Planet Argon has been hard at work on a new podcast that focuses on long-term software investments and thinking beyond the rewrite. In it, Robby Russell speaks with seasoned practitioners from all sorts of technologies about their challenges and successes when working with technical debt and legacy code. It's called Maintainable, and we hope you'll give it a listen.
On a recent episode, Robby spoke with Eileen Uchitelle, a Senior Engineer at Github and a member of the Rails Core team. Eileen has a deep knowledge of Rails. She's given talks at conferences around the world about Rails app performance, open source, and the future of Rails. Eileen wrote a comprehensive blog post about upgrading Github’s large application from Rails 3.2 to 5.2, which was also the focus of her conversation with Robby.
If you're having issues prioritizing a Rails upgrade for your application, some of the topics in this interview may help you push forward. They discuss getting managers on board with an upgrade, how teams can work together, and the technical aspects of running multiple versions in parallel.
Below is the first half of Robby’s interview with Eileen. To listen to the interview in its entirety, you can stream the episode from the Maintainable website, Apple Podcasts, Spotify, or anywhere else you stream podcasts.
Robby: Welcome to the Maintainable software podcast, where we speak with seasoned practitioners who have helped organizations work past the problems often associated with technical debt and legacy code. I'm your host Robby Russell. On this episode, we're joined by Eileen Uchitelle who is a senior systems engineer at GitHub and a member of the Ruby on Rails core team. Eileen Uchitelle, thanks for joining us!
Eileen: Thanks for having me.
Robby: So last year GitHub announced that they were able to upgrade to the most recent and stable version of Ruby on Rails, having been behind several major releases. You were a key member of that upgrade project and I'd love to learn more about how you and your team approached such a project and how that might have been balanced when you're also – I'm assuming GitHub was also working on new features around the same time and how that all worked out together in the end.
But first, how do you describe technical debt?
Eileen: Technical debt is inevitable. You'll never have an application that doesn't have it, but technical debt that becomes burdensome is when you can't develop fast enough. Your CI is slow, deployments are slow, and you realize that your application has become this problem rather than your solution. And that's when you have to start addressing your technical debt.
Robby: Alright. So when it came to working on the upgrade project, how did you get introduced into that project in the first place? Was that one of your first big projects at Github?
Eileen: So I started on the – we're now called app systems – but then we were called Platform systems team. So when I started there wasn't anyone who was full time on the Rails upgrade. It was a volunteer effort. I don't recommend that route for upgrades. It will never get done if you have it be a volunteer effort. I took on that project. I didn't realize it was gonna take a year and a half. I just started with the build and it didn't boot when I first started.
So Aaron and I – Aaron Patterson he's also on my team, we figured out why I didn't boot. It had to do with a route change. We fixed that. And then there were some other reasons it didn't boot and it turned out there was a bug in Rails. So that was good. We fixed that and then it was like, oh here's two thousand failures to work on. So I just started chipping away at those.
Robby: All right. From a process perspective, how did your team decide to...so you get those two thousand errors at that point or failing test in your suite I'm assuming...how did you start going about dividing and conquering that amount of work, knowing that there's like two thousand of them? Did you actually create two thousand tickets in your issue tracking system or anything?
Eileen: Not for the 3.2 to 4.0 upgrade. That was mostly me. So there wasn't anyone else doing work. But we eventually had a volunteer effort where people were like, "Oh what can I work on?" And so I would tell them like work on this one because I'm not working on it. We didn't really have a process yet. And then so one during a hack week somebody was like I want to help. And I was like great, look here's a bunch of failures, and he's like, “No I'm going to do something different.''
And so what he did was he made a tool in CI that we could create issues from the failures by adding a special label. And so it just added a plus sign next to the failure and we would just hit the plus sign and then group all the errors together that were the same, and make issues for those so that way we can assign work and we weren't duplicating our work.
So that was sort of like a whole long-ish process but usually like when we have two thousand errors the build like doesn't even finish, so we can fix those like really big ones, you know the one that's like there's two hundred of the same failure. So we would fix those before we declared the build ready for volunteers or by we I mean I would fix those big ones before we declared it ready for anyone else. And then once we could actually see the errors, it's hard to explain, but our CI groups them in these boxes when the build can't finish because there's too much output, there's too many failures. You don't have those boxes.
So once we could actually see the individual failures and not just the log, I would create issues for every single failure and then group them together. So we knew it would know all of these that are getting a 404 response are probably the same bug, so we'll fix that bug together as one. And so that way nobody was working on the same issue
Robby: Right. I know that that can be a challenge when we're working on upgrade projects for different clients as well, and I think sometimes there's the kind of moments where you need to send someone out on kind of like an independent mission for a bit to try to figure out what's going on, because if you try to get too many people at the same time doing that then you're potentially duplicating efforts, and that's not super productive for the team.
So that process of starting to create issues came after the first big milestone? For how long were you working in kind of a solo approach and tackling one issue after the nexus as they're popping up?
Eileen: It depended on the build I think, because 4.0 was already set up – that one was really easy to get booting again. It was just a lot of route issues or whatever, because the routes changed from 3.2 to 4.0. 4.1 was the hardest. It probably took a few months because we did something really bad. Our test framework directly mutates many test classes and so Rails 4.1 required many test 5 7 hours for a new test 5 1 and before that, we were using 4 7. So we literally had to rewrite our test framework to use Rails 4.1.
Robby: Oh wow. Did you approach that sort of when you're dealing with a dependency like that? Was there conversations about trying to figure out ways to backboard the mini test itself in a way to accommodate or did you just try to figure out how to make everything work with the newest version of minitest?
Eileen: Well we tried it, it was impossible, we had to support two versions of minitest in the code base at once and it was a nightmare. I still don't think that it's right. There are still some things that I think that we never fixed but nobody remembers what it used to be like. So it's fine. But don't don't don't mutate your minitest classes. Don't touch. Do not touch minitest ever.
Robby: I was just reviewing your article that you've written on the Github engineering blog about how you're supporting multiple versions, is that something that you're still doing outside of most recent version and what's on master? Because I know that there are some examples of where you had conditionals for different versions of Rails, like approximately how many different versions were you trying to support in parallel?
Eileen: We would support up to three versions in parallel. So 3.2 was in production. As soon as the 4.0 build was green, we would make that required for every push to Github. So you had to write code that pass in both and that way we wouldn't have to deal with regressions. And then we would work on 4.1 but 4.1 would only run manually when we would run it. So really only like GitHub engineers were building features how to deal with 3.2 and 4.1 and then when 4.1 was green, we deleted 4.0 and made 3.2 and 4.1 required and worked on 4.2.
Once like we got to 4.2, we deployed that and then we did the same process for 4.2 to 5.2. A lot of people ask me why we didn’t just go straight to 5.2. And I think that we would have never finished it, because we would have gotten that mini tests problem and said, “This is too hard.” Like we would have made no progress because it took a year and a half – and it wasn't just a year and a half of doing upgrades. Like people see that they're like, "Oh my God. Like so long." and it's like, “Do you really think that we only did that for a year and a half?” No that's not true. From when I started to when we were on 5.2, that was a year and a half.
I feel like the biggest thing for us was preventing regressions as we made progress because like if we had just gone from 3.2 to 5.2 we would have constantly been dealing with regressions and not knowing which version of Rails with this break on like we'd have to like support like much more major changes between 3.2 and 5.2 to our code base is so big that we just can't support that. But right now we're still supporting 2 versions, 5.2 and 6.
The 6 build is required right now and we were having no regressions being introduced by anyone except for Rails so that means that like we upgrade every week, we fix all the bugs, we fix Rails if it's broken, and then we merge that to master, then everyone has like that new version of Rails and that means that when Rails 6 is released we're going to be able to be on Rails 6 that day.
Robby: Oh wow ok, that's awesome. And you know that as a member of the Rails core team, has there been conversations about baking in some sort of functionality within Rails to help developers with this kind of approach as well?
Eileen: Honestly that's more on bundler than on Rails. With bundler, we had to hack it to be support two gem files. And I know that there's an issue open on bundler. I don't know if they're going to actually do it or not, but like if they could implement multiple gem files that would seriously help Rails upgrades. It would improve that process drastically because like we can't upgrade bundler without like rehacking it and that sucks, but I'd rather monkey patch bundler than fork Rails. So I'll take it.
Robby: All right that makes sense. Yeah we're working on some upgrade projects and there's a little bit of a fun challenge of trying to come up with ways to modify which gemfile things are being used or to make that work there. So all right. So yeah I think I've seen that issue too on the bundler project at least at the request to add that functionality. So when you're working on, you know thinking more about your team's process, approximately how many people were working on the upgrade at any given time throughout the process? Was there a handful of people? You mentioned there volunteers so did a lot of people contribute over the course of that year and a half period, not necessarily knowing that that wasn't how much time you actually spent on the upgrade. But as far as people that were part of part of the process?
Eileen: Yeah. So 3.2 to 4.2 was mostly me. I was only the full-time engineer on it. Of course, Aaron helped, like when I would get stuck on stuff, but he had other responsibilities to work on too. So, there's like 10 people in the Rails team on Github, but like that was just, oh hey I have an afternoon, I'll like pick up an issue and work on it. It was easier to get volunteers as it became more obvious that we were actually going to do it. I think there was a lot of skepticism about actually finishing it because like there wasn't anyone before to drive that effort. You know someone would be like I really want to work on this but I've got a whole bunch other stuff I have to do.
It wasn't a priority for so long because there was so much other stuff to do that I think that it was...people got excited when they saw it coming. "Oh my God, we might actually be on 5.2. I'll help out, like I want to help it!" So that was fun. And then for the 4.2 to 5.2 we had four full-time engineers and I acted as lead, doing more working out process and doing reviews and like making sure that everything was solid but not doing the actual upgrade, because I didn't want to. I mean I what I wanted to see us on 5.2 but I was like I'm going to burn out if I keep doing this, because it's just, it's a lot of work. I can't do it by myself anymore.
So really I like I went from pushing that effort forward by myself mostly to pushing that effort forward with a team. And that's why 4.2 to 5.2 was only five months, it was really fast. We actually did 5.1 to 5.2 in two weeks.
Robby: Oh wow, ok. Thinking about – without getting too far into the weeds on the technical side of things – but just as an overall concept, for those listening that aren't familiar with Ruby on Rails or Rails or this specific framework, in terms of the project was basically several versions behind the Rails have released in it for the community and each major release there's some new functionality or changes to how things worked in the framework. And so Github needed to kind of support multiple versions at the same time to try to get through this process and diving into it a little bit of details here...approximately how many conditionals you think you had throughout the app that were checking on things?
And did that start to become a concern that that would become complicated, to be like “Oh I know that I'm doing something special, but that's only going to work in Rails 5 here”. When you're working on new features, were there are just a lot of conditionals? Or just like some common areas where you would see a lot of those, "Alright. This is 3.2 versus 4 or 5 type of versions."
Eileen: I cannot estimate the number of conditionals that there were. Like when we were on 4.1 we would delete the conditionals for 4.0. So we didn't just leave it littered in the code base, and because we were all we weren't running the 4.0 build anymore, we didn't care if it passed. We were never going to deploy 4.0. So we only really cared about moving forward and sometimes if something was really major we just backport a monkey patch so that we could support it in multiple versions, and not have to do the conditionals for the before filter changes we waited until we were on 4.2 to actually do that work because we knew that the conditionals were just going to be like all over the place and we didn't want to duplicate all those filters and the callbacks.
So we just waited, and I guess the thing is you have to like...everything is just a tradeoff. Figuring out how long you want to wait to fix something.
Some deprecations make sense to fix right away. Rails 6.1 has a deprecation, the validates uniqueness constraint, requires you to pass case sensitive to it. But that works on all the versions because it's just a keyword argument that gets ignored if it's not used by Rails. That way we don't have to add conditionals we can just do the code for both 5.2 and 6.1, once we get to 6.1 it won't care anymore and then we're just done with that.
Robby: As you had a team kind of coming and growing and expanding in size over the course of that period, what sort of processes did you did your team experiment with outside of like look using your build tool and then finding ways to find common errors? Were there any other processes that you put in place in terms of being able to measure progress as a team and where you how far away you were? Or is it kind of always like those questions of “Well how long is a piece of string” if someone actually asked you on when is when will this release be done. Was that something you were able to kind of start projecting at all?
Eileen: It's not like we had a deadline so it was totally directed by us because we hadn't stopped feature development and we hadn't stopped anything else. Nobody was like, "When's it going to be done?" Because we weren't blocking anybody with it. It did actually go down a lot faster than I expected, like 5.2 got done faster than I expected. And so it was a little bit of scrambling to be like, oh actually I need all of you to stop what you're doing and test this thing.
But we actually ended up with a really good process for that. We just said, "Hey, assign someone from your team to click test in this environment and just tell us if you found any bugs," and that's it. And so like we actually found only like two bugs in that process. So who knows that part of the code base better than the team that works on it. So it made sense for them to do that click testing instead of creating a team for click testing. And so that was a good process. And then in terms of other processes we used, I mean, we...our only like measurement of success was, are the numbers of failures going down and are we done.
We also wanted to deploy it with no downtime, which we did in both 4.2 and 5.2. The customer impact was so minimal that we did not even have a single incident during the deployment.
Robby: Oh it's always great when you have those rollouts and they're insignificant in memory, like you don't remember much about it because then everything just worked. Those are great.
Eileen: We were so careful for the 4.2 rollout, that afterwards in the review, I said, we took too long to roll it out, because we were so careful. I think it took us a month or two to roll that out with zero like we've never done this before. We don't know what's going to happen, so 5.2 we did a lot faster, because I was like, I'm totally confident that if I could deploy at 7 a.m. East Coast it doesn't break that it will be totally fine West Coast. We did that process, so like for 4.2, we would deploy it at late at night Australian time, and then ramp up to different time zones and we didn't really do that with 5.2, it didn't seem necessary. We just we literally found zero bugs in the Australia time zones.
I wasn't a waste, but I felt that 4.2 was so smooth, we didn't find out enough stuff in the off hours. I mean GitHub doesn't really have off hours, but we have hours with less traffic. In the off hours that I felt that we were gonna find a lot more, faster in the on hours as long as we knew it was going to take the site down, at that point, it was like just deploy it, especially when like if you deploy off hours, and you're like I found one failure. So then you fix that, you deploy that, two days later you deploy again you find one failure.
I'd rather find six at once and then just fix those and then deploy again than find one every day for three weeks.
Listen to the full interview...
You can listen to this interview, along with 15 more minutes of discussions about Rails version upgrades (including how to justify upgrade costs to a reluctant manager), Eileen’s most recommended book for programmers, and her vision for Rails in the open source world by streaming this episode of Maintainable.