Failed Deployment
I hadn’t broken anything in production in a while until last week, so it seems like a good time to write about that. Last week I deployed an upgrade to one of our services containing two changes:
- A several thousand line refactor of the entire service
- A few dozen line removal of a redundant authorization check
Which one of those changes do you think introduced a problem that caused me to roll back this attempted upgrade, not once, but twice? The second one, unfortunately. The first was well tested, as the refactor made it significantly easier to add automated tests, and had been running in our lower environments for some time without issue. The second I was confident was a minor change because it was clearly duplicated behavior.
I upgraded the service in production and ran through my usual checks - various web pages worked and I wasn’t seeing any errors in the service logs. I marked the changes as deployed, closed the ticket, and was going to move on when I realized I hadn’t checked a particular webpage. Clicking on that page triggered something that looked like our auto-logout behavior, which didn’t make sense because I had just logged in a few minutes prior. My session shouldn’t have expired yet. I logged in again and was able to replicate the behavior. At that point I decided to roll back the change while I investigated. After doing so I checked the staging environment and saw the same issue. What had happened was the supposedly redundant code I had removed behaved slightly differently than the version I thought was handling it: the now-deleted version would read the access token either from a cookie or authorization header but the new version only read from the header. This webpage depended on the cookie version and after the change, the service thought the request wasn’t authenticated and the 401 response triggered the auto logout behavior.
The code change to fix that was simple and another team member gave a thumbs up so the next day I attempted the deployment again. Like the first time, I didn’t see any issues in the lower environments and my usual checks after upgrading the prod service revealed no issues. Again I closed the ticket, marked the change as completed, and noticed something weird. After I logged out, I returned to the landing page and then was immediately logged in again. It turns out the fix wasn’t complete and I had completely broken our logout functionality. Another rollback, another patch, and scheduling another attempt to upgrade.
Luckily in both cases I noticed the issue right after making the change and was able to roll it back before there was significant impact to users. That outcome definitely wasn’t guaranteed. This experience highlighted an over-reliance on manual testing. In both cases, I discovered the issue doing manual tests after upgrading the production service, but the same issues were present in the lower environments. So I wasn’t just relying on manual testing to catch bugs, I wasn’t doing it thoroughly enough or consistently making the same checks in each environment. If I’m going to do manual checks after making a production change (which I still think I should) I need to use a checklist to ensure I’m validating the same behavior every time.
This broken change was introduced after the refactor that added lots of automated tests. Why didn’t the tests catch the problem? It turns out I hadn’t written any tests that provided the access token from a cookie. Setting up the tests with the authorization header was easier and I hadn’t thought of validating that endpoints behaved the same way with both versions. This is a good reminder that it’s not safe to assume the automated tests are covering the real behavior you’d see in production. Unit tests are a human-created model of reality and that model won’t always capture all the relevant details. I added some tests that use the cookie and will keep that in mind when writing tests in the future.
Although it wasn’t a factor in the observed errors this time, I also want to reflect on the long lead time between writing and deploying the refactor change. I wrote that change in August and just last week attempted to put it in production. That change was huge and shouldn’t have been bundled with any other changes in an upgrade. In this case it was easy to tell the errors were related to the second change but that wouldn’t always be true. Running into an issue with the refactor or one without an obvious cause was a more likely outcome and having the two changes deployed together would have just increased confusion looking for the issue. It’s unacceptable that I allowed so much time between writing and deploying the refactor and that I deployed it alongside other changes.
These issues had a limited impact and the biggest damage was to my own ego. I’m obviously frustrated by the experience but in reality, it’s practically impossible to never introduce bugs. Instead of holding onto frustration, I’m taking the time to figure out what I did wrong and applying those lessons to improving my work going forward.