#338 – Pulling the Thread
Friday Ship #338 | March 10th, 2023
Last week we (finally) upgraded our infrastructure…
It started with an upgrade to NodeJS: the software that powers our servers. We were on a version set to expire by September and needed to upgrade in order to use the latest features, improve performance, and most importantly, stay secure.
Unfortunately, upgrading NodeJS required us to upgrade our Operating System (we were a few versions behind there, too). Upgrading our OS required upgrading our hosting service Dokku (an open-source version of Heroku).
As luck would have it, that required upgrading Docker, a new version of Let’s Encrypt (our Certificate Authority), and a handful of other packages. What is typically a painless upgrade turned into a full two-day event for our infrastructure team. A story as old as time in the software industry.
Our original trusty server, and all the settings that power it, was painstakingly configured by our CEO Jordan Husney back in 2016 – long before we even had a DevSecOps team. We had hoped we’d migrate to Kubernetes before we needed to upgrade, but our luck had run out.
Rafael Romero Carmona, our newest Senior DevSecOps member, took this as a challenge. He spun up a new server running the latest software, copied over all of the bespoke settings he could find, and after we tested it on staging, we switched the DNS to the new server.
The result was a zero downtime upgrade!
However, as tech workers, when something works perfectly the first time, panic ensues.
Sure enough, the next day when we reached peak load, our app became sporadically unavailable for some users. The root cause was not setting our reverse proxy’s file limit high enough to support all of the traffic that we get on a typical Friday.
Our team in Europe discovered, triaged, and fixed the bug before I even sat down for breakfast in California. While remote work can be challenging at times, events like this are a clear reminder that a globally distributed team gives us a huge advantage when it comes to site reliability.
Usage is generally down this week. After the previous strong weeks this type of behavior is expected. We’ll investigate more if this trend continues.
This week we…
…wrapped up Sprint 117. This included building the foundation to our new Activity Library, which will support dozens of new meeting types.
…held our first Developer Experience (DX) retro. Engineers shared their pain points & we created some next steps to make development here a little better.
…published a Complete Guide to Project Retrospectives for 2023
Next week we’ll…
…take a week off from sprinting to focus on the features that we personally care about. We call it slack week & it’s where we get to focus on the issues that matter the most to us, even if it doesn’t align with our sprint goals.
Have feedback? See something that you like or something you think could be better? Please write to us.