The Art of Purposeful Release Strategy
Defining a release strategy is a common challenge for staff+ engineers. After all, it’s a process that every software product has to go through. And it’s not easy.
I’m pretty sure you’ve heard of (or even survived) some serious release failures that caused outages leading to lost revenue and/or reputational damage. So the first thing we want from our release strategy is safety.
But safety is not everything. Just having no failures is not good enough. Releasing is the only way to deliver new features to our customers. And delivering new features is necessary to grow revenue. What’s more, in many situations, the pace of delivering those new features also affects the reputation of the software. So we also want speed in our release strategy.
Safety and speed are hard things to mix. You have to understand them separately, and then apply the right amount of each, depending on the context-specific purpose of the strategy.
Releasing With Safety
Safety is about minimizing risk. I’ve deliberately used the word “minimize” there, to emphasize that the risk can never be completely eliminated. That’s important to understand and accept, because it allows us to think properly about safety.
The fact that the risk can never be completely removed has led to the oldest (but still present in some organisations) way of thinking about releasing - “if releasing is always risky, let’s minimize that risk by minimizing the amount of releases”. This is how we got the bing-bang releases. An organisation would plan 3-4 (sometimes even less) releases within a year. Each release would be packed with changes, because when you only have a few slots to ship something, you want to ship as much as possible in each slot. The result? I think you already know - failures.
That size of the release made it impossible to fully verify quality. Of course, there were processes put in place to try to mitigate this, such as long regression test cycles or change advisory/approval boards. They didn't help. In the end, the only time you know if the release is going to have problems is after the release. You have to accept that there is risk in releasing. You can minimize that risk by releasing to a subset of users and observing the impact. This is the first safety capability you can include in your release strategy - progressive rollout.
There are two challenges to progressive rollout. The first is selecting the subset of your users. This is where the purpose comes into play. If you treat all your users the same and just want to minimize the impact, you can randomly select a certain percentage of them for each release. If you have users of varying importance (and sometimes courage), you can segment them and base your progressive rollout on that. Or maybe your concerns are more on the technical side? Maybe you have infrastructure in multiple regions, and some are easier to deploy to than others? Then you can deploy to the easier ones first and observe before deploying to the others.
This brings me to the second challenge of progressive rollouts - what to do when you observe an issue. Ideally, you should be able to roll back the change. That’s the second safety capability that your release strategy may include - recovery and rollback procedures. Again, the presence and nature of these procedures will depend on your purpose. Here, the purpose is usually expressed in terms of well-established metrics:
Time required to restore functionality after a failure is detected (Mean Time to Recover - MTTR)
The maximum acceptable time that an application can be unavailable after a failure (Recovery Time Objective - RTO)
The maximum acceptable duration of data loss due to a failure (Recovery Point Objective - RPO)
It’s up to the staff+ engineer to distill these values based on the business needs.
What if you don’t (or can’t) have rollbacks? Then you better have speed to quickly release a fix for the issue. This is where speed serves safety. It can minimize risk by allowing you to respond faster. And it’s not the only way. It’s hard to be proficient at something that’s done infrequently. So if your organisation releases infrequently, it won’t be proficient at it. To minimize release risk, you should release often (faster) to build that muscle. So let’s talk about releasing with speed.
Releasing With Speed
When organisations realise that releasing faster is beneficial from a release safety perspective, they jump at it. After all, they’ve always known it’s beneficial from a business perspective. So they start arbitrarily shortening the release cadence. And that’s a problematic approach.
Releasing with speed is not really about release cadence, it’s about flexibility. It’s about being able to release fast when you need to release. So why do organisations so often get stuck in thinking about optimising release cadence? Because release is the way we deliver new features to our customers. So organisations think in terms of delivering new features monthly, weekly, daily. And that thinking leads to a problem. A strict release cadence often forces teams to either try to squeeze in a feature at the expense of quality, or to work on features in isolation on long-lived branches and struggle with integration issues. These are not the ways to achieve speed. They are ways to compromise safety. The right way to achieve speed is to decouple feature delivery from the release and that’s what you should be thinking about when you think about speed in your release strategy.
What do I mean by decoupling feature delivery from release? I mean being able to isolate and toggle the feature at its current state of development throughout your solution (code and potentially also infrastructure). That way, whenever you release, you can release the current state of the future without impacting the users. How sophisticated your approach to feature toggling will be, as in any other case, depends on your purpose. If the purpose is simply to decouple the feature before it’s complete, then all you need is a static registry that provides toggle values at build time, and a process to ensure that each toggle is removed once it’s no longer needed. Maybe you want toggles to live longer as a rollback mechanism, in which case you need a tool that allows you to change their values at runtime. Or maybe you want to add more business value to your toggles by making them an experimentation mechanism for the business, in which case you will need integration between toggles and observability systems. Whatever your purpose, it will always result in freeing teams from thinking about “when” the release will happen. It will happen when it’s needed, and it will happen fast with the push of a button.
Yes, the second aspect you should consider for speed in your release strategy is automation. Machines are great at doing things fast and repeatable, humans not so much. Of course, that doesn’t mean your strategy should assume 100% automation from day one. Remember the purpose, and choose the elements to automate that support it.
Good Strategy Is Always About Balance (And It’s Boring)
As we discussed, how much you want to invest in progressive rollouts, recovery and rollback procedures, decoupling feature delivery from release, and automation depends on your purpose. Understanding that purpose will help you find the right balance in implementing these capabilities. It will also help you choose the right techniques. As you make these decisions, remember to be boring and don’t design against your team.