Staff+ Engineer in the SaaS World - Architectural Survival Kit
As part of the “Staff+ Engineer in the SaaS World” series, we have explored several key concepts, including the multitenancy spectrum, the tenant lifecycle, control planes, and costs/pricing. A recurring theme in all of these articles is that the context, choices, and the resulting solution architecture will change and evolve over time. This is because every SaaS solution goes through a journey. On paper this journey is often described as a quite straightforward one:
“Design and architect your solution” → “Implement your solution” → “Operate and optimise your solution”
While this is true, as always, the devil is in the details. While the first two parts of this journey may indeed be straightforward (if you’re building a well-defined MVP), the third certainly won’t be. In reality, “operate and optimise” means navigating a jungle of competing requirements coming from business, regulations, client expectations, scaling, costs, and much more. To survive in such a jungle, you will need to be able to…
OK, maybe improvising is not something that you should do too often, although it will sometimes be useful and necessary. The most important thing is the ability to adapt. If you ensure that you have it, it will allow you to overcome your challenges.
Today, I want to share with you some essential techniques for your architectural survival kit that will give you adaptability when needed.
Feature Flags (So New Features Don’t Choke the Path)
I touched on feature flags when writing about “The Art of Purposeful Release Strategy”. They allow for decoupling feature delivery from the release,enabling faster releases. The last thing you want is to have to say that you can’t fix a critical issue or introduce a small improvement that would enable bringing a new client because you are in the middle of developing a “big feature”.
Yes, some teams avoid being blocked by “big features” by developing them in long-lived branches and having very sophisticated release branches topology. In my experience, these approaches still choke teams, just slower. Rebasing those long-lived branches often grows to a size of separated tasks that eat tons of time. The same goes for cherry-picking and backporting changes to the correct release branches (and at the end, it still happens that something is missing). If you don’t want to be choked, I advise you aim for a branching model that is close to trunk-based development and use feature flags.
Of course, feature flags are not “free”. They require proper management. They should never be used as an excuse to keep dead code around. Once the new feature is released, the old code paths and the feature flag should be removed from the code base. Removed, but not forgotten. You should maintain knowledge of past feature flags so that you never end up reusing one. What if someone didn’t remove all the code (go and search for the history of Knight Capital)? This also means that feature flags should have a short lifespan. Otherwise, you risk creating an unmanageable mess. They are not a mechanism for introducing feature tiers, so you should not use them for that.
API Versioning (So You Don’t Trip When You Take a Step)
From the very beginning of our software engineering journey, we are taught about avoiding coupling. This is because coupling hinders change. So we aim at decoupling our code internally (although we often take this too far, but that’s a different discussion). When talking about solution level architecture, it’s important to recognise that APIs are often the primary source of coupling. This coupling can have two types.
The first type is coupling between the services that compose your solution (regardless of whether they are microservices, picoservices, or services of any other arbitrary size). The moment someone says that another service needs to be updated before this one can be released, you have a problem because you're tripping over your own legs. If someone claims this is because business changes have to be made to both services, you should take a look back at the feature flags section. If someone says it’s due to contract dependency, it means you're lacking an API versioning strategy.
The second type of coupling is between your solution and the outside world. If you expose any API, someone will integrate with it ((even if that wasn’t the intended purpose). Even worse, people will be frustrated when that integration breaks, even if you never promised to maintain it. The only way to avoid this is to have an API versioning strategy.
When considering your API versioning strategy, you should make informed decisions regarding a few key factors:
What mechanism will you use to specify the version? The typical options for web-based APIs are URI, query string, header and media type.
Whether versioning is implicit (there is a default version that will be used) or explicit (the client must always specify a version). I would recommend the latter to avoid issues when the default version needs to change.
How many past versions do you intend to support?
What will your strategy be for deprecating old versions? Will you simply shut them down (in the case of web-based APIs, with a proper status code) and risk upsetting those who are still using them? Or do you take them offline with the option of bringing them back online when necessary? Do you have monitoring in place to identify and notify those who are still using those versions?
When making these decisions, you need to consider the impact on both your team and your clients. You may not achieve the perfect balance with your first approach, but that’s no reason to delay implementing an API versioning strategy.
Expand and Contract (So You Can Keep Your Footing)
The expand and contract pattern emerged quite some time ago. Its purpose is to implement backward-incompatible changes safely. This is achieved by breaking the change into three distinct phases:
Expand, where you perform the changes that don’t break the old version and can provide as much support as possible for the new version.
Migrate, where you perform the changes that are required for full support of the new version and are breaking from the old version perspective.
Contract, where you perform any leftover changes that are essentially a clean-up after the old version and are not breaking from the new version perspective.
I first encountered it in the database space, where this pattern ensured safety and minimised downtime to the migration phase only. Since then, the pattern has provided me with a logical framework for conducting safe changes in various areas, such as restructuring queues, evolving interfaces, or updating infrastructure. It has also been adapted to zero downtime deployments, where downtime is eliminated by eliminating the migration phase. This, of course, requires careful consideration of the individual changes and often results in more iterations than would be needed if downtime were allowed. However, I have yet to encounter a scenario where this wasn’t possible.
The philosophy behind the expand and contract pattern is your ultimate technique to orchestrate changes to your solution without negatively affecting your clients.
Yes, Deployments Are Your Steps
The ability to adapt means the ability to make changes. If you have constant ability to make changes, they don't have to be perfect. You can take a step forward, you can take a step back, and you can take a step to the side. Having the constant ability to make changes and take all the steps you want requires being able to deploy safely and when needed. This is why the above essentials of an architectural survival kit for a staff+ engineer in the SaaS world focus on achieving that.