Coverage, Mutants, and Ensuring Test Quality
“How will you ensure the quality of the code you deliver?”. I frequently ask this question when vetting more senior engineers or setting up new teams. I must say that automated tests are increasingly becoming part of the standard answer. That’s good. The challenging conversation begins when I ask how they intend to ensure the quality of those tests. I usually get two answers. One is silence. The other is measuring coverage. Surprisingly, I consider the latter to be the more problematic one. Why? Because a green field conversation about good practices is easier than a conversation where you first need to uncover what someone means by a certain practice and then discuss its flaws. And people often don’t know what they mean when they say “measuring coverage” because it comes in many flavours.
Many Flavours of Coverage
In most cases, when people say “measuring coverage”, they are referring to the percentage of code executed during testing. To steer the conversation in a creative direction, I usually ask if they mean statement or line coverage. The subtle difference between the two (whether we observe some logical units of execution or rather take a volume-based approach to the codebase) usually sparks some thought. Sometimes, someone will try to cut the conversation short by saying “the simpler one to measure”, but this actually broadens it because the simplest type of coverage to measure is function/method coverage. So why is function/method coverage not that common? Because knowing only what percentage of functions or methods has been called during tests provides little information. Does knowing what percentage of lines or statements has been executed really provide much more information? We would prefer to know what percentage of the code’s logic has been tested. This usually brings us to branch coverage - knowing what percentage of the possible branches from decision points has been executed. This sounds good, but what if the conditions are complicated? Should we go further and measure condition coverage? Should we try to be more comprehensive and measure the percentage of all possible paths through the code that have been executed? At this point, you will find that there are very few tools that support this due to impracticality - the number of paths tends to grow exponentially with code complexity. You may start to feel that coverage is flawed.
Why Coverage Is Flawed
Yes, the coverage is flawed. This is primarily because it provides very high-level information. We get a number that tells us what percentage of our solution has been executed during testing, but this percentage rarely represents the solution’s actual logic. Also “executed” does not necessarily mean “tested”. Have you ever heard stories about writing tests with “assert true” just to satisfy high coverage requirements? Last but not least, coverage can only be calculated reasonably if our testing strategy adheres to a specific interpretation of testing terminology - the one that focuses around typical “testing pyramid” (I once shared my thoughts on how this can limit the testing strategy). What if the majority of our tests are end-to-end tests that reflect how users would interact with the solution?
The sad truth about coverage is that it’s only a potentially negative metric. What do I mean by that? High coverage doesn’t guarantee that our tests are good (only that they execute code), so coverage can’t indicate a positive state. At the same time, low coverage may indicate poor-quality tests, but only potentially, which means that coverage can only potentially indicate a negative state. So, is there an alternative practice for ensuring test quality?
Killing Mutants Beats Coverage
My preferred practice for ensuring test quality is mutation testing. In mutation testing you introduce small, deliberate changes (mutations) into code to check how well existing tests can detect them. If the tests can detect a mutant, it is considered “killed”. If a mutant survives, it means that your tests have gaps. The available mutants depend on the mutation testing tool you choose. The most common include changing an operator (e.g., > to >=), swapping a value (e.g., true to false), or deleting a line of code. However, some frameworks can have some very creative mutants.
You can think of mutation testing as a test for your tests. My typical approach is to set up a dedicated pipeline for mutation testing only. It runs at regular intervals (usually no more than once or twice a week) and fails when the percentage of failed tests falls below a certain threshold. Over time, I aim for this threshold to be quite high. This allows me to achieve the true goal of my testing strategy.
The Goal Is Trust, Not Satisfying Metrics
The only goal of having a testing strategy and automated tests is to have trust in what you are realising. To have trust that they are an effective quality gate when it comes to detecting defects. This can only be achieved by revealing weaknesses in your tests, not by satisfying an arbitrary number.

