38

I am adding unit tests to some existing code which initially was not tested at all.

I have set up my CI system to fail builds where the code coverage percentage is reduced compared to previous build - mainly to set a path of continuing improvement.

I then encountered an unexpected and annoying (although mathematically correct) situation that can be reduced to the following:

  • Before refactor:
    Method 1: 100 lines of code, 10 covered --> 10% coverage
    Method 2: 20 lines of code, 15 covered --> 75% coverage

    Total: 25 / 120 --> ~20% coverage

  • After refactor:
    Method 1: 100 lines of code, 10 covered --> 10% coverage (untouched)
    Method 2: 5 lines of code, 5 covered --> 100% coverage

    Total: 15 / 105 --> ~14% coverage

So even though IMO the situation was improved, my CI system obviously does not agree.

Admittedly, this is a very esoteric problem, and would probably disappear when the bulk of the code will be covered better, but I would appreciate insights and ideas (or tools/configurations) that might allow me to keep enforcing an "improvement" path for coverage with my CI system.

The environment is Java, Jenkins, JaCoCo.

18
  • 2
    One way to view the effect: total coverage ratio is the weighted sum of coverage of the components, weighted by their relative size (.833 * 10% + .167 * 75%, etc.). So your refactoring also changed weights, giving more weight to less covered components.
    – Pablo H
    Commented Nov 8, 2019 at 10:56
  • 20
    My suggestion? Disable the coverage nanny temporarily for this commit and move on. These tools are aids to measure code quality, but they shouldn't be the only tool you use to judge code quality. You know you actually have improved the code, if the measurement tool didn't catch that, that means the tool is wrong, so just tell it not to freak out, and don't freak out yourself worrying too much about it.
    – Lie Ryan
    Commented Nov 8, 2019 at 11:28
  • 4
    I built a tool which is a little better at managing this situation, as long as the lines you have changed are still covered, the tool passes, so in this case the 5 new lines are covered diffFilter would pass the build. github.com/exussum12/coverageChecker
    – exussum
    Commented Nov 8, 2019 at 13:52
  • 6
    The metric isn't lying to you. You're relying on a metric that counts the coverage of lines, not e.g. average coverage of classes or files. The problem you're highlighting is effectively how you want it to be measured. If you want it to be measured differently, use a different metric.
    – Flater
    Commented Nov 8, 2019 at 15:54
  • 3
    If you are enforcing a policy that emphasizes code coverage, then you need to emphasize coverage in your development efforts, i.e. prioritize coverage of the 90 un-covered lines in your method #1. In other words, use the 80/20 rule - focus first on increasing coverage of method #1 (or perhaps reduce its line count) before tackling the method #2 refactor.
    – Zenilogix
    Commented Nov 9, 2019 at 3:50

11 Answers 11

18

You can mitigate the effect to some degree by allowing the relative code coverage to reduce when the total number of uncovered lines also reduces, or when the total number of lines reduces, since this are pretty clear signs of a refactoring which sets a new base line for your coverage metrics.

In your example, the total number of uncovered lines reduces from 95 lines to 90 lines, and the total number of lines from 120 to 105. That should give you enough confidence that the relative coverage is quite unimportant in this situation. However, if you add new code, your metrics reflects the expectation of not allowing less relative coverage for the new code than the relative coverage you had before.

Side note: be aware, none of these metrics tells you if the tests are covering the most sensible parts of the code base.

2
  • Eventually this is the course of action I took, allowed a 2% reduction. I just don't want this to allow a slow decline over time. Commented Nov 10, 2019 at 10:15
  • I chose this answer mainly for being pragmatic and effective. I implemented the same solution even before asking this question. Hopefully when the bulk of the code is tested, I would not need to use this axe. The discussion around my question was most interesting and I got many good insights and learned a thing or two as well, but at the end of the day the issue was the difficulty for a machine to appreciate the quality of a change that is (arguably) obvious to a person. Thanks everyone! Commented Nov 10, 2019 at 11:42
47

The problem I see here is that you have made the code coverage a trigger for build failure. I do believe that code coverage should be something that is routinely reviewed, but as you have experienced, you can have temporary reductions in your pursuit of higher code coverage.

In general, build failures should be predictable. The following make good build failure events:

  • Code will not compile
  • Tests will not run
  • Tests fail
  • Post build packaging fails (i.e. can't make containers, etc.)
  • Packages cannot be published to repositories

All of these are pass/fail, quantitative measures, they work (binary value 1) or they don't (binary value 0).

Code quality should be monitored because they are qualitative. NOTE: percentages are a qualitative measure even though they have a number associated with it. The following are qualitative measures:

  • Cyclomatic complexity (a number associated with a qualitative concept, i.e. the number has a qualitative meaning)
  • Maintainability index (a number associated with a qualitative concept, i.e. the number has a qualitative meaning)
  • Percentage covered lines/branches/methods (qualitative summary of quantitative results)
  • Static analysis results (no numbers involved)

As with any trend, when you look over past releases you can find momentary reductions while the overall trend improves or stays the same (100% remains 100%). If the long term trend is toward less tested or maintainable code then you have a team problem you need to address. If the long term trend is toward higher quality code based on your measurements, then the team is doing their jobs.

8
  • 7
    I'm not sure I agree with your classification of "qualitative" vs. "quantitative" – I thought quantitative means having a number (which is the case in your second group). Commented Nov 9, 2019 at 1:33
  • @PaŭloEbermann I agree; it seems quantitative and qualitative have been used exactly backwards in this answers. A pass/fail test is qualitative, the degree of test coverage is quantitative. Commented Nov 9, 2019 at 16:33
  • qualitative: relating to, measuring, or measured by the quality of something rather than its quantity. in this case just because a number is assigned for relative comparison sake, doesn't mean that number has any relevance taken in isolation. Pass/Fail is 0 or 1, binary, a concept that we should all be familiar with, and very much a quantity. Commented Nov 9, 2019 at 17:32
  • In general, percentages are used in qualitative analysis, particularly when comparing things that are of unequal size where the proportion of things important. Commented Nov 9, 2019 at 17:35
  • 1
    @BerinLoritsch - I agree, and your insights are valuable, coverage metric is not a perfect metric to fail a build on, but I am pragmatic here - I noticed that sometimes, especially when under pressure, people tend to cut corners, so I wanted gatekeeper. The gatekeeper is pretty weak - just don't make things worse, coverage-wise. Sometimes the "we are all grown-ups here" notion is not working if not enforced. Commented Nov 10, 2019 at 10:09
37

Have you considered not using code coverage metrics?

I'm not going to argue that code coverage isn't something that you should look at. It absolutely is. It's good to keep track of what was covered before a build and after a build. It's also good to make sure that you're providing coverage over new and modified lines of code in a change (and, depending on your environment, may be tools that can facilitate this).

But as you're seeing right now, you can refactor and end up removing lines that were covered such that the percentage of lines covered go down. I'd agree with your reasoning - the logical coverage has not changed. You've also probably simplified other aspects of your code (I'd be curious in how various complexity measures have changed before and after your refactoring, for example). But the math is correct - your percentage of lines covered has, in fact, gone down.

Additionally, code coverage says absolutely nothing about the effectiveness or quality of the tests. They also say nothing about the risk behind the part of the code. It's relatively easy to write tests that cover trivial lines or check the most simple logic while not giving much confidence in the quality of the system.

11
  • 11
    "Additionally, code coverage says absolutely nothing about the effectiveness or quality of the tests." very important point. I've heard from people who worked in companies that demands 100% code coverage or as close as possible (say, 95% or 98% minimum or thereabout). They end up writing tests for dumb getters and setters and not even bother to assert the results are correct - the tests are in the form of call getter -> done. Technically, that's a test. Practically, tells you nothing. Other tests can also work in this fashion and not show correctness of code at all.
    – VLAZ
    Commented Nov 8, 2019 at 7:56
  • 3
    Also note there are different types of coverage. Line coverage, condition coverage etc. Line coverage is the worst to have.
    – Sulthan
    Commented Nov 8, 2019 at 8:50
  • 3
    While CC is not an amazing metric. It is useful metric. If I see project with 10% CC and 85% CC, It gives me some useful information. Of course, it doesn't tell you that your tests are perfect. But that is not reason to not have any standards. And having high CC is a good start for test automation. As having a covered code means it can be run using automated test.
    – Euphoric
    Commented Nov 8, 2019 at 10:19
  • 3
    The advice I hear is that low coverage means you should think about adding useful tests, not about getting the coverage up.
    – toolforger
    Commented Nov 8, 2019 at 16:01
  • 1
    @Euphoric - My thoughts exactly - CC is not perfect, but useful. Commented Nov 10, 2019 at 10:14
15

This is called Simpson’s paradox, and is a well known statistical issue with your approach.
You could even construct cases where you refactor and afterwards every single method has a higher coverage, but the overall coverage still went down.

You will need to find other ways to catch 'regressions' of the type you wanted to catch, not with overall percentages (although I liked the approach when reading it).

2
  • Obviously if the 100 line function was the single worst offender, and you change it from 100 lines / 10 covered to 1000 lines / 101 covered.
    – gnasher729
    Commented Nov 8, 2019 at 11:16
  • 1
    Thanks, it is good to have a name for the condition :) Commented Nov 10, 2019 at 10:17
5

While all answers are telling not to use code coverage as a quality metrics, none really answered the question. What to do with reduced coverage?
The answer is quite simple: use absolute numbers, not percent. What is more important then coverage percent are number of untested functions and number of untested branches. Also, it would be great to get list of untested functions and branches.

3
  • 1
    Yes, currently OP is in the paradoxical situation that removing code lines lowers his metric. Using absolute instead of relative code coverage can at least fix this aspect of the problem.
    – kutschkem
    Commented Nov 8, 2019 at 10:50
  • 1
    Absolute number of lines here is still a reduction - from 25 to 15. Commented Nov 8, 2019 at 12:59
  • 6
    @afaulconbridge Yes, and that number is irrelevant. As I said in the answer, what is important is increase in number of untested functions and untested branches. Commented Nov 8, 2019 at 13:03
2

One option that might help with this is to switch to from line coverage to branch coverage. I say 'might' help because you can run into a similar situation with uncovered branches if you reduce branches in a function.

Overall, though, branch coverage is a more meaningful metric than line coverage. And I'll hazard a guess that when you refactor a long method into a short one, you usually reduce lines of code than you will branches. Of course, if it's really nasty code, then you might end up pruning a lot of branches. Even in that case, I think you are still better off counting branches.

6
  • 1
    The problem I see using branch coverage as the metric to focus on is that branches are harder to test and cover completelly. It might lead devs to put unnecessary efforts on covering corner-execution paths, totally meaningless for the business. Branches generate corner-cases oftener than we wish and it's hard to say whether they are dead-ends or possible (but rare) cases. It has all the receips to turn devs into paranoic-tester mode.
    – Laiv
    Commented Nov 8, 2019 at 6:46
  • 1
    @Laiv My team have been using it for years and this has never been an issue but YMMV. I find this logic suspect, however. Essentially you are saying you want to be ignorant of untested branches because it might be too hard to test them. If you don't know what is untested, how do you know that? Perhaps they are infrequent cases but the impact is critical. Branches that are not tested could be indicative of flaws in the program flow i.e. they shouldn't exist. Eliminating the branch is a possible solution.
    – JimmyJames
    Commented Nov 8, 2019 at 16:21
  • It is the corner case branches that generally hide the bugs however. Branches taken thousands of times a day very quickly have code that works, it is the weird edge case that hides the totally bogus code. Far more important to have tests for the error handling edge cases then for the stuff that will be noticed by the dev the first time they hit run. Coverage is a useful METRIC, but as soon as you treat it as more then a vaguely interesting number it (like all metrics) tends to become the tail that wags the dog. This is not something you need hard failing your builds.
    – Dan Mills
    Commented Nov 8, 2019 at 16:41
  • This is a good advice, but in my case, most of the CC metrics were reduces, lines as well as branches. Commented Nov 10, 2019 at 10:19
  • 1
    @JonathanGross I agree that some level of requirement is useful. That way the time in code review can be spent more on whether the tests are good and meaningful and less on whether there is enough coverage. In fact, putting in rules like this can help prevent the real problem here: the large method with low coverage.
    – JimmyJames
    Commented Nov 11, 2019 at 15:42
2

What you are doing is not wrong. While code coverage is not an amazing metric, it is a useful metric. It not telling you that your code is perfectly tested. But not having some level of standard you expect from your tests is also a bad idea.

One way to solve this problem is to simply change the limit if you believe the change is justified. The whole point of this kind of hard limit is to notify you quickly when something bad happens. In many cases, this reduction will be unintentional and it needs to be fixed by introducing more and better tests. But if you identify the error to be a false positive, and you can prove to yourself and your team that the reduction is expected, then it is perfectly fine to reduce the limit.

Another way is to turn that reduction into "soft" failure. For example, Jenkins can have a "unstable" result of a build. In unstable build, the final artifacts are still usable, but there might be some degradation of functionality. That gives you time to decide what to do, if tests need to be expanded or limits lowered. This could also be implemented using two different builds : one that has the build artifacts with lesser checks and limits, that when it fails, not even the artifacts are usable. And second, with much stricter limits that would fail if any of your quality metrics gets broken.

1
  • I might try to have a soft failure, although I believe it would cause confusion with the team, and require some fix anyway. Commented Nov 10, 2019 at 10:20
1

In addition to answer from Berin Loritsch, I would like to also highlight a technique called Mutation Testing (see https://pitest.org/ for a Java tool).

Mutation testing provides metrics on which lines of codes could be mutated to behave differently and not result in Unit test to fail.

In your case, if the refactoring was done well, you should see the Mutation Testing metrics improve (when expressed as a percentage of Unit test Line Coverage).

This link describes a couple of (java-related) mutation testing tools: https://pitest.org/java_mutation_testing_systems/

1
  • This is something I will investigate Commented Nov 10, 2019 at 10:22
1

Before refactor: 95 untested lines.

After refactor: 90 untested lines.

You should always consider how many untested lines you have!

The code coverage percentage is a good measure of this only if your code always grows, which is unfortunate.

Another way to put it: whenever you manage to reduce the amount of lines in your code, without breaking any tests, you have all rights to ignore code coverage percentage!

1

Here's a solution that doesn't involve changing the definition of the code coverage metric or requesting / giving yourself an exemption to the policy. Although, the policy as stated is arguably flawed and you might be better off counting the total number of uncovered lines or enforcing a per-file lower bound (e.g. 80%) instead of requiring monotonicity.

Improve the code coverage of Method 1 prior to or at the same time as refactoring Method 2.

Here's the current situation stated in the question.

                    Old World                   New World

           covered  total   %           covered  total   %
 Method 1       10    100  10                10    100  10
 Method 2       15     20  75                 5      5 100
Aggregate       25    120  20.8              15    105  14.3

In the New World, the total number of covered lines must be at least 22 for the metric to fail to decrease. (The ceiling of 105 * 25 / 120 is 22).

So, all you need to do is add tests that cover 7 more lines of the file where Method 1 and Method 2 live.

This solution does have the drawback of including an unrelated change in the same commit in order to satisfy the code coverage monotonicity rule. It still might be worth doing if this one change to Method 2 is not important enough to justify changing the policy. It also might be worth doing if you want to get this change in first before you change the policy.

2
  • I think total number of uncovered lines would not be a good metric, as it wouldn't be possible to add any new uncovered code. Lower bounds (global, or per-file) will become useful as we cross some coverage threshold, and that's the reason for current requirement for monotony. Specifically for per-file lower bound, I couldn't find an easily integrated tool that allows this, and actually the issue could happen within a single file. Commented Nov 10, 2019 at 19:08
  • @JonathanGross You might be able to write a script that parses the output of your coverage tool and enforces per-file lower bounds itself. The main thrust of my suggestion though is that you can live with the policy as currently stated by adding more tests to existing code at the same time as you refactor. Commented Nov 10, 2019 at 19:29
0

The problem is: You have one huge function with very little code coverage, so in order to get the percentage up, you need to add covered lines of code. If you add 20 lines / 10 covered that improves the percentage more than 5 lines / 5 covered. Even though the second is most likely better (unless speed was important, and your first function was "if (easy case) { handle easy case quickly } else { handle complex case slowly }" and the second was "handle all cases slowly" ).

So we conclude: You have a measurable thing where improving the measurement doesn't improve the quality of your code. Which means that you should only use this as a target when common sense is applied at the same time. From a book about "motivation": The most important thing is to motivate people in such a way that following the motivation produces better results.

Imagine there's a function with a line "int x = 100;". To improve code coverage I change it to "int x = 0; x++; x++; (repeat 100 times)". Instead of one covered line I have 101 covered lines. Total code coverage goes up, and I get a bonus.

1
  • The example I gave is a reduction of the real situation, it would have been the same if I had 10 functions containing 10 lines, with only 1 line covered for each. Commented Nov 10, 2019 at 10:28

Not the answer you're looking for? Browse other questions tagged or ask your own question.