6

When writing tests, I often struggle with a conceptual issue, and maybe I am missing something major here.

For many classes, the test writing is straight forward and simple, often tests (and their verification) are even possible to be repeated with any amount of random input data.
However, for a certain type of classes/methods - and they are they real 'meat' of the program - it seems to be difficult to write any useful test - take for example a method that returns all possible moves of a knight in a given chess position: I could hard-code a position, call the method, and compare to the hard-codes expected resulting move list. I could then add ten or twenty more such hard-coded examples, and/or add one or two examples each for every special or edge case I can imagine. However, this doesn't feel like a good test - I'd rather generate 1000 positions at random, and check the result for each one - but how? I would have to basically re-code the whole method again in the test, to be able to verify the result!?

Aside from the possibility that having such a complex method might be the issue itself, what would be the approach?

Another example is a recursive method for mini-max move evaluation, with alpha-beta-pruning. Although it has only 20 lines, the complexity of the desired functionality seems to disable any approach outside of hard-coding some input-output pairs, and verifying them - again, very unsatisfying, and it might miss many complex cases and subtle errors.

How can testing such a complicated method be approached in a practical and effective way?

5
  • 2
    I don't see how "returns all possible moves for a knight" is a complex method. And I don't see how hard-coding input-output pairs is somehow bad?
    – Euphoric
    Commented Oct 13, 2019 at 18:35
  • 1
    That is the core of my question - if it is not bad, then I learned something. I thought that maybe there is a better way and I miss it.
    – Aganju
    Commented Oct 13, 2019 at 19:08
  • Could you yourself actually test that min max method manually if you had to ? How would you do it ? Some methods ought to be tested on ranges of results. I’m fine proving algorithm X is at least 200ms faster than algorithm Y on Z iterations. I’m not interested in testing the X algorithm itself. This doesn’t apply to your first example though, which deserves a test for every position if you’re into defensive programming. But you’re also fine finding the edge cases and testing those. Commented Oct 13, 2019 at 22:07
  • 3
    "I could hard-code a position, call the method, and compare to the hard-codes expected resulting move list." - sounds perfect. Good work.
    – Ant P
    Commented Oct 14, 2019 at 8:51
  • Be careful. If you make your unit test too complex, you'll have to write a unit test for the unit test. Then turtles happens. At some point someone has to hardcode something.
    – John Wu
    Commented Oct 15, 2019 at 7:52

4 Answers 4

19

There is a mantra in the testing community about "Listening to your tests". Your tests tell you things about your code, and you should listen to them.

If you cannot figure out how to write a sensible test, then that could be an indication of a problem not with your tests, but with the code you are testing.

This comes up especially often in the context "Doctor, it hurts when I try and test complex methods", and the answer to that is: don't write complex methods. After all, a test is just another client of your code, if your tests are hard two write that could be an indication that other clients are hard to write, too.

<aside>
This is one of the advantages of Test-Driven Development: since the production code, and thus its interface, doesn't exist yet, when you write the tests, you can "dream up" any interface you want, for writing your tests against. And since programmers tend to be lazy, the interface yo dream up will be easy to use: why would you deliberately make it hard to write tests if there are no constraints that you force you to?
</aside>

Unfortunately, this is easier said than done.

For complex algorithms, there are some possibilities.

  1. Existing test cases: oftentimes, complex algorithms are based on some research paper, and oftentimes, such a research paper will contain examples. These examples make good test cases.
  2. Test known corner cases, special cases, edge cases, and exceptions, and at and near singularities: many algorithms have corner cases that need to be handled, or they have singularities where they don't work, etc. Pathfinding algorithms that break down when the graph contains cycles, for example. Fibonacci of 0. Factorial of 0. If you know that the algorithm has problems with small numbers, you test it with -1, 0, 1, 0+ε, 0-ε, and so on.
  3. Property-based testing: this is a variation of #2 that was popularized by Haskell's QuickCheck library. You state properties about your algorithm, and the test framework statistically generates likely counterexamples. Microsoft Research's Pex is a very interesting example of this, since it can actually perform this analysis statically and not only prove that a property is violated but statically compute a counterexample without the need of generating it statistically.
  4. Comparison with a known-good implementation: You can compare the output of your implementation of the algorithm for a large number of both carefully selected cases (see #1 and #2) and randomly generated cases (a kind of "fuzz testing" for algorithms) with an implementation of the algorithm you know to be correct. For example, you have a highly optimized implementation in your library and a separate, very slow, very dumb, very simple, very straightforward, line-by-line translation of the idea behind the algorithm that you can read and verify, in your tests. You can even use different languages, e.g. a Python test that reads almost like a natural language description of the algorithm, and your production code is aggressively hand-optimized assembly.
  5. Comparison with a known-good algorithm: This is similar to #4 but instead of comparing against a known-good (ideally simpler) implementation of the same algorithm, you use a different (ideally simpler) algorithm altogether. Going with your example for an implementation of min-max search with alpha-beta pruning, you could e.g. implement a simple brute-force search in the test and verify that they always come to the same solution for the same inputs. (The high-level language trick from #4 can also be used here.)

I would have to basically re-code the whole method again in the test, to be able to verify the result!?

That is essentially my points #4 and #5, but I want to emphasize a crucial point: the two implementations don't need to be the same. They shouldn't be the same!

The idea is that the implementation or even the algorithm that is used in the test is simpler than the one in the production code. Ideally, it should be so simple that it can be verified to be correct.

The second idea is that this test implementation never changes. As you add performance improvements to your production code, the test implementation always stays the same, and thus serves as a regression test to make sure that your optimizations never change the results.

You can even completely change the algorithm in the production code and be sure that you didn't break anything. (If you do this, you can also add the old algorithm you replaced to the tests!)

5
  • 3
    My answer is basically just "Yes, you do all of those things you listed in your question." Commented Oct 13, 2019 at 18:45
  • Especially the point "the implementation that is used in the test is simpler than the one in the production code" is a great tip - it can be slow but straightforward, as the test speed doesn't matter, but the quality of its 'verifiabilty' does. Thanks!
    – Aganju
    Commented Oct 13, 2019 at 19:06
  • 1
    @Aganju: One thing I forgot, that I will add as #5 shortly: it doesn't even have to be a simpler implementation of the same algorithm, it can be a different, simpler algorithm altogether. I.e. use brute-force search in your test and min-max with pruning in your production code. Commented Oct 13, 2019 at 19:11
  • 1
    Tests force you to use your method. If your method is hard to test it might be because you made it hard to use. Commented Oct 14, 2019 at 0:54
  • @candied_orange: Indeed. I'll try to make that clearer. This is one of the advantages of TDD: since the code and this the interface doesn't exist yet when you write the tests, and since humans are naturally lazy, you will tend to "dream up" simple to use interfaces for writing your tests against. Commented Oct 14, 2019 at 4:21
3

Jorg's answer is correct, avoiding complex methods is the way to go. But I wanted to add a practical example of how to segregate these (while trying to minimize the change required).

for example a method that returns all possible moves of a knight in a given chess position

When you think about it, what you're really trying to test is much simpler: checking if two given fields are a possible start/end position for a knight move.

The rest of the logic (returning all valid fields on the board tht you can move to) is merely a for wrapped around the "is valid start/end field" logic.

The second test can arguably be omitted (IMHO) since it's really just testing a trivial for loop over the board, and you shouldn't test trivial code.

Martin, despite being one of the fiercest champions of unit testing and TDD, contributed to the debate with his blog entry "The pragmatics of TDD." Martin doesn't test-drive everything. Here's his list of scenarios where he doesn't use TDD:

  • Getters and setters
  • Member variables
  • Functions that are obviously trivial
  • GUIs
  • Code written by trial and error
  • Code written by third parties

A simple for loop wrapped around a method (which is unit tested already) fits the definition of "obviously trivial".


I'd rather generate 1000 positions at random, and check the result for each one - but how?

Just a short mention, but don't do random. Random is inherently unpredictable, and is bound to lead to inconsistent behavior. E.g. if you run your test twice using random values, and your code happens to have a bug in the white fields but not the black, you're going to get two false positives 1/4 of the time.

The more you test, the smaller that chance becomes to never stumble on an actual error, but the issue remains the same that your unit tests can come up with different results without ever having changed the codebase, which is nonsensical and bound to send you on a quixotical quest to solve a heisenbug (or mistakenly ignore a real bug).

If anything, random dependencies need to be mocked to be not-random for testing purposes, e.g. you test a dice roll with a MockedDice whose outcome you've fixed (e.g. always rolls 6).

Just to be clear, you are allowed to randomly pick any arbitrary amount of fields and then hardcode them (i.e. you pick them at random, but the runtime doesn't).
You can reuse the same testing method by abstracting and parametrizing it (e.g. void TestForField(Field field, Field[] expectedResults) and then using it in all tests where you simply call TestForField(fieldA1, new Field[] { fieldB3, fieldC2 }). This is just a simple example to point out there's nothing wrong with DRY in unit testing - up to a point of simple data changes and not behavioral changes.

However, if you're thinking of writing your tests for all 64 fields, then you need to reconsider (see next paragraph).


I would have to basically re-code the whole method again in the test, to be able to verify the result!?

This is where testing goes off the deep end and becomes a futile exercise in writing the same thing twice and hoping you wrote it right at least one of the times. This is the outer edge of the usefulness of testing, where you realize you've gone too far.
Don't worry, every developer encounters this. When you are overly diligent in testing, you end up realizing that this is exactly where your zeal has led you.

If you are able to list all fields and their expected outcomes, and are sure of its correctness (since you want to use it as the measure of correctness for your unit tests), then you've proven that the code you're testing is pointless. You can delete the code and instead just read data from the mapping you just created (your test data) without needing to calculate anything. The data is correct and complete, so there's no point to calculating anything on the fly anymore!

If you are not able (or willing) to list everything, that means your code has a purpose (it does what you're unable/unwilling to do manually). And to verify that your code does its job well, you perform some sampling, i.e. testing with a subset of possible values so you have a rudimentary understanding of whether the logic works or not. You do this by testing the basics and any easy errors you can think of. For example:

  • Basic test to see if the logic does return the expected outcomes when the input field is in the middle of the board
  • Test to see if the logic doesn't return "off the board" results when the input field is on the board's edge
  • Test to see if the logic doesn't return any fields where there is already a piece of the same color
  • Test to see if the logic does return any fields where there is already a piece of the opposing color

This catches most errors you're likely to make.

Does this catch every possible error? No. There are four things missing (that I can think of - there are probably even more):

  • It's not adequately testing for pawns where normal moves and attack moves are fundamentally different
  • It's not testing for en passant movement
  • It's not testing for castling
  • It's not testing for cases where moving the current piece would render your king check (making it impossible to move this piece).

Note: the sheer variety of situations suggests that your complex method needs to be separated into multiple behaviors, which should all be tested separately. But that's not what I want to focus on here.

But you can't be expected to know every possible error and fringe case in advance, nor can you be expected to never forget something. If you were able to do that, you'd never need to write tests to begin with because you'd be writing perfect code!

Unit tests should be refined iteratively. You write the tests that seem valid, and then you rely on them. If you find another bug in the logic, you revisit the tests and wonder why the tests didn't catch it. When you figure that out, you add another test which now catches the thing you forgot to write a test for the first time. And then you rely on your tests until you find another bug, and you repeat the process.

I've written a chess engine, and the latter two unit tests I mentioned are tests I had to add at a later stage because I hadn't yet thought about it back when I was making my basic movement logic and their unit tests. It's absolutely normal to not be able to catch everything beforehand and have to expand your tests later.

Never succumb to the arrogance of thinking you will get it right the first time. The real development skill is in correcting with mistakes gracefully, as opposed to the pipedream of never making mistakes in the first place.

Unit tests are there to make sure you don't have to solve the same thing twice. Unit tests do not prevent you having to solve an issue once.

3
  • 1
    "Don't do random" is a little simplistic. I'd prefer: "when testing, control your random". Crypto specifications do this all the time. Commented Oct 14, 2019 at 14:24
  • 1
    @candied_orange: In that sense, a controlled random isn't really random anymore :) But yes, there is more than just "fixed single value", you can mock a sequence of alleged-random values, or you could simply enforce a random seed to keep the outcome consistent.
    – Flater
    Commented Oct 14, 2019 at 14:28
  • Wasn't looking to start a chat. Was hoping to inspire an edit. Commented Oct 14, 2019 at 14:50
1

There are problems that are hard to solve. An extreme example is the search for Fermat primes: For about half the potential Fermat primes there is a simple proof that they are not primes, but for the other half there is nothing you can do other than re-do the exact same calculation. Fortunately if you develop a new algorithm, you can compare its results with known results - but if there is disagreement, there is no way to check who is right.

You may be able to prove that a correct solution has certain properties, and verify that your solution has the same properties. This will either give you a slightly higher confidence, or demonstrate your solution is wrong.

You may have a randomised algorithm. Which means you could perform the actual calculation in many different ways. If your code is correct it will give the same answer in any case. If you get different answers, at most one can be correct.

You may be able to construct test cases with known solutions, but that has the big problem that your code might only be correct for those cases.

1

It's important to note that "useful" tests are not necessarily "exhaustive" or "complete" tests. It all depends on why you are writing tests in the first place.

It is possible to use (automated) tests for verifying correctness of new code or existing (3rd party) code. However, probably the most important reason for writing tests is to help catch regressions when your code changes, or when assumptions about its environment change. Regression testing was touched upon in the answer by @Jörg W Mittag, but I want to elaborate on what are "useful" regression tests.

I am going to assume this discussion is about unit tests - those involving one component such as a function, class, or module. A component could be a "knight moves calculator" or a "min-max evaluator".

Changes to code may include refactoring, optimization, or porting to a new language. Changes to the environment may be things like using a new compiler or runtime. Any of these changes could introduce bugs or change behavior in unexpected ways.

The first type of unit tests are black-box tests, where you test the component from the perspective of its user. Black-box tests should correspond to the specification of the component. For example, for knight moves calculator specified as:

  1. the method returns an array of all the moves
  2. only valid positions on the board are returned
  3. positions containing pieces of the same color are not returned

you could write three test cases:

  1. choose a position in the middle of an empty board and check that all 8 moves were returned
  2. choose a position on the edge, and check that only the 4 valid positions are returned
  3. add some pieces of the same color, and check those positions are not returned

these tests are neither exhaustive (only some positions are used) nor complete (there may be other logic in the tests), but they are still useful for a number of reasons:

  • The tests serve as sanity checks which ensure the API of the component still makes sense, even if the code changes significantly. If you have to change these tests, you will probably have to change all places that use your component.

  • The tests serve as documentation that is always up-to-date. Users of the component can see what behaviours are guaranteed. Maintainers can see what use cases are the most important and must continue to work.

Another type of unit tests are white-box tests where you test specific cases in the implementation of the component.

This is where "code coverage" is often talked about - we often try to write complete tests that execute every line or logical statement in the code. The idea behind complete coverage is that any logical change to the code (which is visible through the public API) should cause a test to fail. However, complete coverage isn't necessarily exhaustive coverage, because there may be interactions between different branches which are not captured by individually testing each branch.

It is often useful to think about what the code is actually doing to get the correct answer, and what changes could result in the wrong answer. For the knight moves calculator:

  • Suppose the calculator has a template which contains relative move coordinates like (-2, -1), (1, 2), etc. This template is applied to the current position to get a set of new absolute positions.
    • It may be enough to test just one absolute position, plus corner cases, just like in our black-box tests.
  • On the other hand, let's say there is a lookup table from starting position to array of absolute positions.
    • It may be necessary to test multiple positions, to make sure the lookup logic is correct. On the other hand, edges are no longer corner cases that we have to care about.

Note that we could assume the template or lookup table is static data that has been verified by some other means and doesn't need to be tested exhaustively. After all, the rules of chess are not going to change. However, if I was re-encoding the data in some new format, I may want to write a one-off exhaustive test comparing the output of the algorithm using the old and new data encoding formats.

Not the answer you're looking for? Browse other questions tagged or ask your own question.