How to fool the "try some test cases" heuristic: Algorithms that appear correct, but are actually incorrect

Question

To try to test whether an algorithm for some problem is correct, the usual starting point is to try running the algorithm by hand on a number of simple test cases -- try it on a few example problem instances, including a few simple "corner cases". This is a great heuristic: it's a great way to quickly weed out many incorrect attempts at an algorithm, and to gain understanding about why the algorithm doesn't work.

However, when learning algorithms, some students are tempted to stop there: if their algorithm works correctly on a handful of examples, including all of the corner cases they can think to try, then they conclude that the algorithm must be correct. There's always a student who asks: "Why do I need to prove my algorithm correct, if I can just try it on a few test cases?"

So, how do you fool the "try a bunch of test cases" heuristic? I'm looking for some good examples to show that this heuristic is not enough. In other words, I am looking for one or more examples of an algorithm that superficially looks like it might be correct, and that outputs the right answer on all of the small inputs that anyone is likely to come up with, but where the algorithm actually doesn't work. Maybe the algorithm just happens to work correctly on all small inputs and only fails for large inputs, or only fails for inputs with an unusual pattern.

Specifically, I am looking for:

An algorithm. The flaw has to be at the algorithmic level. I am not looking for implementation bugs. (For instance, at a bare minimum, the example should be language-agnostic, and the flaw should relate to algorithmic concerns rather than software engineering or implementation issues.)
An algorithm that someone might plausibly come up with. The pseudocode should look at least plausibly correct (e.g., code that is obfuscated or obviously dubious is not a good example). Bonus points if it is an algorithm that some student actually came up with when trying to solve a homework or exam problem.
An algorithm that would pass a reasonable manual test strategy with high probability. Someone who tries a few small test cases by hand should be unlikely to discover the flaw. For instance, "simulate QuickCheck by hand on a dozen small test cases" should be unlikely to reveal that the algorithm is incorrect.
Preferably, a deterministic algorithm. I've seen many students think that "try some test cases by hand" is a reasonable way to check whether a deterministic algorithm is correct, but I suspect most students would not assume that trying a few test cases is a good way to verify probabilistic algorithms. For probabilistic algorithms, there's often no way to tell whether any particular output is correct; and you can't hand-crank enough examples to do any useful statistical test on the output distribution. So, I'd prefer to focus on deterministic algorithms, as they get more cleanly to the heart of student misconceptions.

I'd like to teach the importance of proving your algorithm correct, and I'm hoping to use a few examples like this to help motivate proofs of correctness. I would prefer examples that are relatively simple and accessible to undergraduates; examples that require heavy machinery or a ton of mathematical/algorithmic background are less useful. Also, I don't want algorithms that are "unnatural"; while it might be easy to construct some weird artificial algorithm to fool the heuristic, if it looks highly unnatural or has an obvious backdoor constructed just to fool this heuristic, it probably won't be convincing to students. Any good examples?

I love your question, it's also related to a very interesting question i saw on Mathematics the other day related to disproving conjectures with large constants. You can find it here — ZeroUltimax, Commented Aug 28, 2014 at 20:56
Some more digging and I found those two geometric algorithms. — ZeroUltimax, Commented Aug 28, 2014 at 21:23
@ZeroUltimax You are right, the center pt of any 3 non-colinear pts is not guaranteed to be inside. The quick remedy is to get a pt on the line between the furthest left and furthest right. Is there problem else where? — InformedA, Commented Aug 29, 2014 at 7:58
The premise of this question seems odd to me in a way I am having difficulty getting my head around, but I think it comes down to is the process for algorithm design as described is a fundamentally broken one. Even for students that don't 'stop there' it is doomed. 1> write algorithm, 2> think of/run test cases, 3a> stop or 3b> prove correct. The first step pretty much has be identifying the input classes for the problem domain. Corner cases and the algorithm itself arise from those. (cont) — Mr.Mindor, Commented Aug 30, 2014 at 6:11
How do you formally distinguish an implementation bug from a flawed algorithm? I was interested by your question, but at the same time I was bothered by the fact that the situation you describe seems to be more the rule than the exception. Many people do test what they implement, but they usually still have bugs. The second example of the most upvoted answer is precisely such a bug. — babou, Commented Sep 5, 2014 at 6:27

Per Alexandersson · Accepted Answer · 2014-08-31 20:16:42Z

95

A common error I think is to use greedy algorithms, which is not always the correct approach, but might work in most test cases.

Example: Coin denominations, $d_1,\dots,d_k$ and a number $n$, express $n$ as a sum of $d_i$:s with as few coins as possible.

A naive approach is to use the largest possible coin first, and greedily produce such a sum.

For instance, the coins with value $6$, $5$ and $1$ will give correct answers with greedy for all numbers between $1$ and $14$ except for the number $10 = 6+1+1+1+1 = 5+5$.

answered Aug 31, 2014 at 20:16

Per Alexandersson

1,2848 silver badges8 bronze badges

19

$\begingroup$ This is indeed a good example, in particular one that students get routinely wrong. Not only do you need to pick particular coin sets but also particular values in order to see the algorithm fail. $\endgroup$
– Raphael
Commented Sep 1, 2014 at 12:31
4

$\begingroup$ In addition, let me say that students will also often have wrong proofs in this example (sporting some naive arguments that fail on closer examination), so more than one lesson can be learned here. $\endgroup$
– Raphael
Commented Sep 3, 2014 at 11:12
3

$\begingroup$ The old-style British coin system (prior to the 1971 decimalization) had a real example of this. A greedy algorithm for counting out four shillings would use a half-crown (2½ shillings), a one-shilling coin, and a sixpence (½ shilling). But the optimal solution uses two florins (2 shillings each). $\endgroup$
– Mark Dominus
Commented Jan 26, 2016 at 3:23
2

$\begingroup$ Indeed in a lot of cases greedy algorithms seem reasonable, but don't work - another example is maximum bipartite matching. On the other hand, there are also examples where it seems like a greedy algorithm shouldn't work, but it does: maximum spanning tree. $\endgroup$
– jkff
Commented Apr 19, 2019 at 3:36

Add a comment |

Juho · Accepted Answer · 2014-09-04 06:17:57Z

I immediately recalled an example from R. Backhouse (this might have been in one of his books). Apparently, he had assigned a programming assignment where the students had to write a Pascal program to test equality of two strings. One of the programs turned in by a student was the following:

issame := (string1.length = string2.length);

if issame then
  for i := 1 to string1.length do
    issame := string1.char[i] = string2.char[i];

write(issame);

We can now test the program with the following inputs:

"university" "university" $\Rightarrow$ True; OK

"course" "course" $\Rightarrow$ True; OK

"" "" $\Rightarrow$ True; OK

"university" "course" $\Rightarrow$ False; OK

"lecture" "course" $\Rightarrow$ False; OK

"precision" "exactness" $\Rightarrow$ False, OK

All of this seems very promising: maybe the program does indeed work. But a more careful testing with say "pure" and "true" reveals faulty output. In fact, the program says "True" if the strings have the same length and the same last character!

However, testing had been pretty thorough: we had strings with different length, strings with equal length but different content, and even equal strings. Furthermore, the student had even tested and executed every branch. You can't really argue testing had been careless here -- given that the program is indeed very simple, it might be hard to find the motivation and energy to test it thoroughly enough.

Another cute example is binary search. In TAOCP, Knuth says that "although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky". Apparently, a bug in the binary search implementation of Java went unnoticed for a decade. It was an integer overflow bug, and only manifested with large enough input. Tricky details of binary search implementations are also covered by Bentley in the book Programming Pearls.

Bottom line: it can be surprisingly hard to be certain a binary search algorithm is correct by just testing it.

Of course, the flaw is quite apparent from the source (if you have yourself written a similar thing before). — Raphael, Commented Aug 29, 2014 at 10:54
Even if the simple flaw in the example program is corrected, strings give quite a bit of interesting problems! String reversal is a classic - the "basic" way of doing it is by simply reversing the bytes. Then encoding comes into play. Then surrogates (usually twice). The problem is, of course, that there's no easy way of formally proving your method is correct. — Ordous, Commented Aug 29, 2014 at 16:02
Maybe I am completely misinterpreting the question, but this seems to be a flaw in the implementation rather than a flaw in the algorithm itself. — Mr.Mindor, Commented Aug 30, 2014 at 6:04
@Mr.Mindor: how can you tell whether the programmer has written down a correct algorithm and then implemented it incorrectly, or written down an incorrect algorithm and then implemented it faithfully (I hesitate to say "correctly"!) — Steve Jessop, Commented Sep 1, 2014 at 11:41
@wabbit That's debatable. What's obvious to you might not be obvious to a first-year student. — Juho, Commented Apr 14, 2019 at 9:40

Franki · Accepted Answer · 2014-08-30 06:57:53Z

38

The best example I ever came across is primality testing:

input: natural number p, p != 2
output: is p a prime or not?
algorithm: compute 2**(p-1) mod p. If result = 1 then p is prime else p is not.

This works for (almost) every number, except for a very few counter examples, and one actually needs a machine to find a counterexample in a realistic period of time. The first counterexample is 341, and the density of counterexamples actually decreases with increasing p, although just about logarithmically.

Instead of just using 2 as the basis of the power, one may improve the algorithm by also using additional, increasing small primes as basis in case the previous prime returned 1. And still, there are counterexample to this scheme, namely the Carmichael numbers, pretty rare though

edited Aug 30, 2014 at 6:57

user21248

answered Aug 29, 2014 at 7:43

Franki

3892 silver badges2 bronze badges

$\begingroup$ Fermat primality test is a probabilistic test, so your post-condition isn't correct. $\endgroup$
– Femaref
Commented Aug 29, 2014 at 15:15
8

$\begingroup$ ofc it is a probabilistic test but the answer nicely shows (more generally) how probabilistic algorithms mistaken for exact ones can be a source of error. more on Carmichael numbers $\endgroup$
– vzn
Commented Aug 29, 2014 at 15:17
3

$\begingroup$ That's a nice example, with a limitation: for the practical use of primality testing that I'm familiar with, namely asymmetric cryptographic key generation, we use probabilistic algorithms! The numbers are too large for exact tests (if they weren't then they wouldn't be suitable for crypto because the keys could be found by brute force in realistic time). $\endgroup$
– Gilles 'SO- stop being evil'
Commented Aug 29, 2014 at 15:57
3

$\begingroup$ the limitation you refer to is practical, not theoretical, and prime tests in crypto systems eg RSA are subject to rare/ highly improbable failures for exactly these reasons, again underlining the significance of the example. ie in practice sometimes this limitation is accepted as unavoidable. there are P time algorithms for primality testing eg AKS but they take too long for "smaller" numbers used in practice. $\endgroup$
– vzn
Commented Aug 30, 2014 at 16:13
$\begingroup$ If you test not just with 2p, but with ap for 50 different random values 2 ≤ a < p, then most people will know it is probabilistic, but with failures so unlikely that it is more likely that a malfunction in your computer produces the wrong answer. With 2p, 3p, 5p and 7p, failures are already very rare. $\endgroup$
– gnasher729
Commented Apr 22, 2019 at 11:00

| Show 2 more comments

ZeroUltimax · Accepted Answer · 2014-08-28 21:12:29Z

27

Here's one that was thrown at me by google reps at a convention I went to. It was coded in C, but it works in other languages that use references. Sorry for having to code on [cs.se], but it's the only to illustrate it.

swap(int& X, int& Y){
    X := X ^ Y
    Y := X ^ Y
    X := X ^ Y
}

This algorithm will work for any values given to x and y, even if they the same value. It will not work however if it's called as swap(x,x). In that situation, x ends up as 0. Now, this might not satisfy you, since you can somehow prove this operation to be correct mathematically, but still forget about this edge case.

answered Aug 28, 2014 at 21:12

ZeroUltimax

7804 silver badges11 bronze badges

3

$\begingroup$ That trick was used in the underhanded C contest to produce a flawed RC4 implementation. Reading that article again, I just noticed that this hack was probably submitted by @D.W. $\endgroup$
– CodesInChaos
Commented Aug 29, 2014 at 9:47
11

$\begingroup$ This flaw is indeed subtle -- but the flaw is language-specific, though, so it's not really a flaw in the algorithm; it's a flaw in the implementation. One could come up with other examples of language oddities that make it easy to conceal subtle flaws, but that wasn't really what I was looking for (I was looking for something at the level of abstraction of algorithms). In any case, this flaw isn't an ideal demonstration the value of proof; unless you're already thinking about aliasing, you might end up overlooking the same issue when you write out your "proof" of correctness. $\endgroup$
– D.W. ♦
Commented Aug 30, 2014 at 1:26
$\begingroup$ That'd why I'm surprise this got voted so high. $\endgroup$
– ZeroUltimax
Commented Aug 30, 2014 at 21:04
5

$\begingroup$ @D.W. That's a matter of how what model you define the algorithm in. If you go down to a level where memory references are explicit (rather than the common model that assumes the absence of sharing), this is an algorithm flaw. The flaw isn't really language specific, it turns up in any language that supports sharing of memory refererences. $\endgroup$
– Gilles 'SO- stop being evil'
Commented Sep 18, 2014 at 18:51

Add a comment |

Community · Accepted Answer · 2017-03-10 09:42:46Z

21

There is a whole class of algorithms that is inherently hard to test: pseudo-random number generators. You can not test a single output but have to investigate (many) series of outputs with means of statistics. Depending on what and how you test you may well miss non-random characteristics.

One famous case where things went horribly wrong is RANDU. It passed the scrutiny available at the time -- which failed to consider the behaviour of tuples of subsequent outputs. Already triples show lots of structure:

Basically, the tests did not cover all use cases: while single-dimensional use of RANDU was (probably mostly) fine, it did not support using it to sample three-dimensional points (in this way).

Proper pseudo-random sampling is a tricky business. Luckily, there are powerful test suites there days, e.g. dieharder that specialise in throwing all the statistics we know at a proposed generator. Is it enough?

^{To be fair, I have no idea what you can feasibly prove for PRNGs.}

edited Mar 10, 2017 at 9:42

CommunityBot

1

answered Aug 29, 2014 at 14:34

Raphael

72.7k30 gold badges179 silver badges392 bronze badges

2

$\begingroup$ nice example however actually in general there is no way to prove any PRNG has no flaw, there is only an infinite hierarchy of weaker vs stronger tests. actually proving one is "random" in any strict sense is presumably undecidable (havent seen that proven though). $\endgroup$
– vzn
Commented Aug 29, 2014 at 15:20
2

$\begingroup$ That's a good idea of something that's hard to test, but RNG are also hard to prove. PRNG are not so much prone to implementation bugs as to being badly specified. Tests like diehard are good for some uses, but for crypto, you can pass diehard and still be laughed out of the room. There is no “proven secure” CSPRNG, the best you can hope is to prove that if your CSPRNG is broken then so is AES. $\endgroup$
– Gilles 'SO- stop being evil'
Commented Aug 29, 2014 at 15:55
1

$\begingroup$ @Gilles I was not trying to go into crypto, only statistical randomness (I think the two have pretty much orthogonal requirements). Should I make that clear in the answer? $\endgroup$
– Raphael
Commented Aug 29, 2014 at 16:09
2

$\begingroup$ Crypto randomness implies statistical randomness. Neither has a mathematically formal definition though, as far as I know, apart from the ideal (and contradictiory with the concept of a PRNG implemented on a deterministic Turing machine) notion of information-theoretic randomness. Does statistical randomness have a formal definition beyond ”must be independent from the distributions we'll test it against“? $\endgroup$
– Gilles 'SO- stop being evil'
Commented Aug 29, 2014 at 16:14
2

$\begingroup$ @vzn: what it means to be a random sequence of numbers can be defined in many possible ways, but a simple one is "large Komolgorov complexity". In that case, it is easy to show that determining randomness is undecidable. $\endgroup$
– cody
Commented Sep 2, 2014 at 20:47

Add a comment |

Neal Young · Accepted Answer · 2018-03-31 11:09:10Z

2D local maximum

input: 2-dimensional $n \times n$ array $A$

output: a local maximum -- a pair $(i,j)$ such that $A[i,j]$ has no neighboring cell in the array that contains a strictly larger value.

(The neighboring cells are those among $A[i, j+1], A[i, j-1], A[i-1, j], A[i+1, j]$ that are present in the array.) So, for example, if $A$ is

$$\begin{array}{cccc} 0&1&3&\mathbf{4}\\ \mathbf{3}&2&\mathbf{3}&1\\ 2&\mathbf{5}&0&1\\ \mathbf{4}&0&1&\mathbf{3}\end{array}$$

then each bolded cell is a local maximum. Every non-empty array has at least one local maximum.

Algorithm. There is an $O(n^2)$-time algorithm: just check each cell. Here's an idea for a faster, recursive algorithm.

Given $A$, define cross $X$ to consist of the cells in the middle column, and the cells in the middle row. First check each cell in $X$ to see if the cell is a local maximum in $A$. If so, return such a cell. Otherwise, let $(i, j)$ be a cell in $X$ with maximum value. Since $(i, j)$ is not a local maximum, it must have a neighboring cell $(i', j')$ with larger value.

Partition $A \setminus X$ (the array $A$, minus the cells in $X$) into four quadrants -- the upper left, upper right, lower left, and lower right quadrants -- in the natural way. The neighboring cell $(i', j')$ with larger value must be in one of those quadrants. Call that quadrant $A'$.

Lemma. Quadrant $A'$ contains a local maximum of $A$.

Proof. Consider starting at the cell $(i', j')$. If it is not a local maximum, move to a neighbor with a larger value. This can be repeated until arriving at a cell that is a local maximum. That final cell has to be in $A'$, because $A'$ is bounded on all sides by cells whose values are smaller than the value of cell $(i', j')$. This proves the lemma. $\diamond$

The algorithm calls itself recursively on the $\frac{n}{2}\times\frac{n}{2}$ sub-array $A'$ to find a local maximum $(i, j)$ there, then returns that cell.

The running time $T(n)$ for an $n\times n$ matrix satisfies $T(n) = T(n/2) + O(n)$, so $T(n) = O(n)$.

Thus, we have proven the following theorem:

Theorem. There is an $O(n)$-time algorithm for finding a local-maximum in an $n\times n$ array.

Or have we?

I'm still absorbing this. A quick question until I do: Did you mean the running time is $T(n) = O(n \log n)$? (Since that is the solution to the recurrence $T(n) = T(n/2) + O(n)$.) — D.W., Commented Mar 31, 2018 at 15:11
This is a beautiful example! I love it. Thank you. (I finally figured out the flaw in this algorithm. From the timestamps you can get a lower bound on how long it took me. I'm too embarrassed to reveal the actual time. :-) — D.W., Commented Mar 31, 2018 at 15:46
A nice part is that this algorithm is almost correct: if we pick a 'window' instead of a 'cross', we can recurse properly and get an $O(n)$ algorithm. See also courses.csail.mit.edu/6.006/spring11/lectures/lec02.pdf . — Discrete lizard, Commented Mar 14, 2019 at 14:25

Raphael · Accepted Answer · 2014-09-22 12:05:49Z

Fisher-Yates-Knuth shuffling algorithm is an (practical) example and one on which one of the the authors of this site has commented about.

The algorithm generates a random permutation of a given array as:

 // To shuffle an array a of n elements (indices 0..n-1):
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ j ≤ i
       exchange a[j] and a[i]

One sees that in the loop the elements are swapped between $i$ and $j$, $0 \le j \le i$. This produces unbiased sampling of the permutations (no permutations are over-represented and others under-represented).

A "naive" algorithm could be:

 // To shuffle an array a of n elements (indices 0..n-1):
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ j ≤ n-1
       exchange a[j] and a[i]

Where in the loop the element to be swapped is chosen from all available elements. However this produces biased sampling of the permutations (some are over-represented etc..)

Actually one can come-up with the fisher-yates-knuth shuffling using a simple (or naive) counting analysis.

The number of permutations of $n$ elements is $n! = n \times n-1 \times n-2 ..$, meaning 1st element can be placed in any of $n$ positions, 2nd element in remaining $n-1$ positions and so on.This is exactly what Fisher-Yates shuffle does and is why it produces un-biased (random) permutations (unlike the "naive" algorithm)

The main problem with verifying whether the shuffling algorithm is correct or not (biased or not) is that due to the statistics, a large number of samples is needed. The codinghorror article I link above explains exactly that (and with actual tests).

See here for an example correctness proof for a shuffle algorithm. — Raphael, Commented Sep 18, 2014 at 13:54

DanaJ · Accepted Answer · 2014-08-31 19:03:33Z

These are primality examples, because they're common.

(1) Primality in SymPy. Issue 1789. There was an incorrect test put on a well-known web site that didn't fail until after 10^14. While the fix was correct, it was just patching holes rather than rethinking the issue.

(2) Primality in Perl 6. Perl6 has added is-prime which uses a number of M-R tests with fixed bases. There are known counterexamples, but they're quite large since the default number of tests is huge (basically hiding the real problem by degrading performance). This will be addressed soon.

(3) Primality in FLINT. n_isprime() returning true for composites, since fixed. Basically the same issue as SymPy. Using the Feitsma/Galway database of SPRP-2 pseudoprimes to 2^64 we can now test these.

(4) Perl's Math::Primality. is_aks_prime broken. This sequence seems similar to lots of AKS implementations -- lots of code that either worked by accident (e.g. got lost in step 1 and ended up doing the entire thing by trial division) or didn't work for larger examples. Unfortunately AKS is so slow that it is difficult to test.

(5) Pari's pre-2.2 is_prime. Math::Pari ticket. It used 10 random bases for M-R tests (with fixed seed on startup, rather than GMP's fixed seed every call). It will tell you 9 is prime about 1 out of every 1M calls. If you pick the right number you can get it to fail relatively often, but the numbers become sparser, so it doesn't show up much in practice. They have since changed the algorithm and API.

This isn't wrong but it's a classic of probabilistic tests: How many rounds do you give, say, mpz_probab_prime_p? If we give it 5 rounds, it sure looks like it works well -- numbers have to pass a base-210 Fermat test and then 5 pre-selected bases Miller-Rabin tests. You won't find a counterexample until 3892757297131 (with GMP 5.0.1 or 6.0.0a), so you'd have to do a lot of testing to find it. But there are thousands of counterexamples under 2^64. So you keep raising the number. How far? Is there an adversary? How important is a correct answer? Are you confusing random bases with fixed bases? Do you know what input sizes you'll be given?

There is a related point: what is a big number? To students it seems many think 10,000 is a huge number. To many programmers, $10^{16}$ is a big number. To programmers working on cryptography, these are small, and big is, say 4096 bits. To programmers working on computational number theory, these are all small, and big might be 10 to 100 thousand decimal digits. To some mathematicians these all may be considered "not big" considering there are many more positive numbers larger than these examples than there are smaller. This is something a lot of people don't think about, but makes a difference when thinking about correctness and performance.

These are quite difficult to test correctly. My strategy includes obvious unit tests, plus edge cases, plus examples of failures seen before or in other packages, test vs. known databases where possible (e.g. if you do a single base-2 M-R test, then you've reduced the computationally infeasible task of testing 2^64 numbers to testing about 32 million numbers), and finally, lots of randomized tests using another package as a standard. The last point works for functions like primality where there is a fairly simple input and a known output, but quite a few tasks are like this. I have used this to find defects in both my own development code as well as occasional problems in the comparison packages. But given the infinite input space, we can't test everything.

As for proving correctness, here is another primality example. The BLS75 methods and ECPP have the concept of a primality certificate. Basically after they churn away doing searches to find values that work for their proofs, they can output them in a known format. One can then write a verifier or have someone else write it. These run very fast compared to the creation, and now either (1) both pieces of code are incorrect (hence why you'd prefer other programmers for the verifiers), or (2) the math behind the proof idea is wrong. #2 is always possible, but these have typically been published and reviewed by multiple people (and in some cases are easy enough for you to walk through yourself).

In comparison, methods like AKS, APR-CL, trial division, or the deterministic Rabin test, all produce no output other than "prime" or "composite." In the latter case we may have a factor hence can verify, but in the former case we're left with nothing other than this one bit of output. Did the program work correctly? Dunno.

It's important to test the software on more than just a few toy examples, and also going through some examples at each step of the algorithm and saying "given this input, does it make sense that I am here with this state?"

Many of these look like either (1) implementation errors (the underlying algorithm is correct but it wasn't implemented correctly), which are interesting but not the point of this question, or (2) a deliberate, aware choice to select something that is fast and mostly works but might fail with very small probability (for code that's testing with one random base or a few fixed/random bases, I would hope that whoever choose to do that knew they were making a performance tradeoff). — D.W., Commented Aug 31, 2014 at 19:36
You are right on the first point -- correct algorithm + bug isn't the point, although the discussion and other examples are conflating them as well. The field is ripe with conjectures that work for small numbers but are incorrect. For point (2) that is true for some, but my examples #1 and #3 were not this case -- it was believed that the algorithm was correct (these 5 bases give proven results for numbers under 10^16), then later discovered that it was not. — DanaJ, Commented Aug 31, 2014 at 19:58
asmeurer, yes in my #2 and the later discussion of them. But #1 and #3 were both cases of using Miller-Rabin with known bases to give deterministic correct results below a threshold. So in this case the "algorithm" (using the term loosely to match the OP) was incorrect. #4 isn't a probable prime test, but as DW pointed out, the algorithm works fine, it's just the implementation that is difficult. I included it because it leads to a similar situation: testing is needed, and how far do you go beyond simple examples before you say it works? — DanaJ, Commented Sep 2, 2014 at 14:55
Some of your post seems to fit the question while some does not (cf @D.W.'s comment). Please remove the examples (and other content) that does not answer the question. — Raphael, Commented Sep 18, 2014 at 14:05

Jake · Accepted Answer · 2015-05-28 06:23:33Z

The best example (read: thing I am most butt hurt about) I have ever seen has to do with collatz conjecture. I was in a programming competition (with a 500 dollar prize on the line for first place) in which one of the problems was to find the minimum number of steps it takes for two numbers to reach the same number. The solution of course is to alternately step each one until they both reach something that has been seen before. We were given a range of numbers (I think it was between 1 and 1000000) and told that the collatz conjecture had been verified up to 2^64 so all of the numbers we were given would eventually converge at 1. I used 32-bit integers to do the steps with however. It turns out that that there is one obscure number between 1 and 1000000 (170 thousand something) that will cause a 32-bit integer to overflow in due time. In fact these numbers are extremely rare bellow 2^31. We tested our system for HUGE numbers far greater than 1000000 to "ensure" that overflow was not occurring. Turns out a much smaller number that we just didn't test caused overflow. Becuase I used "int" instead of "long" I only got a 300 dollar prize rather than a $500 prize.

Jonathan Prieto-Cubides · Accepted Answer · 2019-04-22 08:50:46Z

The Knapsack 0/1 problem is one that almost all the students think is solvable by a greedy algorithm. That happens more often if you previously show some greedy solutions as the Knapsack's problem version where a greedy algorithm works.

For those problems, in class, I should show the proof for Knapsack 0/1 (dynamic programming) for remove any doubt and for the greedy problem version too. Actually, both proofs are not trivial and the students probably find them very helpful. In addition, there's a comment about this in CLRS 3ed, Chapter 16, Page 425-427.

Problem: thief robbing a store and can carry a maximal weight of W into their knapsack. There are n items and ith item weigh wi and is worth vi dollars. What items should thief take? to maximize his gain?

Knapsack 0/1 problem: The setup is the same, but the items may not be broken into smaller pieces, so thief may decide either to take an item or to leave it (binary choice), but may not take a fraction of an item.

And you can get from students some ideas or algorithms that follow the same idea as greedy version problem, that's:

Take the total capacity of the bag, and put as much as possible the most value object, and iterate this method until you can't put more object because bag it's full or there's not object with less o equal weight for put inside the bag.
Other wrong way is thinking: put lighter items and put these following highest to lowest price.
...

Is it helpful for you? actually, we know the coin problem is a knapsack problem version. But, there are more examples in the forest of knapsack's problems, by example, what about Knapsack 2D (that's is real helpful when you want cut wood for make furniture, I saw in a local from my city), it's very common think that the greedy works here, too, but not.

Greedy was already covered in the accepted answer, but the Knapsack problem in particular is well-suited to set some traps. — Raphael, Commented Sep 22, 2014 at 17:53

Per Alexandersson · Accepted Answer · 2014-08-31 08:51:13Z

4

A common mistake is to implement shuffling algorithms wrong. See discussion on wikipedia.

Trouble is that the bias is usually not easy to detect, and one needs to prove that there are indeed $n!$ "choices" done by the algorithm, and not $n^n$ or $(n-1)^n$ which are common for wrong implementations.

answered Aug 31, 2014 at 8:51

Per Alexandersson

1,2848 silver badges8 bronze badges

1

$\begingroup$ It's a good bug, but not a good illustration of fooling the test cases heuristic, as testing doesn't really apply to a shuffling algorithms (it's randomized, so how would you test it? what would it mean to fail a test case, and how would you detect that from looking at the output?) $\endgroup$
– D.W. ♦
Commented Aug 31, 2014 at 17:26
$\begingroup$ You test it statistically of course. Uniform randomness is far from "anything can happen in the output". Would you not be suspicious if a program said to emulate a dice gave you 100 3's in a row? $\endgroup$
– Per Alexandersson
Commented Aug 31, 2014 at 17:49
$\begingroup$ Again, I'm talking about the student heuristic of "try some test cases by hand". I've seen many students think that this is a reasonable way to check whether a deterministic algorithm is correct, but I suspect they would not assume it's a good way to test whether a shuffling algorithm is correct (since a shuffling algorithm is randomized, there's no way to tell whether any particular output is correct; in any case, you can't hand-crank enough examples by hand to do any useful statistical test). So I don't expect shuffling algorithms will help much to clear up the common misconception. $\endgroup$
– D.W. ♦
Commented Aug 31, 2014 at 19:39
1

$\begingroup$ @PerAlexandersson: Even if you only generate only one shuffle it can't be truly random using MT with n > 2080. Now the deviation from expected will be very small, so you probably won't care... but this applies even if you generate far less than the period (as asmeurer points out above). $\endgroup$
– Charles
Commented Sep 2, 2014 at 15:22
2

$\begingroup$ This answer seems to have been made obsolete by Nikos M.'s more elaborate one? $\endgroup$
– Raphael
Commented Sep 18, 2014 at 13:56

| Show 4 more comments

Gilles 'SO- stop being evil' · Accepted Answer · 2014-09-22 15:50:22Z

4

Pythons PEP450 that introduced statistics functions into the standard library might be of interest. As part of the justification for having a function that calculates the variance in the standard library of python the author Steven D'Aprano writes:

def variance(data):
        # Use the Computational Formula for Variance.
        n = len(data)
        ss = sum(x**2 for x in data) - (sum(data)**2)/n
        return ss/(n-1)
The above appears to be correct with a casual test:
>>> data = [1, 2, 4, 5, 8]
>>> variance(data)
  7.5
But adding a constant to every data point should not change the variance:
>>> data = [x+1e12 for x in data]
>>> variance(data)
  0.0
And variance should never be negative:
>>> variance(data*100)
  -1239429440.1282566

The issue is about numerics and how precision gets lost. If you want maximum precision then you have to order your operations in a certain way. A naive implementation leads to incorrect results because the imprecision is too large. That was one of the issue my numerics course at university was about.

edited Sep 22, 2014 at 15:50

Gilles 'SO- stop being evil'

43.8k8 gold badges118 silver badges184 bronze badges

answered Aug 29, 2014 at 12:18

Christian

1732 bronze badges

1

$\begingroup$ If I decipher the syntax correctly, the algorithm is about correct (one might argue about dividing by $n-1$, but well). I assume the issue lies with with the implementation of the operations? Not sure if that fits the question: I think DW is looking for algorithmic mistakes, not "your algorithm can't deal with real hardware" mistakes. $\endgroup$
– Raphael
Commented Aug 29, 2014 at 16:13
3

$\begingroup$ @Raphael: Though to be fair, the chosen algorithm is well-known to be a poor choice for floating-point data. $\endgroup$
– user5386
Commented Sep 1, 2014 at 21:47
2

$\begingroup$ It's not simply about the implementation of the operation about about numerics and how precision get's lost. If you want maximum precision then you have to order your operations in a certain way. That was one of the issue my numerics course at university was about. $\endgroup$
– Christian
Commented Sep 18, 2014 at 14:27
1

$\begingroup$ In addition to Raphael's accurate comment, a shortcoming of this example is that I don't think a proof of correctness would help avoid this flaw. If you're not aware of the subtleties of floating-point arithmetic, you might think you have proven this correct (by proving that the formula is valid). So it's not an ideal example to teach students why it is important to prove their algorithms correct. If students saw this example, my suspicion is they would instead draw the lesson "floating point / numeric computation stuff is tricky". $\endgroup$
– D.W. ♦
Commented Sep 19, 2014 at 3:20

Add a comment |

Laakeri · Accepted Answer · 2019-10-31 08:47:04Z

3

For almost 40 years it was thought that an intuitive two-pointers based algorithm for finding a maximum-area triangle inside a convex polygon was correct. It was proved incorrect in https://arxiv.org/abs/1705.11035.

edited Oct 31, 2019 at 8:47

answered Oct 31, 2019 at 8:41

Laakeri

1,3391 gold badge9 silver badges19 bronze badges

Add a comment |

Rick Decker · Accepted Answer · 2014-08-29 01:35:12Z

2

While this is likely not quite what you're after, it's certainly easy to understand and testing some small cases without any other thinking will lead to an incorrect algorithm.

Problem: Write a function that takes a nonnegative integer $n$ and returns the number of proper divisors of $n^2+n+41$, namely the number of integers $0<d$ for which $d\text{ divides } n^2+n+41$ and $d < n^2+n+41$.

Proposed solution:

int f(int n) {
   return 1;
}

This happens to be correct for $n= 0, 1, 2, \dotsc, 39$ but fails when $n=40$.

This "try some small cases and infer an algorithm from the result" approach crops up frequently (though not as extremely as here) in programming competitions where the pressure is to come up with an algorithm that (a) is quick to implement and (b) has a fast run time.

edited Aug 29, 2014 at 1:35

answered Aug 28, 2014 at 21:26

Rick Decker

14.9k5 gold badges42 silver badges54 bronze badges

6

$\begingroup$ I don't think this is a very good example, because few people would attempt to find the divisors of a polynomial by returning 1. $\endgroup$
– Brian S
Commented Aug 29, 2014 at 15:08
1

$\begingroup$ Suppose the problem asked to determine for a given integer $n$ whether 3 divided $n^3-n$. A correct solution for this problem is a function which returns true for any input. Is this significantly different from the example I gave? $\endgroup$
– Rick Decker
Commented Aug 29, 2014 at 16:43
$\begingroup$ This could be relevant, in the sense that returning a constant value for divisors (or another caclulation), can be the result of a wrong algorithmic approach to a problem (for example a statistical problem, or not handling edge cases of the algorithm). However the answer needs rephrasing $\endgroup$
– Nikos M.
Commented Sep 18, 2014 at 22:42
$\begingroup$ @NikosM. Heh. I feel like I'm beating a dead horse here, but the question's second paragraph says that "if their algorithm works correctly on a handful of examples, including all of the corner cases they can think to try, then they conclude that the algorithm must be correct. There's always a student who asks: "Why do I need to prove my algorithm correct, if I can just try it on a few test cases?" In this instance, for the first 40 values (far more than a student is likely to try), returning 1 is correct. It seems to me to be that's what the OP was looking for. $\endgroup$
– Rick Decker
Commented Sep 18, 2014 at 23:54
$\begingroup$ Ok, yeah, but this as phrased is trivial (maybe typicaly correct), but not in the spirit of the question. Still would need rephrasing $\endgroup$
– Nikos M.
Commented Sep 19, 2014 at 4:03

| Show 1 more comment

mindthegap · Accepted Answer · 2021-05-16 15:59:58Z

Though this question was asked a long time ago, let me also contribute with two - in my opinion - demonstrative examples (both matching all the criteria you listed) - maybe they will come in handy if you would like to update your slides someday.

The famous activity selection problem. An intuitive (greedy) approach many people may come up with is to choose the shortest interval/activity first. If you think of an edge case where a short activity conflicts with 2 other longer, non-conflicting activities (at the end of the first one and at the beginning of the other), it can be seen easily that the algorithm is incorrect (we could have picked those 2 instead of this single one). Interestingly, the correct solution is also greedy.
This one may be a less popular problem: instead of copying over all the details from Leetcode, please, read them there before proceeding. When I came across this problem first, I had a solid understanding of backtracking already, so I thought, let's solve it with a 2-phased (greedy) backtracking approach! I.e.: let's try to collect as many cherries as possible on the way down to the bottom right corner, then let's try to do the exact same thing on the way back to the upper left corner (for the remaining, so far unpicked cherries). I solved correctly 50 test cases out of 56. An edge case that I didn't consider was something like:

My approach would have solved it like:

Note that on the way back (2nd phase), due to the moving rules defined by the problem, you can pick only one of the remaining 2 cherries. However, you can actually pick up all of them, still playing by the rules:

Sorry for not posting pseudo-code snippets here: you can find it to the first one on the Wikipedia site, and the translation of my C++ solution for the 2nd problem would have been just way too long/complex.

Stack Exchange Network

How to fool the "try some test cases" heuristic: Algorithms that appear correct, but are actually incorrect

15 Answers 15

Not the answer you're looking for? Browse other questions tagged
algorithms
algorithm-analysis
education
correctness-proof
didactics
or ask your own question.

Linked

Hot Network Questions

How to fool the "try some test cases" heuristic: Algorithms that appear correct, but are actually incorrect

15 Answers 15

Not the answer you're looking for? Browse other questions tagged algorithmsalgorithm-analysiseducationcorrectness-proofdidactics or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
algorithms
algorithm-analysis
education
correctness-proof
didactics
or ask your own question.