Is it possible for a challenge to have hidden scoring test-cases?

Question

Assume that the method of generating the hidden test-cases is known. (for example: a list of 5000 uniformly random integers in range [1..1000])

Reasons why hidden test cases should be allowed (that I can think of):

Guaranteed to prevent hard-coding (assume that there are sufficiently many possible test-cases)
It's already used in (some) existing challenges, such as Where's Blackhat? .

Reasons why hidden test cases should not be allowed:

It makes the winning condition objectivity more probabilistic.

It does not completely contradict the answer linked above; however, as the answer poster can generate random test cases using the same method and then guess; it will still be correct most of the time.

Question:

Is it possible for a challenge to have hidden scoring test-cases?

If not, what should be done with the existing challenges with hidden scoring test-cases?

We've been through this on chat: it's fine as long as I reveal the test cases at a later date. I don't get what your problem is — Beta Decay, Commented Jul 29, 2018 at 8:31
Do we need a meta consensus? We're just going to drive people away from this site if we keeping adding more and more obscure rules — Beta Decay, Commented Jul 29, 2018 at 19:18
@BetaDecay In this case, a meta consensus is exactly what we need, as it make the rules less obscure. — DELETE_ME, Commented Jul 30, 2018 at 2:01

ბიმო · Accepted Answer · 2018-07-28 13:22:44Z

Yes, but they need to be made public after X time units

In my opinion for test-battery challenges this can be a sensible thing to do, however there are two important things that a challenge should do in such a case:

they need to specify a time span (at least once some people are answering) for when they will choose the winner and by that time make the test cases public
they need to make sure (eg. by providing a cryptographic hash of the test cases/seed to generate them) that the test cases were not chosen after writing the challenge, this is to make it "impossible" for them to adjust it to an answer they like

Addressing issues mentioned above

1) It makes the winning condition objectivity more probabilistic

If the method of generating hidden test cases is not known, this might be true. However king-of-the-hill challenges are the same. Since the conditions are the same for everyone, this won't cause any advantages/disadvantages for other people. And more importantly just because a challenge fulfils our criterions for being valid, it does not mean that it makes a good challenge:

In my opinion a good test-battery challenge will have at least one example and some public test cases which reflect approximately what's being tested. The scoring could then be a mix between hidden and public tests or just the hidden ones.

2) The score reported by the challenge author cannot be independently verified

Once the test cases are public the score can be verified by everyone and providing a hash will make sure that they are indeed the original tests.

3) The challenge dies when the author stops responding

The first point is very similar to what we do on cops-and-robbers challenges where cops are required to post their solution after some time (usually 7 days), though probably a challenge should give more time than 7 days to solve it.

I doubt that an OP of a challenge will just disappear before they chose a winner, we don't disallow cops-and-robbers challenges because the cop could stop responding and we shouldn't disallow test-battery challenges for that reason. Of course it is a possibility but I don't think that this will create problems.

Note: For this answer I don't think it really matters whether there's a known method of generating the hidden test cases. If anything points (1) and (3) won't cause troubles since people can generate tests themselves to get an approximative score and if an author stops responding the community could come up with new tests (which probably will never happen).

Not sure if this is a good idea. This may discourage future competition (given that old challenges currently doesn't draw much attention, this may be acceptable) — DELETE_ME, Commented Jul 28, 2018 at 13:41
@user202729: I think it's not that bad, why not have some challenges limited only for a certain time? In that case if a challenge was really great, it can always be redone in a different flavour. And a user can always add an answer later, but it won't get a green checkmark. — ბიმო, Commented Jul 28, 2018 at 13:50
I don't see any similarity to cops-and-robbers. If a cop fails to post their solution, they're the only person to lose out. If the OP of a question with hidden scoring disappears, everyone else loses out. — Peter Taylor, Commented Jul 30, 2018 at 9:51
@PeterTaylor: That's true only if no-one solved/attempted a cop answer and the same would apply if no-one solved/attempted such a challenge. — ბიმო, Commented Jul 30, 2018 at 12:39
If I as a robber crack a cop's submission, everyone can verify that I've cracked it, at least on all of the cops-and-robbers questions I've seen. It doesn't require further activity by the cop. The only thing a cop needs to do after posting their answer is to claim it as safe if no-one cracks it. — Peter Taylor, Commented Jul 30, 2018 at 12:43
@PeterTaylor: With the partly public test cases we could verify it too, but your point still stands as with this answer they would not be obligatory. I still think it is comparable because if a cop disappears we don't know their intended solution which could be more interesting, shorter, cleverer etc. and - more importantly - if there were only futile (possibly even very time consuming) attempts it would still be a great loss to those people that attempted & failed to solve it. — ბიმო, Commented Jul 30, 2018 at 13:33

Peter Taylor · Accepted Answer · 2018-07-30 10:18:16Z

2

Reasons why hidden test cases should be allowed (that I can think of):

Guaranteed to prevent hard-coding (assume that there are sufficiently many possible test-cases)

If there are thousands of test cases (as you suggest earlier in the question) then making them public will not make hard-coding a competitive approach, so I don't see any value to this argument.

It's already used in (some) existing challenges, such as Where's Blackhat? .

"It's already been done" is a terrible argument: see code-trolling. What matters is whether it's a good criterion, not whether it's been done before.

The biggest reason that I see for disallowing private test cases is that I can't tell whether a small tweak I make will make the score better or worse. With king-of-the-hill I can at least run my own simulations on equal terms with the OP and decide which version to post. With hidden scoring the only competitive option is to post multiple answers with tiny differences, which annoys everyone involved (me, OP, and third parties).

In short, hidden test cases create the wrong incentives for answerers.

answered Jul 30, 2018 at 10:18

Peter Taylor

43.1k8 gold badges55 silver badges116 bronze badges

\$\begingroup\$ What about cases such as “Where’s Blackhat?” where there are a limited number of test cases (never mind the work that one would have to do to compile many many test cases)? What would be the appropriate tiebreaker then? Code golf as a tiebreak seems like a cop-out to me \$\endgroup\$
– Beta Decay
Commented Jul 31, 2018 at 0:44
\$\begingroup\$ @BetaDecay, if it's realistic that answers will get a perfect score then the tiebreaker is really the winning criterion. If you can't think of a good winning criterion, you can always try the sandbox to see whether anyone can suggest one. \$\endgroup\$
– Peter Taylor
Commented Jul 31, 2018 at 10:13
\$\begingroup\$ @PeterTaylor Realistic != easy... and although answers are public, most people would want to try writing their own answer before looking at others'. \$\endgroup\$
– DELETE_ME
Commented Aug 1, 2018 at 2:41
\$\begingroup\$ @user202729, I'm not sure why it's relevant whether or not people would want to try writing their own answer before looking at others', but there are plenty of questions where the majority of the answers credit the person whose approach they've ported to a different language. \$\endgroup\$
– Peter Taylor
Commented Aug 1, 2018 at 5:57
\$\begingroup\$ "If there are thousands of test cases (as you suggest earlier in the question) then making them public will not make hard-coding a competitive approach" => That's not true for code-challenge challenges, which is the majority of challenges that need hidden test cases. \$\endgroup\$
– Fatalize
Commented Aug 1, 2018 at 7:46
\$\begingroup\$ @Fatalize, even if people aren't aiming to write short code for the winning criterion, the post length limitation of 64kB still applies, and I would rather see questions which place a limit of 4kB on the code than questions which don't let me score my own answer. \$\endgroup\$
– Peter Taylor
Commented Aug 1, 2018 at 7:53
1

\$\begingroup\$ "The biggest reason that I see for disallowing private test cases is that I can't tell whether a small tweak I make will make the score better or worse" => The point of these types of challenges often is that you are supposed to design an answer that generalizes well to unseen data. So making small ad-hoc tweaks that slightly improve your score opposes the goal of the challenge, because you're over-optimizing for the given cases. \$\endgroup\$
– Fatalize
Commented Aug 1, 2018 at 8:01
1

\$\begingroup\$ @Fatalize, then IMO they're not a good fit for the site. \$\endgroup\$
– Peter Taylor
Commented Aug 1, 2018 at 8:04
\$\begingroup\$ If there are thousands of test cases, a good hash function might still make the table function not that long \$\endgroup\$
– l4m2
Commented Oct 17, 2018 at 16:08

Add a comment |

feersum · Accepted Answer · 2018-07-30 00:33:19Z

1

Challenges with hidden test cases should not be created. This is because

The score reported by the challenge author cannot be independently verified.
The challenge dies when the author stops responding.

As an alternative (in the case where there is a specified random distribution for generating the test cases) I suggest generating test cases based on a random number generator seeded by a cryptographic hash of the solution. The non-invertibility of the hash function prevents the answerer from targeting the seed that generates the easiest answers (by adding comments, etc.). This would still allow such "optimization" opportunities by trying different strings through brute force (and I suggest the rules explicitly allow this), but if there are enough test cases it shouldn't go far.

edited Jul 30, 2018 at 0:33

answered Jul 28, 2018 at 9:59

feersum

31.4k21 silver badges24 bronze badges

\$\begingroup\$ @NathanMerrill How else should a submission's average behavior over all inputs be measured? Of course it's impractical to test it for all inputs, right? \$\endgroup\$
– DELETE_ME
Commented Jul 28, 2018 at 15:06
\$\begingroup\$ @feersum Ah, true. (I've deleted my comments to ensure clarity) \$\endgroup\$
– Nathan Merrill
Commented Jul 29, 2018 at 19:26

Add a comment |

Community · Accepted Answer · 2020-06-17 09:03:53Z

Provide a training set, a validation set, and hide the test set

Most challenges that have had hidden test cases are code-challenge challenges that involve some kind of pattern matching algorithms (such as those image recognition challenges).

These challenges are typically solvable using maching learning techniques. With such approaches, it is pretty much always possible to optimize them such that they perform flawlessly on a dataset (even with thousands of elements), if the whole dataset is known. Therefore, for most challenges, scoring on the number of accurate recognition on the given dataset would be meaningless, as everyone could just optimize their answer for it.

Moreover, this kind of ruins the purpose of such challenges where it is expected that answerers provide models that should generalize well to any never seen input data.

As such, the simplest solution is to provide three datasets of test cases, in much the same way machine learning algorithms are evaluated in scientific literature:

A training set, visible to all, that contains a representative number of examples.
A validation set, visible to all, that people can use to see if their answer generalizes well on unseen data, after designing it using the training set.
A test set, only known to the creator of the challenge, that they use to score each answer.

That way, answerers cannot optimize their programs for the test set.

Stack Exchange Network

Is it possible for a challenge to have hidden scoring test-cases?

4 Answers 4

Yes, but they need to be made public after X time units

Addressing issues mentioned above

1) It makes the winning condition objectivity more probabilistic

2) The score reported by the challenge author cannot be independently verified

3) The challenge dies when the author stops responding

Provide a training set, a validation set, and hide the test set

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
discussion
.

Linked

Hot Network Questions

Is it possible for a challenge to have hidden scoring test-cases?

4 Answers 4

Yes, but they need to be made public after X time units

Addressing issues mentioned above

1) It makes the winning condition objectivity more probabilistic

2) The score reported by the challenge author cannot be independently verified

3) The challenge dies when the author stops responding

Provide a training set, a validation set, and hide the test set

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged discussion.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
discussion
.