0
$\begingroup$

I am comparing three prompting techniques in LLMs to check which one is best. All prompting strategies include three examples for in-context learning (few-shot only, no fine-tuning).

If I do greedy decoding, I have a deterministic result for accuracy, which I believe doesn't translate to accessing the real error.

I can also sample with temperature=0.7 to get a distribution. But then, to compare, I should check the type of distribution. I sampled five times (I could sample more, but having a big N is a little expensive). Then I checked, and it looks normal, so I am just doing a t-test.

Is my approach correct? Should I perform things like Bonferroni correction to compare more than two methods? Does this sampling approach better approximate the real error?

$\endgroup$

0