0
$\begingroup$

The setup for the current question is the same as a previous question of mine, with one important difference: I have decided to go for a paired test, by always performing two consecutive experiments with the two possible inputs (zero or random), although I randomize the order (zeros first and random second, or vice-versa) for each measurement, seeking to avoid certain domain-specific biases (if anyone cares, related to the out-of-order execution capability of the underlying CPU).

After more research into the issue, I've learned about equivalence tests, and the more I read about them, the more I'm convinced they're the correct approach for my problem. Not only it answers the "correct" question, but I believe it sidesteps an issue described on that question, of having too many samples, so that a test would reject the null hypothesis by identifying a difference that is too small to be relevant to my problem at hand.

In particular I've come across the EQUIVNONINF R package and have been reading its documentation, as well as the associated monograph "Testing Statistical Hypotheses of Equivalence and Noninferiority" by Stefan Wellek.

Given the particulars of my data set (paired, non-normal, discrete [integer, actually] data with high probability of ties), I believe the srktie functions, which perform a signed rank equivalence test generalized for discrete data, are most appropriate for my problem. In particular, the srktie_m, which is said to be faster, is applicable, so that's the one I planned to go with. This leads to my first question:

Question 1: is the srktie_m really the "best" choice in my case?

I understand that I need to define, prior to running the tests, the equivalence interval in my case, and that it is something that needs to be answered by domain-specific knowledge. I have decided that any differences below one clock cycle must be attributable to measurement noise, since a CPU is a digital electronic circuit which operates in quantums of clock cycles.

I believe the equivalence interval is related to the eps1 and eps2 parameters to the srktie_m function. They are defined as left-hand and right-hand limits of the hypothetical equivalence range for $q_+/(1 − q_0) − 1/2$, where $q_+ = P[D_i + D_j > 0]$ and $q_0 = P[D_i + D_j = 0]$, where, as I understand, $D_i$ and $D_j$ are the differences in clock cycles between a "zeros" input and a "random" input for two randomly selected measurements. I am unsure how to relate this to my 1 clock cycle equivalence hypothesis (I feel it has to do with the variance of the distribution somehow), which leads me to my second question:

Question 2: how do I choose eps1 and eps2 in srktie_m so that differences of less than one clock cycle are considered equivalent?

As I'm looking to run this test by the book, I would like to perform a power analysis first, to determine the required sample size before actually running the test. Note that these are computer experiments, for which I can easily perform a million measurements, and perhaps even a billion wouldn't be out of the question, but I have no idea if a million samples is too little or enough; or if a billion samples is too much (or worse, still too little). I understand it would be "cheating" to simply rerun the tests with more samples if I got a "non-equivalent" response.

Although I can find some functions that appear suitable for power calculation for other tests in the EQUIVNONINF package, I can't seem to find one for the srktie functions. Skimming through Wellek's monograph, I also haven't found anything about this. Which leads to my third question:

Question 3: how to calculate power and sample size for the signed-rank with ties equivalence test?

Lastly, I'm trying to understand if I should apply the Bonferroni correction, given that I have (as stated in my previous question) an ensemble composed of different parameter sets (security levels for the underlying cryptographic algorithm), implementations and other parameters.

Although these are not multiple comparisons within the same data set (as each parameter generates a different set of measurements), ultimately the research question that I'm trying to answer is "does the hardware display data-independent timing?", and I collect evidence to answer that question through these different parameter sets and implementations. This leads me to the fourth and final question:

Question 4: would it be proper to apply the Bonferroni correction to my ensemble of measurements?

$\endgroup$

0