Rationale & FAQ

Why, oh why, oh why?

Sigh. Before you ask “Why do we need yet another test suite”; it’s not another 😉

It’s the only one that does what it does. There is no other suite that tests for randomness below the 1 MB limit, whilst providing accurate p values. Most won’t even work with such small file sizes (not counting the pretty useless rngtest). Yet these file sizes are of paramount importance to the DIY TRNG Maker community and one time pad generateurs.

It is focussed on validating sub Megabyte files full of random bytes rather than validating TRNGs per se that are capable of generating many orders of magnitude more entropy. That’s appropriate for one time pad applications. It’s not dieharder.

Current issues (entoriginal)

  • In “evaluating pseudorandom number generators for encryption”, the required entropy rate is 1 bit/bit or 8 bits/byte. Anything substantially less is useless for cryptography. There is no need to measure it as the only result concerning us is a pass/fail determination within agreed confidence bounds, Ă  la the other standard randomness tests like dieharder. Yet there are no bounds and no determination of any p values for confidence other than for a bit/byte $ \chi^2 $ test.

  • And it can’t be used for general entropy measurement in it’s default setting, as it reports the wrong type of entropy. Cryptography focuses on the most conservative min. entropy H∞, not Shannon entropy. ent reports Shannon entropy which is always higher for all sample distributions other than uniform. And uniform distributions are uncommon from most DIY entropy sources.

  • From the ent website, “BUGS: Note that the ‘optimal compression’ shown for the file is computed from the byte- or bit-stream entropy and thus reflects compressibility based on a reading frame of the chosen width (8-bit bytes or individual bits if the -b option is specified). Algorithms which use a larger reading frame, such as the Lempel-Ziv [Lempel & Ziv] algorithm, may achieve greater compression if the file contains repeated sequences of multiple bytes.”

  • A consequence of the above is that an 8-bit window presupposes IID data with a relaxation period not greater than 8 bits. It is common to see ent used (incorrectly) against non-IID data sets. In those cases the reported entropy would be much higher than the true rate.

Read a longer version of current issues here. In summary, there is no confidence in the ent metrics as to whether the sample data set is suitable for cryptography.

Why empirical simulation?

Leveraging holistic notions of machine learning, kernel density estimation and polynomial spline interpolation allows accurate p value determination for sample files between 25 kB and 1 MB. Our empirical models allow identification of patterns in random data. Traditional randomness testing methods rely on statistical tests to identify such patterns. ent3000 does indeed use some of these, but not exclusively. However, those tests can be fooled by non-random data that exhibits similar statistical properties. We use machine learning to identify patterns in random data that are not easily detected by traditional statistical tests. And we test with sophisticated compression algorithms such as LZMA and BZ2 which have no algebraic determination of a p value.

Overall, machine learning is a powerful tool that we use to improve the effectiveness of randomness testing. Our justification also builds on traditional test failings:-

From: https://webhome.phy.duke.edu/~rgb/General/dieharder.php

The number of actual samples within a test that contribute to the single-run test statistic was made a variable when possible. This was generally possible when the target was an easily computable function of the number of samples, but a number of the tests have pre-computed targets for specific numbers of samples and that number cannot be varied because no general function is known relating the target value to the number of samples.

Many of diehard’s tests investigated overlapping bit sequences. Overlapping sequences are not independent and one has to account for covariance between the samples (or a gradually vanishing degree of autocorrelation between sequential samples with gradually decreasing overlap). This was generally done at least in part because it used file-based input of random numbers and the size of files that could reasonably be generated and tested in the mid-90’s contained on the order of a million random deviates.

Unfortunately, some of the diehard tests that rely on weak inverses of the covariance matrices associated with overlapping samples seem to have errors in their implementation, whether in the original diehard (covariance) data or in dieharder-specific code it is difficult to say. Fortunately, it is no longer necessary to limit the number of random numbers drawn from a generator when running an integrated test, and non-overlapping versions of these same tests do not require any treatment of covariance. For that reason non-overlapping versions of the questionable tests have been provided where possible (in particular testing permutations and sums) and the overlapping versions of those tests are deprecated pending a resolution of the apparent errors.

Indeed, from within the source code for dieharder’s 2d Spheres test, V3.31.1, we find:-

This test has a BUG in it – the expression it uses to evaluate p is not accurate enough to withstand the demands of dieharder. M. Fischler pointed this out and derived the required corrections (giving explicit forms useful out to d=5) and they are incorporated in rgb_minimum_distance(). This test is hence OBSOLETE and is left in so people can play with it and convince themselves that this is so.

From: https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-22r1a.pdf

For many of the tests in this test suite, the assumption has been made that the size of the sequence length, n, is large (of the order 10^3 to 10^7). For such large sample sizes of n, asymptotic reference distributions have been derived and applied to carry out the tests. Most of the tests are applicable for smaller values of n. However, if used for smaller values of n, the asymptotic reference distributions would be inappropriate and would need to be replaced by exact distributions that would commonly be difficult to compute.

According to Maurer, the source-coding algorithm due to Lempel-Ziv “seems to be less suited for application as a statistical test” because it seems to be difficult to define a test statistic whose distribution can be determined or approximated.

From: https://en.wikipedia.org/wiki/Diehard_tests

Most of the tests in DIEHARD return a p-value, which should be uniform on [0,1) if the input file contains truly independent random bits. Those p-values are obtained by p = F(X), where F is the assumed distribution of the sample random variable X – often normal. But that assumed F is just an asymptotic approximation, for which the fit will be worst in the tails.

The parking lot test: Theory for the behavior (sic) of such a random curve seems beyond reach…

The 3D spheres test: Thus the radius cubed is exponential with mean 30. (The mean is obtained by extensive simulation).

From: https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm

An attractive feature of this test is that the distribution of the K-S test statistic itself does not depend on the underlying cumulative distribution function being tested. Another advantage is that it is an exact test (the chi-square goodness-of-fit test depends on an adequate sample size for the approximations to be valid). Despite these advantages, the K-S test has several important limitations:

It only applies to continuous distributions.

It tends to be more sensitive near the center of the distribution than at the tails.

Perhaps the most serious limitation is that the distribution must be fully specified. That is, if location, scale, and shape parameters are estimated from the data, the critical region of the K-S test is no longer valid. It typically must be determined by simulation.

And finally, simulation avoids the assumption of the validity of squiggly functions such as NIST’s 800-22 STS’ Cumulative Sums (Cusum) test:-

$$ G(z) = \frac{1}{\sqrt{2 \pi}} \int_{-z}^{z} \sum_{k=-\infty}^{\infty} (-1)^k \exp \left\{ -\frac{(u-2kz)^2}{2} \right\} du $$

or NIST’s 800-22 STS’ Serial test:-

$$ P - \text{value2} = \text{igamc} \left\{ 2^{m-3}, \nabla^2 \Psi_m^2 / 2 \right\} $$.

And so given the above, it’s statistically and mathematically safer and more accurate to simulate rather than theorise. You know, to make it perfect.

Choosing the number of trial runs

If we simulate, we need to know how many simulations /trials ($n$) to run. And the trials will then be used to construct empirical cumulative distribution functions (eCDFs). We can determine confidence interval bounds around the eCDF, $F_n(x)$ based on $\alpha = 0.05$ from:-

$$ F(x) = F_n(x) \pm \epsilon $$ where $$ \epsilon = \sqrt{\frac{\ln({\frac{2}{\alpha}})}{2n}} $$

knowing that the true CDF, $F(x)$ lies within them. If $\alpha = 0.05$, that’s a $\frac{1}{20}$th resolution. Applying that again to $\epsilon$, we get $\epsilon = 0.0025$. Which leads us to $n = 295,110$, say three hundred thousand trials. With hindsight, we should have calculated $n$ before we ran 1,000,000 trials for some of the calibration simulations. C’est la vie.

Raw byte sequences were generated by Java 17’s new L128X256MixRandom pseudo random number generator. The Mersenne Twister was not used anywhere in the calibrations for the reasons stated at https://en.wikipedia.org/wiki/Mersenne_Twister#Disadvantages. And that it’s slow. We used $5.8 \times 10^{12}$ pseudo random bytes for a full calibration run per test. That’s nearly six trillion bytes, or 19 times the number of stars in our galaxy. Our machines are really quite tired as some calibrations such as for the Darts test take over 600 hours of compute time.

If you’re worried that we used too many random bytes to be statistically independent, don’t. We restarted multiple copies of L128X256MixRandom throughout the days. Even if we hadn’t though, and just ran a single instance, we still would have drawn less per test than $\root 7 \of {\text{state}}$ of the generator’s capacity. Hence one of the reasons for Java 17.

Why those file limits exactly?

  • 25 kB lower limit. A random file of this size should hold an approximately expected 100 count of each 0 - 255 byte value (actually 97.656). That is sufficient to perform statistical tests on the sample, such as $\chi^2$. It becomes statistically infeasible to test much smaller samples, especially those that float 5 byte tuples like our Kolmogorov–Smirnov (KS) test or our 9 byte tuple Shells test.

  • 1 MB upper limit. Once a samples file reaches around a megabyte in size, NIST’s Special Publication 800-22, “A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications” can be used (implemented here). And above about 10 MB, diehard can be used. Typically, DIY entropy sources operate slowly. So a megabyte is sufficient for 95% of our generateurs. And few would need to generate significantly much more entropy as one time pads are optimised for text correspondence. No one should be using them to encrypt 8K UHD movies.

Why Java?

Because of the certain empirical tests. The compression test uses LZMA and BZ2 libraries that we then have to package and ship. The empirical cumulative distribution functions that determine the test’s p value are trained on those particular libraries. Java makes it very easy and reliable to put them into a single JAR file. Python package management is a real pain and unreliable As https://imgs.xkcd.com/comics/python_environment.png. Java also runs a lot faster than Python when generating the simulation data, which was important for the Darts test.

A basic Java install simply means expanding the Java archive into a directory and running it from there. It doesn’t affect anything else and you can have many concurrently installed versions. Java is just easier than Python or C/C++ which come with their own memory management issues. And who knows how to use Rust?

Any other justifications?

  • Creates an unambiguous test result per samples file. ent3000 has PASS/ FAIL determinations. Therefore no statistics knowledge is be necessary to interpret the results.

  • Trying to moot out all those crypto.stackexchange.com questions about NIST’s STS, diehard and dieharder testing and interpretation of p values, such as Interpretation of certain results of NIST Test Suite, Random number generator issue and Interpretation of the results of NIST (p)NRG suite.

  • Sentimentally maintains the existing ent tests, but tightens their bounds whilst adding other tests.

  • Makes calibration algorithms and resultant data publicly available to instil confidence through the opportunity for independent validation.

And why $\alpha=0.05$?

Critical values for a test of hypothesis depend upon a test statistic, which is specific to the type of test, and the significance level $\alpha$, which defines the sensitivity of the test. A value of $\alpha=0.05$ implies that the null hypothesis ($H_0$) is rejected 5 % of the time when it is in fact true. A type I error. The choice of $\alpha$ is somewhat arbitrary, although in practice values of 0.1, 0.05, and 0.01 are common. Critical values are essentially cut-off values that define regions where the test statistic is unlikely to lie; for example, a region where the critical value is exceeded with probability $\alpha$ if the null hypothesis is true. The null hypothesis is rejected if the test statistic lies within this region which is often referred to as the rejection region(s).

$\alpha = 0.05$ is commonly accepted in most sciences, other than particle physics. An α = 0.05 creates a test 5x stronger than in the NIST Special Publication 800-22, “A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications” which uses $\alpha = 0.01$✝. And 50x stronger than in the NIST Special Publication 800-90B, “Recommendation for the Entropy Sources Used for Random Bit Generation” which uses $\alpha = 0.001$✝.

0.05 minimizes type II errors to err on the side of caution as this is a cryptographic test suite after all. Tests will fail safe as your safety is our first priority. If several sample files all fail ent3000, it’s not because of our $\alpha$. It’s because of your TRNG. So there.


Notes:-

✝ Why is this? If security is indeed important in cryptography, why does NIST posit such low values of $\alpha$ and risk generateurs accepting the alternate hypothesis (your TRNG is bad)? Is your safety not their first priority? See conspiracies.