Individual test details

Entropy

Shannon entropy $(H)$ is the information density of the contents of the samples file, expressed as a number of bits per byte. For a general source, $X$ with $n$ categories of symbol:-

$$H(X) = ts = −\sum_{i=0}^n P(x_i) \log_2(P(x_i))$$

This test mirrors the old ent entropy test by featuring the same scan window of 8 bits. Unfortunately for reasons described in our rationale, we have no closed form solution to this. Instead we rely on simulation for the p value directly from the empirical cumulative distribution function (eCDF) for the test statistic $ts$.

You should expect $ts \approx $ 8 bits/byte.

Compression

Similarly to the entropy test above, no closed form determination of a p value is possible due to the overwhelming algorithmic complexity of the test. The test statistic is the size of the compressed samples file $S$, as:-

$$ ts = |LZMA(S)| + |BZ2(S)| $$

We rely on simulation for the p value directly from the eCDF for $ts$. We also believe that two attempts at compression are better than one. The LZMA and BZ2 compression algorithms are located in the xz-1.9.jar and commons-compress-1.21.jar libraries. These jar files should not be changed without consideration of re-calibrating the test’s embedded eCDF.

You should expect $ts$ to be slightly $> 2 \times |S|$.

Chi

The chi-square test $(\chi^2)$ is one of the most common tests for the randomness of data, and is extremely sensitive to errors.

The test’s approximation to the chi-squared distribution breaks down if expected frequencies are too low. It will normally be acceptable so long as no more than 20% of the categories have expected frequencies below 5. We surpass 5 by engineering the test to have an expected 34 counts per category (actually slightly greater). If we run the math $(\text{threshold} = 34 \times \text{bits} \times 2^{\text{bits}} /8)$, we arrive at the following minimum sample file lengths for bit windows of sizes 9 to 14. The larger the bit window, the more rigorous the test. ent uses a bit window of length 8.

Bits Categories Threshold Min.Length
9 512 19,584 25,000
10 1,024 43,520 50,000
11 2,048 95,744 100,000
12 4,096 208,896 300,000
13 8,192 452,608 500,000
14 16,384 974,848 1,000,000

And finally the p value arises from the $\chi^2$ statistic.

Mean

This is simply the result of summing all the bytes in the samples file and dividing by its length. If those bytes are random, the expectation should be 127.5. The variance of the mean is the sum of the variances, which for a discrete uniform distribution of single bytes is $\frac{256^2-1}{12}$. Thus for $n$ bytes:-

$$ \sigma^2 = \frac{256^2-1}{12n} = \frac{5461.25}{n} $$

From here we calculate a standard Z score, $z = \frac{x - \mu}{\sigma},$ and finally a p value.

You should expect $ts \approx $ 127.5.

Pi

To compute Monte Carlo estimates of $\pi$, we appropriate the function $f(x) = \sqrt(1 – x^2)$. The average function method is more accurate than ent’s area method because it has a smaller variance, thus is more rigorous. The graph of the function on the interval [0,1] is shown in the plot below. The curve forms a quarter circle of unit radius and the exact area under the curve is $\pi \over 4$.

A quarter unit circle with area Pi/4 used in the Pi test of ent3000

A quarter unit circle with area $\pi /4$.

We leverage that the average value of a continuous function $f$ on the interval [a, b] is given by the following integral, $ f_{avg} = \frac{1}{b-a} \int_a^b f(x) dx $. That implies that for our curve:-

$$ f_{avg} = \int_0^1 \sqrt(1 – x^2) dx = \frac{\pi}{4} $$

So after estimating the left hand side of the equation, we multiply the estimate by 4 to estimate $\pi$. And that’s its mean value, $\mu$.

And by magic and computing $E[(\sqrt{1-X^2})^2]=\int_0^1 (1-x^2) dx = 2/3$, so that now the variance of $\sqrt{1-X^2}$ is $E[1-X^2]-E[\sqrt{1-X^2}]^2=2/3-(\pi/4)^2,$ we determine that the variance of this estimate of $\pi$ is:-

$$ \sigma^2 = \frac{16}{n} \Bigl\{ 2/3 - \bigl( \frac{\pi}{4} \bigr)^2 \Bigr\} \approx \frac{0.797 062}{n} $$.

(Thanks to Ian for magical assistance with the variance formula.)

From here we calculate a standard Z score, $z = \frac{x - \mu}{\sigma},$ and finally a p value.

You should expect $ts \approx $ 3.14159.

Uncorrelation

This quantity measures the extent to which each byte in the file depends upon the previous byte. We simply use a standard Java library to compute the Pearson’s correlation (R) between the original samples and another set of identical samples with a lag of 1.

The reasoning for calling this test Uncorrelation rather than Correlation as in the original ent is that ent displays the correlation coefficient, which is a test statistic $(ts)$. It’s not a p value. Our library returns a p value associated with the (two-sided) null hypothesis that the corresponding correlation coefficient is zero. Those are opposites, as in a good ent pass $ts \to 0$ whilst in a good ent3000 “PASS” $p \to 1$. Hence “Un…”