Shannon entropy $(H)$ is the information density of the contents of the samples file, expressed as a number of bits per byte. For a general source, $X$ with $n$ categories of symbol:-
$$H(X) = ts = −\sum_{i=0}^n P(x_i) \log_2(P(x_i))$$
This test mirrors the old ent
entropy test by featuring the same scan window of 8 bits. Unfortunately for reasons described in our rationale, we have no closed form solution to this. Instead we rely on simulation for the p value directly from the empirical cumulative distribution function (eCDF) for the test statistic $ts$.
You should expect $ts \approx $ 8 bits/byte.
Similarly to the entropy test above, no closed form determination of a p value is possible due to the overwhelming algorithmic complexity of the test. The test statistic is the size of the compressed samples file $S$, as:-
$$ ts = |LZMA(S)| + |BZ2(S)| $$
We rely on simulation for the p value directly from the eCDF for $ts$. We also believe that two attempts at compression are better than one. The LZMA and BZ2 compression algorithms are located in the xz-1.9.jar
and commons-compress-1.21.jar
libraries. These jar files should not be changed without consideration of re-calibrating the test’s embedded eCDF.
You should expect $ts$ to be slightly $> 2 \times |S|$.
The chi-square test $(\chi^2)$ is one of the most common tests for the randomness of data, and is extremely sensitive to errors.
The test’s approximation to the chi-squared distribution breaks down if expected frequencies are too low. It will normally be acceptable so long as no more than 20% of the categories have expected frequencies below 5. We surpass 5 by engineering the test to have an expected 34 counts per category (actually slightly greater). If we run the math $(\text{threshold} = 34 \times \text{bits} \times 2^{\text{bits}} /8)$, we arrive at the following minimum sample file lengths for bit windows of sizes 9 to 14. The larger the bit window, the more rigorous the test. ent
uses a bit window of length 8.
Bits | Categories | Threshold | Min.Length |
---|---|---|---|
9 | 512 | 19,584 | 25,000 |
10 | 1,024 | 43,520 | 50,000 |
11 | 2,048 | 95,744 | 100,000 |
12 | 4,096 | 208,896 | 300,000 |
13 | 8,192 | 452,608 | 500,000 |
14 | 16,384 | 974,848 | 1,000,000 |
And finally the p value arises from the $\chi^2$ statistic.
This is simply the result of summing all the bytes in the samples file and dividing by its length. If those bytes are random, the expectation should be 127.5. The variance of the mean is the sum of the variances, which for a discrete uniform distribution of single bytes is $\frac{256^2-1}{12}$. Thus for $n$ bytes:-
$$ \sigma^2 = \frac{256^2-1}{12n} = \frac{5461.25}{n} $$
From here we calculate a standard Z score, $z = \frac{x - \mu}{\sigma},$ and finally a p value.
You should expect $ts \approx $ 127.5.
To compute Monte Carlo estimates of $\pi$, we appropriate the function $f(x) = \sqrt(1 – x^2)$. The average function method is more accurate than ent
’s area method because it has a smaller variance, thus is more rigorous. The graph of the function on the interval [0,1] is shown in the plot below. The curve forms a quarter circle of unit radius and the exact area under the curve is $\pi \over 4$.
We leverage that the average value of a continuous function $f$ on the interval [a, b] is given by the following integral, $ f_{avg} = \frac{1}{b-a} \int_a^b f(x) dx $. That implies that for our curve:-
$$ f_{avg} = \int_0^1 \sqrt(1 – x^2) dx = \frac{\pi}{4} $$
So after estimating the left hand side of the equation, we multiply the estimate by 4 to estimate $\pi$. And that’s its mean value, $\mu$.
And by magic and computing $E[(\sqrt{1-X^2})^2]=\int_0^1 (1-x^2) dx = 2/3$, so that now the variance of $\sqrt{1-X^2}$ is $E[1-X^2]-E[\sqrt{1-X^2}]^2=2/3-(\pi/4)^2,$ we determine that the variance of this estimate of $\pi$ is:-
$$ \sigma^2 = \frac{16}{n} \Bigl\{ 2/3 - \bigl( \frac{\pi}{4} \bigr)^2 \Bigr\} \approx \frac{0.797 062}{n} $$.
(Thanks to Ian for magical assistance with the variance formula.)
From here we calculate a standard Z score, $z = \frac{x - \mu}{\sigma},$ and finally a p value.
You should expect $ts \approx $ 3.14159.
This quantity measures the extent to which each byte in the file depends upon the previous byte. We simply use a standard Java library to compute the Pearson’s correlation (R) between the original samples and another set of identical samples with a lag of 1.
The reasoning for calling this test Uncorrelation rather than Correlation as in the original ent
is that ent
displays the correlation coefficient, which is a test statistic $(ts)$. It’s not a p value. Our library returns a p value associated with the (two-sided) null hypothesis that the corresponding correlation coefficient is zero. Those are opposites, as in a good ent
pass $ts \to 0$ whilst in a good ent3000
“PASS” $p \to 1$. Hence “Un…”