Consider the 1 MB of concatenated raw JPEG samples from the Photonic Instrument.
Naive application of Shannon’s $H = -K\sum_{i} {p_i \log p_i}, \enspace i \in \{0, 1\}^n$ entropy equation suggests a Shannon entropy rate of = 7.67 bits/byte for the JPEGs if $n=8$ (as is typical). A cmix compression estimate is only 5.17 bits/byte. What has happened to the missing 2.5 bits of Shannon entropy estimated to be in each byte? Our/an alternative way of thinking about auto correlation is that the missing 2.5 bits of entropy supposedly within each 8 bit window are elsewhere; outside of that narrow 8 bit window of consideration. The missing shmeared bits break the IID criterion, overlapping with many other bytes. We could say with enthusiasm and a grin that:-
Our novel Entropy Shmear ($ES$) metric can be calculate as:-
cmix
compression as $ \frac{|Cmix(\pi)| - k}{|\pi|} \times 8$. $|Cmix(\pi)| = 697,126 $ bytes in this case.cmix
./dev/urandom
, of length equivalent to $|\pi|$.Clearly in an IID situation, $ H_{Shannon} - H_{compression} - k \approx 0 $ [1]. Obviously an approximation, but might $ES$ be an alternative way of viewing auto correlation? We could say that $\pi$ has an auto correlation coefficient of >0.01. Or we can imagine 33% of $\pi$’s bits that would have made an IID sequence, being shmeared out to adjacent bytes. 33% of any byte in $\pi$ is strongly dependant on others.
Anyway, it’s a thought isn’t it ¯\_(ツ)_/¯
Notes:
[1] An ent
test of 1.2 MB of /dev/urandom
samples showing that auto correlation is never exactly zero, even for cryptographically strong pseudo random sequences :-
$ ent /tmp/u
Entropy = 7.999855 bits per byte.
Optimum compression would reduce the size
of this 1200000 byte file by 0 percent.
Chi square distribution for 1200000 samples is 242.00, and randomly
would exceed this value 71.10 percent of the times.
Arithmetic mean value of data bytes is 127.5589 (127.5 = random).
Monte Carlo value for Pi is 3.144180000 (error 0.08 percent).
Serial correlation coefficient is 0.000989 (totally uncorrelated = 0.0).
[2] And the correlation coefficient is inversely proportional to the length of an IID test sequence. This is a problem for DIY TRNG builders as the entropy sources we have resources to build cannot easily produce Gigabytes of test data. The Arduino-Entropy-Library produces “approximately two 32-bit integer values every second”. It would take over 41 hours to generate a similar 1.2 MB sample! Even the venerable HotBits only produces entropy at $\approx$ 800 bps.