Entropy Shmear

Consider the raw Zener samples in the previous $\pi$ sequence.

Naive application of Shannon’s $H(\pi) = -K\sum_{i} {p_i \log p_i}, i \in \{0, 1\}^n$ entropy equation suggests an entropy rate of 6.75 bits/byte for the sequence if $n=8$ (as is typical). A cmix compression estimate is only 1.05 bits/byte. What has happened to the missing 5.7 bits of Shannon entropy estimated to be in each byte?

An alternative way of thinking about correlation is that the missing 5.7 bits of entropy supposedly within each 8 bit window are elsewhere; outside of that narrow window of consideration. The shmeared bits break the IID criterion, overlapping many other bytes. We could say with enthusiasm and a grin that:-

  • Each byte does contain 1.05 bits of true entropy based on $n \gg 8$.
  • The missing 5.7 bits are shmeared out across ajoining bytes and words due to autocorrelation.
  • They live outside of the enumeration of $i$, due to it’s 8 bit window (a probability distribution $P(\pi)$ based on $n = 8$).
  • Thus the common means of applying Shannon cannot see nor account for them.
  • Compute the entropy $H_{compression}$ measured via (cmix) compression as $ \frac{|Cmix(\pi)| - k}{|\pi|} $
  • And $k$ is a constant representing the compressor’s file overhead, found from $ |Cmix(U)| - |U| $
  • $U$ is a good pseudo random file such as that from /dev/urandom, of length equivalent to $|\pi|$
  • Hint: use ent to measure $H_{Shannon}$.
  • The entropy shmear due to autocorrelation $ES$, is $ \frac{H_{Shannon} - H_{compression}}{H_{Shannon}} $, or 84% for our $\pi$ samples.

Clearly in an IID situation, $ H_{Shannon} - H_{compression} \approx 0 $ [1]. Obviously an approximation, but might $ES$ be an alternative way of viewing autocorrelation? We could say that $\pi$ has an autocorrelation coefficient of 0.1. Or, we can imagine 84% of $\pi$'s bits that would have made an IID sequence being shmeared out to adjacent bytes. 84% of them are strongly dependant on others.

Anyway, it’s a thought isn’ it ¯\_(ツ)_/¯


Notes:

[1] An ent test of 1.2 MB of /dev/urandom samples showing that autocorrelation is never exactly zero, even for cryptographically strong pseudo random sequences :-

$ ent /tmp/urandom
Entropy = 7.999855 bits per byte.

Optimum compression would reduce the size
of this 1200000 byte file by 0 percent.

Chi square distribution for 1200000 samples is 242.00, and randomly
would exceed this value 71.10 percent of the times.

Arithmetic mean value of data bytes is 127.5589 (127.5 = random).
Monte Carlo value for Pi is 3.144180000 (error 0.08 percent).
Serial correlation coefficient is 0.000989 (totally uncorrelated = 0.0).

[2] And the correlation coefficient is inversely proportional to the length of the test sequence. This is a problem for DIY TRNG builders as the entropy sources we have resources to build cannot easily produce Gigabytes of test data. The Arduino-Entropy-Library produces “approximately two 32-bit integer values every second”. It would take over five hours to generate a similar 1.2 MB sample. Even the venerable HotBits only produces them at ~800 bps.