Entropy Shmear

Consider the raw samples in the $\pi$ sequence.

Naive application of Shannon’s $H = -K\sum_{i} {p_i \log p_i}$ entropy equation suggests an entropy rate of 4.62 bits/byte for the sequence. A cmix compression estimate is only 0.6 bits/byte. What has happened to the missing 4.02 bits of entropy estimated to be in each byte?

An alternative way of thinking about correlation is that the missing 4.02 bits of entropy supposedly within each 8 bit byte window are elsewhere; outside of that narrow window of consideration. The shmeared bits break the IID criterion, overlapping many other bytes. We could say with a grin that:-

  • Each byte does contain 0.6 bits of entropy.
  • The missing 4.02 bits are shmeared out across ajoining bytes and words due to autocorrelation.
  • They live outside of the enumeration of $i$, due to it’s 8 bit window.
  • Thus the common means of applying Shannon cannot see nor account for them.
  • The entropy shmear due to autocorrelation $ES$, is $ \frac{H_{Shannon} - H_{compression}}{H_{Shannon}} \times 100\% $ or 87% in this case.

Clearly in an IID situation, $ H_{Shannon} - H_{compression} \approx 0 $ as a perfectly random sequence is incompressible and has very low autocorrelation[1].

Might $ES$ be an alternative way of viewing autocorrelation? We could say that $\pi$ has an autocorrelation coefficient of 0.1. Or, we can imagine 87% of $\pi$’s bits that would have made an IID sequence being shmeared out to adjacent octets. 87% of them are strongly dependant on others. Anyway, it’s a thought isn’ it ¯\_(ツ)_/¯


Notes:

[1] An ent test of 1GB of /dev/urandom samples showing that autocorrelation is never exactly zero, even for cryptographically strong pseudo random sequences :-

Entropy = 8.000000 bits per byte.

Optimum compression would reduce the size
of this 1073741824 byte file by 0 percent.

Chi square distribution for 1073741824 samples is 271.72, and randomly
would exceed this value 22.54 percent of the times.

Arithmetic mean value of data bytes is 127.5000 (127.5 = random).
Monte Carlo value for Pi is 3.141575564 (error 0.00 percent).
Serial correlation coefficient is -0.000034 (totally uncorrelated = 0.0).

[2] And the correlation coefficient is inversely proportional to the length of the test sequence. This is a problem for DIY TRNG builders as the entropy sources we have resources to build cannot easily produce Gigabytes of test data. The Arduino-Entropy-Library produces “approximately two 32-bit integer values every second”. That would take four years to generate a similar 1GB sample. Even the venerable HotBits only produces them at ~800 bps.