# Proving randomness

Or, how can we prove to ourselves that output from a TRNG is indeed truly random?

“It seems natural to call a chain random if it cannot be written down in a more condensed form, i.e., if the shortest program for generating it is as long as the chain itself. “

-Andrey N Kolmogorov. Randomness tests with only data as input can mathematically and thus conclusively prove non-randomness, but not vice versa. It is impossible to prove that a chain of numbers you’re given is truly random without further information about how they came to be. Computational indistinguishability means that a truly random sequence looks very much like a pseudo random sequence when tested within polynomial time.

This is the paradigm that vexes users of all commercial TRNGs concerned about their device’s true security. A deep question thus arises from Kolmogorov’s quotation – what is the shortest program that describes your data set? Or more simply, just what made it exactly? Where did the initial uncertainty/secrets come from?

True (Kolmogorov) random sequences possess three fundamental properties. Each of these may be taken as the basis of a definition of randomness:-

1. Being typical. There are many typical sequences and that each typical sequence belongs to a reasonable majority.
2. Having stability of frequencies. By definition, a binary sequence is stochastic whenever any suitably chosen subsequence of it has frequency stability.
3. Being chaotic. Being chaotic signifies a complexity of structure. Then the fact that a sequence is chaotic signifies that the complexity of its beginning segments grows sufficiently fast.

Properties one and two can be achieved by pseudo random number generators. They can be tested against a specific null hypothesis ($H_0$) that they generate IID data using myriad statistical tests. Tests like Chi, Non-overlapping Template Matching Test, Maurer’s Universal Statistical Test, Parking Lot Test, etc. For each applied test, a decision or conclusion is derived that accepts or rejects $H_0$, i.e., whether the generator is (or is not) producing random looking values, based on the sequence that was produced.

Let’s test two distribution ensembles $C$ and $R$ over domain $D$, which for IID RNGs is typically the range of a byte (0 – 255), a bit (0 - 1) or $\{0,1\}^l$ generally. $C$ is generated by a CSPRNG and $R$ is the ideal theoretical ensemble expected of a perfectly true random source. We can calculate a statistical distance between them by summing up the various differential probabilities as:-

$$\Delta(n) = \frac{1}{2} \sum_{\alpha \in D} | \mathbb{P}[C_n = \alpha] - \mathbb{P}[R_n = \alpha] |$$

All cryptographic RNGs will offer a negligible concrete $\Delta(n)$ which NIST defines as $\le 2^{-64}$. Consequently they can’t be differentiated mathematically (in polynomial time). Even the non cryptographic Mersenne Twister (as in Python) makes a good fist of $\Delta(n)$. Simply put, the distribution ensembles look the same for all intents and purposes. So far so good.

Property three is where your problem and our interest lies. Is it truly chaotic or just appears complex? The binary expansions of irrational numbers like $\pi$ , $e$ and $\sqrt{2}$ are widely regarded as being perfectly pseudo random, but are nevertheless calculable and therefore predicable. Yet these constants pass all randomness tests. Indeed, are used as test data sets to verify the tests themselves (like NIST SP 800-22, Appendix B).

They are not truly random though. The entirety of $\pi$ can be expressed with a Kolmogorov complexity of just the following few mathematical symbols:-

$$\pi = \sum_{0}^{\infty} \left( \frac{4}{8n+1} - \frac{2}{8n+4} - \frac{1}{8n+5} - \frac{1}{8n+6} \right) \left( \frac{1}{16} \right)^n$$

And similarly 1MB of perfectly random looking data can actually consist of only three lines of Java as:-

SecureRandom random = SecureRandom();
byte numbers[] = new byte[1_000_000];
random.nextBytes(numbers);


or even less if you have access to the ChaCha20 powered Linux random number generator:-

dd if=/dev/urandom of=pseudo-random-chain bs=1M count=1K


which very quickly produces 1GB of pseudo random bytes that are specifically and very carefully designed to pass any and all cryptographic strength randomness tests.

So the answer to the problem of property three is that we “don’t know” and you can’t tell. There very well may be a program much shorter than your data set, covertly or via blunder operating as $\operatorname{G}: \{0,1\}^n \to \ \{0,1\}^m$ with $m \gg n$. The data would not possess information theoretic security, and therefore would not be as secure as possible. And perhaps even predictable by sufficiently resourced actors. By it’s very definition, such a sequence could not create a one time pad. It’s just not possible for anyone to tell what the original generator of your sample chain is if it comes out of a sealed black box .

Ref. A.N.KOLMOGOROV AND V.A.USPENSKII (Translated by Bernard Seckler), ALGORITHMS AND RANDOMNESS, THEORY PROBAB. APPL., Vol 32, No. 3.