Shannon entropy is the quantitative measure of information produced by a random source. For a discrete random variable taking values with probabilities :
Introduced in Claude Shannon’s 1948 paper “A Mathematical Theory of Communication”, this formula launched information theory as a discipline and became the mathematical backbone of every modern communication system.
1. What the formula measures
Intuitively, is the average number of bits needed to describe one draw from the distribution of , assuming an optimal code. A deterministic source (, others 0) has entropy zero — no information is transmitted. A uniform source over outcomes has entropy — maximal for its support size.
The base of the logarithm is a choice of unit:
- → bits
- → nats
- → hartleys
In pure mathematics and physics, the natural logarithm is usually preferred; in engineering and computer science, base 2.
2. Axiomatic characterization
Shannon derived the formula from three axioms that any reasonable measure of uncertainty should satisfy:
- Continuity in the probabilities .
- Monotonicity: for uniform distributions, more outcomes means more uncertainty.
- Additivity for independent events: the uncertainty of a compound independent experiment is the sum of the uncertainties.
The unique function (up to a multiplicative constant) satisfying all three is .
3. The source coding theorem
Shannon’s source coding theorem establishes the operational meaning:
A source with entropy bits per symbol can be compressed losslessly to no fewer than bits per symbol on average; and arbitrarily close to is achievable with sufficiently long block codes.
This is a hard lower bound on compressibility. Every modern data-compression algorithm (Huffman, arithmetic coding, LZ77/LZ78, gzip, brotli) is a practical approach to achieving this bound.
4. Continuous entropy and the Gaussian
For a continuous random variable with density , the differential entropy is
A remarkable result: among all densities on with fixed mean and variance , the one with maximum differential entropy is the Gaussian . This is one reason the Gaussian appears so often — it is the least informative distribution consistent with given first and second moments.
5. Connections beyond information theory
Statistical mechanics
Shannon entropy is mathematically identical (up to Boltzmann’s constant ) to the Gibbs entropy in statistical mechanics:
The equivalence is not coincidental — both describe uncertainty about microstates given macroscopic information. Jaynes’s maximum entropy principle (1957) unified the two frameworks.
Machine learning
Cross-entropy loss — ubiquitous in classification — is a direct descendant:
where is the empirical distribution of labels and is the model’s predicted distribution. Minimizing cross-entropy is equivalent to maximum-likelihood estimation under a categorical model.
Cryptography
Shannon’s later paper “Communication Theory of Secrecy Systems” (1949) defined perfect secrecy in entropy terms and proved that the one-time pad is the only unconditionally secure cipher. Every modern cryptosystem is measured against Shannon’s bound.
6. Mathematical extensions
- Conditional entropy:
- Mutual information:
- KL divergence:
- Rényi entropy: a parametric family generalizing Shannon entropy
Each has become a central tool in its own field — statistics, machine learning, ergodic theory, quantum information.
Further reading
- Shannon, A Mathematical Theory of Communication (1948) — the founding paper, still worth reading.
- Cover & Thomas, Elements of Information Theory — the standard graduate text.
- MacKay, Information Theory, Inference, and Learning Algorithms — applied, freely available online.
Frequently asked
Why does Shannon entropy use a logarithm?
Because information is additive for independent sources: if X and Y are independent, the joint probability factors as p(x,y) = p(x)·p(y), and we want H(X,Y) = H(X) + H(Y). The logarithm is the unique function (up to base) that turns products into sums. Base-2 logs give entropy in bits; natural logs give nats.
What does the maximum-entropy principle say?
Among all probability distributions consistent with a given set of constraints (for example, fixed mean and variance), the one with the highest entropy is the least biased — it assumes the least additional structure beyond the given constraints. For fixed mean and variance on the reals, the maximum-entropy distribution is the Gaussian.
What is Shannon's source coding theorem?
It states that a source with entropy H bits per symbol cannot be losslessly compressed to fewer than H bits per symbol on average, and arbitrarily close to H is achievable. This gives a hard lower bound on the compressibility of any data stream and is the mathematical foundation of modern data compression.