Shannon entropy is the quantitative measure of information produced by a random source. For a discrete random variable XX taking values x1,,xnx_1, \ldots, x_n with probabilities p1,,pnp_1, \ldots, p_n:

H(X)=i=1npilogpiH(X) = -\sum_{i=1}^{n} p_i \log p_i

Introduced in Claude Shannon’s 1948 paper “A Mathematical Theory of Communication”, this formula launched information theory as a discipline and became the mathematical backbone of every modern communication system.

1. What the formula measures

Intuitively, H(X)H(X) is the average number of bits needed to describe one draw from the distribution of XX, assuming an optimal code. A deterministic source (p1=1p_1 = 1, others 0) has entropy zero — no information is transmitted. A uniform source over nn outcomes has entropy log2n\log_2 n — maximal for its support size.

The base of the logarithm is a choice of unit:

  • log2\log_2 → bits
  • loge\log_e → nats
  • log10\log_{10} → hartleys

In pure mathematics and physics, the natural logarithm is usually preferred; in engineering and computer science, base 2.

2. Axiomatic characterization

Shannon derived the formula from three axioms that any reasonable measure of uncertainty should satisfy:

  1. Continuity in the probabilities pip_i.
  2. Monotonicity: for uniform distributions, more outcomes means more uncertainty.
  3. Additivity for independent events: the uncertainty of a compound independent experiment is the sum of the uncertainties.

The unique function (up to a multiplicative constant) satisfying all three is pilogpi-\sum p_i \log p_i.

3. The source coding theorem

Shannon’s source coding theorem establishes the operational meaning:

A source with entropy HH bits per symbol can be compressed losslessly to no fewer than HH bits per symbol on average; and arbitrarily close to HH is achievable with sufficiently long block codes.

This is a hard lower bound on compressibility. Every modern data-compression algorithm (Huffman, arithmetic coding, LZ77/LZ78, gzip, brotli) is a practical approach to achieving this bound.

4. Continuous entropy and the Gaussian

For a continuous random variable with density ff, the differential entropy is

h(X)=f(x)logf(x)dxh(X) = -\int_{-\infty}^{\infty} f(x) \log f(x)\, dx

A remarkable result: among all densities on R\mathbb{R} with fixed mean μ\mu and variance σ2\sigma^2, the one with maximum differential entropy is the Gaussian N(μ,σ2)N(\mu, \sigma^2). This is one reason the Gaussian appears so often — it is the least informative distribution consistent with given first and second moments.

5. Connections beyond information theory

Statistical mechanics

Shannon entropy is mathematically identical (up to Boltzmann’s constant kBk_B) to the Gibbs entropy in statistical mechanics:

S=kBipilnpiS = -k_B \sum_i p_i \ln p_i

The equivalence is not coincidental — both describe uncertainty about microstates given macroscopic information. Jaynes’s maximum entropy principle (1957) unified the two frameworks.

Machine learning

Cross-entropy loss — ubiquitous in classification — is a direct descendant:

L(p,q)=ipilogqiL(p, q) = -\sum_i p_i \log q_i

where pp is the empirical distribution of labels and qq is the model’s predicted distribution. Minimizing cross-entropy is equivalent to maximum-likelihood estimation under a categorical model.

Cryptography

Shannon’s later paper “Communication Theory of Secrecy Systems” (1949) defined perfect secrecy in entropy terms and proved that the one-time pad is the only unconditionally secure cipher. Every modern cryptosystem is measured against Shannon’s bound.

6. Mathematical extensions

  • Conditional entropy: H(XY)=p(x,y)logp(xy)H(X | Y) = -\sum p(x,y) \log p(x|y)
  • Mutual information: I(X;Y)=H(X)+H(Y)H(X,Y)I(X; Y) = H(X) + H(Y) - H(X,Y)
  • KL divergence: DKL(pq)=pilog(pi/qi)D_{\text{KL}}(p \| q) = \sum p_i \log(p_i/q_i)
  • Rényi entropy: a parametric family generalizing Shannon entropy

Each has become a central tool in its own field — statistics, machine learning, ergodic theory, quantum information.

Further reading

  • Shannon, A Mathematical Theory of Communication (1948) — the founding paper, still worth reading.
  • Cover & Thomas, Elements of Information Theory — the standard graduate text.
  • MacKay, Information Theory, Inference, and Learning Algorithms — applied, freely available online.

Frequently asked

Why does Shannon entropy use a logarithm?

Because information is additive for independent sources: if X and Y are independent, the joint probability factors as p(x,y) = p(x)·p(y), and we want H(X,Y) = H(X) + H(Y). The logarithm is the unique function (up to base) that turns products into sums. Base-2 logs give entropy in bits; natural logs give nats.

What does the maximum-entropy principle say?

Among all probability distributions consistent with a given set of constraints (for example, fixed mean and variance), the one with the highest entropy is the least biased — it assumes the least additional structure beyond the given constraints. For fixed mean and variance on the reals, the maximum-entropy distribution is the Gaussian.

What is Shannon's source coding theorem?

It states that a source with entropy H bits per symbol cannot be losslessly compressed to fewer than H bits per symbol on average, and arbitrarily close to H is achievable. This gives a hard lower bound on the compressibility of any data stream and is the mathematical foundation of modern data compression.