In 1948, an engineer at Bell Labs named Claude Shannon published a paper titled A Mathematical Theory of Communication. It was 79 pages long, contained no breakthrough physics, and, in retrospect, did something more consequential than most physics papers of its century: it created an entire new branch of mathematics.

The paper proposed that “information” — until then a vague word used by engineers and philosophers — could be measured precisely, in units. The unit Shannon defined is the bit. The measure he defined for the average information content of a probability distribution is what we now call Shannon entropy. Together, these ideas became the foundation of every digital technology that followed: compression, error correction, cryptography, the modern internet, machine learning, and theoretical computer science.

This article walks through what information theory says and why it has stayed at the center of computing for 80 years.

What is information, mathematically?

Shannon’s first move was almost philosophical: information is surprise. A message that tells you something you already knew contains no information. A message that resolves uncertainty — that picks out one specific outcome from many possibilities — contains a lot.

This led to a precise definition. If an event has probability pp, the information content of being told it occurred is

I(p)=log2p bits.I(p) = -\log_2 p \text{ bits.}

The minus sign and the logarithm aren’t arbitrary. They’re forced by three reasonable requirements:

  1. Information should be non-negative. Telling you something can’t make you less informed.
  2. Independent events combine additively. Learning two unrelated facts gives you the sum of their information contents — not the product.
  3. Certain events have zero information. If something happens with probability 1, learning it occurred tells you nothing.

Only the formula I(p)=logpI(p) = -\log p satisfies all three. The base of the logarithm just sets the unit: base 2 gives bits, base ee gives nats, base 10 gives dits. Bits are conventional in computing.

Practical consequences:

  • Tossing a fair coin (probability 1/2 for each outcome) gives log2(1/2)=1-\log_2(1/2) = 1 bit of information per toss.
  • Rolling a fair six-sided die gives log2(1/6)2.585-\log_2(1/6) \approx 2.585 bits.
  • Drawing a specific card from a 52-card deck gives log2(1/52)5.7-\log_2(1/52) \approx 5.7 bits.

A surprise event — say, drawing the ace of spades — contains exactly the same information content (5.7 bits) as drawing any other specific card. What matters is the probability, not which outcome.

Entropy: average information

For a random variable that can take many values, Shannon defined the entropy as the average information content over all possible outcomes:

H(X)=ipilog2pi.H(X) = -\sum_i p_i \log_2 p_i.

This is the Shannon entropy formula. It measures the average uncertainty in the distribution — equivalently, the average number of bits needed to describe an outcome.

Key intuitions:

  • Uniform distribution maximizes entropy. A coin that lands heads with probability 0.5 has entropy 1 bit per toss. A biased coin with probability 0.99 for heads has entropy of about 0.08 bits. The biased coin is much more predictable, so each toss carries less information.
  • Entropy of a deterministic event is zero. If one outcome has probability 1, H=0H = 0. There is no surprise to measure.
  • Adding outcomes increases entropy if probabilities are spread out. A six-sided die has more entropy than a coin.

Shannon proved a remarkable theorem: entropy is the minimum number of bits needed, on average, to encode a random message. You cannot losslessly compress a sequence of symbols below their entropy. This is the source coding theorem, the lower bound for all data compression.

Compression in practice

Why is your compressed ZIP file smaller than the original? Because real data has low entropy compared to its raw representation.

A photograph stored as raw pixel values has roughly 24 bits per pixel. But adjacent pixels are highly correlated — if one is blue, the next is probably blue too. The conditional entropy given the surrounding pixels is much less than 24 bits. Compression algorithms (JPEG, PNG, WebP) exploit this by encoding only the new information at each pixel, given what came before. The resulting file approaches the actual entropy of the source.

Text shows the same effect. English has about 4.7 bits per character if you treat letters independently. But conditional on the previous few characters, the entropy drops to about 1.0–1.5 bits per character. After ‘Q’, the next letter is almost always ‘U’ — barely any information. Compression algorithms (gzip, bzip2, zstd) exploit this exhaustively. The Shannon limit explains why English text compresses to roughly 25–30% of its size, no matter what algorithm you use.

The bound is sharp. No compression algorithm can do better than entropy for a given source distribution. This is not a limitation of current technology; it’s a mathematical theorem.

Channel capacity

Shannon’s other big result concerned communication over a noisy channel. If you send bits over a wire that occasionally flips them, how many useful bits per second can you transmit?

The channel capacity theorem answers: there exists a maximum rate CC (in bits per second) such that:

  • Below CC, you can communicate arbitrarily reliably with appropriate error-correcting codes.
  • Above CC, no coding scheme can keep the error rate below any positive bound.

This was extraordinary news in 1948. It said that arbitrarily reliable communication over a noisy channel is possible, as long as you’re willing to use enough redundancy. The channel doesn’t have to be perfect — just below capacity.

The theorem doesn’t tell you how to construct such codes — that took decades of further work. Reed-Solomon codes (1960), turbo codes (1993), and LDPC codes (rediscovered in 1996) all approach Shannon’s bound. Modern Wi-Fi, 5G, and deep-space communication with NASA spacecraft all use codes that come within a fraction of a decibel of channel capacity.

This means the bits in your phone signal, the bits in your DSL connection, and the bits beamed back from Mars rovers are all governed by the same theorem from 1948. The amount of redundancy needed depends on the noise level — channel capacity tells engineers exactly how much.

Information theory and AI

Modern AI is, at its core, information theory in deep neural-network form.

Cross-entropy loss — the standard loss function for classification — directly measures the information-theoretic distance between predicted and true distributions. Training a model means minimizing this distance. Lower cross-entropy means the model’s predictions carry less surprise compared to ground truth.

Compression and intelligence are closely related. Some researchers argue that learning is essentially compression: a model that can predict the next token in a text well has, in effect, compressed its training data into its weights. The better the prediction, the better the compression, the closer to the source’s entropy. This is one informal definition of an “intelligent” model.

Information bottleneck is a theoretical framework that explains why deep networks generalize. Each hidden layer compresses information about the input, throwing away irrelevant details, and keeps only what’s predictive of the output. Tishby and others have argued that the layered structure of deep nets is essentially a sequence of information bottlenecks.

Mutual information — a measure of shared information between two variables — appears throughout machine learning. Variational autoencoders, contrastive learning, and self-supervised methods all use mutual information bounds.

In short: Shannon’s mathematics from 1948 is, in 2025, the language in which most of artificial intelligence is theoretically analyzed.

Where else it shows up

Information theory has spread far beyond communication engineering.

Cryptography. Shannon himself wrote a 1949 paper showing that the only theoretically unbreakable cipher is the one-time pad — a key as long as the message, used once and never reused. The reason is information-theoretic: the ciphertext provides zero information about the plaintext, because any plaintext is equally likely. All practical ciphers (AES, RSA) provide computational rather than information-theoretic security; an attacker with infinite computing power could break them. (See our primes-and-cryptography post for more.)

Statistical mechanics. The Boltzmann entropy formula S=klogWS = k \log W has the same mathematical structure as Shannon entropy. The connection — explored by Edwin Jaynes in the 1950s — is that thermodynamic entropy is essentially Shannon entropy of the molecular state distribution. This is why “information is physical” became a common slogan in the 20th century.

Genetics. DNA is a low-entropy code. The genome contains roughly 3×1093 \times 10^9 base pairs but compresses to far less because of redundancy and structure. Information-theoretic methods help estimate gene boundaries, identify functional sequences, and reconstruct evolutionary histories.

Astronomy. Detecting exoplanets via radial-velocity wobbles, or finding gravitational waves in noisy detector signals, is an information-theoretic problem. The signal is buried in noise, and the question is what’s the minimum-information representation that captures the real signal.

Linguistics. Comparing entropy of different languages reveals structural differences. Languages with more morphological complexity (Finnish, Russian) have different entropy profiles than analytic languages (English, Mandarin).

What it teaches

The deepest lesson of information theory is that “information” is not a vague concept — it’s a precise mathematical quantity, measured in bits, with concrete bounds and theorems governing its behavior.

Several intuitions follow from this:

Compression has a floor. No matter how cleverly you encode something, you cannot beat its entropy. Text, images, audio, genomic data — all have an information-theoretic minimum size.

Communication has a ceiling. No matter how good your error-correcting code, you cannot transmit faster than channel capacity. This bound is one of the few hard limits in engineering.

Surprise is information. Anything you can predict perfectly carries no information. The most informative events are the unexpected ones. This explains why news is about unusual events, why scientific discoveries are about anomalies, and why interesting machine learning happens at distribution boundaries.

Mutual information measures relationship. If two variables share mutual information, knowing one tells you something about the other. Zero mutual information means independence. This generalizes correlation to non-linear relationships.

For Shannon, who joked late in life that he had “stumbled into” his subject, information theory was a kind of intellectual lightning. He didn’t see all the applications. Few do. But the mathematics he wrote down in 1948 has held up better than most theories of its era — it correctly described digital communication before digital communication existed, correctly described compression before compression algorithms existed, and correctly described machine learning decades before machine learning existed.

That kind of foresight is rare. Information theory is one of the cleanest examples of pure mathematics turning out, decades later, to be exactly the language a new technology needed.

Frequently asked

Is Shannon entropy the same as thermodynamic entropy?

Mathematically the formulas look identical — both involve sums of p log p — but the connection is deeper. Both are measures of disorder, just at different scales. Shannon's entropy quantifies uncertainty in a probability distribution; thermodynamic entropy quantifies disorder in a physical system. The unifying principle is information: knowing the microstate of a gas requires the same kind of information measure that knowing the next letter in a message requires.

Why is the unit called a 'bit'?

Shannon coined it as shorthand for 'binary digit.' One bit is the amount of information needed to distinguish between two equally likely outcomes — a single yes/no question. Higher-information events take more bits; one bit is the smallest atomic unit of information.

Does information theory limit AI?

Yes, in important ways. Modern language models compress information about training data, and the compression ratio is bounded by entropy. Lossless compression cannot beat the entropy bound; lossy compression involves trade-offs that information theory makes precise. Most fundamental capabilities of AI systems can be analyzed through Shannon's framework.