Suppose you want to write down a long string of letters from some alphabet — perhaps the text of a novel, or the symbols emitted by a measurement device — and you want to store or send it using as few bits as possible. The simplest plan is to give every symbol a fixed-length code. An alphabet of symbols needs bits per symbol that way: five symbols need three bits each, since is not enough but is plenty.
But that wastes space whenever the symbols are not equally common. In an English novel the letter “e” appears about thirteen percent of the time, and the letter “z” about a tenth of a percent. Paying the same number of bits to write “e” as to write “z” is, intuitively, throwing bits away. A better plan would be to assign short codes to common symbols and long codes to rare ones, the way Morse code does — “E” is a single dot, “Q” is dash-dash-dot-dash.
This is the idea of a variable-length code. It immediately raises a subtler question: when codewords have different lengths, how does the receiver tell where one codeword ends and the next begins? And, having answered that, how do you find the best such code? Huffman coding, invented by David Huffman in 1952 while he was a graduate student at MIT, answers both questions in a single short algorithm. Despite being seventy years old, it sits inside almost every modern compressor — ZIP, JPEG, MP3, PNG, HTTP/2 — and is still the standard against which alternatives are measured. This article is about how it works and why.
Prefix codes: making variable-length unambiguous
The first thing to settle is the boundary problem. If codes have different lengths, the decoder must be able to tell when each codeword ends. Huffman’s trick is to use a prefix code: no codeword is allowed to be the start of any other codeword. Once that rule is in force, the decoder simply reads bits from left to right; the moment the bits read so far form a valid codeword, it must be the only one possible, since no longer codeword can start the same way. The bits can then be decoded and the next codeword begins.
A prefix code has a natural representation as a binary tree. Internal nodes branch left for and right for ; leaves are the codewords. The path from the root to a leaf gives the bit string. Because leaves cannot have descendants, no codeword is a prefix of another by construction.
The Huffman recipe
Suppose the alphabet has symbols with probabilities (or counts) , and we want the prefix code that minimises the average codeword length , where is the length of the codeword assigned to symbol . Huffman’s algorithm is:
- Treat each symbol as a leaf node with weight .
- Repeat: pick the two nodes with the smallest weights, combine them under a new parent node whose weight is the sum of the two, and put the parent back into the pool.
- Stop when only one node — the root — remains.
The tree built this way is the optimal Huffman tree. The codes are read off as the binary paths from root to each leaf. The whole construction is greedy — at each step it makes the locally best decision — and remarkably, this greedy choice produces a globally optimal code.
A worked example
Take an alphabet of five symbols with probabilities , , , , . Apply the algorithm.
Step 1: the two smallest are and . Combine them into a parent of weight .
Step 2: the pool is now . The two smallest are and . Combine them into of weight .
Step 3: the pool is . The two smallest are and . Combine into of weight .
Step 4: the pool is . Combine into the root of weight .
The resulting tree:
Reading off the codes by tracing root-to-leaf paths gives
The most common symbol, , gets a two-bit code; the rarest, and , get three-bit codes. The average codeword length is
A naive fixed-length code would need three bits per symbol for five letters, so Huffman saves bits per symbol — about a reduction. Over a long text, that adds up to a substantial saving.
How close to optimal is this really?
There is a fundamental lower bound. The Shannon entropy of the source,
is the theoretical minimum average number of bits per symbol that any compression scheme can achieve in the limit of long messages. For our example bits per symbol. Huffman achieved , just bits per symbol away from the absolute floor. In general Huffman comes within one bit per symbol of the entropy and usually much closer, and a theorem of Huffman’s own paper shows there is no integer-length prefix code that does any better. So among integer-length prefix codes, Huffman is exactly optimal.
The gap between and comes from a hard structural constraint: Huffman must give each symbol a whole number of bits, but the “ideal” number of bits for symbol is , which is rarely an integer. The most common symbol ideally carries bits of information — but a codeword cannot have bits, so Huffman rounds up to . The cumulative round-up across all symbols accounts for the small but persistent gap to entropy. Arithmetic coding sidesteps this by allowing fractional bits, which is why it is preferred when the last few percent of compression really matter.
Why the greedy choice is optimal
The proof that Huffman’s bottom-up construction produces the best possible tree has a short and surprising structure. It hinges on two observations.
First, in any optimal tree, two symbols of smallest probability can always be placed as siblings at the deepest level. If not, swapping them with whatever two siblings are at the deepest level can only decrease the average codeword length, since the deepest leaves get the largest . So we lose nothing by committing, at the start, to put the two least-frequent symbols together as siblings.
Second, once those two symbols are paired, replacing them with a single combined symbol of weight equal to their sum leaves the rest of the optimisation problem of the same shape, just on a smaller alphabet. This is exactly the move the Huffman algorithm makes. By induction the merged problem is solved optimally, and that optimal solution lifts back to an optimal tree on the original alphabet.
The argument is short and clean, and the fact that a greedy algorithm gives a provably global optimum is one of the most-cited examples of when greed succeeds in algorithm design.
What Huffman cannot do
The integer-bit limitation is the main one, and arithmetic coding overcomes it at the cost of more arithmetic per encoded symbol. Huffman also cannot adapt automatically to a source whose probabilities change as the data flows by — though adaptive Huffman variants update the tree on the fly, with some efficiency cost. And Huffman cannot exploit correlations between successive symbols: it treats each symbol as independent, so a source where letters are highly predictable from context (e.g. natural language) needs an additional modelling step before Huffman can compress it well. The DEFLATE format used by gzip and ZIP couples Huffman with a sliding-window dictionary that captures repeated phrases — the first stage finds the patterns, the second stage Huffman-codes whatever is left.
A foundational pebble
David Huffman’s algorithm is small enough to be sketched on a napkin yet powerful enough that it has not been displaced in seventy-three years for the basic task of compressing a known distribution of symbols. Inside every ZIP file you have ever made, inside every JPEG image, every MP3 song, every PNG screenshot, every HTTP header passing through the modern web, a small Huffman tree is doing its quiet work — giving short codes to common things and long codes to rare ones, exactly as Huffman saw was possible in a sudden moment of insight when he was supposed to be preparing for a final exam.
Frequently asked
Who was Huffman and how did he discover the algorithm?
David A. Huffman was a graduate student at MIT in 1951 when his professor, Robert Fano, offered the class a choice: take the final exam, or solve the open problem of finding the most efficient prefix code for a given set of letter frequencies. Huffman attempted the open problem for months, gave up, and was about to start studying for the exam when the idea suddenly came to him — a simple bottom-up tree construction that was provably optimal. He published it in 1952 and it has been the standard ever since.
What is a prefix code, and why does it matter?
A prefix code is one where no codeword is a prefix of any other. So if 'A' is encoded as 10, no other letter's code can start with 10. The benefit is that the decoder can read a bitstream from left to right and unambiguously split it into codewords, with no need for separators between symbols. Huffman codes are always prefix codes by construction — each codeword traces out a unique root-to-leaf path in a binary tree, and no leaf can be an ancestor of another leaf.
Is Huffman coding optimal?
It is optimal among the codes that assign a whole number of bits to each symbol. For an alphabet whose symbols have probabilities p_i, the Shannon entropy H = −Σ p_i log_2(p_i) gives the theoretical minimum average bits per symbol, and Huffman gets within one bit of this — usually much closer. But because Huffman must use integer-length codewords, it cannot represent, say, 'this symbol carries 1.5 bits of information' efficiently. Arithmetic coding overcomes that limitation and is the modern standard when the last fraction of a bit really matters.
Where is Huffman coding used today?
Just about everywhere data is compressed. The DEFLATE algorithm at the heart of ZIP, gzip, and PNG uses Huffman as one of its two stages. JPEG compresses the quantized DCT coefficients of an image with Huffman. MP3 and AAC encode quantized audio coefficients with Huffman. HTTP/2 uses a static Huffman table to compress HTTP headers. Even the file formats of fonts and the encoded streams in PDFs lean on Huffman. Seventy years on, it is still the workhorse first step of nearly every compressor in daily use.