In 1906, the Russian mathematician Andrey Markov sat down with a copy of Alexander Pushkin’s verse novel Eugene Onegin and started counting letters. He wanted to test something: most theorems of probability theory at the time required random variables to be independent. Real-world data — like sequences of letters in text — is not independent. The letter following ‘q’ is almost always ‘u’. The letter following ‘t’ is unusually likely to be ‘h’.
Markov’s question was: do the basic probability theorems still work for dependent sequences, as long as the dependence has a particular structure? He counted vowels and consonants in the first 20,000 letters of Pushkin’s poem and demonstrated that yes — for sequences where each letter depends only on the immediately preceding one, classical results like the law of large numbers can be generalized.
The structure he isolated is what we now call a Markov chain: a random sequence where the next state depends only on the current state, not on the path that got you there. The structure turns out to be one of the most useful in all of applied mathematics. It describes card shuffling, weather forecasting, the spread of epidemics, language models in machine learning, queueing systems, financial credit ratings, PageRank, and population dynamics. Wherever you have a system that evolves randomly but where “the future depends only on the present,” you have a Markov chain.
The setup
A Markov chain consists of:
- A state space — the possible configurations the system can be in.
- Transition probabilities — the probability of moving from state to state in one step.
The defining property — the Markov property — is:
In words: given the current state, the future is independent of the past. The chain has no memory beyond its current position.
For a finite state space, the transition probabilities form a transition matrix , where is the probability of going from state to state . Each row of must sum to 1 (you have to go somewhere).
A simple example: weather with two states, “Sunny” and “Rainy.”
| Sunny | Rainy | |
|---|---|---|
| Sunny | 0.8 | 0.2 |
| Rainy | 0.4 | 0.6 |
Today is sunny? 80% chance tomorrow is sunny too, 20% rain. Today rainy? 40% chance of recovery, 60% more rain. Note that yesterday’s weather doesn’t matter — only today’s.
This is, of course, a toy model. Real weather has dependence on much more than yesterday — pressure systems, seasons, and so on. But many natural processes really are well approximated by Markov chains.
The stationary distribution
If you let the chain run for a long time, the distribution of states approaches a stationary distribution — a probability distribution that doesn’t change under further evolution. For our weather chain, the stationary distribution turns out to be:
In the long run, regardless of today’s weather, you can expect 2/3 of all days to be sunny.
The stationary distribution is the probability vector satisfying
This is a left eigenvector of the transition matrix with eigenvalue 1. For any Markov chain that is irreducible (every state reachable from every other) and aperiodic (no rigid cyclic structure), the stationary distribution exists and is unique.
If you’ve read our piece on PageRank, this should sound familiar. PageRank is the stationary distribution of a Markov chain whose state space is the web and whose transitions are “follow a random link.” The Perron-Frobenius theorem we used there is exactly the existence theorem for stationary distributions of Markov chains.
Why “memoryless” is more flexible than it sounds
The most common objection to Markov modeling is “but my system has memory!” Stock prices depend on past behavior. Conversations depend on what was said before. Customer behavior depends on past purchases.
The trick is that the state can be defined to include whatever memory you need. If yesterday’s weather matters for tomorrow’s forecast, define the state as the last two days of weather instead of just today. Then the chain is again Markov, just with a larger state space.
This is the technique behind n-gram language models that dominated natural language processing for decades. Instead of “next word depends on current word” (a 1-Markov model that produces nonsense), you use “next word depends on previous 3 words” (a 4-gram, or 3rd-order Markov chain). The state space is huge — every possible triple of words — but the system is still Markovian.
Modern language models, including the deep learning kind, can be understood as elaborate Markov chains. The state is the entire input context (sometimes thousands of tokens). The transition is generated by a neural network. The chain produces text by sampling from the conditional distribution of the next token given the current context.
In a deep sense, predicting the next thing given the current state is exactly what Markov chains do. We have just gotten very good at making the state representation rich.
The role of memory length
What changes when you increase Markov order — that is, the amount of past information included in the state?
A Markov chain of order zero has no memory at all. Each step is independent of the rest. This is the i.i.d. (independent and identically distributed) sequence — the simplest probability model.
A first-order Markov chain remembers only the immediate past. This already produces interesting structure. If you train a first-order Markov chain on English text and let it generate, you get strings like “of the the and an for” — words that locally look reasonable but globally make no sense.
A higher-order Markov chain (third, fifth, tenth) remembers more. The output starts to look syntactically correct over short stretches but still drifts globally. By the time you reach context lengths of ~50 tokens, the output looks like coherent paragraphs but lacks long-range structure.
Modern transformer-based language models effectively have context lengths of hundreds to millions of tokens. They are giant Markov chains where the “state” is enormous and the transition function is a deep neural network. The generative behavior — coherent over thousands of words — comes from the depth of the state representation, not from any escape from the Markov framework.
Where Markov chains show up
The list of applications is enormous. A partial sample:
Card shuffling. Each shuffle is a transition in a Markov chain whose state space is “all 52! orderings of a deck.” A famous result by Persi Diaconis shows that 7 riffle shuffles are sufficient to mix a deck thoroughly, though fewer leaves residual patterns.
Queueing theory. Customers arriving at a service desk, jobs being processed in a computer queue, packets routing through a network — all modeled as Markov chains where states are queue lengths and transitions are arrivals/departures. The math underlies everything from telephone networks to internet backbone routing.
Population genetics. Allele frequencies in a population evolve according to Markov processes — the Wright-Fisher model. Genetic drift, natural selection, and mutation are all formalized as transition rules.
Financial credit ratings. A bond’s rating (AAA, AA, A, BBB, …) evolves according to historical transition probabilities estimated from data. These are used to price credit default swaps and structure portfolios.
Speech recognition. Hidden Markov Models (HMMs) — Markov chains where you observe a noisy version of the state — were the dominant technology for speech recognition from the 1970s until deep learning took over in the 2010s.
Disease spread. SIR models for epidemics (Susceptible → Infected → Recovered) are Markov chains on a small state space. The COVID-19 pandemic was modeled with vast extensions of this framework.
Markov Chain Monte Carlo (MCMC). A technique for sampling from complicated probability distributions by constructing a Markov chain whose stationary distribution is the one you want to sample. MCMC underlies Bayesian statistics, statistical physics simulations, and large parts of computational neuroscience.
The Metropolis algorithm
One of the most important uses of Markov chains is for sampling from distributions you can’t draw from directly. The Metropolis algorithm (1953) is the canonical method.
Suppose you want to sample from a probability distribution over a complicated state space — for example, the configurations of 1000 atoms in a fluid, weighted by their Boltzmann factors. You can compute for any specific , but you cannot enumerate all to normalize.
The trick: design a Markov chain whose stationary distribution is . The Metropolis algorithm does this by:
- Propose a small random change to the current state.
- Compute the ratio .
- If , accept the new state. If , accept with probability .
- Repeat.
This produces a chain that, after enough iterations, generates samples distributed according to . The technique was developed for atomic-bomb simulations at Los Alamos in the 1940s and 50s and is now a workhorse of computational science.
What Markov chains teach
The deepest lesson of Markov chains is something Markov himself articulated in 1906: many natural systems have a kind of local memory, where what just happened matters far more than what happened long ago. If you can identify the right notion of “current state” — where current may include some recent history packed into the definition — the future depends only on it.
This is a strong constraint. Most random sequences in the wild are not, naively, Markov. But with the right state representation, an enormous fraction of them become Markov. And once a process is Markov, you have access to a huge body of mathematical machinery: stationary distributions, mixing times, ergodic theorems, hitting times, large-deviation results, MCMC samplers.
For an applied mathematician, “is this process approximately Markov?” is one of the first questions to ask about any random system. The answer — even a partial yes — opens enormous toolkits. The same conceptual move that lets a 21st-century language model generate coherent prose also lets a 21st-century epidemiologist forecast disease and a 21st-century actuary price an insurance portfolio.
Markov took an arcane question about Pushkin’s poetry and ended up giving twentieth-century science one of its most flexible mathematical tools. The Markov property — “the future depends only on the present” — turns out to be either approximately true or makeable approximately true for most things humans want to model.
That, in retrospect, is one of the more useful things mathematics has produced.
Frequently asked
Why is it called 'Markov' chain?
Andrey Markov, a Russian mathematician, introduced the concept in 1906 to extend the law of large numbers to dependent random variables. He used it to study the statistics of letters in Pushkin's poetry — proving that even though letter sequences are not independent, statistical regularities still hold.
Are real-world systems actually memoryless?
Strictly speaking, almost never. But many systems are *approximately* memoryless if you define 'state' broadly enough. A chess game looks like it depends on history, but if 'state' includes the full board configuration, the next move depends only on that — making chess a (very high-dimensional) Markov chain.
What's the connection to PageRank?
PageRank is exactly a Markov chain. The state is the current web page; the transition is following a random outbound link. PageRank scores are the stationary distribution of this Markov chain — the long-run probabilities of being at each page after enough random clicking.