A random variable has mean and standard deviation . How likely is it that deviates from its mean by more than a given amount? The full answer depends on the shape of the distribution — bell-curve, uniform, heavy-tailed, anything. Compute the exact probability and you need to know the whole distribution.
But suppose you do not know the distribution. You know only the mean and the variance. Can you still say anything? The remarkable answer, given by Pafnuty Chebyshev in 1867 (after a forgotten earlier proof by Bienaymé in 1853), is yes — and the bound is striking in its simplicity. For any random variable with finite variance and any :
The probability of being more than standard deviations from the mean is at most . No matter what distribution has. This is Chebyshev’s inequality, and it is one of the most-used results in all of probability theory. This article is about what the inequality says, why it works regardless of distribution, and how the gap between its conservative bound and the actual tail of, say, a normal distribution reveals the difference between universal and shape-specific reasoning.
The bound in concrete terms
For Chebyshev gives . No more than a quarter of any distribution’s probability can lie more than two standard deviations from the mean. For the bound is . For , . The bound is not tight for most everyday distributions, but it is always valid, no matter how heavy-tailed or bizarre the distribution might be.
This is genuinely surprising. The mean and variance are just two numbers. They contain only a little information about a distribution. Yet they fix an upper bound on every tail probability of the distribution simultaneously. This is the simplest example of a concentration inequality — a statement that random variables with finite second moments cannot wander too far from their mean too often, regardless of their detailed structure.
The two-line proof
The cleanest proof goes through Markov’s inequality, a still simpler result. For any non-negative random variable and any positive , Markov’s inequality says
The intuition: if were greater than on more than a fraction of the probability, the average would be larger than . Apply Markov’s inequality to the non-negative random variable , with :
But the event is exactly the event , so we have Chebyshev’s bound:
That is the entire proof: one application of Markov’s inequality to the squared deviation. Notice that nothing about the shape of ‘s distribution enters except its mean and variance. The bound is distribution-free.
How conservative is the bound, really?
For a normal distribution with mean and standard deviation , the actual tail probabilities are known exactly. The picture below shows Chebyshev’s bound (in red) alongside the actual probability for a normal distribution (in blue), at .
For Chebyshev’s bound is trivial — it just says the probability is at most . For the gap opens up: Chebyshev allows up to , but a normal distribution actually only puts beyond two sigmas. At the gap is enormous: the Chebyshev bound is while the normal distribution puts only that far out — a factor of forty. At the gap is over a thousand-fold.
This is not a defect in Chebyshev’s bound; it is the price of universality. The inequality has to be valid for every distribution, including pathological ones with extremely heavy tails. The bound is sharp — for some distribution and some , equality can be achieved. The classic example is a two-point distribution that puts probability on each of and the rest on itself; this distribution attains Chebyshev’s bound exactly. So you cannot tighten the inequality further without assuming something about the shape of the distribution.
Two important uses
Proving the weak law of large numbers. Suppose are independent and identically distributed random variables with mean and variance . Let be the sample mean. By linearity of expectation and the formula for variance of independent sums, and .
Apply Chebyshev’s inequality with :
The right side goes to zero as . So the sample mean converges in probability to the true mean — the weak law of large numbers, proved in essentially one line via Chebyshev. This is the standard pedagogical route in any probability course, and it shows the inequality is not just a curiosity but a workhorse of the basic theory.
Distribution-free confidence intervals. In statistics, when nothing is known about the distribution of an estimator beyond its mean and variance, Chebyshev’s inequality gives a usable confidence interval. The interval is wide — wider than what a Gaussian-based interval would give — but it is guaranteed to be conservative. This guarantee is occasionally exactly what is needed: in regulatory or safety-critical settings, a Chebyshev-style bound that holds for any distribution is more trustworthy than a Gaussian-style bound that assumes the data are normally distributed.
Why the bound has to be loose
If you know nothing about a distribution except its mean and variance, the worst case is genuinely bad. A distribution can put as much as of its mass on a point exactly standard deviations from the mean, while keeping all the rest at the mean — and that distribution will hit Chebyshev’s bound exactly. So the universal bound cannot be tightened.
But it can be tightened if you know more. Markov’s inequality bounds a tail using just the mean. Chebyshev’s bound uses the mean and variance — and is correspondingly tighter. Hoeffding’s inequality bounds the tail of a sum of bounded independent random variables much tighter than Chebyshev does. Chernoff’s bound and the Bernstein, Bennett, Talagrand inequalities are tighter still under additional assumptions. The whole modern theory of concentration of measure — one of the central themes of high-dimensional probability and a key tool in modern statistics and machine learning — is the search for the right inequality at the right level of assumption.
But the entry point is always Chebyshev: a single, simple, universally valid bound. Two lines of proof. Knowledge required: only mean and variance. The fact that this works at all is itself a strong piece of mathematical news, and it is the reason a 19th-century lemma still appears on the first slide of every modern course on concentration inequalities.
A small inequality that became the foundation of a subject
Chebyshev’s inequality is, on the face of it, a small statement. It contains only one inequality, two parameters, and a single proof step. But that small statement turned out to be the cornerstone of an entire area of probability theory. The reason is that the same algebraic move — apply Markov’s inequality to a cleverly chosen non-negative function of the random variable — is the engine of nearly all concentration arguments. Square the deviation and apply Markov, and you get Chebyshev. Take an exponential of the deviation and apply Markov, and you get Chernoff. Take a moment-generating-function argument and apply Markov, and you get Hoeffding. The shape of the argument never changes; only the function chosen and the assumptions on the distribution do.
This is one of the most consequential examples in mathematics of a single template becoming a research programme. Bienaymé wrote down the trick in 1853 because he needed it for a specific computation. Chebyshev recognised its generality a decade later. A century and a half on, the same template, applied at every possible level of sophistication, underlies the analysis of randomised algorithms, the convergence proofs of stochastic gradient descent, the design of approximation schemes in numerical analysis, and a great deal of modern probability theory. Every single one of these results, when you trace it back, lives in the lineage of one short line of nineteenth-century algebra: square the deviation, apply Markov, and read off the bound.
Frequently asked
Who first proved Chebyshev's inequality?
Irénée-Jules Bienaymé proved the inequality in 1853, and Pafnuty Chebyshev rediscovered and popularised it in 1867. Bienaymé's earlier work was largely forgotten for decades; Chebyshev used the inequality as a lemma in his proof of a version of the law of large numbers, and his name has been attached to it ever since. Modern probability sometimes calls it the Bienaymé–Chebyshev inequality to acknowledge both, though the shorter form is far more common.
Why does the bound work for any distribution?
Because the proof uses no information about the distribution beyond its mean and variance. Specifically, it applies Markov's inequality to the non-negative random variable (X − μ)², whose expectation is exactly the variance σ². The bound P(|X − μ| ≥ kσ) ≤ σ²/(kσ)² = 1/k² then follows by one substitution. Since every distribution with finite variance has a well-defined mean and variance, the same bound applies to all of them — the entire derivation is two lines of algebra and uses absolutely nothing else about the shape of the distribution.
How does Chebyshev's inequality compare with the actual tails of a normal distribution?
It is dramatically more pessimistic. For a normal distribution, the probability of being more than 3σ from the mean is about 0.27%, but Chebyshev's bound only promises 11.1% — over forty times looser. At 4σ the gap is even wider: Chebyshev says 'at most 6.25%' while the normal distribution gives 0.0063%, a thousand-fold gap. The universal bound has to be conservative because it must work for distributions that genuinely do have heavy tails, even though most everyday distributions are much closer to Gaussian than Chebyshev's bound implies.
What is the inequality used for?
Three big things. First, the standard proof of the weak law of large numbers — that the sample average converges to the true mean — uses Chebyshev as its main tool. Second, in statistics and machine learning, Chebyshev provides distribution-free confidence intervals when nothing better is available. Third, it is the entry point to the modern theory of concentration of measure, which produces much tighter bounds for sums of independent random variables and powers most of modern high-dimensional probability and statistics. The inequality is one of the few results in probability that gets used at every level of the subject, from a first lecture course to current research.