Suppose you want to solve f(x)=0f(x) = 0 for some function ff. Algebra can handle simple cases — quadratics with the quadratic formula, polynomials up to degree 4 with classical formulas (and not beyond, thanks to Galois theory). But for arbitrary functions, you need a different approach: an algorithm that produces successively better approximations to the answer.

Newton’s method (or Newton-Raphson, after Joseph Raphson who generalized Newton’s idea) is the gold standard. The basic step is: at each guess xnx_n, the next guess is

xn+1=xnf(xn)f(xn).x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}.

That’s the whole algorithm. Two function evaluations per step. No deep theory. And yet, from a reasonable starting point, Newton’s method converges to a root with stunning speed — doubling the number of correct digits at each iteration.

Today, when your calculator computes 2\sqrt{2}, when an optimizer trains a neural network, when a CAD program intersects two surfaces, Newton’s method is running in the background. The mathematics is from 1669. The relevance is permanent.

The geometric picture

The cleanest way to understand Newton’s method is geometrically. At your current guess xnx_n, draw the tangent line to ff at the point (xn,f(xn))(x_n, f(x_n)). The tangent line crosses the xx-axis at some new point — call that xn+1x_{n+1}. Repeat.

f(x) x₀ x₁ x₂ root Newton's method: tangent line → next guess → repeat

Starting at x0x_0, the tangent to the curve there crosses the xx-axis at x1x_1. The tangent at x1x_1 crosses at x2x_2. Each new guess is closer to the actual root than the previous one — usually much closer.

The algebra matches the picture. The tangent line to ff at xnx_n has slope f(xn)f'(x_n) and passes through (xn,f(xn))(x_n, f(x_n)). Its equation:

yf(xn)=f(xn)(xxn).y - f(x_n) = f'(x_n)(x - x_n).

Set y=0y = 0 to find where the tangent crosses the xx-axis:

0f(xn)=f(xn)(xxn)    x=xnf(xn)f(xn).0 - f(x_n) = f'(x_n)(x - x_n) \implies x = x_n - \frac{f(x_n)}{f'(x_n)}.

That’s the Newton iteration. Tangent line geometry produces it directly.

Quadratic convergence

The reason Newton’s method is the standard root-finder is quadratic convergence: each iteration roughly squares the error.

More precisely, if en=xnre_n = x_n - r is the error (distance from the true root rr), then under reasonable conditions:

en+1Cen2e_{n+1} \approx C \cdot e_n^2

for some constant CC. Squaring the error means: if your current error is 10310^{-3}, the next error is roughly 10610^{-6}. The iteration after that, 101210^{-12}. After 5 iterations from a reasonable start, you’re typically at machine precision.

Compare this to bisection — the simpler method of halving an interval at each step. Bisection has linear convergence: each iteration halves the error. To go from 10310^{-3} to 10610^{-6} requires about 10 iterations. Bisection adds one digit per iteration; Newton’s method doubles digits per iteration.

For computing 2\sqrt{2}:

IterationNewtonBisection
01.51.5
11.4166666…1.25
21.41421568…1.375
31.41421356…1.4375
41.41421356237… (correct to 14 digits)1.40625

Newton’s method reaches 14 correct digits in 4 iterations. Bisection would need around 50 iterations to match.

A worked example: 2\sqrt{2}

To compute 2\sqrt{2}, we solve f(x)=x22=0f(x) = x^2 - 2 = 0. The derivative is f(x)=2xf'(x) = 2x. The Newton iteration is

xn+1=xnxn222xn=xn2+1xn.x_{n+1} = x_n - \frac{x_n^2 - 2}{2 x_n} = \frac{x_n}{2} + \frac{1}{x_n}.

This is the Babylonian method for square roots, known to ancient Babylonian mathematicians and rediscovered by Greeks (Hero of Alexandria, around 60 AD). It’s been used for two thousand years — and it’s Newton’s method applied to x2=2x^2 = 2.

Starting from x0=1x_0 = 1:

  • x1=1/2+1/1=1.5x_1 = 1/2 + 1/1 = 1.5
  • x2=1.5/2+1/1.5=0.75+0.666...=1.41666...x_2 = 1.5/2 + 1/1.5 = 0.75 + 0.666... = 1.41666...
  • x3=1.4142156...x_3 = 1.4142156...
  • x4=1.41421356237...x_4 = 1.41421356237...

After 4 iterations from x0=1x_0 = 1, you have 2\sqrt{2} to 11 decimal places. From x0=1.5x_0 = 1.5 (a better starting guess), 3 iterations suffice.

This is also exactly how your calculator computes 2\sqrt{2}. The hardware does a couple of Newton iterations on a hardware-aware starting point, and you get a result accurate to the machine precision.

When Newton’s method fails

Newton’s method isn’t magic. It can fail in several ways:

Bad starting point: if you start far from a root, the tangent line might intersect the xx-axis at a point even farther away. The iteration diverges.

Zero derivative: if f(xn)=0f'(x_n) = 0, the iteration formula is undefined — you’d divide by zero. Geometrically, the tangent is horizontal and doesn’t cross the xx-axis.

Inflection points: near inflection points (where ff'' changes sign), Newton’s method can oscillate or stall.

Cycle traps: rarely, the iteration can fall into a cycle of points and never converge.

Multiple roots: if rr is a root where ff also has zero derivative (a “multiple” or “tangential” root), convergence slows from quadratic to linear.

Non-differentiable functions: Newton’s method requires a derivative. For non-smooth functions (absolute value, max, etc.), the method doesn’t apply directly.

In practice, robust implementations of Newton’s method use safeguards: check that progress is being made, fall back to bisection if Newton fails to converge, restart with a better initial guess if needed.

Newton fractals — the beautiful failure mode

When you apply Newton’s method to a function with multiple roots, the basin-of-attraction question becomes interesting: which root does each starting point converge to?

For a real function with two real roots, the answer is usually simple: starting points left of some boundary go to the left root, points right go to the right root. The boundary is usually a single point.

For complex polynomials, the picture is wildly different. Consider f(z)=z31f(z) = z^3 - 1, which has three complex roots (one real, two complex). Apply Newton’s method starting from each point in the complex plane, and color the point by which root it converges to.

The result is the Newton fractal:

Newton fractal for z³ - 1: three basins meet in a fractal boundary Each color is a starting region converging to one of the three roots (white dots).

The boundaries between the three colored regions are not smooth lines but fractals — infinitely detailed, self-similar at every scale. Near any boundary point, all three basins approach arbitrarily close. Microscopic changes in starting position can send you to entirely different roots.

This is the same kind of structure as the Mandelbrot set and other complex-dynamics phenomena. Newton fractals are some of the most striking visual demonstrations of mathematical chaos — and they emerge from an algorithm whose simple job is to find roots.

Where Newton’s method appears

Beyond computing 2\sqrt{2}, Newton’s method is the foundation of much of numerical computation.

Calculators and CPUs: hardware square root and division on modern processors use Newton iteration (with cleverly chosen starting points from lookup tables). The square-root instruction in IEEE 754 floating-point is essentially Newton’s method.

Optimization: to find a minimum of g(x)g(x), find where g(x)=0g'(x) = 0. Newton’s method applied to gg' — using the second derivative gg'' for the iteration — is one of the fastest optimization methods. This is the basis of Newton’s method in optimization.

Machine learning: most ML training is optimization. Newton-like methods (or simplifications using only first derivatives, like gradient descent and Adam) underlie every training algorithm. For very small models, true Newton’s method (using the Hessian) is sometimes used; for large models, approximations are preferred because computing the full Hessian is expensive.

Curve and surface intersection: in CAD and 3D modeling, finding where two surfaces intersect typically uses Newton iteration in 2D or 3D.

Equations of state in physics: solving for equilibrium temperatures, pressures, or chemical compositions involves nonlinear systems — Newton’s method handles them.

Numerical PDE solvers: implicit time-stepping methods produce nonlinear equations at each step. Newton’s method (sometimes simplified to Quasi-Newton) solves them.

Computer graphics: ray tracing through curved surfaces, finding self-intersections, solving for shading equations.

Cryptography: certain primality tests and elliptic curve computations use Newton-like methods.

If your code computes anything nonlinear and reaches machine precision, Newton’s method is somewhere in the implementation.

Newton’s method in higher dimensions

The 1D Newton iteration extends naturally to systems. For f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n, the iteration is

xn+1=xnJ1(xn)f(xn)\mathbf{x}_{n+1} = \mathbf{x}_n - J^{-1}(\mathbf{x}_n) \cdot f(\mathbf{x}_n)

where JJ is the Jacobian matrix of partial derivatives. The Jacobian inverse plays the role of 1/f(x)1/f'(x) in the 1D case.

For a 1000-variable system, the Jacobian is a 1000×1000 matrix. Inverting it is expensive (O(n3)O(n^3) operations) — so practical implementations don’t actually invert, they solve JΔx=fJ \cdot \Delta \mathbf{x} = -f at each step (which is O(n3)O(n^3) as a linear-algebra operation but cheaper in some structures).

For very large systems, Quasi-Newton methods approximate the Jacobian or its inverse instead of computing it exactly. BFGS, L-BFGS, and the SR1 method are widely-used quasi-Newton variants used in optimization software, including most ML libraries.

The structure of the 1D Newton iteration — local linearization, fast quadratic convergence near the solution, fragility far from the solution — extends to higher dimensions almost unchanged. Quadratic convergence in Rn\mathbb{R}^n is just as fast as in R\mathbb{R}.

The historical thread

Newton’s manuscript De Analysi per Aequationes Numero Terminorum Infinitas (around 1669, unpublished) contains the first description of the iteration applied to specific polynomial equations. Newton didn’t have the general framework; he showed by example that the technique worked.

Joseph Raphson, in 1690, published Analysis Aequationum Universalis, which gave a more general presentation. Raphson’s version is closer to what’s now standard, and historians sometimes credit him with the modern formulation — hence “Newton-Raphson method.”

Earlier roots exist. The Babylonian iteration for square roots is essentially Newton’s method applied to x2=ax^2 = a. Persian mathematicians studied related techniques. The synthesis under Newton and Raphson, plus systematic study, made the method foundational.

Modern analysis of convergence rates, basins of attraction, and modifications was developed in the 19th and 20th centuries. The understanding of Newton fractals came in the late 20th century with the rise of computer graphics enabling visualization.

What Newton’s method teaches

The deepest lesson of Newton’s method is that local linearization plus iteration is one of the most powerful tools in numerical mathematics. At each step, you approximate a complicated function by its tangent line (or hyperplane in higher dimensions), solve the easy linear approximation, and repeat. This pattern shows up throughout numerical analysis:

  • Implicit ODE solvers use Newton’s method at each time step.
  • Optimization methods use Newton’s method on the gradient.
  • Inverse function computation uses Newton’s method to invert.
  • Eigenvalue computation uses Newton-like iterations.

For a student of mathematics, Newton’s method is one of the cleanest demonstrations that derivatives are not just for theory — they’re the practical tool that makes numerical computation work. Without calculus, modern computing would have no good general way to solve nonlinear equations.

For everyday users, every time your calculator gives you a square root to 15 digits, or your phone unlocks via face recognition (which trains via Newton-like optimization), or your GPS computes your position from satellite signals, Newton’s method is doing work in the background. The 1669 mathematics is among the most-executed algorithms in the world.

For working scientists and engineers, Newton’s method is part of the standard toolkit. When you need to solve a nonlinear equation and want fast convergence, Newton (or a quasi-Newton variant) is usually the first thing to reach for. Robust implementations exist in every numerical library — SciPy, MATLAB, Mathematica, Maple, Julia — and they all do the same thing internally: iterate the tangent-line approximation until convergence.

A simple geometric idea — replace the curve with its tangent line, solve the linear approximation, repeat — turned out to be one of the most consequential algorithms in computational mathematics. Newton noticed it; Raphson generalized it; the world has been using it ever since.

That’s the kind of long-tail impact that the best mathematical ideas have. The next time you reach for 2\sqrt{2}, remember: somewhere in the chip on your device, a tangent line is being drawn.

Frequently asked

Did Newton actually invent this method?

Partly. Newton described a similar iterative idea around 1669 in unpublished work, applying it to specific polynomials. Joseph Raphson generalized and published it in 1690, applying it to any polynomial — and his presentation is closer to the modern version. So 'Newton-Raphson method' is sometimes used. Earlier, similar root-finding ideas appeared in Persian mathematics; the modern algorithm is the synthesis.

Why does Newton's method converge so fast?

Because it has quadratic convergence: each iteration roughly doubles the number of correct digits. If you're 3 digits accurate, the next iteration gives 6, then 12, then 24. Most other methods (bisection, secant) converge linearly — they add one digit per iteration. Quadratic convergence means Newton's method usually reaches machine precision in 5–10 iterations from a reasonable starting point.

When does Newton's method fail?

Several ways. If the starting point is far from the root, iterations can diverge. If the derivative is zero or near-zero at a root, convergence slows or fails. For complex starting points and complex roots, the method produces beautiful but chaotic 'Newton fractals.' For non-differentiable functions, the method can't be applied at all.