There’s a common perception that artificial intelligence is alchemy — a mysterious process where data goes in and cognition comes out. It isn’t. Modern AI is mathematics: linear algebra, calculus, probability, and optimization, applied at enormous scale.
This is an overview of the mathematical structures that make AI work. You won’t become a machine learning engineer from reading it, but you’ll understand what’s happening when people talk about training, gradient descent, attention, and embeddings.
Everything is a vector
The single most important mathematical move in modern AI is: represent everything as a vector in a high-dimensional space.
Words, images, user behaviors, audio clips, molecules — all get mapped to points in for some large (typically a few hundred to a few thousand). Similar objects end up close together; different objects end up far apart.
The mathematics that lets you do useful things with such vectors — add them, measure distances, project them, rotate them — is linear algebra. Every time an AI model processes input, what’s happening under the hood is matrix-vector multiplication on an absolutely enormous scale.
A matrix is a linear map from to . A neural network with layers is essentially an alternating composition of matrices and simple nonlinear functions (activations). Compute a forward pass of a neural network, and you’re multiplying matrices.
Training is optimization
A neural network starts with random weights. Training is the process of adjusting those weights so the network makes useful predictions on training data.
The mathematical problem: you have a loss function that measures how wrong the network is, parameterized by weights (where might be in the billions). You want to find minimizing .
Gradient descent
The standard approach is gradient descent:
Move in the direction opposite the gradient of the loss. The step size is called the learning rate. Computing the gradient of a neural network requires the chain rule — a result from first-year calculus — applied recursively through each layer. The algorithm that does this efficiently is backpropagation.
This is why multivariable calculus is at the heart of AI: every gradient-based training step is a numerical application of the chain rule to a function that can have billions of inputs and outputs.
Probability everywhere
AI models don’t output single answers; they output probability distributions. A language model’s next-token output is a distribution over the entire vocabulary. A classifier’s prediction is a distribution over classes. An image model’s output can be a distribution over pixel values.
The Gaussian distribution shows up constantly — as an assumption about noise, as a regularizer, as the natural distribution for linear regression under standard assumptions.
Bayes’ theorem underlies a huge portion of AI reasoning about uncertainty:
Every time an AI model “updates its beliefs” in light of new evidence, there is, in principle, Bayes’ theorem doing the bookkeeping.
The attention mechanism
Transformers — the architecture behind ChatGPT, Claude, Gemini, and nearly every modern LLM — are built around a specific piece of linear algebra called attention.
For input tokens , compute three projections: queries , keys , values . Attention is:
This one equation — a softmax-weighted sum of values, where weights come from dot products of queries and keys — is the key to how transformers work. It’s pure linear algebra plus a normalization.
Understanding why this works in practice is an active research area. Mathematically, it’s just matrix multiplication, a softmax (a smoothed argmax), and a scaled dot product.
Stochastic optimization
Full gradient descent over a dataset of billions of examples would take forever. Modern training uses stochastic gradient descent (SGD): at each step, estimate the gradient using a small random batch of examples.
This turns training into a stochastic process. The mathematical theory of SGD — convergence rates, noise characteristics, the effect of batch size — draws on stochastic analysis and probability theory. Recent work even connects SGD to random matrix theory and, remarkably, to spin glasses in statistical physics.
Why the math matters
Understanding the mathematics changes how you read AI news.
- “A new algorithm trains 10× faster.” This is almost always about optimization: a better way to take gradient steps, a smarter use of momentum, or a better regularizer.
- “A new architecture learns better.” This is about function approximation: a better parameterization of the space of functions the model can represent.
- “The model is hallucinating.” This often traces back to the probability distributions the model was trained on, combined with the maximum-likelihood objective that most LLMs are trained against.
- “We don’t know why it works.” This is honest. Many empirical successes in deep learning don’t yet have clean mathematical explanations. The gap between practice and theory is real and active.
Where this connects
The mathematical structure behind AI isn’t new. Linear algebra was systematized in the 19th century; probability theory has a rigorous foundation from Kolmogorov (1933); gradient-based optimization goes back to Cauchy.
What’s new is the scale. Modern AI is the largest-scale application of mid-19th-century mathematics ever undertaken. That we can build something that feels like intelligence from matrix multiplication and probability theory tells us something — though what, exactly, is a question for the next century of mathematics to answer.
If you want to go deeper, start with linear algebra (Strang’s textbook or Axler’s Linear Algebra Done Right), then multivariable calculus with attention to the chain rule, then an introductory machine learning textbook (e.g. Bishop, Murphy, or Hastie–Tibshirani–Friedman). Everything in modern AI is built on that foundation.
Frequently asked
Do you need calculus to understand AI?
To really understand it, yes — especially multivariable calculus for gradient descent, the algorithm at the heart of training neural networks. But you can get a working feel for AI with linear algebra alone.
Is AI really 'just' mathematics?
At the formal level, yes: an AI model is a mathematical function whose parameters are tuned using statistical and optimization techniques. What's remarkable is that relatively simple mathematical structures, trained on enough data, can produce behavior that feels intelligent.