How AI Works: No Magic, Just Mathematics

Introduction

This article doesn’t attempt to cover every topic in the AI universe or replicate a textbook. Instead, the goal is to give you a clear foundation and a practical mental model, enough to take you from absolute zero to an understanding of how neural networks really work. Understanding neural networks requires mastering the mathematical foundations that make learning possible. To make this journey accessible, we focus on the essential mathematical pillars behind modern AI: linear algebra, calculus, multivariable calculus, and probability theory.

Roadmap to Understanding AI [1]

Linear algebra helps us describe and represent models, while calculus provides the tools to fit those models to data, optimizing them step by step. Probability theory ties everything together by giving us a mathematical framework for reasoning under uncertainty, which is essential for making reliable predictions. Let’s take a closer look at each of these foundations.

1.Linear Algebra

In AI, everything revolves around prediction. We build models that estimate the next value, the next word, or the next action, and we optimize them, so those predictions move closer to reality. These predictive models, including neural networks, are ultimately mathematical functions trained using tools from calculus. However, the structure of these models is expressed through linear algebra. Concepts like vectors, matrices, and matrix multiplication form the language in which neural networks are defined and computed. Every layer of a neural network applies to a series of matrix operations to transform inputs into more meaningful representations [2].

1.1 Vectors and Vector Spaces

To build intuition, imagine the familiar Cartesian (Euclidean) plane with an x-axis and a y-axis. Pick two points: the origin (0, 0) and another point ( 𝑥1 , 𝑦1 ). If you draw an arrow from the origin to the second point, you have just plotted a vector.

Figure 2: Plotting a vector from (0, 0) and (x1,y2)

We can plot additional vectors as well. Vectors can be added to each other by adding their components, or they can be scaled by multiplying them with a scalar.

Figure 3: A vector scaled by a constant c, and an example of vector addition

In real-world AI systems, vector spaces often have hundreds or thousands of dimensions. They are too complex to visualize mentally, but the 2D Euclidean model provides all the intuition we need. Formally, a set of vectors V is a vector space over the real numbers if you can:

add any two vectors in V, and
scale any vector by a real number

while still staying inside the space.

1.2 Normed Spaces

Once vectors exist in a space, the next step is measuring distance between vectors. In Euclidean space, the most common way to measure a vector’s length is with the Euclidean norm:

Figure 4: Euclidean Norm

This formula is essentially the Pythagorean theorem written in vector form.

Figure 5: Pythagorean Theorem in geometric form

In higher dimensions, the Euclidean norm is simply the repeated application of the Pythagorean theorem.

Figure 6: Euclidean Norm

We can generalize distance using the p-norm family, expressed as:

Figure 7: p-norm family

When p=2, we get the Euclidean norm.
When p=1, we get the Manhattan norm, where distance is computed like navigating a city grid.

Figure 8: Manhattan norm illustration

Some norms originate from an inner product. When the inner product between the two vectors is zero, we say they are orthogonal (perpendicular).

Figure 9: 2D Orthogonal and Orthonormal Vectors [3]

If the basis vectors in a vector space are mutually orthogonal, we call it an orthogonal basis. If each of those vectors also has length 1, the basis is orthonormal.

Figure 10: 3D orthonormal basis i,j,k [4]

1.3 Linear Transformations

Neural networks are built from layers, and each layer is a mathematical function: f(x) = sigmoid(Ax + b). Here:

A is a matrix,
x and b are vectors,
sigmoid is a nonlinear activation function.

The expression Ax represents a linear transformation of the input. A concrete example is a rotation in the plane: rotations can be fully expressed as matrix transformations around the origin.

1.4 Matrices

Matrices allow us to express linear transformations compactly. The most important operation is matrix multiplication:

Figure 11: Matrix Multiplication

Matrix multiplication corresponds to composing linear transformations, applying one transformation after another.

1.5 Determinants

The determinant of a matrix tells us how a linear transformation changes area (in 2D) or volume (in 3D).

If the determinant is negative, the transformation flips orientation.
If the determinant is zero, the transformation collapses space into a lower dimension.

Figure 12 : Effect of Determinant Transformations [5]

In intuitive terms: the determinant tells us how much the “size” of an object (like the Mona Lisa image) is scaled under the transformation.

1.6 Eigenvalues, eigenvectors, and matrix decompositions

Many important matrix forms and algorithms aim to decompose matrix A into simpler components. One of the most powerful tools is the Singular Value Decomposition (SVD): A = U Σ V holds. Here, U and V are special orthonormal matrices, and Σ is diagonal. SVD is widely used in:

dimensionality reduction,
feature extraction,
Principal Component Analysis (PCA).

Figure 13: Eigenvectors preserve direction [6]

Eigenvectors and eigenvalues give us a geometric interpretation:

The eigenvector shows the direction in which the transformation acts.
The eigenvalue shows how much the vector is stretched or compressed along that direction.

Together, these tools form the backbone of how we represent data, compress information, and analyze structure in modern AI systems.

2.Calculus

Calculus is the study of how functions change, through differentiation, and how these changes accumulate, through integration. Since a neural network is a differentiable function, calculus becomes one of the core tools for training it. Differentiation lies at the heart of optimization. A function reaches a local maximum or minimum where its derivative equals zero. Such points are called critical points. To determine whether a critical point is a minimum or maximum, we examine the second derivative:

If f′′(x)>0, the function is locally convex, so x is a local minimum.
If f′′(x)<0, the function is locally concave, so x is a local maximum.
If f′′(x)=0, the situation is inconclusive.

One of the most important tools in calculus is the chain rule, which tells us how to differentiate composite functions. For two functions f and g: f(g(x))’ = f’(g(x))g’(x) This rule is essential in deep learning because neural networks are literally nested compositions of functions, layer after layer. Integration, often described as the inverse of differentiation, can be interpreted as the signed area under a curve. Many quantities in information theory, including entropy and Kullback–Leibler divergence, are defined using integrals.

3.Multivariable Calculus

This is where linear algebra and calculus meet. A neural network is a function of many variables, not just one. Therefore, we need tools that allow us to understand how changes in multiple directions influence the output.

3.1 Differentiation in multiple variables

In single-variable calculus, the derivative captures the slope of the tangent line. In multiple dimensions, a point on a surface has many possible tangent directions. Two especially important ones are:

the tangent parallel to the x–z plane,
the tangent parallel to the y–z plane.

Their slopes are given by partial derivatives. Another fundamental concept is the gradient, a vector that points in the direction of the steepest ascent. If you take an infinitesimal step along the gradient, your function value increases as much as possible compared to any other direction. Naive gradient descent uses the negative gradient to minimize loss, but it has limitations. Modern optimizers, such as Adam, RMSProp, and momentum-based methods, use adaptive step sizes, historical gradients, and other techniques to make optimization faster and more stable. Using the chain rule in multiple dimensions gives us the total derivative, which connects all partial derivatives through the structure of the model.

3.2 Higher-order derivatives

Training a neural network means minimizing a loss function with respect to its parameters. In multivariable settings, we generalize the second derivative using the Hessian matrix, which contains all second-order partial derivatives. The determinant of the Hessian plays a role analogous to the second derivative in one dimension:

It helps classify critical points (points where the gradient is zero).
It indicates whether we are at a minimum, maximum, or saddle point.

Understanding higher-order derivatives is essential for advanced optimization techniques and for analyzing model behavior near optimal solutions.

4.Probability Theory

Probability theory is the mathematical study of uncertainty and chance, a foundation not only for AI, but for all scientific disciplines. In machine learning, we rely on probability to model noise, quantify uncertainty, and make informed predictions about the world. At its core, probability is a function that assigns a number between 0 and 1 to an event, describing how likely that event is to occur.

Figure 14: Events and Event Spaces

4.1 Expected Value

The expected value captures the long-term average outcome of a random process. In machine learning, many loss functions can be interpreted as expected values. For example:

Mean Squared Error approximates the expected squared difference between predictions and true values.
Cross-entropy measures the expected log-loss over the data distribution.

Training neural networks can thus be seen as approximating these expectations and minimizing them over time.

4.2 Law of large numbers

The Law of Large Numbers tells us that as a random experiment is repeated many times, the average outcome converges to its expected value. A simple example: casino games. Even if you win occasionally, every game has a negative expected value for the player. As the number of plays grows large, the average outcome moves closer to that negative expectation, meaning the casino wins in the long run. Machine learning relies on the same principle: using many data samples helps our estimates approach the true underlying distribution.

4.3 Information theory

Information theory provides tools to quantify uncertainty, surprise, and similarity between probability distributions. It plays a crucial role in modern AI. Key concepts include:

Entropy is a measure of uncertainty or unpredictability in a distribution.
Cross-entropy is used to measure how well a model’s predicted distribution matches the true distribution. This gives us the most common loss function in classification tasks.
Kullback–Leibler (KL) divergence measures how one probability distribution differs from another and appears frequently in variational inference, regularization, and modern generative models.

These ideas connect probability with learning theory, enabling us to formalize how models capture information and improve over time.

5.Putting It All Together

All the mathematical components we covered work together to form the core of modern AI.

Linear algebra gives us vectors, matrices, and transformations, the structural backbone of neural networks.
Calculus and multivariable calculus provide the tools to optimize millions of parameters through gradients and the chain rule.
Probability theory enables models to reason for uncertainty, define loss functions, and express predictions as distributions.

A neural network is essentially a sequence of linear transformations combined with nonlinear activations, trained using gradient-based optimization, and evaluated using probabilistic principles. When these ideas come together, they allow AI systems to learn patterns, generalize from data, and make reliable predictions at scale.

6.Conclusion

AI may appear magical, but its foundation is a set of clear mathematical ideas. Linear algebra structures the model, calculus drives learning, and probability theory explains uncertainty. By understanding these pillars, the behavior of neural networks becomes understandable rather than mysterious. These fundamentals shape every advanced AI system, including the solutions we build at MDP.

References

[1], [2] https://substack.com/home/post/p-166969275

[3] https://www.learndatasci.com/glossary/orthogonal-and-orthonormal-vectors/

[4] https://en.wikipedia.org/wiki/Standard_basis

[5] https://blog.google/technology/ai/nano-banana-pro/

[6] https://tr.wikipedia.org/wiki/Dosya:Mona_Lisa_eigenvector_grid.png

Eda Yılmaz

Data Scientist