Math notation used in ML

Symbol	Name / Pron.	Description
Δ	Uppercase delta	A finite, discrete change
The actual difference, Δx = x₂-x₁, “Delta x”, measuring how much x actually changed.
δ	Lowercase delta	A small change
d	Latin d “dee”	The derivative, $\frac{d f}{d x} = lim_{h \to 0} \frac{f (x + h) - f (x)}{h}$
df/dx, more precisely df(x)/dx, called the "derivative" of a function f at an input x, is ratio between an infinitesimal change in the value of f to the infinitesimal change in the input x. Alternatively thought of as the slope of the tangent line to f at x. Note that sometimes d/dx is also treated as the operator.
∂	Curly d “partial” or “del”	Partial derivative
Rate of change w.r.t. one variable (others kept constant). ∂f/∂x “del f del x” derivative of f w.r.t. x, holding other variables constant.
∇	Nabla, “del”, gradient	Vector of all partial derivates
Σ	Sigma	Summation
Π	Pi	Product
θ	Theta	Model parameters (weights)
α, η	Alpha, Eta	Learning rate
ε	Epsilon	Small constant for numerical stability
‖x‖	Norm	“size”/“length” of a vector
x̂, ŷ	x hat, y hat	Prediction / estimate
x*	x star	Optimal value

Chain rule

The heart of backprop

\frac{d x}{d y} = \frac{d x}{d z} \cdot \frac{d z}{d y}

"if a cycle is twice as fast as a man, and a car is 4 times as fast as the cycle, then the car is 8 = 4 . 2 times fast as the man"

Core update rule

θ ← θ - α . ∂L/∂θ

Update parameters by subtracting learning rate times the gradient of loss with respect to parameters.

In vector form

θ ← θ - α . ∇L(θ)

Jacobian

The gradient ∇ is a special case of the Jacobian J when the output is scalar.

Just like with derivatives, where there is a derivative function D f, and a derivative value df/dx at a particular point x, so do we have the Jacobian as a function, J and the Jacobian matrix J(x), but often just written J, obtained by evaluating J at a (now multi-dimensional) point x.

Gradient ∇ - For f: ℝⁿ → ℝ (many inputs, one output)

∇f =
| ∂f/∂x₁ |
| ∂f/∂x₂ |
| ...    |
| ∂f/∂xₙ |

Typically this is written as a column vector, and typically the function we’re concerned with is the scalar loss L.

Jacobian J - For f: ℝⁿ → ℝᵐ (many inputs, many outputs)

J = | ∂f₁/∂x₁  ∂f₁/∂x₂  ...  ∂f₁/∂xₙ |
    | ∂f₂/∂x₁  ∂f₂/∂x₂  ...  ∂f₂/∂xₙ |
    | ...                            |
    | ∂fₘ/∂x₁  ∂fₘ/∂x₂  ...  ∂fₘ/∂xₙ |

This matrix of ratios tells how a small change in inputs propagates to outputs. Each element is a multiplier - how many units of output change per unit of input change locally at the evaluation point x. It’s a rate, like a slope. So ∂f₁/∂x₁ = 2 means Δf₁ ≈ 2 . Δx₁.

The Jacobian is thus the multivariable generalization of “derivative”: If we nudge the input by a tiny vector Δ, the output changes by approx Δ_out = J . Δ_inp.

In a backprop context, we mostly deal with gradients because loss is scalar.