Math notation used in ML

Definitions and notation below are not precise but my rough notes; the more I learn, the less I understand.

Symbol Name / Pron. Description
Δ Uppercase delta A finite, discrete change
The actual difference, Δx = x₂-x₁, “Delta x”, measuring how much x actually changed.
δ Lowercase delta A small change
d Latin d “dee” The derivative, d f d x = lim h 0 f ( x + h ) f ( x ) h
df/dx, more precisely df(x)/dx, called the "derivative" of a function f at an input x, is ratio between an infinitesimal change in the value of f to the infinitesimal change in the input x. Alternatively thought of as the slope of the tangent line to f at x. Note that sometimes d/dx is also treated as the operator.
Curly d “partial” or “del” Partial derivative
Rate of change w.r.t. one variable (others kept constant). ∂f/∂x “del f del x” derivative of f w.r.t. x, holding other variables constant.
Nabla, “del”, gradient Vector of all partial derivates
Σ Sigma Summation
Π Pi Product
θ Theta Model parameters (weights)
α, η Alpha, Eta Learning rate
ε Epsilon Small constant for numerical stability
‖x‖ Norm “size”/“length” of a vector
, ŷ x hat, y hat Prediction / estimate
x* x star Optimal value

Chain rule

The heart of backprop

d x d y = d x d z · d z d y

"if a cycle is twice as fast as a man, and a car is 4 times as fast as the cycle, then the car is 8 = 4 . 2 times fast as the man"


Core update rule

θ ← θ - α . ∂L/∂θ

Update parameters by subtracting learning rate times the gradient of loss with respect to parameters.

In vector form

θ ← θ - α . ∇L(θ)


Jacobian

The gradient ∇ is a special case of the Jacobian J when the output is scalar.

Just like with derivatives, where there is a derivative function D f, and a derivative value df/dx at a particular point x, so do we have the Jacobian as a function, J and the Jacobian matrix J(x), but often just written J, obtained by evaluating J at a (now multi-dimensional) point x.

Gradient ∇ - For f: ℝⁿ → ℝ (many inputs, one output)

∇f =
| ∂f/∂x₁ |
| ∂f/∂x₂ |
| ...    |
| ∂f/∂xₙ |

Typically this is written as a column vector, and typically the function we’re concerned with is the scalar loss L.

Jacobian J - For f: ℝⁿ → ℝᵐ (many inputs, many outputs)

J = | ∂f₁/∂x₁  ∂f₁/∂x₂  ...  ∂f₁/∂xₙ |
    | ∂f₂/∂x₁  ∂f₂/∂x₂  ...  ∂f₂/∂xₙ |
    | ...                            |
    | ∂fₘ/∂x₁  ∂fₘ/∂x₂  ...  ∂fₘ/∂xₙ |

This matrix of ratios tells how a small change in inputs propagates to outputs. Each element is a multiplier - how many units of output change per unit of input change locally at the evaluation point x. It’s a rate, like a slope. So ∂f₁/∂x₁ = 2 means Δf₁ ≈ 2 . Δx₁.

The Jacobian is thus the multivariable generalization of “derivative”: If we nudge the input by a tiny vector Δ, the output changes by approx Δ_out = J . Δ_inp.

In a backprop context, we mostly deal with gradients because loss is scalar.