| Symbol | Name / Pron. | Description |
|---|---|---|
| Δ | Uppercase delta | A finite, discrete change |
| The actual difference, Δx = x₂-x₁, “Delta x”, measuring how much x actually changed. | ||
| δ | Lowercase delta | A small change |
| d | Latin d “dee” | The derivative, |
| df/dx, more precisely df(x)/dx, called the "derivative" of a function f at an input x, is ratio between an infinitesimal change in the value of f to the infinitesimal change in the input x. Alternatively thought of as the slope of the tangent line to f at x. Note that sometimes d/dx is also treated as the operator. | ||
| ∂ | Curly d “partial” or “del” | Partial derivative |
| Rate of change w.r.t. one variable (others kept constant). ∂f/∂x “del f del x” derivative of f w.r.t. x, holding other variables constant. | ||
| ∇ | Nabla, “del”, gradient | Vector of all partial derivates |
| Σ | Sigma | Summation |
| Π | Pi | Product |
| θ | Theta | Model parameters (weights) |
| α, η | Alpha, Eta | Learning rate |
| ε | Epsilon | Small constant for numerical stability |
| ‖x‖ | Norm | “size”/“length” of a vector |
| x̂, ŷ | x hat, y hat | Prediction / estimate |
| x* | x star | Optimal value |
Chain rule
The heart of backprop
"if a cycle is twice as fast as a man, and a car is 4 times as fast as the cycle, then the car is 8 = 4 . 2 times fast as the man"
Core update rule
θ ← θ - α . ∂L/∂θ
Update parameters by subtracting learning rate times the gradient of loss with respect to parameters.
In vector form
θ ← θ - α . ∇L(θ)
Jacobian
The gradient ∇ is a special case of the Jacobian J when the output is scalar.
Just like with derivatives, where there is a derivative function D f, and a derivative value df/dx at a particular point x, so do we have the Jacobian as a function, J and the Jacobian matrix J(x), but often just written J, obtained by evaluating J at a (now multi-dimensional) point x.
Gradient ∇ - For f: ℝⁿ → ℝ (many inputs, one output)
∇f =
| ∂f/∂x₁ |
| ∂f/∂x₂ |
| ... |
| ∂f/∂xₙ |
Typically this is written as a column vector, and typically the function we’re concerned with is the scalar loss L.
Jacobian J - For f: ℝⁿ → ℝᵐ (many inputs, many outputs)
J = | ∂f₁/∂x₁ ∂f₁/∂x₂ ... ∂f₁/∂xₙ |
| ∂f₂/∂x₁ ∂f₂/∂x₂ ... ∂f₂/∂xₙ |
| ... |
| ∂fₘ/∂x₁ ∂fₘ/∂x₂ ... ∂fₘ/∂xₙ |
This matrix of ratios tells how a small change in inputs propagates to outputs. Each element is a multiplier - how many units of output change per unit of input change locally at the evaluation point x. It’s a rate, like a slope. So ∂f₁/∂x₁ = 2 means Δf₁ ≈ 2 . Δx₁.
The Jacobian is thus the multivariable generalization of “derivative”: If we nudge the input by a tiny vector Δ, the output changes by approx Δ_out = J . Δ_inp.
In a backprop context, we mostly deal with gradients because loss is scalar.