Matrix Calculus Essentials

Why Matrix Calculus

Every major optimization problem in telecommunications ultimately reduces to differentiating a scalar objective with respect to a vector or matrix parameter:

  • Beamforming. Maximizing the signal-to-noise ratio SNR=wHRsw/wHRnw\text{SNR} = \mathbf{w}^H \mathbf{R}_s \mathbf{w} / \mathbf{w}^H \mathbf{R}_n \mathbf{w} requires the gradient of a Rayleigh quotient with respect to the weight vector w\mathbf{w}.
  • Precoder design. Minimizing mean-squared error or maximizing mutual information over a precoding matrix F\mathbf{F} demands derivatives of trace expressions and log-determinants with respect to F\mathbf{F}.
  • Covariance optimization. The capacity-achieving input covariance Rx\mathbf{R}_x is found by setting the gradient of logโกdetโก(I+HRxHH/ฯƒ2)\log\det(\mathbf{I} + \mathbf{H}\mathbf{R}_x\mathbf{H}^H / \sigma^2) to zero โ€” the celebrated water-filling solution.
  • Adaptive filtering. The LMS and RLS algorithms are nothing but stochastic gradient descent on quadratic cost functions.

Without a systematic calculus for vectors and matrices, one would have to expand every expression element-by-element โ€” a process that is both tedious and error-prone. Matrix calculus provides compact, coordinate-free derivative rules that make these optimizations tractable and elegant.

Definition:

Gradient of a Scalar with Respect to a Vector (Numerator Layout)

Let f:Cnโ†’Rf : \mathbb{C}^n \to \mathbb{R} be a real-valued function of a complex vector x=[x1,โ€ฆ,xn]T\mathbf{x} = [x_1, \ldots, x_n]^T. The gradient of ff with respect to x\mathbf{x} in the numerator layout convention is the column vector

โˆ‚fโˆ‚x=[โˆ‚fโˆ‚x1โ‹ฎโˆ‚fโˆ‚xn]โˆˆCn.\frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\[4pt] \vdots \\[4pt] \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbb{C}^n.

When x\mathbf{x} is complex, the gradient with respect to xโˆ—\mathbf{x}^* (the Wirtinger derivative) is often more natural:

โˆ‚fโˆ‚xโˆ—=[โˆ‚fโˆ‚x1โˆ—โ‹ฎโˆ‚fโˆ‚xnโˆ—],\frac{\partial f}{\partial \mathbf{x}^*} = \begin{bmatrix} \frac{\partial f}{\partial x_1^*} \\[4pt] \vdots \\[4pt] \frac{\partial f}{\partial x_n^*} \end{bmatrix},

where โˆ‚โˆ‚xkโˆ—=12โ€‰โฃ(โˆ‚โˆ‚โ„œ(xk)+jโ€‰โˆ‚โˆ‚โ„‘(xk))\frac{\partial}{\partial x_k^*} = \frac{1}{2}\!\left(\frac{\partial}{\partial \Re(x_k)} + j\,\frac{\partial}{\partial \Im(x_k)}\right). A necessary condition for a minimum of ff is โˆ‚fโˆ‚xโˆ—=0\frac{\partial f}{\partial \mathbf{x}^*} = \mathbf{0}.

The numerator layout places the components of the variable being differentiated with respect to in the rows of the Jacobian for vector-to-vector maps, and yields a column vector for the gradient. Some references (especially in statistics) use the denominator layout, which transposes all results. This textbook consistently uses the numerator layout.

Definition:

Gradient of a Scalar with Respect to a Matrix

Let f:Cmร—nโ†’Rf : \mathbb{C}^{m \times n} \to \mathbb{R} be a real-valued function of a matrix A=[aij]โˆˆCmร—n\mathbf{A} = [a_{ij}] \in \mathbb{C}^{m \times n}. The gradient of ff with respect to A\mathbf{A} is the mร—nm \times n matrix

โˆ‚fโˆ‚A=[โˆ‚fโˆ‚aij]i,jโˆˆCmร—n,\frac{\partial f}{\partial \mathbf{A}} = \left[\frac{\partial f}{\partial a_{ij}}\right]_{i,j} \in \mathbb{C}^{m \times n},

whose (i,j)(i,j)-th entry is โˆ‚fโˆ‚aij\frac{\partial f}{\partial a_{ij}}.

Equivalently, โˆ‚fโˆ‚A\frac{\partial f}{\partial \mathbf{A}} is the unique matrix satisfying

df=trโกโ€‰โฃ((โˆ‚fโˆ‚A)TdA)df = \operatorname{tr}\!\left( \left(\frac{\partial f}{\partial \mathbf{A}}\right)^T d\mathbf{A} \right)

for all infinitesimal perturbations dAd\mathbf{A}. This trace-differential characterization is the most powerful tool for deriving matrix gradients: one computes dfdf, rewrites it as trโก(GTโ€‰dA)\operatorname{tr}(\mathbf{G}^T\,d\mathbf{A}), and reads off โˆ‚fโˆ‚A=G\frac{\partial f}{\partial \mathbf{A}} = \mathbf{G}.

For the complex case, the gradient with respect to Aโˆ—\mathbf{A}^* uses the analogous Wirtinger convention: df=trโกโ€‰โฃ((โˆ‚fโˆ‚Aโˆ—)HdA)df = \operatorname{tr}\!\left( \left(\frac{\partial f}{\partial \mathbf{A}^*}\right)^H d\mathbf{A} \right).

Theorem: Gradient of a Hermitian Quadratic Form

Let AโˆˆCnร—n\mathbf{A} \in \mathbb{C}^{n \times n} and define f(x)=xHAxf(\mathbf{x}) = \mathbf{x}^H \mathbf{A} \mathbf{x} for xโˆˆCn\mathbf{x} \in \mathbb{C}^n. Then

โˆ‚fโˆ‚xโˆ—=Ax,โˆ‚fโˆ‚x=AHx.\frac{\partial f}{\partial \mathbf{x}^*} = \mathbf{A}\mathbf{x}, \qquad \frac{\partial f}{\partial \mathbf{x}} = \mathbf{A}^H \mathbf{x}.

In particular, if A\mathbf{A} is Hermitian (AH=A\mathbf{A}^H = \mathbf{A}), then

โˆ‚fโˆ‚xโˆ—=Ax.\frac{\partial f}{\partial \mathbf{x}^*} = \mathbf{A}\mathbf{x}.

For the real case (xโˆˆRn\mathbf{x} \in \mathbb{R}^n, AโˆˆRnร—n\mathbf{A} \in \mathbb{R}^{n \times n}),

โˆ‚โˆ‚x(xTAx)=(A+AT)x,\frac{\partial}{\partial \mathbf{x}} (\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x},

which reduces to 2Ax2\mathbf{A}\mathbf{x} when A\mathbf{A} is symmetric.

A quadratic form is the matrix analogue of ax2ax^2 in scalar calculus. The derivative of ax2ax^2 is 2ax2ax; correspondingly, the gradient of xTAx\mathbf{x}^T\mathbf{A}\mathbf{x} involves Ax\mathbf{A}\mathbf{x} (or (A+AT)x(\mathbf{A}+\mathbf{A}^T)\mathbf{x} when A\mathbf{A} is not symmetric, because the non-symmetric part contributes differently from left and right).

Theorem: Gradient of Log-Determinant

Let AโˆˆRnร—n\mathbf{A} \in \mathbb{R}^{n \times n} be invertible. Then

โˆ‚โˆ‚Alogโกdetโก(A)=Aโˆ’T=(Aโˆ’1)T.\frac{\partial}{\partial \mathbf{A}} \log\det(\mathbf{A}) = \mathbf{A}^{-T} = (\mathbf{A}^{-1})^T.

More generally, if AโˆˆCnร—n\mathbf{A} \in \mathbb{C}^{n \times n} is invertible and Hermitian positive definite, then

โˆ‚โˆ‚Aโˆ—logโกdetโก(A)=Aโˆ’T,\frac{\partial}{\partial \mathbf{A}^*} \log\det(\mathbf{A}) = \mathbf{A}^{-T},

and the Wirtinger gradient with respect to A\mathbf{A} is Aโˆ’H\mathbf{A}^{-H}. For the common case where A\mathbf{A} is Hermitian and we restrict to Hermitian perturbations, the result is simply Aโˆ’1\mathbf{A}^{-1}.

In one dimension, ddalogโกa=1/a\frac{d}{da}\log a = 1/a. The matrix analogue replaces division by the inverse, and the transpose accounts for the mismatch between row and column indexing in the trace-differential identification.

Theorem: Gradient of Trace Expressions

Let AโˆˆRmร—n\mathbf{A} \in \mathbb{R}^{m \times n} and BโˆˆRnร—m\mathbf{B} \in \mathbb{R}^{n \times m}. Then

โˆ‚โˆ‚Atrโก(AB)=BT.\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}\mathbf{B}) = \mathbf{B}^T.

More generally, the following identities hold (real case):

  1. โˆ‚โˆ‚Atrโก(BAT)=B\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{B}\mathbf{A}^T) = \mathbf{B}
  2. โˆ‚โˆ‚Atrโก(ATB)=B\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}^T\mathbf{B}) = \mathbf{B}
  3. โˆ‚โˆ‚Atrโก(ATA)=2A\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}^T\mathbf{A}) = 2\mathbf{A}
  4. โˆ‚โˆ‚Atrโก(BAC)=BTCT\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}) = \mathbf{B}^T\mathbf{C}^T

The trace is a linear functional, so its gradient with respect to A\mathbf{A} is particularly simple โ€” it just picks out the coefficient matrix of A\mathbf{A} in the trace expression, with an appropriate transpose to match the layout convention.

Chain Rule for Matrix Derivatives

The chain rule extends naturally to matrix-valued functions. If f:Rmร—nโ†’Rf : \mathbb{R}^{m \times n} \to \mathbb{R} can be decomposed as f(A)=g(B(A))f(\mathbf{A}) = g(\mathbf{B}(\mathbf{A})), where B:Rmร—nโ†’Rpร—q\mathbf{B} : \mathbb{R}^{m \times n} \to \mathbb{R}^{p \times q} and g:Rpร—qโ†’Rg : \mathbb{R}^{p \times q} \to \mathbb{R}, then the differential form of the chain rule gives

df=trโกโ€‰โฃ((โˆ‚gโˆ‚B)TdB).df = \operatorname{tr}\!\left( \left(\frac{\partial g}{\partial \mathbf{B}}\right)^T d\mathbf{B} \right).

One substitutes dBd\mathbf{B} in terms of dAd\mathbf{A} and rearranges until the expression takes the form trโก(GTโ€‰dA)\operatorname{tr}(\mathbf{G}^T\,d\mathbf{A}), at which point โˆ‚fโˆ‚A=G\frac{\partial f}{\partial \mathbf{A}} = \mathbf{G}.

Concrete example. Suppose f(A)=logโกdetโก(I+AB)f(\mathbf{A}) = \log\det(\mathbf{I} + \mathbf{A}\mathbf{B}) where B\mathbf{B} is constant. Let C=I+AB\mathbf{C} = \mathbf{I} + \mathbf{A}\mathbf{B}. Then: df=trโก(Cโˆ’1โ€‰dC)=trโก(Cโˆ’1โ€‰dAโ€‰B)=trโก(BCโˆ’1โ€‰dA)=trโกโ€‰โฃ((Cโˆ’TBT)Tโ€‰dA),df = \operatorname{tr}(\mathbf{C}^{-1}\,d\mathbf{C}) = \operatorname{tr}(\mathbf{C}^{-1}\,d\mathbf{A}\,\mathbf{B}) = \operatorname{tr}(\mathbf{B}\mathbf{C}^{-1}\,d\mathbf{A}) = \operatorname{tr}\!\big((\mathbf{C}^{-T}\mathbf{B}^T)^T\,d\mathbf{A}\big), so โˆ‚fโˆ‚A=Cโˆ’TBT=(I+AB)โˆ’TBT.\frac{\partial f}{\partial \mathbf{A}} = \mathbf{C}^{-T}\mathbf{B}^T = (\mathbf{I} + \mathbf{A}\mathbf{B})^{-T}\mathbf{B}^T.

This chain-rule technique is used repeatedly in deriving gradients for MIMO capacity expressions and MMSE precoder designs.

Example: Optimal Beamforming via Gradient โ€” Reduction to an Eigenvalue Problem

Consider a single-user system where the received signal is y=wHhโ€‰s+wHny = \mathbf{w}^H \mathbf{h}\,s + \mathbf{w}^H \mathbf{n}, with hโˆˆCn\mathbf{h} \in \mathbb{C}^n the channel vector, ss the transmitted symbol with E[โˆฃsโˆฃ2]=ฯƒs2\mathbb{E}[|s|^2] = \sigma_s^2, and n\mathbf{n} the noise with covariance Rn=E[nnH]\mathbf{R}_n = \mathbb{E}[\mathbf{n}\mathbf{n}^H].

The output SNR is

SNR(w)=ฯƒs2โ€‰โˆฃwHhโˆฃ2wHRnw=ฯƒs2โ€‰wH(hhH)wwHRnw.\text{SNR}(\mathbf{w}) = \frac{\sigma_s^2 \,|\mathbf{w}^H \mathbf{h}|^2} {\mathbf{w}^H \mathbf{R}_n \mathbf{w}} = \sigma_s^2 \, \frac{\mathbf{w}^H (\mathbf{h}\mathbf{h}^H) \mathbf{w}} {\mathbf{w}^H \mathbf{R}_n \mathbf{w}}.

Find the beamforming vector w\mathbf{w} that maximizes the SNR subject to โˆฅwโˆฅ=1\|\mathbf{w}\| = 1. Show that the solution reduces to an eigenvalue problem.

Gradient Descent for Quadratic Minimization

Complexity: O(n2)O(n^2) per iteration (matrix-vector product)
Input: Hermitian Aโ‰ป0\mathbf{A} \succ 0, vector b\mathbf{b}, step size ฮท>0\eta > 0, tolerance ฯต\epsilon
Output: Minimizer of f(x)=xHAxโˆ’2โ„œ(bHx)f(\mathbf{x}) = \mathbf{x}^H \mathbf{A} \mathbf{x} - 2\Re(\mathbf{b}^H \mathbf{x})
1. Initialize x0\mathbf{x}_0 (e.g., 0\mathbf{0})
2. for k=0,1,2,โ€ฆk = 0, 1, 2, \ldots do
3. gkโ†2Axkโˆ’2b\quad \mathbf{g}_k \leftarrow 2\mathbf{A}\mathbf{x}_k - 2\mathbf{b} \quad (gradient)
4. xk+1โ†xkโˆ’ฮทโ€‰gk\quad \mathbf{x}_{k+1} \leftarrow \mathbf{x}_k - \eta \, \mathbf{g}_k
5. \quad if โˆฅgkโˆฅ<ฯต\|\mathbf{g}_k\| < \epsilon then return xk\mathbf{x}_k
6. end for

Convergence rate depends on the condition number ฮบ(A)=ฮปmaxโก/ฮปminโก\kappa(\mathbf{A}) = \lambda_{\max}/\lambda_{\min}. Optimal step size is ฮท=2/(ฮปmaxโก+ฮปminโก)\eta = 2/(\lambda_{\max} + \lambda_{\min}). With this choice, the error contracts by a factor (ฮบโˆ’1ฮบ+1)2\left(\frac{\kappa - 1}{\kappa + 1}\right)^2 per iteration. For ill-conditioned problems (ฮบโ‰ซ1\kappa \gg 1), convergence is slow and preconditioning or conjugate gradient methods are preferred.

Gradient Descent on a Quadratic: Convergence on Elliptical Contours

Gradient descent on f(x)=xTAxf(\mathbf{x}) = \mathbf{x}^T\mathbf{A}\mathbf{x} follows a zig-zag path along the elliptical contours of the quadratic form. The eccentricity of the ellipses (determined by the condition number of A\mathbf{A}) controls how fast the algorithm converges.
The trajectory converges to the minimum at the origin. A well-conditioned matrix (nearly circular contours) converges faster than an ill-conditioned one (elongated ellipses).

Gradient Field of a Quadratic Form

Visualize the gradient vectors โˆ‡f(x)=2Ax\nabla f(\mathbf{x}) = 2\mathbf{A}\mathbf{x} for a 2D quadratic form. Eigenvectors of A\mathbf{A} determine the principal directions; eigenvalues determine the curvature.

Parameters
2
0.5
1

Why This Matters: Optimizing Beamformers via Matrix Gradients Leads to Eigenvalue Problems

The central optimization in receive beamforming is to maximize the signal-to-noise ratio

SNR=wHRswwHRnw\text{SNR} = \frac{\mathbf{w}^H \mathbf{R}_s \mathbf{w}} {\mathbf{w}^H \mathbf{R}_n \mathbf{w}}

subject to โˆฅwโˆฅ=1\|\mathbf{w}\| = 1, where Rs=ฯƒs2โ€‰hhH\mathbf{R}_s = \sigma_s^2\,\mathbf{h}\mathbf{h}^H is the signal covariance and Rn\mathbf{R}_n is the noise-plus-interference covariance.

Setting the gradient of the Lagrangian to zero (as derived in EOptimal Beamforming via Gradient โ€” Reduction to an Eigenvalue Problem) yields the generalized eigenvalue problem

Rnโˆ’1Rsw=ฮผw.\mathbf{R}_n^{-1}\mathbf{R}_s \mathbf{w} = \mu \mathbf{w}.

The optimal beamformer is the eigenvector corresponding to the largest eigenvalue ฮผmaxโก\mu_{\max} of Rnโˆ’1Rs\mathbf{R}_n^{-1}\mathbf{R}_s. The maximum achievable SNR equals ฮผmaxโก\mu_{\max}.

This pattern โ€” set matrix gradient to zero, obtain eigenvalue problem โ€” recurs throughout wireless communications:

  • MIMO precoding: maximizing mutual information over the precoder F\mathbf{F} leads to a water-filling solution on the eigenvalues of HHH\mathbf{H}^H\mathbf{H}.
  • MMSE receive filter: minimizing MSE E[โˆฅx^โˆ’xโˆฅ2]\mathbb{E}[\|\hat{\mathbf{x}} - \mathbf{x}\|^2] yields the Wiener filter W=RyxRyyโˆ’1\mathbf{W} = \mathbf{R}_{yx}\mathbf{R}_{yy}^{-1}, derived by setting a matrix gradient to zero.
  • Dominant eigenmode transmission: transmitting along the principal eigenvector of HHH\mathbf{H}^H\mathbf{H} maximizes received SNR in MIMO.

See full treatment in MIMO Receivers

๐ŸšจCritical Engineering Note

Diagonal Loading: Taming Ill-Conditioned Covariance Matrices

The MVDR/Capon beamformer requires inverting Rn\mathbf{R}_n, and the MMSE filter requires inverting Ryy\mathbf{R}_{yy}. In practice, sample covariance matrices estimated from finite data are often ill-conditioned or singular (especially when the number of snapshots TT is comparable to the array size nn).

Diagonal loading adds a small multiple of the identity: R^n(loaded)=R^n+ฮดI,\hat{\mathbf{R}}_n^{(\text{loaded})} = \hat{\mathbf{R}}_n + \delta \mathbf{I}, where ฮด>0\delta > 0 is the loading factor. This:

  1. Ensures the matrix is strictly positive definite (invertible).
  2. Regularizes the smallest eigenvalues, preventing noise amplification.
  3. Provides robustness against steering vector mismatch and calibration errors.

A common rule of thumb: set ฮด=10ฯƒn2/n\delta = 10\sigma_n^2/n (10 times the per-antenna noise power). In 5G NR, the channel estimation error itself acts as a natural form of diagonal loading in the MMSE filter.

Practical Constraints
  • โ€ข

    Without loading: condition number of sample covariance can exceed 101010^{10} for T<2nT < 2n

  • โ€ข

    3GPP baseline: MMSE-IRC receiver uses diagonal loading implicitly via the noise variance term

  • โ€ข

    Optimal loading level depends on SNR and number of snapshots โ€” no universal constant

๐Ÿ“‹ Ref: 3GPP TS 38.214, Section 5.2.2 (MMSE-IRC receiver)

Common Mistake: Numerator vs. Denominator Layout Confusion

Mistake:

A very common source of errors is mixing numerator layout and denominator layout conventions within the same derivation. In numerator layout, โˆ‚fโˆ‚x\frac{\partial f}{\partial \mathbf{x}} is a column vector (same shape as x\mathbf{x}) and the Jacobian of a vector-valued function โˆ‚fโˆ‚x\frac{\partial \mathbf{f}}{\partial \mathbf{x}} has f\mathbf{f}-components in rows. In denominator layout, the gradient is a row vector and the Jacobian is transposed. Many textbooks do not state which convention they use, and some even switch mid-chapter. The result: sign errors, spurious transposes, and factors of 2 that appear or vanish mysteriously.

Correction:

Always declare your convention and stick to it. This textbook uses numerator layout throughout. In this convention:

  • The gradient โˆ‚fโˆ‚xโˆˆRn\frac{\partial f}{\partial \mathbf{x}} \in \mathbb{R}^n is a column vector (same shape as x\mathbf{x}).
  • The Jacobian โˆ‚fโˆ‚x\frac{\partial \mathbf{f}}{\partial \mathbf{x}} is an mร—nm \times n matrix when fโˆˆRm\mathbf{f} \in \mathbb{R}^m and xโˆˆRn\mathbf{x} \in \mathbb{R}^n.
  • The gradient of a scalar with respect to a matrix has the same shape as the matrix.

When consulting external references, check whether โˆ‚โˆ‚x(aTx)\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^T\mathbf{x}) equals a\mathbf{a} (numerator layout) or aT\mathbf{a}^T (denominator layout). If the reference uses a different convention, transpose all results before substituting into your derivation.

Key Takeaway

Matrix calculus transforms wireless optimization problems โ€” maximizing SNR, minimizing MSE, maximizing capacity โ€” into eigenvalue problems. The three workhorse identities are: (1) โˆ‚โˆ‚xโˆ—(xHAx)=Ax\frac{\partial}{\partial \mathbf{x}^*}(\mathbf{x}^H\mathbf{A}\mathbf{x}) = \mathbf{A}\mathbf{x} for Hermitian A\mathbf{A}, (2) โˆ‚โˆ‚AlogโกdetโกA=Aโˆ’T\frac{\partial}{\partial \mathbf{A}}\log\det\mathbf{A} = \mathbf{A}^{-T}, and (3) โˆ‚โˆ‚Atrโก(AB)=BT\frac{\partial}{\partial \mathbf{A}}\operatorname{tr}(\mathbf{A}\mathbf{B}) = \mathbf{B}^T. Setting a gradient to zero and recognizing the resulting equation as Mv=ฮปv\mathbf{M}\mathbf{v} = \lambda\mathbf{v} is the single most important pattern in the mathematical toolbox for telecommunications.

Quick Check

Let AโˆˆRnร—n\mathbf{A} \in \mathbb{R}^{n \times n} be symmetric and positive definite, and let f(x)=xTAxf(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}. What is โˆ‚fโˆ‚x\frac{\partial f}{\partial \mathbf{x}}?

Ax\mathbf{A}\mathbf{x}

2Ax2\mathbf{A}\mathbf{x}

ATx\mathbf{A}^T\mathbf{x}

xTA\mathbf{x}^T\mathbf{A}

Quick Check

What is โˆ‚โˆ‚Atrโก(ATB)\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}^T \mathbf{B}) where A,BโˆˆRmร—n\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}?

BT\mathbf{B}^T

B\mathbf{B}

AB\mathbf{A}\mathbf{B}

I\mathbf{I}

Gradient

For a scalar-valued function ff of a vector xโˆˆRn\mathbf{x} \in \mathbb{R}^n (or Cn\mathbb{C}^n), the gradient โˆ‡f(x)=โˆ‚fโˆ‚x\nabla f(\mathbf{x}) = \frac{\partial f}{\partial \mathbf{x}} is the vector of partial derivatives. In numerator layout it is a column vector. The gradient points in the direction of steepest ascent and its magnitude equals the maximum directional derivative. Setting โˆ‡f=0\nabla f = \mathbf{0} is the first-order necessary condition for an extremum.

Related: Jacobian, Gradient of a Scalar with Respect to a Vector (Numerator Layout), Wirtinger derivative

Numerator Layout

A convention for arranging partial derivatives in which the gradient of a scalar with respect to an nn-dimensional vector is a column vector (i.e., the indices of the function being differentiated determine the row structure of the result). Also called the Jacobian formulation. In this convention, โˆ‚fโˆ‚xโˆˆRnร—1\frac{\partial f}{\partial \mathbf{x}} \in \mathbb{R}^{n \times 1} and the Jacobian of f:Rnโ†’Rm\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m is an mร—nm \times n matrix. The alternative is the denominator layout (or Hessian formulation), in which the gradient is a row vector.

Related: denominator layout, Jacobian, Gradient

Jacobian

For a vector-valued function f:Rnโ†’Rm\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m, the Jacobian is the mร—nm \times n matrix (in numerator layout) whose (i,j)(i,j)-th entry is โˆ‚fiโˆ‚xj\frac{\partial f_i}{\partial x_j}. It generalizes the derivative to multivariate vector functions and governs how infinitesimal perturbations of the input propagate to the output. The gradient of a scalar function is a special case (a single-row Jacobian transposed into a column vector).

Related: Gradient, Chain Rule of Mutual Information, Gradient of a Scalar with Respect to a Vector (Numerator Layout)