Ferkans — Interactive Telecom Tutor

Why Matrix Calculus

Every major optimization problem in telecommunications ultimately reduces to differentiating a scalar objective with respect to a vector or matrix parameter:

Beamforming. Maximizing the signal-to-noise ratio $\text{SNR} = \mathbf{w}^H \mathbf{R}_s \mathbf{w} / \mathbf{w}^H \mathbf{R}_n \mathbf{w}$ requires the gradient of a Rayleigh quotient with respect to the weight vector $\mathbf{w}$ .
Precoder design. Minimizing mean-squared error or maximizing mutual information over a precoding matrix $\mathbf{F}$ demands derivatives of trace expressions and log-determinants with respect to $\mathbf{F}$ .
Covariance optimization. The capacity-achieving input covariance $\mathbf{R}_x$ is found by setting the gradient of $\log\det(\mathbf{I} + \mathbf{H}\mathbf{R}_x\mathbf{H}^H / \sigma^2)$ to zero — the celebrated water-filling solution.
Adaptive filtering. The LMS and RLS algorithms are nothing but stochastic gradient descent on quadratic cost functions.

Without a systematic calculus for vectors and matrices, one would have to expand every expression element-by-element — a process that is both tedious and error-prone. Matrix calculus provides compact, coordinate-free derivative rules that make these optimizations tractable and elegant.

Definition:
Gradient of a Scalar with Respect to a Vector (Numerator Layout)

Let $f : \mathbb{C}^n \to \mathbb{R}$ be a real-valued function of a complex vector $\mathbf{x} = [x_1, \ldots, x_n]^T$ . The gradient of $f$ with respect to $\mathbf{x}$ in the numerator layout convention is the column vector

$\frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\[4pt] \vdots \\[4pt] \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbb{C}^n.$

When $\mathbf{x}$ is complex, the gradient with respect to $\mathbf{x}^*$ (the Wirtinger derivative) is often more natural:

$\frac{\partial f}{\partial \mathbf{x}^*} = \begin{bmatrix} \frac{\partial f}{\partial x_1^*} \\[4pt] \vdots \\[4pt] \frac{\partial f}{\partial x_n^*} \end{bmatrix},$

where $\frac{\partial}{\partial x_k^*} = \frac{1}{2}\!\left(\frac{\partial}{\partial \Re(x_k)} + j\,\frac{\partial}{\partial \Im(x_k)}\right)$ . A necessary condition for a minimum of $f$ is $\frac{\partial f}{\partial \mathbf{x}^*} = \mathbf{0}$ .

The numerator layout places the components of the variable being differentiated with respect to in the rows of the Jacobian for vector-to-vector maps, and yields a column vector for the gradient. Some references (especially in statistics) use the denominator layout, which transposes all results. This textbook consistently uses the numerator layout.

Definition:
Gradient of a Scalar with Respect to a Matrix

Let $f : \mathbb{C}^{m \times n} \to \mathbb{R}$ be a real-valued function of a matrix $\mathbf{A} = [a_{ij}] \in \mathbb{C}^{m \times n}$ . The gradient of $f$ with respect to $\mathbf{A}$ is the $m \times n$ matrix

$\frac{\partial f}{\partial \mathbf{A}} = \left[\frac{\partial f}{\partial a_{ij}}\right]_{i,j} \in \mathbb{C}^{m \times n},$

whose $(i,j)$ -th entry is $\frac{\partial f}{\partial a_{ij}}$ .

Equivalently, $\frac{\partial f}{\partial \mathbf{A}}$ is the unique matrix satisfying

$df = \operatorname{tr}\!\left( \left(\frac{\partial f}{\partial \mathbf{A}}\right)^T d\mathbf{A} \right)$

for all infinitesimal perturbations $d\mathbf{A}$ . This trace-differential characterization is the most powerful tool for deriving matrix gradients: one computes $df$ , rewrites it as $\operatorname{tr}(\mathbf{G}^T\,d\mathbf{A})$ , and reads off $\frac{\partial f}{\partial \mathbf{A}} = \mathbf{G}$ .

For the complex case, the gradient with respect to $\mathbf{A}^*$ uses the analogous Wirtinger convention: $df = \operatorname{tr}\!\left( \left(\frac{\partial f}{\partial \mathbf{A}^*}\right)^H d\mathbf{A} \right)$ .

Theorem: Gradient of a Hermitian Quadratic Form

Let $\mathbf{A} \in \mathbb{C}^{n \times n}$ and define $f(\mathbf{x}) = \mathbf{x}^H \mathbf{A} \mathbf{x}$ for $\mathbf{x} \in \mathbb{C}^n$ . Then

$\frac{\partial f}{\partial \mathbf{x}^*} = \mathbf{A}\mathbf{x}, \qquad \frac{\partial f}{\partial \mathbf{x}} = \mathbf{A}^H \mathbf{x}.$

In particular, if $\mathbf{A}$ is Hermitian ( $\mathbf{A}^H = \mathbf{A}$ ), then

$\frac{\partial f}{\partial \mathbf{x}^*} = \mathbf{A}\mathbf{x}.$

For the real case ( $\mathbf{x} \in \mathbb{R}^n$ , $\mathbf{A} \in \mathbb{R}^{n \times n}$ ),

$\frac{\partial}{\partial \mathbf{x}} (\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x},$

which reduces to $2\mathbf{A}\mathbf{x}$ when $\mathbf{A}$ is symmetric.

A quadratic form is the matrix analogue of $ax^2$ in scalar calculus. The derivative of $ax^2$ is $2ax$ ; correspondingly, the gradient of $\mathbf{x}^T\mathbf{A}\mathbf{x}$ involves $\mathbf{A}\mathbf{x}$ (or $(\mathbf{A}+\mathbf{A}^T)\mathbf{x}$ when $\mathbf{A}$ is not symmetric, because the non-symmetric part contributes differently from left and right).

Proof

Step 1: Expand the quadratic form element-wise (real case)

Write $f(\mathbf{x}) = \mathbf{x}^T \mathbf{A}\mathbf{x} = \sum_{i=1}^n \sum_{k=1}^n x_i \, a_{ik} \, x_k$ . Differentiating with respect to $x_m$ : $\frac{\partial f}{\partial x_m} = \sum_{k=1}^n a_{mk}\,x_k + \sum_{i=1}^n a_{im}\,x_i = [\mathbf{A}\mathbf{x}]_m + [\mathbf{A}^T\mathbf{x}]_m.$ Stacking all $n$ partial derivatives into a column vector gives $\frac{\partial f}{\partial \mathbf{x}} = \mathbf{A}\mathbf{x} + \mathbf{A}^T\mathbf{x} = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}.$ When $\mathbf{A} = \mathbf{A}^T$ , this simplifies to $2\mathbf{A}\mathbf{x}$ .

Step 2: Complex extension via Wirtinger derivatives

For the complex case, write $f(\mathbf{x}) = \mathbf{x}^H \mathbf{A}\mathbf{x} = \sum_{i,k} x_i^* \, a_{ik} \, x_k$ . Treating $x_k$ and $x_k^*$ as independent variables (the Wirtinger framework), differentiate with respect to $x_m^*$ : $\frac{\partial f}{\partial x_m^*} = \sum_{k=1}^n a_{mk}\,x_k = [\mathbf{A}\mathbf{x}]_m.$ Therefore $\frac{\partial f}{\partial \mathbf{x}^*} = \mathbf{A}\mathbf{x}.$ Similarly, differentiating with respect to $x_m$ yields $\frac{\partial f}{\partial x_m} = \sum_{i=1}^n a_{im}^* x_i^* \cdot \frac{\partial x_i^*}{\partial x_m}$ — but within the Wirtinger framework $x_i^*$ does not depend on $x_m$ , so instead we differentiate directly: $\frac{\partial f}{\partial x_m} = \sum_{i=1}^n x_i^* \, a_{im} = [\mathbf{A}^H \mathbf{x}]_m.$ Hence $\frac{\partial f}{\partial \mathbf{x}} = \mathbf{A}^H\mathbf{x}$ . When $\mathbf{A}$ is Hermitian, both gradients coincide as $\mathbf{A}\mathbf{x}$ . $\blacksquare$

Theorem: Gradient of Log-Determinant

Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be invertible. Then

$\frac{\partial}{\partial \mathbf{A}} \log\det(\mathbf{A}) = \mathbf{A}^{-T} = (\mathbf{A}^{-1})^T.$

More generally, if $\mathbf{A} \in \mathbb{C}^{n \times n}$ is invertible and Hermitian positive definite, then

$\frac{\partial}{\partial \mathbf{A}^*} \log\det(\mathbf{A}) = \mathbf{A}^{-T},$

and the Wirtinger gradient with respect to $\mathbf{A}$ is $\mathbf{A}^{-H}$ . For the common case where $\mathbf{A}$ is Hermitian and we restrict to Hermitian perturbations, the result is simply $\mathbf{A}^{-1}$ .

In one dimension, $\frac{d}{da}\log a = 1/a$ . The matrix analogue replaces division by the inverse, and the transpose accounts for the mismatch between row and column indexing in the trace-differential identification.

Proof

Step 1: Differential of the determinant (Jacobi's formula)

We first establish that for invertible $\mathbf{A}$ , $d(\det\mathbf{A}) = \det(\mathbf{A})\,\operatorname{tr}(\mathbf{A}^{-1}\,d\mathbf{A}).$

Proof of Jacobi's formula. Write $\det(\mathbf{A} + d\mathbf{A}) = \det(\mathbf{A})\,\det(\mathbf{I} + \mathbf{A}^{-1}\,d\mathbf{A})$ . For an infinitesimal perturbation, expand the second determinant to first order. Let $\mathbf{E} = \mathbf{A}^{-1}\,d\mathbf{A}$ with $\|\mathbf{E}\| \ll 1$ . Then $\det(\mathbf{I} + \mathbf{E}) = 1 + \operatorname{tr}(\mathbf{E}) + O(\|\mathbf{E}\|^2).$ This follows because $\det(\mathbf{I} + \mathbf{E}) = \prod_{i=1}^n (1 + \mu_i)$ where $\mu_i$ are eigenvalues of $\mathbf{E}$ , and the product to first order is $1 + \sum_i \mu_i = 1 + \operatorname{tr}(\mathbf{E})$ .

Substituting back: $d(\det\mathbf{A}) = \det(\mathbf{A})\,\operatorname{tr}(\mathbf{A}^{-1}\,d\mathbf{A}).$

Step 2: Differential of $\log\det$

By the chain rule for differentials, $d(\log\det\mathbf{A}) = \frac{d(\det\mathbf{A})}{\det\mathbf{A}} = \operatorname{tr}(\mathbf{A}^{-1}\,d\mathbf{A}).$

Step 3: Identify the gradient via the trace-differential rule

Using the cyclic property of trace and the identity $\operatorname{tr}(\mathbf{C}^T \mathbf{D}) = \operatorname{tr}(\mathbf{D}^T \mathbf{C})$ , rewrite: $d(\log\det\mathbf{A}) = \operatorname{tr}(\mathbf{A}^{-1}\,d\mathbf{A}) = \operatorname{tr}\!\big((\mathbf{A}^{-T})^T\,d\mathbf{A}\big).$ By the identification rule $df = \operatorname{tr}(\mathbf{G}^T\,d\mathbf{A}) \;\Longrightarrow\; \frac{\partial f}{\partial \mathbf{A}} = \mathbf{G}$ , we read off $\frac{\partial}{\partial \mathbf{A}}\log\det(\mathbf{A}) = \mathbf{A}^{-T}.$ When $\mathbf{A}$ is symmetric (or Hermitian with Hermitian perturbations), $\mathbf{A}^{-T} = \mathbf{A}^{-1}$ . $\blacksquare$

Theorem: Gradient of Trace Expressions

Let $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times m}$ . Then

$\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}\mathbf{B}) = \mathbf{B}^T.$

More generally, the following identities hold (real case):

$\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{B}\mathbf{A}^T) = \mathbf{B}$
$\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}^T\mathbf{B}) = \mathbf{B}$
$\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}^T\mathbf{A}) = 2\mathbf{A}$
$\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}) = \mathbf{B}^T\mathbf{C}^T$

The trace is a linear functional, so its gradient with respect to $\mathbf{A}$ is particularly simple — it just picks out the coefficient matrix of $\mathbf{A}$ in the trace expression, with an appropriate transpose to match the layout convention.

Proof

Proof of $\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}\mathbf{B}) = \mathbf{B}^T$

Write $f(\mathbf{A}) = \operatorname{tr}(\mathbf{A}\mathbf{B}) = \sum_{i=1}^m \sum_{k=1}^n a_{ik}\,b_{ki}$ . Then $\frac{\partial f}{\partial a_{pq}} = b_{qp} = [\mathbf{B}^T]_{pq}.$ Since this holds for every $(p,q)$ , $\frac{\partial f}{\partial \mathbf{A}} = \mathbf{B}^T.$

Alternative (trace-differential method): $df = \operatorname{tr}(d\mathbf{A}\cdot\mathbf{B}) = \operatorname{tr}(\mathbf{B}\,d\mathbf{A}) = \operatorname{tr}((\mathbf{B}^T)^T\,d\mathbf{A}).$ Reading off the gradient: $\frac{\partial f}{\partial \mathbf{A}} = \mathbf{B}^T$ .

Proof of $\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}) = \mathbf{B}^T\mathbf{C}^T$

Take the differential: $df = \operatorname{tr}(\mathbf{B}\,d\mathbf{A}\,\mathbf{C}) = \operatorname{tr}(\mathbf{C}\mathbf{B}\,d\mathbf{A})$ by the cyclic property of trace. Now identify: $df = \operatorname{tr}\!\big((\mathbf{C}\mathbf{B})^T\big)^T\,d\mathbf{A}) = \operatorname{tr}\!\big((\mathbf{B}^T\mathbf{C}^T)^T\,d\mathbf{A}\big).$ Therefore $\frac{\partial f}{\partial \mathbf{A}} = \mathbf{B}^T\mathbf{C}^T$ . $\blacksquare$

Chain Rule for Matrix Derivatives

The chain rule extends naturally to matrix-valued functions. If $f : \mathbb{R}^{m \times n} \to \mathbb{R}$ can be decomposed as $f(\mathbf{A}) = g(\mathbf{B}(\mathbf{A}))$ , where $\mathbf{B} : \mathbb{R}^{m \times n} \to \mathbb{R}^{p \times q}$ and $g : \mathbb{R}^{p \times q} \to \mathbb{R}$ , then the differential form of the chain rule gives

$df = \operatorname{tr}\!\left( \left(\frac{\partial g}{\partial \mathbf{B}}\right)^T d\mathbf{B} \right).$

One substitutes $d\mathbf{B}$ in terms of $d\mathbf{A}$ and rearranges until the expression takes the form $\operatorname{tr}(\mathbf{G}^T\,d\mathbf{A})$ , at which point $\frac{\partial f}{\partial \mathbf{A}} = \mathbf{G}$ .

Concrete example. Suppose $f(\mathbf{A}) = \log\det(\mathbf{I} + \mathbf{A}\mathbf{B})$ where $\mathbf{B}$ is constant. Let $\mathbf{C} = \mathbf{I} + \mathbf{A}\mathbf{B}$ . Then: $df = \operatorname{tr}(\mathbf{C}^{-1}\,d\mathbf{C}) = \operatorname{tr}(\mathbf{C}^{-1}\,d\mathbf{A}\,\mathbf{B}) = \operatorname{tr}(\mathbf{B}\mathbf{C}^{-1}\,d\mathbf{A}) = \operatorname{tr}\!\big((\mathbf{C}^{-T}\mathbf{B}^T)^T\,d\mathbf{A}\big),$ so $\frac{\partial f}{\partial \mathbf{A}} = \mathbf{C}^{-T}\mathbf{B}^T = (\mathbf{I} + \mathbf{A}\mathbf{B})^{-T}\mathbf{B}^T.$

This chain-rule technique is used repeatedly in deriving gradients for MIMO capacity expressions and MMSE precoder designs.

Example: Optimal Beamforming via Gradient — Reduction to an Eigenvalue Problem

Consider a single-user system where the received signal is $y = \mathbf{w}^H \mathbf{h}\,s + \mathbf{w}^H \mathbf{n}$ , with $\mathbf{h} \in \mathbb{C}^n$ the channel vector, $s$ the transmitted symbol with $\mathbb{E}[|s|^2] = \sigma_s^2$ , and $\mathbf{n}$ the noise with covariance $\mathbf{R}_n = \mathbb{E}[\mathbf{n}\mathbf{n}^H]$ .

The output SNR is

$\text{SNR}(\mathbf{w}) = \frac{\sigma_s^2 \,|\mathbf{w}^H \mathbf{h}|^2} {\mathbf{w}^H \mathbf{R}_n \mathbf{w}} = \sigma_s^2 \, \frac{\mathbf{w}^H (\mathbf{h}\mathbf{h}^H) \mathbf{w}} {\mathbf{w}^H \mathbf{R}_n \mathbf{w}}.$

Find the beamforming vector $\mathbf{w}$ that maximizes the SNR subject to $\|\mathbf{w}\| = 1$ . Show that the solution reduces to an eigenvalue problem.

Solution

Step 1: Formulate as a constrained optimization

Maximizing the Rayleigh quotient $\frac{\mathbf{w}^H \mathbf{R}_s \mathbf{w}} {\mathbf{w}^H \mathbf{R}_n \mathbf{w}}$ where $\mathbf{R}_s = \sigma_s^2\,\mathbf{h}\mathbf{h}^H$ subject to $\|\mathbf{w}\|^2 = 1$ is equivalent to:

$\max_{\mathbf{w}} \; \mathbf{w}^H \mathbf{R}_s \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^H \mathbf{R}_n \mathbf{w} = 1.$

(We can always rescale $\mathbf{w}$ to satisfy the noise-power constraint without changing the ratio.)

Step 2: Form the Lagrangian and compute the gradient

The Lagrangian is $\mathcal{L}(\mathbf{w}, \mu) = \mathbf{w}^H \mathbf{R}_s \mathbf{w} - \mu\,(\mathbf{w}^H \mathbf{R}_n \mathbf{w} - 1).$

Using the gradient rule for Hermitian quadratic forms (TGradient of a Hermitian Quadratic Form), $\frac{\partial \mathcal{L}}{\partial \mathbf{w}^*} = \mathbf{R}_s \mathbf{w} - \mu\,\mathbf{R}_n \mathbf{w}.$

Setting the gradient to zero: $\mathbf{R}_s \mathbf{w} = \mu\,\mathbf{R}_n \mathbf{w}.$

Step 3: Identify the generalized eigenvalue problem

The stationarity condition is a generalized eigenvalue problem:

$\mathbf{R}_s \mathbf{w} = \mu\,\mathbf{R}_n \mathbf{w} \quad \Longleftrightarrow \quad \mathbf{R}_n^{-1}\mathbf{R}_s \mathbf{w} = \mu\,\mathbf{w}.$

The maximum SNR equals the largest generalized eigenvalue $\mu_{\max}$ , and the optimal beamformer $\mathbf{w}_{\star}$ is the corresponding eigenvector of $\mathbf{R}_n^{-1}\mathbf{R}_s$ .

Step 4: Closed-form solution for rank-1 signal covariance

Since $\mathbf{R}_s = \sigma_s^2\,\mathbf{h}\mathbf{h}^H$ has rank 1, the matrix $\mathbf{R}_n^{-1}\mathbf{R}_s$ has a single nonzero eigenvalue: $\mu_{\max} = \sigma_s^2\,\mathbf{h}^H \mathbf{R}_n^{-1} \mathbf{h}.$

The corresponding eigenvector is $\mathbf{w}_{\star} = \frac{\mathbf{R}_n^{-1}\mathbf{h}} {\|\mathbf{R}_n^{-1}\mathbf{h}\|}.$

Verification: $\mathbf{R}_n^{-1}\mathbf{R}_s\mathbf{w}_{\star} = \sigma_s^2\,\mathbf{R}_n^{-1}\mathbf{h} (\mathbf{h}^H\mathbf{w}_{\star}) \propto \mathbf{R}_n^{-1}\mathbf{h} \propto \mathbf{w}_{\star}$ . $\checkmark$

This is the celebrated MVDR (minimum variance distortionless response) beamformer, also known as the Capon beamformer. The entire derivation hinged on setting the matrix gradient to zero — a direct application of the tools developed in this section.

Gradient Descent for Quadratic Minimization

Complexity:

O(n^2)

per iteration (matrix-vector product)

Input: Hermitian

\mathbf{A} \succ 0

, vector

\mathbf{b}

, step size

\eta > 0

, tolerance

\epsilon

Output: Minimizer of

f(\mathbf{x}) = \mathbf{x}^H \mathbf{A} \mathbf{x} - 2\Re(\mathbf{b}^H \mathbf{x})

1. Initialize

\mathbf{x}_0

(e.g.,

\mathbf{0}

)

2. for

k = 0, 1, 2, \ldots

do

3.

\quad \mathbf{g}_k \leftarrow 2\mathbf{A}\mathbf{x}_k - 2\mathbf{b}

\quad

(gradient)

4.

\quad \mathbf{x}_{k+1} \leftarrow \mathbf{x}_k - \eta \, \mathbf{g}_k

5.

\quad

if

\|\mathbf{g}_k\| < \epsilon

then return

\mathbf{x}_k

6. end for

Convergence rate depends on the condition number $\kappa(\mathbf{A}) = \lambda_{\max}/\lambda_{\min}$ . Optimal step size is $\eta = 2/(\lambda_{\max} + \lambda_{\min})$ . With this choice, the error contracts by a factor $\left(\frac{\kappa - 1}{\kappa + 1}\right)^2$ per iteration. For ill-conditioned problems ( $\kappa \gg 1$ ), convergence is slow and preconditioning or conjugate gradient methods are preferred.

Gradient Descent on a Quadratic: Convergence on Elliptical Contours

Gradient descent on

f(\mathbf{x}) = \mathbf{x}^T\mathbf{A}\mathbf{x}

follows a zig-zag path along the elliptical contours of the quadratic form. The eccentricity of the ellipses (determined by the condition number of

\mathbf{A}

) controls how fast the algorithm converges.

The trajectory converges to the minimum at the origin. A well-conditioned matrix (nearly circular contours) converges faster than an ill-conditioned one (elongated ellipses).

Gradient Field of a Quadratic Form

Visualize the gradient vectors $\nabla f(\mathbf{x}) = 2\mathbf{A}\mathbf{x}$ for a 2D quadratic form. Eigenvectors of $\mathbf{A}$ determine the principal directions; eigenvalues determine the curvature.

Parameters

a_{11}

2

a_{12}

0.5

a_{22}

1

Show eigenvectors

Show level curves

Why This Matters: Optimizing Beamformers via Matrix Gradients Leads to Eigenvalue Problems

The central optimization in receive beamforming is to maximize the signal-to-noise ratio

$\text{SNR} = \frac{\mathbf{w}^H \mathbf{R}_s \mathbf{w}} {\mathbf{w}^H \mathbf{R}_n \mathbf{w}}$

subject to $\|\mathbf{w}\| = 1$ , where $\mathbf{R}_s = \sigma_s^2\,\mathbf{h}\mathbf{h}^H$ is the signal covariance and $\mathbf{R}_n$ is the noise-plus-interference covariance.

Setting the gradient of the Lagrangian to zero (as derived in EOptimal Beamforming via Gradient — Reduction to an Eigenvalue Problem) yields the generalized eigenvalue problem

$\mathbf{R}_n^{-1}\mathbf{R}_s \mathbf{w} = \mu \mathbf{w}.$

The optimal beamformer is the eigenvector corresponding to the largest eigenvalue $\mu_{\max}$ of $\mathbf{R}_n^{-1}\mathbf{R}_s$ . The maximum achievable SNR equals $\mu_{\max}$ .

This pattern — set matrix gradient to zero, obtain eigenvalue problem — recurs throughout wireless communications:

MIMO precoding: maximizing mutual information over the precoder $\mathbf{F}$ leads to a water-filling solution on the eigenvalues of $\mathbf{H}^H\mathbf{H}$ .
MMSE receive filter: minimizing MSE $\mathbb{E}[\|\hat{\mathbf{x}} - \mathbf{x}\|^2]$ yields the Wiener filter $\mathbf{W} = \mathbf{R}_{yx}\mathbf{R}_{yy}^{-1}$ , derived by setting a matrix gradient to zero.
Dominant eigenmode transmission: transmitting along the principal eigenvector of $\mathbf{H}^H\mathbf{H}$ maximizes received SNR in MIMO.

See full treatment in MIMO Receivers

🚨Critical Engineering Note

Diagonal Loading: Taming Ill-Conditioned Covariance Matrices

The MVDR/Capon beamformer requires inverting $\mathbf{R}_n$ , and the MMSE filter requires inverting $\mathbf{R}_{yy}$ . In practice, sample covariance matrices estimated from finite data are often ill-conditioned or singular (especially when the number of snapshots $T$ is comparable to the array size $n$ ).

Diagonal loading adds a small multiple of the identity: $\hat{\mathbf{R}}_n^{(\text{loaded})} = \hat{\mathbf{R}}_n + \delta \mathbf{I},$ where $\delta > 0$ is the loading factor. This:

Ensures the matrix is strictly positive definite (invertible).
Regularizes the smallest eigenvalues, preventing noise amplification.
Provides robustness against steering vector mismatch and calibration errors.

A common rule of thumb: set $\delta = 10\sigma_n^2/n$ (10 times the per-antenna noise power). In 5G NR, the channel estimation error itself acts as a natural form of diagonal loading in the MMSE filter.

Practical Constraints

•
Without loading: condition number of sample covariance can exceed $10^{10}$ for $T < 2n$
•
3GPP baseline: MMSE-IRC receiver uses diagonal loading implicitly via the noise variance term
•
Optimal loading level depends on SNR and number of snapshots — no universal constant

📋 Ref: 3GPP TS 38.214, Section 5.2.2 (MMSE-IRC receiver)

Common Mistake: Numerator vs. Denominator Layout Confusion

Mistake:

A very common source of errors is mixing numerator layout and denominator layout conventions within the same derivation. In numerator layout, $\frac{\partial f}{\partial \mathbf{x}}$ is a column vector (same shape as $\mathbf{x}$ ) and the Jacobian of a vector-valued function $\frac{\partial \mathbf{f}}{\partial \mathbf{x}}$ has $\mathbf{f}$ -components in rows. In denominator layout, the gradient is a row vector and the Jacobian is transposed. Many textbooks do not state which convention they use, and some even switch mid-chapter. The result: sign errors, spurious transposes, and factors of 2 that appear or vanish mysteriously.

Correction:

Always declare your convention and stick to it. This textbook uses numerator layout throughout. In this convention:

The gradient $\frac{\partial f}{\partial \mathbf{x}} \in \mathbb{R}^n$ is a column vector (same shape as $\mathbf{x}$ ).
The Jacobian $\frac{\partial \mathbf{f}}{\partial \mathbf{x}}$ is an $m \times n$ matrix when $\mathbf{f} \in \mathbb{R}^m$ and $\mathbf{x} \in \mathbb{R}^n$ .
The gradient of a scalar with respect to a matrix has the same shape as the matrix.

When consulting external references, check whether $\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^T\mathbf{x})$ equals $\mathbf{a}$ (numerator layout) or $\mathbf{a}^T$ (denominator layout). If the reference uses a different convention, transpose all results before substituting into your derivation.

Key Takeaway

Matrix calculus transforms wireless optimization problems — maximizing SNR, minimizing MSE, maximizing capacity — into eigenvalue problems. The three workhorse identities are: (1) $\frac{\partial}{\partial \mathbf{x}^*}(\mathbf{x}^H\mathbf{A}\mathbf{x}) = \mathbf{A}\mathbf{x}$ for Hermitian $\mathbf{A}$ , (2) $\frac{\partial}{\partial \mathbf{A}}\log\det\mathbf{A} = \mathbf{A}^{-T}$ , and (3) $\frac{\partial}{\partial \mathbf{A}}\operatorname{tr}(\mathbf{A}\mathbf{B}) = \mathbf{B}^T$ . Setting a gradient to zero and recognizing the resulting equation as $\mathbf{M}\mathbf{v} = \lambda\mathbf{v}$ is the single most important pattern in the mathematical toolbox for telecommunications.

Quick Check

Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be symmetric and positive definite, and let $f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}$ . What is $\frac{\partial f}{\partial \mathbf{x}}$ ?

$\mathbf{A}\mathbf{x}$

$2\mathbf{A}\mathbf{x}$

$\mathbf{A}^T\mathbf{x}$

$\mathbf{x}^T\mathbf{A}$

Correction:

2\mathbf{A}\mathbf{x}

By TGradient of a Hermitian Quadratic Form, the general formula is $\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}$ . When $\mathbf{A}$ is symmetric ( $\mathbf{A}^T = \mathbf{A}$ ), this simplifies to $2\mathbf{A}\mathbf{x}$ .

Quick Check

What is $\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}^T \mathbf{B})$ where $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ ?

$\mathbf{B}^T$

$\mathbf{B}$

$\mathbf{A}\mathbf{B}$

$\mathbf{I}$

Correction:

\mathbf{B}

$\operatorname{tr}(\mathbf{A}^T\mathbf{B}) = \sum_{i,j} a_{ij} b_{ij}$ , so $\frac{\partial}{\partial a_{pq}}\operatorname{tr}(\mathbf{A}^T\mathbf{B}) = b_{pq}$ , which means the gradient matrix is $\mathbf{B}$ itself. Alternatively, using the differential: $d(\operatorname{tr}(\mathbf{A}^T\mathbf{B})) = \operatorname{tr}(d\mathbf{A}^T\,\mathbf{B}) = \operatorname{tr}(\mathbf{B}^T\,d\mathbf{A})$ , giving the gradient $\mathbf{B}$ .

Gradient

For a scalar-valued function $f$ of a vector $\mathbf{x} \in \mathbb{R}^n$ (or $\mathbb{C}^n$ ), the gradient $\nabla f(\mathbf{x}) = \frac{\partial f}{\partial \mathbf{x}}$ is the vector of partial derivatives. In numerator layout it is a column vector. The gradient points in the direction of steepest ascent and its magnitude equals the maximum directional derivative. Setting $\nabla f = \mathbf{0}$ is the first-order necessary condition for an extremum.

Numerator Layout

A convention for arranging partial derivatives in which the gradient of a scalar with respect to an $n$ -dimensional vector is a column vector (i.e., the indices of the function being differentiated determine the row structure of the result). Also called the Jacobian formulation. In this convention, $\frac{\partial f}{\partial \mathbf{x}} \in \mathbb{R}^{n \times 1}$ and the Jacobian of $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$ is an $m \times n$ matrix. The alternative is the denominator layout (or Hessian formulation), in which the gradient is a row vector.

Jacobian

For a vector-valued function $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian is the $m \times n$ matrix (in numerator layout) whose $(i,j)$ -th entry is $\frac{\partial f_i}{\partial x_j}$ . It generalizes the derivative to multivariate vector functions and governs how infinitesimal perturbations of the input propagate to the output. The gradient of a scalar function is a special case (a single-row Jacobian transposed into a column vector).

Matrix Calculus Essentials

Why Matrix Calculus

Definition: Gradient of a Scalar with Respect to a Vector (Numerator Layout)

Definition: Gradient of a Scalar with Respect to a Matrix

Theorem: Gradient of a Hermitian Quadratic Form

Step 1: Expand the quadratic form element-wise (real case)

Step 2: Complex extension via Wirtinger derivatives

Theorem: Gradient of Log-Determinant

Step 1: Differential of the determinant (Jacobi's formula)

Step 2: Differential of $\log\det$

Step 3: Identify the gradient via the trace-differential rule

Theorem: Gradient of Trace Expressions

Proof of $\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A}\mathbf{B}) = \mathbf{B}^T$

Proof of $\frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}) = \mathbf{B}^T\mathbf{C}^T$

Chain Rule for Matrix Derivatives

Example: Optimal Beamforming via Gradient — Reduction to an Eigenvalue Problem

Step 1: Formulate as a constrained optimization

Step 2: Form the Lagrangian and compute the gradient

Step 3: Identify the generalized eigenvalue problem

Step 4: Closed-form solution for rank-1 signal covariance

Gradient Descent for Quadratic Minimization

Gradient Descent on a Quadratic: Convergence on Elliptical Contours

Gradient Field of a Quadratic Form

Parameters

Why This Matters: Optimizing Beamformers via Matrix Gradients Leads to Eigenvalue Problems

Diagonal Loading: Taming Ill-Conditioned Covariance Matrices

Common Mistake: Numerator vs. Denominator Layout Confusion

Key Takeaway

Quick Check

Quick Check

Gradient

Numerator Layout

Jacobian

Definition:
Gradient of a Scalar with Respect to a Vector (Numerator Layout)

Definition:
Gradient of a Scalar with Respect to a Matrix