Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

B Matrix Calculus and Automatic Differentiation

University of Lausanne

This appendix collects the few matrix-calculus identities used in the main text and gives a one-page summary of how automatic differentiation realizes them computationally.

B.1Matrix calculus identities

For ARm×n\bm A \in \R^{m\times n}, xRn\bm x \in \R^n, y=Ax\bm y = \bm A\bm x:

yx=A,(yy)x=2AAx.\frac{\partial \bm y}{\partial \bm x} = \bm A,\qquad \frac{\partial (\bm y^\top \bm y)}{\partial \bm x} = 2\bm A^\top \bm A\bm x.

For a quadratic form f(x)=xQxf(\bm x) = \bm x^\top \bm Q \bm x with symmetric Q\bm Q, f(x)=2Qx\nabla f(\bm x) = 2\bm Q\bm x and 2f=2Q\nabla^2 f = 2\bm Q. For a deep network y^=gL(WLgL1(g1(W1x+b1)))\bm{\hat y} = g_L(\bm W_L\, g_{L-1}(\cdots g_1(\bm W_1 \bm x + \bm b_1)\cdots)), the chain rule gives

y^Wl  =  (δ(l))backpropdelta(a(l1)) ⁣\frac{\partial \bm{\hat y}}{\partial \bm W_l} \;=\; \underbrace{(\bm \delta^{(l)})}_{\substack{\text{backprop} \\ \text{delta}}} \,\bigl(\bm a^{(l-1)}\bigr)^{\!\top}

with δ(l)=(W(l+1)δ(l+1))g(z(l))\bm \delta^{(l)} = \bigl(\bm W^{(l+1)\top}\bm \delta^{(l+1)}\bigr) \odot g'(\bm z^{(l)}) (Section 1.6).

B.2Reverse-mode AD in one paragraph

Reverse-mode AD evaluates the function once forward, caches every intermediate value, and then traverses the computation graph backwards, accumulating the local Jacobian--vector products in linear time in the number of operations. For a scalar loss, one reverse sweep returns the gradient with respect to all parameters at a small constant multiple of the forward cost. The computation still scales with the size of the graph, but it does not require one separate derivative pass per parameter; this is what makes million-parameter networks trainable. See Baydin et al. (2018) and Margossian (2019) for complete treatments.

B.3Higher-order AD

PINN losses involve second derivatives (VSSV_{SS} in Black--Scholes, Laplacians in Poisson equations, and second-order terms in diffusion HJBs). Higher-order derivatives are obtained by composing reverse-mode AD with itself, torch.autograd.grad of grad in PyTorch, jax.grad of jax.grad in JAX. Activation regularity matters: the network must be at least CkC^k for the strong kk-th-order residual to be well-defined.

References
  1. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18(153), 1–43. http://jmlr.org/papers/v18/17-468.html
  2. Margossian, C. C. (2019). A Review of Automatic Differentiation and its Efficient Implementation. WIREs Data Mining and Knowledge Discovery, 9(4), e1305.