Matrix Calculus and Automatic Differentiation - Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance

This appendix collects the few matrix-calculus identities used in the main text and gives a one-page summary of how automatic differentiation realizes them computationally.

B.1Matrix calculus identities¶

For $\bm A \in \R^{m\times n}$ , $\bm x \in \R^n$ , $\bm y = \bm A\bm x$ :

\frac{\partial \bm y}{\partial \bm x} = \bm A,\qquad \frac{\partial (\bm y^\top \bm y)}{\partial \bm x} = 2\bm A^\top \bm A\bm x.

(B.1)

For a quadratic form $f(\bm x) = \bm x^\top \bm Q \bm x$ with symmetric $\bm Q$ , $\nabla f(\bm x) = 2\bm Q\bm x$ and $\nabla^2 f = 2\bm Q$ . For a deep network $\bm{\hat y} = g_L(\bm W_L\, g_{L-1}(\cdots g_1(\bm W_1 \bm x + \bm b_1)\cdots))$ , the chain rule gives

\frac{\partial \bm{\hat y}}{\partial \bm W_l} \;=\; \underbrace{(\bm \delta^{(l)})}_{\substack{\text{backprop} \\ \text{delta}}} \,\bigl(\bm a^{(l-1)}\bigr)^{\!\top}

(B.2)

with $\bm \delta^{(l)} = \bigl(\bm W^{(l+1)\top}\bm \delta^{(l+1)}\bigr) \odot g'(\bm z^{(l)})$ (Section 1.6).

B.2Reverse-mode AD in one paragraph¶

Reverse-mode AD evaluates the function once forward, caches every intermediate value, and then traverses the computation graph backwards, accumulating the local Jacobian--vector products in linear time in the number of operations. For a scalar loss, one reverse sweep returns the gradient with respect to all parameters at a small constant multiple of the forward cost. The computation still scales with the size of the graph, but it does not require one separate derivative pass per parameter; this is what makes million-parameter networks trainable. See Baydin et al. (2018) and Margossian (2019) for complete treatments.

B.3Higher-order AD¶

PINN losses involve second derivatives ( $V_{SS}$ in Black--Scholes, Laplacians in Poisson equations, and second-order terms in diffusion HJBs). Higher-order derivatives are obtained by composing reverse-mode AD with itself, torch.autograd.grad of grad in PyTorch, jax.grad of jax.grad in JAX. Activation regularity matters: the network must be at least $C^k$ for the strong $k$ -th-order residual to be well-defined.

References¶

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18(153), 1–43. http://jmlr.org/papers/v18/17-468.html
Margossian, C. C. (2019). A Review of Automatic Differentiation and its Efficient Implementation. WIREs Data Mining and Knowledge Discovery, 9(4), e1305.

B Matrix Calculus and Automatic Differentiation

B.1Matrix calculus identities¶

B.2Reverse-mode AD in one paragraph¶

B.3Higher-order AD¶