This appendix collects the few matrix-calculus identities used in the main text and gives a one-page summary of how automatic differentiation realizes them computationally.
B.1Matrix calculus identities¶
For , , :
For a quadratic form with symmetric , and . For a deep network , the chain rule gives
with (Section 1.6).
B.2Reverse-mode AD in one paragraph¶
Reverse-mode AD evaluates the function once forward, caches every intermediate value, and then traverses the computation graph backwards, accumulating the local Jacobian--vector products in linear time in the number of operations. For a scalar loss, one reverse sweep returns the gradient with respect to all parameters at a small constant multiple of the forward cost. The computation still scales with the size of the graph, but it does not require one separate derivative pass per parameter; this is what makes million-parameter networks trainable. See Baydin et al. (2018) and Margossian (2019) for complete treatments.
B.3Higher-order AD¶
PINN losses involve second derivatives ( in Black--Scholes, Laplacians in Poisson equations, and second-order terms in diffusion HJBs). Higher-order derivatives are obtained by composing reverse-mode AD with itself, torch.autograd.grad of grad in PyTorch, jax.grad of jax.grad in JAX. Activation regularity matters: the network must be at least for the strong -th-order residual to be well-defined.
- Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18(153), 1–43. http://jmlr.org/papers/v18/17-468.html
- Margossian, C. C. (2019). A Review of Automatic Differentiation and its Efficient Implementation. WIREs Data Mining and Knowledge Discovery, 9(4), e1305.