Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

F Solutions and Guidance for Exercises

University of Lausanne

This appendix provides worked solutions for analytical end-of-chapter exercises and guidance for coding exercises. Coding exercises (those that ask the reader to modify a notebook or measure a numerical quantity) are not treated as fixed numerical answers; their outputs depend on hardware, seeds, calibration choices, and notebook versions, and should be reported from the corresponding companion notebook listed in the Execution Map. For exercises that mix analytical work with a small numerical check, the analytical part is solved and the numerical component is described as a verification task.

Exercises are referenced by their stable label ex:chNN:MM, where NN is the chapter number and MM is the position within that chapter’s exercise list. Each solution opens with a back-pointer to the exercise statement so the reader can scan the question first.

F.0.1Chapter Chapter 1: Introduction to Machine Learning and Deep Learning

F.0.1.1Exercise 1.1: Backprop on a 2-layer net.

Let z=w1x+b1z = w_1 x + b_1, a=ReLU(z)a = \mathrm{ReLU}(z), y^=w2a\hat y = w_2 a, =(y^y)2\ell = (\hat y - y)^2. Reverse-mode chain rule:

y^=2(y^y),w2=2(y^y)a,a=2(y^y)w2,az=1[z>0],w1=2(y^y)w21[z>0]x.\begin{aligned} \frac{\partial \ell}{\partial \hat y} &= 2(\hat y - y), & \frac{\partial \ell}{\partial w_2} &= 2(\hat y - y)\,a, \\ \frac{\partial \ell}{\partial a} &= 2(\hat y - y)\,w_2, & \frac{\partial a}{\partial z} &= \mathbb{1}[z > 0], \\ \frac{\partial \ell}{\partial w_1} &= 2(\hat y - y)\,w_2\,\mathbb{1}[z>0]\,x. \end{aligned}

Plugging in x=2x=2, y=1y=1, w1=w2=b1=0.5w_1 = w_2 = b_1 = 0.5: z=1.5z = 1.5, a=1.5a = 1.5, y^=0.75\hat y = 0.75, =0.0625\ell = 0.0625. Hence /w2=2(0.25)(1.5)=0.75\partial\ell/\partial w_2 = 2(-0.25)(1.5) = -0.75 and /w1=2(0.25)(0.5)(1)(2)=0.5\partial\ell/\partial w_1 = 2(-0.25)(0.5)(1)(2) = -0.5. A two-line PyTorch script (torch.autograd.grad) returns the same numbers to machine precision; this is the simplest non-trivial sanity check that the chain-rule derivation matches what AD computes.

F.0.1.2Exercise 1.2: MSE vs. MLE.

For Gaussian errors εiN(0,σ2)\varepsilon_i \sim \mathcal{N}(0, \sigma^2), the log-likelihood is

logp(y1:nx1:n;β)=12σ2i(yiβxi)2n2log(2πσ2).\log p(y_{1:n} \mid x_{1:n}; \beta) = -\frac{1}{2\sigma^2}\sum_i (y_i - \beta x_i)^2 - \tfrac{n}{2}\log(2\pi\sigma^2).

Maximizing over β\beta (the only β\beta-dependent term) is identical to minimizing i(yiβxi)2\sum_i (y_i - \beta x_i)^2, i.e. OLS. For Laplace errors with scale bb, p(ε)exp(ε/b)p(\varepsilon) \propto \exp(-|\varepsilon|/b), so the log-likelihood is proportional to iyiβxi-\sum_i |y_i - \beta x_i| and the MLE solves a least-absolute-deviations regression (median regression). Squaring penalizes outliers quadratically, which is suboptimal under Laplace tails because a single large residual dominates the loss; the median estimator is robust precisely because |\cdot| grows linearly.

F.0.1.3Exercise 1.3: Activation choice for a PINN.

A ReLU network y^(x)=kwkReLU(akx+bk)+c\hat y(x) = \sum_k w_k\,\mathrm{ReLU}(a_k x + b_k) + c is piecewise-linear: between consecutive kink points x=bk/akx = -b_k/a_k it is affine in xx, so y^(x)=0\hat y''(x) = 0 on every open subinterval. At a kink, the second distributional derivative is a Dirac measure, not a classical pointwise value. A strong-form PDE residual such as the Black--Scholes operator

tV+12σ2S2VSS+rSVSrV  =  0\partial_t V + \tfrac{1}{2}\sigma^2 S^2\,V_{SS} + r S\,V_S - r V \;=\; 0

must hold pointwise (a.e.) for almost every SS; with a piecewise-linear VV, the term VSSV_{SS} vanishes a.e. in the classical sense, and the residual reduces to tV+rSVSrV\partial_t V + r S V_S - rV, which has no nontrivial solution that respects the actual boundary conditions of an option-pricing problem. Concrete example: y^(x)=ReLU(x)+ReLU(x)=x\hat y(x) = \mathrm{ReLU}(x) + \mathrm{ReLU}(-x) = |x| has y^(x)=0\hat y''(x) = 0 for every x0x \neq 0 in the classical sense. Smooth activations (tanh, Swish, GELU, softplus) are CC^\infty and produce well-defined pointwise second derivatives, which is why the PINN literature recommends them whenever the PDE is of order 2\ge 2.

F.0.1.4Exercise 1.4: Adam vs. AdamW.

Write the gradient at step tt as gt=θ(θt)g_t = \nabla_\theta \ell(\theta_t). With L2L_2 regularization added to the loss, L2(θ)=(θ)+λ2θ2\ell^{L_2}(\theta) = \ell(\theta) + \tfrac{\lambda}{2}\|\theta\|^2, so the effective gradient becomes g~t=gt+λθt\tilde g_t = g_t + \lambda \theta_t. Adam’s update applied to g~t\tilde g_t is

mt=β1mt1+(1β1)g~t,vt=β2vt1+(1β2)g~t2,θt+1=θtηm^tv^t+ε.m_t = \beta_1 m_{t-1} + (1-\beta_1)\tilde g_t,\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)\tilde g_t^2,\quad \theta_{t+1} = \theta_t - \eta\,\frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon}.

The key observation is that the weight-decay contribution λθt\lambda \theta_t enters mtm_t and vtv_t through the same EWMA averaging as the data gradient, so the effective shrinkage applied to θ\theta is ηλθt\eta\lambda \theta_t scaled by 1/(v^t+ε)1/(\sqrt{\hat v_t}+\varepsilon), which differs across coordinates. Parameters with large historical gradients receive small effective decay; parameters with small gradients receive large decay. AdamW decouples the two:

mt=β1mt1+(1β1)gt,vt=β2vt1+(1β2)gt2,θt+1=(1ηλ)θtηm^tv^t+ε.m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t,\quad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2,\quad \theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta\,\frac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon}.

Now the multiplicative shrinkage (1ηλ)(1 - \eta\lambda) acts uniformly on all parameters, independent of the adaptive denominator. Numerically the two updates differ by a factor of 1/(v^t+ε)1/(\sqrt{\hat v_t}+\varepsilon) on the decay term; AdamW preserves the textbook intuition “weight decay shrinks weights at the same rate everywhere.”

F.0.1.5Exercise 1.5: RNN forward pass by hand.

With Wh=0.5I2W_h = 0.5\,I_2, Wx=(1,0)W_x = (1,0)^\top, h0=(0,0)h_0 = (0,0)^\top:

h1=tanh ⁣((0,0)+(1,0))=(tanh1,0)(0.7616,0),h2=tanh ⁣(0.5h1+(0,0))=(tanh(0.3808),0)(0.3637,0),h3=tanh ⁣(0.5h2+(1,0))=(tanh(1.1818),0)(0.8275,0).\begin{aligned} h_1 &= \tanh\!\big((0,0)^\top + (1,0)^\top\big) = (\tanh 1,\, 0)^\top \approx (0.7616,\, 0)^\top,\\ h_2 &= \tanh\!\big(0.5\,h_1 + (0,0)^\top\big) = (\tanh(0.3808),\,0)^\top \approx (0.3637,\,0)^\top,\\ h_3 &= \tanh\!\big(0.5\,h_2 + (1,0)^\top\big) = (\tanh(1.1818),\,0)^\top \approx (0.8275,\,0)^\top. \end{aligned}

Outputs: y^t=Wyht\hat y_t = W_y h_t gives y^10.7616\hat y_1 \approx 0.7616, y^20.3637\hat y_2 \approx 0.3637, y^30.8275\hat y_3 \approx 0.8275 (only the first hidden coordinate is excited, so the second column of WyW_y is irrelevant).

For the gradient, write ht=tanh(zt)h_t = \tanh(z_t) with zt=Whht1+Wxxtz_t = W_h h_{t-1} + W_x x_t. Then

y^3x1=Wyh3z3Whh2z2Whh1z1Wx,\frac{\partial \hat y_3}{\partial x_1} = W_y\,\frac{\partial h_3}{\partial z_3}\,W_h\,\frac{\partial h_2}{\partial z_2}\,W_h\,\frac{\partial h_1}{\partial z_1}\,W_x,

where ht/zt=diag(1tanh2(zt))\partial h_t/\partial z_t = \mathrm{diag}(1-\tanh^2(z_t)). Numerically, tanh(1)0.4200\tanh'(1)\approx 0.4200, tanh(0.3808)0.8678\tanh'(0.3808)\approx 0.8678, tanh(1.1818)0.3155\tanh'(1.1818)\approx 0.3155, so

y^3x110.31550.50.86780.50.420010.0288.\frac{\partial \hat y_3}{\partial x_1} \approx 1 \cdot 0.3155 \cdot 0.5 \cdot 0.8678 \cdot 0.5 \cdot 0.4200 \cdot 1 \approx 0.0288.

The decay rate is the product of two distinct effects: (i) the spectral radius of WhW_h (here 0.5), which contributes one factor of WhW_h per recurrent step (T1T-1 multiplicative copies in the chain above), and (ii) the saturation of tanh()(0,1]\tanh'(\cdot) \in (0,1], which contributes a second multiplicative factor per step. Setting (ii) aside, with Wh2=0.5<1\|W_h\|_2 = 0.5 < 1 the gradient magnitude already decays at least as fast as 0.5T10.5^{T-1}, i.e. exponentially in the sequence length. This is the canonical vanishing-gradient pathology of vanilla RNNs (Section 1.11). LSTM and GRU gates address (i) by allowing the recurrent Jacobian’s eigenvalues to stay close to one without making training unstable; effect (ii) is intrinsic to bounded activations and is mitigated by skip connections (and architectural choices that keep activations away from saturation regimes), not by gating.

F.0.1.6Exercise 1.6: Attention by hand.

With qi=ki=vi=xiq_i = k_i = v_i = x_i and x=(0,1,0.5)x = (0,1,0.5), the score matrix Sij=qikjS_{ij} = q_i k_j is

S=(000010.500.50.25).S = \begin{pmatrix} 0 & 0 & 0 \\ 0 & 1 & 0.5 \\ 0 & 0.5 & 0.25\end{pmatrix}.

Softmaxing each row and using e0=1e^0 = 1, e0.251.284e^{0.25} \approx 1.284, e0.51.649e^{0.5}\approx 1.649, e12.718e^1 \approx 2.718:

a1=13(1,1,1),o1=13(0+1+0.5)=0.5,a2(0.186,0.506,0.307),o20.1860+0.5061+0.3070.50.660,a3(0.254,0.419,0.327),o30.2540+0.4191+0.3270.50.583.\begin{aligned} a_1 &= \tfrac{1}{3}(1,1,1), & o_1 &= \tfrac{1}{3}(0 + 1 + 0.5) = 0.5, \\ a_2 &\approx (0.186,\, 0.506,\, 0.307), & o_2 &\approx 0.186\cdot 0 + 0.506\cdot 1 + 0.307\cdot 0.5 \approx 0.660, \\ a_3 &\approx (0.254,\, 0.419,\, 0.327), & o_3 &\approx 0.254\cdot 0 + 0.419\cdot 1 + 0.327\cdot 0.5 \approx 0.583. \end{aligned}

Each attention vector aia_i is on the simplex (entries non-negative, summing to one), so oi=jaijvjo_i = \sum_j a_{ij} v_j is a convex combination of the values, lying in [0,1][0, 1]. Token 2, the largest input, attends most strongly to itself (a220.51a_{22}\approx 0.51); token 3 also attends most to token 2 because the inner products q3k2=0.5q_3 k_2 = 0.5 exceed the self-score q3k3=0.25q_3 k_3 = 0.25. Token 1 has zero query magnitude, so its row is uniform: with no signal to discriminate on, attention defaults to a uniform average over the values.

F.0.1.7Exercise 1.7: TensorBoard optimizer comparison.

Coding exercise. Expected qualitative behavior on a small classification task: SGD with momentum is often slower to reduce training loss but can generalize competitively when the learning-rate schedule is well tuned; Adam often reaches a low training loss fastest, but its validation curve can diverge from the training curve sooner than for SGD or AdamW; AdamW often sits between the two. The actual crossover point is a notebook output and should be reported from the run.

F.0.2Chapter Chapter 2: Deep Equilibrium Nets

F.0.2.1Exercise 2.1: Closed-form Brock--Mirman.

Conjecture V(K,z)=AlogK+Blogz+CV(K, z) = A\log K + B\log z + C. The Bellman equation

V(K,z)=maxK{log(zKαK)+βEzz ⁣[V(K,z)]}V(K,z) = \max_{K'}\Bigl\{\log\bigl(zK^\alpha - K'\bigr) + \beta\,\mathbb{E}_{z'\mid z}\!\bigl[V(K', z')\bigr]\Bigr\}

yields the FOC 1/(zKαK)=βA/K1/(zK^\alpha - K') = \beta A/K', which gives the constant savings rate

s=KzKα=βA1+βA.s^\star = \frac{K'}{zK^\alpha} = \frac{\beta A}{1 + \beta A}.

Substituting the conjecture back and matching the logK\log K coefficient produces A=α+βAαA = \alpha + \beta A\alpha, hence A=α/(1αβ)A = \alpha/(1-\alpha\beta). Plugging this AA into ss^\star:

s  =  βα/(1αβ)1+βα/(1αβ)  =  αβ1αβ+αβ  =  αβ.s^\star \;=\; \frac{\beta\,\alpha/(1-\alpha\beta)}{1 + \beta\,\alpha/(1-\alpha\beta)} \;=\; \frac{\alpha\beta}{1-\alpha\beta+\alpha\beta} \;=\; \alpha\beta.

The DEQN parameterizes st=σ ⁣(Nρ(Kt,zt))s_t = \sigma\!\bigl(\mathcal{N}_\rho(K_t, z_t)\bigr). Once converged, the average sigmoid output across the ergodic set should equal αβ\alpha\beta (typical calibrations α=0.36\alpha=0.36, β=0.96\beta=0.96 give s0.346s^\star \approx 0.346). Any systematic deviation indicates either insufficient training or a quadrature / sampling bias.

F.0.2.2Exercise 2.2: Hard vs. soft constraints.

With a softplus head on consumption alone, Ct=softplus(Nρ(Kt,zt))>0C_t = \mathrm{softplus}(\mathcal{N}_\rho(K_t, z_t)) > 0 is guaranteed, but next-period capital is then defined as the residual Kt+1=ztKtαCtK_{t+1} = z_t K_t^\alpha - C_t, which is unconstrained in sign. At random initialization the network output is approximately N(0,1)\mathcal{N}(0, 1), so CtC_t is approximately softplus(0)0.69\mathrm{softplus}(0) \approx 0.69 on average but can be much larger.

Concrete failure: take K=1K = 1, z=1z = 1, α=0.36\alpha = 0.36, so zKα=1zK^\alpha = 1. If the network produces an output of, say, 5, then C=softplus(5)5.007C = \mathrm{softplus}(5) \approx 5.007 and Kt+1=15.007=4.007<0K_{t+1} = 1 - 5.007 = -4.007 < 0. The next iterate Kt+1α1K_{t+1}^{\alpha-1} is then complex, and the loss explodes.

The sigmoid-savings parameterization st=σ(Nρ)(0,1)s_t = \sigma(\mathcal{N}_\rho) \in (0,1) avoids this entirely: Kt+1=stztKtα>0K_{t+1} = s_t z_t K_t^\alpha > 0 and Ct=(1st)ztKtα>0C_t = (1-s_t) z_t K_t^\alpha > 0 both hold by construction, regardless of the network’s raw output. This is the simplest example of the broader principle that architectural feasibility encodings dominate loss-based ones whenever the constraint can be written as a closed-form algebraic identity.

F.0.2.3Exercise 2.3: Path averaging vs. conditional expectation.

On the simulated path, the path-averaged residual is rˉT(θ)=T1t=1Tr(θ,xt)\bar r_T(\theta) = T^{-1}\sum_{t=1}^T r(\theta, x_t) with {xt}\{x_t\} generated by the model dynamics. Under ergodicity, Birkhoff’s theorem gives

rˉT(θ)  T  r(θ,x)μ(dx)  =  Eμ[r(θ,x)]a.s.,\bar r_T(\theta) \;\xrightarrow{T \to \infty}\; \int r(\theta, x)\,\mu(dx) \;=\; \mathbb{E}_\mu[r(\theta, x)] \quad\text{a.s.},

where μ\mu is the stationary distribution. The conditional-expectation residual at a fixed point xtx_t, E ⁣[[]r(θ,xt+1)xt]\E[r(\theta, x_{t+1}) \mid x_t], is a different object: it integrates only over the conditional shock distribution, holding the current state fixed.

Their connection: if one sweeps the conditional residual over xtμx_t \sim \mu and averages, the result coincides with Eμ[r(θ,x)]\mathbb{E}_\mu[r(\theta, x)] by the law of iterated expectations. In other words, path averaging combines sampling over states (the outer expectation) and the implicit Monte Carlo integration over shocks (one shock per simulated step). Conditional expectation evaluates the inner integral exactly via quadrature but still requires a way to draw states xtx_t.

Bias--variance trade-off at finite TsimT_{\text{sim}}: the path-averaged loss has variance O(1/Tsim)O(1/T_{\text{sim}}) from finite-sample noise but no bias. An exact-quadrature residual has zero stochastic noise on the inner integral but its outer-state coverage depends on the sampling scheme; with mini-batch SGD over the ergodic set, the variance comes from the random batch, not the integration. In the chapter the deterministic quadrature is preferred whenever the shock dimension is low (d5d \lesssim 5), and pathwise residuals dominate once dd is high enough that explicit quadrature becomes expensive.

F.0.2.4Exercise 2.4: Brock--Mirman with Gauss--Hermite.

The Euler equation in Brock--Mirman with δ=1\delta = 1, log utility, and AR(1) productivity lnz=ϱlnz+σzε\ln z' = \varrho\ln z + \sigma_z\varepsilon', εN(0,1)\varepsilon' \sim \mathcal{N}(0,1), is

1Ct  =  βE ⁣[ ⁣][αzKt+1α1Ct+1Kt,zt].\frac{1}{C_t} \;=\; \beta\,\E\!\left[\frac{\alpha\,z'\,K_{t+1}^{\alpha - 1}}{C_{t+1}} \,\Big|\, K_t, z_t\right].

Replace the expectation by a Gauss--Hermite rule. After the change of variables ε=2ξ\varepsilon' = \sqrt{2}\,\xi that absorbs the normalization 1/π1/\sqrt{\pi}, the QQ-point GH quadrature is

E ⁣[[]h(ε)]    1πq=1Qwqh ⁣(2ξq),\E[h(\varepsilon')] \;\approx\; \frac{1}{\sqrt{\pi}} \sum_{q=1}^{Q} w_q\,h\!\bigl(\sqrt{2}\,\xi_q\bigr),

with classical nodes ξq\xi_q and weights wqw_q that satisfy qwq=π\sum_q w_q = \sqrt{\pi}. Table Table F.1 lists the five-point rule used in this exercise.

Table F.1:Five-point Gauss--Hermite nodes and weights for the convention E ⁣[[]h(ε)]π1/2qwqh(2ξq)\E[h(\varepsilon)] \approx \pi^{-1/2}\sum_q w_q h(\sqrt{2}\xi_q). The weights sum to π\sqrt{\pi} before the outside normalization.

qqξq\xi_qwqw_q
1-2.02020.0200
2-0.95860.3936
30.00000.9453
4+0.95860.3936
5+2.02020.0200

Substituting, the residual at a state (Kt,zt)(K_t, z_t) becomes

r(θ;Kt,zt)  =  1    βαCtπq=15wqzqKt+1α1Ct+1q,r(\theta; K_t, z_t) \;=\; 1 \;-\; \frac{\beta\,\alpha\,C_t}{\sqrt{\pi}}\sum_{q=1}^{5} w_q\,\frac{z_q'\,K_{t+1}^{\alpha-1}}{C_{t+1}^{q}},

where zq=exp(ϱlnzt+σz2ξq)z_q' = \exp(\varrho\ln z_t + \sigma_z\sqrt{2}\,\xi_q) and Ct+1qC_{t+1}^q is the consumption obtained by feeding (Kt+1,zq)(K_{t+1}, z_q') into the network. Numerically, comparing this 5-point sum with a high-draw Monte Carlo estimate on the same residual is a useful diagnostic: agreement should be close for smooth integrands and small shock variance, but the realized discrepancy is an output of the check rather than a fixed benchmark.

F.0.2.5Exercise 2.5: Monomial rule by hand (Stroud-3 at d=4d=4).

Equation (2.18) places nodes at xk±=±dek\bm{x}_k^\pm = \pm\sqrt{d}\,\bm{e}_k for k=1,,dk = 1, \dots, d, with equal weights 1/(2d)=1/81/(2d) = 1/8. At d=4d=4, the eight nodes are ±2ek\pm 2 \bm{e}_k for k=1,2,3,4k=1,2,3,4.

First moment. E ⁣[[]εi]rule=(1/8)k[(dek)i+(dek)i]=0\E[\varepsilon_i']_{\text{rule}} = (1/8)\sum_k [(\sqrt{d}\,\bm{e}_k)_i + (-\sqrt{d}\,\bm{e}_k)_i] = 0 by ±-cancellation.

Second moment. (εi)2(\varepsilon_i')^2 is non-zero only at ±dei\pm\sqrt{d}\,\bm{e}_i, where it equals dd. Two such nodes contribute 2d2d, weighted by 1/(2d)1/(2d), giving exactly 1.

Cross moment. εiεj\varepsilon_i'\varepsilon_j' for iji \neq j is zero at every node because each node has only one nonzero coordinate. So the rule returns 0, the true value.

Third moment. (εi)3(\varepsilon_i')^3 at ±dei\pm\sqrt{d}\bm{e}_i equals ±d3/2\pm d^{3/2}; the two values cancel. Returns 0, exact.

Fourth moment. (εi)4(\varepsilon_i')^4 at ±dei\pm\sqrt{d}\bm{e}_i equals d2d^2, both signs. Two nodes contribute 2d22d^2, weighted by 1/(2d)1/(2d), gives dd. At d=4d=4, the rule returns 4, while the true value E ⁣[[]εi4]=3\E[\varepsilon_i'^{\,4}] = 3.

Linear bias growth. In general the rule reports dd for the fourth moment, so the relative error (d3)/3(d - 3)/3 is linear in dd: 33%33\% at d=4d=4, 67%67\% at d=6d=6, doubles to 1 (100%100\%) at d=9d=9.

When does this matter? The Euler residual is a smooth function of the next-period shock ε\varepsilon'. Taylor-expanding around the conditional mean, the leading bias term is the Hessian of the residual with respect to ε\varepsilon', which probes the second moment, exact under Stroud-3. Fourth-moment bias enters only at the next order, scaled by the integrand’s fourth derivative. For moderate CRRA curvature and thin-tailed shocks this term can be small relative to classroom residual tolerances, but it is a diagnostic to check rather than a universal bound. The bias becomes material when (i) the integrand has heavy fourth-order content (e.g., very risk-averse preferences or fat-tailed shocks), or (ii) the shock dimension is large enough that the relative error (d3)/3(d-3)/3 exceeds the residual tolerance one targets. This is the threshold at which the monomial rule should be replaced by Stroud-5 (2d2+12d^2 + 1 nodes) or QMC.

F.0.2.6Exercise 2.6: Loss-kernel selection.

Definitions for reference. MSE: 1Niri2\frac{1}{N}\sum_i r_i^2. MAE: 1Niri\frac{1}{N}\sum_i |r_i|. Huber(δ\delta): quadratic for rδ|r| \le \delta, linear above. Pinball loss at τ\tau: Lτ(r)=max(τr,(τ1)r)L_\tau(r) = \max(\tau r,\, (\tau - 1)r), whose minimizer is the τ\tau-quantile of rr. CVaR at α\alpha: the expected value of rr conditional on rr exceeding its α\alpha-quantile. Log-cosh: ilogcosh(ri)\sum_i \log\cosh(r_i), smooth and quadratic near zero, linear in tails.

(a) Huber loss: “smooth, quadratic near zero, linear in tails” is the literal definition. MAE is also linear-in-tails but is not differentiable at r=0r=0, so its gradient flips discontinuously and the optimizer stalls; log-cosh is smooth and shares the asymptotic shape with Huber but has no tunable threshold. Huber(δ\delta) gives the cleanest control of where the regime change happens.

(b) CVaR at α=0.99\alpha = 0.99. The CVaR loss optimizes the conditional mean above the 99th-percentile threshold, which is exactly what the regulator audits. The pinball loss at τ=0.99\tau = 0.99 would only target the 99th-percentile residual itself, not the conditional average above it; if the residuals have a fat right tail, the pinball-trained policy can have arbitrarily large worst-case 1%1\% residuals. CVaR is the right primitive for “worst-case mean” control.

(c) MAE (or equivalently pinball at τ=0.5\tau = 0.5). By construction MAE’s first-order condition is solved at the median, not the mean: /rr=sign(r)\partial/\partial r |r| = \mathrm{sign}(r), so the gradient contribution is ±1\pm 1 per residual, regardless of magnitude. This is exactly what the desideratum asks for, no single tail residual dominates the gradient. Huber would also down-weight tails but still tracks the mean below the threshold; the user explicitly wants median-targeting.

F.0.2.7Exercise 2.7 and Exercise 2.8 (statements: p. , p. ).

These are coding exercises (notebook lecture_03_02_Brock_Mirman_Uncertainty_DEQN.ipynb); reference outputs and timing curves are in the companion repository. Qualitative anchors: in Ex. Exercise 2.7, the per-epoch wall time of tensor-product Gauss--Hermite (QdQ^d nodes) should grow exponentially in dd while Stroud-3 (2d2d nodes) grows linearly, with a crossover that is visible already at d=4d=4--5 on a single GPU; the relative Euler error should track the integration accuracy of each rule, with Stroud-3 inheriting the fourth-moment bias of Section 2.6.3. In Ex. Exercise 2.8, swapping Swish for tanh\tanh typically slows time-to-converge by tens of percent on a smooth problem like Brock--Mirman because tanh\tanh saturates faster, but final accuracy is comparable; convergence should still hold under the same hyperparameters.

F.0.3Chapter Chapter 3: The International Real Business Cycle Model

F.0.3.1Exercise 3.1: Fischer--Burmeister.

For the forward direction, suppose a0a \ge 0, b0b \ge 0, ab=0ab = 0. Without loss of generality b=0b = 0; then Φ(a,0)=a+0a2+0=aa=0\Phi(a, 0) = a + 0 - \sqrt{a^2 + 0} = a - |a| = 0 since a0a \ge 0. By symmetry the same holds when a=0a = 0.

For the reverse direction, suppose Φ(a,b)=0\Phi(a,b) = 0, i.e. a+b=a2+b2a + b = \sqrt{a^2 + b^2}. The right-hand side is non-negative, so a+b0a + b \ge 0. Squaring: (a+b)2=a2+b22ab=0ab=0(a+b)^2 = a^2 + b^2 \Rightarrow 2ab = 0 \Rightarrow ab = 0. Combined with a+b0a + b \ge 0 and ab=0ab = 0, the only possibility is one of a,ba, b being zero and the other non-negative, i.e. a,b0a, b \ge 0 and ab=0ab = 0. \square

The level set Φ(a,b)=0\Phi(a, b) = 0 is the union of the non-negative aa-axis and the non-negative bb-axis (an L-shape in the (a,b)(a,b) plane). The smoothed variant Φε(a,b)=a+ba2+b2+ε2\Phi_\varepsilon(a,b) = a + b - \sqrt{a^2 + b^2 + \varepsilon^2} rounds the corner at the origin: at (a,b)=(0,0)(a,b) = (0,0) one finds Φε(0,0)=ε0\Phi_\varepsilon(0,0) = -\varepsilon \neq 0, while far from the origin a2+b2+ε2a2+b2\sqrt{a^2 + b^2 + \varepsilon^2} \approx \sqrt{a^2 + b^2} and the smoothed level set converges to the unsmoothed L-shape. In a numerical setting the smoothing eliminates the gradient kink at the origin, which is convenient for AD but distorts the strict KKT zero set by an O(ε)O(\varepsilon) amount.

(c) Gradient direction. At an interior point of the open positive quadrant, Φ(a,b)=(1a/a2+b2,1b/a2+b2)\nabla\Phi(a,b) = (1 - a/\sqrt{a^2+b^2},\, 1 - b/\sqrt{a^2+b^2}). At (a,b)=(1,1)(a,b) = (1,1) this evaluates to (11/2,11/2)(0.293,0.293)(1-1/\sqrt 2,\, 1-1/\sqrt 2) \approx (0.293, 0.293), both components strictly positive. The raw gradient Φ\nabla\Phi therefore points northeast, away from the L-shaped zero set, so the bare residual Φ\Phi on its own is not the right object for SGD to descend on. What does descend toward the zero set is the squared loss Φ2\Phi^2: by the chain rule (Φ2)=2ΦΦ\nabla(\Phi^2) = 2\Phi\,\nabla\Phi, and inside the open positive quadrant Φ(a,b)>0\Phi(a,b) > 0, so (Φ2)=2ΦΦ-\nabla(\Phi^2) = -2\Phi\,\nabla\Phi has both components strictly negative at (1,1)(1,1) and points southwest, i.e. back toward the closer feasible axis. This is the operative observation for training: SGD on Φ2\Phi^2 is what pulls infeasible iterates back to the L; the raw gradient Φ\nabla\Phi on its own would push them away.

(d) Sign convention. Replacing Φ\Phi by Φ-\Phi leaves the squared loss untouched: (Φ)2=Φ2(-\Phi)^2 = \Phi^2, so the gradient field of the loss and the SGD trajectory are unchanged. In particular, the sign convention a+ba2+b2a + b - \sqrt{a^2+b^2} versus a2+b2ab\sqrt{a^2+b^2} - a - b is irrelevant once the residual is squared, and the operative quantity for SGD is (Φ2)-\nabla(\Phi^2) rather than Φ-\nabla\Phi. The L-shaped zero set is invariant under sign flip; only the value of Φ\Phi at off-zero points changes (it flips sign), and squaring removes that distinction.

F.0.3.2Exercise 3.2: State-space scaling.

For an IRBC with NN symmetric countries, the state vector contains (Kj,zj)j=1N(K^j, z^j)_{j=1}^N for a total of 2N2N components. The network outputs the next-period capital vector (kj)j=1N(k^{j\prime})_{j=1}^N, the irreversibility multipliers (μj)j=1N(\mu^j)_{j=1}^N, and the resource-constraint shadow price λ\lambda, for 2N+12N + 1 outputs in total. Country-level consumption is recovered algebraically from λ\lambda via the consumption-sharing FOC and is therefore not a separate output. The loss has NN Euler residuals, NN Fischer--Burmeister residuals from the irreversibility complementarity, and 1 aggregate-resource-constraint residual, 2N+12N + 1 components total, matching the 2N+12N + 1 outputs. Each Euler residual is an expectation over a (N+1)(N+1)-dimensional shock vector (one country-specific innovation plus one aggregate, as in equation (3.4)).

Tensor-product Gauss--Hermite at Q=3Q=3 costs 3N+13^{N+1} evaluations per residual. Setting 3N+1>1043^{N+1} > 10^4 gives N+1>log3(104)8.38N + 1 > \log_3(10^4) \approx 8.38, so N8N \ge 8 already exceeds the threshold, and N=9N = 9 overshoots by a factor of nearly 6 (cost 310=590493^{10} = 59\,049).

The Stroud-3 rule has 2(N+1)2(N+1) nodes per residual. Setting 2(N+1)<1002(N+1) < 100 gives N48N \le 48, comfortable for any IRBC dimension actually used in practice. At N=50N = 50 the monomial cost is 102 nodes; the tensor cost is 3512.15×10243^{51} \approx 2.15 \times 10^{24} evaluations. This four-order-of-magnitude gap at N=10N = 10 and twenty-order-of-magnitude gap at N=50N = 50 is the operational reason DEQNs use Stroud-3 by default once N5N \gtrsim 5.

F.0.3.3Exercise 3.3: Two-phase training.

At a randomly initialized network, the policy outputs kjk^{j\prime} are not coordinated with the resource constraint. Suppose the network produces kjk^{j\prime} values that, summed and combined with country-jj consumption, exceed total output: j(kj+cj+Γj)>jYj\sum_j (k^{j\prime} + c^j + \Gamma^j) > \sum_j Y^j. The implied state on the next simulated step has kj<(1δ)kjk^{j\prime} < (1-\delta) k^j for some country (irreversibility violated), or cj<0c^j < 0 (negative consumption), or both. Concretely, take N=2N = 2, Atfp=1A_\mathrm{tfp} = 1, δ=0.025\delta = 0.025, k1=k2=1k^1 = k^2 = 1, and a random network output (k1,k2)=(0.5,0.5)(k^{1\prime}, k^{2\prime}) = (0.5, 0.5) (instead of the symmetric (1,1)(1, 1) steady state). Capital has dropped by 50%50\% in one step, so Ij=kj(1δ)kj=0.50.975=0.475<0I^j = k^{j\prime} - (1-\delta)k^j = 0.5 - 0.975 = -0.475 < 0, violating irreversibility. Even before the irreversibility check fires, the implied consumption cj=YjIjΓjc^j = Y^j - I^j - \Gamma^j becomes huge, and in the next step the marginal utility u(c)u'(c) is essentially zero, so the Euler residual gradient is uninformative.

Phase 1 (uniform sampling on a wide box of states, with Euler residuals computed against any feasible policy guess) gives the optimizer signal to bring outputs into the feasible region before any simulation is attempted. Once the policy is in a feasible neighbourhood, Phase 2 (simulation-based sampling on the ergodic set) refines accuracy. Without Phase 1, the simulation in Phase 2 starts in regions where the policy is grossly infeasible, the loss explodes, and gradient descent diverges.

F.0.3.4Exercise 3.4: Adjustment-cost partials and Tobin’s Q.

Write gjkj/kj1g^j \equiv k^{j\prime}/k^j - 1. Then Γj=(κ/2)kj(gj)2\Gamma^j = (\kappa/2)\,k^j (g^j)^2. The partials are

Γjkj=κ2kj2gj1kj  =  κgj  =  κ ⁣(kjkj1),Γjkj=κ2(gj)2+κ2kj2gj ⁣(kj(kj)2)=κ2(gj)2κgj ⁣(kjkj)  =  κ2 ⁣[(gj)22(1+gj)gj]=κ2 ⁣[(gj)2+2gj]  =  κ2 ⁣[1(kjkj) ⁣2],\begin{aligned} \frac{\partial \Gamma^j}{\partial k^{j\prime}} &= \frac{\kappa}{2}\,k^j \cdot 2 g^j \cdot \frac{1}{k^j} \;=\; \kappa\,g^j \;=\; \kappa\!\left(\frac{k^{j\prime}}{k^j} - 1\right), \\ \frac{\partial \Gamma^j}{\partial k^j} &= \frac{\kappa}{2}(g^j)^2 + \frac{\kappa}{2}\,k^j \cdot 2 g^j \cdot \!\left(-\frac{k^{j\prime}}{(k^j)^2}\right) \\ &= \frac{\kappa}{2}\,(g^j)^2 - \kappa\,g^j\!\left(\frac{k^{j\prime}}{k^j}\right) \;=\; \frac{\kappa}{2}\!\left[\bigl(g^j\bigr)^2 - 2(1+g^j) g^j\right] \\ &= -\frac{\kappa}{2}\!\left[(g^j)^2 + 2 g^j\right] \;=\; \frac{\kappa}{2}\!\left[1 - \bigl(\tfrac{k^{j\prime}}{k^j}\bigr)^{\!2}\right], \end{aligned}

matching equation (3.6).

At the steady state, kj=kjk^{j\prime} = k^j so gj=0g^j = 0. Both partials vanish: Γj/kj=0\partial\Gamma^j/\partial k^{j\prime} = 0 and Γj/kj=0\partial\Gamma^j/\partial k^j = 0. The adjustment cost itself Γj(kj,kj)=0\Gamma^j(k^j, k^j) = 0, and its first-order presence in the resource constraint and the Euler equation drops out. The deterministic steady state of the IRBC with adjustment costs is therefore identical to the frictionless steady state, kssk_\mathrm{ss} being pinned down by β(1δ+ζAtfpkζ1)=1\beta(1 - \delta + \zeta A_\mathrm{tfp} k^{\zeta-1}) = 1.

The expression Γj/kj=κgj\partial\Gamma^j/\partial k^{j\prime} = \kappa g^j is the marginal cost of investing one more unit and is the IRBC’s analogue of Tobin’s marginal Q: large positive gjg^j (rapid expansion) creates a large investment wedge, raising the effective per-unit cost of capital next period. Higher κ\kappa flattens the response of investment to a productivity shock: a large adjustment cost makes the planner spread the response of kjk^{j\prime} across several periods rather than absorbing the shock in one big move, slowing convergence to the new steady state. In the limit κ\kappa \to \infty, capital adjusts only infinitesimally each period and the dynamics become arbitrarily slow.

F.0.3.5Exercise 3.5: Complete-markets risk sharing.

The consumption-sharing condition (3.12) reads ctj=(λt/τj)γjc_t^j = (\lambda_t/\tau^j)^{-\gamma_j}. Therefore

ctictj  =   ⁣(λtτi) ⁣γi ⁣/ ⁣(λtτj) ⁣γj  =   ⁣(τiλt) ⁣γi ⁣ ⁣(λtτj) ⁣γj  =  (τi)γi(τj)γjλtγjγi.\frac{c_t^i}{c_t^j} \;=\; \!\left(\frac{\lambda_t}{\tau^i}\right)^{\!-\gamma_i}\!\Bigl/\!\left(\frac{\lambda_t}{\tau^j}\right)^{\!-\gamma_j} \;=\; \!\left(\frac{\tau^i}{\lambda_t}\right)^{\!\gamma_i}\!\!\left(\frac{\lambda_t}{\tau^j}\right)^{\!\gamma_j} \;=\; (\tau^i)^{\gamma_i}\,(\tau^j)^{-\gamma_j}\,\lambda_t^{\,\gamma_j - \gamma_i}.

(i) With homogeneous IES γi=γj=γ\gamma_i = \gamma_j = \gamma, the λt\lambda_t exponent vanishes and the ratio collapses to a time-invariant constant:

ctictj  =   ⁣(τiτj) ⁣γ.\frac{c_t^i}{c_t^j} \;=\; \!\left(\frac{\tau^i}{\tau^j}\right)^{\!\gamma}.

Since log(cti/ctj)\log(c_t^i/c_t^j) is constant, Δlogcti=Δlogctj\Delta\log c_t^i = \Delta\log c_t^j, the perfect risk-sharing prediction of Backus et al. (1992): cross-country consumption growth is perfectly correlated (correlation =1= 1).

(ii) With heterogeneous IES γiγj\gamma_i \neq \gamma_j, the exponent γjγi\gamma_j - \gamma_i is non-zero and λt\lambda_t enters the ratio. As shocks move the planner’s shadow price, the consumption ratio fluctuates: low-IES countries’ consumption is less sensitive to λt\lambda_t than high-IES countries’. But the log growth rate is still

Δlogctj=γjΔlogλt.\Delta \log c_t^j = -\gamma_j\,\Delta\log\lambda_t .

Thus, for positive γi,γj\gamma_i,\gamma_j, any pair of country consumption-growth rates is a positive scalar multiple of the same aggregate shock Δlogλt\Delta\log\lambda_t. The correlation remains one; heterogeneous IES changes relative consumption-growth volatility, not the correlation, in this complete-markets planner allocation.

(iii) The empirical consumption-correlation puzzle is that real-world cross-country consumption growth correlations sit well below the near-perfect correlations implied by the complete-markets benchmark. Heterogeneous IES alone does not break the common-shadow-price structure; closing the gap with the data requires incomplete markets, wedges, nontraded goods, preference nonseparabilities, or other frictions that break full insurance.

F.0.3.6Exercise 3.6 and Exercise 3.7 (statements: p. , p. ).

These are notebook exercises (lecture_05_05_IRBC_Exercise.ipynb); reference solutions are in the notebook itself, behind the “attempt first” dividers. Qualitative anchors: in Ex. Exercise 3.6, the closed-form steady state kss=[(1/β1+δ)/(ζAtfp)]1/(ζ1)k_\mathrm{ss} = \bigl[(1/\beta - 1 + \delta)/(\zeta A_\mathrm{tfp})\bigr]^{1/(\zeta-1)} falls in all three scenarios --- a higher δ\delta raises the required net return 1/β1+δ1/\beta - 1 + \delta, a lower β\beta raises 1/β1/\beta, and a lower ζ\zeta shrinks the multiplicative term while (since ζ1<0\zeta - 1 < 0) steepening diminishing returns --- so kssk_\mathrm{ss} is decreasing in δ\delta, increasing in β\beta, and increasing in ζ\zeta around the baseline, with css=Atfpkssζδkssc_\mathrm{ss} = A_\mathrm{tfp} k_\mathrm{ss}^\zeta - \delta k_\mathrm{ss} following by substitution. In Ex. Exercise 3.7, inverse-loss weighting wi1/iw_i \propto 1/\ell_i equalizes the per-component contributions wiiw_i \ell_i, so the smallest-magnitude residuals (here the Fischer--Burmeister terms) receive the largest weight; the speed-up is largest when a component is small because it is hard to fit, and the scheme can hurt when a component is small because it is already satisfied by construction (e.g., a hard-coded resource constraint), in which case up-weighting it merely amplifies noise.

F.0.4Chapter Chapter 4: Neural Architecture Search and Loss Normalization

F.0.4.1Exercise 4.1: Random vs. grid.

With two hyperparameters and only one “important” axis, a 3×33\times 3 grid uses 9 candidates but only 3 distinct values along the important axis. If the near-optimal interval has length fraction pp and its location relative to the grid is unknown, the grid hit probability is approximately min{3p,1}\min\{3p,1\} when pp is small. Random search at 9 evaluations samples 9 independent values along the important axis, so its hit probability is

1(1p)9.1 - (1-p)^9 .

For p=0.05p=0.05, the grid probability is approximately 0.15, while random search gives 10.9590.371-0.95^9\approx 0.37.

The general principle: with nn evaluations and only rdr \ll d important axes, random search effectively gives nn independent draws on those rr axes (since the irrelevant axes don’t matter), while grid search wastes most of its budget on the irrelevant axes’ marginals. This is the projection argument of Bergstra & Bengio (2012): when the loss landscape is anisotropic, random search dominates grid search.

F.0.4.2Exercise 4.2: Bayesian optimization toy problem.

This is a coding exercise. A grid with step size 0.01 over [1,2][-1,2] has 301 points, so the qualitative benchmark is that BO should require far fewer objective evaluations when the GP posterior and acquisition function identify the promising region early. The exact evaluation count is a notebook output and depends on the initial design, acquisition optimizer, random seed, and stopping rule.

F.0.4.3Exercise 4.3: Hyperband budget allocation.

Hyperband with R=81R = 81, η=3\eta = 3 runs a ladder of brackets indexed by s=smax,smax1,,0s = s_{\max}, s_{\max}-1, \dots, 0, where smax=logηR=4s_{\max} = \lfloor\log_\eta R\rfloor = 4. Each bracket starts with ns=(smax+1)ηs/(s+1)n_s = \lceil (s_{\max}+1)\,\eta^s / (s+1)\rceil candidates trained for rs=R/ηsr_s = R / \eta^s resource each, then runs Successive Halving with reduction factor η\eta. Table Table F.2 works out the resulting schedule.

Table F.2:Hyperband schedule for maximum resource R=81R=81 and reduction factor η=3\eta=3. Each row reports the successive-halving rungs inside one bracket and the total resource consumed by that bracket.

ssSHA rungs (n×r)(n\times r)Total
481×127×39×93×271×8181{\times}1 \to 27{\times}3 \to 9{\times}9 \to 3{\times}27 \to 1{\times}81405
334×311×93×271×8134{\times}3 \to 11{\times}9 \to 3{\times}27 \to 1{\times}81363
215×95×271×8115{\times}9 \to 5{\times}27 \to 1{\times}81351
18×272×818{\times}27 \to 2{\times}81378
05×815{\times}81405

Within each bracket, Successive Halving reduces candidates by η\eta at each rung and increases the resource per surviving candidate by η\eta. Therefore the total cost must sum all rungs, not only the first rung. With the floor/ceil schedule above, total Hyperband budget is

405+363+351+378+405=1902405 + 363 + 351 + 378 + 405 = 1902

resource units, just below the loose worst-case bound (smax+1)2R=2581=2025(s_{\max}+1)^2 R = 25 \cdot 81 = 2025.

A naive “train all n0=27n_0 = 27 candidates to full R=81R = 81” costs 2781=218727 \cdot 81 = 2187 resource units. Hyperband is only moderately cheaper in total resource here, but it screens a much larger initial pool: across the five brackets, the total number of first-rung candidates is 81+34+15+8+5=14381 + 34 + 15 + 8 + 5 = 143, vs. 27 for the naive scheme. Its advantage is adaptive allocation: many more candidates are sampled, but only a small subset receives large training budgets.

F.0.4.4Exercise 4.4: Loss balancing.

Let i(t)\ell_i^{(t)} have magnitude Li{100,102,104}L_i \in \{10^0, 10^{-2}, 10^{-4}\} and per-component gradient norm i\|\nabla\ell_i\|. If we crudely model gradient norm as scaling linearly with loss magnitude (true for, e.g., quadratic losses far from the optimum), then iLi\|\nabla \ell_i\| \propto L_i. Equalising gradient contributions λii\lambda_i \|\nabla\ell_i\| requires λi1/Li\lambda_i \propto 1/L_i, i.e. (λ1,λ2,λ3)(1,102,104)(\lambda_1, \lambda_2, \lambda_3) \propto (1, 10^2, 10^4), which after normalisation becomes

(λ1,λ2,λ3)  =  11+102+104(1,102,104)    (104,102,1).(\lambda_1, \lambda_2, \lambda_3) \;=\; \frac{1}{1 + 10^2 + 10^4}\,(1, 10^2, 10^4) \;\approx\; (10^{-4},\, 10^{-2},\, 1).

The scheme breaks down when the gradients are correlated: i,j0\langle \nabla\ell_i, \nabla\ell_j\rangle \neq 0 means that scaling up λ3\lambda_3 to “boost” 3\ell_3 also moves θ\theta along the 1\nabla\ell_1 direction, changing 1\ell_1. The “equal contribution” targeted by the fixed weights is no longer a fixed point: each parameter update changes the local gradient geometry, and weights tuned at one iteration become wrong at the next. Adaptive schemes (ReLoBRaLo, GradNorm) re-tune the λi\lambda_i at every step, recovering the equalisation in a way that fixed weights cannot.

F.0.4.5Exercise 4.5: Pareto frontier geometry.

(i) Differentiate L(θ;λ)=λ(θa)2+(1λ)(θb)2\mathcal{L}(\theta;\lambda) = \lambda(\theta-a)^2 + (1-\lambda)(\theta-b)^2 in θ\theta and set to zero: 2λ(θa)+2(1λ)(θb)=02\lambda(\theta - a) + 2(1-\lambda)(\theta - b) = 0, hence

θ(λ)  =  λa+(1λ)b.\theta^\star(\lambda) \;=\; \lambda a + (1-\lambda) b.

(ii) Substituting back:

1(λ)  =  (θa)2  =  (1λ)2(ba)2,2(λ)  =  (θb)2  =  λ2(ba)2.\ell_1^\star(\lambda) \;=\; (\theta^\star - a)^2 \;=\; (1-\lambda)^2 (b-a)^2, \qquad \ell_2^\star(\lambda) \;=\; (\theta^\star - b)^2 \;=\; \lambda^2 (b-a)^2.

(iii) Take square roots: 1=(1λ)(ba)\sqrt{\ell_1^\star} = (1-\lambda)(b-a) and 2=λ(ba)\sqrt{\ell_2^\star} = \lambda(b-a). Adding:

1+2  =  (ba),\sqrt{\ell_1^\star} + \sqrt{\ell_2^\star} \;=\; (b - a),

independent of λ\lambda. This is a convex curve in (1,2)(\ell_1, \ell_2) space (a quarter of an astroid-like arc) running from (0,(ba)2)(0, (b-a)^2) at λ=1\lambda = 1 to ((ba)2,0)((b-a)^2, 0) at λ=0\lambda = 0.

(iv) At λ=1/2\lambda = 1/2, θ=(a+b)/2\theta^\star = (a+b)/2, the midpoint, and 1=2=(ba)2/4\ell_1^\star = \ell_2^\star = (b-a)^2/4. This sits on the symmetric axis of the front.

(v) In the one-dimensional toy problem the trade-off is completely described by the curve above. In a neural network, however, θ\theta is high-dimensional and the descent direction for fixed scalar weight λ\lambda is

[λ1(θ)+(1λ)2(θ)].-\bigl[\lambda \nabla \ell_1(\theta) + (1-\lambda)\nabla \ell_2(\theta)\bigr].

If 1,2>0\langle\nabla\ell_1,\nabla\ell_2\rangle>0, reducing one component tends to reduce the other; if the inner product is negative, progress on one component can increase the other. Because this geometry changes along the training path, a fixed scalar weight can balance progress at one iterate and become badly unbalanced later. ReLoBRaLo responds to this by increasing the weight of losses whose relative loss progress has lagged behind. GradNorm is the related method that targets gradient magnitudes directly by trying to balance wkk\|w_k\nabla\ell_k\| across components.

F.0.4.6Exercise 4.6: ReLoBRaLo vs. GradNorm.

This is a coding exercise. Qualitatively: GradNorm requires one extra backward pass per component per step (to compute k\|\nabla \ell_k\|), so wall-clock per epoch is roughly K×K\times slower than ReLoBRaLo for KK components. In return, GradNorm achieves tighter gradient balance, which matters when component magnitudes are not a faithful proxy for gradient magnitudes (e.g., when the loss landscape is anisotropic). For the standard PINN losses studied in this script (K=2K = 2 or 3 components, typically scaled by physical units), ReLoBRaLo’s loss-magnitude proxy is usually good enough and the extra cost of GradNorm is not warranted; for losses with strongly heterogeneous Hessians (e.g., HJB residuals coupled with KFE residuals where the two operators have different stiffness), GradNorm’s direct gradient measurement can give a meaningful speedup.

F.0.4.7Exercise 4.7: HPO vs. full NAS decision.

Sketch. The four cells are roughly:

The general rule: budget grows \to smarter methods become affordable; search-space size grows \to the marginal benefit of smarter methods grows; per-method overhead per evaluation needs to fit inside one GPU step or it eats into the budget itself.

F.0.5Chapter Chapter 5: Overlapping Generations Models with DEQNs

F.0.5.1Exercise 5.1: OLG market clearing for A=3A=3.

With three cohorts (young h=1h=1, middle h=2h=2, old h=3h=3), the budget constraints are

ct1=wt1kt+12,ct2=wt2+Rtkt2kt+13,ct3=wt3+Rtkt3,\begin{aligned} c^1_t &= w_t \,\ell^1 - k^2_{t+1}, \\ c^2_t &= w_t \,\ell^2 + R_t k^2_t - k^3_{t+1}, \\ c^3_t &= w_t \,\ell^3 + R_t k^3_t, \end{aligned}

where wt,Rtw_t, R_t are equilibrium prices and h\ell^h are exogenous lifecycle labor endowments (the old cohort consumes its capital). Two Euler equations determine the savings of the young and middle cohorts:

u(ct1)=βEt[u(ct+12)Rt+1],u(ct2)=βEt[u(ct+13)Rt+1].\begin{aligned} u'(c^1_t) &= \beta\,\mathbb{E}_t[u'(c^2_{t+1})\,R_{t+1}], \\ u'(c^2_t) &= \beta\,\mathbb{E}_t[u'(c^3_{t+1})\,R_{t+1}]. \end{aligned}

The market-clearing condition closes the system:

kt+12+kt+13  =  Kt+1,k^2_{t+1} + k^3_{t+1} \;=\; K_{t+1},

where Kt+1K_{t+1} is aggregate capital.

Equation count: three budget constraints (used to eliminate consumption), two Euler equations, one capital-market-clearing identity. The budget constraints are bookkeeping, so the network outputs the two cohort savings (kt+12,kt+13)(k^2_{t+1}, k^3_{t+1}) and is trained on the two Euler residuals; aggregate capital Kt+1=kt+12+kt+13K_{t+1} = k^2_{t+1} + k^3_{t+1} then determines next-period prices (wt+1,Rt+1)(w_{t+1}, R_{t+1}) algebraically through the firm FOCs. The unknown count (two savings) matches the Euler-residual count (two), and the market-clearing identity is built into the definition of Kt+1K_{t+1}.

F.0.5.2Exercise 5.2: KKT under FB.

For agent hh with borrowing constraint kh0k'^h \ge 0 and Lagrange multiplier λh0\lambda^h \ge 0, the KKT system is

kh0,λh0,khλh=0.k'^h \ge 0, \qquad \lambda^h \ge 0, \qquad k'^h \cdot \lambda^h = 0.

These three conditions are encoded by the single Fischer--Burmeister equation Φ(λh,kh)=0\Phi(\lambda^h, k'^h) = 0 with Φ(a,b)=a+ba2+b2\Phi(a,b) = a + b - \sqrt{a^2 + b^2}. The Euler equation

u(cth)βEt[u(ct+1h+1)Rt+1]λth  =  0u'(c^h_t) - \beta\,\mathbb{E}_t[u'(c^{h+1}_{t+1})\,R_{t+1}] - \lambda^h_t \;=\; 0

contributes a second residual. Squared and summed:

Lh(θ)  =  [Euler residual]2+[Φ(λh,kh)]2.\mathcal{L}^h(\theta) \;=\; \bigl[\text{Euler residual}\bigr]^2 + \bigl[\Phi(\lambda^h, k'^h)\bigr]^2.

At any KKT point, both squared terms vanish exactly: the Euler residual is zero by definition, and Φ=0\Phi = 0 encodes complementarity. This is what “vanishes exactly at the KKT point” means: there is no ε\varepsilon smoothing or penalty parameter, the loss is zero on the equilibrium and strictly positive off it. Compare with a quadratic penalty [max(0,kh)]2+μ[max(0,λh)]2+(khλh)2\bigl[\max(0, -k'^h)\bigr]^2 + \mu \bigl[\max(0, -\lambda^h)\bigr]^2 + (k'^h \lambda^h)^2: this also vanishes at the KKT point but the multiplier μ\mu has to be tuned, whereas FB has no parameter.

F.0.5.3Exercise 5.3: Hump-shaped lifecycle.

A hump-shaped labor-income profile h\ell^h peaks at middle age and declines toward retirement. The lifecycle savings policy khk'^h inherits this hump for two reasons. (i) Consumption smoothing: agents with high current income whw \ell^h relative to lifetime average save heavily to fund retirement years when h\ell^h drops. (ii) Time-varying borrowing constraint: young agents have low income, want to borrow against future earnings, are constrained by kh0k'^h \ge 0, so they save little; middle-aged agents are unconstrained and save the most; old agents dis-save toward death.

The expected shape: kh0k'^h \approx 0 for the very young (constrained), peaks around age 40--50 (middle of working life), declines toward retirement, drops to zero for the oldest cohort that does not save into the next period. In notebook lecture_08_10_OLG_Benchmark_DEQN_persistent.ipynb, plotting the trained network’s khk'^h against cohort age hh should reveal exactly this single-peak shape; the position of the peak depends on the calibration of (β,δ,Atfp)(\beta, \delta, A_\mathrm{tfp}) and on the lifecycle labor profile.

F.0.5.4Exercise 5.4: Hard aggregation layer.

Coding exercise. The current analytic notebook already clears the capital market by defining Kt+1K_{t+1} as the sum of predicted cohort savings. The alternative hard-layer variant is useful when the network has a separate aggregate-capital head. Implementation sketch: output a positive scalar K^t+1\widehat K_{t+1} and unnormalised cohort scores (z2,,zA)(z^2, \dots, z^A); apply softmax along the cohort axis, sh=softmax(zh)s^h = \mathrm{softmax}(z^h); rescale to capital:

kt+1h  =  K^t+1sh,h=2Akt+1h  =  K^t+1    by construction.k_{t+1}^h \;=\; \widehat K_{t+1} \cdot s^h, \qquad \sum_{h=2}^{A} k_{t+1}^h \;=\; \widehat K_{t+1}\;\;\text{by construction}.

The market-clearing residual hkhK^t+1\sum_h k^h - \widehat K_{t+1} is identically zero up to floating-point precision. The comparison with the current notebook should therefore focus on Euler residuals and wall-clock time: the hard layer removes one possible inconsistency but also changes the parameterisation, so faster convergence is an empirical question rather than a mathematical guarantee. In multi-asset settings each exact clearing condition needs its own accounting layer, which is why the 56-agent benchmark enforces bond-market clearing as an explicit residual instead.

F.0.5.5Exercise 5.5: Bond pricing in equilibrium.

Cohort hh’s Euler equation for capital is

u(cth)  =  βEt ⁣[u(ct+1h+1)Rt+1],u'(c^h_t) \;=\; \beta\,\mathbb{E}_t\!\bigl[u'(c^{h+1}_{t+1})\,R_{t+1}\bigr],

and for the riskless bond that costs ptp_t today and pays one unit of consumption next period,

u(cth)pt  =  βEt ⁣[u(ct+1h+1)].u'(c^h_t)\,p_t \;=\; \beta\,\mathbb{E}_t\!\bigl[u'(c^{h+1}_{t+1})\bigr].

The second equation gives directly

pt  =  βEt[u(ct+1h+1)]u(cth).p_t \;=\; \frac{\beta\,\mathbb{E}_t[u'(c^{h+1}_{t+1})]}{u'(c^h_t)}.

Identifying the stochastic discount factor Mt,t+1=βu(ct+1h+1)/u(cth)M_{t,t+1} = \beta\,u'(c^{h+1}_{t+1})/u'(c^h_t), the same expression reads pt=Et[Mt,t+1]p_t = \mathbb{E}_t[M_{t,t+1}].

For the risk-premium decomposition, divide the capital Euler by u(cth)u'(c^h_t) and use the covariance identity Et[XY]=Et[X]Et[Y]+Covt(X,Y)\mathbb{E}_t[XY] = \mathbb{E}_t[X]\mathbb{E}_t[Y] + \mathrm{Cov}_t(X,Y):

1  =  Et[Mt,t+1Rt+1]  =  Et[Mt,t+1]Et[Rt+1]+Covt(Mt,t+1,Rt+1).1 \;=\; \mathbb{E}_t[M_{t,t+1}\,R_{t+1}] \;=\; \mathbb{E}_t[M_{t,t+1}]\,\mathbb{E}_t[R_{t+1}] + \mathrm{Cov}_t(M_{t,t+1}, R_{t+1}).

Substituting pt=Et[Mt,t+1]p_t = \mathbb{E}_t[M_{t,t+1}]:

1pt  =  Et[Rt+1]+Covt(Mt,t+1,Rt+1)pt,\frac{1}{p_t} \;=\; \mathbb{E}_t[R_{t+1}] + \frac{\mathrm{Cov}_t(M_{t,t+1}, R_{t+1})}{p_t},

which after rearrangement gives the textbook risk-premium decomposition: the gap between expected gross capital return and the riskless rate equals minus the SDF--return covariance, scaled by 1/pt1/p_t.

When the collateral constraint binds. If the collateral constraint is active, with non-negative KKT multiplier μth\mu^h_t entering the bond FOC with coefficient κ\kappa (the same κ\kappa that controls the constraint kh+κbh0k'^h + \kappa b'^h \ge 0), the bond Euler equation becomes

u(cth)pt  =  βEt ⁣[u(ct+1h+1)]+κμth,u'(c^h_t)\,p_t \;=\; \beta\,\mathbb{E}_t\!\bigl[u'(c^{h+1}_{t+1})\bigr] + \kappa\,\mu^h_t,

so the equilibrium bond price is

pt  =  βEt[u(ct+1h+1)]+κμthu(cth).p_t \;=\; \frac{\beta\,\mathbb{E}_t[u'(c^{h+1}_{t+1})] + \kappa\,\mu^h_t}{u'(c^h_t)}.

The unconstrained SDF expression pt=Et[Mt,t+1]p_t = \mathbb{E}_t[M_{t,t+1}] is recovered when μth=0\mu^h_t = 0. The multiplier wedge raises ptp_t, equivalently lowers the implicit safe rate that 1/pt1/p_t tracks, because the constrained agent values one extra unit of bond consumption tomorrow more than the unconstrained agent. In the 56-agent benchmark, the cohorts most likely to bind are the youngest (lowest income, highest desire to borrow against future earnings), so any wedge κμth\kappa\,\mu^h_t from the binding-cohort population enters the cross-sectional pricing equation.

Why no bond residual in 6-agent OLG? In the analytic 6-agent calibration there is only one asset (capital). Bonds are absent, so no separate market-clearing residual is needed. The single-asset Euler equation pins down the implicit safe rate via 1/p=E[R]1/p = \mathbb{E}[R] minus the appropriate covariance, but no quantity needs to clear because no bond is traded.

F.0.5.6Exercise 5.6 and Exercise 5.7 (statements: p. , p. ).

Coding exercises. Both call for 4--5 retraining runs of the 56-agent benchmark and a binding-frequency / steady-state diagnostic. The binding indicator should be based on small slack and a positive multiplier, not on a large complementarity residual: at a well-trained KKT solution the product residual is close to zero both when a constraint binds and when it is slack. Expect borrowing and collateral constraints to bind mostly for young cohorts; as κ\kappa rises, the lower bound bhkh/κb'^h \ge -k'^h/\kappa becomes tighter, so negative bond positions should shrink and cross-cohort bond dispersion should typically fall. The equilibrium price response is a general-equilibrium object and should be read from the retrained models rather than imposed analytically.

F.0.6Chapter Chapter 6: Heterogeneous Agents and Young’s Method

F.0.6.1Exercise 6.1: Mean-preserving lottery.

Place mass ω\omega at knk_n and 1ω1-\omega at kn+1k_{n+1}. Mean preservation:

ωkn+(1ω)kn+1  =  k.\omega\,k_n + (1-\omega)\,k_{n+1} \;=\; k'.

Solving for ω\omega:

ω  =  kn+1kkn+1kn.\omega \;=\; \frac{k_{n+1} - k'}{k_{n+1} - k_n}.

This is well-defined for knkkn+1k_n \le k' \le k_{n+1} since the denominator is positive and the numerator lies in [0,kn+1kn][0, k_{n+1} - k_n], ensuring ω[0,1]\omega \in [0,1].

Mass conservation: the two probabilities sum to one, ω+(1ω)=1\omega + (1-\omega) = 1, so total mass is preserved exactly under this redistribution. Equivalently, the weight ω\omega is the unique linear interpolation weight that makes the discrete two-point distribution have mean kk', which is what defines Young’s redistribution operator on a fixed grid.

Higher-moment matching is impossible with a two-point split unless kk' coincides with a grid point: any non-degenerate two-point distribution with mean kk' and support {kn,kn+1}\{k_n, k_{n+1}\} has variance ω(1ω)(kn+1kn)2>0\omega(1-\omega)(k_{n+1}-k_n)^2 > 0, while the original (delta) distribution at kk' has zero variance. This residual variance is the price of the discretization, and it shrinks as the grid is refined.

F.0.6.2Exercise 6.2: Closed-form bracketing on log-spaced grids.

Coding exercise. Algorithm sketch: with grid kn=ex0+nΔxck_n = e^{x_0 + n\Delta x} - c, the bracket index for a query kk' is

n  =  floor ⁣(log(k+c)x0Δx).n \;=\; \mathrm{floor}\!\left(\frac{\log(k' + c) - x_0}{\Delta x}\right).

This is O(1)\mathcal{O}(1) per query (one log + one floor), independent of grid size NN. By contrast, numpy.searchsorted costs O(logN)\mathcal{O}(\log N) per query (binary search). The speed difference is hardware- and implementation-dependent, so the coding exercise should report the measured wall-clock ratio on the vectorized batch rather than treating a fixed multiplier as universal.

F.0.6.3Exercise 6.3: Approximate aggregation, scope.

The KS log-linear rule is built on the empirical observation that mean capital KtK_t alone is a sufficient statistic for forecasting Kt+1K_{t+1} in the standard Aiyagari--Krusell--Smith calibration. This approximate aggregation breaks when the cross-sectional distribution carries information beyond its first moment that materially affects equilibrium prices.

Counterexample 1: multiple assets with switching liquidity. Add a second asset (say a corporate bond) with state-dependent liquidity: in good times agents trade both assets freely; in bad times the bond becomes illiquid. Now the share of wealth held in the bond, plus the bond--capital correlation in the cross-section, both determine prices, and neither can be summarized by mean capital alone.

Counterexample 2: heterogeneous discount factors. If agents differ in β\beta and the cross-sectional distribution of β\beta is dynamic (e.g., new entrants have different β\beta), then mean capital understates the dispersion, and the equilibrium interest rate depends on which subpopulation holds the marginal unit of capital.

Why higher moments do not always rescue. Adding the variance to the forecasting rule helps with smooth perturbations but cannot capture multi-modal distributions, regime-switching, or non-monotone responses to skewness. The fundamental issue is that the master equation requires full cross-sectional information whenever prices are non-linear in the distribution; truncating to any finite set of moments is exact only in the linear-pricing case.

F.0.6.4Exercise 6.4: Sequence-space vs. histogram DEQN.

Coding exercise. Empirically: with truncation horizon T=80T = 80 and the chapter’s reference calibration, the sequence-space residual after training matches the histogram-based DEQN to within a factor of 1.2--1.5 on the same model. The sequence-space variant generalizes worse to a much longer test horizon (Ttest80T_\mathrm{test} \gg 80) because the truncation error ρzT\rho_z^T grows with the gap; the histogram variant does not have this issue because its state is stationary. The trade-off favors sequence-space when the cross-sectional distribution is intrinsically high-dimensional (e.g., multiple assets, multi-cohort wealth), at which point storing TT shock realizations is cheaper than discretizing the distribution.

F.0.6.5Exercise 6.5: DeepSets permutation invariance.

(i) Let π\pi be a permutation of {1,,N}\{1, \dots, N\}. The aggregator’s mm-th component is

mtm(πs)  =  i=1Ngθm(stπ(i))  =  j=1Ngθm(stj),m_t^m(\pi \cdot s) \;=\; \sum_{i=1}^N g_\theta^m\bigl(s_t^{\pi(i)}\bigr) \;=\; \sum_{j=1}^N g_\theta^m(s_t^j),

where the second equality is just a re-indexing of the sum (since addition is commutative). Therefore mt(πs)=mt(s)\bm m_t(\pi \cdot s) = \bm m_t(s), exactly invariant.

(ii) The policy πρ(sti;mt,at)\pi_\rho(s_t^i; \bm m_t, a_t) is a function of agent ii’s own state stis_t^i, the population summary mt\bm m_t, and the aggregate exogenous state ata_t. Under a permutation π\pi, agent π(i)\pi(i)'s individual state is now stπ(i)s_t^{\pi(i)}, while mt\bm m_t and ata_t are unchanged (by the result in (i) for mt\bm m_t). Therefore the policy of agent π(i)\pi(i) in the permuted economy equals the policy of agent π(i)\pi(i) in the original economy, i.e. the policy moves with its own agent index but is otherwise unaffected: equivariance.

(iii) Zaheer et al. (2017) prove that any continuous permutation-invariant function f:Rd×NRf: \mathbb{R}^{d \times N} \to \mathbb{R} on sets of fixed cardinality NN can be written as f(s1,,sN)=ρ(i=1Ng(si))f(s_1, \dots, s_N) = \rho\bigl(\sum_{i=1}^N g(s_i)\bigr) for some continuous functions g,ρg, \rho. This is the universal-approximation result for permutation-invariant DeepSets.

Implication for DeepHAM. Since the equilibrium price functional is permutation-invariant in agents (anonymous markets), DeepHAM’s parameterization can in principle approximate any continuous price/policy functional of the cross-sectional distribution to arbitrary accuracy, provided the inner network gθg_\theta has enough capacity and the moment vector mt\bm m_t is rich enough. The number of moments MM plays the role of the encoder’s bottleneck dimension: empirically, M=1M = 1--3 suffices for Krusell--Smith-class economies, consistent with the chapter’s report that DeepHAM with one learned moment matches the histogram DEQN.

F.0.6.6Exercise 6.6 and Exercise 6.7 (statements: p. , p. ).

Coding exercises. Guidance for interpreting the outputs:

Exercise 6.6. The relevant statistic is the cross-replication sampling variance conditional on the same aggregate path, not the time-series variance of KtK_t along that path. As NN rises, the Monte Carlo standard error should fall at the usual N1/2N^{-1/2} rate for smooth aggregate statistics. Young’s path has zero sampling variance across replications because the histogram update integrates out the lottery exactly. The MC-vs-Young trade-off depends on the target functional: tail mass (e.g., the bottom-10%10\% wealth share) decays much more slowly under MC, so the panel size needed to match Young-equivalent precision is an output of the repeated-panel experiment.

Exercise 6.7. In the standard KS calibration, the one-moment forecasting rule should already fit logKt+1\log K_{t+1} extremely well; adding logVt\log V_t, with Vt=Varμt(k)V_t=\mathrm{Var}_{\mu_t}(k), is expected to give only a small incremental gain. This is exactly the empirical observation behind “approximate aggregation”. In calibrations with high cross-sectional dispersion (e.g., wide income range or frequent borrowing-constraint binding), the second-moment improvement can become economically visible and should be reported from the run.

F.0.7Chapter Chapter 7: Physics-Informed Neural Networks

F.0.7.1Exercise 7.1: Trial-function BC enforcement.

Define y^(x)=2xπ+x(π2x)Nρ(x)\hat y(x) = \tfrac{2x}{\pi} + x\bigl(\tfrac{\pi}{2} - x\bigr)\,\mathcal{N}_\rho(x). Evaluate at the boundaries:

y^(0)  =  0+0π2Nρ(0)  =  0,y^(π/2)  =  2ππ2+π20Nρ(π/2)  =  1.\hat y(0) \;=\; 0 + 0 \cdot \tfrac{\pi}{2}\cdot \mathcal{N}_\rho(0) \;=\; 0, \qquad \hat y(\pi/2) \;=\; \tfrac{2}{\pi}\cdot\tfrac{\pi}{2} + \tfrac{\pi}{2}\cdot 0\cdot \mathcal{N}_\rho(\pi/2) \;=\; 1.

Both boundary conditions hold for any network output Nρ\mathcal{N}_\rho, so the BCs are encoded in the architecture rather than enforced via the loss.

Why preferable to a soft penalty? A soft penalty λ(y^(0)0)2+λ(y^(π/2)1)2\lambda\,(\hat y(0) - 0)^2 + \lambda\,(\hat y(\pi/2) - 1)^2 in the loss involves a hyperparameter λ\lambda that must be tuned: too small, and the BC violation is large; too large, and the interior PDE residual is starved of optimization budget. The trial-function enforcement is parameter-free, makes the BC residual identically zero, and reduces the loss to the single PDE-interior term supxy+y2\sup_x |y'' + y|^2. This separates the two optimization concerns cleanly; any wall-clock gain should be measured in the notebook rather than assumed.

F.0.7.2Exercise 7.2: ReLU pathology.

A ReLU network is piecewise-linear: between consecutive kinks x=bk/akx = -b_k/a_k it is affine in xx, so 2y^/x2=0\partial^2 \hat y/\partial x^2 = 0 a.e. At a kink, the second distributional derivative is a Dirac delta supported on a measure-zero set. The strong-form Black--Scholes residual

Vt+12σ2S2VSS+rSVSrV  =  0V_t + \tfrac{1}{2}\sigma^2 S^2 V_{SS} + rS V_S - rV \;=\; 0

must hold pointwise for almost every SS. With ReLU, VSS=0V_{SS} = 0 a.e., so the residual reduces to Vt+rSVSrVV_t + rS V_S - rV, which has no nontrivial solution that respects the option-pricing boundary condition. The PINN cannot decrease its loss below the order of magnitude of the missing VSSV_{SS} term.

Weak-form fix. Multiply both sides by a smooth test function φ(S)\varphi(S) with compact support on [Smin,Smax][S_\mathrm{min}, S_\mathrm{max}] and integrate by parts on the VSSV_{SS} term:

SminSmax ⁣VSSφdS  =  VSφdS  +  [VSφ]SminSmax.\int_{S_\mathrm{min}}^{S_\mathrm{max}}\! V_{SS}\,\varphi\, dS \;=\; -\int V_S\,\varphi'\, dS \;+\;[V_S\,\varphi]_{S_\mathrm{min}}^{S_\mathrm{max}}.

The boundary terms vanish for compactly supported φ\varphi, and the residual now involves only first-order derivatives of VV. ReLU networks have well-defined first derivatives a.e., so the weak-form PINN can minimize this residual. This is exactly the Galerkin / Deep Galerkin formulation of Sirignano & Spiliopoulos (2018).

F.0.7.3Exercise 7.3: Discrete \to continuous bridge.

Start with

V(a)=maxc[u(c)Δt+βΔtEV(a)],a=acΔt,βΔt=eρΔt.V(a) = \max_c \bigl[u(c)\,\Delta t + \beta_{\Delta t}\,\mathbb{E}V(a')\bigr], \qquad a' = a - c\,\Delta t, \qquad \beta_{\Delta t}=e^{-\rho\Delta t}.

Then βΔt=1ρΔt+O(Δt2)\beta_{\Delta t}=1-\rho\,\Delta t+O(\Delta t^2). The same first-order limit follows from the implicit-Euler convention βΔt=1/(1+ρΔt)\beta_{\Delta t}=1/(1+\rho\Delta t).

Taylor-expand V(a)=V(acΔt)V(a') = V(a - c\,\Delta t) around aa:

V(a)  =  V(a)V(a)cΔt+12V(a)(cΔt)2+O(Δt3).V(a') \;=\; V(a) - V'(a)\,c\,\Delta t + \tfrac{1}{2}V''(a)\,(c\,\Delta t)^2 + O(\Delta t^3).

Substitute and use βΔt=1ρΔt+O(Δt2)\beta_{\Delta t} = 1 - \rho\,\Delta t + O(\Delta t^2):

V(a)  =  maxc{u(c)Δt+(1ρΔt) ⁣[V(a)V(a)cΔt+O(Δt2)]}.V(a) \;=\; \max_c \Bigl\{u(c)\,\Delta t + \bigl(1 - \rho\Delta t\bigr)\!\bigl[V(a) - V'(a)\,c\,\Delta t + O(\Delta t^2)\bigr]\Bigr\}.

Subtract V(a)V(a) from both sides and divide by Δt\Delta t:

0  =  maxc{u(c)ρV(a)V(a)c+O(Δt)}.0 \;=\; \max_c\Bigl\{u(c) - \rho V(a) - V'(a)\,c + O(\Delta t)\Bigr\}.

Take Δt0\Delta t \to 0 and rearrange:

ρV(a)  =  maxc[u(c)V(a)c].\rho V(a) \;=\; \max_c \bigl[u(c) - V'(a)\,c\bigr].

This is the HJB equation for the consumption-savings problem with no asset return (or with rr embedded into a=a+raΔtcΔta' = a + ra\Delta t - c\Delta t, in which case the HJB picks up an additional V(a)raV'(a)\,r a drift term). The discrete-to-continuous bridge thus shows that the HJB is the formal limit of the discrete Bellman as the time step shrinks, which justifies treating PINNs as the continuous-time analogue of value-function iteration.

F.0.7.4Exercise 7.4, Exercise 7.5, Exercise 7.6 (statements: p. , p. , p. ).

Coding exercises. The exact numbers below depend on random seeds, batch sizes, and stopping rules; use them as qualitative checks rather than fixed targets:

Exercise 7.4. At small λ\lambda, the BC residual is underweighted and endpoint violations remain visibly large. At very large λ\lambda, the endpoints fit well but the interior residual can stagnate because the optimizer spends most of its gradient budget on the boundary term. The elbow is the smallest λ\lambda for which further increases mostly improve the boundary metric without improving the interior fit. The hard-BC variant should be the benchmark: it sets the boundary violation to numerical zero by construction and removes the penalty-weight tuning problem.

Exercise 7.5. Sobol and Latin Hypercube points usually reduce visible sampling gaps relative to uniform random points. On the smooth manufactured Poisson problem the gain may be modest, because the solution has no boundary layer or interior singularity. Adaptive sampling becomes more valuable when residuals are spatially localized; report the actual collocation count, wall time, and test-grid residual rather than relying on a universal percentage saving.

Exercise 7.6. The expected qualitative result is that the strong-form tanh\tanh PINN is the natural baseline for Black--Scholes, because VSSV_{SS} is well-defined by automatic differentiation. A strong-form ReLU network is ill-suited because VSS=0V_{SS}=0 almost everywhere and undefined at kinks. In a weak formulation, integration by parts moves the second derivative off the network and onto the test function, so a ReLU network can be made mathematically admissible. Any empirical comparison should report held-out pricing errors and residual diagnostics from the implemented notebook rather than quoting architecture-independent iteration counts.

F.0.7.5Exercise 7.7: Operator learning vs. PINN.

Eleven independent PINN runs each cost CPINNC_\mathrm{PINN} wall-clock seconds, total 11CPINN11\,C_\mathrm{PINN}. A single operator-learning or parametric-PINN run trained on 11 values of KK (or a continuous range, sampled in mini-batches) costs CopC_\mathrm{op}. Amortized training wins when 11CPINN>Cop11\,C_\mathrm{PINN} > C_\mathrm{op}, i.e., Cop/CPINN<11C_\mathrm{op}/C_\mathrm{PINN} < 11. The crossover scales linearly in the number of distinct KK values: with NN strikes, operator learning wins whenever Cop/CPINN<NC_\mathrm{op}/C_\mathrm{PINN}<N. This is the cost-amortization argument that motivates DeepONet-style operator learning Lu et al., 2021.

F.0.8Chapter Chapter 8: Heterogeneous Agent Models in Continuous Time

F.0.8.1Exercise 8.1: Itô on GBM.

Geometric Brownian motion satisfies dXt=μXtdt+σXtdBtdX_t = \mu X_t\,dt + \sigma X_t\,dB_t. Apply Itô’s lemma to f(x)=lnxf(x) = \ln x with f(x)=1/xf'(x) = 1/x, f(x)=1/x2f''(x) = -1/x^2:

d(lnXt)  =  f(Xt)dXt+12f(Xt)(dXt)2  =  1Xt(μXtdt+σXtdBt)+12 ⁣(1Xt2)σ2Xt2dt.d(\ln X_t) \;=\; f'(X_t)\,dX_t + \tfrac{1}{2} f''(X_t)\,(dX_t)^2 \;=\; \frac{1}{X_t}\bigl(\mu X_t\,dt + \sigma X_t\,dB_t\bigr) + \tfrac{1}{2}\!\left(-\frac{1}{X_t^2}\right)\sigma^2 X_t^2\,dt.

Simplifying:

d(lnXt)  =  (μ12σ2)dt+σdBt.d(\ln X_t) \;=\; \bigl(\mu - \tfrac{1}{2}\sigma^2\bigr)\,dt + \sigma\,dB_t.

Integrating from 0 to tt:

lnXtlnX0  =  (μ12σ2)t+σBt,Xt  =  X0exp ⁣[(μ12σ2)t+σBt].\ln X_t - \ln X_0 \;=\; (\mu - \tfrac{1}{2}\sigma^2)\,t + \sigma B_t, \qquad X_t \;=\; X_0\,\exp\!\bigl[(\mu - \tfrac{1}{2}\sigma^2)\,t + \sigma B_t\bigr].

Volatility drag. Taking expectations: E[Xt]=X0eμt\mathbb{E}[X_t] = X_0\,e^{\mu t}, but E[lnXt]=lnX0+(μ12σ2)t\mathbb{E}[\ln X_t] = \ln X_0 + (\mu - \tfrac{1}{2}\sigma^2)\,t. The expected log return μσ2/2\mu - \sigma^2/2 is strictly less than the log of the expected return, μ\mu, by the variance correction term. This is the volatility drag. Two illustrative regimes: with zero arithmetic drift (μ=0\mu = 0), expected log growth is σ2/2-\sigma^2/2, strictly negative; with drift exactly equal to the Itô correction (μ=σ2/2\mu = \sigma^2/2), expected log growth is zero (the drift just offsets the drag). In financial terms, volatility eats into geometric returns; this is why an asset with 20%20\% expected return and 40%40\% volatility delivers a long-run geometric mean of only μσ2/2=12%\mu - \sigma^2/2 = 12\%.

F.0.8.2Exercise 8.2: KFE for an OU process.

The OU process dXt=η(XˉXt)dt+σdBtdX_t = \eta(\bar X - X_t)\,dt + \sigma\,dB_t has drift μ(x)=η(Xˉx)\mu(x) = \eta(\bar X - x) and diffusion coefficient σ\sigma. The KFE in conservation form is

tg(x,t)  =  x[μ(x)g(x,t)]+σ22xxg(x,t)  =  x[η(xXˉ)g]+σ22xxg.\partial_t g(x,t) \;=\; -\partial_x\bigl[\mu(x)\,g(x,t)\bigr] + \tfrac{\sigma^2}{2}\,\partial_{xx} g(x,t) \;=\; \partial_x\bigl[\eta(x - \bar X)\,g\bigr] + \tfrac{\sigma^2}{2}\,\partial_{xx} g.

Setting tg=0\partial_t g = 0 for the stationary density g(x)g^\star(x):

ηx[(xXˉ)g]+σ22g  =  0.\eta\,\partial_x\bigl[(x - \bar X)\,g^\star\bigr] + \tfrac{\sigma^2}{2}\,g^{\star\prime\prime} \;=\; 0.

Integrate once in xx, with the constant of integration set to zero by the no-flux boundary condition at ±\pm\infty:

η(xXˉ)g+σ22g  =  0g(x)g(x)  =  2ησ2(xXˉ).\eta\,(x - \bar X)\,g^\star + \tfrac{\sigma^2}{2}\,g^{\star\prime} \;=\; 0 \quad\Longleftrightarrow\quad \frac{g^{\star\prime}(x)}{g^\star(x)} \;=\; -\frac{2\eta}{\sigma^2}\,(x - \bar X).

This is a linear ODE for lng\ln g^\star; integrating gives lng=η(xXˉ)2/σ2+const\ln g^\star = -\eta(x-\bar X)^2/\sigma^2 + \mathrm{const}, i.e.

g(x)    exp ⁣[η(xXˉ)2/σ2].g^\star(x) \;\propto\; \exp\!\bigl[-\eta(x - \bar X)^2/\sigma^2\bigr].

This is a Gaussian density with mean Xˉ\bar X and variance σ2/(2η)\sigma^2/(2\eta), normalised by gdx=1\int g^\star\,dx = 1:

g(x)  =  ηπσ2exp ⁣[η(xXˉ)2σ2]  =  N ⁣(Xˉ,σ2/(2η)).g^\star(x) \;=\; \sqrt{\frac{\eta}{\pi\sigma^2}}\,\exp\!\left[-\frac{\eta\,(x - \bar X)^2}{\sigma^2}\right] \;=\; \mathcal{N}\!\bigl(\bar X,\, \sigma^2/(2\eta)\bigr).

The OU’s stationary distribution is Gaussian with mean equal to the mean-reversion target, variance equal to the diffusion-to-mean-reversion ratio σ2/(2η)\sigma^2 / (2\eta): faster mean reversion (larger η\eta) shrinks the dispersion; larger diffusion grows it.

F.0.8.3Exercise 8.3: Functional derivative.

For the master-equation toy specification V(a,g)=u(c(a,y))g(y)dyV(a, g) = \int u(c(a, y))\,g(y)\,dy where c(a,y)c(a,y) is fixed (does not depend on gg), the functional derivative δV/δg\delta V / \delta g measures the linear response of VV to a perturbation in gg.

In the ambient vector space of signed measures, a point perturbation gives δV/δg(y0)=limε0[V(a,g+εδy0)V(a,g)]/ε\delta V/\delta g(y_0) = \lim_{\varepsilon\to 0} [V(a, g + \varepsilon\delta_{y_0}) - V(a,g)]/\varepsilon, where δy0\delta_{y_0} is a Dirac mass at y0y_0. Substituting,

V(a,g+εδy0)  =  u(c(a,y))(g(y)+εδy0(y))dy  =  V(a,g)+εu(c(a,y0)).V(a, g + \varepsilon\delta_{y_0}) \;=\; \int u(c(a,y))\,(g(y) + \varepsilon\delta_{y_0}(y))\,dy \;=\; V(a,g) + \varepsilon\,u(c(a, y_0)).

Therefore δV/δg(y0)=u(c(a,y0))\delta V/\delta g(y_0) = u(c(a, y_0)). If we restrict gg to the probability simplex, perturbations must preserve total mass; for example, η=δy0δy1\eta=\delta_{y_0}-\delta_{y_1} gives directional derivative u(c(a,y0))u(c(a,y1))u(c(a,y_0))-u(c(a,y_1)). Equivalently, on the simplex the derivative kernel is identified only up to an additive constant.

Interpretation. The functional derivative at a point y0y_0 is the value contribution of an infinitesimal mass placed at y0y_0. In the toy spec where VV is just a population average of utilities, the contribution is the per-agent utility u(c(a,y0))u(c(a, y_0)) at that point. In the real master equation, c(a,y)c(a, y) would itself depend on gg (because prices depend on gg), and the functional derivative would pick up additional indirect terms via c/g\partial c/\partial g; this is what makes the master equation a genuinely infinite-dimensional PDE rather than a parametric family of finite PDEs.

F.0.8.4Exercise 8.4: HJB residual.

Coding exercise. The right answer is the out-of-sample residual table produced by your run of lecture_13_08_Aiyagari_Continuous_Time_FD_and_PINN_PyTorch.ipynb. Report the training budget, collocation batch, random seed, and test grid. The residual should decrease when the collocation budget and network capacity are increased, but the scaling is empirical rather than a universal NpN^{-p} law because it mixes approximation, optimization, and sampling error.

F.0.8.5Exercise 8.5: Closed Aiyagari system.

Combining the four ingredients:

HJB (from (8.11)):

ρV(a,n)  =  maxc{u(c)+V(a,n)(wn+rac)+λ(n)(V(a,n^)V(a,n))}.\rho V(a,n) \;=\; \max_c \bigl\{u(c) + V'(a,n)(wn + ra - c) + \lambda(n)(V(a,\hat n) - V(a,n))\bigr\}.

KFE for the stationary distribution (from (8.6), with tg=0\partial_t g = 0):

0  =  a[s(a,n)g(a,n)]λ(n)g(a,n)+λ(n^)g(a,n^),0 \;=\; -\partial_a[s^\star(a,n)\,g(a,n)] - \lambda(n)\,g(a,n) + \lambda(\hat n)\,g(a,\hat n),

where s(a,n)=wn+rac(a,n)s^\star(a,n) = wn + ra - c^\star(a,n) is the optimal savings function.

Firm FOCs (Cobb--Douglas):

r  =  αAKα1L1αδ,w  =  (1α)AKαLα.r \;=\; \alpha A K^{\alpha-1} L^{1-\alpha} - \delta, \qquad w \;=\; (1-\alpha) A K^\alpha L^{-\alpha}.

Market clearing:

K  =  naag(a,n)da,L  =  nnag(a,n)da.K \;=\; \sum_n \int_{\underline a}^\infty a\,g(a,n)\,da, \qquad L \;=\; \sum_n n \int_{\underline a}^{\infty} g(a,n)\,da.

The equilibrium objects are (V(a,n),g(a,n),K,L,r,w)(V(a,n),g(a,n),K,L,r,w), with LL often pinned down by the stationary income shares and ww implied by the firm FOCs once (K,L)(K,L) is known. In practice one fixes a candidate rr, computes the associated firm demand and wage, solves the HJB for VV and hence the policy c,sc^\star, s^\star, plugs into the KFE for gg, computes implied K=nag(a,n)daK = \sum_n\int a g(a,n)\,da, and compares this capital supply with the capital demand implied by the candidate rr. A fixed point in rr is the equilibrium. This is the bisection-on-rr algorithm of Achdou et al. (2022), and the PINN replaces the inner HJB and KFE solves with neural-network approximation while keeping the outer fixed-point loop on rr unless the full equilibrium system is learned jointly.

Why both must be solved consistently at each candidate rr. The HJB takes rr as an input (price-taking agents); the KFE takes the policy from the HJB. Mis-specifying rr during training would feed a mis-specified policy into the KFE and yield an inconsistent KK. In the production-scale solver, the fixed-point loop alternates HJB and KFE to convergence at each rr before updating rr; the PINN can be trained on all four equations jointly when the architecture is rich enough to learn the equilibrium price as an output.

F.0.8.6Exercise 8.6 and Exercise 8.7 (statements: p. , p. ).

Coding exercises. Expected diagnostics:

Exercise 8.6. With a fixed policy, the finite-dimensional KFE is a linear forward equation. If the discretized generator is ergodic, the distance to the stationary distribution should decay approximately exponentially after transient modes die out. Estimate the slope from your run rather than quoting a universal number; the fitted rate is controlled by the slowest nonzero eigenvalue of the KFE generator and depends on income-switching intensities, savings drift, and grid truncation.

Exercise 8.7. On the one-asset stationary benchmark, finite differences should usually win on absolute wall-clock time and give the cleanest low-dimensional benchmark residuals. A PINN may become more attractive when the same architecture is reused across many nearby parameter values, when warm starts work well, or when the state space is extended beyond what a grid handles comfortably. Report the actual cold-start and warm-start timings from your machine, and treat memory use as hardware- and backend-dependent.

F.0.9Chapter Chapter 9: Gaussian Processes

F.0.9.1Exercise 9.1: Posterior on three points.

The RBF kernel with length scale =1\ell = 1 and signal variance σf2=1\sigma_f^2 = 1 is k(x,x)=exp((xx)2/2)k(x, x') = \exp(-(x-x')^2/2). With training points X=(0,1,2)X = (0, 1, 2) and targets y=(0,0.8,0.3)y = (0, 0.8, 0.3), the kernel matrix is

K=(1e1/2e2e1/21e1/2e2e1/21)(10.60650.13530.606510.60650.13530.60651).K = \begin{pmatrix} 1 & e^{-1/2} & e^{-2} \\ e^{-1/2} & 1 & e^{-1/2} \\ e^{-2} & e^{-1/2} & 1\end{pmatrix} \approx \begin{pmatrix} 1 & 0.6065 & 0.1353 \\ 0.6065 & 1 & 0.6065 \\ 0.1353 & 0.6065 & 1\end{pmatrix}.

Adding observation noise σy2I=0.01I\sigma_y^2 I = 0.01\,I to the diagonal of KK and solving (K+σy2I)v=y(K + \sigma_y^2 I)\bm v = \bm y with y=(0,0.8,0.3)\bm y = (0, 0.8, 0.3)^\top by Gaussian elimination gives

(K+σy2I)1y    (0.964,  1.744,  0.621).(K + \sigma_y^2 I)^{-1} \bm y \;\approx\; (-0.964,\; 1.744,\; -0.621)^\top.

At x=1.5x^\star = 1.5,

k=(k(1.5,0),k(1.5,1),k(1.5,2))=(e1.125,e0.125,e0.125)(0.3247,0.8825,0.8825).k_\star = \bigl(k(1.5,0),\, k(1.5,1),\, k(1.5,2)\bigr) = \bigl(e^{-1.125},\, e^{-0.125},\, e^{-0.125}\bigr) \approx (0.3247, 0.8825, 0.8825).

Posterior mean:

μˉ(x)=k(K+σy2I)1y0.3247(0.964)+0.88251.744+0.8825(0.621)0.678.\bar{\mu}(x^\star) = k_\star^\top (K + \sigma_y^2 I)^{-1} \bm y \approx 0.3247\cdot(-0.964) + 0.8825\cdot 1.744 + 0.8825\cdot(-0.621) \approx 0.678.

For the variance, solve (K+σy2I)w=k(K + \sigma_y^2 I)\bm w = k_\star, giving w(0.142,  0.662,  0.495)\bm w \approx (-0.142,\; 0.662,\; 0.495)^\top, hence

σˉ2(x)=k(x,x)kw    10.975    0.025.\bar{\sigma}^2(x^\star) = k(x^\star, x^\star) - k_\star^\top \bm w \;\approx\; 1 - 0.975 \;\approx\; 0.025.

The posterior is N(0.678,0.025)\mathcal{N}(0.678, 0.025), with standard deviation 0.158\approx 0.158. Notice the posterior mean “smooths” the two flanking observations y=0.8y = 0.8 and y=0.3y = 0.3 rather than being pulled to either; the 0.158 standard deviation reflects partial information at a halfway point between two data points.

F.0.9.2Exercise 9.2: Marginal likelihood Occam.

The log marginal likelihood is

logp(yX,)=12yK1y    12logK    n2log(2π),\log p(y | X, \ell) = -\tfrac{1}{2} y^\top K_\ell^{-1} y \;-\; \tfrac{1}{2} \log|K_\ell| \;-\; \tfrac{n}{2}\log(2\pi),

where K=K()+σy2IK_\ell = K_\ell(\ell) + \sigma_y^2 I is the kernel matrix at length scale \ell. The first term penalizes misfit (data that don’t lie in the GP’s predicted manifold get a high yK1yy^\top K_\ell^{-1} y); the second term penalizes model complexity (logK\log|K_\ell| is large when the kernel can fit anything, small when it is rigid). This is the Bayesian Occam’s razor: small \ell gives a flexible model that fits any training data perfectly (small misfit) but accepts a large complexity penalty; large \ell enforces smoothness (large misfit if the data are not smooth) but enjoys a small complexity penalty. The optimum balances the two.

For three data points y=(0,0.8,0.3)y = (0, 0.8, 0.3) at x=(0,1,2)x = (0, 1, 2), plotting logp(yX,)\log p(y | X, \ell) against [0.1,5]\ell \in [0.1, 5] typically shows an interior maximum near the data’s natural variation scale. At =0.1\ell = 0.1, the kernel is very local: each point is nearly uncorrelated with its neighbors, so the data fit is easy but the flexible prior receives a complexity penalty. At =5\ell = 5, the kernel forces all three function values to be nearly equal, which fits the data poorly. The peak is the point where these two forces balance.

F.0.9.3Exercise 9.3: Active subspace by hand.

For f(x)=(x1+x2+x3)2+0.01(x1x2)2f(\bm x) = (x_1 + x_2 + x_3)^2 + 0.01(x_1 - x_2)^2 on [1,1]3[-1,1]^3, the gradient is

f(x)=2(x1+x2+x3)(1,1,1)+0.02(x1x2)(1,1,0).\nabla f(\bm x) = 2(x_1 + x_2 + x_3)\,(1, 1, 1)^\top + 0.02(x_1 - x_2)\,(1, -1, 0)^\top.

The dominant term is along (1,1,1)(1,1,1) (with weight roughly 2(x1+x2+x3)32(x_1+x_2+x_3) \cdot \sqrt{3}), and the perturbation along (1,1,0)(1,-1,0) is two orders of magnitude smaller.

Compute C^=E[ff]\hat C = \mathbb{E}[\nabla f \nabla f^\top] with xU[1,1]3x \sim \mathcal{U}[-1,1]^3 i.i.d., so E[xi2]=1/3\mathbb{E}[x_i^2] = 1/3. Let s=x1+x2+x3s = x_1 + x_2 + x_3; then E[s2]=1\mathbb{E}[s^2] = 1, E[s(x1x2)]=E[x12x22]=0\mathbb{E}[s(x_1 - x_2)] = \mathbb{E}[x_1^2 - x_2^2] = 0, and E[(x1x2)2]=2/3\mathbb{E}[(x_1 - x_2)^2] = 2/3. Substituting,

C^    411  +  O(104),\hat C \;\approx\; 4\,\mathbf{1}\mathbf{1}^\top \;+\; \mathcal{O}(10^{-4}),

where 1=(1,1,1)\mathbf{1} = (1,1,1)^\top and the O(104)\mathcal{O}(10^{-4}) correction comes from the 0.01(x1x2)0.01\,(x_1 - x_2) perturbation. Since 11\mathbf{1}\mathbf{1}^\top has eigenvalues {3,0,0}\{3, 0, 0\} (eigenvector 1/3\mathbf{1}/\sqrt 3 for the nonzero one), the leading eigenvalue of C^\hat C is λ143=12\lambda_1 \approx 4 \cdot 3 = 12, with eigenvector (1,1,1)/3(1,1,1)/\sqrt{3}, the “aggregate direction.” The next eigenvalue is 104\sim 10^{-4}, four orders of magnitude smaller. The active subspace of dimension 1 already captures essentially all of the function’s variability.

F.0.9.4Exercise 9.4: Deep vs. linear active subspace.

Coding exercise. The linear-AS diagnostic should show two nonzero eigenvalues for the radial-ridge target, with the remaining eigenvalues close to numerical zero. Thus a linear active subspace needs dlin=2d_{\mathrm{lin}}=2 to represent the two linear features w1ξw_1^\top\xi and w2ξw_2^\top\xi. Deep AS can reach its validation-MSE elbow already at dnl=1d_{\mathrm{nl}}=1 because the encoder can learn a scalar nonlinear aggregate such as (w1ξ)2+(w2ξ)2(w_1\cdot\xi)^2 + (w_2\cdot\xi)^2. Report the held-out MSE curve and eigenvalues from the run rather than quoting universal numbers, since optimization noise and train/validation splits move the exact diagnostics.

F.0.9.5Exercise 9.5: BAL on a 2D function.

Coding exercise. Pure-variance acquisition selects points purely by where the GP is most uncertain, ignoring the predicted mean. The resulting design is usually more uniform across the input domain because the GP variance is high wherever data is sparse, regardless of function value. Maximizing pure logσ2(x)\log\sigma^2(\x) gives exactly the same design as maximizing pure σ2(x)\sigma^2(\x), since the logarithm is monotone. Differences arise only once the acquisition is mixed with a mean term, for example wobjμ(x)+wvar2logσ2(x)w_{\mathrm{obj}}\mu(\x)+\tfrac{w_{\mathrm{var}}}{2}\log\sigma^2(\x): then the design tilts toward regions that are both uncertain and high-valued under the current surrogate. For global surrogate construction, pure variance is the cleaner baseline; for optimization-like goals, the mixed score may be useful.

F.0.9.6Exercise 9.6: Sobol sensitivity with a GP surrogate.

Coding exercise. Report the actual errors from the run rather than treating fixed percentages as universal reference outputs. The expected pattern is improvement as NN increases: first-order Sobol errors should fall from small to larger sample budgets, while total-effect indices may converge more slowly because they include interaction terms. The cost ratio is the important lesson: once the surrogate is trained, very large Sobol designs can be evaluated at negligible marginal cost relative to repeated true-model solves, which is why surrogate-based sensitivity analysis is standard for expensive simulators such as climate IAMs.

F.0.9.7Exercise 9.7: Prior-driven RBF-GP extrapolation outside the training domain.

(i) Coding: on the training interval [0,1][0,1] the GP closely tracks f(x)=sin(2πx)exf(x) = \sin(2\pi x)\,e^{-x} with credible band of width 0.05\sim 0.05.

(ii) Analytical claim: for an RBF kernel k(x,x)=σf2exp((xx)2/(22))k(x, x') = \sigma_f^2 \exp(-(x-x')^2 / (2\ell^2)), when xxi|x - x_i| \gg \ell for all training points xix_i, the kernel cross-vector k=(k(x,x1),,k(x,xn))(0,,0)k_\star = (k(x, x_1), \ldots, k(x, x_n))^\top \to (0, \ldots, 0)^\top. Therefore the posterior mean μˉ(x)=k(K+σy2I)1y0\bar\mu(x) = k_\star^\top (K + \sigma_y^2 I)^{-1} y \to 0 (the prior mean). The posterior variance is

σˉ2(x)  =  k(x,x)k(K+σy2I)1k    σf20  =  σf2,\bar\sigma^2(x) \;=\; k(x,x) - k_\star^\top (K+\sigma_y^2 I)^{-1} k_\star \;\to\; \sigma_f^2 - 0 \;=\; \sigma_f^2,

the prior variance. Hence far from data the GP literally returns the prior N(0,σf2)\mathcal{N}(0, \sigma_f^2), regardless of the training data.

(iii) Coding verification: at x=3x = 3 (with 0.2\ell \approx 0.2 for the training data), the cross-kernel is exp((31)2/(20.04))=exp(50)1022\exp(-(3-1)^2/(2\cdot 0.04)) = \exp(-50) \approx 10^{-22}, well below floating-point precision. Posterior mean 0\approx 0, posterior s.d. σf0.5\approx \sigma_f \approx 0.5 (whatever was learned via marginal-likelihood maximisation).

(iv) The implication is more subtle than “overconfidence.” Far from the training data the posterior literally reverts to the prior, so the ±2σf\pm 2\sigma_f band there is exactly what the prior would have produced before any data were observed; it is overconfident only when the prior variance or the learned length scale is itself misleading, for instance when σf\sigma_f was calibrated by marginal-likelihood maximization on a training set that does not represent the function’s scale outside the training hull. In particular, the posterior does not know that ff continues oscillating outside [0,1][0,1]; it just reverts to zero with a band of width σf\sigma_f, regardless of whether the true function actually stays close to zero there. Bayesian active learning algorithms that select points by maximum posterior variance can therefore fail to acquire informative samples outside the convex hull when the prior variance is no larger than the within-hull noise scale.

Mitigations: use a Matérn kernel with ν=1/2\nu = 1/2 or ν=3/2\nu = 3/2 (heavier-tailed than RBF, posterior reverts to prior more slowly), incorporate a polynomial mean function in the GP prior (so extrapolation grows with xx instead of decaying to zero), or use a boundary-aware acquisition function that explicitly penalizes exploitation outside the convex hull. In economic applications (e.g., extrapolating an estimated value function to wealth levels outside the training range), the safest practice is to flag any query outside the convex hull of training data and refuse to predict, rather than to trust a prior-driven band that has no data behind it.

F.0.10Chapter Chapter 10: Deep Surrogate Models and Structural Estimation

F.0.10.1Exercise 10.1: Identification.

F.1Let m(ϱ)=Eϱ[h(C,I,Y)]m(\varrho)=\mathbb{E}_\varrho[h(C,I,Y)] be the simulated moment vector implied by the persistence parameter. At the truth ϱ\varrho^\star, local identification is captured by the Jacobian $$M(\varrho^\star)

\frac{\partial m(\varrho)}{\partial \varrho}\bigg|_{\varrho^\star}.$Amomentisstronglyidentifyingif A moment is strongly identifying if |M_k(\varrho^\star)|islarge;aweaklyidentifyingmomenthasderivativenearzero,inwhichcasesmallchangesin is large; a weakly identifying moment has derivative near zero, in which case small changes in \varrho$ produce nearly invisible changes in the moment and the SMM objective becomes flat.

Brock--Mirman example. The output autocorrelation Corr(logYt,logYt1)\mathrm{Corr}(\log Y_t,\log Y_{t-1}) is typically the strongest local moment for ϱ\varrho, because it is a direct echo of the productivity AR(1). The autocorrelation of consumption growth is also informative, but more filtered through equilibrium dynamics. The mean savings rate is nearly flat in the scalar ϱ\varrho exercise and is therefore weakly identifying for persistence. The exact ranking should be reported from the finite-difference Jacobian in the notebook, not from a hard-coded number.

F.1.0.1Exercise 10.2: Optimal weighting.

F.2The SMM objective is QT(θ)=g^(θ)Wg^(θ)Q_T(\theta) = \hat g(\theta)^\top W \hat g(\theta), where g^(θ)=m(θ)m^data\hat g(\theta) = m(\theta) - \hat m^\mathrm{data} is the moment-error vector. Under standard regularity (Hansen 1982), the asymptotic variance of θ^SMM\hat\theta_\mathrm{SMM} is $$\mathrm{Avar}(\hat\theta)

(M^\top W M)^{-1}M^\top W\Omega W M(M^\top W M)^{-1},$where where M=\partial m/\partial\theta’at at \theta^\starand and \Omega=\mathrm{Var}(\sqrt{T}\hat g(\theta^\star))$ is the covariance of the data-simulation moment discrepancy.

F.3To minimize Avar(θ^)\mathrm{Avar}(\hat\theta) over choices of WW, set up the Lagrangian or use the matrix calculus result that the variance is minimized when WΩ1W \propto \Omega^{-1}. Substituting W=Ω1W=\Omega^{-1}: $$\mathrm{Avar}(\hat\theta)\big|_{W^\star}

(M^\top \Omega^{-1}M)^{-1}.$Forindependentsimulatedpanelsofthesamelengthasthedata, For independent simulated panels of the same length as the data, \Omega=(1+1/S)\Sigma_m;ifsimulationnoiseisnegligible,; if simulation noise is negligible, \Omega\approx\Sigma_m.Comparetoadifferentweighting. Compare to a different weighting W=\tilde W:: \mathrm{Avar}|{\tilde W}-\mathrm{Avar}|{W^\star}$ is positive semi-definite by the Gauss--Markov theorem applied to the moment system.

Why identity weighting in the first stage? The optimal weight depends on the unknown covariance at the truth. The standard two-stage strategy is: (1) estimate with W=IW=I to obtain a consistent but inefficient θ^(1)\hat\theta^{(1)}; (2) estimate Ω^\hat\Omega at or near θ^(1)\hat\theta^{(1)}; (3) re-estimate with W=Ω^1W=\hat\Omega^{-1}.

F.3.0.1Exercise 10.3: Common random numbers.

Coding exercise. Without CRN, the objective QT(ϱ)Q_T(\varrho) changes both because ϱ\varrho changes and because each candidate uses a new shock path. This contaminates the profile with Monte Carlo noise. With CRN, every candidate is evaluated on the same shock path, so differences in QT(ϱ)Q_T(\varrho) isolate the structural effect of persistence. Quantify the gain by repeating the entire candidate grid across Monte Carlo panels and comparing the cross-panel variance of QT(ϱ)Q_T(\varrho) at each grid point.

F.3.0.2Exercise 10.4: SMM vs. SBI.

SMM solves an outer optimization at deployment: given new data, run a KK-step optimization over θ\theta, where each step requires (a) a model simulation at the candidate θ\theta and (b) moment computation. Total cost per inference: KK model evaluations.

SBI (e.g., neural posterior estimation) does the work upfront at training time: simulate the model at NN values of θ\theta, train a neural density estimator q(θy)q(\theta | y) on the (θ,y)(\theta, y) pairs, and then at deployment evaluate qq on the new data with one forward pass. Total cost per inference (after training): one forward pass.

SBI dominates when (i) the model is expensive but a single training run is amortized over many datasets to be analyzed (e.g., a central bank running daily updates); (ii) the model is non-likelihood (no closed-form likelihood, only simulator); (iii) the parameter dimension is moderate (p20p \le 20) so training data covers parameter space. SMM dominates when (i) the model is cheap (each evaluation a few ms), (ii) only one or a handful of inferences are needed, (iii) classical confidence intervals are required (SBI gives a Bayesian posterior, which differs).

F.3.0.3Exercise 10.5: JJ-statistic and overidentification.

F.4Coding+analytical. In the joint notebook’s over-identified specification, q=4q=4 and p=2p=2, so the degrees of freedom are qp=2q-p=2. Under correct specification and with W=Ω1W=\Omega^{-1}, the JJ-statistic at the optimum follows $$J

T,\hat g(\hat\theta)^\top \Omega^{-1}\hat g(\hat\theta) \xrightarrow{d} \chi^2_{2}.$Iftheexerciseuses If the exercise uses W=\Sigma_m^{-1}$ while simulation noise is non-negligible, adjust the scale under the equal-length independent-simulation approximation or use the Monte Carlo/bootstrap distribution directly.

Empirical distribution: under correct specification, the Monte Carlo distribution of J(θ^)J(\hat\theta) should be centered near the corresponding χ22\chi^2_2 benchmark once the weighting and simulation-noise treatment are consistent. Under a structural break, the model cannot match all four moments simultaneously, so the distribution should shift to the right and the rejection rate should rise above the nominal size. The size of that shift is an output of the experiment and should be reported from the run.

F.4.0.1Exercise 10.6: Bootstrap confidence intervals.

Coding exercise. The nonparametric bootstrap should respect time-series dependence; use a moving-block or stationary bootstrap rather than iid resampling of individual dates. The parametric bootstrap draws new shock paths under the estimated parameter vector and re-runs the estimator. Report the actual CI endpoints and coverage from the run. In small samples, bootstrap intervals may differ materially from sandwich intervals because the SMM criterion is nonlinear and the finite-sample distribution of θ^\hat\theta need not be close to Gaussian.

F.4.0.2Exercise 10.7: ML vs. SMM efficiency.

Coding exercise. If logzt\log z_t is observed and follows

logzt+1=ϱlogzt+σzεt+1,\log z_{t+1}=\varrho\log z_t+\sigma_z\varepsilon_{t+1},

then the Gaussian AR(1) likelihood gives a direct MLE for ϱ\varrho (OLS of logzt+1\log z_{t+1} on logzt\log z_t when σz\sigma_z is unrestricted). This MLE is efficient for the observed-shock likelihood. The SMM estimator based on a finite moment vector reaches the GMM efficiency bound for those moments, but it equals MLE efficiency only if the chosen moments span the score, which the three pedagogical moments generally do not.

Why SMM despite the efficiency loss? In production-scale models (heterogeneous-agent macro, dynamic IO), the likelihood is often unavailable in closed form, and computing it would require integration over high-dimensional latent states. Moments are cheaper, interpretable, and robust to parts of the model that are not central to the research question. The cost is that SMM efficiency depends on the information content of the chosen moments.

F.4.1Chapter Chapter 11: Climate Economics and Deep Uncertainty Quantification

F.4.1.1Exercise 11.1: ECS sensitivity.

Coding exercise. Report the SCC at the central calibration and at each ECS value in the likely and very-likely ranges. The expected qualitative pattern is monotone and often convex in ECS: higher equilibrium climate sensitivity raises temperature damages and therefore the emissions shadow price. The quantitative range is calibration-dependent and should be reported from the notebook rather than fixed in the solution text. The interpretation should connect the dispersion to the climate-science finding of Sherwood et al. (2020) that ECS uncertainty is a major driver of SCC uncertainty.

F.4.1.2Exercise 11.2: Sobol decomposition.

For q(θ1,θ2,θ3)=θ1θ2+θ32q(\theta_1, \theta_2, \theta_3) = \theta_1\theta_2 + \theta_3^2 with θiU[0,1]\theta_i \sim \mathcal{U}[0,1] i.i.d.:

E[q]=E[θ1]E[θ2]+E[θ32]=14+13=712,E[q2]=E[θ12]E[θ22]+2E[θ1θ2]E[θ32]+E[θ34]=19+16+15=4390,Var(q)=4390(712)2=1180.\begin{aligned} \mathbb{E}[q] &= \mathbb{E}[\theta_1]\mathbb{E}[\theta_2] + \mathbb{E}[\theta_3^2] = \tfrac{1}{4} + \tfrac{1}{3} = \tfrac{7}{12}, \\ \mathbb{E}[q^2] &= \mathbb{E}[\theta_1^2]\mathbb{E}[\theta_2^2] + 2\mathbb{E}[\theta_1\theta_2]\mathbb{E}[\theta_3^2] + \mathbb{E}[\theta_3^4] = \tfrac{1}{9} + \tfrac{1}{6} + \tfrac{1}{5} = \tfrac{43}{90}, \\ \mathrm{Var}(q) &= \tfrac{43}{90} - \bigl(\tfrac{7}{12}\bigr)^2 = \tfrac{11}{80}. \end{aligned}

Conditional means:

E[qθ1]=θ12+13,Varθ1(E[qθ1])=148,E[qθ3]=14+θ32,Varθ3(E[qθ3])=Var(θ32)=445.\begin{aligned} \mathbb{E}[q \mid \theta_1] &= \tfrac{\theta_1}{2} + \tfrac{1}{3}, & \mathrm{Var}_{\theta_1}\bigl(\mathbb{E}[q\mid\theta_1]\bigr) &= \tfrac{1}{48},\\ \mathbb{E}[q \mid \theta_3] &= \tfrac{1}{4} + \theta_3^2, & \mathrm{Var}_{\theta_3}\bigl(\mathbb{E}[q\mid\theta_3]\bigr) &= \mathrm{Var}(\theta_3^2) = \tfrac{4}{45}. \end{aligned}

First-order Sobol indices:

S1=S2=1/4811/80=5330.152,S3=4/4511/80=64990.646.S_1 = S_2 = \frac{1/48}{11/80} = \frac{5}{33} \approx 0.152, \qquad S_3 = \frac{4/45}{11/80} = \frac{64}{99} \approx 0.646.

Sum: S1+S2+S3=94/990.949S_1 + S_2 + S_3 = 94/99 \approx 0.949; the residual 5/990.0515/99 \approx 0.051 is the θ1θ2\theta_1\theta_2 interaction term.

Total-effect indices (one minus the share explained by all other variables): S1T=S2T=20/990.202S_1^T = S_2^T = 20/99 \approx 0.202; S3T=64/990.646S_3^T = 64/99 \approx 0.646 (no interactions involving θ3\theta_3 since it enters only quadratically alone).

A SALib estimate with 104 samples typically matches these analytical values to two decimal places.

F.4.1.3Exercise 11.3: ACE benchmark.

Coding exercise. Use ACE’s closed-form optimal carbon tax as the analytic benchmark, then compute the DEQN-trained DICE SCC at the same calibration and report the percentage discrepancy. A small gap is expected only when units, discounting, damage curvature, and time discretization are aligned; any residual difference should be attributed explicitly to the DICE discretization, smoothing choices, and calibration differences rather than to a fixed percentage target.

F.4.1.4Exercise 11.4: Pareto-improving tax design.

Project-level coding exercise. The constrained search is over a low-dimensional parameter vector ϑ=(ϑtax,ω)\vartheta = (\vartheta_{\mathrm{tax}}, \omega), with the cohort welfare constraints U~t(ϑ)Ut\tilde U_t(\vartheta) \ge U_t enforced via a penalty or barrier term in the outer optimizer. The Pareto frontier is typically tight: cohorts whose BAU welfare is already close to the unconstrained Ramsey-optimal welfare are easily satisfied, while cohorts that bear the bulk of the transition cost (often the youngest at the time of the reform) bind first; the welfare gap to the unconstrained problem grows with the share of constrained cohorts. Holding ω\omega fixed at the BAU or declining benchmark ωˉ\bar\omega leaves measurable welfare on the table because endogenizing the transfer schedule provides an extra free instrument with which to relax the binding constraints. Numerical answers are calibration-dependent and should be reported from the notebook rather than benchmarked against fixed values; the qualitative interpretation is what matters for the policy discussion.

F.4.1.5Exercise 11.5: Carbon-cycle warm-up.

Coding exercise driven by notebook lecture_16_01_Climate_Exercise.ipynb. Part (a) avoided-warming and avoided-damages numbers at 2100 under the 50% mitigation rule are notebook outputs and depend on calibration, so they should be reported from the run rather than fixed in the solution text. Part (b) the longest-timescale reservoir is identified by the smallest non-zero eigenvalue of the carbon-cycle transition matrix; for the CDICE three-box calibration this is the lower-ocean compartment, with a multi-century characteristic timescale. Part (c) a quadratic damage function D(T)=π2T2D(T) = \pi_2 T^2 has bounded curvature in TT and therefore underweights tail risk: marginal damage grows only linearly, so very high realizations of TT are penalized far less than under super-quadratic damages or an explicit tipping-hazard term, and tail-risk assessment requires one of those richer specifications (compare Exercise 11.8).

F.4.1.6Exercise 11.6: Deterministic CDICE-DEQN reproduction.

Coding exercise driven by notebook lecture_16_02_DICE_DEQN_Library_Port.ipynb. The verification target is the set {TAT(2100), MAT(2100), μ(2100), SCC(2015), SCC(2100), SCC(2300)}\{T^{\mathrm{AT}}(2100),\ M^{\mathrm{AT}}(2100),\ \mu(2100),\ \mathrm{SCC}(2015),\ \mathrm{SCC}(2100),\ \mathrm{SCC}(2300)\}, each compared against the reference solution at the tolerances stated in the notebook’s verification gate. Reference numbers depend on seed, hardware, optimizer schedule, and trajectory sampling, so the solution should anchor on the verification gate rather than on fixed numbers. Typical convergence problems trace back to scaling differences across the eight residuals or to insufficient trajectory coverage near the carbon-stock saturation regime; the residual-balancing methods discussed in Chapter Chapter 4 are the first line of defense.

F.4.1.7Exercise 11.7: Stochastic SCC fan chart.

Coding exercise driven by notebook lecture_16_03_Stochastic_DICE_DEQN.ipynb. As the AR(1) productivity volatility σz\sigma_z rises, the right tail of the SCC distribution at 2100 widens disproportionately, because higher productivity raises emissions which feed convexly into damages. Report the quantiles q10,q50,q90q_{10}, q_{50}, q_{90} of the SCC fan chart at each σz\sigma_z rather than the mean alone, since the mean obscures the asymmetry. The qualitative finding aligns with Cai & Lontzek (2019), that productivity and consumption-growth shocks shift the SCC distribution materially and not just its location, supporting the use of distribution-aware reporting in policy work.

F.4.1.8Exercise 11.8: Tipping-point regime-switching damages.

Coding exercise. Report the unconditional tipping probability by 2100, the SCC at t=0t=0, and the optimal abatement path under the regime-switching specification and under the smooth-damage baseline. The expected qualitative pattern is that adding an irreversible tipping hazard raises the SCC and brings abatement forward, especially when TthreshT_\mathrm{thresh} is low. The factor by which the SCC rises is an output of the hazard calibration and retrained policy. The policy implication should be framed as a precautionary motive under threshold uncertainty, not as a universal numerical multiplier.

F.4.1.9Exercise 11.9: Real options value of waiting.

(i) Closed forms. Under the act-now strategy, the planner chooses μ0\mu_0 to minimize θμ02+αE[ECS](1μ0)\theta\mu_0^2 + \alpha\,\mathbb{E}[\mathrm{ECS}]\cdot(1 - \mu_0). With E[ECS]=(ECSL+ECSH)/2ECS\mathbb{E}[\mathrm{ECS}] = (\mathrm{ECS}_L + \mathrm{ECS}_H)/2 \equiv \overline{\mathrm{ECS}}, the FOC gives μ0=αECS/(2θ)\mu_0^\star = \alpha \overline{\mathrm{ECS}}/(2\theta). Plugging back, expected cost is

Cnow  =  αECS(αECS)24θ.C_\mathrm{now} \;=\; \alpha \overline{\mathrm{ECS}} - \frac{(\alpha \overline{\mathrm{ECS}})^2}{4\theta}.

Under the wait strategy, after observing ECS^\widehat{\mathrm{ECS}}, the posterior mean E[ECSECS^]\mathbb{E}[\mathrm{ECS}\mid\widehat{\mathrm{ECS}}] has variance shrunk relative to the prior. The planner chooses μ1(ECS^)=αE[ECSECS^]/(2θ)\mu_1^\star(\widehat{\mathrm{ECS}}) = \alpha\,\mathbb{E}[\mathrm{ECS}\mid\widehat{\mathrm{ECS}}]/(2\theta). Expected cost (using the law of total variance):

Cwait  =  αECSα24θE[E[ECSECS^]2]  =  αECSα24θ(ECS2+VarpriorVarpost).C_\mathrm{wait} \;=\; \alpha \overline{\mathrm{ECS}} - \frac{\alpha^2}{4\theta}\,\mathbb{E}\bigl[\mathbb{E}[\mathrm{ECS}\mid\widehat{\mathrm{ECS}}]^2\bigr] \;=\; \alpha \overline{\mathrm{ECS}} - \frac{\alpha^2}{4\theta}\bigl(\overline{\mathrm{ECS}}^2 + \mathrm{Var}_\mathrm{prior} - \mathrm{Var}_\mathrm{post}\bigr).

(ii) Value of waiting. Difference:

VoW  =  CwaitCnow  =  α24θ(VarpriorVarpost).\mathrm{VoW} \;=\; C_\mathrm{wait} - C_\mathrm{now} \;=\; -\frac{\alpha^2}{4\theta}\bigl(\mathrm{Var}_\mathrm{prior} - \mathrm{Var}_\mathrm{post}\bigr).

As the signal becomes informative (σε0\sigma_\varepsilon \to 0), Varpost0\mathrm{Var}_\mathrm{post} \to 0, so VarpriorVarpostVarprior>0\mathrm{Var}_\mathrm{prior} - \mathrm{Var}_\mathrm{post} \to \mathrm{Var}_\mathrm{prior} > 0, and VoW<0\mathrm{VoW} < 0: waiting is preferred. The reduction in posterior variance is the value of information.

F.5(iii) With irreversibility. Add a wedge ημ12\eta\mu_1^2 to the wait-branch objective, so the planner who waits solves a different second-period problem:
minμ1  (θ+η)μ12+αE[ECSECS^](1μ1).\min_{\mu_1}\;(\theta+\eta)\mu_1^2 +\alpha\,\mathbb{E}[\mathrm{ECS}\mid\widehat{\mathrm{ECS}}]\,(1-\mu_1).
Assuming the interior solution remains in [0,1][0,1], $$\mu_1^{\eta\star}(\widehat{\mathrm{ECS}})

F.6\frac{\alpha,\mathbb{E}[\mathrm{ECS}\mid\widehat{\mathrm{ECS}}]}{2(\theta+\eta)}
andexpectedwaitcostbecomesand expected wait cost becomes
C_\mathrm{wait}^\eta

\alpha \overline{\mathrm{ECS}}

F.7The threshold occurs at $$\eta^\star

\frac{\theta(\mathrm{Var}\mathrm{prior}-\mathrm{Var}\mathrm{post})} {\overline{\mathrm{ECS}}^2},$$ where the value of information just balances the irreversibility cost.

Connection to climate policy. This stylized model captures the central tension in climate policy: information arrives over time about ECS, damage functions, and tipping thresholds, but emissions are largely irreversible (atmospheric CO2_2 persists for centuries). The chapter’s Bayesian-learning treatment makes the trade-off quantitative: under fast learning and slow climate dynamics, waiting can be optimal; under slow learning and fast tipping risks, an early-action premium emerges. The empirical literature Pindyck, 2007Cai & Lontzek, 2019 finds that for realistic climate calibrations, the irreversibility channel typically dominates, supporting near-term carbon tax implementation rather than “wait and see” policies.

F.7.1Chapter Chapter 12: Synthesis and Outlook

F.7.1.1Exercise 12.1: Method-choice scenario.

Sketch:

(a) 4-state monetary-policy DSGE with smooth shocks. Use classical perturbation or projection as the baseline. The state space is small and the shocks are smooth, so a classical method is transparent, fast, and easy to audit. A DEQN becomes attractive only if the model is extended with genuinely global nonlinearities (for example a binding zero lower bound, occasionally binding collateral constraints, or a highly non-quadratic loss). Hybrid: use a GP surrogate over policy-rule parameters after the classical or DEQN solution step if many counterfactuals are needed.

(b) 200-agent OLG with progressive taxation. Use DEQN with Young’s-method aggregation (Chapter Chapter 5, Chapter 6). 200 cohorts is at the edge where explicit-panel methods (all-in-one DL) and histogram methods compete; histograms are easier when constraints bind frequently across cohorts. Hybrid: post-train a GP surrogate over the Pareto-weight calibration to evaluate optimal-tax-rule sensitivity.

(c) Exotic option pricing on an irregularly shaped payoff. Use PINN or Deep Galerkin methods with careful treatment of the payoff kink (Chapter Chapter 7) for a single-contract instance, or DeepONet if prices are needed for many strikes or contract parameters. The non-smooth object is the terminal payoff, not a generic spatial boundary; for a strong-form Black--Scholes residual this usually calls for smoothing the payoff, using smooth activations away from the kink, or switching to a weak/viscosity-aware formulation. GP surrogates can help as an outer pricing surface over a few parameters, but they are not the primitive PDE solver.

(d) Climate-IAM where SCC uncertainty is the deliverable. Use the full pipeline: DEQN to solve the deterministic IAM, GP surrogate over deep-uncertain parameters (ECS, damage convexity, tipping thresholds), Sobol/Shapley sensitivity decomposition for attribution, and BAL for sample-efficient uncertainty quantification (Chapters Chapter 11, Chapter 9). This combines the IAM uncertainty-quantification workflow of Friedl et al. (2023) with the surrogate-based policy-search logic in Kübler et al. (2026).

F.7.1.2Exercise 12.2: When NOT to use deep learning.

Open-ended. Sketch of one regime: bit-exact reproducibility for regulatory audit. GPU non-determinism in atomic accumulators (described in Appendix E) means that a deep-learning solver typically cannot reproduce the same numbers across hardware platforms; for regulatory work where auditors must replay every step bitwise, a deterministic finite-difference solver on a fixed grid is preferable. Even when deterministic flags are set, BLAS implementations differ across CUDA versions. Classical fixed-grid methods are easier to make bit-reproducible because the operation order can be pinned and the solver path is usually far less sensitive to random initialization and stochastic mini-batches.

F.7.1.3Exercise 12.3: Reproducibility audit.

Coding exercise. Expected behavior: re-running notebook lecture_03_02_Brock_Mirman_Uncertainty_DEQN.ipynb on the same machine with the same seeds should reproduce the reported diagnostics within the stated tolerance. Bitwise equality of trained network parameters should be expected only when deterministic framework settings, hardware, BLAS/CUDA versions, and floating-point order of operations are all pinned. Re-running on a different GPU, or with deterministic flags off, can produce small deviations in the last few printed digits of the savings-rate diagnostics while remaining well within the residual tolerance of the training.

F.7.1.4Exercise 12.4: Open-ended.

Open-ended; expected output is a 1--2 page research sketch. A representative answer combines DEQN (for solving the model) with GP surrogates (for parameter estimation): e.g., a 6-month project that estimates a heterogeneous-agent NK model with deep-learning policy functions, then uses a deep-kernel GP surrogate to do Bayesian posterior inference on the structural parameters. This integrates Chapters Chapter 6, Chapter 9, and Chapter 10.

F.7.1.5Exercise 12.5: Hybrid pipeline DEQN ++ GP ++ SMM.

Coding exercise. Illustrative outputs to benchmark against, not fixed targets:

The expected qualitative result is that the surrogate-based optimization is much cheaper per objective evaluation after the one-time DEQN, simulation-grid, and GP-fitting costs have been paid. The actual speedup and estimation errors are outputs of the run and should be reported rather than assumed.

F.7.1.6Exercise 12.6: DeepONet for a parameterized HJB.

(i) Architecture. Branch net b(γ):RRp\bm b(\gamma): \mathbb{R} \to \mathbb{R}^p encodes the parameter γ\gamma into a pp-dimensional latent representation; trunk net t(a):RRp\bm t(a): \mathbb{R} \to \mathbb{R}^p encodes the query point aa into the same latent dimension. Predicted value:

V^(a;γ)  =  b(γ),t(a)  +  b0,\widehat{V}(a; \gamma) \;=\; \langle \bm b(\gamma), \bm t(a)\rangle \;+\; b_0,

where b0b_0 is a learnable scalar bias. Both nets are typically MLPs with 2--4 layers and width 50--200. The architecture is non-trivial because it imposes the bilinear separable structure V(a;γ)=k=1pbk(γ)tk(a)V(a; \gamma) = \sum_{k=1}^p b_k(\gamma) t_k(a), which is the universal approximation form for nonlinear operators.

(ii) UAT for operator learning. Lu et al. (2021) show that a continuous nonlinear operator can be approximated uniformly on a compact set of input functions and a compact output domain by a DeepONet with finitely many sensors, a branch network, and a trunk network. In the present notation, for every ε>0\varepsilon>0 there are sensor locations, finite-width branch/trunk nets, and latent dimension pp such that

supγΓsupaAV(a;γ)V^(a;γ)<ε\sup_{\gamma\in\Gamma}\sup_{a\in\mathcal A} \bigl|V(a;\gamma)-\widehat V(a;\gamma)\bigr| < \varepsilon

on the compact training domain Γ×A\Gamma\times\mathcal A, provided the solution operator is continuous there. This generalizes ordinary universal approximation from finite-dimensional functions to operators, but it does not provide extrapolation guarantees outside the compact training domain.

(iii) Cost comparison. Independent PINN runs cost NCPINNN \cdot C_\mathrm{PINN}. One DeepONet run costs CDONC_\mathrm{DON}, where CDON/CPINNC_\mathrm{DON}/C_\mathrm{PINN} captures the larger network, the parameter-sampling overhead, and the longer training needed to span the parameter range. Operator learning wins exactly when

CDON<NCPINN,or equivalentlyCDON/CPINN<N.C_\mathrm{DON} < N\,C_\mathrm{PINN}, \qquad\text{or equivalently}\qquad C_\mathrm{DON}/C_\mathrm{PINN} < N .

At N=50N=50, the break-even ratio is therefore 50: a DeepONet can cost up to fifty single-parameter PINN solves and still be cheaper in total. The realized speedup is an empirical output of the implementation, not a universal constant.

(iv) Limitations. Extrapolation: DeepONet trained on γ[1.5,5]\gamma \in [1.5, 5] has no guarantees outside this range; in practice the predicted operator V(a;γ)V(a; \gamma) for γ>5\gamma > 5 degrades smoothly but may violate the structural property that VV should remain concave. Mitigation: use polynomial-augmented branch networks that extrapolate via known tails (e.g., Bernoulli polynomials), or refuse to predict outside the convex hull of training γ\gamma values.

Concavity preservation: the bilinear DeepONet architecture does not automatically preserve concavity in aa. A trained network may produce non-concave V^(;γ)\widehat{V}(\cdot; \gamma) for some γ\gamma, which is economically wrong under risk-averse preferences. Mitigation: use an input-concave architecture for the trunk representation, for example the negative of an input-convex neural network where appropriate, or add a concavity penalty max(0,V^(a;γ))\max(0,\widehat{V}''(a;\gamma)) to the loss. Both increase training cost but restore pressure toward the structural property.

References
  1. Backus, D. K., Kehoe, P. J., & Kydland, F. E. (1992). International real business cycles. Journal of Political Economy, 745–775.
  2. Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281–305.
  3. Zaheer, M., Kottur, S., Ravanbakhsh, S., Póczos, B., Salakhutdinov, R., & Smola, A. J. (2017). Deep Sets. Advances in Neural Information Processing Systems (NeurIPS).
  4. Sirignano, J., & Spiliopoulos, K. (2018). DGM: A Deep Learning Algorithm for Solving Partial Differential Equations. Journal of Computational Physics, 375, 1339–1364.
  5. Lu, L., Jin, P., Pang, G., Zhang, Z., & Karniadakis, G. E. (2021). Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3), 218–229. 10.1038/s42256-021-00302-5
  6. Achdou, Y., Han, J., Lasry, J.-M., Lions, P.-L., & Moll, B. (2022). Income and wealth distribution in macroeconomics: A continuous-time approach. The Review of Economic Studies, 89(1), 45–86.
  7. Sherwood, S. C., Webb, M. J., Annan, J. D., Armour, K. C., Forster, P. M., Hargreaves, J. C., Hegerl, G., Klein, S. A., Marvel, K. D., Rohling, E. J., Watanabe, M., Andrews, T., Braconnot, P., Bretherton, C. S., Foster, G. L., Hausfather, Z., von der Heydt, A. S., Knutti, R., Mauritsen, T., … Zelinka, M. D. (2020). An Assessment of Earth’s Climate Sensitivity Using Multiple Lines of Evidence. Reviews of Geophysics, 58(4), e2019RG000678. 10.1029/2019RG000678
  8. Cai, Y., & Lontzek, T. S. (2019). The Social Cost of Carbon with Economic and Climate Risks. Journal of Political Economy, 127(6), 2684–2734. 10.1086/701890
  9. Pindyck, R. S. (2007). Uncertainty in Environmental Economics. Review of Environmental Economics and Policy, 1(1), 45–65.
  10. Friedl, A., Kübler, F., Scheidegger, S., & Usui, T. (2023). Deep Uncertainty Quantification: With an Application to Integrated Assessment Models.
  11. Kübler, F., Scheidegger, S., & Surbek, O. (2026). Using Machine Learning to Compute Constrained Optimal Carbon Tax Rules. Journal of Political Economy: Macroeconomics.