Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2 Deep Equilibrium Nets

University of Lausanne

This chapter introduces Deep Equilibrium Nets (DEQNs), the central computational framework of this script. Rather than generating labeled training data, a DEQN embeds the equilibrium conditions of a dynamic stochastic model (Euler equations, budget constraints, complementarity conditions) directly into the loss function of a neural network. The network then learns policy functions by minimizing the violation of these conditions via gradient descent. We develop the methodology on the classical Brock--Mirman growth model, for which an analytical solution exists and serves as a benchmark. The primary reference is Azinovic et al. (2022).

2.0.0.1Notation transition.

Chapter Chapter 1 used θ\bm{\theta} for the parameter vector of a generic supervised model, following the textbook convention. From this chapter onward we switch to ρ\rho (so that the network is denoted Nρ(x)\mathcal{N}_\rho(\x)), matching the notation of Azinovic et al. (2022) and most of the subsequent DEQN literature. The two symbols denote the same object; we use ρ\rho for parameters wherever a network appears, and reserve θ\bm{\theta} for structural economic parameters that the network does not see (Chs. Chapter 10--Chapter 11).

2.1The Curse of Dimensionality in Economics

A central challenge in computational economics is that many models of interest, such as overlapping-generations (OLG) economies, heterogeneous-agent models, and international business cycle models, feature state spaces of high dimension. In an OLG model with NN cohorts and idiosyncratic risk, the state vector may include individual capital holdings, productivity levels, and wealth shares for each cohort, leading to state spaces of dimension d10d \gg 10. Traditional grid-based solution methods, such as value function iteration on a Cartesian grid with nn nodes per dimension, require ndn^d grid points; this is the curse of dimensionality, the term coined by Bellman (1961) to describe the exponential growth of computational cost with the number of state variables. A concrete back-of-the-envelope makes the bottleneck plain: at the modest resolution of n=100n=100 nodes per dimension, a d=5d=5 state space already requires 1005=1010100^5 = 10^{10} grid points, which is infeasible under any reasonable compute budget; even more modest resolutions of n=20n=20--30 become prohibitive once dd exceeds ten.

Table Table 2.1 illustrates the curse concretely. Even with a modest n=10n=10 grid points per dimension, the total number of grid points grows astronomically.

Table 2.1:Size of an n=10n = 10 Cartesian grid and the 64-bit memory required to store one floating-point value per grid point, as a function of state-space dimension dd. Grid-based methods are comfortable only at low dimension; by d=10d = 10 even storing one scalar per grid point is already borderline before policies, interpolation objects, residuals, and simulation output are added.

ddGrid points (10d10^d)Memory (64-bit)Feasibility
11080 Btrivial
2100800 Btrivial
5105800 KBfeasible
10101080 GBborderline
2010208×10118 \times 10^{11} GBimpossible
501050---absurd
10010100---a googol

2.1.0.1The volume paradox.

A complementary, and more geometric, perspective on the curse of dimensionality starts from a single ratio. The volume of the unit dd-ball relative to the smallest cube that contains it, [1,1]d[-1,1]^d, is

Vball(d)Vcube(d)=πd/22dΓ(d/2+1).\frac{V_\mathrm{ball}(d)}{V_\mathrm{cube}(d)} = \frac{\pi^{d/2}}{2^d\,\Gamma(d/2+1)}.

This ratio collapses with dd (Figure Figure 2.1): it is π/40.785\pi/4 \approx 0.785 at d=2d=2 and π/60.524\pi/6 \approx 0.524 at d=3d=3, has fallen to 2.5×103\approx 2.5\times 10^{-3} by d=10d=10, and is below 10-70 at d=100d=100. The geometry behind the numbers is sharp. The inscribed ball touches the center of each of the 2d2d faces, at distance 1 from the center, but its surface never gets within d1\sqrt{d}-1 of any of the 2d2^d corners, which sit at distance d\sqrt{d}. As dd grows the corners march off to infinity while the ball stays put, so essentially all of the cube’s volume drains into its corners, far from the center. A box-shaped object in high dimensions is, to a first approximation, all corners and no middle.

Why should an economist care about a fact about cubes and balls? Because solving a dynamic stochastic model means producing a global solution[1], and the region over which that solution actually has to be accurate is the model’s ergodic set, the states the economy visits under its own law of motion. That set is typically a thin, curved, low-volume sliver of any hypercube [xmin,xmax]d[\x_{\min},\x_{\max}]^d that one would draw around it, often well under one percent of the bounding box already at moderate dd Judd et al., 2011. Yet most of the high-dimensional approximation machinery on offer, tensor-product grids, adaptive sparse grids, tensor-product quadrature, is built on the hypercube; by the volume paradox an exponentially growing fraction of those nodes lands in corners the model never visits, so most of the computational effort is spent representing the function where it does not matter. Figure Figure 2.2 visualizes this mismatch. This is exactly the waste the DEQN approach is designed to avoid: instead of tiling a box, it trains the network on states sampled from the model’s own simulated trajectories, putting the approximation effort where the economics lives.

Distance itself is the second casualty of high dimensions, and it is worth flagging now because it returns later with teeth. Draw a point uniformly from the unit dd-ball: its radius has cumulative distribution rdr^d and density drd1d\,r^{d-1} on [0,1][0,1], which for large dd piles essentially all of its mass against the shell r1r\to 1. Random points sit on the pulp, not in the core, and they do so with overwhelming probability. Relatedly, pairwise Euclidean distances among random points concentrate so tightly that the nearest and the farthest neighbor of a query point become almost equidistant, so “distance” loses most of its power to discriminate Aggarwal et al., 2001 (Figure Figure 2.3). This is not a curio: any method that judges similarity through xx\|\x-\x'\|, from kk-nearest-neighbors to kernel ridge regression to the Gaussian-process surrogates of Chapter Chapter 9, whose RBF and Matérn kernels are functions of xx\|\x-\x'\| alone, loses resolution as dd grows. It is one of the reasons Chapter Chapter 9 reaches for dimension reduction, active subspaces and deep kernels, before fitting a GP in a high-dimensional input space.

(a) the unit ball inscribed in [-1,1]^d

Figure 2.1:(a) the unit ball inscribed in [1,1]d[-1,1]^d

(b) ratio (2.1) versus dd (log scale)

The volume paradox behind the curse of dimensionality. (a) The largest ball that fits inside the cube [1,1]d[-1,1]^d touches the center of every face, at distance 1 from the center, but its surface stays d1\sqrt{d}-1 away from each of the 2d2^d corners, which lie at distance d\sqrt{d}. As dd grows the corners recede while the ball does not, so almost all of the cube’s volume ends up in the corners (tinted), far from the center. (b) The ball-to-cube volume ratio of equation (2.1) on a logarithmic scale: π/4\pi/4 at d=2d=2, about 2.5×1032.5\times10^{-3} at d=10d=10, and below 10-70 at d=100d=100. A grid or quadrature rule built on the bounding hypercube therefore spends an exponentially growing share of its nodes in corners that the model’s ergodic set never reaches.

Grid-based vs. simulation-based state sampling. A Cartesian grid (left) allocates effort uniformly over the hypercube and places most of its n^d points far from the model’s ergodic set, the small subset of the state space that the economy actually visits under its own dynamics (red band). The DEQN algorithm samples states along simulated trajectories (right), concentrating training points exactly on that set. As d grows, the fraction of the cube reached by the ergodic set shrinks rapidly, so the grid’s relative waste grows exponentially.

Figure 2.2:Grid-based vs. simulation-based state sampling. A Cartesian grid (left) allocates effort uniformly over the hypercube and places most of its ndn^d points far from the model’s ergodic set, the small subset of the state space that the economy actually visits under its own dynamics (red band). The DEQN algorithm samples states along simulated trajectories (right), concentrating training points exactly on that set. As dd grows, the fraction of the cube reached by the ergodic set shrinks rapidly, so the grid’s relative waste grows exponentially.

(a) blue core and red shell each hold half a d-ball’s volume; the shell thins as d grows

Figure 2.3:(a) blue core and red shell each hold half a dd-ball’s volume; the shell thins as dd grows

(b) the radius CDF piles up at r1r\to1 as dd grows

Why a “random” point in high dimensions lives on the shell, not in the core. (a) For each dd, the blue inner disk and the red outer shell each hold half of the dd-ball’s volume; the shell’s fractional thickness 1(1/2)1/d1-(1/2)^{1/d} shrinks from 0.29\approx0.29 at d=2d=2 to 0.07\approx0.07 at d=10d=10 to 0.014\approx0.014 at d=50d=50, so half of the mass crowds into an ever thinner rim near the surface. (b) Equivalently, the radius of a uniform draw has cumulative distribution rdr^{d}, which for large dd stays near zero until rr is almost 1 and then jumps: the radius is essentially deterministic at 1. The companion fact is that pairwise Euclidean distances among random points concentrate, so a query point’s nearest and farthest neighbors become nearly indistinguishable Aggarwal et al., 2001; this is why distance-based methods, including the Gaussian-process kernels of Chapter Chapter 9, lose resolution in high dimensions.

2.2From Supervised Learning to Self-Supervised Equilibrium Solving

A word on terminology before we start, recalling the taxonomy of Section Section 1.3. A learning method is supervised when its loss compares the network’s output to an externally given target label; it is unsupervised when there are no target labels at all (clustering, density estimation, dimension reduction); and it is self-supervised when, although no human-provided labels exist, the method manufactures its own supervisory target out of the available structure. Deep Equilibrium Nets are unsupervised in the narrow “no labels” sense, but the sharper and more useful description is equation-based self-supervision: at every state where we evaluate the model, the equilibrium conditions tell us what the correct relationship between today’s policy and tomorrow’s must be, namely a residual of zero, and that residual plays exactly the role a label plays in supervised learning. The DEQN objective introduced below is therefore a genuine regression loss, only against a target the model itself dictates rather than one handed to us as data.

With that in mind, the key conceptual shift introduced by Azinovic et al. (2022) is to replace the labeled training data of standard supervised learning with the structural equations of the economic model. In a generic dynamic stochastic economic model, the state of the economy at time tt is summarized by a vector xtRd\x_t \in \R^d, and agents choose policy (or decision) variables pt=p(xt)Rmp_t = p(\x_t) \in \R^m. The equilibrium is characterized by a system of functional equations

G(xt,p(xt),Et[H(xt+1,p(xt+1))])=0,xt,G\bigl(\x_t,\, p(\x_t),\, \mathbb{E}_t[H(\x_{t+1}, p(\x_{t+1}))]\bigr) = 0, \qquad \forall\, \x_t,

where GG encodes optimality conditions (e.g., Euler equations) and Et[]\mathbb{E}_t[\cdot] is the conditional expectation over next-period shocks via the transition law xt+1=T(xt,p(xt),εt+1)\x_{t+1} = T(\x_t, p(\x_t), \varepsilon_{t+1}). The fundamental challenge is that this is a functional equation: we seek functions p:RdRmp:\R^d \to \R^m rather than finite-dimensional parameter vectors. The DEQN approach parameterizes pp as a neural network and solves this functional equation via stochastic optimization (Figure Figure 2.4).

self-supervised

Figure 2.4:self-supervised

Supervised learning (left) versus DEQN training (right). Both paradigms train a parameter vector by minimizing a residual loss with SGD; the difference is the source of the training signal, labeled data (xi,yi)(\x_i, y_i) in the supervised case, structural equilibrium equations G(x,p(x))=0G(\x, p(\x)) = 0 in the DEQN case. No labeled solution data are required for DEQNs. For Brock--Mirman the right-hand side specializes to G(K,z)=1β(C/C)αzKα1G(K, z) = 1 - \beta\,(C/C') \cdot \alpha z' K'^{\alpha-1} with C=Nρ(K,z)C = \mathcal{N}_\rho(K, z) and K=zKαCK' = z K^\alpha - C; the network learns the policy by driving the squared mean of this residual to zero on simulated trajectories.

In the standard (supervised) paradigm, one minimizes the discrepancy between the network’s predictions and known target values. In the DEQN framework, the loss function is instead the squared norm of the equilibrium residual:

ρ=1Ni=1NG(xi,Nρ(xi))2,\boxed{ \ell_\rho = \frac{1}{N} \sum_{i=1}^{N} \bigl\| G\bigl(\x_i,\, \mathcal{N}_\rho(\x_i)\bigr) \bigr\|^2, }

where Nρ\mathcal{N}_\rho denotes a deep neural network with parameters ρ\rho that maps states to policies. If the residual loss is zero on a sufficiently rich set of states and the conditional expectations inside GG are evaluated accurately (rather than with a single shock realization), the network satisfies the equilibrium conditions on those states; global accuracy must always be assessed on independent validation trajectories and, where available, against benchmark solutions. The continuous-time analogue is the residual-loss formulation of Raissi et al. (2019), which inspired the PINN literature developed in Chapter Chapter 7. The training points xi\x_i are drawn not from an exogenous dataset, but from the model’s own simulated dynamics: the network learns on the ergodic distribution of the economy, concentrating computational effort on economically relevant regions of the state space. The idea of using simulated paths as the support for projection goes back to the parameterized-expectations algorithm of Marcet (1988) and Marcet & Marshall (1994), which the DEQN literature reframed as ergodic-set sampling to focus computation on economically relevant states; the numerical stability of stochastic-simulation projection was further developed by Judd et al. (2011).

2.2.0.1Connection to the modern overparameterized regime.

Note that DEQNs operate squarely in the “modern regime” of Section Section 1.10: the network often has pnp \gg n parameters relative to the mini-batch size nn. Unlike supervised learning with a fixed dataset, however, the training distribution is renewable: each episode can generate fresh state samples from the model’s own simulation. This changes the fixed-dataset double-descent analogy, but it does not remove the need for validation: pp can still exceed the effective sample budget seen during training, and a flexible network can fit residuals on a narrow simulated region. The practical lesson is to combine overparameterized networks and SGD with independent validation trajectories, held-out Euler residuals, and, where available, closed-form benchmarks.

2.3The DEQN Training Algorithm

Several features of this algorithm deserve emphasis:

2.3.0.1Episode length TsimT_{\mathrm{sim}}.

The simulated-trajectory length TsimT_{\mathrm{sim}} controls the effective size of the training set within an episode. Azinovic et al. (2022) use Tsim=10,000T_{\mathrm{sim}} = 10{,}000 time steps (split into many short mini-batches) in their 113-dimensional 56-agent OLG benchmark. Shorter episodes (e.g., Tsim=1000T_{\mathrm{sim}} = 1000) speed up early training because the distribution is re-drawn more frequently, at the cost of higher variance; longer episodes produce smoother gradient estimates but run the risk of stale data if the policy has drifted during training. A pragmatic rule is to start short, lengthen TsimT_{\mathrm{sim}} as the loss plateaus, and check out-of-sample accuracy on an independent simulated trajectory.

2.3.0.2Connection to continuous-time methods.

The DEQN residual loss is the discrete-time analogue of the Deep Galerkin Method (DGM) of Sirignano & Spiliopoulos (2018), which minimizes a PDE residual on neural-network-parameterized PDE solutions in continuous time, and of the deep BSDE solver of Han et al. (2018) (Han--Jentzen--E), which formulates the same problem as a backward stochastic differential equation. Within the discrete-time deep-learning toolkit, Maliar et al. (2021) unify lifetime-reward, Bellman-equation, and Euler-equation training into a single framework, paired with an “all-in-one” integration operator that estimates all conditional expectations from a single Monte Carlo realization; DEQN as developed here is the Euler-equation branch of that taxonomy. Chapters Chapter 7 and Chapter 8 develop the continuous-time analogues explicitly; the present chapter should be read as introducing the discrete-time machinery on which those continuous-time methods are built.

2.4The Brock--Mirman Benchmark

Having laid out the generic training loop, we now put it to work on a concrete model. To validate the DEQN methodology, Azinovic et al. (2022) begin with the stochastic growth model of Brock & Mirman (1972), which admits a closed-form solution and therefore serves as an ideal test case.

The social planner solves:

max{Ct,Kt+1}t=0E ⁣[t=0βtln(Ct)]s.t.Kt+1+Ct=ztKtα,lnzt+1=ϱlnzt+σzϵt+1,\max_{\{C_t,\, K_{t+1}\}_{t=0}^{\infty}} \E{\sum_{t=0}^{\infty} \beta^t \ln(C_t)} \quad\text{s.t.}\quad K_{t+1} + C_t = z_t K_t^\alpha, \quad \ln z_{t+1} = \varrho \ln z_t + \sigma_z \epsilon_{t+1},

where β(0,1)\beta \in (0,1) is the discount factor, α(0,1)\alpha \in (0,1) the capital share, ϱ[0,1)\varrho \in [0,1) the shock persistence, σz>0\sigma_z > 0 the shock volatility, and ϵt+1N(0,1)\epsilon_{t+1} \sim \mathcal{N}(0,1) i.i.d. innovations. Depreciation is full (δ=1\delta = 1).

2.4.0.1Derivation of the Euler equation.

To derive the optimality conditions, form the Lagrangian by attaching a multiplier βtλt\beta^t \lambda_t to the resource constraint at each date tt:

L=E ⁣[t=0βt[ln(Ct)+λt(ztKtαCtKt+1)]].\mathcal{L} = \E{\sum_{t=0}^{\infty} \beta^t \Bigl[\ln(C_t) + \lambda_t \bigl(z_t K_t^\alpha - C_t - K_{t+1}\bigr)\Bigr]}.

Since the planner chooses CtC_t and Kt+1K_{t+1} at each date, we take partial derivatives with respect to each:

FOC w.r.t. CtC_t:

LCt=βt ⁣(1Ctλt)=0λt=1Ct.\frac{\partial \mathcal{L}}{\partial C_t} = \beta^t\!\left(\frac{1}{C_t} - \lambda_t\right) = 0 \qquad\Longrightarrow\qquad \lambda_t = \frac{1}{C_t}.

FOC w.r.t. Kt+1K_{t+1}: The variable Kt+1K_{t+1} appears in the date-tt constraint (with coefficient -1) and in the date-(t ⁣+ ⁣1)(t\!+\!1) constraint via output zt+1Kt+1αz_{t+1}K_{t+1}^\alpha. Differentiating:

LKt+1=βt(λt)+βt+1Et ⁣[λt+1αzt+1Kt+1α1]=0.\frac{\partial \mathcal{L}}{\partial K_{t+1}} = \beta^t\bigl(-\lambda_t\bigr) + \beta^{t+1}\,\mathbb{E}_t\!\bigl[\lambda_{t+1}\,\alpha\, z_{t+1}\, K_{t+1}^{\alpha-1}\bigr] = 0.

Dividing both sides by βt\beta^t yields:

λt=βEt ⁣[λt+1αzt+1Kt+1α1].\lambda_t = \beta\,\mathbb{E}_t\!\bigl[\lambda_{t+1}\,\alpha\, z_{t+1}\, K_{t+1}^{\alpha-1}\bigr].

Finally, substituting λt=1/Ct\lambda_t = 1/C_t from (2.6) into this expression gives the Euler equation:

1Ct=βEt ⁣[αzt+1Kt+1α1Ct+1].\frac{1}{C_t} = \beta\,\mathbb{E}_t\!\left[ \frac{\alpha\, z_{t+1}\, K_{t+1}^{\alpha-1}}{C_{t+1}} \right].

The economic intuition is transparent: the left-hand side is the marginal utility cost of saving one additional unit today; the right-hand side is the expected discounted marginal utility gain from the extra output αzt+1Kt+1α1\alpha\, z_{t+1}\, K_{t+1}^{\alpha-1} that unit produces tomorrow. At the optimum, the planner is indifferent between consuming and saving at the margin.

2.4.0.2Analytical solution.

When depreciation is full (δ=1\delta = 1), this model admits a closed-form solution. One can verify by direct substitution into the Euler equation (2.9) that the optimal consumption policy is:

Ct=(1βα)ztKtα.C_t^\star = (1 - \beta\alpha)\, z_t K_t^\alpha.

The derivation proceeds by guessing that the value function takes the form V(K,z)=V0+BlnK+DlnzV(K,z) = V_0 + B\ln K + D\ln z (with the constant V0V_0 written as V0V_0 rather than AA to avoid clashing with the TFP / cohort-count uses of AA elsewhere in the script) and using the envelope theorem. Substituting this guess into the Bellman equation V(K,z)=maxC{lnC+βE ⁣[V(K,z)]}V(K,z) = \max_C \{\ln C + \beta \E{V(K',z')}\} with K=zKαCK' = zK^\alpha - C yields a system of equations for V0V_0, BB, and DD. Matching coefficients on lnK\ln K pins down B=α/(1αβ)B = \alpha/(1-\alpha\beta); matching on lnz\ln z pins down D=1/[(1αβ)(1βϱ)]D = 1/[(1-\alpha\beta)(1-\beta\varrho)] (so DD, unlike BB, depends on the shock persistence); and the constant V0V_0 then absorbs the remaining terms. The first-order condition 1/Ct=βB/Kt+11/C_t = \beta\,B/K_{t+1} together with Ct+Kt+1=ztKtαC_t + K_{t+1} = z_t K_t^\alpha delivers the linear savings rule Kt+1=βαztKtαK_{t+1} = \beta\alpha\, z_t K_t^\alpha, from which (2.10) follows immediately via the resource constraint. Notably, this closed-form solution holds regardless of the shock persistence ϱ\varrho for the savings rule itself, because zt+1z_{t+1} cancels in the ratio zt+1/Ct+1z_{t+1}/C_{t+1} under the linear consumption rule: Ct+1zt+1C_{t+1} \propto z_{t+1}, so zt+1/Ct+1=1/[(1βα)Kt+1α]z_{t+1}/C_{t+1} = 1/[(1-\beta\alpha) K_{t+1}^\alpha] no longer depends on the next-period shock.

2.4.0.3Persistent shocks.

When ϱ>0\varrho > 0, productivity shocks are autocorrelated and the state of the economy becomes the pair (Kt,zt)(K_t, z_t), since the current shock level ztz_t now carries information about future productivity. Under log utility with Cobb--Douglas production and full depreciation (δ=1\delta = 1), the analytical solution (2.10) continues to hold regardless of the persistence ϱ\varrho. The reason fits in one line: under the linear consumption rule Ct=(1βα)ztKtαC_t = (1-\beta\alpha)\,z_t K_t^\alpha we have zt+1/Ct+1=1/[(1βα)Kt+1α]z_{t+1}/C_{t+1} = 1/[(1-\beta\alpha)K_{t+1}^\alpha], so zt+1z_{t+1} cancels and the conditional expectation in the Euler equation (2.9) no longer depends on the next-period shock; the constant DD in the value function still depends on ϱ\varrho, but DD enters VV and cancels in the FOC for the savings rule. However, for more general preferences (e.g., CRRA with γ1\gamma \neq 1), partial depreciation (δ<1\delta < 1), or non-Cobb--Douglas production, the closed-form solution breaks down and numerical methods, and the DEQN approach in particular, become essential: the policy function C(Kt,zt)C(K_t, z_t) must be approximated rather than derived analytically. (Concretely: the deterministic companion notebook uses δ=1\delta = 1 so that the closed form applies and can be used as a benchmark; the stochastic companion notebook switches to δ=0.1\delta = 0.1, where it does not, and the policy is genuinely learned. Only the loss-kernel notebook of Section 2.9 returns to δ=1\delta = 1, precisely so that the closed-form savings rate s=αβs^\star = \alpha\beta is available as ground truth.)

2.4.0.4DEQN formulation.

We parametrize the consumption policy as Ct=Nρ(Kt,zt)C_t = \mathcal{N}_\rho(K_t, z_t) and define the residual

G(Kt,zt)=1βCtCt+1αzt+1Kt+1α1,G(K_t, z_t) = 1 - \beta \, \frac{C_t}{C_{t+1}} \cdot \alpha z_{t+1} K_{t+1}^{\alpha-1},

where zt+1z_{t+1} denotes a single realization of the next-period shock and Kt+1=ztKtαNρ(Kt,zt)K_{t+1} = z_t K_t^\alpha - \mathcal{N}_\rho(K_t, z_t). In the code listing below the network actually emits a savings share st(0,1)s_t\in(0,1) through a sigmoid head rather than CtC_t directly, which guarantees Ct>0C_t>0 and Kt+1>0K_{t+1}>0 at every training step; Section Section 2.5 explains the hard-vs-soft constraint split this exemplifies. Note that the Euler equation (2.9) involves the conditional expectation Et[]\mathbb{E}_t[\cdot], whereas the residual GG above is written for a single realization of zt+1z_{t+1}. Two approaches are common for handling this expectation:

  1. Path averaging: draw a single zt+1z_{t+1} per state, compute GG for that draw, and average G2G^2 over many states along the simulated path. This minimizes E(Kt,zt,zt+1)[G2]\mathbb{E}_{(K_t,z_t,z_{t+1})}[G^2] -- the squared pathwise residual. It is an unbiased Monte Carlo objective for that stronger loss, and Jensen’s inequality gives (Et[G])2Et[G2]\bigl(\mathbb{E}_t[G]\bigr)^2 \leq \mathbb{E}_t[G^2]. Thus G=0G=0 almost surely implies the conditional Euler equation, but the converse need not hold: a policy can have Et[G]=0\mathbb{E}_t[G]=0 while GG varies with the shock. In the log/full-depreciation Brock--Mirman benchmark the closed-form policy happens to drive the pathwise residual itself to zero, so this stronger target is harmless; in richer models it is a modeling choice. This is the approach used in the Brock--Mirman code listing below. Note that the pathwise-residual-zero coincidence is special to δ=1\delta = 1: for δ<1\delta < 1 (the calibration of the stochastic notebook) a converged policy has Eε[Gx]=0\mathbb{E}_\varepsilon[G\mid x]=0 but GG itself is nonzero realization-by-realization, so path averaging then minimizes a strictly stronger objective than the conditional Euler residual.

  2. Quadrature: for each state (Kt,zt)(K_t, z_t), approximate Et[]\mathbb{E}_t[\cdot] explicitly via deterministic nodes and weights (Section Section 2.6), form the residual from the estimated expectation, and then square. This targets (Et[G])2\bigl(\mathbb{E}_t[G]\bigr)^2 directly and is the approach used for the IRBC model of Chapter Chapter 3, where accurate expectations are critical for convergence.

Both approaches recover the analytical solution (2.10) to high accuracy, providing a rigorous validation of the methodology. Because convergence curves depend on the exact training run, random seed, and solver configuration, this manuscript does not include a hand-drawn convergence plot. In practice, one should report diagnostics from the actual notebook run: residual trajectories, held-out Euler errors, and the gap between the learned policy and the analytical benchmark. Figure Figure 2.5 sketches the qualitative shape one should expect to see.

Schematic, not measured: the qualitative convergence behavior typical of a successful Brock--Mirman DEQN run. An early high-residual phase reflects an untrained network feeling out the state space; a mid phase descends roughly exponentially as the policy locks onto the equilibrium structure; a late phase plateaus near the irreducible quadrature/training-noise floor. The dotted line marks the analytical benchmark below which the residual cannot be driven without higher-precision quadrature. For the actual numbers on a specific seed, consult the companion notebook.

Figure 2.5:Schematic, not measured: the qualitative convergence behavior typical of a successful Brock--Mirman DEQN run. An early high-residual phase reflects an untrained network feeling out the state space; a mid phase descends roughly exponentially as the policy locks onto the equilibrium structure; a late phase plateaus near the irreducible quadrature/training-noise floor. The dotted line marks the analytical benchmark below which the residual cannot be driven without higher-precision quadrature. For the actual numbers on a specific seed, consult the companion notebook.

2.4.0.5Relative Euler equation error.

Following Judd (1998) and Azinovic et al. (2022) (Eq. 43), the standard accuracy diagnostic for a DEQN is the relative Euler error, which measures the percentage consumption error implied by a violation of the Euler equation. For Brock--Mirman with log utility, the relative error at state xj=(Kj,zj)\x_j = (K_j, z_j) is

exjREE(ρ)  =  (βE[αz(Kj)α1/Cρ(xj)xj])1Cρ(xj)    1,e^{\mathrm{REE}}_{\x_j}(\rho) \;=\; \frac{ \bigl(\beta\,\mathbb{E}[\alpha\, z' (K_j')^{\alpha-1}/C'_\rho(\x'_j)\mid \x_j]\bigr)^{-1} }{C_\rho(\x_j)} \;-\; 1,

where Cρ=NρC_\rho = \mathcal{N}_\rho, KjK'_j is next-period capital under the network policy, the expectation conditions on xj\x_j, and the 1^{-1} inverts the marginal-utility relation u(c)=1/cu'(c) = 1/c. The value eREE=104e^{\mathrm{REE}} = 10^{-4} means the agent’s optimal consumption is mispriced by 0.01%0.01\%, independent of units or utility scale. This is the metric reported in Table 3 of Azinovic et al. (2022), where the trained DEQN achieves mean relative Euler errors of order 10-4 on the 113-dimensional 56-agent OLG benchmark.

def deqn_loss(model, K, z, beta, alpha, z_next):
    Y       = z * K**alpha                   # output today
    s       = model(K, z)                    # savings share in (0,1) via sigmoid
    C       = (1.0 - s) * Y                  # consumption today  (>0)
    K_next  = s * Y                          # capital tomorrow   (>0)
    Y_next  = z_next * K_next**alpha
    s_next  = model(K_next, z_next)
    C_next  = (1.0 - s_next) * Y_next        # consumption tomorrow
    # z_next: single draw; expectation via path averaging
    G = 1.0 - beta * (C / C_next) * alpha * z_next * K_next**(alpha-1)
    return tf.reduce_mean(G**2)

Representative DEQN loss for Brock--Mirman with path averaging. The network outputs a savings share s(0,1)s \in (0,1) via a sigmoid, which jointly enforces C>0C > 0 \emph{and} K>0K' > 0; softplus on CC alone would not, since C>zKαC > z K^\alpha would yield K<0K' < 0 and an undefined Kα1K'^{\,\alpha-1}.

A Gauss--Hermite variant (developed formally in Section 2.6.2) replaces the single shock draw z_next by a QQ-node weighted sum E[αzKα1/Cz]q=1QwqαzqKα1/C(K,zq)\mathrm{E}[\,\alpha z' K'^{\alpha-1}/C'\mid z\,] \approx \sum_{q=1}^{Q} w_q\,\alpha z_q' K'^{\alpha-1}/C'(K',z_q') with zq=zϱexp(σzεq)z_q' = z^\varrho\exp(\sigma_z\varepsilon_q) at the rescaled Hermite nodes εq\varepsilon_q; in practice Q=5Q=5 already drives the quadrature error below the training error. The autodiff companion notebook 03_Brock_Mirman_Uncertainty_AutoDiff_DEQN.ipynb implements this variant explicitly.

2.5Encoding Equilibrium Conditions: Hard vs. Soft Constraints

The Brock--Mirman DEQN loss above already used one design choice without dwelling on it: the network emits a savings share through a sigmoid rather than consumption directly. This is an instance of a general principle. Azinovic et al. (2022) §4.2.2 do not treat every equilibrium equation symmetrically in the loss (2.3); they split the conditions into two groups. We make the split explicit on the Brock--Mirman model of the previous section and then indicate how it generalizes.

Hard constraints, encoded in the architecture (never in the loss).

Some equations can be satisfied exactly for every state x=(K,z)\x = (K,z):

  • The resource / state-transition equation Kt+1=ztKtαCtK_{t+1} = z_t K_t^\alpha - C_t defines next-period capital from the network’s consumption policy. It is not minimized; it is evaluated as a closed-form function of xt\x_t and CtC_t.

  • The economic requirements Ct>0C_t > 0 and Kt+1>0K_{t+1} > 0 are jointly imposed by parameterizing the network’s output as a savings share st(0,1)s_t \in (0,1) via a sigmoid activation, sigmoid(z)=1/(1+ez)\mathrm{sigmoid}(z) = 1/(1+e^{-z}), and recovering both quantities in closed form from the resource constraint, Ct=(1st)ztKtαC_t = (1-s_t)\,z_t K_t^\alpha and Kt+1=stztKtαK_{t+1} = s_t\,z_t K_t^\alpha. A softplus head on CtC_t alone would guarantee Ct>0C_t>0 but not Kt+1>0K_{t+1}>0 (the network could output Ct>ztKtαC_t > z_t K_t^\alpha). This sigmoid-savings parameterization removes an entire class of infeasible candidate policies before training begins; see the code listing in Section Section 2.4 above.

Soft constraint, minimized in the loss.

The only equilibrium condition that cannot be enforced analytically is the Euler equation (2.9). The squared relative Euler error (2.12) is averaged over the mini-batch and driven toward zero by stochastic gradient descent.

This split is pedagogically important for three reasons: (i) only the genuinely non-closed-form conditions enter the loss, which speeds up training; (ii) it eliminates a family of bad local minima in which the network produces, e.g., slightly negative consumption or infeasible states; and (iii) it explains why the loss typically converges to a small but nonzero value even at the optimum, since the Euler residual is intrinsic and cannot be removed by re-parameterization. Figure Figure 2.6 summarizes this construction for Brock--Mirman.

Hard vs. soft constraints in the DEQN architecture for Brock--Mirman. The network \mathcal{N}_\rho reads the state (K_t,z_t) and emits a savings share s_t \in (0,1) via a sigmoid head. Both consumption C_t = (1-s_t)\,z_t K_t^\alpha and next-period capital K_{t+1} = s_t\,z_t K_t^\alpha are then defined in closed form (green, top), guaranteeing C_t>0 and K_{t+1}>0 simultaneously; a softplus head on C_t alone could not, since C_t > z_t K_t^\alpha would push K_{t+1}<0 and make K_{t+1}^{\alpha-1} undefined. Only the relative Euler-equation residual (red, bottom) is squared, averaged over the mini-batch, and minimized by SGD. This figure matches the code listing in Section  above. In richer models (OLG, Chapter ) the same split extends to firm first-order conditions, household budget constraints, and KKT multipliers with softplus heads and Fischer--Burmeister complementarity residuals.

Figure 2.6:Hard vs. soft constraints in the DEQN architecture for Brock--Mirman. The network Nρ\mathcal{N}_\rho reads the state (Kt,zt)(K_t,z_t) and emits a savings share st(0,1)s_t \in (0,1) via a sigmoid head. Both consumption Ct=(1st)ztKtαC_t = (1-s_t)\,z_t K_t^\alpha and next-period capital Kt+1=stztKtαK_{t+1} = s_t\,z_t K_t^\alpha are then defined in closed form (green, top), guaranteeing Ct>0C_t>0 and Kt+1>0K_{t+1}>0 simultaneously; a softplus head on CtC_t alone could not, since Ct>ztKtαC_t > z_t K_t^\alpha would push Kt+1<0K_{t+1}<0 and make Kt+1α1K_{t+1}^{\alpha-1} undefined. Only the relative Euler-equation residual (red, bottom) is squared, averaged over the mini-batch, and minimized by SGD. This figure matches the code listing in Section Section 2.4 above. In richer models (OLG, Chapter Chapter 5) the same split extends to firm first-order conditions, household budget constraints, and KKT multipliers with softplus heads and Fischer--Burmeister complementarity residuals.

2.5.0.1How the split generalizes.

In models with more equilibrium objects (the OLG benchmark of Azinovic et al. (2022), the IRBC model of Chapter Chapter 3, the HA economies of Chapter Chapter 6), the same logic extends: firm first-order conditions for prices (wt,rt)(w_t,r_t) are closed-form in the aggregate state, the household budget constraint algebraically determines consumption given savings, and each inequality constraint contributes a softplus-headed KKT multiplier paired with a Fischer--Burmeister complementarity residual in the soft part of the loss (Chapter Chapter 3). The Brock--Mirman case above is the minimal instance of the pattern.

2.5.0.2Market-clearing layers.

One can push this design philosophy further by encoding the market-clearing condition itself into the network as a dedicated output layer rather than minimizing a market-clearing residual in the loss: the network outputs unnormalized cohort savings (or shares) and a final layer rescales them so that aggregate clearing holds by construction. Azinovic-Yang & Žemlička (2024) use such a “market-clearing layer” in an OLG economy with rare disasters; this is the discrete-time counterpart of the PINN-style practice of baking conservation laws directly into the network output (Chapter Chapter 7). Sigmoid heads, softplus heads, Fischer--Burmeister residuals, and market-clearing rescaling layers thus form a small toolbox of architecture-level encodings that each move part of the equilibrium structure from the soft loss into the hard part of the network.

2.6Quadrature Rules for Conditional Expectations

2.6.1Why We End Up Integrating, and What Numerical Integration Is

2.6.1.1Why an integral always shows up.

Sooner or later, every dynamic stochastic model in this script presents the same algorithmic bottleneck: a forward-looking agent has to form an expectation over future shocks before any decision can be made. The Euler equation (2.9) hinges on Et[αzt+1Kt+1α1/Ct+1]\mathbb{E}_t[\,\alpha z_{t+1} K_{t+1}^{\alpha-1}/C_{t+1}\,]; the Bellman operator behind a value-function iteration evaluates E[V(s)s,a]\mathbb{E}[V(\bm{s}')\mid\bm{s},a] at every state and action; the relative Euler error (2.12) embeds the same expectation in the diagnostic; the integrated-assessment models of Chapter Chapter 11 require expected discounted utility under climate-shock distributions; and the structural-estimation moments of Chapter Chapter 10 are themselves expectations over simulated paths. Outside a handful of conjugate or fully-linear-quadratic-Gaussian special cases, none of these expectations admit closed-form expressions, so the integral E[g(ε)]=g(ε)μ(ε)dε\mathbb{E}[g(\bm\varepsilon)] = \int g(\bm\varepsilon)\mu(\bm\varepsilon)\,d\bm\varepsilon has to be approximated numerically by a finite, cheap-to-evaluate, ideally differentiable, weighted sum qwqg(εq)\sum_q w_q\, g(\bm\varepsilon_q). Choosing the nodes εq\bm\varepsilon_q and weights wqw_q well is the entire content of quadrature theory. The path-averaging recipe used in the Brock--Mirman code above is one such choice (a Monte Carlo sample average with one node per state along a simulated trajectory); the rest of this section makes the deterministic and quasi-random alternatives explicit and explains when each is preferable.

2.6.1.2Picture 1: a definite integral as area, and the Riemann/midpoint sum.

Strip away the economics. A definite integral abf(x)dx\int_a^b f(x)\,dx is the (signed) area between the graph of ff and the xx-axis on [a,b][a,b]. The simplest deterministic numerical rule, the midpoint rule, replaces this area by a stack of NN rectangles of equal width Δx=(ba)/N\Delta x = (b-a)/N, each one as tall as ff at the midpoint of its base (Figure Figure 2.7, left). Adding the rectangle areas yields

IN=Δxi=1Nf(ximid)  N  abf(x)dx,with error INI=O(N2)I_N = \Delta x \sum_{i=1}^{N} f\bigl(x_i^{\text{mid}}\bigr) \;\xrightarrow[N \to \infty]{}\; \int_a^b f(x)\,dx, \qquad \text{with error } |I_N - I| = \mathcal{O}(N^{-2})

for smooth ff. Trapezoid and Simpson rules refine the same idea by replacing each rectangle with a trapezoid or a parabolic arc and reach O(N2)\mathcal{O}(N^{-2}) and O(N4)\mathcal{O}(N^{-4}) respectively; Gauss--Hermite, the workhorse of Section 2.6.2, is the optimal rule when the integrand is multiplied by the Gaussian weight ex2e^{-x^2} and reaches degree-(2Q1)(2Q-1) exactness using only QQ nodes. The same tile-the-area idea generalizes to higher dimensions by laying out a Cartesian grid. With NN nodes per coordinate the cost is NdN^d; equivalently, with M=NdM=N^d total nodes the midpoint rate becomes O(M2/d)\mathcal{O}(M^{-2/d}). Notice that it is the exponent of the rate, not just the constant, that degrades from -2 in 1D to 2/d-2/d in dd-D Cartesian, the curse of dimensionality made quantitative. This is the explicit form of the curse of dimensionality flagged in Section Section 2.1, and it motivates both the Stroud monomial rules of Section 2.6.3 and the randomized methods of Section 2.6.4.

2.6.1.3Picture 2: throwing darts to estimate π\pi.

An entirely different idea is to abandon the grid and approximate the integral by a sample average: draw NN random points u1,,uN\bm{u}_1, \ldots, \bm{u}_N uniformly from the integration domain Ω\Omega and report IN=vol(Ω)1Ni=1Nf(ui)I_N = \mathrm{vol}(\Omega)\cdot \tfrac{1}{N}\sum_{i=1}^{N} f(\bm{u}_i). This is plain Monte Carlo (MC), and its error decays as O(N1/2)\mathcal{O}(N^{-1/2}) with a rate that is independent of the dimension dd, although the variance constant can still worsen with dimension. This is why MC dominates deterministic grids when dd is large. The cleanest illustration uses the unit square [0,1]2[0,1]^2 and the indicator function of a quarter-disc:

π4  =  01 ⁣ ⁣011[x2+y21]dxdy    NinN,π^N  =  4NinN,\frac{\pi}{4} \;=\; \int_0^1\!\!\int_0^1 \mathbf{1}\bigl[x^2 + y^2 \leq 1\bigr]\, dx\, dy \;\approx\; \frac{N_{\mathrm{in}}}{N}, \qquad \widehat{\pi}_N \;=\; 4\cdot \frac{N_{\mathrm{in}}}{N},

where NinN_{\mathrm{in}} counts how many of the NN uniform “darts” land inside the quarter-circle x2+y21x^2+y^2 \leq 1 (Figure Figure 2.7, right). With N=100N=100 a typical run gives π^1003.04\widehat{\pi}_{100} \approx 3.04 (about 3% off); with N=106N=10^6 a typical run gives π^1063.1417\widehat{\pi}_{10^6} \approx 3.1417 (about 10-4 off). The error shrinks as O(1/N)\mathcal{O}(1/\sqrt{N}), requiring a hundredfold increase in NN to gain one extra decimal of accuracy, which is glacially slow compared to O(N4)\mathcal{O}(N^{-4}) for Simpson’s rule on a smooth 1D integrand. But the MC rate has no dependence on the dimension of the domain: replacing the quarter-disc by a dd-dimensional unit ball would leave the rate untouched, while the deterministic grid would suffer the NdN^d cost explosion of the curse of dimensionality. This is what makes MC and its quasi-random refinement (QMC, Section 2.6.4) the natural tools for the conditional expectations encountered in DEQNs at d10d \gtrsim 10.

Two paradigms for numerical integration that underlie every rule in this section. Deterministic tiling (left) is exact and highly accurate at low dimension but suffers exponential cost growth in d. Random sampling (right) is dimension-independent in its error rate but slow to converge in any single dimension. The Gauss--Hermite, monomial, and QMC rules of §-- are sophisticated descendants of the two ideas, designed to combine deterministic accuracy with manageable cost in d.

Figure 2.7:Two paradigms for numerical integration that underlie every rule in this section. Deterministic tiling (left) is exact and highly accurate at low dimension but suffers exponential cost growth in dd. Random sampling (right) is dimension-independent in its error rate but slow to converge in any single dimension. The Gauss--Hermite, monomial, and QMC rules of §Section 2.6.2--Section 2.6.4 are sophisticated descendants of the two ideas, designed to combine deterministic accuracy with manageable cost in dd.

2.6.1.4Where the rest of this section is going.

With this picture in hand, the design of every quadrature rule in the literature can be read as an answer to two questions: (i) where do we place the nodes εq\bm\varepsilon_q, and (ii) what weights wqw_q do we attach to them? Tensor-product Gauss--Hermite (Section 2.6.2) places nodes deterministically on a Cartesian grid of Hermite roots; the Stroud-3 monomial rule of Section 2.6.3 places only 2d2d nodes on the principal axes and accepts a controlled bias on fourth-order moments in exchange for linear-in-dd scaling; the QMC construction of Section 2.6.4 places nodes from a low-discrepancy sequence in the unit cube and pulls them back through the inverse CDF. The cost--accuracy trade-offs differ dramatically with dd, as Table Table 2.2 (page ) makes concrete. Chapter Chapter 3 returns to these numbers when scaling DEQNs to multi-country economies.

2.6.2Tensor-Product Gauss--Hermite

The classical Gauss--Hermite rule approximates integrals against the weight function ex2e^{-x^2}:

f(x)ex2dxq=1Qw~qf(xq),\int_{-\infty}^{\infty} f(x)\, e^{-x^2}\, dx \approx \sum_{q=1}^{Q} \tilde{w}_q\, f(x_q),

where the QQ nodes {xq}\{x_q\} are the roots of the QQ-th Hermite polynomial HQ(x)H_Q(x) and the weights are w~q=2Q1Q!π/(Q2[HQ1(xq)]2)\tilde{w}_q = 2^{Q-1} Q!\sqrt{\pi}/(Q^2 [H_{Q-1}(x_q)]^2). For expectations under the standard normal distribution, we apply the change of variables ε=2x\varepsilon = \sqrt{2}\, x to obtain:

E ⁣[h(ε)]=h(ε)eε2/22πdεq=1Qwqh(εq),wq=w~qπ,εq=2xq.\E{h(\varepsilon)} = \int_{-\infty}^{\infty} h(\varepsilon)\frac{e^{-\varepsilon^2/2}}{\sqrt{2\pi}}\,d\varepsilon \approx \sum_{q=1}^{Q} w_q\, h(\varepsilon_q), \qquad w_q = \frac{\tilde{w}_q}{\sqrt{\pi}},\quad \varepsilon_q = \sqrt{2}\,x_q.

The 1D rule (2.16) is exact for univariate polynomials in ε\varepsilon of degree at most 2Q12Q-1, an attractive accuracy guarantee for smooth integrands.

2.6.2.1Worked example: Brock--Mirman Euler expectation.

In the Brock--Mirman model, the conditional expectation in the Euler equation (2.9) is one-dimensional:

Et ⁣[αzt+1Kt+1α1Ct+1],zt+1=ztϱexp(σzεt+1),εt+1N(0,1).\mathbb{E}_t\!\left[ \frac{\alpha\, z_{t+1}\,K_{t+1}^{\alpha-1}}{C_{t+1}} \right], \qquad z_{t+1}=z_t^{\varrho}\exp(\sigma_z\varepsilon_{t+1}), \quad \varepsilon_{t+1}\sim\mathcal{N}(0,1).

Replacing path averaging by Gauss--Hermite with Q=5Q=5 nodes turns the residual (2.9) into a deterministic function of (Kt,zt)(K_t, z_t) and the network parameters, eliminating the shock noise inside the loss while costing only five extra forward passes per state. This is the simplest concrete instance of the general construction below and a useful sanity check: at Q=5Q=5 the quadrature error is negligible relative to the training error in this benchmark, so a well-trained network should match the analytical Brock--Mirman policy to numerical tolerance.

For shock dimension d>1d > 1, the multivariate expectation is computed via a tensor product of one-dimensional rules. Each multi-dimensional quadrature node is a dd-tuple (ε1q1,,εdqd)(\varepsilon_1^{q_1}, \ldots, \varepsilon_d^{q_d}) with weight j=1dwqj\prod_{j=1}^{d} w_{q_j}. The total number of quadrature nodes grows as QdQ^{d}, which becomes prohibitive once dd is large Judd, 1998. For d10d \geq 10, the monomial rule of Section 2.6.3 or sparse-grid (Smolyak) alternatives can substantially reduce the computational cost: Gerstner & Griebel (1998) introduce the construction in numerical analysis, Heiss & Winschel (2008) apply it to econometric likelihoods, and Brumm & Scheidegger (2017) develop adaptive sparse grids tailored to dynamic-programming and equilibrium computations. The polynomial-versus-exponential gap depends on smoothness assumptions and the effective dimension of the integrand: sparse grids attain polynomial rates for sufficiently regular functions; QMC achieves close-to-O(1/M)\mathcal{O}(1/M) rates under randomized constructions and finite-variation / low-effective-dimension conditions Sobol', 1967Niederreiter, 1992Owen, 1995Caflisch, 1998Novak & Woźniakowski, 2008, but neither method is universally polynomial.

2.6.3Monomial (Stroud-3) Cubature with Linear Scaling

The exponential cost QdQ^{d} of the previous subsection is the price paid for being exact in the highest-degree univariate polynomial along each coordinate. In equilibrium-residual applications, however, the integrand h(ε)h(\bm\varepsilon') obtained by composing the policy network with model primitives is typically only mildly nonlinear in the shock vector, and reproducing every cross-coordinate monomial of total degree up to 2Q12Q-1 is overkill. Once one is willing to give up that high degree, dramatically cheaper rules become available.

The simplest such rule is the Stroud degree-3 monomial cubature Stroud, 1971, a cornerstone of the toolkit reviewed in Judd (1998) [§7.5] and Maliar & Maliar (2014) [§5.3] for large-scale dynamic models. For a dd-dimensional standard normal expectation it uses only 2d2d nodes, all on the principal axes:

E ⁣[h(ε)]    12dk=1d[h(+dek)  +  h(dek)],\E{h(\bm\varepsilon')} \;\approx\; \frac{1}{2d} \sum_{k=1}^{d}\Bigl[\, h\bigl(+\sqrt{d}\,\bm{e}_k\bigr) \;+\; h\bigl(-\sqrt{d}\,\bm{e}_k\bigr)\,\Bigr],

where ek\bm{e}_k is the kk-th standard basis vector in Rd\R^{d}. All weights are equal to 1/(2d)1/(2d), and the nodes lie at the common radius d\sqrt{d} from the origin (so they expand outward as the dimension grows, consistent with the fact that the bulk of a high-dimensional Gaussian sits in a thin spherical shell, recalling the dimension-of-volumes discussion in Section Section 2.1). At d=1d=1, the rule collapses to the two-point cubature {+1,1}\{+1, -1\} with weights {12,12}\{\tfrac12, \tfrac12\}, exact for polynomials in ε\varepsilon up to degree three.

2.6.3.1Why it works (and where it stops).

Rule (2.18) integrates every monomial in ε\bm\varepsilon' of total degree at most three exactly: the first moments vanish by ±-symmetry, the variances E ⁣[εi2]\E{\varepsilon_i'^{\,2}} all evaluate to 1 (each axis contributes 12d2d=1\tfrac{1}{2d} \cdot 2 \cdot d = 1), the cross-moments E ⁣[εiεj]\E{\varepsilon_i' \varepsilon_j'} (iji \neq j) vanish because each node has only one nonzero coordinate, and the third moments vanish again by symmetry. Higher moments are not exact: E ⁣[εi4]\E{\varepsilon_i'^{\,4}} is reproduced as dd rather than the truth 3, so the rule introduces a controlled bias proportional to the integrand’s fourth-order shock content. In the IRBC class of Chapter Chapter 3 this bias is empirically negligible at the Euler-equation tolerance one cares about (103\sim 10^{-3}); see Pichler (2011) for a careful demonstration on a multi-country RBC and Maliar & Maliar (2014) [Table 5] for a broader cost--accuracy benchmark. When higher accuracy is required, Stroud’s degree-5 monomial rule with 2d2+12d^2 + 1 nodes still scales polynomially, not exponentially Judd, 1998.

2.6.3.2Cost comparison.

At Q=3Q=3 nodes per dimension, the tensor-product rule of Section 2.6.2 costs 3d3^{d} evaluations per state, while (2.18) costs 2d2d. Table Table 2.2 contrasts the two for the dimensionalities encountered later in the script. The monomial rule is the default integration scheme behind the production DEQN code for IRBC-type models in Chapter Chapter 3 once the shock dimension exceeds about five, dropped in by replacing a single quadrature kernel.

Table 2.2:Quadrature cost per residual evaluation, tensor-product Gauss--Hermite (Q=3Q=3) versus the Stroud-3 monomial rule of equation (2.18), as a function of the shock dimension dd. The tensor-product cost grows exponentially in dd; the monomial cost grows linearly. Brock--Mirman (d=1d=1) is the simplest case where the two rules agree to within a factor of 1.5; the gap explodes once the shock vector is multi-dimensional.

shock dim. ddwhere it appearstensor-product 3d3^{d}monomial 2d2dspeed-up
1Brock--Mirman (Ch. Chapter 2)321.5×\sim 1.5\times
32-country IRBC (Ch. Chapter 3)2765×\sim 5\times
65-country IRBC7291260×\sim 60\times
1110-country IRBC177,147177{,}147228,000×\sim 8{,}000\times
2120-country IRBC1010\sim 10^{10}422×108×\sim 2 \times 10^8\times
5150-country IRBC1024\sim 10^{24}1021022×\sim 10^{22}\times

2.6.4Inverse-CDF Transform and Quasi--Monte Carlo

For very high-dimensional expectations (d10d \geq 10), even sparse-grid rules can become expensive, and, when the integrand contains material fourth-order shock content, the bias of the monomial rule (2.18) starts to dominate. An effective alternative is Quasi-Monte Carlo (QMC) integration combined with the inverse-CDF transform Judd, 1998. This method maps the unbounded domain of normal shocks Rd\R^{d} to the unit hypercube [0,1]d[0,1]^{d} using the inverse normal CDF:

E ⁣[f(ε)]1Mm=1Mf(Φ1(um)),\E{f(\bm{\varepsilon})} \approx \frac{1}{M} \sum_{m=1}^{M} f\bigl(\Phi^{-1}(\bm{u}_m)\bigr),

where εRd\bm{\varepsilon} \in \R^{d} is the vector of i.i.d. standard normal shocks, Φ1\Phi^{-1} is the element-wise inverse standard normal CDF, and um\bm{u}_m are points from a low-discrepancy sequence (e.g., Sobol) on [0,1]d[0,1]^{d}. For sufficiently smooth integrands, randomized QMC often achieves error rates close to O(1/M)\mathcal{O}(1/M) up to logarithmic factors, compared with the O(1/M)\mathcal{O}(1/\sqrt{M}) rate of standard Monte Carlo. This makes QMC highly effective in many high-dimensional applications, including the multi-country IRBC of Chapter Chapter 3. The companion notebook 07_Genz_Approximation_and_Loss_Functions.ipynb explores QMC integration using the Genz test functions, a standard suite of integrands for benchmarking high-dimensional quadrature methods.

2.7Automatic Differentiation for DEQNs

The quadrature rules of the previous section answer the question “how do we evaluate the expectation in the Euler equation?” This section answers the companion question that every DEQN application faces: “how do we evaluate the derivatives in the Euler equation, the marginal utility u(Ct)u'(C_t), the marginal product Y/K\partial Y / \partial K, and the derivative of the value function V(K)V'(K)?” For the textbook Brock--Mirman model of Section 2.4 we derived all of these by hand. But every variation of the model, whether CRRA utility in place of log, CES production in place of Cobb--Douglas, a capital adjustment cost, or a borrowing constraint, would force us to redo pages of algebra before writing the loss. Automatic differentiation (AD) removes that tax: the user writes only the period payoff Π\Pi of the model once, and the first-order conditions, the envelope theorem, and all higher-order sensitivities are computed exactly from the code that evaluates Π\Pi. The result is not a numerical approximation of the Euler equation: it is the textbook Euler equation, produced by a different mechanism.

2.7.1Three ways to take a derivative

There are three generic ways to compute a derivative of a programmable function f:RnRf:\R^n \to \R at a point x0x_0: symbolic, finite differences, and autodiff. Each has a distinct profile.

2.7.1.1Symbolic differentiation

uses a computer algebra system (Mathematica, SymPy) to manipulate the algebraic expression for ff into an algebraic expression for ff'. Its output is exact and human-readable, and for clean analytic formulas it is ideal. Its two weaknesses are (i) expression swell: the chain rule applied to a deeply composed expression can produce a result with many more terms than the original, sometimes enough to defeat further symbolic manipulation; and (ii) it cannot handle arbitrary code with loops and conditional branches.

2.7.1.2Finite differences (FD)

approximates the derivative by the definition itself. For the common central-difference variant,

f^(x0)  =  f(x0+h)f(x0h)2h  =  f(x0)  +  16f(x0)h2  +  O(h4).\widehat{f'}(x_0) \;=\; \frac{f(x_0 + h) - f(x_0 - h)}{2\,h} \;=\; f'(x_0) \;+\; \tfrac{1}{6}\,f'''(x_0)\,h^2 \;+\; \mathcal{O}(h^4).

The geometric content of this formula is the mundane fact that the slope of a smooth curve at a point can be approximated by the slope of a nearby secant line. Figure Figure 2.8 makes this concrete: the true tangent at x0x_0 (red) is approximated by the secant connecting (x0h,f(x0h))(x_0 - h, f(x_0-h)) to (x0+h,f(x0+h))(x_0+h, f(x_0+h)) (blue dashed). As hh shrinks the secant rotates toward the tangent, and its slope converges to f(x0)f'(x_0); taking hh too small, however, forces the numerator f(x0+h)f(x0h)f(x_0+h) - f(x_0-h) to become the difference of two nearly equal floating-point numbers, at which point catastrophic cancellation dominates.

Geometric meaning of a central finite difference. The derivative f'(x_0) is the slope of the tangent to f at x_0 (red). Finite differences replace this slope by the slope of the secant through the two nearby points (x_0 - h, f(x_0-h)) and (x_0 + h, f(x_0+h)) (blue, dashed). For small h, the secant approaches the tangent at rate \mathcal{O}(h^2); for h too small, the subtraction in the numerator becomes unreliable in finite-precision arithmetic and the approximation degrades. Figure  plots the resulting error against h.

Figure 2.8:Geometric meaning of a central finite difference. The derivative f(x0)f'(x_0) is the slope of the tangent to ff at x0x_0 (red). Finite differences replace this slope by the slope of the secant through the two nearby points (x0h,f(x0h))(x_0 - h, f(x_0-h)) and (x0+h,f(x0+h))(x_0 + h, f(x_0+h)) (blue, dashed). For small hh, the secant approaches the tangent at rate O(h2)\mathcal{O}(h^2); for hh too small, the subtraction in the numerator becomes unreliable in finite-precision arithmetic and the approximation degrades. Figure Figure 2.9 plots the resulting error against hh.

Two sources of error compete as hh changes. The truncation error 16f(x0)h2\tfrac{1}{6}f'''(x_0)\,h^2 shrinks as hh decreases; the rounding error associated with computing the near-zero subtraction f(x0+h)f(x0h)f(x_0+h)-f(x_0-h) in floating-point arithmetic grows as ϵ/h\epsilon/h, where ϵ2.21016\epsilon \approx 2.2\cdot 10^{-16} is machine precision in double precision. The total error is minimized at the step size that balances the two,

hϵ1/3,minimum errorϵ2/31011.h^\star \sim \epsilon^{1/3},\qquad\text{minimum error}\quad \sim \epsilon^{2/3} \approx 10^{-11}.

The practical consequence is that central FD loses roughly five digits of precision relative to the input precision. For second derivatives the loss is worse: Hessians obtained by FD are usually unusable for precision-critical work. FD has one undeniable advantage: it treats ff as a black box, requiring only the ability to evaluate ff. For debugging or for a one-off comparative-statics check it is fine. For training a DEQN, where the gradient is used millions of times, it is badly suboptimal. Figure Figure 2.9 makes the trade-off visible.

The classic U-curve of central finite differences, here for f(x) = \mathrm{e}^x at x_0 = 1. Truncation error (red, dashed) falls as h^2; roundoff error (green, dashed) grows as \epsilon / h. Their sum (blue) has a minimum at h^\star \approx 5 \cdot 10^{-6}, where the achievable precision is around 10^{-11}, roughly five digits worse than the input precision. Autodiff, by contrast, attains machine precision at no step-size tuning.

Figure 2.9:The classic U-curve of central finite differences, here for f(x)=exf(x) = \mathrm{e}^x at x0=1x_0 = 1. Truncation error (red, dashed) falls as h2h^2; roundoff error (green, dashed) grows as ϵ/h\epsilon / h. Their sum (blue) has a minimum at h5106h^\star \approx 5 \cdot 10^{-6}, where the achievable precision is around 10-11, roughly five digits worse than the input precision. Autodiff, by contrast, attains machine precision at no step-size tuning.

2.7.1.3Automatic differentiation

exploits the fact that every piece of Python (or any other imperative language) that evaluates a numerical function is, at run-time, a composition of elementary operations (++, -, ×\times, //, exp\exp, log\log, sin\sin, cos\cos, \ldots) whose derivatives are known in closed form. The chain rule composes these local derivatives into the derivative of the whole program, exactly. The cost is a small multiple of one function evaluation, typically 2--4×4\times in reverse mode, and is independent of the number of inputs. There is no step-size hh to tune, and the result is exact to machine precision. For ML-scale problems with 106 parameters and a scalar loss, AD is the only feasible method. The classic computer-science references are Baydin et al. (2018) and Margossian (2019); both motivate AD from machine-learning workloads and lay out the forward/reverse-mode taxonomy used below.

2.7.1.4Where do local derivatives come from? A tiny built-in table.

It is worth pausing on a question that beginners reasonably ask: “How does the computer know that the derivative of sin(x)\sin(x) is cos(x)\cos(x)?” The answer is deflating but honest: it does not compute this from first principles. Every autodiff framework ships with a short, hand-written table of derivatives for the elementary operations it supports, a few dozen entries such as those in Table Table 2.3. When your program executes y=sin(x)y = \sin(x), the framework looks up the rule “gradient of sin\sin is cos\cos” in that table and records cos(x)\cos(x) on the edge of the computational graph. The chain rule then composes those table entries. Autodiff is not a theorem prover; it is a library of primitive rules plus an automatic application of the chain rule. This explains both its power (every program expressible in terms of these primitives is differentiable) and its limits (genuinely new functions require extending the table).

Table 2.3:A representative slice of the primitive-derivative table that every autodiff framework ships with. TensorFlow, PyTorch, and JAX each include a few hundred such entries: one per elementary operation they support. User code composes these primitives; the chain rule composes their derivatives. When an economist writes the Brock--Mirman period payoff Π(Kt,Kt+1)=log(Ktα+(1δ)KtKt+1)\Pi(K_t, K_{t+1}) = \log(K_t^{\alpha} + (1-\delta)K_t - K_{t+1}), only entries from this table are invoked; no calculus is performed at runtime.

PrimitiveLocal derivative (built in)Where to find it
y=x+zy = x + zy/x=1, y/z=1\partial y/\partial x = 1,\ \partial y/\partial z = 1tf.math.add
y=xzy = x \cdot zy/x=z, y/z=x\partial y/\partial x = z,\ \partial y/\partial z = xtf.math.multiply
y=xαy = x^{\alpha}y/x=αxα1\partial y/\partial x = \alpha\,x^{\alpha-1}tf.math.pow
y=exp(x)y = \exp(x)y/x=exp(x)\partial y/\partial x = \exp(x)tf.math.exp
y=log(x)y = \log(x)y/x=1/x\partial y/\partial x = 1/xtf.math.log
y=sin(x)y = \sin(x)y/x=cos(x)\partial y/\partial x = \cos(x)tf.math.sin
y=cos(x)y = \cos(x)y/x=sin(x)\partial y/\partial x = -\sin(x)tf.math.cos
y=ReLU(x)y = \mathrm{ReLU}(x)y/x=1{x>0}\partial y/\partial x = \mathbf{1}\{x > 0\}tf.nn.relu

This explanation also makes the limits of AD concrete. Operations absent from the table (a black-box solver call, a non-differentiable step like sorting or argmax) require either a custom rule (via @tf.custom_gradient) or a redesign of the code. The pitfalls in Section 2.7.4 are essentially cases where either the table’s entry is degenerate at a point (kinks, x/x\partial|x|/\partial x at x=0x=0) or the composed gradient is numerically unstable even though each local rule is correct.

2.7.2Computational graph; forward and reverse modes

Every numerical function, once evaluated, corresponds to a directed acyclic computational graph whose nodes are elementary operations and whose edges carry intermediate values. Take the toy example y=f(x)=x2+sin(x)y = f(x) = x^2 + \sin(x) evaluated at x0=2x_0=2:

x    v1=x2    y=v1+v2,x    v2=sin(x)    y=v1+v2.x \;\to\; v_1 = x^2 \;\to\; y = v_1 + v_2,\qquad x \;\to\; v_2 = \sin(x) \;\to\; y = v_1 + v_2.

The values along the graph are v1=4v_1=4, v2=sin(2)0.909v_2=\sin(2) \approx 0.909, y4.909y \approx 4.909. The edges carry local derivatives: v1/x=2x=4\partial v_1/\partial x = 2x = 4, v2/x=cos(x)0.416\partial v_2/\partial x = \cos(x) \approx -0.416, y/v1=y/v2=1\partial y/\partial v_1 = \partial y/\partial v_2 = 1. AD combines the edge derivatives by the chain rule in one of two traversal orders. Figure Figure 2.10 shows the graph together with both traversals.

The two modes of autodiff on y = x^2 + \sin(x) at x = 2. Top: forward mode carries a derivative tag \dot v = \partial v / \partial x alongside each value and reads \dot y = dy/dx at the output. Bottom: reverse mode evaluates f forward, stores the graph, then walks backwards with \bar v = \partial y / \partial v and reads \bar x = dy/dx at the input. Both deliver 3.584, equal to f'(2) = 2 \cdot 2 + \cos(2) at machine precision. Forward mode scales linearly with the number of inputs; reverse mode scales linearly with the number of outputs, which is why it wins for DEQN training (scalar loss, many parameters).

Figure 2.10:The two modes of autodiff on y=x2+sin(x)y = x^2 + \sin(x) at x=2x = 2. Top: forward mode carries a derivative tag v˙=v/x\dot v = \partial v / \partial x alongside each value and reads y˙=dy/dx\dot y = dy/dx at the output. Bottom: reverse mode evaluates ff forward, stores the graph, then walks backwards with vˉ=y/v\bar v = \partial y / \partial v and reads xˉ=dy/dx\bar x = dy/dx at the input. Both deliver 3.584, equal to f(2)=22+cos(2)f'(2) = 2 \cdot 2 + \cos(2) at machine precision. Forward mode scales linearly with the number of inputs; reverse mode scales linearly with the number of outputs, which is why it wins for DEQN training (scalar loss, many parameters).

2.7.2.1Forward mode

carries a derivative tag v˙\dot v next to every value vv and updates both in a single forward pass through the graph. One seeds x˙=1\dot x = 1 at the input; by the chain rule, v˙1=4\dot v_1 = 4, v˙2=0.416\dot v_2 = -0.416, and y˙=v˙1+v˙2=3.584\dot y = \dot v_1 + \dot v_2 = 3.584, which is exactly f(2)=2x+cosxf'(2) = 2x + \cos x evaluated at x=2x=2. The cost is one extra float per variable and one extra pass through the graph. For a function with nn inputs and mm outputs, the cost of the full Jacobian is O(n)\mathcal{O}(n) forward passes; forward mode is therefore attractive when the number of inputs is small.

2.7.2.2Reverse mode

first evaluates ff forward, storing the computational graph and all intermediate values. It then walks the graph backwards, carrying a sensitivity vˉ=y/v\bar v = \partial y / \partial v along each edge. Seeded with yˉ=1\bar y = 1 at the output, the backward pass accumulates

vˉi  =  j:vivjvˉjvjvi,\bar v_i \;=\; \sum_{j:\, v_i \to v_j} \bar v_j \cdot \frac{\partial v_j}{\partial v_i},

yielding at the input node xˉ=vˉ14+vˉ2(0.416)=14+1(0.416)=3.584\bar x = \bar v_1 \cdot 4 + \bar v_2 \cdot (-0.416) = 1\cdot 4 + 1\cdot(-0.416) = 3.584. The computational cost is 2--4×4\times a single forward pass, and, crucially, one reverse pass produces the full gradient with respect to all inputs simultaneously. Reverse mode is what tf.GradientTape, torch.autograd, and jax.grad use by default. The price paid is memory: the entire forward graph must be stored for the backward pass. For extremely long simulations this can be a binding constraint; we return to the point in the pitfalls below.

2.7.2.3Which mode for DEQNs.

The DEQN loss is a scalar mean-squared Euler residual, computed from a network with n103n \sim 10^3 to 105 parameters. Reverse mode produces the full gradient with respect to all network parameters in 2--4×4\times one forward pass, a speedup of n/4n/4 over forward mode. This is the same reason reverse mode is the workhorse of deep-learning training: it exactly fits the “few outputs, many inputs” regime.

2.7.3The autodiff Euler residual

We now apply autodiff to the very object we spent Section 2.4 deriving by hand: the Euler equation of the Brock--Mirman planner. Define the period payoff as a function of the current state and the current choice:

Π(Kin,Kout,zin)  =  u ⁣(zinKinα+(1δ)Kin    Kout),\Pi(K_{\text{in}}, K_{\text{out}}, z_{\text{in}}) \;=\; u\!\left(\,z_{\text{in}} K_{\text{in}}^{\alpha} + (1-\delta)K_{\text{in}} \;-\; K_{\text{out}}\,\right),

where the three argument slots are the state KinK_{\text{in}}, the choice KoutK_{\text{out}}, and the exogenous shock zinz_{\text{in}} (the same slot names the code listing below uses); applying Π\Pi at a date tt means plugging KtK_t into slot 1, Kt+1K_{t+1} into slot 2, and ztz_t into slot 3. This is the only place the model enters. Change uu, YY, or δ\delta and nothing else in what follows needs to move. The Bellman equation of Section 2.4 then reads

V(Kt,zt)  =  maxKt+1  Π(Kt,Kt+1,zt)+βE ⁣[V(Kt+1,zt+1)zt].V(K_t, z_t) \;=\; \max_{K_{t+1}}\;\Pi(K_t, K_{t+1}, z_t) + \beta\,\mathbb{E}\!\left[V(K_{t+1}, z_{t+1})\,\big|\,z_t\right].

2.7.3.1Notation: what do 1Π\partial_1\Pi and 2Π\partial_2\Pi mean?

Throughout this section the subscript names the slot of Π\Pi being differentiated, not a time index. With the three-slot definition Π(Kin,Kout,zin)\Pi(K_{\text{in}},K_{\text{out}},z_{\text{in}}) of (2.24), we write

1Π    ΠKin,2Π    ΠKout.\partial_1\Pi \;\equiv\; \frac{\partial\Pi}{\partial K_{\text{in}}}, \qquad \partial_2\Pi \;\equiv\; \frac{\partial\Pi}{\partial K_{\text{out}}}.

The first slot KinK_{\text{in}} is the state, the second slot KoutK_{\text{out}} is the choice, the third slot zinz_{\text{in}} is an exogenous parameter and is not differentiated. An expression like 2Π(Kt,Kt+1,zt)\partial_2\Pi(K_t,K_{t+1},z_t) therefore denotes the derivative of Π\Pi in its second slot, evaluated with KtK_t plugged into slot 1, Kt+1K_{t+1} into slot 2, and ztz_t into slot 3, so it equals Π/Kt+1\partial\Pi/\partial K_{t+1}. The expression 1Π(Kt+1,Kt+2,zt+1)\partial_1\Pi(K_{t+1},K_{t+2},z_{t+1}) denotes the derivative in the first slot, evaluated at the time-(t+1)(t+1) state pair, so it also equals Π/Kt+1\partial\Pi/\partial K_{t+1} but for a different physical reason (because Kt+1K_{t+1} now sits in the state slot of the period-t+1t+1 problem). This overloading is the whole point: the same partial expression handles the FOC in one period and the envelope term in the next.

Two facts, both derived in Section 2.4, are now worth restating in terms of Π\Pi:

FOC w.r.t.\ the choice Kt+1K_{t+1}:
2Π(Kt,Kt+1,zt)  +  βEzt+1zt[V(Kt+1,zt+1)]  =  0.\partial_2 \Pi(K_t,K_{t+1},z_t) \;+\; \beta\,\mathbb{E}_{z_{t+1}\mid z_t}[\,V'(K_{t+1},z_{t+1})\,] \;=\;0.
Envelope at the optimum, evaluated at the state KtK_t:
V(Kt,zt)  =  1Π(Kt,g(Kt,zt),zt),V'(K_t, z_t) \;=\; \partial_1 \Pi(K_t, g(K_t, z_t), z_t),
where gg is the optimal policy.

Substituting the envelope (evaluated one period ahead, so that Kt+1K_{t+1} sits in the state slot) into the FOC, the familiar Euler equation becomes

0=2Π(Kt,Kt+1,zt)+βEzt+1zt ⁣[1Π(Kt+1,Kt+2,zt+1)].\boxed{ \begin{aligned} 0 = {}& \partial_2\,\Pi(K_t, K_{t+1}, z_t)\\ &+ \beta\,\mathbb{E}_{z_{t+1}\mid z_t}\!\left[ \partial_1\,\Pi(K_{t+1}, K_{t+2}, z_{t+1}) \right]. \end{aligned}}

Here Kt+2=g(Kt+1,zt+1)K_{t+2}=g(K_{t+1},z_{t+1}). Both terms in (2.29) are derivatives with respect to the same physical variable Kt+1K_{t+1}, but one treats Kt+1K_{t+1} as the choice of period tt (slot 2) and the other treats Kt+1K_{t+1} as the state of period t+1t+1 (slot 1). Every term is therefore a partial derivative of the same function Π\Pi. Neither uu' nor the marginal product of capital, nor the envelope formula V(K)=u(C)(αKα1+1δ)V'(K) = u'(C)(\alpha K^{\alpha-1} + 1 - \delta) need to be written out by hand. tf.GradientTape records the forward evaluation of Π\Pi and produces both partials on demand; the expectation is approximated by any of the quadrature rules of Section 2.6. The code is shorter than the pen-and-paper counterpart and, more importantly, entirely model-agnostic: swapping log for CRRA utility means editing the body of Pi and nothing else.

def Pi(K_in, K_out, z_in):
    Y = z_in * K_in ** alpha
    C = Y + (1.0 - delta) * K_in - K_out
    return tf.math.log(C)

def euler_residual(K_t, z_t, K_tp1, K_tp2, z_tp1):
    # FOC term:       d Pi / d K_{t+1} at (K_t, K_{t+1}, z_t)
    with tf.GradientTape() as t1:
        t1.watch(K_tp1)
        pi_t = Pi(K_t, K_tp1, z_t)
    d2 = t1.gradient(pi_t, K_tp1)
    # Envelope term:  d Pi / d K_t at (K_{t+1}, K_{t+2}, z_{t+1})
    with tf.GradientTape() as t2:
        t2.watch(K_tp1)
        pi_tp1 = Pi(K_tp1, K_tp2, z_tp1)
    d1 = t2.gradient(pi_tp1, K_tp1)
    return d2 + beta * d1

Program 2:Autodiff Euler residual for Brock--Mirman. The function \texttt{Pi} is the only model-specific code; the rest is generic. A full implementation lives in the autodiff chapter’s code folder, notebook 02_Brock_Mirman_AutoDiff_DEQN.ipynb.

2.7.3.2Cross-checking the autodiff loss.

An important pedagogical point is that the autodiff residual (2.29) and the hand-derived residual of Section 2.4 are the same mathematical object. The companion autodiff notebooks implement both expressions on the same network and report the maximum absolute difference, which in our (seeded) runs is of order 10-6--10-7 in the deterministic case and 10-6--10-5 in the stochastic case with Gauss--Hermite quadrature (float32 arithmetic; float64 tightens this by roughly seven orders of magnitude). In every case the residual is consistent with finite-precision arithmetic, graph ordering, and quadrature accumulation rather than with any difference in the underlying mathematics. Under full depreciation (δ=1\delta = 1) the trained policy from the autodiff loss matches the analytical closed-form Kt+1=αβKtαK_{t+1} = \alpha\beta K_t^{\alpha} (respectively αβztKtα\alpha\beta z_t K_t^{\alpha}) to mean relative error 104\sim 10^{-4} in the deterministic case and 103\sim 10^{-3} on a coarse classroom grid in the stochastic case (the residual rises with the quadrature footprint), confirming that minimizing the autodiff residual recovers the true policy when one is available.

2.7.4Pitfalls of autodiff

AD is exact and cheap but it is not magical. Five pitfalls are worth naming; each has a concrete workaround.

2.7.4.1Non-differentiable kinks.

Operations such as max(0,x)\max(0,x) (ReLU), x|x|, min\min, max\max, argmax, sort\mathrm{sort}, and indicators are non-differentiable at isolated points. Frameworks return a subgradient at the kink -- in TensorFlow and PyTorch, ReLU/xx=0=0\partial \mathrm{ReLU}/\partial x\,|_{x=0} = 0 by convention. If the loss repeatedly lands on such a kink, SGD can receive an uninformative gradient and stall. The cure is to either smooth the kink (softplus\mathrm{softplus} for ReLU, x2+ε2\sqrt{x^2 + \varepsilon^2} for x|x|, log-sum-exp for max\max) or, in complementarity problems, to replace the non-smooth indicator by a Fischer--Burmeister residual (Chapter Chapter 3, Section 3.3.0.8).

2.7.4.2Boundary singularities.

The Brock--Mirman loss contains logCt\log C_t and hence 1/Ct1/C_t. If the network’s output happens to drive CtC_t close to zero during training, AD dutifully returns a very large gradient; SGD then takes a very large step and the network explodes or produces NaN. Two cures of different quality are in wide use:

2.7.4.3Reverse-mode memory.

Reverse mode stores the entire forward graph so it can walk it backwards. Unrolling a 10,00010{,}000-step simulation and back-propagating through all of it has memory cost O(T×dimstate)\mathcal{O}(T \times \mathrm{dim}\,\text{state}) and can exhaust GPU memory on non-trivial models. Standard mitigations are (i) gradient checkpointing (tf.recompute_grad), which recomputes parts of the graph on the backward pass in exchange for memory, (ii) truncated back-propagation through time for long recurrences, and (iii) path-averaging mini-batches of trajectory segments rather than full trajectories (notebook 02 of Day 2 already does this).

2.7.4.4Loss of structural insight.

AD returns a number, not an algebraic expression. Useful structural facts -- the cancellation u(Ct)βu(Ct+1)Ru'(C_t) - \beta u'(C_{t+1}) R has the sign of the Euler wedge, the Euler residual is homogeneous of a known degree in productivity, and so on -- are invisible to AD and come from the hand derivation. In practice one should retain the ability to derive the FOC on a toy version of the model, precisely as we did in Section 2.4: AD scales that derivation; it does not replace the understanding it provides.

2.7.4.5stop_gradient and the envelope theorem.

A subtle point worth emphasizing. The envelope identity V(K)=1Π(K,g(K))V'(K) = \partial_1 \Pi(K, g(K)) holds at the optimum because the FOC kills the term (u(C)+βV(g(K)))g(K)(-u'(C) + \beta V'(g(K)))\cdot g'(K) that otherwise contributes to the total derivative of the composed continuation value. During training we are not yet at the optimum. Two natural codings of the next-period term therefore give different numbers: one differentiates through the policy gg and computes a total derivative of a rollout, while the other treats K=g(K)K'=g(K) as a fixed choice and computes the partial derivative used in the Euler FOC. The autodiff residual (2.29) uses the latter, which is what falls out of t2.gradient(pi_tp1, K_tp1) in Listing Program 2. This is the correct residual for checking the Euler equation; differentiating through the policy is a different object. The two codings agree as the FOC residual vanishes, but off the optimum they can disagree.

2.7.4.6Companion notebooks.

Three notebooks (in the autodiff chapter’s code folder) put the above into practice, with a fourth provided as self-study:

2.8Data Parallelization with MPI

For very large-scale applications (e.g., N50N \geq 50 countries in the IRBC model of Chapter Chapter 3), training can be accelerated by distributing the gradient computation across multiple GPUs or compute nodes. The standard approach uses synchronous data parallelism via MPI Allreduce; in the DEQN paper the corresponding implementation is built on Horovod Sergeev & Del Balso, 2018, which wraps Allreduce into a drop-in replacement for the training optimizer.

Since each worker processes an independent mini-batch and the Allreduce operation averages the gradients, the effective batch size scales linearly with the number of workers. In practice, this yields near-linear speedup for moderate numbers of workers (P32P \leq 32), with communication overhead becoming significant only for very large clusters. The key advantage for economics applications is that each worker can simulate its own trajectory of the economy, naturally exploring different regions of the state space and improving the diversity of the training distribution.

2.9Choice of Loss Kernel: Beyond Mean-Squared Loss

Every DEQN derivation in this script ends with the same step: take the equilibrium residual r(x)r(\bm{x}), square it, average over a mini-batch, and minimize. The squared average, mean-squared error (MSE), is the canonical default, but it is only one of many reasonable reductions of a residual vector to a scalar. Different reductions emphasize different parts of the residual distribution and have visibly different convergence properties, even on the same model with the same network and data. This section makes the trade-off concrete on the stochastic Brock--Mirman model with full depreciation, where the optimal savings rate is known in closed form (s=αβs^\star = \alpha\beta) so that the deviation between learned and optimal policy can be measured exactly.

2.9.0.1Six kernels.

We compare:

2.9.0.2Convergence behavior.

Figure Figure 2.11 shows the mean, p90p_{90}, and p99p_{99} of the absolute relative Euler error on a fixed evaluation grid as a function of the training episode, for each of the six kernels. Three patterns are immediate.

Convergence of the relative Euler-error distribution under six different loss kernels on the stochastic Brock--Mirman model with full depreciation. Same network (2\times32 swish, sigmoid head), same Adam optimizer, same CRN training stream; only the loss kernel changes. Left: mean |r|. Middle: p_{90}\,|r|. Right: p_{99}\,|r|. All on a log scale. Closed-form benchmark: s^\star = \alpha\beta.

Figure 2.11:Convergence of the relative Euler-error distribution under six different loss kernels on the stochastic Brock--Mirman model with full depreciation. Same network (2×\times32 swish, sigmoid head), same Adam optimizer, same CRN training stream; only the loss kernel changes. Left: mean r|r|. Middle: p90rp_{90}\,|r|. Right: p99rp_{99}\,|r|. All on a log scale. Closed-form benchmark: s=αβs^\star = \alpha\beta.

First, MAE clearly stalls at a noise floor an order of magnitude above the other kernels, exactly the constant-gradient pathology described above; the curve is monotone but flattens out and refuses to drop further. Second, MSE and log-cosh achieve essentially the same final mean residual (1.4 ⁣× ⁣104\sim 1.4\!\times\!10^{-4}), and both pull the p99p_{99} down with comparable efficacy. Third, the tail-aware kernels (Huber, Quantile, CVaR) typically produce the narrowest spread between mean and max residual but pay a small constant cost in the bulk; the practical pay-off shows up in problems where rare-but-consequential states matter.

2.9.0.3Why log-cosh tends to converge most smoothly.

Of the six kernels, log-cosh is the one whose convergence curve is essentially monotone with no spikes. This is not an accident: logcosh(r)\log\cosh(r) is a convenient smooth interpolant between 12r2\tfrac12 r^2 near zero and r|r| in the tails, with no kink at any threshold. Near the optimum its gradient is tanh(r)r\tanh(r) \approx r (linear, like MSE), so it inherits MSE’s fine-grained convergence. Far from the optimum its gradient saturates at ±1\pm 1 (like MAE), so an unusually large residual cannot dominate the mini-batch gradient. In effect, log-cosh is what one would design if asked for “MSE near zero and MAE in the tails, smoothly”. For practical DEQN settings where the residual distribution is unknown in advance, log-cosh is therefore a useful default to compare against MSE rather than a mere robustness afterthought.

2.9.0.4Why this is an economic comparison.

The relative Euler-equation residual already has a direct economic interpretation: a 1%1\% relative Euler error translates approximately into a 1%1\% per-period consumption error along the simulated path Judd, 1998. The mean / p90p_{90} / p99p_{99} panels of Figure Figure 2.11 therefore measure, in interpretable units, the average and tail consumption mistakes that each loss kernel leaves behind. The point of the experiment is that the training-loss ranking is not the same as the ranking on this economic metric: a kernel that achieves a slightly higher mean residual but a much narrower tail produces smaller worst-case consumption errors, which is what matters when rare states are economically consequential (occasionally binding constraints, fat-tailed shocks, tipping risks). The training loss is the instrument; the relative Euler error along simulated paths is the criterion. Choosing the kernel with that hierarchy in mind is part of the modeling decision. The companion notebook pushes this one step further by collapsing the per-period error distribution into a single consumption-equivalent welfare loss against ss^\star; readers who want a single-number summary of the trade-off should consult that table directly.

The full experiment, including a path-residual histogram, a policy-error heatmap on the (z,logK)(z, \log K) plane, and a CE-welfare-loss summary table, is in the companion notebook 05_StochasticBM_LossComparison.ipynb.

2.9.0.5Extensions of the basic DEQN template.

The Brock--Mirman model establishes the core DEQN recipe, but it is only the starting point. Later chapters modify the same template in several directions: Chapter Chapter 5 adds lifecycle structure and complementarity constraints, Chapter Chapter 6 replaces a finite vector of states by a histogram representation of the cross-sectional distribution, Section Section 6.7 then shows that one can also feed shock histories to the network rather than the current endogenous aggregate state, and Chapter Chapter 11 extends the method to nonstationary climate-economy models with pseudo-states. The equilibrium logic is the same in all cases: choose a network parameterization, simulate the model forward, evaluate equilibrium residuals, and update the network by gradient descent.

2.9.0.6Code examples.

The following Jupyter notebooks implement and extend the material in this chapter. Notebooks 01 and 02 illustrate the sampling progression that is pedagogically central to the method: 01 uses uniform random states to isolate the loss-and-architecture mechanics, while 02 replaces the exogenous grid by simulated trajectories and learns on the model’s ergodic set (Section 2.3); both derive the FOC and apply the envelope theorem on paper before writing the loss. Notebooks 03 and 04 introduce additional techniques (endogenous labor, a KKT-constrained labor-time ceiling encoded via a Fischer--Burmeister complementarity function, and a six-period OLG extension) that anticipate material developed formally in Chapters Chapter 3 and Chapter 5. Three further notebooks that re-solve the same two Brock--Mirman models with an autodiff loss (illustrating the template of Section 2.7) are placed in the autodiff-chapter code folder as the autodiff primer: 01_AutoDiff_Analytical_Examples.ipynb, 02_Brock_Mirman_AutoDiff_DEQN.ipynb, 03_Brock_Mirman_Uncertainty_AutoDiff_DEQN.ipynb.

2.10Further Reading

2.11Exercises

Worked solutions and guidance for these exercises appear in Appendix Appendix F.

Footnotes
  1. In the sense of Brumm & Scheidegger (2017): an approximation of the equilibrium policy (or value) functions over the entire economically relevant region of the state space, in particular over the model’s ergodic set, as opposed to a local (perturbation) solution that is accurate only in a neighborhood of the deterministic steady state.

References
  1. Azinovic, M., Gaegauf, L., & Scheidegger, S. (2022). DEEP EQUILIBRIUM NETS. International Economic Review, 63(4), 1471–1525. 10.1111/iere.12575
  2. Bellman, R. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press.
  3. Judd, K. L., Maliar, L., & Maliar, S. (2011). Numerically Stable and Accurate Stochastic Simulation Approaches for Solving Dynamic Economic Models. Quantitative Economics, 2(2), 173–210.
  4. Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. Database Theory — ICDT 2001, 1973, 420–434. 10.1007/3-540-44503-X_27
  5. Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.
  6. Marcet, A. (1988). Solving Non-Linear Stochastic Models by Parameterizing Expectations. Carnegie Mellon University Working Paper.
  7. Marcet, A., & Marshall, D. A. (1994). Solving Nonlinear Rational Expectations Models By Parametrized Expectations: Convergence to Stationary Solutions [Discussion Paper 91]. Federal Reserve Bank of Minneapolis.
  8. Sirignano, J., & Spiliopoulos, K. (2018). DGM: A Deep Learning Algorithm for Solving Partial Differential Equations. Journal of Computational Physics, 375, 1339–1364.
  9. Han, J., Jentzen, A., & E, W. (2018). Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34), 8505–8510.
  10. Maliar, L., Maliar, S., & Winant, P. (2021). Deep learning for solving dynamic economic models. Journal of Monetary Economics, 122, 76–101.
  11. Brock, W. A., & Mirman, L. J. (1972). Optimal economic growth and uncertainty: The discounted case. Journal of Economic Theory, 4(3), 479–513.
  12. Judd, K. L. (1998). Numerical methods in economics. The MIT press.
  13. Azinovic-Yang, M., & Žemlička, J. (2024). Intergenerational consequences of rare disasters. Available at SSRN 4386477. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4386477
  14. Gerstner, T., & Griebel, M. (1998). Numerical Integration Using Sparse Grids. Numerical Algorithms, 18, 209–232.
  15. Heiss, F., & Winschel, V. (2008). Likelihood Approximation by Numerical Integration on Sparse Grids. Journal of Econometrics, 144(1), 62–80.