This chapter introduces Deep Equilibrium Nets (DEQNs), the central computational framework of this script. Rather than generating labeled training data, a DEQN embeds the equilibrium conditions of a dynamic stochastic model (Euler equations, budget constraints, complementarity conditions) directly into the loss function of a neural network. The network then learns policy functions by minimizing the violation of these conditions via gradient descent. We develop the methodology on the classical Brock--Mirman growth model, for which an analytical solution exists and serves as a benchmark. The primary reference is Azinovic et al. (2022).
2.0.0.1Notation transition.¶
Chapter Chapter 1 used for the parameter vector of a generic supervised model, following the textbook convention. From this chapter onward we switch to (so that the network is denoted ), matching the notation of Azinovic et al. (2022) and most of the subsequent DEQN literature. The two symbols denote the same object; we use for parameters wherever a network appears, and reserve for structural economic parameters that the network does not see (Chs. Chapter 10--Chapter 11).
2.1The Curse of Dimensionality in Economics¶
A central challenge in computational economics is that many models of interest, such as overlapping-generations (OLG) economies, heterogeneous-agent models, and international business cycle models, feature state spaces of high dimension. In an OLG model with cohorts and idiosyncratic risk, the state vector may include individual capital holdings, productivity levels, and wealth shares for each cohort, leading to state spaces of dimension . Traditional grid-based solution methods, such as value function iteration on a Cartesian grid with nodes per dimension, require grid points; this is the curse of dimensionality, the term coined by Bellman (1961) to describe the exponential growth of computational cost with the number of state variables. A concrete back-of-the-envelope makes the bottleneck plain: at the modest resolution of nodes per dimension, a state space already requires grid points, which is infeasible under any reasonable compute budget; even more modest resolutions of --30 become prohibitive once exceeds ten.
Table Table 2.1 illustrates the curse concretely. Even with a modest grid points per dimension, the total number of grid points grows astronomically.
Table 2.1:Size of an Cartesian grid and the 64-bit memory required to store one floating-point value per grid point, as a function of state-space dimension . Grid-based methods are comfortable only at low dimension; by even storing one scalar per grid point is already borderline before policies, interpolation objects, residuals, and simulation output are added.
| Grid points () | Memory (64-bit) | Feasibility | |
|---|---|---|---|
| 1 | 10 | 80 B | trivial |
| 2 | 100 | 800 B | trivial |
| 5 | 105 | 800 KB | feasible |
| 10 | 1010 | 80 GB | borderline |
| 20 | 1020 | GB | impossible |
| 50 | 1050 | --- | absurd |
| 100 | 10100 | --- | a googol |
2.1.0.1The volume paradox.¶
A complementary, and more geometric, perspective on the curse of dimensionality starts from a single ratio. The volume of the unit -ball relative to the smallest cube that contains it, , is
This ratio collapses with (Figure Figure 2.1): it is at and at , has fallen to by , and is below 10-70 at . The geometry behind the numbers is sharp. The inscribed ball touches the center of each of the faces, at distance 1 from the center, but its surface never gets within of any of the corners, which sit at distance . As grows the corners march off to infinity while the ball stays put, so essentially all of the cube’s volume drains into its corners, far from the center. A box-shaped object in high dimensions is, to a first approximation, all corners and no middle.
Why should an economist care about a fact about cubes and balls? Because solving a dynamic stochastic model means producing a global solution[1], and the region over which that solution actually has to be accurate is the model’s ergodic set, the states the economy visits under its own law of motion. That set is typically a thin, curved, low-volume sliver of any hypercube that one would draw around it, often well under one percent of the bounding box already at moderate Judd et al., 2011. Yet most of the high-dimensional approximation machinery on offer, tensor-product grids, adaptive sparse grids, tensor-product quadrature, is built on the hypercube; by the volume paradox an exponentially growing fraction of those nodes lands in corners the model never visits, so most of the computational effort is spent representing the function where it does not matter. Figure Figure 2.2 visualizes this mismatch. This is exactly the waste the DEQN approach is designed to avoid: instead of tiling a box, it trains the network on states sampled from the model’s own simulated trajectories, putting the approximation effort where the economics lives.
Distance itself is the second casualty of high dimensions, and it is worth flagging now because it returns later with teeth. Draw a point uniformly from the unit -ball: its radius has cumulative distribution and density on , which for large piles essentially all of its mass against the shell . Random points sit on the pulp, not in the core, and they do so with overwhelming probability. Relatedly, pairwise Euclidean distances among random points concentrate so tightly that the nearest and the farthest neighbor of a query point become almost equidistant, so “distance” loses most of its power to discriminate Aggarwal et al., 2001 (Figure Figure 2.3). This is not a curio: any method that judges similarity through , from -nearest-neighbors to kernel ridge regression to the Gaussian-process surrogates of Chapter Chapter 9, whose RBF and Matérn kernels are functions of alone, loses resolution as grows. It is one of the reasons Chapter Chapter 9 reaches for dimension reduction, active subspaces and deep kernels, before fitting a GP in a high-dimensional input space.
Figure 2.1:(a) the unit ball inscribed in
(b) ratio (2.1) versus (log scale)
The volume paradox behind the curse of dimensionality. (a) The largest ball that fits inside the cube touches the center of every face, at distance 1 from the center, but its surface stays away from each of the corners, which lie at distance . As grows the corners recede while the ball does not, so almost all of the cube’s volume ends up in the corners (tinted), far from the center. (b) The ball-to-cube volume ratio of equation (2.1) on a logarithmic scale: at , about at , and below 10-70 at . A grid or quadrature rule built on the bounding hypercube therefore spends an exponentially growing share of its nodes in corners that the model’s ergodic set never reaches.
Figure 2.2:Grid-based vs. simulation-based state sampling. A Cartesian grid (left) allocates effort uniformly over the hypercube and places most of its points far from the model’s ergodic set, the small subset of the state space that the economy actually visits under its own dynamics (red band). The DEQN algorithm samples states along simulated trajectories (right), concentrating training points exactly on that set. As grows, the fraction of the cube reached by the ergodic set shrinks rapidly, so the grid’s relative waste grows exponentially.
Figure 2.3:(a) blue core and red shell each hold half a -ball’s volume; the shell thins as grows
(b) the radius CDF piles up at as grows
Why a “random” point in high dimensions lives on the shell, not in the core. (a) For each , the blue inner disk and the red outer shell each hold half of the -ball’s volume; the shell’s fractional thickness shrinks from at to at to at , so half of the mass crowds into an ever thinner rim near the surface. (b) Equivalently, the radius of a uniform draw has cumulative distribution , which for large stays near zero until is almost 1 and then jumps: the radius is essentially deterministic at 1. The companion fact is that pairwise Euclidean distances among random points concentrate, so a query point’s nearest and farthest neighbors become nearly indistinguishable Aggarwal et al., 2001; this is why distance-based methods, including the Gaussian-process kernels of Chapter Chapter 9, lose resolution in high dimensions.
2.2From Supervised Learning to Self-Supervised Equilibrium Solving¶
A word on terminology before we start, recalling the taxonomy of Section Section 1.3. A learning method is supervised when its loss compares the network’s output to an externally given target label; it is unsupervised when there are no target labels at all (clustering, density estimation, dimension reduction); and it is self-supervised when, although no human-provided labels exist, the method manufactures its own supervisory target out of the available structure. Deep Equilibrium Nets are unsupervised in the narrow “no labels” sense, but the sharper and more useful description is equation-based self-supervision: at every state where we evaluate the model, the equilibrium conditions tell us what the correct relationship between today’s policy and tomorrow’s must be, namely a residual of zero, and that residual plays exactly the role a label plays in supervised learning. The DEQN objective introduced below is therefore a genuine regression loss, only against a target the model itself dictates rather than one handed to us as data.
With that in mind, the key conceptual shift introduced by Azinovic et al. (2022) is to replace the labeled training data of standard supervised learning with the structural equations of the economic model. In a generic dynamic stochastic economic model, the state of the economy at time is summarized by a vector , and agents choose policy (or decision) variables . The equilibrium is characterized by a system of functional equations
where encodes optimality conditions (e.g., Euler equations) and is the conditional expectation over next-period shocks via the transition law . The fundamental challenge is that this is a functional equation: we seek functions rather than finite-dimensional parameter vectors. The DEQN approach parameterizes as a neural network and solves this functional equation via stochastic optimization (Figure Figure 2.4).
Figure 2.4:self-supervised
Supervised learning (left) versus DEQN training (right). Both paradigms train a parameter vector by minimizing a residual loss with SGD; the difference is the source of the training signal, labeled data in the supervised case, structural equilibrium equations in the DEQN case. No labeled solution data are required for DEQNs. For Brock--Mirman the right-hand side specializes to with and ; the network learns the policy by driving the squared mean of this residual to zero on simulated trajectories.
In the standard (supervised) paradigm, one minimizes the discrepancy between the network’s predictions and known target values. In the DEQN framework, the loss function is instead the squared norm of the equilibrium residual:
where denotes a deep neural network with parameters that maps states to policies. If the residual loss is zero on a sufficiently rich set of states and the conditional expectations inside are evaluated accurately (rather than with a single shock realization), the network satisfies the equilibrium conditions on those states; global accuracy must always be assessed on independent validation trajectories and, where available, against benchmark solutions. The continuous-time analogue is the residual-loss formulation of Raissi et al. (2019), which inspired the PINN literature developed in Chapter Chapter 7. The training points are drawn not from an exogenous dataset, but from the model’s own simulated dynamics: the network learns on the ergodic distribution of the economy, concentrating computational effort on economically relevant regions of the state space. The idea of using simulated paths as the support for projection goes back to the parameterized-expectations algorithm of Marcet (1988) and Marcet & Marshall (1994), which the DEQN literature reframed as ergodic-set sampling to focus computation on economically relevant states; the numerical stability of stochastic-simulation projection was further developed by Judd et al. (2011).
2.2.0.1Connection to the modern overparameterized regime.¶
Note that DEQNs operate squarely in the “modern regime” of Section Section 1.10: the network often has parameters relative to the mini-batch size . Unlike supervised learning with a fixed dataset, however, the training distribution is renewable: each episode can generate fresh state samples from the model’s own simulation. This changes the fixed-dataset double-descent analogy, but it does not remove the need for validation: can still exceed the effective sample budget seen during training, and a flexible network can fit residuals on a narrow simulated region. The practical lesson is to combine overparameterized networks and SGD with independent validation trajectories, held-out Euler residuals, and, where available, closed-form benchmarks.
2.3The DEQN Training Algorithm¶
Several features of this algorithm deserve emphasis:
Endogenous training distribution. As the policy network improves, the simulated paths become more accurate, and the training points concentrate on the ergodic set of the true equilibrium. This is fundamentally different from supervised learning, where the training distribution is fixed. Early in training, however, the simulated distribution is generated by a poor and rapidly changing policy: it can be unstable, infeasible, or far too narrow. Practical implementations therefore (i) burn in with broad sampling from the prior or a uniform box around the deterministic steady state, (ii) clip or guard against infeasible states (e.g. negative consumption), and (iii) often maintain a small replay buffer so that early states are revisited as the policy improves. Out-of-sample validation on an independent trajectory remains essential.
No labeled data. The algorithm requires no “ground truth” solutions, only the structural equations of the model. The loss function has direct economic interpretation: it measures the violation of equilibrium conditions.
Stochastic optimization. Mini-batch SGD naturally handles the stochasticity of the model: different mini-batches sample different states, providing implicit exploration of the state space.
Scalability. The DEQN avoids explicit tensor-product state grids entirely: the per-iteration cost scales with network size, batch size, the number of equations, and the cost of integrating the conditional expectation; the input/output layer widths and the simulation step also depend on the state and shock dimensions, but only linearly. This favorable empirical scaling does not eliminate all high-dimensional costs (sample complexity, expectation accuracy, optimization difficulty), but it does mitigate the practical grid-based curse of dimensionality.
JIT-compile every gradient step. In any production or classroom-scale implementation, the per-batch gradient step (lines 5--6 inside the inner loop) should be wrapped in
@tf.function(TensorFlow),torch.compile(PyTorch), or@jax.jit(JAX). The speed-up over an un-traced Python loop is typically 5-- for the per-step kernel, and the trace cost is amortized after the first call. All companion notebooks in this course follow this convention; un-traced Python loops are simply too slow for classroom use, let alone for research-scale runs.
2.3.0.1Episode length .¶
The simulated-trajectory length controls the effective size of the training set within an episode. Azinovic et al. (2022) use time steps (split into many short mini-batches) in their 113-dimensional 56-agent OLG benchmark. Shorter episodes (e.g., ) speed up early training because the distribution is re-drawn more frequently, at the cost of higher variance; longer episodes produce smoother gradient estimates but run the risk of stale data if the policy has drifted during training. A pragmatic rule is to start short, lengthen as the loss plateaus, and check out-of-sample accuracy on an independent simulated trajectory.
2.3.0.2Connection to continuous-time methods.¶
The DEQN residual loss is the discrete-time analogue of the Deep Galerkin Method (DGM) of Sirignano & Spiliopoulos (2018), which minimizes a PDE residual on neural-network-parameterized PDE solutions in continuous time, and of the deep BSDE solver of Han et al. (2018) (Han--Jentzen--E), which formulates the same problem as a backward stochastic differential equation. Within the discrete-time deep-learning toolkit, Maliar et al. (2021) unify lifetime-reward, Bellman-equation, and Euler-equation training into a single framework, paired with an “all-in-one” integration operator that estimates all conditional expectations from a single Monte Carlo realization; DEQN as developed here is the Euler-equation branch of that taxonomy. Chapters Chapter 7 and Chapter 8 develop the continuous-time analogues explicitly; the present chapter should be read as introducing the discrete-time machinery on which those continuous-time methods are built.
2.4The Brock--Mirman Benchmark¶
Having laid out the generic training loop, we now put it to work on a concrete model. To validate the DEQN methodology, Azinovic et al. (2022) begin with the stochastic growth model of Brock & Mirman (1972), which admits a closed-form solution and therefore serves as an ideal test case.
The social planner solves:
where is the discount factor, the capital share, the shock persistence, the shock volatility, and i.i.d. innovations. Depreciation is full ().
2.4.0.1Derivation of the Euler equation.¶
To derive the optimality conditions, form the Lagrangian by attaching a multiplier to the resource constraint at each date :
Since the planner chooses and at each date, we take partial derivatives with respect to each:
FOC w.r.t. :
FOC w.r.t. : The variable appears in the date- constraint (with coefficient -1) and in the date- constraint via output . Differentiating:
Dividing both sides by yields:
Finally, substituting from (2.6) into this expression gives the Euler equation:
The economic intuition is transparent: the left-hand side is the marginal utility cost of saving one additional unit today; the right-hand side is the expected discounted marginal utility gain from the extra output that unit produces tomorrow. At the optimum, the planner is indifferent between consuming and saving at the margin.
2.4.0.2Analytical solution.¶
When depreciation is full (), this model admits a closed-form solution. One can verify by direct substitution into the Euler equation (2.9) that the optimal consumption policy is:
The derivation proceeds by guessing that the value function takes the form (with the constant written as rather than to avoid clashing with the TFP / cohort-count uses of elsewhere in the script) and using the envelope theorem. Substituting this guess into the Bellman equation with yields a system of equations for , , and . Matching coefficients on pins down ; matching on pins down (so , unlike , depends on the shock persistence); and the constant then absorbs the remaining terms. The first-order condition together with delivers the linear savings rule , from which (2.10) follows immediately via the resource constraint. Notably, this closed-form solution holds regardless of the shock persistence for the savings rule itself, because cancels in the ratio under the linear consumption rule: , so no longer depends on the next-period shock.
2.4.0.3Persistent shocks.¶
When , productivity shocks are autocorrelated and the state of the economy becomes the pair , since the current shock level now carries information about future productivity. Under log utility with Cobb--Douglas production and full depreciation (), the analytical solution (2.10) continues to hold regardless of the persistence . The reason fits in one line: under the linear consumption rule we have , so cancels and the conditional expectation in the Euler equation (2.9) no longer depends on the next-period shock; the constant in the value function still depends on , but enters and cancels in the FOC for the savings rule. However, for more general preferences (e.g., CRRA with ), partial depreciation (), or non-Cobb--Douglas production, the closed-form solution breaks down and numerical methods, and the DEQN approach in particular, become essential: the policy function must be approximated rather than derived analytically. (Concretely: the deterministic companion notebook uses so that the closed form applies and can be used as a benchmark; the stochastic companion notebook switches to , where it does not, and the policy is genuinely learned. Only the loss-kernel notebook of Section 2.9 returns to , precisely so that the closed-form savings rate is available as ground truth.)
2.4.0.4DEQN formulation.¶
We parametrize the consumption policy as and define the residual
where denotes a single realization of the next-period shock and . In the code listing below the network actually emits a savings share through a sigmoid head rather than directly, which guarantees and at every training step; Section Section 2.5 explains the hard-vs-soft constraint split this exemplifies. Note that the Euler equation (2.9) involves the conditional expectation , whereas the residual above is written for a single realization of . Two approaches are common for handling this expectation:
Path averaging: draw a single per state, compute for that draw, and average over many states along the simulated path. This minimizes -- the squared pathwise residual. It is an unbiased Monte Carlo objective for that stronger loss, and Jensen’s inequality gives . Thus almost surely implies the conditional Euler equation, but the converse need not hold: a policy can have while varies with the shock. In the log/full-depreciation Brock--Mirman benchmark the closed-form policy happens to drive the pathwise residual itself to zero, so this stronger target is harmless; in richer models it is a modeling choice. This is the approach used in the Brock--Mirman code listing below. Note that the pathwise-residual-zero coincidence is special to : for (the calibration of the stochastic notebook) a converged policy has but itself is nonzero realization-by-realization, so path averaging then minimizes a strictly stronger objective than the conditional Euler residual.
Quadrature: for each state , approximate explicitly via deterministic nodes and weights (Section Section 2.6), form the residual from the estimated expectation, and then square. This targets directly and is the approach used for the IRBC model of Chapter Chapter 3, where accurate expectations are critical for convergence.
Both approaches recover the analytical solution (2.10) to high accuracy, providing a rigorous validation of the methodology. Because convergence curves depend on the exact training run, random seed, and solver configuration, this manuscript does not include a hand-drawn convergence plot. In practice, one should report diagnostics from the actual notebook run: residual trajectories, held-out Euler errors, and the gap between the learned policy and the analytical benchmark. Figure Figure 2.5 sketches the qualitative shape one should expect to see.
Figure 2.5:Schematic, not measured: the qualitative convergence behavior typical of a successful Brock--Mirman DEQN run. An early high-residual phase reflects an untrained network feeling out the state space; a mid phase descends roughly exponentially as the policy locks onto the equilibrium structure; a late phase plateaus near the irreducible quadrature/training-noise floor. The dotted line marks the analytical benchmark below which the residual cannot be driven without higher-precision quadrature. For the actual numbers on a specific seed, consult the companion notebook.
2.4.0.5Relative Euler equation error.¶
Following Judd (1998) and Azinovic et al. (2022) (Eq. 43), the standard accuracy diagnostic for a DEQN is the relative Euler error, which measures the percentage consumption error implied by a violation of the Euler equation. For Brock--Mirman with log utility, the relative error at state is
where , is next-period capital under the network policy, the expectation conditions on , and the inverts the marginal-utility relation . The value means the agent’s optimal consumption is mispriced by , independent of units or utility scale. This is the metric reported in Table 3 of Azinovic et al. (2022), where the trained DEQN achieves mean relative Euler errors of order 10-4 on the 113-dimensional 56-agent OLG benchmark.
def deqn_loss(model, K, z, beta, alpha, z_next):
Y = z * K**alpha # output today
s = model(K, z) # savings share in (0,1) via sigmoid
C = (1.0 - s) * Y # consumption today (>0)
K_next = s * Y # capital tomorrow (>0)
Y_next = z_next * K_next**alpha
s_next = model(K_next, z_next)
C_next = (1.0 - s_next) * Y_next # consumption tomorrow
# z_next: single draw; expectation via path averaging
G = 1.0 - beta * (C / C_next) * alpha * z_next * K_next**(alpha-1)
return tf.reduce_mean(G**2)Representative DEQN loss for Brock--Mirman with path averaging. The network outputs a savings share via a sigmoid, which jointly enforces \emph{and} ; softplus on alone would not, since would yield and an undefined .
A Gauss--Hermite variant (developed formally in Section 2.6.2) replaces the single shock draw z_next by a -node weighted sum with at the rescaled Hermite nodes ; in practice already drives the quadrature error below the training error. The autodiff companion notebook 03_Brock_Mirman_Uncertainty_AutoDiff_DEQN.ipynb implements this variant explicitly.
2.5Encoding Equilibrium Conditions: Hard vs. Soft Constraints¶
The Brock--Mirman DEQN loss above already used one design choice without dwelling on it: the network emits a savings share through a sigmoid rather than consumption directly. This is an instance of a general principle. Azinovic et al. (2022) §4.2.2 do not treat every equilibrium equation symmetrically in the loss (2.3); they split the conditions into two groups. We make the split explicit on the Brock--Mirman model of the previous section and then indicate how it generalizes.
- Hard constraints, encoded in the architecture (never in the loss).
Some equations can be satisfied exactly for every state :
The resource / state-transition equation defines next-period capital from the network’s consumption policy. It is not minimized; it is evaluated as a closed-form function of and .
The economic requirements and are jointly imposed by parameterizing the network’s output as a savings share via a sigmoid activation, , and recovering both quantities in closed form from the resource constraint, and . A softplus head on alone would guarantee but not (the network could output ). This sigmoid-savings parameterization removes an entire class of infeasible candidate policies before training begins; see the code listing in Section Section 2.4 above.
- Soft constraint, minimized in the loss.
The only equilibrium condition that cannot be enforced analytically is the Euler equation (2.9). The squared relative Euler error (2.12) is averaged over the mini-batch and driven toward zero by stochastic gradient descent.
This split is pedagogically important for three reasons: (i) only the genuinely non-closed-form conditions enter the loss, which speeds up training; (ii) it eliminates a family of bad local minima in which the network produces, e.g., slightly negative consumption or infeasible states; and (iii) it explains why the loss typically converges to a small but nonzero value even at the optimum, since the Euler residual is intrinsic and cannot be removed by re-parameterization. Figure Figure 2.6 summarizes this construction for Brock--Mirman.
Figure 2.6:Hard vs. soft constraints in the DEQN architecture for Brock--Mirman. The network reads the state and emits a savings share via a sigmoid head. Both consumption and next-period capital are then defined in closed form (green, top), guaranteeing and simultaneously; a softplus head on alone could not, since would push and make undefined. Only the relative Euler-equation residual (red, bottom) is squared, averaged over the mini-batch, and minimized by SGD. This figure matches the code listing in Section Section 2.4 above. In richer models (OLG, Chapter Chapter 5) the same split extends to firm first-order conditions, household budget constraints, and KKT multipliers with softplus heads and Fischer--Burmeister complementarity residuals.
2.5.0.1How the split generalizes.¶
In models with more equilibrium objects (the OLG benchmark of Azinovic et al. (2022), the IRBC model of Chapter Chapter 3, the HA economies of Chapter Chapter 6), the same logic extends: firm first-order conditions for prices are closed-form in the aggregate state, the household budget constraint algebraically determines consumption given savings, and each inequality constraint contributes a softplus-headed KKT multiplier paired with a Fischer--Burmeister complementarity residual in the soft part of the loss (Chapter Chapter 3). The Brock--Mirman case above is the minimal instance of the pattern.
2.5.0.2Market-clearing layers.¶
One can push this design philosophy further by encoding the market-clearing condition itself into the network as a dedicated output layer rather than minimizing a market-clearing residual in the loss: the network outputs unnormalized cohort savings (or shares) and a final layer rescales them so that aggregate clearing holds by construction. Azinovic-Yang & Žemlička (2024) use such a “market-clearing layer” in an OLG economy with rare disasters; this is the discrete-time counterpart of the PINN-style practice of baking conservation laws directly into the network output (Chapter Chapter 7). Sigmoid heads, softplus heads, Fischer--Burmeister residuals, and market-clearing rescaling layers thus form a small toolbox of architecture-level encodings that each move part of the equilibrium structure from the soft loss into the hard part of the network.
2.6Quadrature Rules for Conditional Expectations¶
2.6.1Why We End Up Integrating, and What Numerical Integration Is¶
2.6.1.1Why an integral always shows up.¶
Sooner or later, every dynamic stochastic model in this script presents the same algorithmic bottleneck: a forward-looking agent has to form an expectation over future shocks before any decision can be made. The Euler equation (2.9) hinges on ; the Bellman operator behind a value-function iteration evaluates at every state and action; the relative Euler error (2.12) embeds the same expectation in the diagnostic; the integrated-assessment models of Chapter Chapter 11 require expected discounted utility under climate-shock distributions; and the structural-estimation moments of Chapter Chapter 10 are themselves expectations over simulated paths. Outside a handful of conjugate or fully-linear-quadratic-Gaussian special cases, none of these expectations admit closed-form expressions, so the integral has to be approximated numerically by a finite, cheap-to-evaluate, ideally differentiable, weighted sum . Choosing the nodes and weights well is the entire content of quadrature theory. The path-averaging recipe used in the Brock--Mirman code above is one such choice (a Monte Carlo sample average with one node per state along a simulated trajectory); the rest of this section makes the deterministic and quasi-random alternatives explicit and explains when each is preferable.
2.6.1.2Picture 1: a definite integral as area, and the Riemann/midpoint sum.¶
Strip away the economics. A definite integral is the (signed) area between the graph of and the -axis on . The simplest deterministic numerical rule, the midpoint rule, replaces this area by a stack of rectangles of equal width , each one as tall as at the midpoint of its base (Figure Figure 2.7, left). Adding the rectangle areas yields
for smooth . Trapezoid and Simpson rules refine the same idea by replacing each rectangle with a trapezoid or a parabolic arc and reach and respectively; Gauss--Hermite, the workhorse of Section 2.6.2, is the optimal rule when the integrand is multiplied by the Gaussian weight and reaches degree- exactness using only nodes. The same tile-the-area idea generalizes to higher dimensions by laying out a Cartesian grid. With nodes per coordinate the cost is ; equivalently, with total nodes the midpoint rate becomes . Notice that it is the exponent of the rate, not just the constant, that degrades from -2 in 1D to in -D Cartesian, the curse of dimensionality made quantitative. This is the explicit form of the curse of dimensionality flagged in Section Section 2.1, and it motivates both the Stroud monomial rules of Section 2.6.3 and the randomized methods of Section 2.6.4.
2.6.1.3Picture 2: throwing darts to estimate .¶
An entirely different idea is to abandon the grid and approximate the integral by a sample average: draw random points uniformly from the integration domain and report . This is plain Monte Carlo (MC), and its error decays as with a rate that is independent of the dimension , although the variance constant can still worsen with dimension. This is why MC dominates deterministic grids when is large. The cleanest illustration uses the unit square and the indicator function of a quarter-disc:
where counts how many of the uniform “darts” land inside the quarter-circle (Figure Figure 2.7, right). With a typical run gives (about 3% off); with a typical run gives (about 10-4 off). The error shrinks as , requiring a hundredfold increase in to gain one extra decimal of accuracy, which is glacially slow compared to for Simpson’s rule on a smooth 1D integrand. But the MC rate has no dependence on the dimension of the domain: replacing the quarter-disc by a -dimensional unit ball would leave the rate untouched, while the deterministic grid would suffer the cost explosion of the curse of dimensionality. This is what makes MC and its quasi-random refinement (QMC, Section 2.6.4) the natural tools for the conditional expectations encountered in DEQNs at .
Figure 2.7:Two paradigms for numerical integration that underlie every rule in this section. Deterministic tiling (left) is exact and highly accurate at low dimension but suffers exponential cost growth in . Random sampling (right) is dimension-independent in its error rate but slow to converge in any single dimension. The Gauss--Hermite, monomial, and QMC rules of §Section 2.6.2--Section 2.6.4 are sophisticated descendants of the two ideas, designed to combine deterministic accuracy with manageable cost in .
2.6.1.4Where the rest of this section is going.¶
With this picture in hand, the design of every quadrature rule in the literature can be read as an answer to two questions: (i) where do we place the nodes , and (ii) what weights do we attach to them? Tensor-product Gauss--Hermite (Section 2.6.2) places nodes deterministically on a Cartesian grid of Hermite roots; the Stroud-3 monomial rule of Section 2.6.3 places only nodes on the principal axes and accepts a controlled bias on fourth-order moments in exchange for linear-in- scaling; the QMC construction of Section 2.6.4 places nodes from a low-discrepancy sequence in the unit cube and pulls them back through the inverse CDF. The cost--accuracy trade-offs differ dramatically with , as Table Table 2.2 (page ) makes concrete. Chapter Chapter 3 returns to these numbers when scaling DEQNs to multi-country economies.
2.6.2Tensor-Product Gauss--Hermite¶
The classical Gauss--Hermite rule approximates integrals against the weight function :
where the nodes are the roots of the -th Hermite polynomial and the weights are . For expectations under the standard normal distribution, we apply the change of variables to obtain:
The 1D rule (2.16) is exact for univariate polynomials in of degree at most , an attractive accuracy guarantee for smooth integrands.
2.6.2.1Worked example: Brock--Mirman Euler expectation.¶
In the Brock--Mirman model, the conditional expectation in the Euler equation (2.9) is one-dimensional:
Replacing path averaging by Gauss--Hermite with nodes turns the residual (2.9) into a deterministic function of and the network parameters, eliminating the shock noise inside the loss while costing only five extra forward passes per state. This is the simplest concrete instance of the general construction below and a useful sanity check: at the quadrature error is negligible relative to the training error in this benchmark, so a well-trained network should match the analytical Brock--Mirman policy to numerical tolerance.
For shock dimension , the multivariate expectation is computed via a tensor product of one-dimensional rules. Each multi-dimensional quadrature node is a -tuple with weight . The total number of quadrature nodes grows as , which becomes prohibitive once is large Judd, 1998. For , the monomial rule of Section 2.6.3 or sparse-grid (Smolyak) alternatives can substantially reduce the computational cost: Gerstner & Griebel (1998) introduce the construction in numerical analysis, Heiss & Winschel (2008) apply it to econometric likelihoods, and Brumm & Scheidegger (2017) develop adaptive sparse grids tailored to dynamic-programming and equilibrium computations. The polynomial-versus-exponential gap depends on smoothness assumptions and the effective dimension of the integrand: sparse grids attain polynomial rates for sufficiently regular functions; QMC achieves close-to- rates under randomized constructions and finite-variation / low-effective-dimension conditions Sobol', 1967Niederreiter, 1992Owen, 1995Caflisch, 1998Novak & Woźniakowski, 2008, but neither method is universally polynomial.
2.6.3Monomial (Stroud-3) Cubature with Linear Scaling¶
The exponential cost of the previous subsection is the price paid for being exact in the highest-degree univariate polynomial along each coordinate. In equilibrium-residual applications, however, the integrand obtained by composing the policy network with model primitives is typically only mildly nonlinear in the shock vector, and reproducing every cross-coordinate monomial of total degree up to is overkill. Once one is willing to give up that high degree, dramatically cheaper rules become available.
The simplest such rule is the Stroud degree-3 monomial cubature Stroud, 1971, a cornerstone of the toolkit reviewed in Judd (1998) [§7.5] and Maliar & Maliar (2014) [§5.3] for large-scale dynamic models. For a -dimensional standard normal expectation it uses only nodes, all on the principal axes:
where is the -th standard basis vector in . All weights are equal to , and the nodes lie at the common radius from the origin (so they expand outward as the dimension grows, consistent with the fact that the bulk of a high-dimensional Gaussian sits in a thin spherical shell, recalling the dimension-of-volumes discussion in Section Section 2.1). At , the rule collapses to the two-point cubature with weights , exact for polynomials in up to degree three.
2.6.3.1Why it works (and where it stops).¶
Rule (2.18) integrates every monomial in of total degree at most three exactly: the first moments vanish by ±-symmetry, the variances all evaluate to 1 (each axis contributes ), the cross-moments () vanish because each node has only one nonzero coordinate, and the third moments vanish again by symmetry. Higher moments are not exact: is reproduced as rather than the truth 3, so the rule introduces a controlled bias proportional to the integrand’s fourth-order shock content. In the IRBC class of Chapter Chapter 3 this bias is empirically negligible at the Euler-equation tolerance one cares about (); see Pichler (2011) for a careful demonstration on a multi-country RBC and Maliar & Maliar (2014) [Table 5] for a broader cost--accuracy benchmark. When higher accuracy is required, Stroud’s degree-5 monomial rule with nodes still scales polynomially, not exponentially Judd, 1998.
2.6.3.2Cost comparison.¶
At nodes per dimension, the tensor-product rule of Section 2.6.2 costs evaluations per state, while (2.18) costs . Table Table 2.2 contrasts the two for the dimensionalities encountered later in the script. The monomial rule is the default integration scheme behind the production DEQN code for IRBC-type models in Chapter Chapter 3 once the shock dimension exceeds about five, dropped in by replacing a single quadrature kernel.
Table 2.2:Quadrature cost per residual evaluation, tensor-product Gauss--Hermite () versus the Stroud-3 monomial rule of equation (2.18), as a function of the shock dimension . The tensor-product cost grows exponentially in ; the monomial cost grows linearly. Brock--Mirman () is the simplest case where the two rules agree to within a factor of 1.5; the gap explodes once the shock vector is multi-dimensional.
2.6.4Inverse-CDF Transform and Quasi--Monte Carlo¶
For very high-dimensional expectations (), even sparse-grid rules can become expensive, and, when the integrand contains material fourth-order shock content, the bias of the monomial rule (2.18) starts to dominate. An effective alternative is Quasi-Monte Carlo (QMC) integration combined with the inverse-CDF transform Judd, 1998. This method maps the unbounded domain of normal shocks to the unit hypercube using the inverse normal CDF:
where is the vector of i.i.d. standard normal shocks, is the element-wise inverse standard normal CDF, and are points from a low-discrepancy sequence (e.g., Sobol) on . For sufficiently smooth integrands, randomized QMC often achieves error rates close to up to logarithmic factors, compared with the rate of standard Monte Carlo. This makes QMC highly effective in many high-dimensional applications, including the multi-country IRBC of Chapter Chapter 3. The companion notebook 07_Genz_Approximation_and_Loss_Functions.ipynb explores QMC integration using the Genz test functions, a standard suite of integrands for benchmarking high-dimensional quadrature methods.
2.7Automatic Differentiation for DEQNs¶
The quadrature rules of the previous section answer the question “how do we evaluate the expectation in the Euler equation?” This section answers the companion question that every DEQN application faces: “how do we evaluate the derivatives in the Euler equation, the marginal utility , the marginal product , and the derivative of the value function ?” For the textbook Brock--Mirman model of Section 2.4 we derived all of these by hand. But every variation of the model, whether CRRA utility in place of log, CES production in place of Cobb--Douglas, a capital adjustment cost, or a borrowing constraint, would force us to redo pages of algebra before writing the loss. Automatic differentiation (AD) removes that tax: the user writes only the period payoff of the model once, and the first-order conditions, the envelope theorem, and all higher-order sensitivities are computed exactly from the code that evaluates . The result is not a numerical approximation of the Euler equation: it is the textbook Euler equation, produced by a different mechanism.
2.7.1Three ways to take a derivative¶
There are three generic ways to compute a derivative of a programmable function at a point : symbolic, finite differences, and autodiff. Each has a distinct profile.
2.7.1.1Symbolic differentiation¶
uses a computer algebra system (Mathematica, SymPy) to manipulate the algebraic expression for into an algebraic expression for . Its output is exact and human-readable, and for clean analytic formulas it is ideal. Its two weaknesses are (i) expression swell: the chain rule applied to a deeply composed expression can produce a result with many more terms than the original, sometimes enough to defeat further symbolic manipulation; and (ii) it cannot handle arbitrary code with loops and conditional branches.
2.7.1.2Finite differences (FD)¶
approximates the derivative by the definition itself. For the common central-difference variant,
The geometric content of this formula is the mundane fact that the slope of a smooth curve at a point can be approximated by the slope of a nearby secant line. Figure Figure 2.8 makes this concrete: the true tangent at (red) is approximated by the secant connecting to (blue dashed). As shrinks the secant rotates toward the tangent, and its slope converges to ; taking too small, however, forces the numerator to become the difference of two nearly equal floating-point numbers, at which point catastrophic cancellation dominates.
Figure 2.8:Geometric meaning of a central finite difference. The derivative is the slope of the tangent to at (red). Finite differences replace this slope by the slope of the secant through the two nearby points and (blue, dashed). For small , the secant approaches the tangent at rate ; for too small, the subtraction in the numerator becomes unreliable in finite-precision arithmetic and the approximation degrades. Figure Figure 2.9 plots the resulting error against .
Two sources of error compete as changes. The truncation error shrinks as decreases; the rounding error associated with computing the near-zero subtraction in floating-point arithmetic grows as , where is machine precision in double precision. The total error is minimized at the step size that balances the two,
The practical consequence is that central FD loses roughly five digits of precision relative to the input precision. For second derivatives the loss is worse: Hessians obtained by FD are usually unusable for precision-critical work. FD has one undeniable advantage: it treats as a black box, requiring only the ability to evaluate . For debugging or for a one-off comparative-statics check it is fine. For training a DEQN, where the gradient is used millions of times, it is badly suboptimal. Figure Figure 2.9 makes the trade-off visible.
Figure 2.9:The classic U-curve of central finite differences, here for at . Truncation error (red, dashed) falls as ; roundoff error (green, dashed) grows as . Their sum (blue) has a minimum at , where the achievable precision is around 10-11, roughly five digits worse than the input precision. Autodiff, by contrast, attains machine precision at no step-size tuning.
2.7.1.3Automatic differentiation¶
exploits the fact that every piece of Python (or any other imperative language) that evaluates a numerical function is, at run-time, a composition of elementary operations (, , , , , , , , ) whose derivatives are known in closed form. The chain rule composes these local derivatives into the derivative of the whole program, exactly. The cost is a small multiple of one function evaluation, typically 2-- in reverse mode, and is independent of the number of inputs. There is no step-size to tune, and the result is exact to machine precision. For ML-scale problems with 106 parameters and a scalar loss, AD is the only feasible method. The classic computer-science references are Baydin et al. (2018) and Margossian (2019); both motivate AD from machine-learning workloads and lay out the forward/reverse-mode taxonomy used below.
2.7.1.4Where do local derivatives come from? A tiny built-in table.¶
It is worth pausing on a question that beginners reasonably ask: “How does the computer know that the derivative of is ?” The answer is deflating but honest: it does not compute this from first principles. Every autodiff framework ships with a short, hand-written table of derivatives for the elementary operations it supports, a few dozen entries such as those in Table Table 2.3. When your program executes , the framework looks up the rule “gradient of is ” in that table and records on the edge of the computational graph. The chain rule then composes those table entries. Autodiff is not a theorem prover; it is a library of primitive rules plus an automatic application of the chain rule. This explains both its power (every program expressible in terms of these primitives is differentiable) and its limits (genuinely new functions require extending the table).
Table 2.3:A representative slice of the primitive-derivative table that every autodiff framework ships with. TensorFlow, PyTorch, and JAX each include a few hundred such entries: one per elementary operation they support. User code composes these primitives; the chain rule composes their derivatives. When an economist writes the Brock--Mirman period payoff , only entries from this table are invoked; no calculus is performed at runtime.
| Primitive | Local derivative (built in) | Where to find it |
|---|---|---|
tf.math.add | ||
tf.math.multiply | ||
tf.math.pow | ||
tf.math.exp | ||
tf.math.log | ||
tf.math.sin | ||
tf.math.cos | ||
tf.nn.relu |
This explanation also makes the limits of AD concrete. Operations absent from the table (a black-box solver call, a non-differentiable step like sorting or argmax) require either a custom rule (via @tf.custom_gradient) or a redesign of the code. The pitfalls in Section 2.7.4 are essentially cases where either the table’s entry is degenerate at a point (kinks, at ) or the composed gradient is numerically unstable even though each local rule is correct.
2.7.2Computational graph; forward and reverse modes¶
Every numerical function, once evaluated, corresponds to a directed acyclic computational graph whose nodes are elementary operations and whose edges carry intermediate values. Take the toy example evaluated at :
The values along the graph are , , . The edges carry local derivatives: , , . AD combines the edge derivatives by the chain rule in one of two traversal orders. Figure Figure 2.10 shows the graph together with both traversals.
Figure 2.10:The two modes of autodiff on at . Top: forward mode carries a derivative tag alongside each value and reads at the output. Bottom: reverse mode evaluates forward, stores the graph, then walks backwards with and reads at the input. Both deliver 3.584, equal to at machine precision. Forward mode scales linearly with the number of inputs; reverse mode scales linearly with the number of outputs, which is why it wins for DEQN training (scalar loss, many parameters).
2.7.2.1Forward mode¶
carries a derivative tag next to every value and updates both in a single forward pass through the graph. One seeds at the input; by the chain rule, , , and , which is exactly evaluated at . The cost is one extra float per variable and one extra pass through the graph. For a function with inputs and outputs, the cost of the full Jacobian is forward passes; forward mode is therefore attractive when the number of inputs is small.
2.7.2.2Reverse mode¶
first evaluates forward, storing the computational graph and all intermediate values. It then walks the graph backwards, carrying a sensitivity along each edge. Seeded with at the output, the backward pass accumulates
yielding at the input node . The computational cost is 2-- a single forward pass, and, crucially, one reverse pass produces the full gradient with respect to all inputs simultaneously. Reverse mode is what tf.GradientTape, torch.autograd, and jax.grad use by default. The price paid is memory: the entire forward graph must be stored for the backward pass. For extremely long simulations this can be a binding constraint; we return to the point in the pitfalls below.
2.7.2.3Which mode for DEQNs.¶
The DEQN loss is a scalar mean-squared Euler residual, computed from a network with to 105 parameters. Reverse mode produces the full gradient with respect to all network parameters in 2-- one forward pass, a speedup of over forward mode. This is the same reason reverse mode is the workhorse of deep-learning training: it exactly fits the “few outputs, many inputs” regime.
2.7.3The autodiff Euler residual¶
We now apply autodiff to the very object we spent Section 2.4 deriving by hand: the Euler equation of the Brock--Mirman planner. Define the period payoff as a function of the current state and the current choice:
where the three argument slots are the state , the choice , and the exogenous shock (the same slot names the code listing below uses); applying at a date means plugging into slot 1, into slot 2, and into slot 3. This is the only place the model enters. Change , , or and nothing else in what follows needs to move. The Bellman equation of Section 2.4 then reads
2.7.3.1Notation: what do and mean?¶
Throughout this section the subscript names the slot of being differentiated, not a time index. With the three-slot definition of (2.24), we write
The first slot is the state, the second slot is the choice, the third slot is an exogenous parameter and is not differentiated. An expression like therefore denotes the derivative of in its second slot, evaluated with plugged into slot 1, into slot 2, and into slot 3, so it equals . The expression denotes the derivative in the first slot, evaluated at the time- state pair, so it also equals but for a different physical reason (because now sits in the state slot of the period- problem). This overloading is the whole point: the same partial expression handles the FOC in one period and the envelope term in the next.
Two facts, both derived in Section 2.4, are now worth restating in terms of :
- FOC w.r.t.\ the choice :
- Envelope at the optimum, evaluated at the state :
- where is the optimal policy.
Substituting the envelope (evaluated one period ahead, so that sits in the state slot) into the FOC, the familiar Euler equation becomes
Here . Both terms in (2.29) are derivatives with respect to the same physical variable , but one treats as the choice of period (slot 2) and the other treats as the state of period (slot 1). Every term is therefore a partial derivative of the same function . Neither nor the marginal product of capital, nor the envelope formula need to be written out by hand. tf.GradientTape records the forward evaluation of and produces both partials on demand; the expectation is approximated by any of the quadrature rules of Section 2.6. The code is shorter than the pen-and-paper counterpart and, more importantly, entirely model-agnostic: swapping log for CRRA utility means editing the body of Pi and nothing else.
def Pi(K_in, K_out, z_in):
Y = z_in * K_in ** alpha
C = Y + (1.0 - delta) * K_in - K_out
return tf.math.log(C)
def euler_residual(K_t, z_t, K_tp1, K_tp2, z_tp1):
# FOC term: d Pi / d K_{t+1} at (K_t, K_{t+1}, z_t)
with tf.GradientTape() as t1:
t1.watch(K_tp1)
pi_t = Pi(K_t, K_tp1, z_t)
d2 = t1.gradient(pi_t, K_tp1)
# Envelope term: d Pi / d K_t at (K_{t+1}, K_{t+2}, z_{t+1})
with tf.GradientTape() as t2:
t2.watch(K_tp1)
pi_tp1 = Pi(K_tp1, K_tp2, z_tp1)
d1 = t2.gradient(pi_tp1, K_tp1)
return d2 + beta * d1Program 2:Autodiff Euler residual for Brock--Mirman. The function \texttt{Pi} is the only model-specific code; the rest is generic. A full implementation lives in the autodiff chapter’s code folder, notebook 02_Brock_Mirman_AutoDiff_DEQN.ipynb.
2.7.3.2Cross-checking the autodiff loss.¶
An important pedagogical point is that the autodiff residual (2.29) and the hand-derived residual of Section 2.4 are the same mathematical object. The companion autodiff notebooks implement both expressions on the same network and report the maximum absolute difference, which in our (seeded) runs is of order 10-6--10-7 in the deterministic case and 10-6--10-5 in the stochastic case with Gauss--Hermite quadrature (float32 arithmetic; float64 tightens this by roughly seven orders of magnitude). In every case the residual is consistent with finite-precision arithmetic, graph ordering, and quadrature accumulation rather than with any difference in the underlying mathematics. Under full depreciation () the trained policy from the autodiff loss matches the analytical closed-form (respectively ) to mean relative error in the deterministic case and on a coarse classroom grid in the stochastic case (the residual rises with the quadrature footprint), confirming that minimizing the autodiff residual recovers the true policy when one is available.
2.7.4Pitfalls of autodiff¶
AD is exact and cheap but it is not magical. Five pitfalls are worth naming; each has a concrete workaround.
2.7.4.1Non-differentiable kinks.¶
Operations such as (ReLU), , , , argmax, , and indicators are non-differentiable at isolated points. Frameworks return a subgradient at the kink -- in TensorFlow and PyTorch, by convention. If the loss repeatedly lands on such a kink, SGD can receive an uninformative gradient and stall. The cure is to either smooth the kink ( for ReLU, for , log-sum-exp for ) or, in complementarity problems, to replace the non-smooth indicator by a Fischer--Burmeister residual (Chapter Chapter 3, Section 3.3.0.8).
2.7.4.2Boundary singularities.¶
The Brock--Mirman loss contains and hence . If the network’s output happens to drive close to zero during training, AD dutifully returns a very large gradient; SGD then takes a very large step and the network explodes or produces NaN. Two cures of different quality are in wide use:
Reparameterize so the domain is respected by architecture. In the Brock--Mirman notebooks, the network outputs the savings share through a sigmoid; this guarantees and simultaneously, at every training iteration, with no penalty term. This is the hard/soft constraint split of Figure Figure 2.6.
Use numerically stable primitives. Prefer
tf.math.log1p(x)tolog(1+x),tf.math.softplusto a hand-coded , andtf.math.xlogy(a,b)to when the convention matters. AD does not see the cancellations these functions implement; the human must.
2.7.4.3Reverse-mode memory.¶
Reverse mode stores the entire forward graph so it can walk it backwards. Unrolling a -step simulation and back-propagating through all of it has memory cost and can exhaust GPU memory on non-trivial models. Standard mitigations are (i) gradient checkpointing (tf.recompute_grad), which recomputes parts of the graph on the backward pass in exchange for memory, (ii) truncated back-propagation through time for long recurrences, and (iii) path-averaging mini-batches of trajectory segments rather than full trajectories (notebook 02 of Day 2 already does this).
2.7.4.4Loss of structural insight.¶
AD returns a number, not an algebraic expression. Useful structural facts -- the cancellation has the sign of the Euler wedge, the Euler residual is homogeneous of a known degree in productivity, and so on -- are invisible to AD and come from the hand derivation. In practice one should retain the ability to derive the FOC on a toy version of the model, precisely as we did in Section 2.4: AD scales that derivation; it does not replace the understanding it provides.
2.7.4.5stop_gradient and the envelope theorem.¶
A subtle point worth emphasizing. The envelope identity holds at the optimum because the FOC kills the term that otherwise contributes to the total derivative of the composed continuation value. During training we are not yet at the optimum. Two natural codings of the next-period term therefore give different numbers: one differentiates through the policy and computes a total derivative of a rollout, while the other treats as a fixed choice and computes the partial derivative used in the Euler FOC. The autodiff residual (2.29) uses the latter, which is what falls out of t2.gradient(pi_tp1, K_tp1) in Listing Program 2. This is the correct residual for checking the Euler equation; differentiating through the policy is a different object. The two codings agree as the FOC residual vanishes, but off the optimum they can disagree.
2.7.4.6Companion notebooks.¶
Three notebooks (in the autodiff chapter’s code folder) put the above into practice, with a fourth provided as self-study:
01_AutoDiff_Analytical_Examples: warm-up (), the FD U-curve regenerated numerically, CRRA utility’s first and second derivative, Cobb--Douglas production’s 2-D gradient field, the capital adjustment cost with AD vs hand partials side by side, and the Hessian of Cobb--Douglas via a nestedGradientTape.02_Brock_Mirman_AutoDiff_DEQN: the loss of (2.29) implemented on the deterministic model of Section 2.4; cross-check against the hand residual at float32 tolerance; cross-check of the trained policy against .03_Brock_Mirman_Uncertainty_AutoDiff_DEQN: the same, with AR(1) productivity and Gauss--Hermite quadrature; cross-check against the stochastic hand residual and the closed-form in a full-depreciation side-experiment.04_IRBC_AutoDiff_DEQN(self-study): the same template scaled up to the multi-country IRBC model of Chapter Chapter 3 (per-country planner Lagrangian, Fischer--Burmeister irreversibility, tensor-product Gauss--Hermite), with a machine-precision cross-check against the Chapter Chapter 3 hand-derived residual.
2.8Data Parallelization with MPI¶
For very large-scale applications (e.g., countries in the IRBC model of Chapter Chapter 3), training can be accelerated by distributing the gradient computation across multiple GPUs or compute nodes. The standard approach uses synchronous data parallelism via MPI Allreduce; in the DEQN paper the corresponding implementation is built on Horovod Sergeev & Del Balso, 2018, which wraps Allreduce into a drop-in replacement for the training optimizer.
Since each worker processes an independent mini-batch and the Allreduce operation averages the gradients, the effective batch size scales linearly with the number of workers. In practice, this yields near-linear speedup for moderate numbers of workers (), with communication overhead becoming significant only for very large clusters. The key advantage for economics applications is that each worker can simulate its own trajectory of the economy, naturally exploring different regions of the state space and improving the diversity of the training distribution.
2.9Choice of Loss Kernel: Beyond Mean-Squared Loss¶
Every DEQN derivation in this script ends with the same step: take the equilibrium residual , square it, average over a mini-batch, and minimize. The squared average, mean-squared error (MSE), is the canonical default, but it is only one of many reasonable reductions of a residual vector to a scalar. Different reductions emphasize different parts of the residual distribution and have visibly different convergence properties, even on the same model with the same network and data. This section makes the trade-off concrete on the stochastic Brock--Mirman model with full depreciation, where the optimal savings rate is known in closed form () so that the deviation between learned and optimal policy can be measured exactly.
2.9.0.1Six kernels.¶
We compare:
MSE: . Quadratic everywhere; large residuals dominate the gradient.
MAE: . Robust to outliers, but the gradient has constant magnitude, so the optimizer takes the same step size regardless of how close to zero a residual already is. This is fine for learning a coarse fit, but it means MAE training plateaus at a finite “noise floor” that is several orders of magnitude above what MSE attains at the same training budget.
Huber: quadratic for , linear for . Smooth hybrid that combines MSE’s fine convergence with MAE’s tail robustness, but the transition at is a kink in the gradient, which can introduce small training-loop oscillations as residuals migrate across the threshold.
Quantile pinball at level : . Targets the signed -quantile of the residual distribution; high emphasizes positive residual tails, low emphasizes negative tails, and all observations contribute with asymmetric weights.
CVaR-style at confidence : mean of the top fraction of . Explicitly trains the worst-case states.
LogCosh: . This kernel is everywhere and combines MSE-like behavior near () with MAE-like saturation in the tails (, gradient saturates at ). Crucially, the transition between the two regimes is smooth: there is no kink at any threshold.
2.9.0.2Convergence behavior.¶
Figure Figure 2.11 shows the mean, , and of the absolute relative Euler error on a fixed evaluation grid as a function of the training episode, for each of the six kernels. Three patterns are immediate.

Figure 2.11:Convergence of the relative Euler-error distribution under six different loss kernels on the stochastic Brock--Mirman model with full depreciation. Same network (232 swish, sigmoid head), same Adam optimizer, same CRN training stream; only the loss kernel changes. Left: mean . Middle: . Right: . All on a log scale. Closed-form benchmark: .
First, MAE clearly stalls at a noise floor an order of magnitude above the other kernels, exactly the constant-gradient pathology described above; the curve is monotone but flattens out and refuses to drop further. Second, MSE and log-cosh achieve essentially the same final mean residual (), and both pull the down with comparable efficacy. Third, the tail-aware kernels (Huber, Quantile, CVaR) typically produce the narrowest spread between mean and max residual but pay a small constant cost in the bulk; the practical pay-off shows up in problems where rare-but-consequential states matter.
2.9.0.3Why log-cosh tends to converge most smoothly.¶
Of the six kernels, log-cosh is the one whose convergence curve is essentially monotone with no spikes. This is not an accident: is a convenient smooth interpolant between near zero and in the tails, with no kink at any threshold. Near the optimum its gradient is (linear, like MSE), so it inherits MSE’s fine-grained convergence. Far from the optimum its gradient saturates at (like MAE), so an unusually large residual cannot dominate the mini-batch gradient. In effect, log-cosh is what one would design if asked for “MSE near zero and MAE in the tails, smoothly”. For practical DEQN settings where the residual distribution is unknown in advance, log-cosh is therefore a useful default to compare against MSE rather than a mere robustness afterthought.
2.9.0.4Why this is an economic comparison.¶
The relative Euler-equation residual already has a direct economic interpretation: a relative Euler error translates approximately into a per-period consumption error along the simulated path Judd, 1998. The mean / / panels of Figure Figure 2.11 therefore measure, in interpretable units, the average and tail consumption mistakes that each loss kernel leaves behind. The point of the experiment is that the training-loss ranking is not the same as the ranking on this economic metric: a kernel that achieves a slightly higher mean residual but a much narrower tail produces smaller worst-case consumption errors, which is what matters when rare states are economically consequential (occasionally binding constraints, fat-tailed shocks, tipping risks). The training loss is the instrument; the relative Euler error along simulated paths is the criterion. Choosing the kernel with that hierarchy in mind is part of the modeling decision. The companion notebook pushes this one step further by collapsing the per-period error distribution into a single consumption-equivalent welfare loss against ; readers who want a single-number summary of the trade-off should consult that table directly.
The full experiment, including a path-residual histogram, a policy-error heatmap on the plane, and a CE-welfare-loss summary table, is in the companion notebook 05_StochasticBM_LossComparison.ipynb.
2.9.0.5Extensions of the basic DEQN template.¶
The Brock--Mirman model establishes the core DEQN recipe, but it is only the starting point. Later chapters modify the same template in several directions: Chapter Chapter 5 adds lifecycle structure and complementarity constraints, Chapter Chapter 6 replaces a finite vector of states by a histogram representation of the cross-sectional distribution, Section Section 6.7 then shows that one can also feed shock histories to the network rather than the current endogenous aggregate state, and Chapter Chapter 11 extends the method to nonstationary climate-economy models with pseudo-states. The equilibrium logic is the same in all cases: choose a network parameterization, simulate the model forward, evaluate equilibrium residuals, and update the network by gradient descent.
2.9.0.6Code examples.¶
The following Jupyter notebooks implement and extend the material in this chapter. Notebooks 01 and 02 illustrate the sampling progression that is pedagogically central to the method: 01 uses uniform random states to isolate the loss-and-architecture mechanics, while 02 replaces the exogenous grid by simulated trajectories and learns on the model’s ergodic set (Section 2.3); both derive the FOC and apply the envelope theorem on paper before writing the loss. Notebooks 03 and 04 introduce additional techniques (endogenous labor, a KKT-constrained labor-time ceiling encoded via a Fischer--Burmeister complementarity function, and a six-period OLG extension) that anticipate material developed formally in Chapters Chapter 3 and Chapter 5. Three further notebooks that re-solve the same two Brock--Mirman models with an autodiff loss (illustrating the template of Section 2.7) are placed in the autodiff-chapter code folder as the autodiff primer: 01_AutoDiff_Analytical_Examples.ipynb, 02_Brock_Mirman_AutoDiff_DEQN.ipynb, 03_Brock_Mirman_Uncertainty_AutoDiff_DEQN.ipynb.
01_Brock_Mirman_1972_DEQN.ipynb: deterministic Brock--Mirman; exogenous uniform sampling; hand-derived FOC + envelope.02_Brock_Mirman_Uncertainty_DEQN.ipynb: stochastic Brock--Mirman (, ); simulation-based sampling on the ergodic set; hand-derived FOC + envelope.03_DEQN_Exercises_Blanks.ipynb: four guided exercises (endogenous labor, KKT + Fischer--Burmeister, simple OLG extension); blank version.04_DEQN_Exercises_Solutions.ipynb: the same four exercises with complete solutions.05_StochasticBM_LossComparison.ipynb: the stochastic Brock--Mirman is re-solved six times with identical network, optimizer, and CRN training data, and only the loss kernel changes (MSE, MAE, Huber, quantile pinball, CVaR, log-cosh). The notebook switches to full depreciation so that the closed-form optimal savings rate is available, then evaluates each trained policy on the economic metric: the relative Euler-equation error along a simulated path, plus the consumption-equivalent welfare loss against . The exercise makes concrete that the choice of training loss and the convergence of the metric we ultimately care about are not the same thing, and that tail-aware kernels (Huber, CVaR, quantile pinball) trade a small loss in the bulk for a much cleaner .
2.10Further Reading¶
Azinovic et al. (2022), the foundational DEQN paper; required reading.
Maliar et al. (2021), “all-in-one” deep learning, an alternative formulation discussed in Chapter Chapter 6.
Fernández-Villaverde et al. (2024), broad survey of deep learning in macro, situating DEQNs against PINNs and other approaches.
Judd (1998), the classical numerical-methods reference whose toolkit DEQNs supplement (rather than replace); especially Chapters 7 and 7.5 on Gauss--Hermite quadrature and monomial rules underlying Section 2.6.
Stroud (1971), the canonical reference for the monomial cubature formulas of Section 2.6.3.
Pichler (2011) and Maliar & Maliar (2014), monomial integration in large-scale dynamic economic models.
2.11Exercises¶
Worked solutions and guidance for these exercises appear in Appendix Appendix F.
In the sense of Brumm & Scheidegger (2017): an approximation of the equilibrium policy (or value) functions over the entire economically relevant region of the state space, in particular over the model’s ergodic set, as opposed to a local (perturbation) solution that is accurate only in a neighborhood of the deterministic steady state.
- Azinovic, M., Gaegauf, L., & Scheidegger, S. (2022). DEEP EQUILIBRIUM NETS. International Economic Review, 63(4), 1471–1525. 10.1111/iere.12575
- Bellman, R. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press.
- Judd, K. L., Maliar, L., & Maliar, S. (2011). Numerically Stable and Accurate Stochastic Simulation Approaches for Solving Dynamic Economic Models. Quantitative Economics, 2(2), 173–210.
- Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. Database Theory — ICDT 2001, 1973, 420–434. 10.1007/3-540-44503-X_27
- Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.
- Marcet, A. (1988). Solving Non-Linear Stochastic Models by Parameterizing Expectations. Carnegie Mellon University Working Paper.
- Marcet, A., & Marshall, D. A. (1994). Solving Nonlinear Rational Expectations Models By Parametrized Expectations: Convergence to Stationary Solutions [Discussion Paper 91]. Federal Reserve Bank of Minneapolis.
- Sirignano, J., & Spiliopoulos, K. (2018). DGM: A Deep Learning Algorithm for Solving Partial Differential Equations. Journal of Computational Physics, 375, 1339–1364.
- Han, J., Jentzen, A., & E, W. (2018). Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34), 8505–8510.
- Maliar, L., Maliar, S., & Winant, P. (2021). Deep learning for solving dynamic economic models. Journal of Monetary Economics, 122, 76–101.
- Brock, W. A., & Mirman, L. J. (1972). Optimal economic growth and uncertainty: The discounted case. Journal of Economic Theory, 4(3), 479–513.
- Judd, K. L. (1998). Numerical methods in economics. The MIT press.
- Azinovic-Yang, M., & Žemlička, J. (2024). Intergenerational consequences of rare disasters. Available at SSRN 4386477. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4386477
- Gerstner, T., & Griebel, M. (1998). Numerical Integration Using Sparse Grids. Numerical Algorithms, 18, 209–232.
- Heiss, F., & Winschel, V. (2008). Likelihood Approximation by Numerical Integration on Sparse Grids. Journal of Econometrics, 144(1), 62–80.