The OLG models of Chapter Chapter 5 featured a finite number of agent types, so the cross-sectional state was simply a vector . Many important macroeconomic applications instead require a continuum of agents subject to idiosyncratic shocks. In the Krusell & Smith (1998) economy, an aggregate productivity shock additionally moves the cross section in a stochastic way, and the entire wealth distribution then becomes part of the aggregate state. The Aiyagari (1994) model is the special case without aggregate risk: is fixed in stationary equilibrium and evolves deterministically along transitions, so it is a parameter of the equilibrium rather than a stochastic state variable. Incomplete markets prevent full insurance in both, making explicit distributional tracking essential, but the “master-equation” challenge of treating as a high-dimensional aggregate state arises only once aggregate risk is added on top of Aiyagari. Why represent as a histogram on a discrete grid rather than as a Monte Carlo panel? Two reasons motivate the choice up front: a histogram is deterministic, so re-running the same equilibrium gives identical aggregates and the loss is a smooth function of the network weights, and it is noise-free, so Euler-equation residuals reflect approximation error rather than Monte Carlo sampling noise (Figure Figure 6.5 makes this contrast quantitative). This chapter develops Young’s non-stochastic simulation method Young, 2010 for representing as a histogram and shows how to embed that method within the DEQN framework of Azinovic et al. (2022) to solve heterogeneous-agent economies with neural network policy functions.
6.0.0.1How this chapter maps onto the slides and notebooks.¶
The heterogeneous-agent material of this chapter, together with the companion deck in lecture_09_heterogeneous_agents_youngs_method/slides/, can be read independently of the sequence-space material in Section 6.7; readers already comfortable with Krusell--Smith may skip straight there on a first pass. Two notebooks in lecture_09_heterogeneous_agents_youngs_method/code/ accompany Sections Section 6.3--Section 6.4: lecture_09_10_Youngs_Method_Examples.ipynb isolates Young’s redistribution operator on toy examples, and lecture_09_11_Continuum_of_Agents_DEQN.ipynb embeds the same operator inside the Appendix A.5 endowment-economy DEQN. Section Section 6.6 on alternative deep-learning approaches is paired with lecture_09_12_KrusellSmith_DeepLearning.ipynb, a classroom-scale all-in-one DL solver in the spirit of Maliar et al. (2021).
6.0.0.2The Bewley--Huggett--Aiyagari lineage.¶
The continuum-agent framework that this chapter operationalises has three foundational sources. Bewley (1986) introduced stationary monetary equilibrium with a continuum of agents subject to iid endowment shocks; the explicit self-insurance-through-borrowing-constraints mechanism that defines the modern incomplete-markets workhorse is due to İmrohoroğlu (1989), Huggett (1993), and Aiyagari (1994). Huggett (1993) cast the idea as a tractable endowment economy with a single risk-free asset, focusing on the equilibrium interest rate. Aiyagari (1994) added neoclassical production, closing the model in general equilibrium with capital accumulation; this is the canonical incomplete-markets economy. Krusell & Smith (1998) layered aggregate productivity shocks on top, producing the modern Krusell--Smith economy that this chapter targets, and its continuous-time reformulation is developed by Achdou et al. (2022) and revisited in Chapter Chapter 8.
Figure 6.1:Genealogy of the heterogeneous-agent models treated in this script. This chapter targets the Krusell--Smith branch (right) by combining a DEQN policy with Young’s histogram update; the continuous-time branch (Achdou--Han--Lasry--Lions--Moll) is taken up in Chapter Chapter 8.
6.1From Representative to Heterogeneous Agents¶
In the representative-agent models of Chapters Chapter 2--Chapter 3, aggregate capital is a sufficient statistic for the economy’s state. In reality, agents differ in wealth, income, and employment status, and incomplete markets prevent full insurance against idiosyncratic risk. This heterogeneity matters for policy analysis:
Fiscal stimulus: agents near the borrowing constraint have high marginal propensities to consume; wealthy agents save most of a windfall.
Monetary policy affects borrowers and savers differently.
The shape of the wealth distribution feeds back into aggregate demand and equilibrium prices.
The mathematical challenge depends on whether the economy carries aggregate risk. Without it (the Aiyagari case), the cross-section is fixed in stationary equilibrium and evolves deterministically along transitions, so it parameterizes the equilibrium rather than serving as a stochastic state variable. With aggregate risk (the Krusell--Smith case that this chapter targets), an aggregate shock moves the cross-section in a stochastic way, and the entire distribution then becomes part of the aggregate state. Since is in either case a measure (an infinite-dimensional object), it cannot be placed on a grid without encountering the curse of dimensionality.
6.1.0.1Two approaches.¶
There are two main strategies for handling heterogeneous agents in discrete time. Finite-agent methods keep a high-dimensional but finite vector of individual states, which is exact in but pays a permutation-symmetry cost that scales poorly with (the OLG models of Chapter Chapter 5, and the finite-agent Monte-Carlo implementation of the continuum-agent Krusell--Smith model by Maliar et al. (2010), fall in this category). Continuum-agent methods replace the agent labels with the distributional state , which is infinite-dimensional but eliminates sampling noise and the permutation problem, at the price of needing to approximate (Krusell & Smith (1998) and Young (2010) are the canonical references). This chapter pursues the second approach: tracking the distribution via a histogram, with Young’s histogram method Young, 2010 providing the distribution update.
6.2The Krusell--Smith (1998) Economy¶
The canonical heterogeneous-agent model with aggregate uncertainty is due to Krusell & Smith (1998). Its key features are:
A continuum of ex ante identical, infinitely-lived agents.
Incomplete markets: no insurance against idiosyncratic labor income risk.
Aggregate uncertainty: productivity shocks that affect all agents.
A single asset (capital/bonds) with a borrowing constraint.
Each agent must forecast future prices to make optimal savings decisions. But prices depend on the entire wealth distribution, which is infinite-dimensional. The key insight of Krusell & Smith (1998) is that the mean of the wealth distribution is a nearly sufficient statistic for forecasting: a log-linear forecasting rule
where is the approximate next-period aggregate capital (the price-forecasting function), and are the OLS intercept and slope coefficients (each a function of the aggregate shock state ), and is mean capital; this rule achieves in practice.
6.2.0.1Equilibrium.¶
A recursive competitive equilibrium consists of:
Individual optimization: each agent chooses savings to maximize utility given the forecasting rule.
Market clearing: aggregate demand equals supply.
Consistent distribution evolution: given optimal policies.
Rational expectations: agents’ forecasts are self-confirming.
The question is: how do we simulate the distribution forward to compute the realized mean and check the forecasting rule?
6.2.0.2The traditional Krusell--Smith algorithm in detail.¶
Before turning to the distribution-update step, it is useful to write the canonical KS algorithm explicitly. It is a nested fixed-point iteration with an outer loop over forecasting-rule coefficients and an inner loop that solves the individual household’s Bellman equation given those coefficients.
The remarkable empirical finding of Krusell & Smith (1998) is that a log-linear rule in alone achieves and very small forecast errors: the “approximate aggregation” result. Textbook implementations (e.g., the econ-ark/KrusellSmith REMARK) typically converge in outer iterations with standard parameters , , log utility, aggregate shocks on with persistence , and unemployment rates (good) vs. (bad).
6.2.0.3Bottleneck.¶
The inner Bellman solve on a two-dimensional grid is the expensive step. It scales exponentially if one tries to add additional moments (e.g., tracking variance or skewness), which is why the KS algorithm as stated caps out at one moment. Extensions that need more moments must use more powerful function approximators. This is where the Young-histogram DEQN of Section 6.4 (and the alternatives in Section 6.6) come in.
6.3Young’s (2010) Non-Stochastic Simulation¶
Young (2010) proposed a deterministic method for propagating the wealth distribution that avoids Monte Carlo sampling entirely.
6.3.0.1Core idea.¶
Represent the wealth distribution at time as a histogram over discrete bins, using two ingredients:
a fixed capital grid , where is the number of bins and indexes individual grid points ;
a finite set of idiosyncratic-shock states , indexed by .
The histogram value is the probability mass of agents sitting at capital bin in shock state at time ; by construction, .
Given the household policy function , where:
is the individual’s current capital and their current idiosyncratic shock;
is aggregate mean capital (the summary statistic from Krusell--Smith);
is the aggregate productivity shock (the same that drives prices in the KS setup above);
the key operation is mass redistribution. Evaluating at a bin produces the savings target , which generically lies off the grid. Let denote the bracketing index defined by , and let
be the probability mass currently sitting at bin . This mass is then split between the two bracketing grid points and using linear-interpolation weights:
Figure Figure 6.2 illustrates this operation. A mass sitting at the off-grid savings target is split between the two bracketing grid points and , with the weight proportional to the proximity of to (so when and when ).
Figure 6.2:Linear interpolation in Young’s method. Mass at off-grid point is redistributed to the two bracketing grid points and using weights and . Closer proximity to a grid point yields a larger share of the mass.
Why exactly this weight? The lottery weight in Eq. (6.3) is not an arbitrary choice: it is the unique value for which the two-point split has conditional mean equal to the off-grid policy choice . Solving the equation for recovers Eq. (6.3) in one line:
By linearity of expectation, this conditional mean equality at every grid bin extends to the full distribution: the unconditional mean of equals the unconditional mean of the policy-implied next-period capital. Higher moments (variance, percentiles) are only approximated; the leading error is of order on smooth densities, so a finer grid improves higher-moment fidelity at no cost to mean preservation.
6.3.0.2Worked example.¶
We illustrate the full histogram update with a small grid of capital levels and two idiosyncratic states .
Step 1, Setup. The initial histogram (masses summing to 1):
| Row sum | |||||
|---|---|---|---|---|---|
| 0.10 | 0.20 | 0.10 | 0.05 | 0.45 | |
| 0.05 | 0.15 | 0.20 | 0.15 | 0.55 |
The mean capital is .
Step 2, Policy evaluation. Let denote the dollar productivity associated with shock state , with and (here is the state index, is its numerical value). Using a simple linear savings rule :
| : | 0.9 | 1.3 | 1.7 | 2.1 |
| : | 1.9 | 2.3 | 2.7 | 3.1 |
Step 3, Interpolation weights. Since the grid spacing is , each off-grid is bracketed by with weight :
| : bracket, | , clip | , 0.7 | , 0.3 | , 0.9 |
| : bracket, | , 0.1 | , 0.7 | , 0.3 | , 0.9 |
When (here for the low-state, lowest-capital agents), all mass is assigned to (clipped to the boundary).
Step 4, Mass redistribution. For simplicity, assume the shock transition is the identity (), so mass stays in its current -state. Building by accumulating the redistributed mass from each source bin (mass is conserved bin-by-bin: e.g. the , source of mass 0.20 contributes to and to ):
| Row sum | |||||
|---|---|---|---|---|---|
| 0.270 | 0.175 | 0.005 | 0.000 | 0.450 | |
| 0.005 | 0.210 | 0.320 | 0.015 | 0.550 | |
| total | 0.275 | 0.385 | 0.325 | 0.015 | 1.000 |
Step 5, Mean verification. The mean of is . The unclipped, policy-implied mean is , so boundary clipping at raises the mean by only 0.01. Mean preservation is exact for source bins whose policy choice lies strictly inside the grid; clipping at the boundary slightly biases the mean, here upward, because mass that would have landed at is forced to . In general, boundary clipping biases the mean in the direction of the violated boundary: clipping at biases the mean upward (mass is pulled in from below the grid), clipping at biases it downward. With a wider grid () the mean would be preserved exactly.
The full picture: one cell splits into four. The worked example above used identity shock transitions to keep the arithmetic visible. The general case combines the capital lottery with a shock fork: each piece of mass split between and is then split again across the reachable next-period values according to . With two shocks this produces four destination bins for every source bin, with weights given by the product . Figure Figure 6.3 reproduces this two-stage cascade (essentially Fig. 1 of Young (2010)), annotated with a concrete numerical case.
Figure 6.3:Verification: .The conditional mean of next-period capital is , by Eq. (6.4).
Young’s cascade for one source bin (essentially Fig. 1 of Young (2010)). Mass at flows in two stages: the capital lottery ( vs. ) sends it to the bracketing grid points and , and the shock fork ( vs. ) splits each piece across the reachable next-period idiosyncratic states. Each of the four leaves receives the product of its stage-1 and stage-2 weights times the source mass; the four leaf masses sum back to . Repeating the cascade for every active source bin and accumulating the leaves yields the new histogram .
A reader implementing the method should recognize three properties from the figure: (i) mass is conserved bin-by-bin because and ; (ii) the capital lottery’s expected next-period equals the policy choice by Eq. (6.4); (iii) the entire update is linear in , so it is a sparse matrix-vector product where the transition operator depends on the current policy. This linearity is what makes the histogram update differentiable in the policy values almost everywhere, conditional on the interpolation brackets, when we embed it inside a neural-network training loop in Section 6.4. Index changes at bin boundaries and clipping at domain edges are nondifferentiable; standard implementations either ignore these measure-zero events, smooth the assignment, or stop gradients through the index-selection step, and in practice none of these choices materially affects training in the calibrations covered in this chapter.
6.3.0.3The histogram update algorithm.¶
At each time step, the histogram is updated deterministically:
Implementation cheatsheet. The pseudocode above translates into a few lines of NumPy; the inner double loop can also be vectorised with np.add.at for production use.
[language=Python, basicstyle=\footnotesize\ttfamily, frame=single, backgroundcolor=\color{backcolor}, numbers=none, escapeinside={(*}{*)}]
import numpy as np
def young_step(G, kp, pi_eps, k_grid):
"""One Young (2010) histogram update on a uniform k-grid.
G[i,j] mass at (k_grid[i], eps_j), sums to 1
kp[i,j] policy choice k'(k_i, eps_j), possibly off-grid
pi_eps[j,jp] = Pr(eps_{t+1}=jp | eps_t=j)
"""
G_next = np.zeros_like(G)
dk = k_grid[1] - k_grid[0] # uniform spacing
n_k = len(k_grid)
for j in range(G.shape[1]): # current shock
for i in range(G.shape[0]): # current capital bin
x = kp[i, j]
# Boundary handling: clip BOTH the bracket index J AND the
# lottery weight omega. Otherwise an off-grid x produces
# omega < 0 or omega > 1 and hence negative or super-unit mass.
if x <= k_grid[0]:
J, omega = 0, 1.0 # all mass to the floor
elif x >= k_grid[-1]:
J, omega = n_k - 2, 0.0 # all mass to the cap
else:
J = int((x - k_grid[0]) // dk)
J = min(max(J, 0), n_k - 2)
omega = (k_grid[J + 1] - x) / dk # in [0, 1]
for jp in range(G.shape[1]): # next-period shock
w = pi_eps[j, jp] * G[i, j] # shock fork x source mass
G_next[J, jp] += omega * w
G_next[J + 1, jp] += (1 - omega) * w
return G_nextThe four scatter-add lines correspond exactly to the four leaves of Figure Figure 6.3: each leaf receives omega or (1-omega) from the capital lottery, multiplied by pi_eps[j, jp] from the shock fork, multiplied by the source mass. The Krusell--Smith JAX tutorial in lectures/lecture_10_sequence_space_deqns/code/lecture_10_KrusellSmith_Tutorial_CPU.ipynb implements this same operation as distribution_step, vectorised over the grid via jax.vmap and accumulated with .at[ ].add( ).
Figure Figure 6.4 visualizes the five stages of a single forward step.
Figure 6.4:Flow diagram for one forward step of Young’s histogram update (Algorithm Algorithm 6.1). Starting from , the policy function is evaluated at every active bin, the resulting off-grid savings are interpolated back onto the grid, and idiosyncratic shock transitions redistribute mass across -states to produce .
Comparison with Monte Carlo. Young’s method produces zero sampling noise (deterministic), preserves the mean exactly, requires only 100--5,000 grid points (versus 50,000 agents for Monte Carlo), and is fully reproducible. The trade-off is that it approximates higher moments and requires a grid that is wide enough to contain all mass. The following table summarizes the comparison:
| Young’s method | Panel simulation | |
|---|---|---|
| Sampling noise | None | |
| Mean preservation | Exact | Approximate |
| Typical size | 100--5,000 grid points | 50,000 agents |
| Reproducibility | Deterministic | Seed-dependent |
| Higher moments | Approximated | Approximated |
Figure Figure 6.5 contrasts the two approaches visually: the histogram method yields a smooth, noise-free distribution, while a Monte Carlo panel of comparable size exhibits visible sampling noise.
Figure 6.5:Young’s histogram (left) versus Monte Carlo panel simulation (right). Both approximate the same underlying wealth density (dashed). The histogram method is deterministic and smooth; the Monte Carlo panel exhibits sampling noise that contaminates downstream OLS regressions in the Krusell--Smith algorithm. The bars in this figure are a TikZ schematic illustrating the two regimes; for the actual histograms produced by the algorithm see notebook lecture_09_10_Youngs_Method_Examples in the Lecture-09 code/ folder.
The absence of sampling noise matters for the Krusell--Smith algorithm: Monte Carlo noise in the realized mean contaminates the OLS regression that updates the forecasting rule, potentially destabilizing convergence.
Grid design. Two separate grids are used in practice. The value function grid (typically 150 irregularly spaced points) is finer near the borrowing constraint where the policy function has high curvature and coarser for large where behavior is smooth. The simulation grid for Young’s histogram (typically 1,000--5,000 uniformly spaced points) uses uniform spacing to produce smooth histograms without artifacts. The upper bound must be chosen large enough that no mass reaches the boundary; if for any agent, all mass is assigned to the last grid point, which violates mean preservation. A practical safeguard is to run a preliminary simulation and verify that the boundary bins contain negligible mass.
The full Krusell--Smith algorithm. Combining value function iteration (VFI) with Young’s simulation yields:
Initialize forecasting coefficients .
Solve the household problem via VFI given the forecasting rule .
Forward-simulate the distribution for periods via Young’s method, recording realized means .
Re-estimate by OLS on simulated .
Check convergence; if not converged, return to step 2.
Young’s non-stochastic simulation makes step 3 essentially noise-free and fast relative to a large Monte Carlo panel. The full outer-loop iteration count and wall-clock time, however, remain implementation- and tolerance-dependent because the VFI solve and forecasting-rule fixed point are still present; the speedup applies to the simulation step, not as a generic wall-clock guarantee for the traditional KS workflow.
6.3.0.5Convergence and accuracy.¶
A remarkable empirical finding is that the log-linear forecasting rule (6.1) achieves in the standard Krusell--Smith economy: the first moment of the wealth distribution is a nearly sufficient statistic for predicting next-period prices. Adding higher moments (variance, skewness) to the forecasting rule barely improves the fit. This “approximate aggregation” result does not hold universally; it relies on the specific features of the Krusell--Smith calibration (small aggregate shocks, moderate borrowing constraint), but it is a useful benchmark against which richer models can be compared.
6.4DEQN with a Continuum of Agents¶
The traditional Krusell--Smith algorithm provides the benchmark logic for this chapter, but the classroom DEQN notebook used in the course is the simpler Bewley endowment economy of Appendix A.5 in Azinovic et al. (2022). This distinction is pedagogically useful. Krusell--Smith explains why distribution tracking matters and why Young’s method is valuable inside an outer forecasting-rule loop. Appendix A.5 then shows how the same histogram machinery enters a neural-equilibrium implementation once one replaces the forecasting rule by a price network. By combining Young’s histogram method with neural network policies, the DEQN approach overcomes both main limitations of the traditional KS workflow: the network can condition on the full histogram, and there is no need for a separate forecasting rule because the price network directly takes the distribution as input.
6.4.0.1What Section 6.4 inherits from Appendix A.5.¶
Five features of the Appendix-A.5 teaching model anchor the rest of this section and the companion notebook 11_Continuum_of_Agents_DEQN.ipynb; they are the distinguishing departures from the canonical Krusell--Smith calibration of §Section 6.2--Section 6.3:
Endowment economy, not production. Aggregate output is exogenous, , instead of ; there is no capital and no firm problem.
Bonds in unit net supply. Households trade a single one-period bond at endogenous price , and the market-clearing condition is .
Epstein--Zin recursive utility. Risk aversion () and inverse IES () are separated, with discount factor . The KS benchmark in Section 6.2 instead used log utility with .
Six-state aggregate shock. encodes a 2-state uncertainty regime crossed with a 3-state income level, replacing the 2-state TFP shock of canonical KS.
Two idiosyncratic productivity types. Labour endowment on a transition matrix , in place of the employed/unemployed two-state Markov chain of canonical KS.
6.4.0.2Histogram encoding.¶
The aggregate state is encoded as a vector containing three aggregate-shock index entries plus histogram values for each idiosyncratic shock type. In the Appendix A.5 notebook these three entries are the combined six-state aggregate index, the income-level index, and the uncertainty-regime index:
For grid points and 2 idiosyncratic states, the aggregate state has dimension . The full input to the policy network adds the individual state , giving total dimension 205. This is the production setting; the checked-in smoke and teaching runs use , so the aggregate state has dimension 103 and the policy input 105. Figure Figure 6.6 shows how the histogram vector and individual state are assembled and fed into the two networks.
6.4.0.3How the notebooks fit together.¶
The companion notebook sequence mirrors this decomposition. 10_Youngs_Method_Examples.ipynb isolates Young’s redistribution operator on toy examples, first in one dimension and then with idiosyncratic shocks. 11_Continuum_of_Agents_DEQN.ipynb then reuses the same logic inside the full aggregate state vector (6.9). In other words, the second notebook is not a new distributional device; it is the first notebook’s histogram update embedded in a larger equilibrium-learning loop.
Histogram encoding and neural network architecture. The individual state (blue boxes / blue arrows) and the aggregate state (orange box for the three shock-index entries, red boxes for the two histograms) are concatenated as input to the policy network ; each input arrow is colored to match its source box and enters the policy-input layer at a distinct horizontal offset, so the five arrows are uniquely identifiable at a glance. The price network receives only the aggregate state. Both networks use softplus output activations.
Neural network architecture. Two networks are trained jointly:
Policy network : takes individual + aggregate state () outputs savings , borrowing multiplier , and value (3 outputs with softplus activation).
Price network : takes only aggregate state () outputs the bond price (1 output with softplus).
Both networks use two hidden layers with 500 units (production) or 128 (the checked-in smoke/teaching runs).
Equilibrium conditions as loss. The loss function comprises five terms, all of which should be zero in equilibrium:
where EE is the Euler equation residual, BE is the Bellman equation consistency condition, MC is the bond market clearing condition, KKT is the borrowing complementarity, and CB penalizes negative consumption. The market-clearing residual is rescaled by (the number of aggregate-shock states) before squaring, which puts the single market-clearing residual on the same scale as the per-state Euler, Bellman, and complementarity residuals (themselves evaluated in relative terms, divided through by consumption).
6.4.0.4Market clearing via histogram.¶
A key advantage of the histogram representation is that market clearing can be computed exactly (no Monte Carlo). The equilibrium condition is ; the residual that enters the loss is its deviation from zero,
where is the histogram mass, is the policy network output, and is aggregate net bond supply. We write the signed residual here because it enters the loss (6.10) squared, which is also how the notebook computes it. The Appendix A.5 notebook normalizes , which is the source of the “-1” term sometimes seen in code and slides.
6.4.0.5Young’s method inside the training loop.¶
During each training episode, the histogram is propagated forward using Young’s method (Algorithm Algorithm 6.1) with the current neural network providing the policy function. This creates a sequence of aggregate states on which the equilibrium residuals (6.10) are evaluated. Young’s redistribution operator is differentiable almost everywhere (linear interpolation conditional on fixed brackets), so the one-step histogram update that enters the market-clearing residual at each sampled state carries gradients back to the policy network. The longer simulated path of aggregate states is treated as data: it is regenerated each episode from the current network and held fixed inside the gradient tape, rather than backpropagated through end to end. Even so the distribution co-evolves with the network: early in training, when the policy network is inaccurate, the simulated path will be “wrong,” but the market-clearing residuals evaluated along it provide corrective feedback through the loss, and as the network improves the distribution converges toward the ergodic steady state.
6.5Results and Discussion¶
In the Appendix A.5 teaching model, the DEQN with histogram encoding achieves competitive accuracy: Euler equation errors are of order 10-3, and market-clearing residuals are comparably small. These figures come from the checked-in teaching/smoke configuration of the companion notebook (a small network trained for a modest number of episodes); the production configuration tightens them further. The broader lesson for the Krusell--Smith benchmark is conceptual rather than model-specific. Compared to the traditional KS algorithm, the DEQN approach has two advantages: (i) the neural network can condition on the full distribution rather than just its mean, providing a richer approximation that can capture situations where higher moments of the distribution matter for prices, and (ii) there is no need for a separate outer loop to update forecasting coefficients, since equilibrium conditions are enforced directly through the loss function.
6.6Alternative Deep-Learning Approaches to Krusell--Smith¶
Before turning to the deep-learning alternatives, it is useful to set the histogram-DEQN method in the broader landscape of solution techniques for heterogeneous-agent equilibria with aggregate shocks. Table Table 6.1 compares classical and modern approaches along four dimensions that drive method choice in practice.
Table 6.1:Heterogeneous-agent solution methods at a glance. The first three rows are classical or finite-difference; the last three are the modern deep-learning families compared in detail in Table Table 6.2 and the rest of this section.
| Method | Distribution rep. | Aggregate state | Solution principle | Best when |
|---|---|---|---|---|
| Classical KS Krusell & Smith, 1998 | Panel of agents | First moment(s) | Bounded-rationality fixed point of forecasting rule | Standard incomplete-markets, low-dim aggregate state |
| Reiter (back-loaded) Reiter, 2009 | Histogram on a fixed grid | First-order perturbation around stationary | Linearize after solving the steady state | Small aggregate shocks, smooth policies |
| Continuous-time Achdou et al. Achdou et al., 2022 | Density solving a KFE PDE | In the limit, the entire density | Coupled HJB+KFE finite differences | PDE-friendly model, smooth densities |
| Histogram-DEQN Azinovic et al., 2022 | Young histogram on grid | Histogram entries (or moments thereof) | SGD on equilibrium residuals | Strong nonlinearities, occasionally binding constraints |
| All-in-one DL Maliar et al., 2021 | Panel of agents | Agent-level states, policy and aggregate together | SGD on stacked Euler+market-clearing residuals | Many states per agent, GPU available |
| DeepHAM Han et al., 2024 | Permutation-invariant set encoder (DeepSets) | Learned generalized moments | Cumulative utility along simulated paths (policy-gradient with structural individual dynamics) | Want a low-dim aggregate state without committing to a moment a priori |
Whereas Table Table 6.1 is panoramic (classical and DL methods on common axes), Table Table 6.2 drills into the DL trio along the axes that matter when choosing among them. Histogram-DEQN is not the only deep-learning approach to heterogeneous-agent equilibria, and it is pedagogically useful to see how the space of deep-learning strategies decomposes. Three broad families have emerged in the literature. Two informative axes organize them: how the cross-sectional distribution is represented as input to the network, and what objective is optimized. Histogram-DEQN and All-in-One DL minimize residuals of the structural equilibrium equations; DeepHAM instead maximizes cumulative utility along simulated paths and uses Bellman residuals as a validation diagnostic.
Table 6.2:Three deep-learning approaches to heterogeneous-agent equilibria, all applied to the Krusell--Smith benchmark. They differ on two axes: how the cross-sectional distribution is encoded as input to the network, and what objective is optimized. Histogram-DEQN and All-in-One DL minimize squared structural residuals (Euler, Bellman, market clearing); DeepHAM instead maximizes cumulative utility along simulated paths, with the individual law of motion embedded in the computational graph and Bellman residuals tracked as a validation diagnostic.
Histogram DEQN | All-in-one DL | DeepHAM | |
|---|---|---|---|
Maliar et al., 2021 | Han et al., 2024 | ||
State distribution representation | Explicit histogram vector on a fixed grid | Explicit panel of agents’ states | Learned generalized moments via a permutation-invariant encoder |
Dimension of aggregate state seen by the network | histogram entries | agent states | learned scalars |
Interpretability of distributional state | High (histogram readable) | Low (permutation-dependent) | High in the economic sense (the learned basis is interpretable, e.g. concave in assets, linking the moment to MPCs and redistribution) |
Permutation invariance | Automatic (histogram is invariant) | Requires careful architecture / data augmentation | Baked into the encoder by DeepSets structure |
Training objective | Euler + Bellman + market clearing + FB residuals (squared) | Euler + Bellman + market clearing residuals (squared) | Cumulative utility along simulated paths plus value-function error; Bellman residual is a validation diagnostic only |
Reported accuracy (baseline KS) | Euler errors --10-4 | Approximation errors with agents | Bellman residual (used as a validation diagnostic) reduced by vs. KS with one learned generalized moment |
6.6.1All-in-One Deep Learning (Maliar, Maliar & Winant, 2021)¶
Maliar et al. (2021) propose what they call all-in-one deep learning. The key observation is that every dynamic economic model can be cast, at its core, as a collection of expectation conditions (optimality, feasibility, market clearing) that vanish at the true solution. They rewrite these conditions as non-linear regression equations with zero dependent variable, parameterize the policy and value functions by deep neural networks, and minimize the expected-squared residual by stochastic gradient descent on simulated paths.
For the Krusell--Smith benchmark with aggregate shocks and a continuum of agents, Maliar et al. (2021) formulate the following components of the loss. An alternative deep-learning route to the same problem is the symmetry-exploiting parameterisation of Kahou et al. (2021), which uses a permutation-invariant aggregation of agent-level features; the modern perturbation route is the refined Reiter implementation of Bayer & Luetticke (2020). Both are useful complements to the Young/DEQN combination developed below, with different scaling profiles and code-complexity tradeoffs.
Euler residual. The household’s consumption-saving first-order condition:
where is the individual-plus-aggregate state and is an -agent panel of capital holdings. The policy network outputs ; the Euler residual is evaluated at simulated states and the expectation by Monte Carlo over next-period idiosyncratic and aggregate shocks. With the borrowing constraint , the plain Euler residual is insufficient: at a binding constraint the equation can be slack, so the loss must combine EE with a complementarity condition via Fischer--Burmeister or KKT (see below).
Bellman residual / value-function error for off-policy learning of a value function, used in the “lifetime reward” formulation.
Market-clearing residual: summed across the agents in the panel.
Borrowing constraint non-negativity enforced architecturally (softplus on savings); the complementarity optimality condition still enters the loss via a Fischer--Burmeister term regardless.
A central computational contribution is the all-in-one integration operator: a single Monte Carlo realization of the next-period state is used to estimate the combined expectation across all residual terms simultaneously, rather than evaluating separate quadrature nodes for each conditional expectation. This reduces the per-iteration cost from (with nodes per shock and shocks) to in expectation.
6.6.1.1Scale reported by the authors.¶
Maliar et al. (2021) demonstrate the approach across several increasingly demanding setups: a Krusell--Smith variant with explicit agents and state variables (one per agent plus one aggregate) in the baseline parameterization, scaled up to agents in a moments-reduced variant where the cross section is summarized by 10--20 aggregate moments; and on a one-agent consumption-savings problem with kinked policies they report approximation errors of at most a fraction of one percentage point. A Python/TensorFlow replication is available at https://lectures/lecture_09_heterogeneous_agents_youngs_method/code/lecture_09_12_KrusellSmith_DeepLearning.ipynb (described below).
6.6.1.2Why this approach?¶
All-in-one DL is the closest cousin of the DEQN framework of Chapter Chapter 2. The only conceptual difference is that the aggregate state is represented by an explicit panel of agents (a large vector ) rather than by a histogram on a fixed grid. Stochastic mini-batches draw both individual state and aggregate state, and the law of large numbers delivers permutation-invariance in the limit without requiring a permutation-invariant architecture. The appeal is conceptual simplicity; the cost is that the input dimension grows with and the learner must rediscover permutation symmetry from the data.
6.6.2DeepHAM: Generalized Moments via DeepSets (Han, Yang & E, 2024)¶
Han et al. (2024) ask a more ambitious question: can the network learn which summary statistics of the cross-sectional distribution are relevant for individual decisions, rather than having the researcher commit to tracking a histogram or the first moment? They propose replacing the distribution with a small set of generalized moments obtained by averaging a flexible neural feature over the cross-section,
with and a neural feature encoder trained jointly with the policy and value networks. Equation (6.13) is the canonical DeepSets architecture of permutation-invariant set functions Zaheer et al., 2017: averaging (or, in the continuum limit, integration against ) makes invariant to permutations of the agents, and the encoders are flexible enough (by universal approximation on the permutation-invariant functions) to represent any fixed-arity moment.
The individual’s value and policy functions then take the form
and all networks are trained jointly. The primitive training objective is to maximize cumulative utility along simulated paths,
where the expectation is taken over simulated idiosyncratic and aggregate histories generated under the current policy. Squared Bellman residuals are reported as a validation diagnostic, not as the optimization target. Because the individual law of motion, the budget constraint, and the transition structure are known economic objects, they are written directly into the computational graph: gradients of flow through these structural dynamics, in contrast to model-free reinforcement learning where transitions are observed only as samples Yang et al., 2025. Because the generalized moments are themselves parameters of the optimization (not hyperparameters), the method automatically discovers the minimal set of distributional statistics required for equilibrium pricing.
A practical consequence of the policy-gradient formulation is that DeepHAM is well suited to problems where first-order conditions are difficult to write down or inconvenient to use, including constrained-efficiency problems with aggregate shocks, optimal-policy design, and behavioral macro questions; the headline application in Han et al. (2024) is precisely a constrained-efficiency problem solved by simulating the economy under candidate allocation rules and updating the rules to improve social welfare.
6.6.2.1Reported accuracy (validation diagnostics).¶
The numbers that follow are taken from Han et al. (2024); they have not been independently replicated by the companion notebooks of this chapter, and should be read as reported results rather than as figures we can vouch for from our own runs. On the baseline Krusell--Smith model, the authors report the following Bellman-error reductions, computed ex post as validation measures of the converged solution rather than as the training objective:
DeepHAM with only the first moment in the state vector reduces the Bellman residual by 61.6% compared to the classical KS algorithm.
DeepHAM with one learned generalized moment (on top of, or replacing, the first moment) reduces the residual by 68.9%.
The method extends to HA models with multiple shocks, multiple endogenous states, large shocks (risky steady state), and nonlinearities (e.g., ZLB) where both KS and the local perturbation method of Reiter (2009) break down.
Conceptually, DeepHAM bridges two traditions: the approximate-aggregation insight of Krusell & Smith (1998) (a small number of moments can suffice) and the deep-function-approximator philosophy of Maliar et al. (2021) and Azinovic et al. (2022) (the NN need not be restricted to pre-specified basis functions). The official reference implementation, including replication code for the Krusell--Smith benchmarks above, is available at https://
6.6.2.2Interpretability of the learned moments.¶
Although the encoders are flexible neural networks, the moments they produce are interpretable in the economic sense relevant to heterogeneous-agent macro: they summarize how the cross-section affects welfare and policy. In the Krusell--Smith application of Han et al. (2024), the learned basis function is concave in assets, so a marginal asset held by a poor household contributes more to the moment than a marginal asset held by a rich household. This links the learned distributional representation directly to marginal propensities to save, redistribution, and welfare effects, which is the natural object of interest in HA models. Generalized moments are therefore not just a flexible compression of ; they double as a device for reading economic content out of the trained equilibrium.
6.6.2.3From learned moments to learned state spaces.¶
Once equilibrium computation is written as policy-gradient optimization over simulated paths, one can ask a sharper question: do agents need the full distribution as a state variable at all? In Walrasian heterogeneous-agent models, agents care about only indirectly, through equilibrium prices. Yang et al. (2025) exploit this in their structural reinforcement learning (SRL) framework: agents’ policy functions take low-dimensional prices, or short price histories, as the aggregate state, while remains part of the simulated environment used to clear markets. This sidesteps the master equation entirely for a substantial class of HA economies and produces a natural conceptual arc: KS chooses moments a priori (Section Section 6.3); DeepHAM learns moments from the equilibrium objective (this section); SRL replaces moments with prices when the model permits, and lets ML help define a tractable equilibrium concept rather than only solving a fixed one.
6.6.3Beyond Walrasian Markets: DeepSAM and Search Frictions¶
A natural follow-up question is whether the DeepHAM machinery extends to economies in which the cross-sectional distribution affects decisions through channels other than aggregate prices. In standard heterogeneous-agent models the distribution is felt only through equilibrium prices: enters individual problems by setting and . In labor markets, search-and-matching, and other non-Walrasian settings, the distribution enters more directly: through the matching technology, the type composition on each side of the market, outside options, and bargained transfers. This makes the equilibrium mapping intrinsically non-Walrasian and forecloses simple price-only summaries of the cross-section.
Payne et al. (2025) address this case with DeepSAM, a deep-learning solver for search-and-matching models with two-sided heterogeneity and aggregate shocks. The architecture inherits the DeepHAM idea of a permutation-invariant set encoder, applied to each side of the market separately, and feeds the resulting type-composition summaries into networks that approximate value functions, matching surplus, and policy. The training objective is again policy-improvement on simulated paths, with the matching technology and bargaining rule embedded structurally. In Walrasian HAM the cross-sectional asset/wealth/income distribution enters individual problems only indirectly, through equilibrium prices ; in SAM the type composition on each side of the matching market enters directly, through the matching technology, outside options, and bargained transfers. That is why DeepHAM and DeepSAM, despite sharing a permutation-invariant set-encoder architecture, treat the distribution differently.
6.6.4Which Method, When?¶
The three deep-learning approaches (Histogram DEQN, Section 6.4; All-in-One DL, Section 6.6.1; and DeepHAM, Section 6.6.2) are complements rather than substitutes. Table Table 6.2 summarizes the practical trade-offs:
For teaching purposes, the Histogram DEQN is the cleanest: the network input is an interpretable distribution vector, and the training loop directly mirrors the DEQN template introduced in Chapter Chapter 2.
For research problems where the number of agents is the natural state dimension (e.g., overlapping generations with many cohorts, or finite-agent social-planner problems), All-in-One DL is often more convenient because it requires no grid design.
For policy analysis in richer HA environments (risky steady states, multiple endogenous states, ZLB), DeepHAM’s learned-moment representation pays off both in accuracy and in interpretability, because the learned moments can be plotted and analyzed as functions of the distribution.
Table Table 6.3 distils the same trade-offs into a quick decision aid: when the model fits the row’s "When it shines" column, the matching method is the first one to try.
Table 6.3:Practical chooser for the three DL approaches to Krusell--Smith. When several rows look applicable, the recommended ordering is: start with Histogram DEQN if a clean teaching narrative is the goal; switch to All-in-One DL if grid design is awkward; switch to DeepHAM if the cumulative-utility objective or learned moments are first-order to the question. Adapted from the L09 deck’s Head-to-Head slide.
| Method | Pros | When it shines |
|---|---|---|
| Histogram DEQN Azinovic et al., 2022 | Interpretable state; exact market clearing; reuses the DEQN template | Teaching; moderate; smooth policies |
| All-in-One DL Maliar et al., 2021 | No grid design; large- panels; single optimizer for all residuals | Large- research problems; OLG-many-cohort extensions |
| DeepHAM Han et al., 2024 | Learned moments; cumulative-utility objective; risky steady state; ZLB | Rich HA macro-finance; constrained-efficiency / optimal-policy design |
6.6.4.1Notebook: a classroom-scale all-in-one KS solver.¶
The accompanying Jupyter notebook lecture_09_12_KrusellSmith_DeepLearning.ipynb (introduced in Section 6.5 and described in detail in the README) implements a classroom-scale version of the all-in-one DL approach of Maliar et al. (2021) on the Krusell--Smith benchmark with the parameters of Krusell & Smith (1998). It uses a single policy network parameterized by TensorFlow/Keras and an explicit panel of agents as the distributional input; the loss is the squared Euler residual, augmented with a Fischer--Burmeister complementarity term at the borrowing constraint, and aggregate consistency is imposed by construction (next-period capital is the cross-sectional mean of the panel’s savings choices) rather than through a separate market-clearing penalty. This is a deliberate simplification of the full all-in-one formulation of Section 6.6.1, which also carries a value network and an explicit market-clearing residual. The notebook is annotated cell-by-cell with the correspondence to the equations in this chapter, and is designed to converge in under ten minutes on a standard CPU. For the production-scale counterpart (up to agents, GPU acceleration, richer shock structure), we refer the reader to the replication repository of Maliar (2022).
6.6.4.2Benchmarks and replication pointers.¶
For readers who want to benchmark any of the deep-learning approaches against the traditional KS algorithm, the canonical reference implementation is econ-ark/KrusellSmith The Econ-ARK Team, 2020, which implements the forecasting-rule-update algorithm of Krusell & Smith (1998) and reports within about 20 outer iterations under standard parameters (, , log utility, two aggregate states). Any deep-learning method must at least match that accuracy on the baseline model; the advantages are supposed to appear when the model is extended in directions KS cannot handle cleanly (more moments, many endogenous states, risky steady state).
6.7Extension: Deep Learning in the Sequence Space¶
The histogram-based DEQN above is transparent because it feeds a direct approximation of the endogenous cross-sectional distribution into the neural network. The price of that transparency is dimensionality: in richer heterogeneous-agent economies, the aggregate state can contain hundreds of histogram entries. Azinovic-Yang & Žemlička (2025) propose a different representation of the aggregate state. Instead of feeding the current endogenous state to the network, they feed a truncated history of exogenous aggregate shocks. The equilibrium logic does not change: one still enforces Euler equations, market clearing, and occasionally binding constraints inside the loss. What changes is the object that summarizes the aggregate state for the network. Figure Figure 6.7 contrasts the two views.
Figure 6.7:Two ways to encode the aggregate state in deep equilibrium learning. Each pipeline reads top-to-bottom: the upper (colored) box is the input the user gives to the same neural network , the middle (colored) box is the network’s output (policy and price objects), and the green box is the equilibrium loss that consumes those outputs. Histogram DEQNs (left, blue) feed an endogenous state representation ; sequence-space DEQNs (right, red) feed a truncated exogenous shock history . Crucially, the network and the residual-based training loss are identical across the two pipelines, only the input changes.
The sequence-space representation. Let denote the last realizations of the exogenous aggregate shock. The key claim is that, in an ergodic economy, this history is an approximate sufficient statistic for the endogenous aggregate state. In the Brock--Mirman warm-up notebook, the network maps the shock history to a bounded savings rate, from which next-period capital follows by the resource constraint,
where is the logistic squashing that keeps feasible. In the richer heterogeneous-agent version, the network instead maps the same history to higher-level equilibrium objects such as policy-function coefficients or pricing objects. This connects the method to the MIT-shock and sequence-space Jacobian literature of Boppart et al. (2018) and Auclert et al. (2021), but replaces local linear approximations with a global residual-based neural approximation.
Intuition first. The easiest way to think about the method is as a memory compression device. A positive aggregate shock today raises output and therefore raises tomorrow’s capital. That extra capital still matters the period after, but only through the production elasticity , so its influence is smaller. One more period later it is smaller again. In other words, the current aggregate state stores a decaying memory of past shocks. The sequence-space idea is to feed that shock history directly to the network rather than feeding the current endogenous state itself.
Figure 6.8:Intuition for sequence space in Brock--Mirman. depends on past shocks with weights that decay like in the lag (here , the standard capital share). Already at the weight has fallen to , so a finite history of recent shocks summarizes the relevant aggregate information; very old shocks matter little.
Brock--Mirman: what changes relative to Chapter Chapter 2? The Brock--Mirman warm-up is useful because the change can be written down exactly. In Chapter Chapter 2, the state-space DEQN uses the current state as input,
In the sequence-space version, the economic model is unchanged, but the network sees a different input:
The Euler residual is the same object as before,
so the economics are unchanged. What changes is the computational representation:
the network input changes from the current state to the recent history ;
the network output changes from current consumption to a bounded savings rate in the warm-up notebook, so that is feasible by construction;
the current capital stock is no longer fed directly into the network, but is generated recursively from the initial condition and previously predicted capital choices;
the training samples are overlapping shock histories rather than pointwise states .
This distinction is important conceptually. For Brock--Mirman, sequence space is not a dimensionality reduction, since is only two-dimensional whereas a history of length is larger. The Brock--Mirman notebook is therefore a pedagogical demonstration of the idea that histories can stand in for endogenous states. The dimensionality gain appears only in richer heterogeneous-agent models, where the relevant alternative is a large histogram or other high-dimensional distributional summary.
Intermediate bridge: sequence-space IRBC. Between the one-shock Brock--Mirman warm-up and the infinite-dimensional Krusell--Smith state, the companion notebook lectures/lecture_10_sequence_space_deqns/code/lecture_10_05b_SequenceSpace_IRBC.ipynb re-trains the two-country IRBC model of Chapter Chapter 3 under sequence-space inputs: the policy network reads the last shock vectors (a 240-dimensional history with truncation error) instead of the four-dimensional current state. The equilibrium residuals (Euler, ARC, Fischer--Burmeister), the Gauss--Hermite quadrature, and the cloud-method sampler are literally unchanged from nb 01; only the input domain changes. Because the current capital stock is no longer an input, we parametrize the output head around the steady state, and , which keeps gradients lively at the target policy and prevents the cold-start divergence that plagued a naive softplus head. This notebook is a pedagogical bridge rather than a computational win, at a four-dimensional state the history is much larger, not smaller, but it shows that the same template handles a multi-equation system with multiple independent shock channels before we hand the method over to Krusell--Smith, where the dimensionality gain is real.
Training logic. The computational pattern is also close to the rest of this chapter. One samples an exogenous shock path, constructs overlapping history windows , evaluates the network on those windows, and then uses the resulting decisions to simulate the endogenous economy forward. In the Brock--Mirman warm-up this produces the capital sequence directly; in the Krusell--Smith tutorial it produces policy-function objects, while Young’s method still propagates the cross-sectional distribution inside the simulator. Residuals are then evaluated on the simulated path and backpropagated through the full pipeline. Figure Figure 6.9 summarizes this workflow.
Figure 6.9:Training flow for sequence-space DEQNs. The exogenous shock history is the network input, but the forward simulator still produces endogenous objects such as prices, aggregate capital, or cross-sectional distributions needed for residual evaluation.
Why truncated histories can work. The Brock--Mirman warm-up makes the logic especially transparent. With full depreciation () and log utility, recursive substitution shows that the capital stock depends on the last shocks up to an error of order . Since (typically ), this error vanishes exponentially: for , the truncation error is of order 10-11. More generally, in ergodic economies with persistent aggregate shocks, the approximation error decays at roughly . In richer heterogeneous-agent models this is no longer an exact algebraic statement, so the history length becomes an empirical accuracy choice rather than a theorem.
Why this is useful in heterogeneous-agent models. Two advantages are worth separating. First, as a network input, a history of shocks can be much smaller than a histogram with hundreds of bins. Second, exogenous shock histories are sampled from a fixed distribution. This removes one source of instability in residual-based training: the set of network inputs is anchored by model primitives even though the endogenous simulator still evolves with the current policy network. In the Krusell--Smith tutorial, this means that the network is conditioned on shock histories, while Young’s method remains responsible for propagating the distribution used in market-clearing calculations.
Shape-preserving operator learning. A second contribution of Azinovic-Yang & Žemlička (2025) is to let the network output policy-function objects rather than a single scalar choice. In particular, they construct architectures that guarantee monotonicity and concavity of the predicted consumption rule by representing it with an I-spline basis and non-negative coefficients. In the Krusell--Smith tutorial, the network maps the shock history to these coefficients; the resulting policy can then be evaluated at all idiosyncratic states on the wealth grid. This operator-learning view pairs naturally with the endogenous grid method (EGM) of Carroll (2006) and avoids ad hoc penalties for monotonicity or concavity.
Explicit I-spline MPC parameterization. Having seen why a shape-preserving output head matters (above), we now write the construction down concretely; this is the most technical paragraph of the section and a reader who already accepts the monotonicity/concavity guarantees can skip to “Fischer--Burmeister KKT loss” below. Let be a fixed log-spaced wealth grid and let be a precomputed I-spline basis evaluated on it,
where is a small numerical shift (the BASIS_SHIFT constant in the notebook) and each is an integrated B-spline that is monotonically increasing from 0 to 1. For each idiosyncratic state , the network outputs two objects: a boundary marginal propensity to consume (sigmoid head) and non-negative weights with (a “phantom-zero” softmax head). The grid MPC is
which is decreasing in by construction (positive weights times an increasing basis, subtracted off a constant), bounded in , and continuous in the network parameters. Consumption is then recovered on the grid by cumulation of the MPC schedule along cash-on-hand ,
and off-grid evaluation uses piecewise-linear interpolation. Equations (6.25)--(6.26) guarantee, by construction and without any auxiliary penalty, that the consumption rule is non-negative, monotonically increasing in , concave in , and feasible (). In code, is the matrix ispline_basis, and come from the two heads of actor_c_grid, and the cumulation is the closing block of that same function.
Fischer--Burmeister KKT loss. Households face a borrowing constraint . The Karush--Kuhn--Tucker conditions of the household problem split into two regimes: at an interior optimum the Euler equation holds with equality, while at a binding constraint the Euler equation can be slack but next-period capital is zero. Define the (relative) Euler residual and the (relative) savings slack (in this section is reused as the Euler residual, matching the tutorial code’s variable name; it is not the household policy function of Section 6.3)
where is the consumption level implied by the Euler equation given the network’s continuation policy. The KKT pair is then
which is a complementarity condition. The Fischer--Burmeister envelope, in the same sign convention used in Ch. Chapter 3 and Ch. Chapter 5,
is smooth and satisfies if and only if with both non-negative; the small constant (set to 10-12 in the notebooks) is a numerical stabilizer for the square root. The upstream JAX tutorial code uses the negative-sign variant , which has the same zero set when squared. The training loss is the buffer-and-grid average of ,
so that one differentiable scalar simultaneously enforces the Euler equation in the interior region and the complementarity condition at the borrowing constraint, without case splits or shadow-price augmentation. This reuses the smooth complementarity construction of Fischer (1992) that is standard in nonlinear programming, applied here to a heterogeneous-agent equilibrium loss.
Putting the pieces together: the HA training loop. The Krusell--Smith tutorial assembles the encoder, the I-spline policy head, Young’s distribution step, and the Fischer--Burmeister loss into a single replay-buffer training loop. Definition 6.3 states it explicitly.
Three implementation choices in Definition 6.3 are worth flagging. First, the forward roll is wrapped in stop_gradient so that gradients only flow through the FB residual evaluated on the buffer, not through long simulation chains; this is what makes training tractable for horizons. Second, because Young’s step gives exact aggregate sums, the only stochasticity in the gradient comes from buffer mini-batching, which acts as standard SGD noise rather than a Monte-Carlo aggregation noise floor. Third, the buffer simultaneously decouples training-state drawing from the current policy and lets the network see distributions generated by earlier policies, which improves coverage of the ergodic set during early training.
Two training algorithms: residual minimization vs. time iteration. Definition 6.3 is one of two families of training schemes used in the sequence-space DL literature. In direct residual minimization (the version above and in our notebook), the network is trained by gradient descent on the squared equilibrium residual itself. In time iteration with EGM, the network is trained on a sequence of supervised regression problems: at each outer iteration, one (i) uses the current network to construct next-period policies, (ii) backs out implied current-period policies via the endogenous-grid method of Carroll (2006), and (iii) updates the network by minimizing the squared error against those EGM targets. Time iteration is more involved and requires a per-batch root-finding step, but it is more flexible: it tolerates non-trivial market clearing and, crucially, handles non-convex choices (e.g., a discrete retirement decision) and non-monotone Laffer curves where the Euler equation has multiple roots that direct residual minimization cannot disambiguate. In practice, residual minimization is the simpler entry point on smooth, convex problems; switch to time iteration when convergence stalls, when the model contains discrete choices, or when the continuation value has convex regions that admit multiple optimal savings.
Applications and notebooks. The paper applies this framework to three demanding models: (i) an OLG economy with portfolio choice and aggregate risk, (ii) an economy with a continuum of heterogeneous firms and households featuring idiosyncratic and aggregate shocks, and (iii) a lifecycle model with a discrete early-retirement choice that introduces local convexities. Mean Euler equation errors are below 0.2% in all cases. The two TensorFlow 2 companion notebooks are intentionally complementary: 05_SequenceSpace_BrockMirman.ipynb is a Brock--Mirman warm-up that isolates the history-to-policy logic in the simplest possible environment, while 06_SequenceSpace_KrusellSmith.ipynb is a compact teaching implementation that combines sequence-space inputs, I-spline policies, and Young’s method in a heterogeneous-agent setting.
A third notebook, KrusellSmith_Tutorial_CPU.ipynb, is a JAX/optax port of the upstream pedagogical tutorial released by the paper’s authors. It exposes the same shape-preserving I-spline MPC parameterization, the same Young step inside the simulator, and the same Fischer--Burmeister KKT loss as the TensorFlow notebook, but in the original JAX form. It is adapted from the upstream tutorial 01_KrusellSmith_Tutorial_CPU.ipynb in the companion code repository Azinovic-Yang & Žemlička, 2025, available at https://
6.8Further Reading¶
Young (2010), the original non-stochastic histogram paper.
Krusell & Smith (1998), the canonical heterogeneous-agent benchmark with aggregate shocks.
Azinovic-Yang & Žemlička (2025), sequence-space DEQNs, the natural extension.
Achdou et al. (2022), the continuous-time treatment that Chapter Chapter 8 builds on.
Maliar et al. (2021) Han et al. (2024), alternative deep-learning approaches to KS, contrasted in Section 6.6.
6.9Exercises¶
Worked solutions and guidance for these exercises appear in Appendix Appendix F.
- Krusell, P., & Smith, A. A., Jr. (1998). Income and wealth heterogeneity in the macroeconomy. Journal of Political Economy, 106(5), 867–896.
- Aiyagari, S. R. (1994). Uninsured Idiosyncratic Risk and Aggregate Saving. The Quarterly Journal of Economics, 109(3), 659–684.
- Young, E. R. (2010). Solving the incomplete markets model with aggregate uncertainty using the Krusell–Smith algorithm and non-stochastic simulations. Journal of Economic Dynamics and Control, 34(1), 36–41.
- Azinovic, M., Gaegauf, L., & Scheidegger, S. (2022). DEEP EQUILIBRIUM NETS. International Economic Review, 63(4), 1471–1525. 10.1111/iere.12575
- Maliar, L., Maliar, S., & Winant, P. (2021). Deep learning for solving dynamic economic models. Journal of Monetary Economics, 122, 76–101.
- Bewley, T. (1986). Stationary monetary equilibrium with a continuum of independently fluctuating consumers. In W. Hildenbrand & A. Mas-Colell (Eds.), Contributions to Mathematical Economics in Honor of Gérard Debreu (pp. 79–102). North-Holland.
- İmrohoroğlu, A. (1989). Cost of Business Cycles with Indivisibilities and Liquidity Constraints. Journal of Political Economy, 97(6), 1364–1383. 10.1086/261640
- Huggett, M. (1993). The Risk-Free Rate in Heterogeneous-Agent Incomplete-Insurance Economies. Journal of Economic Dynamics and Control, 17(5–6), 953–969. 10.1016/0165-1889(93)90024-M
- Achdou, Y., Han, J., Lasry, J.-M., Lions, P.-L., & Moll, B. (2022). Income and wealth distribution in macroeconomics: A continuous-time approach. The Review of Economic Studies, 89(1), 45–86.
- Maliar, L., Maliar, S., & Valli, F. (2010). Solving the Incomplete Markets Model with Aggregate Uncertainty Using the Krusell–Smith Algorithm. Journal of Economic Dynamics and Control, 34(1), 42–49.
- Azinovic-Yang, M., & Žemlička, J. (2025). Deep Learning in the Sequence Space. 10.48550/arXiv.2509.13623
- Reiter, M. (2009). Solving heterogeneous-agent models by projection and perturbation. Journal of Economic Dynamics and Control, 33(3), 649–665.
- Han, J., Yang, Y., & E, W. (2024). DeepHAM: A Global Solution Method for Heterogeneous Agent Models with Aggregate Shocks. Quantitative Economics.
- Kahou, M. E., Fernández-Villaverde, J., Perla, J., & Sood, A. (2021). Exploiting Symmetry in High-Dimensional Dynamic Programming (NBER Working Paper No. 28981). National Bureau of Economic Research.
- Bayer, C., & Luetticke, R. (2020). Solving Discrete Time Heterogeneous Agent Models with Aggregate Risk and Many Idiosyncratic States by Perturbation. Quantitative Economics, 11(4), 1253–1288.