Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

6 Heterogeneous Agents and Young's Method

University of Lausanne

The OLG models of Chapter Chapter 5 featured a finite number of agent types, so the cross-sectional state was simply a vector (k1,,kA)(k^1,\ldots,k^A). Many important macroeconomic applications instead require a continuum of agents subject to idiosyncratic shocks. In the Krusell & Smith (1998) economy, an aggregate productivity shock additionally moves the cross section in a stochastic way, and the entire wealth distribution μt\mu_t then becomes part of the aggregate state. The Aiyagari (1994) model is the special case without aggregate risk: μt\mu_t is fixed in stationary equilibrium and evolves deterministically along transitions, so it is a parameter of the equilibrium rather than a stochastic state variable. Incomplete markets prevent full insurance in both, making explicit distributional tracking essential, but the “master-equation” challenge of treating μt\mu_t as a high-dimensional aggregate state arises only once aggregate risk is added on top of Aiyagari. Why represent μt\mu_t as a histogram on a discrete grid rather than as a Monte Carlo panel? Two reasons motivate the choice up front: a histogram is deterministic, so re-running the same equilibrium gives identical aggregates and the loss is a smooth function of the network weights, and it is noise-free, so Euler-equation residuals reflect approximation error rather than O(1/N)\mathcal{O}(1/\sqrt N) Monte Carlo sampling noise (Figure Figure 6.5 makes this contrast quantitative). This chapter develops Young’s non-stochastic simulation method Young, 2010 for representing μt\mu_t as a histogram and shows how to embed that method within the DEQN framework of Azinovic et al. (2022) to solve heterogeneous-agent economies with neural network policy functions.

6.0.0.1How this chapter maps onto the slides and notebooks.

The heterogeneous-agent material of this chapter, together with the companion deck in lecture_09_heterogeneous_agents_youngs_method/slides/, can be read independently of the sequence-space material in Section 6.7; readers already comfortable with Krusell--Smith may skip straight there on a first pass. Two notebooks in lecture_09_heterogeneous_agents_youngs_method/code/ accompany Sections Section 6.3--Section 6.4: lecture_09_10_Youngs_Method_Examples.ipynb isolates Young’s redistribution operator on toy examples, and lecture_09_11_Continuum_of_Agents_DEQN.ipynb embeds the same operator inside the Appendix A.5 endowment-economy DEQN. Section Section 6.6 on alternative deep-learning approaches is paired with lecture_09_12_KrusellSmith_DeepLearning.ipynb, a classroom-scale all-in-one DL solver in the spirit of Maliar et al. (2021).

6.0.0.2The Bewley--Huggett--Aiyagari lineage.

The continuum-agent framework that this chapter operationalises has three foundational sources. Bewley (1986) introduced stationary monetary equilibrium with a continuum of agents subject to iid endowment shocks; the explicit self-insurance-through-borrowing-constraints mechanism that defines the modern incomplete-markets workhorse is due to İmrohoroğlu (1989), Huggett (1993), and Aiyagari (1994). Huggett (1993) cast the idea as a tractable endowment economy with a single risk-free asset, focusing on the equilibrium interest rate. Aiyagari (1994) added neoclassical production, closing the model in general equilibrium with capital accumulation; this is the canonical incomplete-markets economy. Krusell & Smith (1998) layered aggregate productivity shocks on top, producing the modern Krusell--Smith economy that this chapter targets, and its continuous-time reformulation is developed by Achdou et al. (2022) and revisited in Chapter Chapter 8.

Genealogy of the heterogeneous-agent models treated in this script. This chapter targets the Krusell--Smith branch (right) by combining a DEQN policy with Young’s histogram update; the continuous-time branch (Achdou--Han--Lasry--Lions--Moll) is taken up in Chapter .

Figure 6.1:Genealogy of the heterogeneous-agent models treated in this script. This chapter targets the Krusell--Smith branch (right) by combining a DEQN policy with Young’s histogram update; the continuous-time branch (Achdou--Han--Lasry--Lions--Moll) is taken up in Chapter Chapter 8.

6.1From Representative to Heterogeneous Agents

In the representative-agent models of Chapters Chapter 2--Chapter 3, aggregate capital KtK_t is a sufficient statistic for the economy’s state. In reality, agents differ in wealth, income, and employment status, and incomplete markets prevent full insurance against idiosyncratic risk. This heterogeneity matters for policy analysis:

The mathematical challenge depends on whether the economy carries aggregate risk. Without it (the Aiyagari case), the cross-section μt\mu_t is fixed in stationary equilibrium and evolves deterministically along transitions, so it parameterizes the equilibrium rather than serving as a stochastic state variable. With aggregate risk (the Krusell--Smith case that this chapter targets), an aggregate shock moves the cross-section in a stochastic way, and the entire distribution μt\mu_t then becomes part of the aggregate state. Since μt\mu_t is in either case a measure (an infinite-dimensional object), it cannot be placed on a grid without encountering the curse of dimensionality.

6.1.0.1Two approaches.

There are two main strategies for handling heterogeneous agents in discrete time. Finite-agent methods keep a high-dimensional but finite vector (k1,,kN)(k_1,\ldots,k_N) of individual states, which is exact in μt\mu_t but pays a permutation-symmetry cost that scales poorly with NN (the OLG models of Chapter Chapter 5, and the finite-agent Monte-Carlo implementation of the continuum-agent Krusell--Smith model by Maliar et al. (2010), fall in this category). Continuum-agent methods replace the agent labels with the distributional state μt()\mu_t(\cdot), which is infinite-dimensional but eliminates sampling noise and the permutation problem, at the price of needing to approximate μt\mu_t (Krusell & Smith (1998) and Young (2010) are the canonical references). This chapter pursues the second approach: tracking the distribution via a histogram, with Young’s histogram method Young, 2010 providing the distribution update.

6.2The Krusell--Smith (1998) Economy

The canonical heterogeneous-agent model with aggregate uncertainty is due to Krusell & Smith (1998). Its key features are:

Each agent must forecast future prices to make optimal savings decisions. But prices depend on the entire wealth distribution, which is infinite-dimensional. The key insight of Krusell & Smith (1998) is that the mean of the wealth distribution is a nearly sufficient statistic for forecasting: a log-linear forecasting rule

H^(m1,a)=exp ⁣(A(a)+B(a)logm1),\hat{H}(m_1, a) = \exp\!\bigl(A(a) + B(a)\log m_1\bigr),

where H^(m1,a)\hat{H}(m_1,a) is the approximate next-period aggregate capital (the price-forecasting function), A(a)A(a) and B(a)B(a) are the OLS intercept and slope coefficients (each a function of the aggregate shock state aa), and m1=kdμm_1 = \int k\,d\mu is mean capital; this rule achieves R2>0.9999R^2 > 0.9999 in practice.

6.2.0.1Equilibrium.

A recursive competitive equilibrium consists of:

  1. Individual optimization: each agent chooses savings kk' to maximize utility given the forecasting rule.

  2. Market clearing: aggregate demand equals supply.

  3. Consistent distribution evolution: μt+1=T(μt,at)\mu_{t+1} = T(\mu_t, a_t) given optimal policies.

  4. Rational expectations: agents’ forecasts are self-confirming.

The question is: how do we simulate the distribution forward to compute the realized mean and check the forecasting rule?

6.2.0.2The traditional Krusell--Smith algorithm in detail.

Before turning to the distribution-update step, it is useful to write the canonical KS algorithm explicitly. It is a nested fixed-point iteration with an outer loop over forecasting-rule coefficients and an inner loop that solves the individual household’s Bellman equation given those coefficients.

The remarkable empirical finding of Krusell & Smith (1998) is that a log-linear rule in m1m_1 alone achieves R2>0.9999R^2 > 0.9999 and very small forecast errors: the “approximate aggregation” result. Textbook implementations (e.g., the econ-ark/KrusellSmith REMARK) typically converge in 20\sim 20 outer iterations with standard parameters β=0.99\beta = 0.99, α=0.36\alpha = 0.36, log utility, aggregate shocks on {good,bad}\{\text{good},\text{bad}\} with persistence 0.875\approx 0.875, and unemployment rates 4%4\% (good) vs. 10%10\% (bad).

6.2.0.3Bottleneck.

The inner Bellman solve on a two-dimensional (k,m1)(k, m_1) grid is the expensive step. It scales exponentially if one tries to add additional moments (e.g., tracking variance or skewness), which is why the KS algorithm as stated caps out at one moment. Extensions that need more moments must use more powerful function approximators. This is where the Young-histogram DEQN of Section 6.4 (and the alternatives in Section 6.6) come in.

6.3Young’s (2010) Non-Stochastic Simulation

Young (2010) proposed a deterministic method for propagating the wealth distribution that avoids Monte Carlo sampling entirely.

6.3.0.1Core idea.

Represent the wealth distribution at time tt as a histogram over discrete bins, using two ingredients:

The histogram value Gt(ki,εj)[0,1]G_t(k_i, \varepsilon_j) \in [0,1] is the probability mass of agents sitting at capital bin kik_i in shock state εj\varepsilon_j at time tt; by construction, i,jGt(ki,εj)=1\sum_{i,j} G_t(k_i, \varepsilon_j) = 1.

Given the household policy function k=g(k,ε,m1,a)k' = g(k, \varepsilon, m_1, a), where:

the key operation is mass redistribution. Evaluating gg at a bin (ki,εj)(k_i,\varepsilon_j) produces the savings target k=g(ki,εj,m1,a)k' = g(k_i, \varepsilon_j, m_1, a), which generically lies off the grid. Let n=n(i,j){1,,Nk1}n = n(i,j) \in \{1,\ldots,N_k-1\} denote the bracketing index defined by knk<kn+1k_n \leq k' < k_{n+1}, and let

mGt(ki,εj)m \equiv G_t(k_i,\varepsilon_j)

be the probability mass currently sitting at bin (ki,εj)(k_i,\varepsilon_j). This mass mm is then split between the two bracketing grid points knk_n and kn+1k_{n+1} using linear-interpolation weights:

ω=1kknkn+1kn,mass ωm to kn,(1ω)m to kn+1.\omega = 1 - \frac{k' - k_n}{k_{n+1} - k_n}, \qquad \text{mass } \omega\cdot m \text{ to } k_n, \quad (1-\omega)\cdot m \text{ to } k_{n+1}.

Figure Figure 6.2 illustrates this operation. A mass mm sitting at the off-grid savings target kk' is split between the two bracketing grid points knk_n and kn+1k_{n+1}, with the weight ω\omega proportional to the proximity of kk' to knk_n (so ω1\omega \to 1 when kknk' \to k_n and ω0\omega \to 0 when kkn+1k' \to k_{n+1}).

Linear interpolation in Young’s method. Mass m at off-grid point k' is redistributed to the two bracketing grid points k_n and k_{n+1} using weights \omega and 1-\omega. Closer proximity to a grid point yields a larger share of the mass.

Figure 6.2:Linear interpolation in Young’s method. Mass mm at off-grid point kk' is redistributed to the two bracketing grid points knk_n and kn+1k_{n+1} using weights ω\omega and 1ω1-\omega. Closer proximity to a grid point yields a larger share of the mass.

Why exactly this weight? The lottery weight ω\omega in Eq. (6.3) is not an arbitrary choice: it is the unique value for which the two-point split has conditional mean equal to the off-grid policy choice kk'. Solving the equation ωkn+(1ω)kn+1=k\omega\cdot k_n + (1-\omega)\cdot k_{n+1} = k' for ω\omega recovers Eq. (6.3) in one line:

ωkn+(1ω)kn+1=kn+(1ω)(kn+1kn)=kn+(kkn)=k.\omega\cdot k_n + (1-\omega)\cdot k_{n+1} = k_n + (1-\omega)(k_{n+1} - k_n) = k_n + (k' - k_n) = k'.

By linearity of expectation, this conditional mean equality at every grid bin extends to the full distribution: the unconditional mean of Gt+1G_{t+1} equals the unconditional mean of the policy-implied next-period capital. Higher moments (variance, percentiles) are only approximated; the leading error is of order (Δk)2(\Delta k)^2 on smooth densities, so a finer grid improves higher-moment fidelity at no cost to mean preservation.

6.3.0.2Worked example.

We illustrate the full histogram update with a small grid of Nk=4N_k=4 capital levels {1.0,2.0,3.0,4.0}\{1.0,\, 2.0,\, 3.0,\, 4.0\} and two idiosyncratic states ε{low,high}\varepsilon \in \{\text{low},\, \text{high}\}.

Step 1, Setup. The initial histogram G0G_0 (masses summing to 1):

k=1.0k=1.0k=2.0k=2.0k=3.0k=3.0k=4.0k=4.0Row sum
ε=low\varepsilon = \text{low}0.100.200.100.050.45
ε=high\varepsilon = \text{high}0.050.150.200.150.55

The mean capital is kˉ0=i,jG0(ki,εj)ki=0.10(1)+0.20(2)+0.10(3)+0.05(4)+0.05(1)+0.15(2)+0.20(3)+0.15(4)=2.55\bar{k}_0 = \sum_{i,j} G_0(k_i,\varepsilon_j)\cdot k_i = 0.10(1) + 0.20(2) + 0.10(3) + 0.05(4) + 0.05(1) + 0.15(2) + 0.20(3) + 0.15(4) = 2.55.

Step 2, Policy evaluation. Let y(ε)y(\varepsilon) denote the dollar productivity associated with shock state ε\varepsilon, with ylow=1y_{\text{low}}=1 and yhigh=3y_{\text{high}}=3 (here ε\varepsilon is the state index, yy is its numerical value). Using a simple linear savings rule k=0.4k+0.5y(ε)k' = 0.4 k + 0.5\,y(\varepsilon):

k=1.0k=1.0k=2.0k=2.0k=3.0k=3.0k=4.0k=4.0
ε=low\varepsilon = \text{low}: kk'0.91.31.72.1
ε=high\varepsilon = \text{high}: kk'1.92.32.73.1

Step 3, Interpolation weights. Since the grid spacing is Δk=1.0\Delta k = 1.0, each off-grid kk' is bracketed by [kn,kn+1][k_n, k_{n+1}] with weight ω=1(kkn)/Δk\omega = 1 - (k' - k_n)/\Delta k:

k=1.0k=1.0k=2.0k=2.0k=3.0k=3.0k=4.0k=4.0
ε=low\varepsilon = \text{low}: bracket, ω\omega[1,1][1,1], clip[1,2][1,2], 0.7[1,2][1,2], 0.3[2,3][2,3], 0.9
ε=high\varepsilon = \text{high}: bracket, ω\omega[1,2][1,2], 0.1[2,3][2,3], 0.7[2,3][2,3], 0.3[3,4][3,4], 0.9

When k<k1=1.0k' < k_1 = 1.0 (here k=0.9k' = 0.9 for the low-state, lowest-capital agents), all mass is assigned to k1k_1 (clipped to the boundary).

Step 4, Mass redistribution. For simplicity, assume the shock transition is the identity (ε=ε\varepsilon' = \varepsilon), so mass stays in its current ε\varepsilon-state. Building G1G_1 by accumulating the redistributed mass from each source bin (mass is conserved bin-by-bin: e.g. the ε=low\varepsilon=\text{low}, k=2k=2 source of mass 0.20 contributes 0.70.20=0.140.7\cdot0.20 = 0.14 to k=1k=1 and 0.30.20=0.060.3\cdot0.20 = 0.06 to k=2k=2):

k=1.0k=1.0k=2.0k=2.0k=3.0k=3.0k=4.0k=4.0Row sum
ε=low\varepsilon = \text{low}0.2700.1750.0050.0000.450
ε=high\varepsilon = \text{high}0.0050.2100.3200.0150.550
total0.2750.3850.3250.0151.000

Step 5, Mean verification. The mean of G1G_1 is kˉ1=0.275(1)+0.385(2)+0.325(3)+0.015(4)=2.08\bar{k}_1 = 0.275(1)+0.385(2)+0.325(3)+0.015(4) = 2.08. The unclipped, policy-implied mean is i,jG0(ki,εj)k(ki,εj)=2.07\sum_{i,j} G_0(k_i,\varepsilon_j)\,k'(k_i,\varepsilon_j) = 2.07, so boundary clipping at k=0.9k' = 0.9 raises the mean by only 0.01. Mean preservation is exact for source bins whose policy choice kk' lies strictly inside the grid; clipping at the boundary slightly biases the mean, here upward, because mass that would have landed at k=0.9k'=0.9 is forced to k=1k=1. In general, boundary clipping biases the mean in the direction of the violated boundary: clipping at kmink_{\min} biases the mean upward (mass is pulled in from below the grid), clipping at kmaxk_{\max} biases it downward. With a wider grid (kmin<0.9k_{\min} < 0.9) the mean would be preserved exactly.

The full picture: one cell splits into four. The worked example above used identity shock transitions to keep the arithmetic visible. The general case combines the capital lottery with a shock fork: each piece of mass split between kJk_J and kJ+1k_{J+1} is then split again across the reachable next-period ε\varepsilon' values according to πεε,a\pi_{\varepsilon'\mid\varepsilon, a}. With two shocks ε{L,H}\varepsilon \in \{L, H\} this produces four destination bins for every source bin, with weights given by the product {ω,1ω}×{πεL,πεH}\{\omega, 1-\omega\} \times \{\pi_{\varepsilon L}, \pi_{\varepsilon H}\}. Figure Figure 6.3 reproduces this two-stage cascade (essentially Fig. 1 of Young (2010)), annotated with a concrete numerical case.

Verification: 0.027 + 0.003 + 0.018 + 0.002 = 0.05 = m ✓.The conditional mean of next-period capital is \omega k_J + (1-\omega) k_{J+1} = k', by Eq. .

Figure 6.3:Verification: 0.027+0.003+0.018+0.002=0.05=m0.027 + 0.003 + 0.018 + 0.002 = 0.05 = m .The conditional mean of next-period capital is ωkJ+(1ω)kJ+1=k\omega k_J + (1-\omega) k_{J+1} = k', by Eq. (6.4).

Young’s cascade for one source bin (essentially Fig. 1 of Young (2010)). Mass mm at (k,ε=L)(k, \varepsilon{=}L) flows in two stages: the capital lottery (ω\omega vs. 1ω1-\omega) sends it to the bracketing grid points kJk_J and kJ+1k_{J+1}, and the shock fork (πLL\pi_{LL} vs. πLH\pi_{LH}) splits each piece across the reachable next-period idiosyncratic states. Each of the four leaves receives the product of its stage-1 and stage-2 weights times the source mass; the four leaf masses sum back to mm. Repeating the cascade for every active source bin and accumulating the leaves yields the new histogram Gt+1G_{t+1}.

A reader implementing the method should recognize three properties from the figure: (i) mass is conserved bin-by-bin because ω+(1ω)=1\omega + (1-\omega) = 1 and επεε=1\sum_{\varepsilon'} \pi_{\varepsilon\varepsilon'} = 1; (ii) the capital lottery’s expected next-period kk equals the policy choice kk' by Eq. (6.4); (iii) the entire update is linear in GtG_t, so it is a sparse matrix-vector product Gt+1=T(ρ)GtG_{t+1} = T(\rho)\,G_t where the transition operator TT depends on the current policy. This linearity is what makes the histogram update differentiable in the policy values almost everywhere, conditional on the interpolation brackets, when we embed it inside a neural-network training loop in Section 6.4. Index changes at bin boundaries and clipping at domain edges are nondifferentiable; standard implementations either ignore these measure-zero events, smooth the assignment, or stop gradients through the index-selection step, and in practice none of these choices materially affects training in the calibrations covered in this chapter.

6.3.0.3The histogram update algorithm.

At each time step, the histogram is updated deterministically:

Implementation cheatsheet. The pseudocode above translates into a few lines of NumPy; the inner double loop can also be vectorised with np.add.at for production use.

[language=Python, basicstyle=\footnotesize\ttfamily, frame=single, backgroundcolor=\color{backcolor}, numbers=none, escapeinside={(*}{*)}]
import numpy as np

def young_step(G, kp, pi_eps, k_grid):
    """One Young (2010) histogram update on a uniform k-grid.
    G[i,j]      mass at (k_grid[i], eps_j),       sums to 1
    kp[i,j]     policy choice k'(k_i, eps_j),     possibly off-grid
    pi_eps[j,jp] = Pr(eps_{t+1}=jp | eps_t=j)
    """
    G_next = np.zeros_like(G)
    dk     = k_grid[1] - k_grid[0]                       # uniform spacing
    n_k    = len(k_grid)
    for j in range(G.shape[1]):                          # current shock
        for i in range(G.shape[0]):                      # current capital bin
            x = kp[i, j]
            # Boundary handling: clip BOTH the bracket index J AND the
            # lottery weight omega.  Otherwise an off-grid x produces
            # omega < 0 or omega > 1 and hence negative or super-unit mass.
            if x <= k_grid[0]:
                J, omega = 0, 1.0                        # all mass to the floor
            elif x >= k_grid[-1]:
                J, omega = n_k - 2, 0.0                  # all mass to the cap
            else:
                J     = int((x - k_grid[0]) // dk)
                J     = min(max(J, 0), n_k - 2)
                omega = (k_grid[J + 1] - x) / dk         # in [0, 1]
            for jp in range(G.shape[1]):                 # next-period shock
                w = pi_eps[j, jp] * G[i, j]              # shock fork x source mass
                G_next[J,     jp] += omega       * w
                G_next[J + 1, jp] += (1 - omega) * w
    return G_next

The four scatter-add lines correspond exactly to the four leaves of Figure Figure 6.3: each leaf receives omega or (1-omega) from the capital lottery, multiplied by pi_eps[j, jp] from the shock fork, multiplied by the source mass. The Krusell--Smith JAX tutorial in lectures/lecture_10_sequence_space_deqns/code/lecture_10_KrusellSmith_Tutorial_CPU.ipynb implements this same operation as distribution_step, vectorised over the grid via jax.vmap and accumulated with .at[ ].add( ).

Figure Figure 6.4 visualizes the five stages of a single forward step.

Flow diagram for one forward step of Young’s histogram update (Algorithm ). Starting from G_t, the policy function is evaluated at every active bin, the resulting off-grid savings are interpolated back onto the grid, and idiosyncratic shock transitions redistribute mass across \varepsilon-states to produce G_{t+1}.

Figure 6.4:Flow diagram for one forward step of Young’s histogram update (Algorithm Algorithm 6.1). Starting from GtG_t, the policy function is evaluated at every active bin, the resulting off-grid savings are interpolated back onto the grid, and idiosyncratic shock transitions redistribute mass across ε\varepsilon-states to produce Gt+1G_{t+1}.

Comparison with Monte Carlo. Young’s method produces zero sampling noise (deterministic), preserves the mean exactly, requires only \sim100--5,000 grid points (versus >>50,000 agents for Monte Carlo), and is fully reproducible. The trade-off is that it approximates higher moments and requires a grid that is wide enough to contain all mass. The following table summarizes the comparison:

Young’s methodPanel simulation
Sampling noiseNoneO(1/N)\mathcal{O}(1/\sqrt{N})
Mean preservationExactApproximate
Typical size100--5,000 grid points>>50,000 agents
ReproducibilityDeterministicSeed-dependent
Higher momentsApproximatedApproximated

Figure Figure 6.5 contrasts the two approaches visually: the histogram method yields a smooth, noise-free distribution, while a Monte Carlo panel of comparable size exhibits visible sampling noise.

Young’s histogram (left) versus Monte Carlo panel simulation (right). Both approximate the same underlying wealth density (dashed). The histogram method is deterministic and smooth; the Monte Carlo panel exhibits \mathcal{O}(1/\sqrt{N}) sampling noise that contaminates downstream OLS regressions in the Krusell--Smith algorithm. The bars in this figure are a TikZ schematic illustrating the two regimes; for the actual histograms produced by the algorithm see notebook lecture_09_10_Youngs_Method_Examples in the Lecture-09 code/ folder.

Figure 6.5:Young’s histogram (left) versus Monte Carlo panel simulation (right). Both approximate the same underlying wealth density (dashed). The histogram method is deterministic and smooth; the Monte Carlo panel exhibits O(1/N)\mathcal{O}(1/\sqrt{N}) sampling noise that contaminates downstream OLS regressions in the Krusell--Smith algorithm. The bars in this figure are a TikZ schematic illustrating the two regimes; for the actual histograms produced by the algorithm see notebook lecture_09_10_Youngs_Method_Examples in the Lecture-09 code/ folder.

The absence of sampling noise matters for the Krusell--Smith algorithm: Monte Carlo noise in the realized mean contaminates the OLS regression that updates the forecasting rule, potentially destabilizing convergence.

Grid design. Two separate grids are used in practice. The value function grid (typically \sim150 irregularly spaced points) is finer near the borrowing constraint where the policy function has high curvature and coarser for large kk where behavior is smooth. The simulation grid for Young’s histogram (typically \sim1,000--5,000 uniformly spaced points) uses uniform spacing to produce smooth histograms without artifacts. The upper bound kmaxk_{\max} must be chosen large enough that no mass reaches the boundary; if k>kmaxk' > k_{\max} for any agent, all mass is assigned to the last grid point, which violates mean preservation. A practical safeguard is to run a preliminary simulation and verify that the boundary bins contain negligible mass.

The full Krusell--Smith algorithm. Combining value function iteration (VFI) with Young’s simulation yields:

  1. Initialize forecasting coefficients A(a),B(a)A(a), B(a).

  2. Solve the household problem via VFI given the forecasting rule H^\hat{H}.

  3. Forward-simulate the distribution for TT periods via Young’s method, recording realized means m1tm_1^t.

  4. Re-estimate A(a),B(a)A(a), B(a) by OLS on simulated (m1t,m1t+1,at)(m_1^t, m_1^{t+1}, a_t).

  5. Check convergence; if not converged, return to step 2.

Young’s non-stochastic simulation makes step 3 essentially noise-free and fast relative to a large Monte Carlo panel. The full outer-loop iteration count and wall-clock time, however, remain implementation- and tolerance-dependent because the VFI solve and forecasting-rule fixed point are still present; the speedup applies to the simulation step, not as a generic wall-clock guarantee for the traditional KS workflow.

6.3.0.5Convergence and accuracy.

A remarkable empirical finding is that the log-linear forecasting rule (6.1) achieves R2>0.9999R^2 > 0.9999 in the standard Krusell--Smith economy: the first moment of the wealth distribution is a nearly sufficient statistic for predicting next-period prices. Adding higher moments (variance, skewness) to the forecasting rule barely improves the fit. This “approximate aggregation” result does not hold universally; it relies on the specific features of the Krusell--Smith calibration (small aggregate shocks, moderate borrowing constraint), but it is a useful benchmark against which richer models can be compared.

6.4DEQN with a Continuum of Agents

The traditional Krusell--Smith algorithm provides the benchmark logic for this chapter, but the classroom DEQN notebook used in the course is the simpler Bewley endowment economy of Appendix A.5 in Azinovic et al. (2022). This distinction is pedagogically useful. Krusell--Smith explains why distribution tracking matters and why Young’s method is valuable inside an outer forecasting-rule loop. Appendix A.5 then shows how the same histogram machinery enters a neural-equilibrium implementation once one replaces the forecasting rule by a price network. By combining Young’s histogram method with neural network policies, the DEQN approach overcomes both main limitations of the traditional KS workflow: the network can condition on the full histogram, and there is no need for a separate forecasting rule because the price network directly takes the distribution as input.

6.4.0.1What Section 6.4 inherits from Appendix A.5.

Five features of the Appendix-A.5 teaching model anchor the rest of this section and the companion notebook 11_Continuum_of_Agents_DEQN.ipynb; they are the distinguishing departures from the canonical Krusell--Smith calibration of §Section 6.2--Section 6.3:

6.4.0.2Histogram encoding.

The aggregate state is encoded as a vector containing three aggregate-shock index entries plus NbN_b histogram values for each idiosyncratic shock type. In the Appendix A.5 notebook these three entries are the combined six-state aggregate index, the income-level index, and the uncertainty-regime index:

xtagg=(ztidx,inctidx,unctidx3 shock-index entries,  htη=0.8(b1),,htη=0.8(bNb)Nb values,  htη=1.2(b1),,htη=1.2(bNb)Nb values).x_t^{agg} = \bigl(\underbrace{z_t^{\mathrm{idx}},\,\mathrm{inc}_t^{\mathrm{idx}},\,\mathrm{unc}_t^{\mathrm{idx}}}_{3\text{ shock-index entries}},\; \underbrace{h_t^{\eta=0.8}(b_1),\ldots,h_t^{\eta=0.8}(b_{N_b})}_{N_b\text{ values}},\;\underbrace{h_t^{\eta=1.2}(b_1),\ldots,h_t^{\eta=1.2}(b_{N_b})}_{N_b\text{ values}}\bigr).

For Nb=100N_b = 100 grid points and 2 idiosyncratic states, the aggregate state has dimension 3+200=2033 + 200 = 203. The full input to the policy network adds the individual state (bt,ηt)(b_t, \eta_t), giving total dimension 205. This is the production setting; the checked-in smoke and teaching runs use Nb=50N_b = 50, so the aggregate state has dimension 103 and the policy input 105. Figure Figure 6.6 shows how the histogram vector and individual state are assembled and fed into the two networks.

6.4.0.3How the notebooks fit together.

The companion notebook sequence mirrors this decomposition. 10_Youngs_Method_Examples.ipynb isolates Young’s redistribution operator on toy examples, first in one dimension and then with idiosyncratic shocks. 11_Continuum_of_Agents_DEQN.ipynb then reuses the same logic inside the full aggregate state vector (6.9). In other words, the second notebook is not a new distributional device; it is the first notebook’s histogram update embedded in a larger equilibrium-learning loop.

5+2N_b

Figure 6.6:5+2Nb5+2N_b

3+2Nb3+2N_b

Histogram encoding and neural network architecture. The individual state (ηt,bt)(\eta_t, b_t) (blue boxes / blue arrows) and the aggregate state (ztidx,inctidx,unctidx,htη=0.8,htη=1.2)(z_t^{\mathrm{idx}},\mathrm{inc}_t^{\mathrm{idx}},\mathrm{unc}_t^{\mathrm{idx}}, h_t^{\eta=0.8}, h_t^{\eta=1.2}) (orange box for the three shock-index entries, red boxes for the two histograms) are concatenated as input to the policy network Npol\mathcal{N}_{pol}; each input arrow is colored to match its source box and enters the policy-input layer at a distinct horizontal offset, so the five arrows are uniquely identifiable at a glance. The price network Nprice\mathcal{N}_{price} receives only the aggregate state. Both networks use softplus output activations.

Neural network architecture. Two networks are trained jointly:

Both networks use two hidden layers with 500 units (production) or 128 (the checked-in smoke/teaching runs).

Equilibrium conditions as loss. The loss function comprises five terms, all of which should be zero in equilibrium:

L=1Ni=1N[EEi2+BEi2+(nZMCi)2+KKTi2+CBi2],\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}\left[\mathrm{EE}_i^2 + \mathrm{BE}_i^2 + \left(n_Z\,\mathrm{MC}_i\right)^2 + \mathrm{KKT}_i^2 + \mathrm{CB}_i^2\right],

where EE is the Euler equation residual, BE is the Bellman equation consistency condition, MC is the bond market clearing condition, KKT is the borrowing complementarity, and CB penalizes negative consumption. The market-clearing residual is rescaled by nZn_Z (the number of aggregate-shock states) before squaring, which puts the single market-clearing residual on the same scale as the per-state Euler, Bellman, and complementarity residuals (themselves evaluated in relative terms, divided through by consumption).

6.4.0.4Market clearing via histogram.

A key advantage of the histogram representation is that market clearing can be computed exactly (no Monte Carlo). The equilibrium condition is η,jht(η,bj)b(η,bj,xtagg)=Bˉ\sum_{\eta,j} h_t(\eta,b_j)\,b'(\eta,b_j,x_t^{agg}) = \bar B; the residual that enters the loss is its deviation from zero,

MC  =  ηj=1Nbht(η,bj)b(η,bj,xtagg)    Bˉ  =  0 at equilibrium,\mathrm{MC} \;=\; \sum_{\eta}\sum_{j=1}^{N_b} h_t(\eta, b_j)\cdot b'(\eta, b_j, x_t^{agg}) \;-\; \bar B \;=\; 0 \text{ at equilibrium},

where ht(η,bj)h_t(\eta, b_j) is the histogram mass, b()b'(\cdot) is the policy network output, and Bˉ\bar B is aggregate net bond supply. We write the signed residual here because it enters the loss (6.10) squared, which is also how the notebook computes it. The Appendix A.5 notebook normalizes Bˉ=1\bar B=1, which is the source of the “-1” term sometimes seen in code and slides.

6.4.0.5Young’s method inside the training loop.

During each training episode, the histogram is propagated forward using Young’s method (Algorithm Algorithm 6.1) with the current neural network providing the policy function. This creates a sequence of aggregate states (x0agg,x1agg,,xTagg)(x_0^{agg}, x_1^{agg}, \ldots, x_T^{agg}) on which the equilibrium residuals (6.10) are evaluated. Young’s redistribution operator is differentiable almost everywhere (linear interpolation conditional on fixed brackets), so the one-step histogram update that enters the market-clearing residual at each sampled state carries gradients back to the policy network. The longer simulated path of aggregate states is treated as data: it is regenerated each episode from the current network and held fixed inside the gradient tape, rather than backpropagated through end to end. Even so the distribution co-evolves with the network: early in training, when the policy network is inaccurate, the simulated path will be “wrong,” but the market-clearing residuals evaluated along it provide corrective feedback through the loss, and as the network improves the distribution converges toward the ergodic steady state.

6.5Results and Discussion

In the Appendix A.5 teaching model, the DEQN with histogram encoding achieves competitive accuracy: Euler equation errors are of order 10-3, and market-clearing residuals are comparably small. These figures come from the checked-in teaching/smoke configuration of the companion notebook (a small network trained for a modest number of episodes); the production configuration tightens them further. The broader lesson for the Krusell--Smith benchmark is conceptual rather than model-specific. Compared to the traditional KS algorithm, the DEQN approach has two advantages: (i) the neural network can condition on the full distribution rather than just its mean, providing a richer approximation that can capture situations where higher moments of the distribution matter for prices, and (ii) there is no need for a separate outer loop to update forecasting coefficients, since equilibrium conditions are enforced directly through the loss function.

6.6Alternative Deep-Learning Approaches to Krusell--Smith

Before turning to the deep-learning alternatives, it is useful to set the histogram-DEQN method in the broader landscape of solution techniques for heterogeneous-agent equilibria with aggregate shocks. Table Table 6.1 compares classical and modern approaches along four dimensions that drive method choice in practice.

Table 6.1:Heterogeneous-agent solution methods at a glance. The first three rows are classical or finite-difference; the last three are the modern deep-learning families compared in detail in Table Table 6.2 and the rest of this section.

MethodDistribution rep.Aggregate stateSolution principleBest when
Classical KS Krusell & Smith, 1998Panel of N104N \sim 10^4 agentsFirst moment(s)Bounded-rationality fixed point of forecasting ruleStandard incomplete-markets, low-dim aggregate state
Reiter (back-loaded) Reiter, 2009Histogram on a fixed gridFirst-order perturbation around stationaryLinearize after solving the steady stateSmall aggregate shocks, smooth policies
Continuous-time Achdou et al. Achdou et al., 2022Density solving a KFE PDEIn the limit, the entire densityCoupled HJB+KFE finite differencesPDE-friendly model, smooth densities
Histogram-DEQN Azinovic et al., 2022Young histogram on gridHistogram entries (or moments thereof)SGD on equilibrium residualsStrong nonlinearities, occasionally binding constraints
All-in-one DL Maliar et al., 2021Panel of N103N \gtrsim 10^3 agentsAgent-level states, policy and aggregate togetherSGD on stacked Euler+market-clearing residualsMany states per agent, GPU available
DeepHAM Han et al., 2024Permutation-invariant set encoder (DeepSets)Learned M ⁣ ⁣NM\!\ll\!N generalized momentsCumulative utility along simulated paths (policy-gradient with structural individual dynamics)Want a low-dim aggregate state without committing to a moment a priori

Whereas Table Table 6.1 is panoramic (classical and DL methods on common axes), Table Table 6.2 drills into the DL trio along the axes that matter when choosing among them. Histogram-DEQN is not the only deep-learning approach to heterogeneous-agent equilibria, and it is pedagogically useful to see how the space of deep-learning strategies decomposes. Three broad families have emerged in the literature. Two informative axes organize them: how the cross-sectional distribution is represented as input to the network, and what objective is optimized. Histogram-DEQN and All-in-One DL minimize residuals of the structural equilibrium equations; DeepHAM instead maximizes cumulative utility along simulated paths and uses Bellman residuals as a validation diagnostic.

Table 6.2:Three deep-learning approaches to heterogeneous-agent equilibria, all applied to the Krusell--Smith benchmark. They differ on two axes: how the cross-sectional distribution is encoded as input to the network, and what objective is optimized. Histogram-DEQN and All-in-One DL minimize squared structural residuals (Euler, Bellman, market clearing); DeepHAM instead maximizes cumulative utility along simulated paths, with the individual law of motion embedded in the computational graph and Bellman residuals tracked as a validation diagnostic.

Histogram DEQN

All-in-one DL

DeepHAM

Azinovic et al., 2022

Maliar et al., 2021

Han et al., 2024

State distribution representation

Explicit histogram vector on a fixed grid

Explicit panel of NN agents’ states

Learned generalized moments via a permutation-invariant encoder

Dimension of aggregate state seen by the network

O(Nb)\mathcal{O}(N_b) histogram entries

O(N)\mathcal{O}(N) agent states

MNbM \ll N_b learned scalars

Interpretability of distributional state

High (histogram readable)

Low (permutation-dependent)

High in the economic sense (the learned basis is interpretable, e.g. concave in assets, linking the moment to MPCs and redistribution)

Permutation invariance

Automatic (histogram is invariant)

Requires careful architecture / data augmentation

Baked into the encoder by DeepSets structure

Training objective

Euler + Bellman + market clearing + FB residuals (squared)

Euler + Bellman + market clearing residuals (squared)

Cumulative utility along simulated paths plus value-function error; Bellman residual is a validation diagnostic only

Reported accuracy (baseline KS)

Euler errors 103\sim 10^{-3}--10-4

Approximation errors <1%< 1\% with >103>10^3 agents

Bellman residual (used as a validation diagnostic) reduced by 68.9%68.9\% vs. KS with one learned generalized moment

6.6.1All-in-One Deep Learning (Maliar, Maliar & Winant, 2021)

Maliar et al. (2021) propose what they call all-in-one deep learning. The key observation is that every dynamic economic model can be cast, at its core, as a collection of expectation conditions (optimality, feasibility, market clearing) that vanish at the true solution. They rewrite these conditions as non-linear regression equations with zero dependent variable, parameterize the policy and value functions by deep neural networks, and minimize the expected-squared residual by stochastic gradient descent on simulated paths.

For the Krusell--Smith benchmark with aggregate shocks and a continuum of agents, Maliar et al. (2021) formulate the following components of the loss. An alternative deep-learning route to the same problem is the symmetry-exploiting parameterisation of Kahou et al. (2021), which uses a permutation-invariant aggregation of agent-level features; the modern perturbation route is the refined Reiter implementation of Bayer & Luetticke (2020). Both are useful complements to the Young/DEQN combination developed below, with different scaling profiles and code-complexity tradeoffs.

EE(s)  =  u(ct(s))βEt ⁣[u(ct+1(s))Rt+1],\mathrm{EE}(s) \;=\; u'(c_t(s)) - \beta\,\mathbb{E}_{t}\!\left[u'(c_{t+1}(s'))\,R_{t+1}\right],

where s=(kt,εt;St)s = (k_t,\varepsilon_t; \bm{S}_t) is the individual-plus-aggregate state and St\bm{S}_t is an NN-agent panel of capital holdings. The policy network outputs ct=πρ(s)c_t = \pi_\rho(s); the Euler residual is evaluated at simulated states and the expectation by Monte Carlo over next-period idiosyncratic and aggregate shocks. With the borrowing constraint k0k'\ge 0, the plain Euler residual is insufficient: at a binding constraint the equation can be slack, so the loss must combine EE with a complementarity condition via Fischer--Burmeister or KKT (see below).

A central computational contribution is the all-in-one integration operator: a single Monte Carlo realization of the next-period state is used to estimate the combined expectation across all residual terms simultaneously, rather than evaluating separate quadrature nodes for each conditional expectation. This reduces the per-iteration cost from O(Qm)\mathcal{O}(Q^m) (with QQ nodes per shock and mm shocks) to O(1)\mathcal{O}(1) in expectation.

6.6.1.1Scale reported by the authors.

Maliar et al. (2021) demonstrate the approach across several increasingly demanding setups: a Krusell--Smith variant with N1,000N \geq 1{,}000 explicit agents and 2,0012{,}001 state variables (one per agent plus one aggregate) in the baseline parameterization, scaled up to N=10,000N = 10{,}000 agents in a moments-reduced variant where the cross section is summarized by 10--20 aggregate moments; and on a one-agent consumption-savings problem with kinked policies they report approximation errors of at most a fraction of one percentage point. A Python/TensorFlow replication is available at https://github.com/marcmaliar/deep-learning-euler-method-krusell-smith Maliar, 2022; it is the basis for the companion notebook lectures/lecture_09_heterogeneous_agents_youngs_method/code/lecture_09_12_KrusellSmith_DeepLearning.ipynb (described below).

6.6.1.2Why this approach?

All-in-one DL is the closest cousin of the DEQN framework of Chapter Chapter 2. The only conceptual difference is that the aggregate state is represented by an explicit panel of NN agents (a large vector StRN\bm{S}_t \in \R^N) rather than by a histogram on a fixed grid. Stochastic mini-batches draw both individual state and aggregate state, and the law of large numbers delivers permutation-invariance in the limit NN \to \infty without requiring a permutation-invariant architecture. The appeal is conceptual simplicity; the cost is that the input dimension grows with NN and the learner must rediscover permutation symmetry from the data.

6.6.2DeepHAM: Generalized Moments via DeepSets (Han, Yang & E, 2024)

Han et al. (2024) ask a more ambitious question: can the network learn which summary statistics of the cross-sectional distribution are relevant for individual decisions, rather than having the researcher commit to tracking a histogram or the first moment? They propose replacing the distribution μt\mu_t with a small set of generalized moments obtained by averaging a flexible neural feature over the cross-section,

mt  =  1Ni=1Nϕθ(sti)Nϕθ(s)dμt(s),=1,,M,m_t^\ell \;=\; \frac{1}{N}\sum_{i=1}^{N} \phi_\theta^{\,\ell}(s_t^i) \quad\xrightarrow{N\to\infty}\quad \int \phi_\theta^{\,\ell}(s)\, d\mu_t(s), \qquad \ell = 1,\ldots,M,

with mt=(mt1,,mtM)\bm{m}_t = (m_t^1,\ldots,m_t^M) and ϕθ:RdR\phi_\theta^{\,\ell}: \R^{d} \to \R a neural feature encoder trained jointly with the policy and value networks. Equation (6.13) is the canonical DeepSets architecture of permutation-invariant set functions Zaheer et al., 2017: averaging (or, in the continuum limit, integration against μt\mu_t) makes mt\bm{m}_t invariant to permutations of the agents, and the ϕθ\phi_\theta^{\,\ell} encoders are flexible enough (by universal approximation on the permutation-invariant functions) to represent any fixed-arity moment.

The individual’s value and policy functions then take the form

Vt=Vρ(sti;mt,at),kt+1i=πρ(sti;mt,at),V_t = V_\rho(s_t^i; \bm{m}_t, a_t), \qquad k_{t+1}^i = \pi_\rho(s_t^i; \bm{m}_t, a_t),

and all networks (Vρ,πρ,ϕθ)(V_\rho, \pi_\rho, \phi_\theta) are trained jointly. The primitive training objective is to maximize cumulative utility along simulated paths,

J(θ,ρ)  =  E ⁣[t=0T1βtu ⁣(cρ(sti;mt,at))  +  βTVψ(sTi;mT,aT)],J(\theta,\rho) \;=\; \mathbb{E}\!\left[\sum_{t=0}^{T-1} \beta^{t}\,u\!\bigl(c_\rho(s_t^i; \bm{m}_t, a_t)\bigr) \;+\; \beta^{T}\, V_\psi(s_T^i; \bm{m}_T, a_T)\right],

where the expectation is taken over simulated idiosyncratic and aggregate histories generated under the current policy. Squared Bellman residuals are reported as a validation diagnostic, not as the optimization target. Because the individual law of motion, the budget constraint, and the transition structure are known economic objects, they are written directly into the computational graph: gradients of JJ flow through these structural dynamics, in contrast to model-free reinforcement learning where transitions are observed only as samples Yang et al., 2025. Because the generalized moments are themselves parameters of the optimization (not hyperparameters), the method automatically discovers the minimal set of distributional statistics required for equilibrium pricing.

A practical consequence of the policy-gradient formulation is that DeepHAM is well suited to problems where first-order conditions are difficult to write down or inconvenient to use, including constrained-efficiency problems with aggregate shocks, optimal-policy design, and behavioral macro questions; the headline application in Han et al. (2024) is precisely a constrained-efficiency problem solved by simulating the economy under candidate allocation rules and updating the rules to improve social welfare.

6.6.2.1Reported accuracy (validation diagnostics).

The numbers that follow are taken from Han et al. (2024); they have not been independently replicated by the companion notebooks of this chapter, and should be read as reported results rather than as figures we can vouch for from our own runs. On the baseline Krusell--Smith model, the authors report the following Bellman-error reductions, computed ex post as validation measures of the converged solution rather than as the training objective:

Conceptually, DeepHAM bridges two traditions: the approximate-aggregation insight of Krusell & Smith (1998) (a small number of moments can suffice) and the deep-function-approximator philosophy of Maliar et al. (2021) and Azinovic et al. (2022) (the NN need not be restricted to pre-specified basis functions). The official reference implementation, including replication code for the Krusell--Smith benchmarks above, is available at https://github.com/frankhan91/DeepHAM.

6.6.2.2Interpretability of the learned moments.

Although the encoders ϕθ\phi_\theta^{\,\ell} are flexible neural networks, the moments they produce are interpretable in the economic sense relevant to heterogeneous-agent macro: they summarize how the cross-section affects welfare and policy. In the Krusell--Smith application of Han et al. (2024), the learned basis function is concave in assets, so a marginal asset held by a poor household contributes more to the moment than a marginal asset held by a rich household. This links the learned distributional representation directly to marginal propensities to save, redistribution, and welfare effects, which is the natural object of interest in HA models. Generalized moments are therefore not just a flexible compression of μt\mu_t; they double as a device for reading economic content out of the trained equilibrium.

6.6.2.3From learned moments to learned state spaces.

Once equilibrium computation is written as policy-gradient optimization over simulated paths, one can ask a sharper question: do agents need the full distribution as a state variable at all? In Walrasian heterogeneous-agent models, agents care about μt\mu_t only indirectly, through equilibrium prices. Yang et al. (2025) exploit this in their structural reinforcement learning (SRL) framework: agents’ policy functions take low-dimensional prices, or short price histories, as the aggregate state, while μt\mu_t remains part of the simulated environment used to clear markets. This sidesteps the master equation entirely for a substantial class of HA economies and produces a natural conceptual arc: KS chooses moments a priori (Section Section 6.3); DeepHAM learns moments from the equilibrium objective (this section); SRL replaces moments with prices when the model permits, and lets ML help define a tractable equilibrium concept rather than only solving a fixed one.

6.6.3Beyond Walrasian Markets: DeepSAM and Search Frictions

A natural follow-up question is whether the DeepHAM machinery extends to economies in which the cross-sectional distribution affects decisions through channels other than aggregate prices. In standard heterogeneous-agent models the distribution is felt only through equilibrium prices: μt\mu_t enters individual problems by setting rt=r(Kt)r_t = r(K_t) and wt=w(Kt)w_t = w(K_t). In labor markets, search-and-matching, and other non-Walrasian settings, the distribution enters more directly: through the matching technology, the type composition on each side of the market, outside options, and bargained transfers. This makes the equilibrium mapping intrinsically non-Walrasian and forecloses simple price-only summaries of the cross-section.

Payne et al. (2025) address this case with DeepSAM, a deep-learning solver for search-and-matching models with two-sided heterogeneity and aggregate shocks. The architecture inherits the DeepHAM idea of a permutation-invariant set encoder, applied to each side of the market separately, and feeds the resulting type-composition summaries into networks that approximate value functions, matching surplus, and policy. The training objective is again policy-improvement on simulated paths, with the matching technology and bargaining rule embedded structurally. In Walrasian HAM the cross-sectional asset/wealth/income distribution enters individual problems only indirectly, through equilibrium prices rt,wtr_t,w_t; in SAM the type composition on each side of the matching market enters directly, through the matching technology, outside options, and bargained transfers. That is why DeepHAM and DeepSAM, despite sharing a permutation-invariant set-encoder architecture, treat the distribution differently.

6.6.4Which Method, When?

The three deep-learning approaches (Histogram DEQN, Section 6.4; All-in-One DL, Section 6.6.1; and DeepHAM, Section 6.6.2) are complements rather than substitutes. Table Table 6.2 summarizes the practical trade-offs:

Table Table 6.3 distils the same trade-offs into a quick decision aid: when the model fits the row’s "When it shines" column, the matching method is the first one to try.

Table 6.3:Practical chooser for the three DL approaches to Krusell--Smith. When several rows look applicable, the recommended ordering is: start with Histogram DEQN if a clean teaching narrative is the goal; switch to All-in-One DL if grid design is awkward; switch to DeepHAM if the cumulative-utility objective or learned moments are first-order to the question. Adapted from the L09 deck’s Head-to-Head slide.

MethodProsWhen it shines
Histogram DEQN Azinovic et al., 2022Interpretable state; exact market clearing; reuses the DEQN templateTeaching; NbN_b moderate; smooth policies
All-in-One DL Maliar et al., 2021No grid design; large-NN panels; single optimizer for all residualsLarge-NN research problems; OLG-many-cohort extensions
DeepHAM Han et al., 2024Learned moments; cumulative-utility objective; risky steady state; ZLBRich HA macro-finance; constrained-efficiency / optimal-policy design

6.6.4.1Notebook: a classroom-scale all-in-one KS solver.

The accompanying Jupyter notebook lecture_09_12_KrusellSmith_DeepLearning.ipynb (introduced in Section 6.5 and described in detail in the README) implements a classroom-scale version of the all-in-one DL approach of Maliar et al. (2021) on the Krusell--Smith benchmark with the parameters of Krusell & Smith (1998). It uses a single policy network πρ(k,ε,K,a)\pi_\rho(k, \varepsilon, K, a) parameterized by TensorFlow/Keras and an explicit panel of NN agents as the distributional input; the loss is the squared Euler residual, augmented with a Fischer--Burmeister complementarity term at the borrowing constraint, and aggregate consistency is imposed by construction (next-period capital is the cross-sectional mean of the panel’s savings choices) rather than through a separate market-clearing penalty. This is a deliberate simplification of the full all-in-one formulation of Section 6.6.1, which also carries a value network and an explicit market-clearing residual. The notebook is annotated cell-by-cell with the correspondence to the equations in this chapter, and is designed to converge in under ten minutes on a standard CPU. For the production-scale counterpart (up to N=104N = 10^4 agents, GPU acceleration, richer shock structure), we refer the reader to the replication repository of Maliar (2022).

6.6.4.2Benchmarks and replication pointers.

For readers who want to benchmark any of the deep-learning approaches against the traditional KS algorithm, the canonical reference implementation is econ-ark/KrusellSmith The Econ-ARK Team, 2020, which implements the forecasting-rule-update algorithm of Krusell & Smith (1998) and reports R2>0.9999R^2 > 0.9999 within about 20 outer iterations under standard parameters (β=0.99\beta = 0.99, α=0.36\alpha = 0.36, log utility, two aggregate states). Any deep-learning method must at least match that accuracy on the baseline model; the advantages are supposed to appear when the model is extended in directions KS cannot handle cleanly (more moments, many endogenous states, risky steady state).

6.7Extension: Deep Learning in the Sequence Space

The histogram-based DEQN above is transparent because it feeds a direct approximation of the endogenous cross-sectional distribution into the neural network. The price of that transparency is dimensionality: in richer heterogeneous-agent economies, the aggregate state can contain hundreds of histogram entries. Azinovic-Yang & Žemlička (2025) propose a different representation of the aggregate state. Instead of feeding the current endogenous state to the network, they feed a truncated history of exogenous aggregate shocks. The equilibrium logic does not change: one still enforces Euler equations, market clearing, and occasionally binding constraints inside the loss. What changes is the object that summarizes the aggregate state for the network. Figure Figure 6.7 contrasts the two views.

Two ways to encode the aggregate state in deep equilibrium learning. Each pipeline reads top-to-bottom: the upper (colored) box is the input the user gives to the same neural network \mathcal{N}_\rho, the middle (colored) box is the network’s output (policy and price objects), and the green box is the equilibrium loss that consumes those outputs. Histogram DEQNs (left, blue) feed an endogenous state representation (A_t, \mu_t); sequence-space DEQNs (right, red) feed a truncated exogenous shock history z_t^T. Crucially, the network and the residual-based training loss are identical across the two pipelines, only the input changes.

Figure 6.7:Two ways to encode the aggregate state in deep equilibrium learning. Each pipeline reads top-to-bottom: the upper (colored) box is the input the user gives to the same neural network Nρ\mathcal{N}_\rho, the middle (colored) box is the network’s output (policy and price objects), and the green box is the equilibrium loss that consumes those outputs. Histogram DEQNs (left, blue) feed an endogenous state representation (At,μt)(A_t, \mu_t); sequence-space DEQNs (right, red) feed a truncated exogenous shock history ztTz_t^T. Crucially, the network and the residual-based training loss are identical across the two pipelines, only the input changes.

The sequence-space representation. Let ztT:=(ztT+1,,zt)RTz_t^T := (z_{t-T+1}, \ldots, z_t) \in \R^T denote the last TT realizations of the exogenous aggregate shock. The key claim is that, in an ergodic economy, this history is an approximate sufficient statistic for the endogenous aggregate state. In the Brock--Mirman warm-up notebook, the network maps the shock history to a bounded savings rate, from which next-period capital follows by the resource constraint,

st=σ ⁣(Nρ(ztT))(0,1),Kt+1=stztKtα,s_t = \sigma\!\bigl(\mathcal{N}_\rho(z_t^T)\bigr) \in (0,1), \qquad K_{t+1} = s_t\, z_t K_t^\alpha,

where σ\sigma is the logistic squashing that keeps Kt+1K_{t+1} feasible. In the richer heterogeneous-agent version, the network instead maps the same history to higher-level equilibrium objects such as policy-function coefficients or pricing objects. This connects the method to the MIT-shock and sequence-space Jacobian literature of Boppart et al. (2018) and Auclert et al. (2021), but replaces local linear approximations with a global residual-based neural approximation.

Intuition first. The easiest way to think about the method is as a memory compression device. A positive aggregate shock today raises output and therefore raises tomorrow’s capital. That extra capital still matters the period after, but only through the production elasticity α\alpha, so its influence is smaller. One more period later it is smaller again. In other words, the current aggregate state stores a decaying memory of past shocks. The sequence-space idea is to feed that shock history directly to the network rather than feeding the current endogenous state itself.

Intuition for sequence space in Brock--Mirman. \log K_t depends on past shocks with weights that decay like \alpha^j in the lag j (here \alpha = 0.36, the standard capital share). Already at j = 3 the weight has fallen to \sim\!0.05, so a finite history of recent shocks summarizes the relevant aggregate information; very old shocks matter little.

Figure 6.8:Intuition for sequence space in Brock--Mirman. logKt\log K_t depends on past shocks with weights that decay like αj\alpha^j in the lag jj (here α=0.36\alpha = 0.36, the standard capital share). Already at j=3j = 3 the weight has fallen to  ⁣0.05\sim\!0.05, so a finite history of recent shocks summarizes the relevant aggregate information; very old shocks matter little.

Brock--Mirman: what changes relative to Chapter Chapter 2? The Brock--Mirman warm-up is useful because the change can be written down exactly. In Chapter Chapter 2, the state-space DEQN uses the current state as input,

xtstate=(Kt,zt),Ct=Nρ(Kt,zt),Kt+1=ztKtαCt.x_t^{\mathrm{state}} = (K_t, z_t), \qquad C_t = \mathcal{N}_\rho(K_t, z_t), \qquad K_{t+1} = z_t K_t^\alpha - C_t.

In the sequence-space version, the economic model is unchanged, but the network sees a different input:

xtseq=ztT=(ztT+1,,zt),st=σ ⁣(Nρ(ztT)),Kt+1=stztKtα,Ct=(1st)ztKtα.x_t^{\mathrm{seq}} = z_t^T = (z_{t-T+1}, \ldots, z_t), \qquad s_t = \sigma\!\bigl(\mathcal{N}_\rho(z_t^T)\bigr), \qquad K_{t+1} = s_t\, z_t K_t^\alpha, \qquad C_t = (1-s_t)\, z_t K_t^\alpha.

The Euler residual is the same object as before,

Gt=1βCtCt+1αzt+1Kt+1α1,G_t = 1 - \beta \,\frac{C_t}{C_{t+1}} \,\alpha z_{t+1} K_{t+1}^{\alpha-1},

so the economics are unchanged. What changes is the computational representation:

This distinction is important conceptually. For Brock--Mirman, sequence space is not a dimensionality reduction, since (Kt,zt)(K_t, z_t) is only two-dimensional whereas a history of length T=25T=25 is larger. The Brock--Mirman notebook is therefore a pedagogical demonstration of the idea that histories can stand in for endogenous states. The dimensionality gain appears only in richer heterogeneous-agent models, where the relevant alternative is a large histogram or other high-dimensional distributional summary.

Intermediate bridge: sequence-space IRBC. Between the one-shock Brock--Mirman warm-up and the infinite-dimensional Krusell--Smith state, the companion notebook lectures/lecture_10_sequence_space_deqns/code/lecture_10_05b_SequenceSpace_IRBC.ipynb re-trains the two-country IRBC model of Chapter Chapter 3 under sequence-space inputs: the policy network reads the last T=80T=80 shock vectors (a 240-dimensional history with ρzT1.7×102\rho_z^T \approx 1.7\times 10^{-2} truncation error) instead of the four-dimensional current state. The 2N+12N+1 equilibrium residuals (Euler, ARC, Fischer--Burmeister), the Gauss--Hermite quadrature, and the cloud-method sampler are literally unchanged from nb 01; only the input domain changes. Because the current capital stock is no longer an input, we parametrize the output head around the steady state, kj=kssexp(tanhzjk)k'_j = k_{ss}\exp(\tanh z^k_j) and λ=λssexp(tanhzλ)\lambda = \lambda_{ss}\exp(\tanh z^\lambda), which keeps gradients lively at the target policy and prevents the cold-start divergence that plagued a naive softplus head. This notebook is a pedagogical bridge rather than a computational win, at a four-dimensional state the history is much larger, not smaller, but it shows that the same template handles a multi-equation system with multiple independent shock channels before we hand the method over to Krusell--Smith, where the dimensionality gain is real.

Training logic. The computational pattern is also close to the rest of this chapter. One samples an exogenous shock path, constructs overlapping history windows ztTz_t^T, evaluates the network on those windows, and then uses the resulting decisions to simulate the endogenous economy forward. In the Brock--Mirman warm-up this produces the capital sequence directly; in the Krusell--Smith tutorial it produces policy-function objects, while Young’s method still propagates the cross-sectional distribution inside the simulator. Residuals are then evaluated on the simulated path and backpropagated through the full pipeline. Figure Figure 6.9 summarizes this workflow.

Training flow for sequence-space DEQNs. The exogenous shock history is the network input, but the forward simulator still produces endogenous objects such as prices, aggregate capital, or cross-sectional distributions needed for residual evaluation.

Figure 6.9:Training flow for sequence-space DEQNs. The exogenous shock history is the network input, but the forward simulator still produces endogenous objects such as prices, aggregate capital, or cross-sectional distributions needed for residual evaluation.

Why truncated histories can work. The Brock--Mirman warm-up makes the logic especially transparent. With full depreciation (δ=1\delta = 1) and log utility, recursive substitution shows that the capital stock depends on the last TT shocks up to an error of order αTlog(KtT)\alpha^T \log(K_{t-T}). Since α<1\alpha < 1 (typically α0.36\alpha \approx 0.36), this error vanishes exponentially: for T=25T = 25, the truncation error is of order 10-11. More generally, in ergodic economies with persistent aggregate shocks, the approximation error decays at roughly max{ϱ,α}T\max\{|\varrho|, |\alpha|\}^T. In richer heterogeneous-agent models this is no longer an exact algebraic statement, so the history length TT becomes an empirical accuracy choice rather than a theorem.

Why this is useful in heterogeneous-agent models. Two advantages are worth separating. First, as a network input, a history of T25T \approx 25 shocks can be much smaller than a histogram with hundreds of bins. Second, exogenous shock histories are sampled from a fixed distribution. This removes one source of instability in residual-based training: the set of network inputs is anchored by model primitives even though the endogenous simulator still evolves with the current policy network. In the Krusell--Smith tutorial, this means that the network is conditioned on shock histories, while Young’s method remains responsible for propagating the distribution used in market-clearing calculations.

Shape-preserving operator learning. A second contribution of Azinovic-Yang & Žemlička (2025) is to let the network output policy-function objects rather than a single scalar choice. In particular, they construct architectures that guarantee monotonicity and concavity of the predicted consumption rule by representing it with an I-spline basis and non-negative coefficients. In the Krusell--Smith tutorial, the network maps the shock history to these coefficients; the resulting policy can then be evaluated at all idiosyncratic states on the wealth grid. This operator-learning view pairs naturally with the endogenous grid method (EGM) of Carroll (2006) and avoids ad hoc penalties for monotonicity or concavity.

Explicit I-spline MPC parameterization. Having seen why a shape-preserving output head matters (above), we now write the construction down concretely; this is the most technical paragraph of the section and a reader who already accepts the monotonicity/concavity guarantees can skip to “Fischer--Burmeister KKT loss” below. Let {kn}n=0N\{k_n\}_{n=0}^N be a fixed log-spaced wealth grid and let BRJ×(N+1)B \in \mathbb{R}^{J \times (N+1)} be a precomputed I-spline basis evaluated on it,

Bj,n  =  Ij ⁣(log(η+kn)),j=1,,J,  n=0,,N,B_{j,n} \;=\; I_j\!\bigl(\log(\eta + k_n)\bigr), \qquad j = 1,\ldots,J,\; n=0,\ldots,N,

where η>0\eta>0 is a small numerical shift (the BASIS_SHIFT constant in the notebook) and each IjI_j is an integrated B-spline that is monotonically increasing from 0 to 1. For each idiosyncratic state ε\varepsilon, the network outputs two objects: a boundary marginal propensity to consume α(ε)(0,1)\alpha(\varepsilon)\in(0,1) (sigmoid head) and non-negative weights w~j(ε)0\widetilde w_j(\varepsilon)\ge 0 with jw~j(ε)<1\sum_j \widetilde w_j(\varepsilon) < 1 (a “phantom-zero” softmax head). The grid MPC is

MPCε,n  =  α(ε)(1j=1Jw~j(ε)Bj,n),\mathrm{MPC}_{\varepsilon,n} \;=\; \alpha(\varepsilon)\Bigl(1 - \sum_{j=1}^J \widetilde w_j(\varepsilon)\, B_{j,n}\Bigr),

which is decreasing in nn by construction (positive weights times an increasing basis, subtracted off a constant), bounded in [0,α(ε)][0,1][0, \alpha(\varepsilon)] \subset [0,1], and continuous in the network parameters. Consumption is then recovered on the grid by cumulation of the MPC schedule along cash-on-hand m=wε+Rkm = w\varepsilon + Rk,

c(ε,k0)  =  MPCε,0m(ε,k0),c(ε,kn)  =  c(ε,kn1)+MPCε,nR(knkn1),c(\varepsilon, k_0) \;=\; \mathrm{MPC}_{\varepsilon,0}\, m(\varepsilon, k_0), \qquad c(\varepsilon, k_n) \;=\; c(\varepsilon, k_{n-1}) + \mathrm{MPC}_{\varepsilon,n}\, R\, (k_n - k_{n-1}),

and off-grid evaluation uses piecewise-linear interpolation. Equations (6.25)--(6.26) guarantee, by construction and without any auxiliary penalty, that the consumption rule is non-negative, monotonically increasing in kk, concave in kk, and feasible (cmc \le m). In code, BB is the matrix ispline_basis, α\alpha and w~\widetilde w come from the two heads of actor_c_grid, and the cumulation is the closing block of that same function.

Fischer--Burmeister KKT loss. Households face a borrowing constraint kt+10k_{t+1} \ge 0. The Karush--Kuhn--Tucker conditions of the household problem split into two regimes: at an interior optimum the Euler equation holds with equality, while at a binding constraint the Euler equation can be slack but next-period capital is zero. Define the (relative) Euler residual and the (relative) savings slack (in this section gg is reused as the Euler residual, matching the tutorial code’s variable name; it is not the household policy function g(k,ε,)g(k,\varepsilon,\ldots) of Section 6.3)

g  =  cEulercc,s  =  kc,g \;=\; \frac{c_{\text{Euler}} - c}{c}, \qquad s \;=\; \frac{k'}{c},

where cEuler=(u)1 ⁣(βEt[Ru(c)])c_{\text{Euler}} = (u')^{-1}\!\bigl(\beta\,\mathbb{E}_t [R'\,u'(c')]\bigr) is the consumption level implied by the Euler equation given the network’s continuation policy. The KKT pair is then

g=0, s0(interior)org0, s=0(constrained),g = 0,\ s \ge 0 \quad\text{(interior)} \qquad\text{or}\qquad g \ge 0,\ s = 0 \quad\text{(constrained)},

which is a complementarity condition. The Fischer--Burmeister envelope, in the same sign convention used in Ch. Chapter 3 and Ch. Chapter 5,

FB(g,s)  =  g  +  s    g2+s2+ϵfb\mathrm{FB}(g, s) \;=\; g \;+\; s \;-\; \sqrt{g^2 + s^2 + \epsilon_{\text{fb}}}

is smooth and satisfies FB(g,s)=0\mathrm{FB}(g,s) = 0 if and only if min(g,s)=0\min(g, s) = 0 with both non-negative; the small constant ϵfb\epsilon_{\text{fb}} (set to 10-12 in the notebooks) is a numerical stabilizer for the square root. The upstream JAX tutorial code uses the negative-sign variant g2+s2+ϵgs\sqrt{g^2+s^2+\epsilon}-g-s, which has the same zero set when squared. The training loss is the buffer-and-grid average of FB2\mathrm{FB}^2,

L(ρ)  =  E(zH,μ)B[1Nε(N+1)ε,nFB ⁣(gε,n(ρ),sε,n(ρ))2],\mathcal{L}(\rho) \;=\; \mathbb{E}_{(z^H,\mu)\sim\mathcal{B}}\biggl[\, \frac{1}{N_\varepsilon (N+1)} \sum_{\varepsilon,n} \mathrm{FB}\!\bigl(g_{\varepsilon,n}(\rho), s_{\varepsilon,n}(\rho)\bigr)^2 \,\biggr],

so that one differentiable scalar simultaneously enforces the Euler equation in the interior region and the complementarity condition at the borrowing constraint, without case splits or shadow-price augmentation. This reuses the smooth complementarity construction of Fischer (1992) that is standard in nonlinear programming, applied here to a heterogeneous-agent equilibrium loss.

Putting the pieces together: the HA training loop. The Krusell--Smith tutorial assembles the encoder, the I-spline policy head, Young’s distribution step, and the Fischer--Burmeister loss into a single replay-buffer training loop. Definition 6.3 states it explicitly.

Three implementation choices in Definition 6.3 are worth flagging. First, the forward roll is wrapped in stop_gradient so that gradients only flow through the FB residual evaluated on the buffer, not through long simulation chains; this is what makes training tractable for T=100T = 100 horizons. Second, because Young’s step gives exact aggregate sums, the only stochasticity in the gradient comes from buffer mini-batching, which acts as standard SGD noise rather than a Monte-Carlo aggregation noise floor. Third, the buffer simultaneously decouples training-state drawing from the current policy and lets the network see distributions μ\mu generated by earlier policies, which improves coverage of the ergodic set during early training.

Two training algorithms: residual minimization vs. time iteration. Definition 6.3 is one of two families of training schemes used in the sequence-space DL literature. In direct residual minimization (the version above and in our notebook), the network is trained by gradient descent on the squared equilibrium residual itself. In time iteration with EGM, the network is trained on a sequence of supervised regression problems: at each outer iteration, one (i) uses the current network to construct next-period policies, (ii) backs out implied current-period policies via the endogenous-grid method of Carroll (2006), and (iii) updates the network by minimizing the squared error against those EGM targets. Time iteration is more involved and requires a per-batch root-finding step, but it is more flexible: it tolerates non-trivial market clearing and, crucially, handles non-convex choices (e.g., a discrete retirement decision) and non-monotone Laffer curves where the Euler equation has multiple roots that direct residual minimization cannot disambiguate. In practice, residual minimization is the simpler entry point on smooth, convex problems; switch to time iteration when convergence stalls, when the model contains discrete choices, or when the continuation value has convex regions that admit multiple optimal savings.

Applications and notebooks. The paper applies this framework to three demanding models: (i) an OLG economy with portfolio choice and aggregate risk, (ii) an economy with a continuum of heterogeneous firms and households featuring idiosyncratic and aggregate shocks, and (iii) a lifecycle model with a discrete early-retirement choice that introduces local convexities. Mean Euler equation errors are below 0.2% in all cases. The two TensorFlow 2 companion notebooks are intentionally complementary: 05_SequenceSpace_BrockMirman.ipynb is a Brock--Mirman warm-up that isolates the history-to-policy logic in the simplest possible environment, while 06_SequenceSpace_KrusellSmith.ipynb is a compact teaching implementation that combines sequence-space inputs, I-spline policies, and Young’s method in a heterogeneous-agent setting.

A third notebook, KrusellSmith_Tutorial_CPU.ipynb, is a JAX/optax port of the upstream pedagogical tutorial released by the paper’s authors. It exposes the same shape-preserving I-spline MPC parameterization, the same Young step inside the simulator, and the same Fischer--Burmeister KKT loss as the TensorFlow notebook, but in the original JAX form. It is adapted from the upstream tutorial 01_KrusellSmith_Tutorial_CPU.ipynb in the companion code repository Azinovic-Yang & Žemlička, 2025, available at https://github.com/azinoma/DeepLearningInTheSequenceSpace; the local adaptation adds an explicit shape-guarantee diagnostic and additional inline commentary, leaving the algorithm unchanged.

6.8Further Reading

6.9Exercises

Worked solutions and guidance for these exercises appear in Appendix Appendix F.

References
  1. Krusell, P., & Smith, A. A., Jr. (1998). Income and wealth heterogeneity in the macroeconomy. Journal of Political Economy, 106(5), 867–896.
  2. Aiyagari, S. R. (1994). Uninsured Idiosyncratic Risk and Aggregate Saving. The Quarterly Journal of Economics, 109(3), 659–684.
  3. Young, E. R. (2010). Solving the incomplete markets model with aggregate uncertainty using the Krusell–Smith algorithm and non-stochastic simulations. Journal of Economic Dynamics and Control, 34(1), 36–41.
  4. Azinovic, M., Gaegauf, L., & Scheidegger, S. (2022). DEEP EQUILIBRIUM NETS. International Economic Review, 63(4), 1471–1525. 10.1111/iere.12575
  5. Maliar, L., Maliar, S., & Winant, P. (2021). Deep learning for solving dynamic economic models. Journal of Monetary Economics, 122, 76–101.
  6. Bewley, T. (1986). Stationary monetary equilibrium with a continuum of independently fluctuating consumers. In W. Hildenbrand & A. Mas-Colell (Eds.), Contributions to Mathematical Economics in Honor of Gérard Debreu (pp. 79–102). North-Holland.
  7. İmrohoroğlu, A. (1989). Cost of Business Cycles with Indivisibilities and Liquidity Constraints. Journal of Political Economy, 97(6), 1364–1383. 10.1086/261640
  8. Huggett, M. (1993). The Risk-Free Rate in Heterogeneous-Agent Incomplete-Insurance Economies. Journal of Economic Dynamics and Control, 17(5–6), 953–969. 10.1016/0165-1889(93)90024-M
  9. Achdou, Y., Han, J., Lasry, J.-M., Lions, P.-L., & Moll, B. (2022). Income and wealth distribution in macroeconomics: A continuous-time approach. The Review of Economic Studies, 89(1), 45–86.
  10. Maliar, L., Maliar, S., & Valli, F. (2010). Solving the Incomplete Markets Model with Aggregate Uncertainty Using the Krusell–Smith Algorithm. Journal of Economic Dynamics and Control, 34(1), 42–49.
  11. Azinovic-Yang, M., & Žemlička, J. (2025). Deep Learning in the Sequence Space. 10.48550/arXiv.2509.13623
  12. Reiter, M. (2009). Solving heterogeneous-agent models by projection and perturbation. Journal of Economic Dynamics and Control, 33(3), 649–665.
  13. Han, J., Yang, Y., & E, W. (2024). DeepHAM: A Global Solution Method for Heterogeneous Agent Models with Aggregate Shocks. Quantitative Economics.
  14. Kahou, M. E., Fernández-Villaverde, J., Perla, J., & Sood, A. (2021). Exploiting Symmetry in High-Dimensional Dynamic Programming (NBER Working Paper No. 28981). National Bureau of Economic Research.
  15. Bayer, C., & Luetticke, R. (2020). Solving Discrete Time Heterogeneous Agent Models with Aggregate Risk and Many Idiosyncratic States by Perturbation. Quantitative Economics, 11(4), 1253–1288.