Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

9 Gaussian Processes

University of Lausanne

Section 4.4.1 introduced Gaussian processes at the picture level, with just enough machinery to follow the Bayesian-optimization argument of Chapter Chapter 4: the definition f(x)GP(μ(x),k(x,x))f(\x) \sim \mathcal{GP}(\mu(\x), k(\x, \x')), the squared-exponential kernel (4.2), and the closed-form posterior mean and variance (4.3)--(4.4). Those formulas took the kernel and its hyperparameters (,σf,σy)(\ell, \sigma_f, \sigma_y) as given. The present chapter removes both shortcuts and develops the full machinery: it samples whole functions from the GP prior (Section 9.1), learns (,σf,σy)(\ell, \sigma_f, \sigma_y) from data via the marginal-likelihood Occam’s razor (Section 9.2), expands the kernel zoo beyond the squared-exponential to Matérn and composite kernels (Section 9.2.1.1), uses the posterior variance to choose where to evaluate next via Bayesian Active Learning (Section 9.3), scales the whole machinery to high-dimensional inputs through active subspaces (Section 9.5), and embeds GPs inside value-function iteration for high-dimensional dynamic programming (Section 9.6). Two applications-driven chapters then put the machinery to work: structural estimation via surrogate-driven SMM (Chapter Chapter 10) and climate-policy uncertainty quantification (Chapter Chapter 11). Methodological foundations are the GP textbook of Rasmussen & Williams (2005) and the Bayesian-active-learning ideas dating back to MacKay (1992) and Krause et al. (2008). Embedding GPs inside dynamic programming was pioneered by Deisenroth et al. (2009) (GPDP) and Engel et al. (2005) (GP temporal-difference learning); within economics, applications include high-dimensional growth models Scheidegger & Bilionis, 2019, dynamic incentive problems Renner & Scheidegger, 2018, and deep uncertainty quantification for integrated assessment models Friedl et al., 2023. Sparse approximations Titsias, 2009Hensman et al., 2013 extend GPs beyond their cubic scaling limit; deep kernel learning combines neural-network feature extraction with GP inference and closes the chapter.

9.1Gaussian Process Regression

9.1.0.1Sampling from the GP prior.

Recall from Section 4.4.1 that for any finite set of test inputs x1,,xn\x_1, \ldots, \x_n, the function values f=(f(x1),,f(xn))\bm{f} = (f(\x_1), \ldots, f(\x_n))^\top are jointly Gaussian with prior mean μ\bm{\mu} (the entries μ(xi)\mu(\x_i)) and covariance matrix Kij=k(xi,xj)K_{ij} = k(\x_i, \x_j). To draw a function from the GP prior at those test inputs, evaluate KK, compute its Cholesky decomposition K=LLK = L L^\top, draw uN(0,I)\bm{u} \sim \mathcal{N}(\bm{0}, I), and set f=μ+Lu\bm{f} = \bm{\mu} + L \bm{u}. The resulting paths have the smoothness, amplitude, and length scale dictated entirely by the kernel.

9.1.0.2Kernel choice.

The squared-exponential kernel (4.2) of Section 4.4.1 remains the right default for genuinely smooth targets, e.g. in the unconstrained interior of an ergodic set. In economic applications, however, the target function (value, policy, equilibrium price) often has kinks from occasionally binding constraints, and the Matérn-3/23/2 kernel introduced in Section 9.2.1.1 below is then a better default than RBF, since it does not oversmooth at non-differentiable features.

Squared-exponential kernel as a function of distance for three length scales. Small \ell makes correlations decay quickly and produces rougher, more local fits; large \ell couples distant points and imposes smoother functions.

Figure 9.1:Squared-exponential kernel as a function of distance for three length scales. Small \ell makes correlations decay quickly and produces rougher, more local fits; large \ell couples distant points and imposes smoother functions.

Figure Figure 9.1 shows how the RBF length scale controls the distance over which observations remain informative. For any training dataset D={(xi,yi)}i=1n\mathcal{D} = \{(\x_i, y_i)\}_{i=1}^n, the GP posterior at a test point x\x_* has the closed-form mean fˉ\bar{f}_* and variance σf,2\sigma_{f,*}^2 stated in (4.3)--(4.4), with Kij=k(xi,xj)K_{ij} = k(\x_i,\x_j) the kernel matrix on training inputs, Ky=K+σy2IK_y = K + \sigma_y^2 I its noise-augmented version, μX\bm{\mu}_X the prior-mean vector at training inputs, and k\bm{k}_* the cross-covariance whose ii-th entry is k(x,xi)k(\x_*, \x_i). For a noisy future observation yy_* the predictive variance is σy,2=σf,2+σy2\sigma_{y,*}^2 = \sigma_{f,*}^2 + \sigma_y^2, and the common zero-mean formulas are recovered by centering outputs or setting μ0\mu \equiv 0.

9.1.0.3A hand-traceable 1D example.

To make (4.3)--(4.4) concrete, take f(x)=sinxf(x) = \sin x, observe ff noiselessly at x1=0x_1 = 0 and x2=πx_2 = \pi (so y1=y2=0y_1 = y_2 = 0), and query at x=π/2x_\star = \pi/2. Use the kernel k(x,x)=exp ⁣((xx)2/2)k(x, x') = \exp\!\bigl(-(x - x')^2/2\bigr) and a tiny noise floor σy2=106\sigma_y^2 = 10^{-6} for numerical stability. The training kernel matrix is $$K + \sigma_y^2 I ;=;

(1eπ2/2eπ2/21)\begin{pmatrix} 1 & e^{-\pi^2/2} \\ e^{-\pi^2/2} & 1 \end{pmatrix}

;\approx;

(1.0000.007190.007191.000)\begin{pmatrix} 1.000 & 0.00719 \\ 0.00719 & 1.000 \end{pmatrix}
(exp((π/2)2/2)exp((π/2)2/2))\begin{pmatrix} \exp(-(\pi/2)^2/2) \\ \exp(-(\pi/2)^2/2) \end{pmatrix}

;\approx;

(0.29100.2910)\begin{pmatrix} 0.2910 \\ 0.2910 \end{pmatrix}

Figure Figure 9.2 illustrates the GP prior and posterior for a simple one-dimensional regression problem. Before observing data, the GP prior has constant mean and uniform uncertainty. After conditioning on five observations, the posterior mean interpolates the data and the uncertainty bands collapse near the observations while remaining wide in unexplored regions.

Gaussian-process prior and posterior on a 1D regression problem. Left: the prior has constant mean (here zero) and uniform uncertainty; the shaded bands show the 68\% and 95\% credible intervals, and the thin grey curves are three sample paths drawn from the prior. Right: after conditioning on five observations (black dots), the posterior mean (red curve) interpolates the data exactly, and the credible band collapses near the observed points while widening in unexplored regions away from the data, giving the GP its built-in uncertainty quantification.

Figure 9.2:Gaussian-process prior and posterior on a 1D regression problem. Left: the prior has constant mean (here zero) and uniform uncertainty; the shaded bands show the 68%68\% and 95%95\% credible intervals, and the thin grey curves are three sample paths drawn from the prior. Right: after conditioning on five observations (black dots), the posterior mean (red curve) interpolates the data exactly, and the credible band collapses near the observed points while widening in unexplored regions away from the data, giving the GP its built-in uncertainty quantification.

9.2Kernel Functions and Hyperparameter Learning

The kernel hyperparameters ϑ=(,σf,σy)\bm{\vartheta} = (\ell, \sigma_f, \sigma_y) are learned by maximizing the log marginal likelihood:

logp(yX,ϑ)=12(yμX)Ky1(yμX)12logKyn2log2π,\log p(\bm{y} | \X, \bm{\vartheta}) = -\frac{1}{2}(\bm{y}-\bm{\mu}_X)^\top K_y^{-1}(\bm{y}-\bm{\mu}_X) - \frac{1}{2}\log|K_y| - \frac{n}{2}\log 2\pi,

where Ky=K+σy2IK_y = K + \sigma_y^2 I and μX\bm{\mu}_X is the prior mean evaluated on the training inputs. In the zero-mean convention used elsewhere in this section, set μX=0\bm{\mu}_X = 0 (or center the outputs).

9.2.0.1Why marginal likelihood?

The log evidence logp(yX,ϑ)\log p(\bm y \mid \X, \bm\vartheta) encodes both data fit and an automatic complexity penalty in a single closed-form expression. The quadratic form 12yKy1y-\tfrac{1}{2}\bm y^\top K_y^{-1}\bm y rewards hyperparameters that explain the centered observations with a small inverse-covariance norm, while the log-determinant term 12logKy-\tfrac{1}{2}\log|K_y| penalises overly flexible kernels that admit too many possible functions, giving Bayesian Occam’s razor Rasmussen & Williams, 2005. Compared with cross-validated MSE, this approach requires no held-out split, makes use of all nn observations, and exposes a closed-form gradient with respect to ϑ\bm\vartheta, which is essential for L-BFGS-style optimization in scikit-learn / GPyTorch. The maximum is reached at the kernel that is just expressive enough to fit the data but no more (Figure Figure 9.3).

Marginal-likelihood Occam’s razor for a GP. As the kernel becomes more flexible (smaller length scale \ell), the data-fit term improves but the \log|K_y| complexity penalty grows linearly. Their sum, the log evidence, peaks at an interior optimum that is just expressive enough to explain the data, automatically. No held-out validation set is required.

Figure 9.3:Marginal-likelihood Occam’s razor for a GP. As the kernel becomes more flexible (smaller length scale \ell), the data-fit term improves but the logKy\log|K_y| complexity penalty grows linearly. Their sum, the log evidence, peaks at an interior optimum that is just expressive enough to explain the data, automatically. No held-out validation set is required.

9.2.0.2A free held-out diagnostic.

Marginal likelihood is not the only validation tool. The leave-one-out (LOO) predictive error of a GP admits a closed form using the same Cholesky factor already computed for posterior inference, so it costs nothing extra; we develop the formula in Section 9.6.5, where it serves as an iteration-by-iteration health check inside GP-based VFI, and reuse it in Section 10.1.4 for the GP layer over the SMM moment map.

9.2.1Kernel Composition

More complex covariance structures can be built by composing simpler kernels. The sum of two kernels is again a valid kernel, as is the product. For example:

9.2.1.1The Matérn kernel family.

The Matérn kernel is parameterized by a smoothness parameter ν>0\nu > 0:

kMateˊrn(r)=σf221νΓ(ν)(2νr)νKν ⁣(2νr),k_{\mathrm{Mat\acute{e}rn}}(r) = \sigma_f^2\,\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\, r}{\ell}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\, r}{\ell}\right),

where r=xxr = \|\x - \x'\|, \ell is the length scale, and KνK_\nu is the modified Bessel function of the second kind. The smoothness parameter ν\nu controls the regularity of sample paths: a Matérn-ν\nu GP is kk-times mean-square differentiable for every integer k<νk < \nu, with the RBF kernel recovered in the limit ν\nu \to \infty. Important special cases:

For economic applications where the target function may have kinks (e.g., due to occasionally binding constraints), the Matérn-3/2 kernel is often a better choice than the infinitely smooth RBF kernel, as it does not oversmooth near non-differentiable features Renner & Scheidegger, 2018.

9.3Bayesian Active Learning for Sample-Efficient Training

When model evaluations are expensive, we wish to select training points that provide maximal information. Bayesian Active Learning (BAL) uses the GP posterior variance to guide this selection. The information-theoretic foundations go back to MacKay (1992), with submodular guarantees for variance-based sensor placement established by Krause et al. (2008) and the widely used upper-confidence-bound formulation of Srinivas et al. (2010).

9.3.0.1Connection to Bayesian optimization (Chapter Chapter 4).

BAL is the active-learning twin of the Bayesian-optimization (BO) recipe introduced in Chapter Chapter 4: same GP surrogate, same acquisition-function machinery. The difference is the target. BO seeks a scalar optimum of the surrogate (e.g. Expected Improvement steers samples toward arg maxf^\argmax \hat f), so its acquisition trades exploration against the chance of beating the current best. BAL instead targets global function approximation: it allocates samples wherever posterior variance is largest, irrespective of the predicted value. Both reduce to the same primitive (fit GP, maximise acquisition, evaluate, refit), but the choice of acquisition reflects whether one wants the best point or the best surrogate.

The BAL algorithm concentrates training points near kinks, boundary layers, and other regions where the function is hardest to approximate, achieving the same accuracy as uniform sampling with far fewer evaluations Renner & Scheidegger, 2018. The exploration--exploitation trade-off controlled by (wobj,wvar)(w_{\mathrm{obj}}, w_{\mathrm{var}}) ensures that the algorithm balances refining the approximation in already well-sampled regions against exploring uncharted territory.

The intuition behind the acquisition function is as follows. The first term wobjμ(x)w_{\mathrm{obj}} \cdot \mu(\x) favors regions where the predicted function value is large (exploitation: sample where the function is interesting). The second term wvar2logσ2(x)\frac{w_{\mathrm{var}}}{2}\log\sigma^2(\x) favors regions of high uncertainty (exploration: sample where we know least). This acquisition function belongs to the Upper Confidence Bound (UCB) family Srinivas et al., 2010: with wobj=0w_{\mathrm{obj}} = 0 it has the same maximizers as pure posterior-variance sampling, while the logarithmic variance weighting provides a more conservative exploration bonus than the standard-deviation weighting used in GP-UCB when it is combined with exploitation. By adjusting wobjw_{\mathrm{obj}} and wvarw_{\mathrm{var}}, the practitioner can control the balance. For economic applications where the entire domain is relevant (e.g., approximating a policy function), a pure exploration strategy (wobj=0w_{\mathrm{obj}} = 0, wvar>0w_{\mathrm{var}} > 0) is often appropriate, reducing BAL to an uncertainty sampling scheme that minimizes the integrated posterior variance. Figure Figure 9.4 illustrates two BAL iterations on a 1D toy.

Bayesian Active Learning in action. (a) Starting from three initial observations (red dots), the GP posterior mean (blue line) deviates from the true function (dashed black) in the data-sparse region, where the 95% credible band (blue shading) is wide. (b) The acquisition function selects the point of maximum posterior variance (green diamond); after evaluation, the posterior tightens locally and the mean improves. (c) A second active-learning iteration fills the remaining gap. With only five strategically chosen points, the GP posterior closely tracks the true function across the entire domain.

Figure 9.4:Bayesian Active Learning in action. (a) Starting from three initial observations (red dots), the GP posterior mean (blue line) deviates from the true function (dashed black) in the data-sparse region, where the 95% credible band (blue shading) is wide. (b) The acquisition function selects the point of maximum posterior variance (green diamond); after evaluation, the posterior tightens locally and the mean improves. (c) A second active-learning iteration fills the remaining gap. With only five strategically chosen points, the GP posterior closely tracks the true function across the entire domain.

9.4When to Use GPs vs. DNNs

Now that the GP machinery (posterior inference, kernel design, and BAL) has been introduced, we can give a more detailed comparison than the overview in Table Table 10.1. Active subspaces, introduced in Section Section 9.5, are the main tool for pushing GP surrogates beyond the moderate-dimensional regime. Table Table 9.1 extends Table Table 10.1 with the GP-specific items (LOO diagnostics, marginal-likelihood Occam, BAL).

Table 9.1:GP and DNN surrogate trade-offs after the basic GP machinery has been introduced. GPs are sample-efficient and uncertainty-aware but expensive in the number of training points; DNNs scale to much larger datasets and dimensions but need more data and separate machinery for calibrated uncertainty.
\dagger Sparse-GP via inducing points reduces O(N3)\mathcal{O}(N^3) to O(Nm2+m3)\mathcal{O}(Nm^2 + m^3) for mNm \ll N inducing inputs Titsias, 2009Hensman et al., 2013; see the inducing-point remarkbox below.

CriterionGaussian ProcessesDeep Neural Networks
Training costO(N3)\mathcal{O}(N^3); exact for NN in the low thousands^\dagger^O(N)\mathcal{O}(N) per epoch; scales to N106N \sim 10^6
Prediction costO(N)\mathcal{O}(N) posterior mean, O(N2)\mathcal{O}(N^2) posterior variance, per test pointO(weights)\mathcal{O}(\text{weights}) per forward pass; independent of NN
Gradient accessclosed-form from posteriorautodiff (exact, any order)
Non-smooth featuresMatérn-3/23/2 adapts wellexcellent with enough data
High-dim. extensionactive subspaces (d10d \gg 10)native

Practical guidelines.

9.4.1GP Regression in Practice

In code, GP regression is a one-liner once the kernel is chosen. In scikit-learn the standard pattern is to assemble a kernel as the sum of an RBF (RBF(length_scale=...)) and a noise term (WhiteKernel(noise_level=...)), pass it to GaussianProcessRegressor, call .fit(X_train, y_train), and obtain posterior mean and standard deviation from .predict(X_test, return_std=True); the kernel hyperparameters (length_scale, noise_level, output amplitude) are optimized by maximising the marginal likelihood, with n_restarts_optimizer controlling robustness to local optima. The companion notebook 02_GP_and_BAL.ipynb provides a full worked example fitting noisy observations of sin(x)\sin(x) on [2,2][-2,2].

9.4.1.1Application: GP surrogates for option pricing.

GPs are particularly well suited as surrogates for derivative pricing models. For example, one can train a GP on as few as 5--50 Black--Scholes option prices (evaluated at different spot prices or parameter configurations) and obtain a surrogate that accurately reproduces the pricing surface with calibrated uncertainty bands. The posterior variance immediately quantifies the interpolation uncertainty at each query point. This idea extends naturally to stochastic volatility models such as Heston, where the analytical pricing formula is expensive to evaluate. Furthermore, because GP predictions are linear in the training targets, the uncertainty of a portfolio of GP-priced instruments propagates analytically: for a linear portfolio iwiV^i\sum_i w_i \hat{V}_i with vector of weights w\bm{w} and joint posterior covariance ΣV^\Sigma_{\hat{V}}, Var(wV^)=wΣV^w\mathrm{Var}(\bm{w}^\top \hat{V}) = \bm{w}^\top \Sigma_{\hat{V}} \bm{w}. When the surrogate errors are independent across instruments, ΣV^\Sigma_{\hat{V}} is diagonal with entries σi2\sigma_i^2 and the formula reduces to iwi2σi2\sum_i w_i^2 \sigma_i^2; otherwise the off-diagonal cross-instrument covariances must be retained, e.g. via a multi-output GP. Either way the assessment is instant.

Inducing-point intuition. Exact GP inference conditions on all n=21 observations and uses the full n\times n kernel matrix. Sparse variational GP methods introduce m\ll n inducing inputs \bm Z and approximate the posterior through the low-rank structure induced by K_{nm}K_{mm}^{-1}K_{mn}. The top panel compares the exact posterior mean and uncertainty band with a sparse approximation using m=4 pseudo-inputs; the bottom panel shows the absolute difference between the two means. In this smooth one-dimensional illustration the approximation is close, but its quality depends on m, the kernel hyperparameters, and the placement of \bm Z. The dominant training cost falls from \mathcal{O}(n^3) to \mathcal{O}(nm^2 + m^3); the variational formulation of  optimizes \bm Z jointly with the kernel hyperparameters.

Figure 9.5:Inducing-point intuition. Exact GP inference conditions on all n=21n=21 observations and uses the full n×nn\times n kernel matrix. Sparse variational GP methods introduce mnm\ll n inducing inputs Z\bm Z and approximate the posterior through the low-rank structure induced by KnmKmm1KmnK_{nm}K_{mm}^{-1}K_{mn}. The top panel compares the exact posterior mean and uncertainty band with a sparse approximation using m=4m=4 pseudo-inputs; the bottom panel shows the absolute difference between the two means. In this smooth one-dimensional illustration the approximation is close, but its quality depends on mm, the kernel hyperparameters, and the placement of Z\bm Z. The dominant training cost falls from O(n3)\mathcal{O}(n^3) to O(nm2+m3)\mathcal{O}(nm^2 + m^3); the variational formulation of Titsias (2009) optimizes Z\bm Z jointly with the kernel hyperparameters.

9.5Scaling GPs to High Dimensions: Active Subspaces

A common criticism of GP methods is that they are limited to moderate-dimensional problems (d10d \leq 10). While this is true for naïve implementations, where the number of training points needed to cover the input space grows exponentially in dd, Scheidegger & Bilionis (2019) show that active subspace methods can mitigate this barrier when the target function exhibits a clear spectral gap in the gradient outer-product matrix CC of (9.8) below, i.e., when its variation is concentrated on a low-dimensional linear subspace of the input space. If ff depends comparably on all dd coordinates, the spectral gap is absent and the technique offers little gain; sufficient conditions are in Constantine (2015). For a textbook treatment of the active subspace framework and references to its origins, see Constantine (2015); we summarize the key ideas and their application to dynamic economic models below.

9.5.0.1Intuition: why most directions don’t matter.

Consider a value function V(x)V(\x) in a dynamic stochastic economy with dd state variables. Although VV is formally a function of all dd inputs, it often responds primarily to a few linear combinations of them. For example, in a multi-country model, the value function may depend mostly on aggregate capital jkj\sum_j k^j and a measure of inequality Var(kj)\mathrm{Var}(k^j), rather than on each country’s capital individually. If we can identify these important directions, we can project the dd-dimensional input onto a much lower-dimensional subspace and fit the GP there, substantially mitigating the practical grid-based curse of dimensionality.

9.5.0.2The gradient outer product matrix.

The key diagnostic tool is the gradient outer product matrix Constantine, 2015. Given a function f ⁣:RdRf\colon \R^d \to \R with input distribution π(x)\pi(\x), define:

C=f(x)f(x)π(x)dx    Rd×d.C = \int \nabla f(\x) \, \nabla f(\x)^\top \, \pi(\x)\, d\x \;\in\; \R^{d \times d}.

The matrix CC is symmetric and positive semi-definite. Its (i,j)(i,j)-entry measures how much ff co-varies along input directions ii and jj, averaged over the input distribution. In particular:

9.5.0.3Worked toy example.

Take f(x1,x2)=x12f(x_1, x_2) = x_1^2 on xUnif([1,1]2)\x \sim \mathrm{Unif}([-1,1]^2). Then f=(2x1,0)\nabla f = (2x_1, 0)^\top, and integrating against the uniform measure gives C=diag(4/3,0)C = \mathrm{diag}(4/3,\,0). The eigenvalues are λ1=4/3\lambda_1 = 4/3 and λ2=0\lambda_2 = 0; the first eigenvector is (1,0)(1,0)^\top, the second is (0,1)(0,1)^\top; the spectral gap is infinite. The active subspace is the x1x_1-axis, and ff is recovered exactly as f=g(x1)f = g(x_1) with g(u)=u2g(u) = u^2. This is the cleanest instance of the projection in (9.9) below.

9.5.0.4Eigendecomposition and the spectral gap.

Let C=UΛUC = U \Lambda U^\top be the eigendecomposition, with eigenvalues λ1λ2λd0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0. The eigenvectors reveal the directions of maximal average variation; the eigenvalues quantify how much variation each direction captures. Each diagonal entry of CC is the average squared sensitivity of ff along that input direction, so eigenvectors of CC are directions of correlated sensitivity: when the top mm eigenvalues capture most of the trace, ff varies primarily along their span and is approximately constant in the perpendicular (dm)(d-m)-dimensional complement. If there is a clear spectral gap, i.e., λmλm+1\lambda_m \gg \lambda_{m+1}, then the first mm eigenvectors Um=[u1,,um]U_m = [u_1, \ldots, u_m] span the active subspace, and ff is well approximated by a function of the reduced coordinates x~=UmxRm\tilde{\x} = U_m^\top \x \in \R^m alone.

f(x)    g(Umx),Um=[u1,,um].f(\x) \;\approx\; g(U_m^\top \x), \qquad U_m = [u_1,\ldots,u_m].
Spectral decay of the active-subspace eigenvalues for a schematic example with d=12 state variables. The first three eigenvalues are orders of magnitude larger than the rest; the dashed red line marks the spectral gap after \lambda_3, which indicates that f effectively lives on a 3-dimensional active subspace (green brace) and reduces the GP regression problem from \R^{12} to \R^3. The remaining nine directions form the inactive subspace (gray brace), along which f is nearly constant on average.

Figure 9.6:Spectral decay of the active-subspace eigenvalues for a schematic example with d=12d=12 state variables. The first three eigenvalues are orders of magnitude larger than the rest; the dashed red line marks the spectral gap after λ3\lambda_3, which indicates that ff effectively lives on a 3-dimensional active subspace (green brace) and reduces the GP regression problem from R12\R^{12} to R3\R^3. The remaining nine directions form the inactive subspace (gray brace), along which ff is nearly constant on average.

9.5.0.5Computing the gradient.

In practice, the integral in (9.8) is estimated from a finite sample of gradient evaluations Constantine, 2015 (this is a one-time pre-processing cost, O(Ngd)\mathcal{O}(N_g\, d) with Ng2dN_g \sim 2d--10d10d, far smaller than the surrogate training set of size NN):

C^=1Ngi=1Ngf(xi)f(xi),xiπ(x).\hat{C} = \frac{1}{N_g}\sum_{i=1}^{N_g} \nabla f(\x_i)\,\nabla f(\x_i)^\top, \qquad \x_i \sim \pi(\x).

The gradient f(xi)\nabla f(\x_i) can be obtained via:

The eigendecomposition of C^\hat{C} is a standard d×dd \times d matrix operation and is cheap relative to the model evaluations. When {xi}\{\x_i\} are i.i.d. from π\pi, C^C\hat{C} \to C almost surely as NgN_g \to \infty by the law of large numbers, so the eigenvectors of C^\hat{C} are consistent estimates of the population active subspace; finite-sample bias is documented in Constantine (2015).

9.5.0.6Practical workflow.

The full dimension-reduction pipeline consists of three steps:

  1. Gradient sampling: Draw NgN_g points xiπ(x)\x_i \sim \pi(\x) and compute f(xi)\nabla f(\x_i) at each. Form the empirical estimate C^\hat{C} via (9.10).

  2. Eigendecomposition and dimension selection: Compute C^=U^Λ^U^\hat{C} = \hat{U}\hat{\Lambda}\hat{U}^\top. Inspect the eigenvalue decay, as in Figure Figure 9.6, and choose mm at the spectral gap.

  3. GP in reduced space: Project all training inputs onto the active subspace: x~i=U^mxiRm\tilde{\x}_i = \hat{U}_m^\top \x_i \in \R^m. Fit a standard GP on the mm-dimensional projected data.

When combined with BAL, the GP adaptively selects training points in the reduced space, further improving sample efficiency.

Linear active-subspace pipeline. Gradient samples identify the dominant eigenspace of the gradient outer-product matrix, all simulator inputs are projected to the reduced coordinates \tilde{\x}=U_m^\top \x, and the GP/BAL loop is then run in the low-dimensional active subspace.

Figure 9.7:Linear active-subspace pipeline. Gradient samples identify the dominant eigenspace of the gradient outer-product matrix, all simulator inputs are projected to the reduced coordinates x~=Umx\tilde{\x}=U_m^\top \x, and the GP/BAL loop is then run in the low-dimensional active subspace.

Figure Figure 9.7 summarizes the linear active-subspace workflow used before fitting the GP.

9.5.0.7Application to dynamic stochastic economies.

Scheidegger & Bilionis (2019) apply this pipeline to high-dimensional dynamic programming problems arising in neoclassical growth models. Their key findings:

9.5.0.8Anchor on the multi-sector growth benchmark.

In the multi-sector neoclassical growth benchmark of Scheidegger & Bilionis (2019), the active-subspace dimension collapses to m=1m = 1 even at D=500D = 500, so the entire ASGP pipeline operates on a one-dimensional projected coordinate. On models with other configurations the active subspace can be larger; the spectral-gap test is what decides.

9.5.1Nonlinear Generalization: Deep Active Subspaces

The linear pipeline of the previous section assumes that the important directions are linear combinations of the input variables. That is an inductive bias: the first mm columns of UU always span a linear subspace of Rd\R^d, so if the response ff actually varies along a curved low-dimensional manifold (a ridge, a product of two linear features, a radial coordinate, a hidden regime indicator), the linear projection has to carry enough directions to cover the whole manifold, not the manifold itself. Concretely, if f(ξ)=φ ⁣((w1ξ)2+(w2ξ)2)f(\xi) = \varphi\!\bigl((w_1^\top \xi)^2 + (w_2^\top \xi)^2\bigr) is a radial function of two linear features, the gradient f=2φ()(s1w1+s2w2)\nabla f = 2\,\varphi'(\cdot)\,(s_1 w_1 + s_2 w_2) spans a two-dimensional space when averaged over ξ\xi, so linear AS returns m=2m = 2; yet the intrinsic “interesting” coordinate is the scalar aggregate r2=s12+s22r^2 = s_1^2 + s_2^2, which is one-dimensional. Tripathy & Bilionis (2018) remove the linearity bias by replacing the linear projection with a learned nonlinear encoder and thereby recover the one-dimensional structure.

9.5.1.1The deep active-subspace ansatz.

Tripathy and Bilionis parametrize the surrogate as a composition of two small neural networks,

f^(ξ)  =  g(h(ξ)),h ⁣:RD ⁣ ⁣Rd,g ⁣:Rd ⁣ ⁣R,\hat f(\xi) \;=\; g\bigl(h(\xi)\bigr), \qquad h\colon \R^D \!\to\! \R^d, \qquad g\colon \R^d \!\to\! \R,

where hh is a multilayer perceptron (MLP) that plays the role of the projection matrix UmU_m^\top in the linear case, and gg is a second MLP that plays the role of the GP link. Setting h(ξ)=Umξh(\xi) = U_m^\top \xi and taking gg to be a polynomial recovers the linear active-subspace surrogate (9.9), so (9.11) generalizes the linear-AS ansatz rather than the gradient covariance matrix (9.8) itself. The two networks are trained jointly to minimize the residual f^(ξi)yi\hat f(\xi_i) - y_i on an input-output sample, so that the bottleneck h(ξ)Rdh(\xi) \in \R^d adapts to the specific response surface rather than being fixed in advance by the spectrum of CC.

9.5.1.2Architecture choices.

Three design choices make (9.11) trainable with a few hundreds of samples:

dk  =  Deηwk,ηw=1Llog(d/D),k=0,1,,L,d_k \;=\; \bigl\lceil D \, e^{\eta_{\mathrm{w}} k} \bigr\rceil, \qquad \eta_{\mathrm{w}} = \tfrac{1}{L}\log(d/D), \qquad k = 0, 1, \ldots, L,

so that the bottleneck closes smoothly from DD to dd without a brittle hyperparameter choice. For D=20,L=3,d=1D = 20,\, L = 3,\, d = 1 this gives widths [20,8,3,1][20, 8, 3, 1]; for d=2d = 2, widths [20,10,5,2][20, 10, 5, 2] -- a recipe rather than a search.

σ(z)  =  z1+eγz,γ=1,\sigma(z) \;=\; \frac{z}{1 + e^{-\gamma z}}, \qquad \gamma = 1,

which is smooth everywhere (unlike ReLU) and non-saturating (unlike tanh\tanh); smoothness matters because we will want to differentiate through f^\hat f for sensitivity analysis and because the elastic-net penalty below otherwise has to fight the sharp corners of ReLU.

L(θ)  =  1Ni=1N(f^θ(ξi)yi)2  +  λ1θ1  +  λ2θ22,\mathcal{L}(\theta) \;=\; \frac{1}{N}\sum_{i=1}^N \bigl(\hat f_\theta(\xi_i) - y_i\bigr)^2 \;+\; \lambda_1 \lVert \theta \rVert_1 \;+\; \lambda_2 \lVert \theta \rVert_2^2,

with small λ1,λ2\lambda_1, \lambda_2; the 1\ell_1 term encourages sparse weights and input usage, the 2\ell_2 term controls the overall smoothness of f^θ\hat f_\theta and prevents the loss landscape from becoming pathological as NN shrinks.

Crucially, no gradient samples of ff are required: θ\theta is learned from the input-output pairs {(ξi,yi)}\{(\xi_i, y_i)\} alone. The orthogonality constraint UU=IU^\top U = I that is implicit in the linear case becomes unnecessary, because the learned encoder is free to pick any smooth low-dimensional parameterization of the active manifold.

9.5.1.3Choosing the latent dimension dd.

The spectral gap of CC is no longer available -- the encoder is nonlinear -- so dd is chosen by a validation-MSE elbow: hold out an independent fraction of the sample, train a small family of models with d=1,2,3,d = 1, 2, 3, \ldots, and pick the smallest dd beyond which held-out error no longer drops significantly. An operational rule of thumb is to stop at the first dd for which the MSE improvement from dd to d+1d+1 is less than a factor of two: smaller gains are typically driven by optimization slack, not by new latent structure. On curved problems this elbow lies strictly below the linear-AS spectral gap: the deep encoder collapses two linear features into a single nonlinear aggregate (notebook 09_Deep_Active_Subspace_Ridge gives a reproducible instance in D=20D = 20). On nearly-linear problems the two criteria agree qualitatively, and at small training-set sizes a polynomial link on top of a two-dimensional linear AS can in fact be more data-efficient than the deep encoder (notebook 10_Deep_AS_vs_Linear_AS_Borehole, the canonical borehole benchmark with D=8D = 8 and N=500N = 500). Figure Figure 9.8 contrasts the elbow rule with the linear-AS spectral gap on the radial-ridge target.

Stylized comparison of the two selection criteria for the radial-ridge target y(\xi) = \exp(-[(w_1^\top\xi)^2 + (w_2^\top\xi)^2]) in D = 20 (see notebook 09). The linear-AS eigenvalue spectrum has two dominant directions, so the spectral-gap criterion picks d = 2 (right green dashed line). The deep-AS validation MSE, by contrast, is already at its plateau at d = 1 (left green dashed line): the learned encoder h_\theta represents the nonlinear aggregate r^2 = s_1^2 + s_2^2 as a scalar. The curves are stylized; the orders of magnitude track the notebook (linear-AS eigenvalues \approx 0.16 for i = 1, 2 then a sharp drop; deep-AS validation MSE \approx 10^{-4} at d = 1 and roughly flat thereafter), and the elbow-at-d=1 versus spectral-gap-at-d=2 contrast is reproduced.

Figure 9.8:Stylized comparison of the two selection criteria for the radial-ridge target y(ξ)=exp([(w1ξ)2+(w2ξ)2])y(\xi) = \exp(-[(w_1^\top\xi)^2 + (w_2^\top\xi)^2]) in D=20D = 20 (see notebook 09). The linear-AS eigenvalue spectrum has two dominant directions, so the spectral-gap criterion picks d=2d = 2 (right green dashed line). The deep-AS validation MSE, by contrast, is already at its plateau at d=1d = 1 (left green dashed line): the learned encoder hθh_\theta represents the nonlinear aggregate r2=s12+s22r^2 = s_1^2 + s_2^2 as a scalar. The curves are stylized; the orders of magnitude track the notebook (linear-AS eigenvalues 0.16\approx 0.16 for i=1,2i = 1, 2 then a sharp drop; deep-AS validation MSE 104\approx 10^{-4} at d=1d = 1 and roughly flat thereafter), and the elbow-at-d=1d=1 versus spectral-gap-at-d=2d=2 contrast is reproduced.

9.5.1.4Training recipe.

A practical recipe that reproduces the experiments in notebooks 09 and 10:

  1. Sample NN input-output pairs (ξi,yi)(\xi_i, y_i); split 80/2080/20 into train and validation sets. Standardize inputs to unit variance and center outputs.

  2. For each candidate d{1,2,3,}d \in \{1, 2, 3, \ldots\}: build hθh_\theta with widths (9.12) and gθg_\theta with two hidden layers (16--32 units); train on loss (9.14) with Adam (lr=5×103\text{lr} = 5 \times 10^{-3}), a cosine learning-rate schedule over 103--2×1032 \times 10^3 epochs, and λ1=105,λ2=104\lambda_1 = 10^{-5},\, \lambda_2 = 10^{-4}.

  3. Record the validation MSE, apply the elbow rule, and deploy f^θ\hat f_\theta with the chosen dd.

Sample-budget rule of thumb: N50dnlN \approx 50\, d_{\mathrm{nl}} to find the bottleneck, inflated to N200dnlN \approx 200\, d_{\mathrm{nl}} for a deployment surrogate. Two post-training sanity checks are worth doing: (i) the validation curve should be monotone-then-flat in dd; (ii) a scatter of hθ(ξi)h_\theta(\xi_i) against the top linear-AS coordinate U1ξiU_1^\top \xi_i reveals whether the encoder has actually gone nonlinear (a non-monotone relation is the fingerprint).

Deep active-subspace pipeline. Input--output pairs (\xi_i, y_i) are drawn directly from the simulator (no gradient samples needed); a nonlinear encoder h_\theta compresses the high-dimensional input into a d-dimensional latent code, a small link network g_\theta maps the latent code to the response, and an elastic-net / validation-MSE elbow chooses d. The trained composition \hat f_\theta = g_\theta \circ h_\theta is the deployed surrogate. Compared with the linear active-subspace pipeline (Figure ), the encoder + link boxes replace the eigendecomposition + linear projection, the elbow box replaces the spectral-gap search, and the gradient-sampling step disappears entirely.

Figure 9.9:Deep active-subspace pipeline. Input--output pairs (ξi,yi)(\xi_i, y_i) are drawn directly from the simulator (no gradient samples needed); a nonlinear encoder hθh_\theta compresses the high-dimensional input into a dd-dimensional latent code, a small link network gθg_\theta maps the latent code to the response, and an elastic-net / validation-MSE elbow chooses dd. The trained composition f^θ=gθhθ\hat f_\theta = g_\theta \circ h_\theta is the deployed surrogate. Compared with the linear active-subspace pipeline (Figure Figure 9.7), the encoder + link boxes replace the eigendecomposition + linear projection, the elbow box replaces the spectral-gap search, and the gradient-sampling step disappears entirely.

Figure Figure 9.9 should be compared with the linear pipeline in Figure Figure 9.7: the two blue encoder / link boxes replace “eigendecomposition + project”, the “elastic-net + elbow” box replaces “find spectral gap”, and the gradient-sampling step is gone entirely. The economic pay-off is the same, a cheap surrogate in dd variables where the original has DD, but the modeling assumption is weaker: curved active manifolds are now admissible.

9.5.1.5When is the extra machinery worth it?

Deep AS pays off when (i) no gradient samples are available (e.g., black-box simulators, noisy observations from a lab experiment, or calibration targets returned by a proprietary solver), (ii) the spectral gap of CC is ambiguous or the response surface genuinely depends on nonlinear aggregates of the inputs (ridges, radial coordinates, thresholded features, piecewise regimes), or (iii) the input dimension is large (D10D \gg 10) and a linear projection is too restrictive to capture the active manifold. When none of these conditions apply, linear AS plus a polynomial or a GP link is usually more data-efficient and should be fit first as a baseline. The borehole experiment in notebook 10 is the honest diagnostic: on a function this close to a d=2d = 2 ridge, the two curves cross, and a cubic polynomial on two linear features beats the deep encoder at all d2d \ge 2. Deep AS’s advantage materializes precisely at d=1d = 1, where a single linear feature cannot span the active direction. This perspective also connects naturally to deep kernel learning (Section Section 9.7), which composes a learned feature map with a GP head; deep AS is the same idea for the feature map, with an MLP link in place of the GP.

9.5.1.6Applications in economics and finance.

The deep-AS architecture is particularly attractive in settings where the state variable is high-dimensional but the economic mechanism acts through a few aggregate quantities that are not linear in the state. Examples include (i) heterogeneous-agent models whose cross-sectional distribution of wealth enters the equilibrium law of motion only through a few moments (mean, dispersion, tail mass), the Krusell-Smith aggregation logic of Chapter Chapter 6, where the map from the raw distribution to those moments is a learned, nonlinear encoder; (ii) multi-region climate-economy integrated assessment models with hundreds of regional capital stocks, in which damages and optimal carbon taxes depend on nonlinear aggregates of the underlying state (total radiative forcing, regional inequality indices); and (iii) dynamic games and mean-field games in which the relevant action depends on a curved functional of the opponent distribution. In each case the original space is too large for a direct GP and the linear active subspace is too rigid, while the deep encoder provides a trainable compromise. Bilionis & Zabaras (2016) and Tripathy & Bilionis (2018) provide the general UQ and deep-active-subspace methodology; for integrated assessment models specifically, Friedl et al. (2023) is the economics-side reference used in this course.

9.6Dynamic Programming with Gaussian Processes

Returning to the second oracle announced at the start of this chapter and developed formally in Section 9.6.1 below, we now embed the GP and active-subspace machinery developed above into a value function iteration (VFI) algorithm, with the Bellman operator TVTV playing the role of the expensive label generator. The acronym ASGP used in the section titles and figure captions below stands for Active-Subspace Gaussian Process: the dimensionality reducer of Section 9.5 stacked under the GP of Section 9.2.

The idea of representing the value function as a GP inside a DP recursion was introduced under the name Gaussian process dynamic programming (GPDP) by Deisenroth et al. (2009), who used it for continuous-state, continuous-action optimal control problems and combined it with a variance-based active selection of support points. Closely related work Engel et al., 2005 embeds GPs into temporal-difference learning, replacing tabular value functions with a GP posterior. These ideas later inspired model-based policy search methods such as PILCO Deisenroth & Rasmussen, 2011, which propagate GP uncertainty through long horizons. In economics, Scheidegger & Bilionis (2019) show that pairing this paradigm with active subspaces and Bayesian Gaussian mixture models for the ergodic set scales the framework to economies with up to 500 continuous state variables; Renner & Scheidegger (2018) apply it to dynamic incentive problems, Gaegauf et al. (2023) to dynamic portfolio choice, and Chen et al. (2025) to dynamic private asset allocation. The remainder of this section presents the algorithm in this combined form.

9.6.1Why GPs for DP: a Supervised-Learning View of the Bellman Operator

Before stating the algorithm, it helps to see VFI through a supervised-learning lens. At iteration ss, given an incumbent value function Vs1V^{s-1}, we generate a training set

Ds={(x(i),tis)}i=1ns,tis=(TVs1)(x(i)),\mathcal{D}^s = \bigl\{(\x^{(i)},\, t_i^s)\bigr\}_{i=1}^{n^s}, \qquad t_i^s = (TV^{s-1})(\x^{(i)}),

and fit a regressor Vs(TVs1)V^s \approx (TV^{s-1}) to it. The label tist_i^s is not free. Each evaluation of the Bellman operator at a state x(i)\x^{(i)} requires solving a constrained nonlinear program over controls ξ\xi subject to the law of motion and any feasibility / market-clearing constraints; quadrature over x\x' is taken inside. In textbook problems each oracle call costs 10-2--100 seconds, and in higher-dimensional models with adjustment costs, occasionally binding constraints, or aggregator nonlinearities, it can run into minutes. Because the calls are independent across x(i)\x^{(i)}, the design is embarrassingly parallel Scheidegger & Bilionis, 2019, but each individual label remains expensive.

This is exactly the regime where Gaussian processes shine. Three properties make them the natural surrogate class:

  1. Sample efficiency under an expensive oracle. The marginal-likelihood Occam’s razor of Section 9.2 delivers calibrated fits at n102n \sim 10^2--103 design points, well below the n104n \gg 10^4 regime in which deep-network surrogates start to dominate (Table Table 9.3).

  2. Built-in uncertainty quantification. The GP posterior variance σGP2(x)\sigma_\mathrm{GP}^2(\x) tells us, at every state, how much the current surrogate trusts its own prediction. This is the input that turns a passive interpolant into an adaptive one (Section 9.6.6).

  3. Cheap held-out diagnostics. The leave-one-out predictive error is available in closed form from the same Cholesky factor used for posterior inference (Section 9.6.5), so we can monitor surrogate health every iteration without spending oracle calls on a held-out split.

The same logic applies, with a different oracle, in the structural estimation chapter: Chapter Chapter 10 stacks a GP over the moment map m(θ)m(\theta) where each label is a forward simulation plus moment computation under a candidate parameter, again expensive to evaluate and cheap to store. Both settings are instances of one pattern: supervised regression on the output of a costly numerical procedure. The next subsection writes down the formal DP problem; the subsequent subsections develop the GP-based interpolant and the two ingredients (LOO and active learning) that make it sample-efficient at modest design size.

9.6.2The Dynamic Programming Problem

Consider an infinite-horizon, discrete-time stochastic optimal decision problem. A representative agent chooses a sequence of controls {ξt}t=0\{\xi_t\}_{t=0}^\infty to maximize the expected discounted sum of returns:

V(x0)=max{ξt}E0t=0βtr(xt,ξt),V(\x_0) = \max_{\{\xi_t\}} \mathbb{E}_0 \sum_{t=0}^\infty \beta^t\, r(\x_t, \xi_t),

subject to the law of motion xt+1F(xt,ξt)\x_{t+1} \sim F(\cdot\,|\,\x_t, \xi_t), where xtXRD\x_t \in \mathcal{X} \subset \R^D is the state, ξtΛ(xt)\xi_t \in \Lambda(\x_t) is the control, rr is the per-period return function, and β(0,1)\beta \in (0,1) is the discount factor.

By the principle of optimality Bellman, 1957, the infinite-dimensional problem (9.16) reduces to the Bellman equation:

V(x)=maxξΛ(x){r(x,ξ)+βE[V(x)]},V(\x) = \max_{\xi \in \Lambda(\x)} \bigl\{ r(\x, \xi) + \beta\, \mathbb{E}\bigl[V(\x')\bigr] \bigr\},

where the expectation is over the stochastic transition. The Bellman operator TT maps value functions to value functions:

(TV)(x)=maxξΛ(x){r(x,ξ)+βE[V(x)]}.(TV)(\x) = \max_{\xi \in \Lambda(\x)} \bigl\{ r(\x, \xi) + \beta\, \mathbb{E}\bigl[V(\x')\bigr] \bigr\}.

Under standard regularity conditions (bounded returns, β<1\beta < 1, monotonicity, discounting), TT is a contraction mapping on the Banach space of bounded continuous functions with sup-norm, with modulus β\beta. By the Banach fixed-point theorem (Appendix Appendix D), TT admits a unique fixed point VV^*, and iterating Vs+1=TVsV^{s+1} = TV^s from any initial guess V0V^0 converges geometrically: VsVβsV0V\|V^s - V^*\|_\infty \le \beta^s \|V^0 - V^*\|_\infty. The classical references in economics are Stokey et al. (1989) [Chs. 3--4] for the formal contraction-and-fixed-point theory of the Bellman operator and Judd (1998) [Ch. 12] for the numerical implementation in continuous-state settings; Ljungqvist & Sargent (2018) [Chs. 3--4] and Sargent & Stachurski (2026) provide modern macroeconomic treatments. In the operations-research and optimal-control literature, Bertsekas (2000) is the canonical reference, with extensive coverage of approximate dynamic programming. The contraction modulus β\beta applies to the exact Bellman operator; convergence of the GP-fitted iterates additionally requires controlling the per-step interpolation error of the surrogate, otherwise the approximate operator can fail to contract globally even though the exact one does. This is the standard textbook VFI loop: guess a value function, apply the Bellman operator, interpolate the updated values, and repeat until convergence. In the present chapter, the only change is the interpolant: a GP replaces the grid or polynomial basis when the state space is irregular or moderately high-dimensional. For a comprehensive treatment of dynamic programming in economics, see also the open-source QuantEcon lectures.[1]

9.6.2.1The computational challenge.

At every VFI iteration, we must approximate Vs+1V^{s+1} as a function of the full state vector xRD\x \in \R^D. Traditional approaches (finite grids, Smolyak sparse grids, tensor-product polynomial bases) suffer from the curse of dimensionality: the number of grid points grows exponentially in DD, and they require hypercubic state-space geometries. For D>20D > 20, these methods become infeasible.

9.6.3The Stochastic Optimal Growth Model

The workhorse test case for GP-based dynamic programming is the multi-sector stochastic optimal growth model. Assume DD sectors, each with a capital stock kjk_j, so the state is k=(k1,,kD)R+D\bm{k} = (k_1, \ldots, k_D) \in \R^D_+. A representative household chooses consumption c\bm{c}, labor supply l\bm{l}, and investment i\bm{i} to maximize:

V0(k0)=max{ct,lt,it}E0t=0βtu(ct,lt),V_0(\bm{k}_0) = \max_{\{\bm{c}_t, \bm{l}_t, \bm{i}_t\}} \mathbb{E}_0 \sum_{t=0}^\infty \beta^t\, u(\bm{c}_t, \bm{l}_t),

subject to the capital law of motion kj,t+1=(1δ)kj,t+ij,tk_{j,t+1} = (1-\delta)\,k_{j,t} + i_{j,t}, a sector-by-sector resource constraint cj,t+ij,t+Γj,t=exp(zj,t)Akj,tψlj,t1ψc_{j,t} + i_{j,t} + \Gamma_{j,t} = \exp(z_{j,t})\,A\,k_{j,t}^\psi\,l_{j,t}^{1-\psi} with convex adjustment cost Γj,t\Gamma_{j,t}, and Cobb--Douglas production f(kj,lj)=Akjψlj1ψf(k_j, l_j) = A\,k_j^\psi\,l_j^{1-\psi}. Productivity shocks zj,tz_{j,t} follow a stationary process and the dimension DD can be scaled from 1 (the textbook Brock--Mirman model of Chapter Chapter 2) to 500 without changing the algorithmic structure.

9.6.4GP-Based Value Function Iteration (ASGP)

The key idea, common to Deisenroth et al. (2009) in the control literature and Scheidegger & Bilionis (2019) in the economics literature, is to use a Gaussian process as the interpolation scheme inside VFI, replacing grid-based methods entirely. At each VFI step ss:

  1. Generate nsn^s training inputs X={x(1),,x(ns)}[k,kˉ]D\X = \{\x^{(1)}, \ldots, \x^{(n^s)}\} \subset [\underline{\bm{k}}, \bar{\bm{k}}]^D.

  2. Evaluate the Bellman operator at each training point: tis=(TVs1)(x(i))t_i^s = (TV^{s-1})(\x^{(i)}).

  3. Learn a GP (or ASGP) surrogate VsurrV_\mathrm{surr} from the training data {X,t}\{\X, \bm{t}\}.

  4. Set Vs=VsurrV^s = V_\mathrm{surr} (the GP posterior mean).

  5. Compute the convergence error es=VsVs1/ΔVe^s = \|V^s - V^{s-1}\|_\infty / \Delta_V, where ΔV\Delta_V is a normalizing value-function range.

  6. If es<eˉe^s < \bar{e}, stop; otherwise continue.

9.6.4.1Advantages over grid-based methods.

This GP-VFI approach offers several structural advantages:

9.6.4.2Parallelization.

The most expensive part of the algorithm, evaluating the Bellman operator at nsn^s training points, is embarrassingly parallel: each evaluation is independent. In the MPI implementation of Scheidegger & Bilionis (2019), the current value function surrogate is broadcast to all workers, each worker evaluates the Bellman operator at ns/ncpun^s/n_\mathrm{cpu} points, and the results are gathered at the master for GP fitting. Communication cost is negligible relative to the Bellman evaluations.

9.6.5Leave-One-Out Error: a Held-out Diagnostic for Free

How do we know that the GP surrogate at iteration ss is good enough to take another Bellman step? The marginal-likelihood objective of Section 9.2 fits the kernel, but it does not, on its own, certify pointwise predictive accuracy on the design. A held-out validation split would, but every held-out point is one fewer expensive Bellman label going into the GP.

The standard escape route in GP regression is the leave-one-out (LOO) predictive error, and it has the pleasant feature that for a Gaussian process it admits a closed form using the same Cholesky factor already computed for the posterior Rasmussen & Williams, 2005:

μi(x(i))yi=αi[Ky1]ii,σi2(x(i))=1[Ky1]ii,\mu_{-i}(\x^{(i)}) - y_i = -\frac{\alpha_i}{\bigl[K_y^{-1}\bigr]_{ii}}, \qquad \sigma_{-i}^2(\x^{(i)}) = \frac{1}{\bigl[K_y^{-1}\bigr]_{ii}},

where μi\mu_{-i} and σi2\sigma_{-i}^2 denote the posterior mean and variance after removing the ii-th observation, and α=Ky1(yμX)\alpha = K_y^{-1}(\bm y-\bm\mu_X) (for centered outputs, μX=0\bm\mu_X=0). Since Ky1K_y^{-1} is recovered from the Cholesky factor of KyK_y in O(n2)\mathcal{O}(n^2) once the factorisation has been done, computing the full nn-vector of LOO residuals is essentially free relative to the O(n3)\mathcal{O}(n^3) already paid for posterior inference.

9.6.5.1What the LOO RMSE tells us.

Tracking

LOO-RMSE(Ds)  =  1nsi=1ns(μi(x(i))tis)2\mathrm{LOO\text{-}RMSE}(\mathcal{D}^s) \;=\; \sqrt{\frac{1}{n^s}\sum_{i=1}^{n^s}\bigl(\mu_{-i}(\x^{(i)}) - t_i^s\bigr)^2}

across VFI iterations is a cheap surrogate-health metric that is independent of the Bellman residual:

The notebook 04_GP_Value_Function_Iteration.ipynb computes (9.20) via scipy.linalg.cho_solve (function gp_loo_rmse) and reports it alongside the Bellman residual at every VFI iteration, separating these two failure modes. The same diagnostic reappears in Section 10.1.4 for the GP layer over the SMM moment map.

9.6.6Active Learning Inside the VFI Loop

The Bayesian active-learning machinery of Section 9.3 carries over almost verbatim once the VFI loop is in place, but with one important adjustment: the goal is now uniform interpolation accuracy of the value function on the relevant state-space region, not maximisation of a payoff or minimisation of a loss. The right acquisition function is therefore pure exploration, the GP posterior standard deviation, rather than a UCB or expected-improvement criterion that trades off exploitation and exploration:

xnextarg maxxXcandσGPs(x).\x^{\mathrm{next}} \in \argmax_{\x \in \mathcal{X}^\mathrm{cand}} \sigma_\mathrm{GP}^s(\x).

A practical implementation, used in notebook 04_GP_Value_Function_Iteration.ipynb, runs the VFI iterations with a frozen design for a few steps, then enriches the design every NN iterations:

  1. Evaluate σGPs\sigma_\mathrm{GP}^s on a dense candidate set Xcand\mathcal{X}^\mathrm{cand} (a Latin-hypercube draw over the current state-space region).

  2. Pick the top-naddn_\mathrm{add} candidates by posterior standard deviation, subject to a minimum-spacing constraint x(new)x(j)δspacing\|\x^{(\mathrm{new})} - \x^{(j)}\| \ge \delta_\mathrm{spacing} against existing design points to avoid clustering.

  3. Evaluate the Bellman operator at the new points (one expensive oracle call each, in parallel) and append them to Ds\mathcal{D}^s.

  4. Refit the GP and continue iterating.

9.6.6.1Why pure exploration here.

A UCB-style acquisition would bias the design toward states with high value, which is not what we want when the surrogate is a building block of an iteration. We want the GP posterior to be uniformly tight wherever the Bellman operator might be evaluated, so that the contraction modulus of the approximate operator stays close to β\beta. This is structurally different from the optimisation setting of Bayesian optimisation, where exploitation is a feature.

9.6.6.2Empirical impact.

The companion notebook compares a same-budget fixed Latin-hypercube design with an active design inside the one-dimensional GP-VFI loop. Both designs use Bellman labels; the active design starts from a small initial set and adds states by maximising the GP posterior standard deviation (9.22) subject to a spacing rule. At the same final number of labels, the active design lowers posterior uncertainty and achieves a comparable or smaller dense-grid Bellman residual (Figure Figure 9.10). The figure is one-dimensional by design: the goal is to show active enrichment inside a genuine Bellman iteration, not to use a separable interpolation toy as a proxy for multidimensional dynamic programming.

Same-budget active enrichment inside one-dimensional GP value-function iteration. The left panel compares the GP posterior means from a fixed Latin-hypercube design and an active design against a reference GP-VFI solution. The middle panel shows the posterior standard deviation and the states added by the active rule , marked as triangles on the horizontal axis. The right panel reports the dense-grid Bellman residual. Unlike the previously used two-dimensional separable interpolation benchmark, every plotted training value here is generated by a Bellman maximization. Generated by notebook lecture_14_04_GP_Value_Function_Iteration.ipynb.

Figure 9.10:Same-budget active enrichment inside one-dimensional GP value-function iteration. The left panel compares the GP posterior means from a fixed Latin-hypercube design and an active design against a reference GP-VFI solution. The middle panel shows the posterior standard deviation and the states added by the active rule (9.22), marked as triangles on the horizontal axis. The right panel reports the dense-grid Bellman residual. Unlike the previously used two-dimensional separable interpolation benchmark, every plotted training value here is generated by a Bellman maximization. Generated by notebook lecture_14_04_GP_Value_Function_Iteration.ipynb.

9.6.7Results: From One to 500 Dimensions

The companion notebook deliberately stops at the one-dimensional Bellman problem, where every numerical object can be inspected directly. The high-dimensional claims below come from the ASGP implementation of Scheidegger & Bilionis (2019), where the GP operates on an active subspace rather than on the full tensor-product state space; we report their findings as a literature summary, not as something the notebook itself demonstrates.

Scheidegger & Bilionis (2019) apply the ASGP-VFI algorithm to the stochastic optimal growth model (9.19) across a range of dimensions, from D=1D = 1 (the textbook Brock--Mirman case) to D=500D = 500 continuous state variables. The key findings are:

  1. Convergence is dimension-independent in the active subspace: all models converge to a normalized error below 10-4 within approximately 70 VFI iterations regardless of DD, where the iteration count is dimension-independent in the mm-dimensional active subspace on which the GP actually operates rather than in the full ambient RD\R^D.

  2. Low-dimensional structure: the active subspace dimension is m=1m = 1 in all cases tested, so the value function depends on a single linear combination of the DD capital stocks, confirming that the high-dimensional problem has intrinsic low-dimensional structure.

  3. Sub-exponential scaling: CPU time grows sub-exponentially with DD (concave on a log scale when parallelized across compute nodes); on the benchmarks reported in Scheidegger & Bilionis (2019), D=500D = 500 is solved at sub-exponential cost in DD when distributed across a compute cluster. This avoids the explicit grid-based curse but does not eliminate all high-dimensional sample/optimization costs.

  4. Beyond the reach of grids: grid-based methods such as Smolyak sparse grids cannot handle D>100D > 100, making ASGP one demonstrated grid-free approach for models at this scale; neural residual methods such as DEQNs are complementary alternatives when the equilibrium conditions are easier to express directly.

9.6.8Uncertainty Quantification and Ergodic Sets

9.6.8.1Free UQ from the GP posterior.

Because the value function at each VFI iteration is a GP, the posterior variance σ2(x)\sigma^2(\x) provides a built-in measure of approximation uncertainty. This gives pointwise credible bands for the interpolated value function essentially for free. Policy uncertainty is not automatic in the same sense: it must be propagated through the Bellman maximization, for example by evaluating policies under posterior draws or local approximations to the GP posterior.

9.6.8.2Parameter uncertainty propagation.

Uncertain model parameters can be treated as additional pseudo-states. For example, extending the state to S~=(k,γ)\tilde{S} = (\bm{k}, \gamma) where γ[γ,γˉ]\gamma \in [\underline{\gamma}, \bar{\gamma}] is the risk aversion parameter allows solving the model once on the extended state space and then extracting univariate effects and global sensitivity indices in a single computation, avoiding the traditional approach of re-solving the model for each parameter value. Harenberg et al. (2019) give a general introduction to uncertainty quantification for economic models, covering univariate-effect plots and Sobol’ sensitivity indices.

9.6.8.3Learning ergodic sets.

Many economic models have irregularly shaped ergodic sets (ellipsoids or manifolds, not hypercubes). Scheidegger & Bilionis (2019) propose learning the ergodic distribution via Bayesian Gaussian mixture models: (i) solve the model on a large initial domain; (ii) simulate the economy forward to generate capital paths {kt+}\{\bm{k}_t^+\}; (iii) fit a mixture of Gaussians ρ^ergodic\hat{\rho}_\mathrm{ergodic} to the simulated paths; (iv) re-solve on the ergodic set by sampling training points from ρ^ergodic\hat{\rho}_\mathrm{ergodic} instead of the full hypercube. GPs handle this naturally because they require no grid structure.

9.6.9Comparison: GPs vs. DEQNs for Solving Dynamic Models

Both GPs (this chapter) and DEQNs (Chapters Chapter 2--Chapter 5) solve dynamic stochastic models, but with different trade-offs (Table Table 9.2):

Table 9.2:GP/ASGP and DEQN solvers for dynamic models. The table distinguishes exact fixed-point theory from the numerical approximation actually used: GP-VFI inherits the Bellman contraction only up to interpolation error, while DEQNs are judged by residual and simulation diagnostics.

CriterionGP / ASGPDEQNs
Solution methodValue function iterationEuler equation residuals
DimensionalityUp to \sim500 (with AS)>>1000 feasible
UQ built-inYes (GP posterior variance)No (needs extra work)
HardwareCPU clusters (MPI)GPUs (TensorFlow/PyTorch)
Irregular domainsYes (grid-free)Yes (mesh-free)
Sensitivity analysisCheap after pseudo-state augmentationRequires pseudo-states or re-training
Convergence diagnosticsExact Bellman operator is a contraction; GP interpolation error must be controlledEuler residuals and simulation tests; no generic global proof

When to use which: GPs when the effective dimension is moderate (D15D \lesssim 15, or a few hundred only when active-subspace structure is strong) and uncertainty quantification or sensitivity analysis is required; DEQNs when DD is very large, GPU hardware is available, or the model involves complicated market-clearing conditions that are more naturally expressed as Euler equation residuals than as Bellman maximization.

9.7Deep Kernel Learning

Standard GP kernels such as the RBF or Matérn operate on raw input features: the covariance k(x,x)k(\x, \x') depends on xx\|\x - \x'\|, which may be a poor measure of similarity when the inputs are high-dimensional or when the function’s structure depends on complex nonlinear interactions among variables. Deep Kernel Learning (DKL) Wilson et al., 2016 addresses this limitation by combining the representation-learning power of deep neural networks with the probabilistic framework of GPs.

9.7.0.1The DKL architecture.

A DKL model consists of two components:

  1. A feature extractor ϕ(x;θNN) ⁣:RDRd\phi(\x; \bm{\theta}_\mathrm{NN})\colon \R^D \to \R^d, typically a multi-layer perceptron that maps the raw input x\x to a learned representation of (usually much lower) dimension dd.

  2. A GP layer that places a GP prior on the function g(z)=g(ϕ(x;θNN))g(\z) = g(\phi(\x; \bm{\theta}_\mathrm{NN})), with a standard base kernel kbase(z,z)k_\mathrm{base}(\z, \z') (e.g., RBF or Matérn) operating in the learned feature space.

The deep kernel is thus:

kDKL(x,x)=kbase(ϕ(x;θNN),  ϕ(x;θNN)).k_\mathrm{DKL}(\x, \x') = k_\mathrm{base}\bigl(\phi(\x; \bm{\theta}_\mathrm{NN}),\; \phi(\x'; \bm{\theta}_\mathrm{NN})\bigr).

Unlike a standard kernel that computes distance in the input space, the DKL kernel computes distance in a learned feature space. If the neural network learns to map functionally similar inputs close together, the GP can exploit this structure for better predictions and more calibrated uncertainty estimates.

9.7.0.2Joint training.

The parameters of both the neural network (θNN\bm{\theta}_\mathrm{NN}) and the GP hyperparameters (,σf,σy\ell, \sigma_f, \sigma_y of kbasek_\mathrm{base}) are optimized jointly by maximizing the GP marginal likelihood:

maxθNN,ϑ  logp(yΦ(X;θNN),ϑ),\max_{\bm{\theta}_\mathrm{NN},\, \bm{\vartheta}} \; \log p(\bm{y}\,|\,\Phi(\X; \bm{\theta}_\mathrm{NN}),\, \bm{\vartheta}),

where Φ(X;θNN)\Phi(\X; \bm{\theta}_\mathrm{NN}) is the matrix of transformed features. This end-to-end training procedure automatically learns features that are useful for the GP, without requiring manual feature engineering. In practice, DKL is implemented via GPyTorch Gardner et al., 2018, which provides efficient GPU-accelerated GP inference and integrates seamlessly with PyTorch for the neural network component.

9.7.0.3When DKL helps.

DKL is most beneficial when:

The trade-off relative to a standard GP is clear: DKL offers greater expressiveness at the cost of more parameters and the risk of overfitting with very small datasets. For the moderate-data regimes typical of economic surrogate models (N100N \sim 100--1000), DKL can offer significant improvements over standard kernels, particularly when the target function has complex multi-scale behavior. Chen et al. (2025) apply learned-feature-map GP architectures in a dynamic model of private asset allocation, where the feature map captures the complex nonlinear interactions between illiquidity, portfolio composition, and optimal rebalancing.

9.7.0.4Illustrative examples.

Two simple examples highlight when DKL provides a qualitative advantage over standard kernels:

9.8GPs Among Their Bayesian Cousins

Gaussian processes are the most analytically transparent way to attach uncertainty to a non-parametric regressor, but they are not the only one. For completeness, this section briefly situates the GP machinery against two neural-network-based alternatives that come up frequently in modern uncertainty-aware deep learning.

9.8.0.1Monte-Carlo dropout.

Gal & Ghahramani (2016) show that a deep network trained with dropout and evaluated with dropout still active can be interpreted as approximate variational inference over a particular Bayesian neural network. Predictive uncertainty is obtained by averaging TT stochastic forward passes; the cost is one extra forward pass per sample. Calibration is rougher than a GP’s, but the method requires no architectural change and scales to deep nets and large nn.

9.8.0.2Deep ensembles.

Lakshminarayanan et al. (2017) train EE independent neural networks with different random seeds and combine them into a Gaussian-mixture predictor. Empirically this is one of the most robust uncertainty-quantification recipes available, often beating MC dropout and approaching the calibration of GPs, at EE times the training cost.

Table 9.3:Three uncertainty-quantification recipes compared on the dimensions that drive method choice in economic surrogate work. This script uses GPs in Chapter Chapter 9 and Chapter Chapter 11 because the typical regime (n103n \lesssim 10^3, expensive simulator) plays to their strengths; readers operating in larger-data regimes should consider MC dropout or deep ensembles before resorting to a sparse GP.

Gaussian processMC dropoutDeep ensembles
Calibrationexact under modelapproximatestrong empirically
Training costO(n3)\mathcal{O}(n^3)one networkE×E\times one network
Inference costO(n2)\mathcal{O}(n^2)TT forward passesEE forward passes
Sample efficiencybest at small nnneeds much more dataneeds much more data
Best whenn104n \lesssim 10^4, low ddcheap UQ on existing netswilling to pay E×E\times for top calibration
ReferenceRasmussen & Williams (2005)Gal & Ghahramani (2016)Lakshminarayanan et al. (2017)

9.8.0.3A decision rule for practice.

In our experience the right method follows from the application: (i) plain GPs for moderate dd (10\lesssim 10--20) with a smooth target and an expensive simulator, when calibrated uncertainty is the goal; (ii) deep kernels (Wilson & al., (2016)) when the input geometry is non-trivial (regime switches, manifold structure, image-like inputs); (iii) deep ensembles or MC dropout for high-dd regression where calibrated uncertainty is desirable but exact GP inference is infeasible; (iv) sparse GPs (Titsias (2009) Hensman et al. (2013)) for n104n \gtrsim 10^4 when the target stays smooth. The summary frame in the companion deck (“Toolbox: When to Use What”) gives the same decomposition visually.

9.9Further Reading

9.10Exercises

Worked solutions and guidance for these exercises appear in Appendix Appendix F.

Footnotes
References
  1. Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
  2. MacKay, D. J. C. (1992). Information-Based Objective Functions for Active Data Selection. Neural Computation, 4(4), 590–604.
  3. Krause, A., Singh, A., & Guestrin, C. (2008). Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9, 235–284.
  4. Deisenroth, M. P., Rasmussen, C. E., & Peters, J. (2009). Gaussian process dynamic programming. Neurocomputing, 72(7), 1508–1524.
  5. Engel, Y., Mannor, S., & Meir, R. (2005). Reinforcement Learning with Gaussian Processes. Proceedings of the 22nd International Conference on Machine Learning.
  6. Scheidegger, S., & Bilionis, I. (2019). Machine learning for high-dimensional dynamic stochastic economies. Journal of Computational Science, 33, 68–82. https://doi.org/10.1016/j.jocs.2019.03.004
  7. Renner, P., & Scheidegger, S. (2018). Machine learning for dynamic incentive problems. 10.2139/ssrn.3282487
  8. Friedl, A., Kübler, F., Scheidegger, S., & Usui, T. (2023). Deep Uncertainty Quantification: With an Application to Integrated Assessment Models.
  9. Titsias, M. K. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS).
  10. Hensman, J., Fusi, N., & Lawrence, N. D. (2013). Gaussian Processes for Big Data. Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI).
  11. Srinivas, N., Krause, A., Kakade, S., & Seeger, M. (2010). Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. Proceedings of the 27th International Conference on Machine Learning, 1015–1022.
  12. Constantine, P. G. (2015). Active Subspaces: Emerging Ideas for Dimension Reduction in Parameter Studies. SIAM.
  13. Tripathy, R. K., & Bilionis, I. (2018). Deep UQ: Learning Deep Neural Network Surrogate Models for High Dimensional Uncertainty Quantification. Journal of Computational Physics, 375, 565–588.
  14. Bilionis, I., & Zabaras, N. (2016). Bayesian uncertainty propagation using Gaussian processes. In Handbook of Uncertainty Quantification (pp. 1–45). Springer International Publishing. 10.1007/978-3-319-11259-6_16-1
  15. Deisenroth, M. P., & Rasmussen, C. E. (2011). PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Proceedings of the 28th International Conference on Machine Learning.