Gaussian Processes - Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance

Section 4.4.1 introduced Gaussian processes at the picture level, with just enough machinery to follow the Bayesian-optimization argument of Chapter Chapter 4: the definition $f(\x) \sim \mathcal{GP}(\mu(\x), k(\x, \x'))$ , the squared-exponential kernel (4.2), and the closed-form posterior mean and variance (4.3)--(4.4). Those formulas took the kernel and its hyperparameters $(\ell, \sigma_f, \sigma_y)$ as given. The present chapter removes both shortcuts and develops the full machinery: it samples whole functions from the GP prior (Section 9.1), learns $(\ell, \sigma_f, \sigma_y)$ from data via the marginal-likelihood Occam’s razor (Section 9.2), expands the kernel zoo beyond the squared-exponential to Matérn and composite kernels (Section 9.2.1.1), uses the posterior variance to choose where to evaluate next via Bayesian Active Learning (Section 9.3), scales the whole machinery to high-dimensional inputs through active subspaces (Section 9.5), and embeds GPs inside value-function iteration for high-dimensional dynamic programming (Section 9.6). Two applications-driven chapters then put the machinery to work: structural estimation via surrogate-driven SMM (Chapter Chapter 10) and climate-policy uncertainty quantification (Chapter Chapter 11). Methodological foundations are the GP textbook of Rasmussen & Williams (2005) and the Bayesian-active-learning ideas dating back to MacKay (1992) and Krause et al. (2008). Embedding GPs inside dynamic programming was pioneered by Deisenroth et al. (2009) (GPDP) and Engel et al. (2005) (GP temporal-difference learning); within economics, applications include high-dimensional growth models Scheidegger & Bilionis, 2019, dynamic incentive problems Renner & Scheidegger, 2018, and deep uncertainty quantification for integrated assessment models Friedl et al., 2023. Sparse approximations Titsias, 2009Hensman et al., 2013 extend GPs beyond their cubic scaling limit; deep kernel learning combines neural-network feature extraction with GP inference and closes the chapter.

9.1Gaussian Process Regression¶

9.1.0.1Sampling from the GP prior.¶

Recall from Section 4.4.1 that for any finite set of test inputs $\x_1, \ldots, \x_n$ , the function values $\bm{f} = (f(\x_1), \ldots, f(\x_n))^\top$ are jointly Gaussian with prior mean $\bm{\mu}$ (the entries $\mu(\x_i)$ ) and covariance matrix $K_{ij} = k(\x_i, \x_j)$ . To draw a function from the GP prior at those test inputs, evaluate $K$ , compute its Cholesky decomposition $K = L L^\top$ , draw $\bm{u} \sim \mathcal{N}(\bm{0}, I)$ , and set $\bm{f} = \bm{\mu} + L \bm{u}$ . The resulting paths have the smoothness, amplitude, and length scale dictated entirely by the kernel.

9.1.0.2Kernel choice.¶

The squared-exponential kernel (4.2) of Section 4.4.1 remains the right default for genuinely smooth targets, e.g. in the unconstrained interior of an ergodic set. In economic applications, however, the target function (value, policy, equilibrium price) often has kinks from occasionally binding constraints, and the Matérn- $3/2$ kernel introduced in Section 9.2.1.1 below is then a better default than RBF, since it does not oversmooth at non-differentiable features.

Squared-exponential kernel as a function of distance for three length scales. Small \ell makes correlations decay quickly and produces rougher, more local fits; large \ell couples distant points and imposes smoother functions. — Figure 9.1:Squared-exponential kernel as a function of distance for three length scales. Small $\ell$ makes correlations decay quickly and produces rougher, more local fits; large $\ell$ couples distant points and imposes smoother functions.

Figure Figure 9.1 shows how the RBF length scale controls the distance over which observations remain informative. For any training dataset $\mathcal{D} = \{(\x_i, y_i)\}_{i=1}^n$ , the GP posterior at a test point $\x_*$ has the closed-form mean $\bar{f}_*$ and variance $\sigma_{f,*}^2$ stated in (4.3)--(4.4), with $K_{ij} = k(\x_i,\x_j)$ the kernel matrix on training inputs, $K_y = K + \sigma_y^2 I$ its noise-augmented version, $\bm{\mu}_X$ the prior-mean vector at training inputs, and $\bm{k}_*$ the cross-covariance whose $i$ -th entry is $k(\x_*, \x_i)$ . For a noisy future observation $y_*$ the predictive variance is $\sigma_{y,*}^2 = \sigma_{f,*}^2 + \sigma_y^2$ , and the common zero-mean formulas are recovered by centering outputs or setting $\mu \equiv 0$ .

9.1.0.3A hand-traceable 1D example.¶

To make (4.3)--(4.4) concrete, take $f(x) = \sin x$ , observe $f$ noiselessly at $x_1 = 0$ and $x_2 = \pi$ (so $y_1 = y_2 = 0$ ), and query at $x_\star = \pi/2$ . Use the kernel $k(x, x') = \exp\!\bigl(-(x - x')^2/2\bigr)$ and a tiny noise floor $\sigma_y^2 = 10^{-6}$ for numerical stability. The training kernel matrix is $$K + \sigma_y^2 I ;=;

\begin{pmatrix} 1 & e^{-\pi^2/2} \\ e^{-\pi^2/2} & 1 \end{pmatrix}

(9.1)

;\approx;

\begin{pmatrix} 1.000 & 0.00719 \\ 0.00719 & 1.000 \end{pmatrix}

(9.2)

\begin{pmatrix} \exp(-(\pi/2)^2/2) \\ \exp(-(\pi/2)^2/2) \end{pmatrix}

(9.3)

;\approx;

\begin{pmatrix} 0.2910 \\ 0.2910 \end{pmatrix}

(9.4)

Figure Figure 9.2 illustrates the GP prior and posterior for a simple one-dimensional regression problem. Before observing data, the GP prior has constant mean and uniform uncertainty. After conditioning on five observations, the posterior mean interpolates the data and the uncertainty bands collapse near the observations while remaining wide in unexplored regions.

Gaussian-process prior and posterior on a 1D regression problem. Left: the prior has constant mean (here zero) and uniform uncertainty; the shaded bands show the 68\% and 95\% credible intervals, and the thin grey curves are three sample paths drawn from the prior. Right: after conditioning on five observations (black dots), the posterior mean (red curve) interpolates the data exactly, and the credible band collapses near the observed points while widening in unexplored regions away from the data, giving the GP its built-in uncertainty quantification. — Figure 9.2:Gaussian-process prior and posterior on a 1D regression problem. *Left:* the prior has constant mean (here zero) and uniform uncertainty; the shaded bands show the $68\%$ and $95\%$ credible intervals, and the thin grey curves are three sample paths drawn from the prior. *Right:* after conditioning on five observations (black dots), the posterior mean (red curve) interpolates the data exactly, and the credible band collapses near the observed points while widening in unexplored regions away from the data, giving the GP its built-in uncertainty quantification.

9.2Kernel Functions and Hyperparameter Learning¶

The kernel hyperparameters $\bm{\vartheta} = (\ell, \sigma_f, \sigma_y)$ are learned by maximizing the log marginal likelihood:

\log p(\bm{y} | \X, \bm{\vartheta}) = -\frac{1}{2}(\bm{y}-\bm{\mu}_X)^\top K_y^{-1}(\bm{y}-\bm{\mu}_X) - \frac{1}{2}\log|K_y| - \frac{n}{2}\log 2\pi,

(9.5)

where $K_y = K + \sigma_y^2 I$ and $\bm{\mu}_X$ is the prior mean evaluated on the training inputs. In the zero-mean convention used elsewhere in this section, set $\bm{\mu}_X = 0$ (or center the outputs).

9.2.0.1Why marginal likelihood?¶

The log evidence $\log p(\bm y \mid \X, \bm\vartheta)$ encodes both data fit and an automatic complexity penalty in a single closed-form expression. The quadratic form $-\tfrac{1}{2}\bm y^\top K_y^{-1}\bm y$ rewards hyperparameters that explain the centered observations with a small inverse-covariance norm, while the log-determinant term $-\tfrac{1}{2}\log|K_y|$ penalises overly flexible kernels that admit too many possible functions, giving Bayesian Occam’s razor Rasmussen & Williams, 2005. Compared with cross-validated MSE, this approach requires no held-out split, makes use of all $n$ observations, and exposes a closed-form gradient with respect to $\bm\vartheta$ , which is essential for L-BFGS-style optimization in scikit-learn / GPyTorch. The maximum is reached at the kernel that is just expressive enough to fit the data but no more (Figure Figure 9.3).

Marginal-likelihood Occam’s razor for a GP. As the kernel becomes more flexible (smaller length scale \ell), the data-fit term improves but the \log|K_y| complexity penalty grows linearly. Their sum, the log evidence, peaks at an interior optimum that is just expressive enough to explain the data, automatically. No held-out validation set is required. — Figure 9.3:Marginal-likelihood Occam’s razor for a GP. As the kernel becomes more flexible (smaller length scale $\ell$ ), the data-fit term improves but the $\log|K_y|$ complexity penalty grows linearly. Their sum, the log evidence, peaks at an interior optimum that is just expressive enough to explain the data, automatically. No held-out validation set is required.

9.2.0.2A free held-out diagnostic.¶

Marginal likelihood is not the only validation tool. The leave-one-out (LOO) predictive error of a GP admits a closed form using the same Cholesky factor already computed for posterior inference, so it costs nothing extra; we develop the formula in Section 9.6.5, where it serves as an iteration-by-iteration health check inside GP-based VFI, and reuse it in Section 10.1.4 for the GP layer over the SMM moment map.

9.2.1Kernel Composition¶

More complex covariance structures can be built by composing simpler kernels. The sum of two kernels is again a valid kernel, as is the product. For example:

$k_\mathrm{SE} + k_\mathrm{periodic}$ : captures a smooth trend plus periodic oscillations.
$k_\mathrm{SE}(\ell_1) \cdot k_\mathrm{SE}(\ell_2)$ : models interactions between two length scales.

9.2.1.1The Matérn kernel family.¶

The Matérn kernel is parameterized by a smoothness parameter $\nu > 0$ :

k_{\mathrm{Mat\acute{e}rn}}(r) = \sigma_f^2\,\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\, r}{\ell}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\, r}{\ell}\right),

(9.6)

where $r = \|\x - \x'\|$ , $\ell$ is the length scale, and $K_\nu$ is the modified Bessel function of the second kind. The smoothness parameter $\nu$ controls the regularity of sample paths: a Matérn- $\nu$ GP is $k$ -times mean-square differentiable for every integer $k < \nu$ , with the RBF kernel recovered in the limit $\nu \to \infty$ . Important special cases:

$\nu = 1/2$ : the Ornstein--Uhlenbeck kernel $k(r) = \sigma_f^2 \exp(-r/\ell)$ ; sample paths are continuous but nowhere differentiable.
$\nu = 3/2$ : $k(r) = \sigma_f^2(1 + \sqrt{3}\,r/\ell)\exp(-\sqrt{3}\,r/\ell)$ ; once differentiable.
$\nu = 5/2$ : $k(r) = \sigma_f^2(1 + \sqrt{5}\,r/\ell + 5r^2/(3\ell^2))\exp(-\sqrt{5}\,r/\ell)$ ; twice differentiable.
$\nu \to \infty$ : recovers the squared exponential (RBF) kernel; infinitely differentiable.

For economic applications where the target function may have kinks (e.g., due to occasionally binding constraints), the Matérn-3/2 kernel is often a better choice than the infinitely smooth RBF kernel, as it does not oversmooth near non-differentiable features Renner & Scheidegger, 2018.

9.3Bayesian Active Learning for Sample-Efficient Training¶

When model evaluations are expensive, we wish to select training points that provide maximal information. Bayesian Active Learning (BAL) uses the GP posterior variance to guide this selection. The information-theoretic foundations go back to MacKay (1992), with submodular guarantees for variance-based sensor placement established by Krause et al. (2008) and the widely used upper-confidence-bound formulation of Srinivas et al. (2010).

9.3.0.1Connection to Bayesian optimization (Chapter Chapter 4).¶

BAL is the active-learning twin of the Bayesian-optimization (BO) recipe introduced in Chapter Chapter 4: same GP surrogate, same acquisition-function machinery. The difference is the target. BO seeks a scalar optimum of the surrogate (e.g. Expected Improvement steers samples toward $\argmax \hat f$ ), so its acquisition trades exploration against the chance of beating the current best. BAL instead targets global function approximation: it allocates samples wherever posterior variance is largest, irrespective of the predicted value. Both reduce to the same primitive (fit GP, maximise acquisition, evaluate, refit), but the choice of acquisition reflects whether one wants the best point or the best surrogate.

Definition 9.1

U(\x) = w_{\mathrm{obj}} \cdot \mu(\x) + \frac{w_{\mathrm{var}}}{2}\log\sigma^2(\x),

(9.7)

where $\mu(\x)$ and $\sigma^2(\x)$ are the GP posterior mean and variance, $w_{\mathrm{obj}}$ is the objective-following (or Bayesian-optimization) weight, and $w_{\mathrm{var}}$ controls exploration. This logarithmic-variance form is the one used by Renner & Scheidegger (2018) in economic applications; it is a non-standard relative of GP-UCB Srinivas et al., 2010. When the goal is global surrogate accuracy rather than optimization, integrated posterior variance or expected error reduction can be more natural objectives. The weights $w_{\mathrm{obj}}$ and $w_{\mathrm{var}}$ are local to the BAL context and distinct from the discount factor $\beta$ and shock-persistence notation used elsewhere in the script.

Definition 9.2

Input: Initial design $\mathcal{D}_0 = \{(\x_i, y_i)\}_{i=1}^{n_0}$ , budget $N$ , kernel $k$ , acquisition weights $(w_{\mathrm{obj}}, w_{\mathrm{var}})$
Fit GP on $\mathcal{D}_0$ : learn hyperparameters via marginal likelihood
for $n = n_0+1, \ldots, N$ :
- Compute acquisition function: $U(\x) = w_{\mathrm{obj}} \cdot \mu_{n-1}(\x) + \frac{w_{\mathrm{var}}}{2}\log\sigma_{n-1}^2(\x)$
- Select next point: $\x_n = \argmax_{\x} U(\x)$
- Evaluate expensive model: $y_n = f(\x_n)$
- Update dataset: $\mathcal{D}_n = \mathcal{D}_{n-1} \cup \{(\x_n, y_n)\}$
- Re-fit GP on $\mathcal{D}_n$ (update hyperparameters every $k$ iterations)
Output: Trained GP surrogate $\hat{f}(\cdot)$ with posterior mean $\mu_N$ and variance $\sigma_N^2$

The BAL algorithm concentrates training points near kinks, boundary layers, and other regions where the function is hardest to approximate, achieving the same accuracy as uniform sampling with far fewer evaluations Renner & Scheidegger, 2018. The exploration--exploitation trade-off controlled by $(w_{\mathrm{obj}}, w_{\mathrm{var}})$ ensures that the algorithm balances refining the approximation in already well-sampled regions against exploring uncharted territory.

The intuition behind the acquisition function is as follows. The first term $w_{\mathrm{obj}} \cdot \mu(\x)$ favors regions where the predicted function value is large (exploitation: sample where the function is interesting). The second term $\frac{w_{\mathrm{var}}}{2}\log\sigma^2(\x)$ favors regions of high uncertainty (exploration: sample where we know least). This acquisition function belongs to the Upper Confidence Bound (UCB) family Srinivas et al., 2010: with $w_{\mathrm{obj}} = 0$ it has the same maximizers as pure posterior-variance sampling, while the logarithmic variance weighting provides a more conservative exploration bonus than the standard-deviation weighting used in GP-UCB when it is combined with exploitation. By adjusting $w_{\mathrm{obj}}$ and $w_{\mathrm{var}}$ , the practitioner can control the balance. For economic applications where the entire domain is relevant (e.g., approximating a policy function), a pure exploration strategy ( $w_{\mathrm{obj}} = 0$ , $w_{\mathrm{var}} > 0$ ) is often appropriate, reducing BAL to an uncertainty sampling scheme that minimizes the integrated posterior variance. Figure Figure 9.4 illustrates two BAL iterations on a 1D toy.

Bayesian Active Learning in action. (a) Starting from three initial observations (red dots), the GP posterior mean (blue line) deviates from the true function (dashed black) in the data-sparse region, where the 95% credible band (blue shading) is wide. (b) The acquisition function selects the point of maximum posterior variance (green diamond); after evaluation, the posterior tightens locally and the mean improves. (c) A second active-learning iteration fills the remaining gap. With only five strategically chosen points, the GP posterior closely tracks the true function across the entire domain. — Figure 9.4:Bayesian Active Learning in action. **(a)** Starting from three initial observations (red dots), the GP posterior mean (blue line) deviates from the true function (dashed black) in the data-sparse region, where the 95% credible band (blue shading) is wide. **(b)** The acquisition function selects the point of maximum posterior variance (green diamond); after evaluation, the posterior tightens locally and the mean improves. **(c)** A second active-learning iteration fills the remaining gap. With only five strategically chosen points, the GP posterior closely tracks the true function across the entire domain.

9.4When to Use GPs vs. DNNs¶

Now that the GP machinery (posterior inference, kernel design, and BAL) has been introduced, we can give a more detailed comparison than the overview in Table Table 10.1. Active subspaces, introduced in Section Section 9.5, are the main tool for pushing GP surrogates beyond the moderate-dimensional regime. Table Table 9.1 extends Table Table 10.1 with the GP-specific items (LOO diagnostics, marginal-likelihood Occam, BAL).

Table 9.1:GP and DNN surrogate trade-offs after the basic GP machinery has been introduced. GPs are sample-efficient and uncertainty-aware but expensive in the number of training points; DNNs scale to much larger datasets and dimensions but need more data and separate machinery for calibrated uncertainty.
$\dagger$ Sparse-GP via inducing points reduces $\mathcal{O}(N^3)$ to $\mathcal{O}(Nm^2 + m^3)$ for $m \ll N$ inducing inputs Titsias, 2009Hensman et al., 2013; see the inducing-point remarkbox below.

Criterion	Gaussian Processes	Deep Neural Networks
Training cost	$\mathcal{O}(N^3)$ ; exact for $N$ in the low thousands^ $\dagger$ ^	$\mathcal{O}(N)$ per epoch; scales to $N \sim 10^6$
Prediction cost	$\mathcal{O}(N)$ posterior mean, $\mathcal{O}(N^2)$ posterior variance, per test point	$\mathcal{O}(\text{weights})$ per forward pass; independent of $N$
Gradient access	closed-form from posterior	autodiff (exact, any order)
Non-smooth features	Matérn- $3/2$ adapts well	excellent with enough data
High-dim. extension	active subspaces ( $d \gg 10$ )	native

Practical guidelines.

Choose a GP when each model evaluation is expensive (minutes to hours per solve), the effective dimensionality is moderate ( $d \lesssim 15$ , or higher with active subspaces), and uncertainty quantification on the surrogate is needed, e.g., for reporting confidence intervals on estimated parameters or for guiding further data collection via BAL.
Choose a DNN when training data is cheap to generate, the input dimension is high ( $d \gg 10$ ), and the downstream task requires millions of fast evaluations, e.g., Monte Carlo simulation, grid search over a large parameter space, or real-time inference.
Combine both when the problem has a natural two-stage structure: use a GP with BAL to build a small, high-quality training set with uncertainty estimates, then train a DNN on the resulting dataset for fast large-scale deployment. The active subspace approach of Scheidegger & Bilionis (2019) is particularly effective in this setting, extending GP methods far beyond their naïve scaling limits.

9.4.1GP Regression in Practice¶

In code, GP regression is a one-liner once the kernel is chosen. In scikit-learn the standard pattern is to assemble a kernel as the sum of an RBF (RBF(length_scale=...)) and a noise term (WhiteKernel(noise_level=...)), pass it to GaussianProcessRegressor, call .fit(X_train, y_train), and obtain posterior mean and standard deviation from .predict(X_test, return_std=True); the kernel hyperparameters (length_scale, noise_level, output amplitude) are optimized by maximising the marginal likelihood, with n_restarts_optimizer controlling robustness to local optima. The companion notebook 02_GP_and_BAL.ipynb provides a full worked example fitting noisy observations of $\sin(x)$ on $[-2,2]$ .

9.4.1.1Application: GP surrogates for option pricing.¶

GPs are particularly well suited as surrogates for derivative pricing models. For example, one can train a GP on as few as 5--50 Black--Scholes option prices (evaluated at different spot prices or parameter configurations) and obtain a surrogate that accurately reproduces the pricing surface with calibrated uncertainty bands. The posterior variance immediately quantifies the interpolation uncertainty at each query point. This idea extends naturally to stochastic volatility models such as Heston, where the analytical pricing formula is expensive to evaluate. Furthermore, because GP predictions are linear in the training targets, the uncertainty of a portfolio of GP-priced instruments propagates analytically: for a linear portfolio $\sum_i w_i \hat{V}_i$ with vector of weights $\bm{w}$ and joint posterior covariance $\Sigma_{\hat{V}}$ , $\mathrm{Var}(\bm{w}^\top \hat{V}) = \bm{w}^\top \Sigma_{\hat{V}} \bm{w}$ . When the surrogate errors are independent across instruments, $\Sigma_{\hat{V}}$ is diagonal with entries $\sigma_i^2$ and the formula reduces to $\sum_i w_i^2 \sigma_i^2$ ; otherwise the off-diagonal cross-instrument covariances must be retained, e.g. via a multi-output GP. Either way the assessment is instant.

Remark 9.2

The main limitation of GPs is their $\mathcal{O}(n^3)$ training cost, arising from the inversion of the $n \times n$ kernel matrix. For $n > 10{,}000$ , exact GP inference becomes impractical. Approximate methods, such as variational sparse GPs with inducing points Titsias, 2009, stochastic-variational GPs Hensman et al., 2013, random Fourier features, and structured kernel interpolation, can extend the range to $n \sim 10^5$ , but for truly large-scale problems, deep neural networks remain the method of choice. The BAL framework mitigates this limitation by keeping $n$ small through intelligent sample selection.

The inducing-point idea (Figure Figure 9.5) is simple: instead of carrying the full $n\times n$ kernel matrix, summarize the dataset with a much smaller set of $m \ll n$ pseudo-inputs $\bm Z = \{\bm z_1,\dots,\bm z_m\}$ and replace $K_{nn}$ by the Nyström-style low-rank approximation $K_{nm}K_{mm}^{-1}K_{mn}$ . Training and prediction then cost about $\mathcal{O}(nm^2 + m^3)$ rather than $\mathcal{O}(n^3)$ , with the $m^3$ term coming from the small inducing-point block. The variational formulation of Titsias (2009) additionally treats $\bm Z$ as parameters to be optimized against the marginal likelihood, so the inducing inputs migrate to wherever the GP actually needs resolution.

$Inducing-point intuition. Exact GP inference conditions on all n=21 observations and uses the full n\times n kernel matrix. Sparse variational GP methods introduce m\ll n inducing inputs \bm Z and approximate the posterior through the low-rank structure induced by K_{nm}K_{mm}^{-1}K_{mn}. The top panel compares the exact posterior mean and uncertainty band with a sparse approximation using m=4 pseudo-inputs; the bottom panel shows the absolute difference between the two means. In this smooth one-dimensional illustration the approximation is close, but its quality depends on m, the kernel hyperparameters, and the placement of \bm Z. The dominant training cost falls from \mathcal{O}(n^3) to \mathcal{O}(nm^2 + m^3); the variational formulation of optimizes \bm Z jointly with the kernel hyperparameters.$

Figure 9.5:Inducing-point intuition. Exact GP inference conditions on all $n=21$ observations and uses the full $n\times n$ kernel matrix. Sparse variational GP methods introduce $m\ll n$ inducing inputs $\bm Z$ and approximate the posterior through the low-rank structure induced by $K_{nm}K_{mm}^{-1}K_{mn}$ . The top panel compares the exact posterior mean and uncertainty band with a sparse approximation using $m=4$ pseudo-inputs; the bottom panel shows the absolute difference between the two means. In this smooth one-dimensional illustration the approximation is close, but its quality depends on $m$ , the kernel hyperparameters, and the placement of $\bm Z$ . The dominant training cost falls from $\mathcal{O}(n^3)$ to $\mathcal{O}(nm^2 + m^3)$ ; the variational formulation of Titsias (2009) optimizes $\bm Z$ jointly with the kernel hyperparameters.

9.5Scaling GPs to High Dimensions: Active Subspaces¶

A common criticism of GP methods is that they are limited to moderate-dimensional problems ( $d \leq 10$ ). While this is true for naïve implementations, where the number of training points needed to cover the input space grows exponentially in $d$ , Scheidegger & Bilionis (2019) show that active subspace methods can mitigate this barrier when the target function exhibits a clear spectral gap in the gradient outer-product matrix $C$ of (9.8) below, i.e., when its variation is concentrated on a low-dimensional linear subspace of the input space. If $f$ depends comparably on all $d$ coordinates, the spectral gap is absent and the technique offers little gain; sufficient conditions are in Constantine (2015). For a textbook treatment of the active subspace framework and references to its origins, see Constantine (2015); we summarize the key ideas and their application to dynamic economic models below.

9.5.0.1Intuition: why most directions don’t matter.¶

Consider a value function $V(\x)$ in a dynamic stochastic economy with $d$ state variables. Although $V$ is formally a function of all $d$ inputs, it often responds primarily to a few linear combinations of them. For example, in a multi-country model, the value function may depend mostly on aggregate capital $\sum_j k^j$ and a measure of inequality $\mathrm{Var}(k^j)$ , rather than on each country’s capital individually. If we can identify these important directions, we can project the $d$ -dimensional input onto a much lower-dimensional subspace and fit the GP there, substantially mitigating the practical grid-based curse of dimensionality.

9.5.0.2The gradient outer product matrix.¶

The key diagnostic tool is the gradient outer product matrix Constantine, 2015. Given a function $f\colon \R^d \to \R$ with input distribution $\pi(\x)$ , define:

C = \int \nabla f(\x) \, \nabla f(\x)^\top \, \pi(\x)\, d\x \;\in\; \R^{d \times d}.

(9.8)

The matrix $C$ is symmetric and positive semi-definite. Its $(i,j)$ -entry measures how much $f$ co-varies along input directions $i$ and $j$ , averaged over the input distribution. In particular:

If $\partial f / \partial x_i$ is consistently large across the input space, the $i$ -th diagonal entry of $C$ is large, meaning direction $i$ is important.
If two directions always change $f$ together, the corresponding off-diagonal entry is large, as they form part of the same important linear combination.
If $\partial f / \partial x_i \approx 0$ everywhere, then direction $i$ contributes nothing to $C$ , so it is irrelevant and can be projected away.

9.5.0.3Worked toy example.¶

Take $f(x_1, x_2) = x_1^2$ on $\x \sim \mathrm{Unif}([-1,1]^2)$ . Then $\nabla f = (2x_1, 0)^\top$ , and integrating against the uniform measure gives $C = \mathrm{diag}(4/3,\,0)$ . The eigenvalues are $\lambda_1 = 4/3$ and $\lambda_2 = 0$ ; the first eigenvector is $(1,0)^\top$ , the second is $(0,1)^\top$ ; the spectral gap is infinite. The active subspace is the $x_1$ -axis, and $f$ is recovered exactly as $f = g(x_1)$ with $g(u) = u^2$ . This is the cleanest instance of the projection in (9.9) below.

9.5.0.4Eigendecomposition and the spectral gap.¶

Let $C = U \Lambda U^\top$ be the eigendecomposition, with eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$ . The eigenvectors reveal the directions of maximal average variation; the eigenvalues quantify how much variation each direction captures. Each diagonal entry of $C$ is the average squared sensitivity of $f$ along that input direction, so eigenvectors of $C$ are directions of correlated sensitivity: when the top $m$ eigenvalues capture most of the trace, $f$ varies primarily along their span and is approximately constant in the perpendicular $(d-m)$ -dimensional complement. If there is a clear spectral gap, i.e., $\lambda_m \gg \lambda_{m+1}$ , then the first $m$ eigenvectors $U_m = [u_1, \ldots, u_m]$ span the active subspace, and $f$ is well approximated by a function of the reduced coordinates $\tilde{\x} = U_m^\top \x \in \R^m$ alone.

f(\x) \;\approx\; g(U_m^\top \x), \qquad U_m = [u_1,\ldots,u_m].

(9.9)

Spectral decay of the active-subspace eigenvalues for a schematic example with d=12 state variables. The first three eigenvalues are orders of magnitude larger than the rest; the dashed red line marks the spectral gap after \lambda_3, which indicates that f effectively lives on a 3-dimensional active subspace (green brace) and reduces the GP regression problem from \R^{12} to \R^3. The remaining nine directions form the inactive subspace (gray brace), along which f is nearly constant on average. — Figure 9.6:Spectral decay of the active-subspace eigenvalues for a schematic example with $d=12$ state variables. The first three eigenvalues are orders of magnitude larger than the rest; the dashed red line marks the spectral gap after $\lambda_3$ , which indicates that $f$ effectively lives on a 3-dimensional active subspace (green brace) and reduces the GP regression problem from $\R^{12}$ to $\R^3$ . The remaining nine directions form the inactive subspace (gray brace), along which $f$ is nearly constant on average.

9.5.0.5Computing the gradient.¶

In practice, the integral in (9.8) is estimated from a finite sample of gradient evaluations Constantine, 2015 (this is a one-time pre-processing cost, $\mathcal{O}(N_g\, d)$ with $N_g \sim 2d$ -- $10d$ , far smaller than the surrogate training set of size $N$ ):

\hat{C} = \frac{1}{N_g}\sum_{i=1}^{N_g} \nabla f(\x_i)\,\nabla f(\x_i)^\top, \qquad \x_i \sim \pi(\x).

(9.10)

The gradient $\nabla f(\x_i)$ can be obtained via:

Automatic differentiation through the model solver (if differentiable, e.g., a DEQN or PINN).
Finite differences: evaluate $f$ at $d+1$ perturbed inputs per sample point. This costs $(d+1)\cdot N_g$ model evaluations, but $N_g$ can be modest (typically $N_g \sim 2d$ to $10d$ suffices for a reliable estimate of the leading eigenspace).

The eigendecomposition of $\hat{C}$ is a standard $d \times d$ matrix operation and is cheap relative to the model evaluations. When $\{\x_i\}$ are i.i.d. from $\pi$ , $\hat{C} \to C$ almost surely as $N_g \to \infty$ by the law of large numbers, so the eigenvectors of $\hat{C}$ are consistent estimates of the population active subspace; finite-sample bias is documented in Constantine (2015).

9.5.0.6Practical workflow.¶

The full dimension-reduction pipeline consists of three steps:

Gradient sampling: Draw $N_g$ points $\x_i \sim \pi(\x)$ and compute $\nabla f(\x_i)$ at each. Form the empirical estimate $\hat{C}$ via (9.10).
Eigendecomposition and dimension selection: Compute $\hat{C} = \hat{U}\hat{\Lambda}\hat{U}^\top$ . Inspect the eigenvalue decay, as in Figure Figure 9.6, and choose $m$ at the spectral gap.
GP in reduced space: Project all training inputs onto the active subspace: $\tilde{\x}_i = \hat{U}_m^\top \x_i \in \R^m$ . Fit a standard GP on the $m$ -dimensional projected data.

When combined with BAL, the GP adaptively selects training points in the reduced space, further improving sample efficiency.

Linear active-subspace pipeline. Gradient samples identify the dominant eigenspace of the gradient outer-product matrix, all simulator inputs are projected to the reduced coordinates \tilde{\x}=U_m^\top \x, and the GP/BAL loop is then run in the low-dimensional active subspace. — Figure 9.7:Linear active-subspace pipeline. Gradient samples identify the dominant eigenspace of the gradient outer-product matrix, all simulator inputs are projected to the reduced coordinates $\tilde{\x}=U_m^\top \x$ , and the GP/BAL loop is then run in the low-dimensional active subspace.

Figure Figure 9.7 summarizes the linear active-subspace workflow used before fitting the GP.

9.5.0.7Application to dynamic stochastic economies.¶

Scheidegger & Bilionis (2019) apply this pipeline to high-dimensional dynamic programming problems arising in neoclassical growth models. Their key findings:

Models with up to $d = 500$ continuous state variables can be solved, far beyond the reach of grid-based methods or naïve GP implementations.
The active subspace typically reduces the effective dimension to $m = 2$ --5, even when $d$ is in the hundreds. The value function’s dependence on the full state vector collapses onto a few aggregate quantities (e.g., mean capital, dispersion of productivity).
The state space is partitioned into clusters via Bayesian Gaussian mixture models, and a separate GP with BAL is fit within each cluster. This local approximation strategy handles non-stationarities in the value function (e.g., different curvature in high- vs. low-capital regions).
Parallel computing distributes the gradient evaluations and GP fits across multiple processors, enabling the method to scale to large-scale models on high-performance computing clusters.

9.5.0.8Anchor on the multi-sector growth benchmark.¶

In the multi-sector neoclassical growth benchmark of Scheidegger & Bilionis (2019), the active-subspace dimension collapses to $m = 1$ even at $D = 500$ , so the entire ASGP pipeline operates on a one-dimensional projected coordinate. On models with other configurations the active subspace can be larger; the spectral-gap test is what decides.

9.5.1Nonlinear Generalization: Deep Active Subspaces¶

The linear pipeline of the previous section assumes that the important directions are linear combinations of the input variables. That is an inductive bias: the first $m$ columns of $U$ always span a linear subspace of $\R^d$ , so if the response $f$ actually varies along a curved low-dimensional manifold (a ridge, a product of two linear features, a radial coordinate, a hidden regime indicator), the linear projection has to carry enough directions to cover the whole manifold, not the manifold itself. Concretely, if $f(\xi) = \varphi\!\bigl((w_1^\top \xi)^2 + (w_2^\top \xi)^2\bigr)$ is a radial function of two linear features, the gradient $\nabla f = 2\,\varphi'(\cdot)\,(s_1 w_1 + s_2 w_2)$ spans a two-dimensional space when averaged over $\xi$ , so linear AS returns $m = 2$ ; yet the intrinsic “interesting” coordinate is the scalar aggregate $r^2 = s_1^2 + s_2^2$ , which is one-dimensional. Tripathy & Bilionis (2018) remove the linearity bias by replacing the linear projection with a learned nonlinear encoder and thereby recover the one-dimensional structure.

9.5.1.1The deep active-subspace ansatz.¶

Tripathy and Bilionis parametrize the surrogate as a composition of two small neural networks,

\hat f(\xi) \;=\; g\bigl(h(\xi)\bigr), \qquad h\colon \R^D \!\to\! \R^d, \qquad g\colon \R^d \!\to\! \R,

(9.11)

where $h$ is a multilayer perceptron (MLP) that plays the role of the projection matrix $U_m^\top$ in the linear case, and $g$ is a second MLP that plays the role of the GP link. Setting $h(\xi) = U_m^\top \xi$ and taking $g$ to be a polynomial recovers the linear active-subspace surrogate (9.9), so (9.11) generalizes the linear-AS ansatz rather than the gradient covariance matrix (9.8) itself. The two networks are trained jointly to minimize the residual $\hat f(\xi_i) - y_i$ on an input-output sample, so that the bottleneck $h(\xi) \in \R^d$ adapts to the specific response surface rather than being fixed in advance by the spectrum of $C$ .

9.5.1.2Architecture choices.¶

Three design choices make (9.11) trainable with a few hundreds of samples:

Exponentially decaying encoder widths Tripathy & Bilionis, 2018. For an encoder with $L$ layers mapping $\R^D$ to $\R^d$ , choose widths

d_k \;=\; \bigl\lceil D \, e^{\eta_{\mathrm{w}} k} \bigr\rceil, \qquad \eta_{\mathrm{w}} = \tfrac{1}{L}\log(d/D), \qquad k = 0, 1, \ldots, L,

(9.12)

so that the bottleneck closes smoothly from $D$ to $d$ without a brittle hyperparameter choice. For $D = 20,\, L = 3,\, d = 1$ this gives widths $[20, 8, 3, 1]$ ; for $d = 2$ , widths $[20, 10, 5, 2]$ -- a recipe rather than a search.

Swish activation Tripathy & Bilionis, 2018,

\sigma(z) \;=\; \frac{z}{1 + e^{-\gamma z}}, \qquad \gamma = 1,

(9.13)

which is smooth everywhere (unlike ReLU) and non-saturating (unlike $\tanh$ ); smoothness matters because we will want to differentiate through $\hat f$ for sensitivity analysis and because the elastic-net penalty below otherwise has to fight the sharp corners of ReLU.

Elastic-net regularization on every weight matrix Tripathy & Bilionis, 2018. Writing $\theta = (\theta_h, \theta_g)$ for all encoder and link parameters, the training loss is

\mathcal{L}(\theta) \;=\; \frac{1}{N}\sum_{i=1}^N \bigl(\hat f_\theta(\xi_i) - y_i\bigr)^2 \;+\; \lambda_1 \lVert \theta \rVert_1 \;+\; \lambda_2 \lVert \theta \rVert_2^2,

(9.14)

with small $\lambda_1, \lambda_2$ ; the $\ell_1$ term encourages sparse weights and input usage, the $\ell_2$ term controls the overall smoothness of $\hat f_\theta$ and prevents the loss landscape from becoming pathological as $N$ shrinks.

Crucially, no gradient samples of $f$ are required: $\theta$ is learned from the input-output pairs $\{(\xi_i, y_i)\}$ alone. The orthogonality constraint $U^\top U = I$ that is implicit in the linear case becomes unnecessary, because the learned encoder is free to pick any smooth low-dimensional parameterization of the active manifold.

9.5.1.3Choosing the latent dimension $d$ .¶

The spectral gap of $C$ is no longer available -- the encoder is nonlinear -- so $d$ is chosen by a validation-MSE elbow: hold out an independent fraction of the sample, train a small family of models with $d = 1, 2, 3, \ldots$ , and pick the smallest $d$ beyond which held-out error no longer drops significantly. An operational rule of thumb is to stop at the first $d$ for which the MSE improvement from $d$ to $d+1$ is less than a factor of two: smaller gains are typically driven by optimization slack, not by new latent structure. On curved problems this elbow lies strictly below the linear-AS spectral gap: the deep encoder collapses two linear features into a single nonlinear aggregate (notebook 09_Deep_Active_Subspace_Ridge gives a reproducible instance in $D = 20$ ). On nearly-linear problems the two criteria agree qualitatively, and at small training-set sizes a polynomial link on top of a two-dimensional linear AS can in fact be more data-efficient than the deep encoder (notebook 10_Deep_AS_vs_Linear_AS_Borehole, the canonical borehole benchmark with $D = 8$ and $N = 500$ ). Figure Figure 9.8 contrasts the elbow rule with the linear-AS spectral gap on the radial-ridge target.

Stylized comparison of the two selection criteria for the radial-ridge target y(\xi) = \exp(-[(w_1^\top\xi)^2 + (w_2^\top\xi)^2]) in D = 20 (see notebook 09). The linear-AS eigenvalue spectrum has two dominant directions, so the spectral-gap criterion picks d = 2 (right green dashed line). The deep-AS validation MSE, by contrast, is already at its plateau at d = 1 (left green dashed line): the learned encoder h_\theta represents the nonlinear aggregate r^2 = s_1^2 + s_2^2 as a scalar. The curves are stylized; the orders of magnitude track the notebook (linear-AS eigenvalues \approx 0.16 for i = 1, 2 then a sharp drop; deep-AS validation MSE \approx 10^{-4} at d = 1 and roughly flat thereafter), and the elbow-at-d=1 versus spectral-gap-at-d=2 contrast is reproduced. — Figure 9.8:Stylized comparison of the two selection criteria for the radial-ridge target $y(\xi) = \exp(-[(w_1^\top\xi)^2 + (w_2^\top\xi)^2])$ in $D = 20$ (see notebook `09`). The linear-AS eigenvalue spectrum has two dominant directions, so the spectral-gap criterion picks $d = 2$ (right green dashed line). The deep-AS validation MSE, by contrast, is already at its plateau at $d = 1$ (left green dashed line): the learned encoder $h_\theta$ represents the nonlinear aggregate $r^2 = s_1^2 + s_2^2$ as a *scalar*. The curves are stylized; the orders of magnitude track the notebook (linear-AS eigenvalues $\approx 0.16$ for $i = 1, 2$ then a sharp drop; deep-AS validation MSE $\approx 10^{-4}$ at $d = 1$ and roughly flat thereafter), and the elbow-at- $d=1$ versus spectral-gap-at- $d=2$ contrast is reproduced.

9.5.1.4Training recipe.¶

A practical recipe that reproduces the experiments in notebooks 09 and 10:

Sample $N$ input-output pairs $(\xi_i, y_i)$ ; split $80/20$ into train and validation sets. Standardize inputs to unit variance and center outputs.
For each candidate $d \in \{1, 2, 3, \ldots\}$ : build $h_\theta$ with widths (9.12) and $g_\theta$ with two hidden layers (16--32 units); train on loss (9.14) with Adam ( $\text{lr} = 5 \times 10^{-3}$ ), a cosine learning-rate schedule over 10³-- $2 \times 10^3$ epochs, and $\lambda_1 = 10^{-5},\, \lambda_2 = 10^{-4}$ .
Record the validation MSE, apply the elbow rule, and deploy $\hat f_\theta$ with the chosen $d$ .

Sample-budget rule of thumb: $N \approx 50\, d_{\mathrm{nl}}$ to find the bottleneck, inflated to $N \approx 200\, d_{\mathrm{nl}}$ for a deployment surrogate. Two post-training sanity checks are worth doing: (i) the validation curve should be monotone-then-flat in $d$ ; (ii) a scatter of $h_\theta(\xi_i)$ against the top linear-AS coordinate $U_1^\top \xi_i$ reveals whether the encoder has actually gone nonlinear (a non-monotone relation is the fingerprint).

$Deep active-subspace pipeline. Input--output pairs (\xi_i, y_i) are drawn directly from the simulator (no gradient samples needed); a nonlinear encoder h_\theta compresses the high-dimensional input into a d-dimensional latent code, a small link network g_\theta maps the latent code to the response, and an elastic-net / validation-MSE elbow chooses d. The trained composition \hat f_\theta = g_\theta \circ h_\theta is the deployed surrogate. Compared with the linear active-subspace pipeline (Figure ), the encoder + link boxes replace the eigendecomposition + linear projection, the elbow box replaces the spectral-gap search, and the gradient-sampling step disappears entirely.$

Figure 9.9:Deep active-subspace pipeline. Input--output pairs $(\xi_i, y_i)$ are drawn directly from the simulator (no gradient samples needed); a nonlinear encoder $h_\theta$ compresses the high-dimensional input into a $d$ -dimensional latent code, a small link network $g_\theta$ maps the latent code to the response, and an elastic-net / validation-MSE elbow chooses $d$ . The trained composition $\hat f_\theta = g_\theta \circ h_\theta$ is the deployed surrogate. Compared with the linear active-subspace pipeline (Figure Figure 9.7), the encoder + link boxes replace the eigendecomposition + linear projection, the elbow box replaces the spectral-gap search, and the gradient-sampling step disappears entirely.

Figure Figure 9.9 should be compared with the linear pipeline in Figure Figure 9.7: the two blue encoder / link boxes replace “eigendecomposition + project”, the “elastic-net + elbow” box replaces “find spectral gap”, and the gradient-sampling step is gone entirely. The economic pay-off is the same, a cheap surrogate in $d$ variables where the original has $D$ , but the modeling assumption is weaker: curved active manifolds are now admissible.

9.5.1.5When is the extra machinery worth it?¶

Deep AS pays off when (i) no gradient samples are available (e.g., black-box simulators, noisy observations from a lab experiment, or calibration targets returned by a proprietary solver), (ii) the spectral gap of $C$ is ambiguous or the response surface genuinely depends on nonlinear aggregates of the inputs (ridges, radial coordinates, thresholded features, piecewise regimes), or (iii) the input dimension is large ( $D \gg 10$ ) and a linear projection is too restrictive to capture the active manifold. When none of these conditions apply, linear AS plus a polynomial or a GP link is usually more data-efficient and should be fit first as a baseline. The borehole experiment in notebook 10 is the honest diagnostic: on a function this close to a $d = 2$ ridge, the two curves cross, and a cubic polynomial on two linear features beats the deep encoder at all $d \ge 2$ . Deep AS’s advantage materializes precisely at $d = 1$ , where a single linear feature cannot span the active direction. This perspective also connects naturally to deep kernel learning (Section Section 9.7), which composes a learned feature map with a GP head; deep AS is the same idea for the feature map, with an MLP link in place of the GP.

9.5.1.6Applications in economics and finance.¶

The deep-AS architecture is particularly attractive in settings where the state variable is high-dimensional but the economic mechanism acts through a few aggregate quantities that are not linear in the state. Examples include (i) heterogeneous-agent models whose cross-sectional distribution of wealth enters the equilibrium law of motion only through a few moments (mean, dispersion, tail mass), the Krusell-Smith aggregation logic of Chapter Chapter 6, where the map from the raw distribution to those moments is a learned, nonlinear encoder; (ii) multi-region climate-economy integrated assessment models with hundreds of regional capital stocks, in which damages and optimal carbon taxes depend on nonlinear aggregates of the underlying state (total radiative forcing, regional inequality indices); and (iii) dynamic games and mean-field games in which the relevant action depends on a curved functional of the opponent distribution. In each case the original space is too large for a direct GP and the linear active subspace is too rigid, while the deep encoder provides a trainable compromise. Bilionis & Zabaras (2016) and Tripathy & Bilionis (2018) provide the general UQ and deep-active-subspace methodology; for integrated assessment models specifically, Friedl et al. (2023) is the economics-side reference used in this course.

9.6Dynamic Programming with Gaussian Processes¶

Returning to the second oracle announced at the start of this chapter and developed formally in Section 9.6.1 below, we now embed the GP and active-subspace machinery developed above into a value function iteration (VFI) algorithm, with the Bellman operator $TV$ playing the role of the expensive label generator. The acronym ASGP used in the section titles and figure captions below stands for Active-Subspace Gaussian Process: the dimensionality reducer of Section 9.5 stacked under the GP of Section 9.2.

The idea of representing the value function as a GP inside a DP recursion was introduced under the name Gaussian process dynamic programming (GPDP) by Deisenroth et al. (2009), who used it for continuous-state, continuous-action optimal control problems and combined it with a variance-based active selection of support points. Closely related work Engel et al., 2005 embeds GPs into temporal-difference learning, replacing tabular value functions with a GP posterior. These ideas later inspired model-based policy search methods such as PILCO Deisenroth & Rasmussen, 2011, which propagate GP uncertainty through long horizons. In economics, Scheidegger & Bilionis (2019) show that pairing this paradigm with active subspaces and Bayesian Gaussian mixture models for the ergodic set scales the framework to economies with up to 500 continuous state variables; Renner & Scheidegger (2018) apply it to dynamic incentive problems, Gaegauf et al. (2023) to dynamic portfolio choice, and Chen et al. (2025) to dynamic private asset allocation. The remainder of this section presents the algorithm in this combined form.

9.6.1Why GPs for DP: a Supervised-Learning View of the Bellman Operator¶

Before stating the algorithm, it helps to see VFI through a supervised-learning lens. At iteration $s$ , given an incumbent value function $V^{s-1}$ , we generate a training set

\mathcal{D}^s = \bigl\{(\x^{(i)},\, t_i^s)\bigr\}_{i=1}^{n^s}, \qquad t_i^s = (TV^{s-1})(\x^{(i)}),

(9.15)

and fit a regressor $V^s \approx (TV^{s-1})$ to it. The label $t_i^s$ is not free. Each evaluation of the Bellman operator at a state $\x^{(i)}$ requires solving a constrained nonlinear program over controls $\xi$ subject to the law of motion and any feasibility / market-clearing constraints; quadrature over $\x'$ is taken inside. In textbook problems each oracle call costs 10^-2--10⁰ seconds, and in higher-dimensional models with adjustment costs, occasionally binding constraints, or aggregator nonlinearities, it can run into minutes. Because the calls are independent across $\x^{(i)}$ , the design is embarrassingly parallel Scheidegger & Bilionis, 2019, but each individual label remains expensive.

This is exactly the regime where Gaussian processes shine. Three properties make them the natural surrogate class:

Sample efficiency under an expensive oracle. The marginal-likelihood Occam’s razor of Section 9.2 delivers calibrated fits at $n \sim 10^2$ --10³ design points, well below the $n \gg 10^4$ regime in which deep-network surrogates start to dominate (Table Table 9.3).
Built-in uncertainty quantification. The GP posterior variance $\sigma_\mathrm{GP}^2(\x)$ tells us, at every state, how much the current surrogate trusts its own prediction. This is the input that turns a passive interpolant into an adaptive one (Section 9.6.6).
Cheap held-out diagnostics. The leave-one-out predictive error is available in closed form from the same Cholesky factor used for posterior inference (Section 9.6.5), so we can monitor surrogate health every iteration without spending oracle calls on a held-out split.

The same logic applies, with a different oracle, in the structural estimation chapter: Chapter Chapter 10 stacks a GP over the moment map $m(\theta)$ where each label is a forward simulation plus moment computation under a candidate parameter, again expensive to evaluate and cheap to store. Both settings are instances of one pattern: supervised regression on the output of a costly numerical procedure. The next subsection writes down the formal DP problem; the subsequent subsections develop the GP-based interpolant and the two ingredients (LOO and active learning) that make it sample-efficient at modest design size.

9.6.2The Dynamic Programming Problem¶

Consider an infinite-horizon, discrete-time stochastic optimal decision problem. A representative agent chooses a sequence of controls $\{\xi_t\}_{t=0}^\infty$ to maximize the expected discounted sum of returns:

V(\x_0) = \max_{\{\xi_t\}} \mathbb{E}_0 \sum_{t=0}^\infty \beta^t\, r(\x_t, \xi_t),

(9.16)

subject to the law of motion $\x_{t+1} \sim F(\cdot\,|\,\x_t, \xi_t)$ , where $\x_t \in \mathcal{X} \subset \R^D$ is the state, $\xi_t \in \Lambda(\x_t)$ is the control, $r$ is the per-period return function, and $\beta \in (0,1)$ is the discount factor.

By the principle of optimality Bellman, 1957, the infinite-dimensional problem (9.16) reduces to the Bellman equation:

V(\x) = \max_{\xi \in \Lambda(\x)} \bigl\{ r(\x, \xi) + \beta\, \mathbb{E}\bigl[V(\x')\bigr] \bigr\},

(9.17)

where the expectation is over the stochastic transition. The Bellman operator $T$ maps value functions to value functions:

(TV)(\x) = \max_{\xi \in \Lambda(\x)} \bigl\{ r(\x, \xi) + \beta\, \mathbb{E}\bigl[V(\x')\bigr] \bigr\}.

(9.18)

Under standard regularity conditions (bounded returns, $\beta < 1$ , monotonicity, discounting), $T$ is a contraction mapping on the Banach space of bounded continuous functions with sup-norm, with modulus $\beta$ . By the Banach fixed-point theorem (Appendix Appendix D), $T$ admits a unique fixed point $V^*$ , and iterating $V^{s+1} = TV^s$ from any initial guess $V^0$ converges geometrically: $\|V^s - V^*\|_\infty \le \beta^s \|V^0 - V^*\|_\infty$ . The classical references in economics are Stokey et al. (1989) [Chs. 3--4] for the formal contraction-and-fixed-point theory of the Bellman operator and Judd (1998) [Ch. 12] for the numerical implementation in continuous-state settings; Ljungqvist & Sargent (2018) [Chs. 3--4] and Sargent & Stachurski (2026) provide modern macroeconomic treatments. In the operations-research and optimal-control literature, Bertsekas (2000) is the canonical reference, with extensive coverage of approximate dynamic programming. The contraction modulus $\beta$ applies to the exact Bellman operator; convergence of the GP-fitted iterates additionally requires controlling the per-step interpolation error of the surrogate, otherwise the approximate operator can fail to contract globally even though the exact one does. This is the standard textbook VFI loop: guess a value function, apply the Bellman operator, interpolate the updated values, and repeat until convergence. In the present chapter, the only change is the interpolant: a GP replaces the grid or polynomial basis when the state space is irregular or moderately high-dimensional. For a comprehensive treatment of dynamic programming in economics, see also the open-source QuantEcon lectures.^[1]

9.6.2.1The computational challenge.¶

At every VFI iteration, we must approximate $V^{s+1}$ as a function of the full state vector $\x \in \R^D$ . Traditional approaches (finite grids, Smolyak sparse grids, tensor-product polynomial bases) suffer from the curse of dimensionality: the number of grid points grows exponentially in $D$ , and they require hypercubic state-space geometries. For $D > 20$ , these methods become infeasible.

9.6.3The Stochastic Optimal Growth Model¶

The workhorse test case for GP-based dynamic programming is the multi-sector stochastic optimal growth model. Assume $D$ sectors, each with a capital stock $k_j$ , so the state is $\bm{k} = (k_1, \ldots, k_D) \in \R^D_+$ . A representative household chooses consumption $\bm{c}$ , labor supply $\bm{l}$ , and investment $\bm{i}$ to maximize:

V_0(\bm{k}_0) = \max_{\{\bm{c}_t, \bm{l}_t, \bm{i}_t\}} \mathbb{E}_0 \sum_{t=0}^\infty \beta^t\, u(\bm{c}_t, \bm{l}_t),

(9.19)

subject to the capital law of motion $k_{j,t+1} = (1-\delta)\,k_{j,t} + i_{j,t}$ , a sector-by-sector resource constraint $c_{j,t} + i_{j,t} + \Gamma_{j,t} = \exp(z_{j,t})\,A\,k_{j,t}^\psi\,l_{j,t}^{1-\psi}$ with convex adjustment cost $\Gamma_{j,t}$ , and Cobb--Douglas production $f(k_j, l_j) = A\,k_j^\psi\,l_j^{1-\psi}$ . Productivity shocks $z_{j,t}$ follow a stationary process and the dimension $D$ can be scaled from 1 (the textbook Brock--Mirman model of Chapter Chapter 2) to 500 without changing the algorithmic structure.

9.6.4GP-Based Value Function Iteration (ASGP)¶

The key idea, common to Deisenroth et al. (2009) in the control literature and Scheidegger & Bilionis (2019) in the economics literature, is to use a Gaussian process as the interpolation scheme inside VFI, replacing grid-based methods entirely. At each VFI step $s$ :

Generate $n^s$ training inputs $\X = \{\x^{(1)}, \ldots, \x^{(n^s)}\} \subset [\underline{\bm{k}}, \bar{\bm{k}}]^D$ .
Evaluate the Bellman operator at each training point: $t_i^s = (TV^{s-1})(\x^{(i)})$ .
Learn a GP (or ASGP) surrogate $V_\mathrm{surr}$ from the training data $\{\X, \bm{t}\}$ .
Set $V^s = V_\mathrm{surr}$ (the GP posterior mean).
Compute the convergence error $e^s = \|V^s - V^{s-1}\|_\infty / \Delta_V$ , where $\Delta_V$ is a normalizing value-function range.
If $e^s < \bar{e}$ , stop; otherwise continue.

9.6.4.1Advantages over grid-based methods.¶

This GP-VFI approach offers several structural advantages:

Arbitrary geometries: GPs require no tensor-product structure, so training points can be placed anywhere in the state space, including on irregularly shaped ergodic sets.
Adaptive training: Points where the Bellman operator fails to converge (e.g., near constraints) can simply be excluded and retried later; adding or removing points is trivial.
Built-in uncertainty quantification: The GP posterior variance provides interpolation uncertainty at every state point, at every iteration, “free” UQ.
Active subspace integration: When $D \gg 10$ , the GP operates on the projected inputs $\tilde{\x} = U_m^\top \x \in \R^m$ discovered via the active subspace pipeline (Section Section 9.5), reducing the effective dimensionality from hundreds to a handful.

9.6.4.2Parallelization.¶

The most expensive part of the algorithm, evaluating the Bellman operator at $n^s$ training points, is embarrassingly parallel: each evaluation is independent. In the MPI implementation of Scheidegger & Bilionis (2019), the current value function surrogate is broadcast to all workers, each worker evaluates the Bellman operator at $n^s/n_\mathrm{cpu}$ points, and the results are gathered at the master for GP fitting. Communication cost is negligible relative to the Bellman evaluations.

9.6.5Leave-One-Out Error: a Held-out Diagnostic for Free¶

How do we know that the GP surrogate at iteration $s$ is good enough to take another Bellman step? The marginal-likelihood objective of Section 9.2 fits the kernel, but it does not, on its own, certify pointwise predictive accuracy on the design. A held-out validation split would, but every held-out point is one fewer expensive Bellman label going into the GP.

The standard escape route in GP regression is the leave-one-out (LOO) predictive error, and it has the pleasant feature that for a Gaussian process it admits a closed form using the same Cholesky factor already computed for the posterior Rasmussen & Williams, 2005:

\mu_{-i}(\x^{(i)}) - y_i = -\frac{\alpha_i}{\bigl[K_y^{-1}\bigr]_{ii}}, \qquad \sigma_{-i}^2(\x^{(i)}) = \frac{1}{\bigl[K_y^{-1}\bigr]_{ii}},

(9.20)

where $\mu_{-i}$ and $\sigma_{-i}^2$ denote the posterior mean and variance after removing the $i$ -th observation, and $\alpha = K_y^{-1}(\bm y-\bm\mu_X)$ (for centered outputs, $\bm\mu_X=0$ ). Since $K_y^{-1}$ is recovered from the Cholesky factor of $K_y$ in $\mathcal{O}(n^2)$ once the factorisation has been done, computing the full $n$ -vector of LOO residuals is essentially free relative to the $\mathcal{O}(n^3)$ already paid for posterior inference.

9.6.5.1What the LOO RMSE tells us.¶

Tracking

\mathrm{LOO\text{-}RMSE}(\mathcal{D}^s) \;=\; \sqrt{\frac{1}{n^s}\sum_{i=1}^{n^s}\bigl(\mu_{-i}(\x^{(i)}) - t_i^s\bigr)^2}

(9.21)

across VFI iterations is a cheap surrogate-health metric that is independent of the Bellman residual:

A flat-then-rising LOO curve at the same design size signals kernel mis-specification (length scale collapsing, noise variance hitting a bound) and tells us to revisit the kernel choice or hyperparameter bounds before adding more design points.
A high LOO RMSE at small $n^s$ that decays as the design grows is the expected behaviour and tells us that the surrogate simply needs more labels.
A small LOO RMSE coexisting with a large Bellman residual points the finger at the operator, not the surrogate: the iterate may be far from the fixed point even though the GP fits the current $TV^{s-1}$ well.

The notebook 04_GP_Value_Function_Iteration.ipynb computes (9.20) via scipy.linalg.cho_solve (function gp_loo_rmse) and reports it alongside the Bellman residual at every VFI iteration, separating these two failure modes. The same diagnostic reappears in Section 10.1.4 for the GP layer over the SMM moment map.

9.6.6Active Learning Inside the VFI Loop¶

The Bayesian active-learning machinery of Section 9.3 carries over almost verbatim once the VFI loop is in place, but with one important adjustment: the goal is now uniform interpolation accuracy of the value function on the relevant state-space region, not maximisation of a payoff or minimisation of a loss. The right acquisition function is therefore pure exploration, the GP posterior standard deviation, rather than a UCB or expected-improvement criterion that trades off exploitation and exploration:

\x^{\mathrm{next}} \in \argmax_{\x \in \mathcal{X}^\mathrm{cand}} \sigma_\mathrm{GP}^s(\x).

(9.22)

A practical implementation, used in notebook 04_GP_Value_Function_Iteration.ipynb, runs the VFI iterations with a frozen design for a few steps, then enriches the design every $N$ iterations:

Evaluate $\sigma_\mathrm{GP}^s$ on a dense candidate set $\mathcal{X}^\mathrm{cand}$ (a Latin-hypercube draw over the current state-space region).
Pick the top- $n_\mathrm{add}$ candidates by posterior standard deviation, subject to a minimum-spacing constraint $\|\x^{(\mathrm{new})} - \x^{(j)}\| \ge \delta_\mathrm{spacing}$ against existing design points to avoid clustering.
Evaluate the Bellman operator at the new points (one expensive oracle call each, in parallel) and append them to $\mathcal{D}^s$ .
Refit the GP and continue iterating.

9.6.6.1Why pure exploration here.¶

A UCB-style acquisition would bias the design toward states with high value, which is not what we want when the surrogate is a building block of an iteration. We want the GP posterior to be uniformly tight wherever the Bellman operator might be evaluated, so that the contraction modulus of the approximate operator stays close to $\beta$ . This is structurally different from the optimisation setting of Bayesian optimisation, where exploitation is a feature.

9.6.6.2Empirical impact.¶

The companion notebook compares a same-budget fixed Latin-hypercube design with an active design inside the one-dimensional GP-VFI loop. Both designs use Bellman labels; the active design starts from a small initial set and adds states by maximising the GP posterior standard deviation (9.22) subject to a spacing rule. At the same final number of labels, the active design lowers posterior uncertainty and achieves a comparable or smaller dense-grid Bellman residual (Figure Figure 9.10). The figure is one-dimensional by design: the goal is to show active enrichment inside a genuine Bellman iteration, not to use a separable interpolation toy as a proxy for multidimensional dynamic programming.

Same-budget active enrichment inside one-dimensional GP value-function iteration. The left panel compares the GP posterior means from a fixed Latin-hypercube design and an active design against a reference GP-VFI solution. The middle panel shows the posterior standard deviation and the states added by the active rule , marked as triangles on the horizontal axis. The right panel reports the dense-grid Bellman residual. Unlike the previously used two-dimensional separable interpolation benchmark, every plotted training value here is generated by a Bellman maximization. Generated by notebook lecture_14_04_GP_Value_Function_Iteration.ipynb. — Figure 9.10:Same-budget active enrichment inside one-dimensional GP value-function iteration. The left panel compares the GP posterior means from a fixed Latin-hypercube design and an active design against a reference GP-VFI solution. The middle panel shows the posterior standard deviation and the states added by the active rule (9.22), marked as triangles on the horizontal axis. The right panel reports the dense-grid Bellman residual. Unlike the previously used two-dimensional separable interpolation benchmark, every plotted training value here is generated by a Bellman maximization. Generated by notebook `lecture_14_04_GP_Value_Function_Iteration.ipynb`.

9.6.7Results: From One to 500 Dimensions¶

The companion notebook deliberately stops at the one-dimensional Bellman problem, where every numerical object can be inspected directly. The high-dimensional claims below come from the ASGP implementation of Scheidegger & Bilionis (2019), where the GP operates on an active subspace rather than on the full tensor-product state space; we report their findings as a literature summary, not as something the notebook itself demonstrates.

Scheidegger & Bilionis (2019) apply the ASGP-VFI algorithm to the stochastic optimal growth model (9.19) across a range of dimensions, from $D = 1$ (the textbook Brock--Mirman case) to $D = 500$ continuous state variables. The key findings are:

Convergence is dimension-independent in the active subspace: all models converge to a normalized error below 10^-4 within approximately 70 VFI iterations regardless of $D$ , where the iteration count is dimension-independent in the $m$ -dimensional active subspace on which the GP actually operates rather than in the full ambient $\R^D$ .
Low-dimensional structure: the active subspace dimension is $m = 1$ in all cases tested, so the value function depends on a single linear combination of the $D$ capital stocks, confirming that the high-dimensional problem has intrinsic low-dimensional structure.
Sub-exponential scaling: CPU time grows sub-exponentially with $D$ (concave on a log scale when parallelized across compute nodes); on the benchmarks reported in Scheidegger & Bilionis (2019), $D = 500$ is solved at sub-exponential cost in $D$ when distributed across a compute cluster. This avoids the explicit grid-based curse but does not eliminate all high-dimensional sample/optimization costs.
Beyond the reach of grids: grid-based methods such as Smolyak sparse grids cannot handle $D > 100$ , making ASGP one demonstrated grid-free approach for models at this scale; neural residual methods such as DEQNs are complementary alternatives when the equilibrium conditions are easier to express directly.

9.6.8Uncertainty Quantification and Ergodic Sets¶

9.6.8.1Free UQ from the GP posterior.¶

Because the value function at each VFI iteration is a GP, the posterior variance $\sigma^2(\x)$ provides a built-in measure of approximation uncertainty. This gives pointwise credible bands for the interpolated value function essentially for free. Policy uncertainty is not automatic in the same sense: it must be propagated through the Bellman maximization, for example by evaluating policies under posterior draws or local approximations to the GP posterior.

9.6.8.2Parameter uncertainty propagation.¶

Uncertain model parameters can be treated as additional pseudo-states. For example, extending the state to $\tilde{S} = (\bm{k}, \gamma)$ where $\gamma \in [\underline{\gamma}, \bar{\gamma}]$ is the risk aversion parameter allows solving the model once on the extended state space and then extracting univariate effects and global sensitivity indices in a single computation, avoiding the traditional approach of re-solving the model for each parameter value. Harenberg et al. (2019) give a general introduction to uncertainty quantification for economic models, covering univariate-effect plots and Sobol’ sensitivity indices.

9.6.8.3Learning ergodic sets.¶

Many economic models have irregularly shaped ergodic sets (ellipsoids or manifolds, not hypercubes). Scheidegger & Bilionis (2019) propose learning the ergodic distribution via Bayesian Gaussian mixture models: (i) solve the model on a large initial domain; (ii) simulate the economy forward to generate capital paths $\{\bm{k}_t^+\}$ ; (iii) fit a mixture of Gaussians $\hat{\rho}_\mathrm{ergodic}$ to the simulated paths; (iv) re-solve on the ergodic set by sampling training points from $\hat{\rho}_\mathrm{ergodic}$ instead of the full hypercube. GPs handle this naturally because they require no grid structure.

9.6.9Comparison: GPs vs. DEQNs for Solving Dynamic Models¶

Both GPs (this chapter) and DEQNs (Chapters Chapter 2--Chapter 5) solve dynamic stochastic models, but with different trade-offs (Table Table 9.2):

Table 9.2:GP/ASGP and DEQN solvers for dynamic models. The table distinguishes exact fixed-point theory from the numerical approximation actually used: GP-VFI inherits the Bellman contraction only up to interpolation error, while DEQNs are judged by residual and simulation diagnostics.

Criterion	GP / ASGP	DEQNs
Solution method	Value function iteration	Euler equation residuals
Dimensionality	Up to $\sim$ 500 (with AS)	$>$ 1000 feasible
UQ built-in	Yes (GP posterior variance)	No (needs extra work)
Hardware	CPU clusters (MPI)	GPUs (TensorFlow/PyTorch)
Irregular domains	Yes (grid-free)	Yes (mesh-free)
Sensitivity analysis	Cheap after pseudo-state augmentation	Requires pseudo-states or re-training
Convergence diagnostics	Exact Bellman operator is a contraction; GP interpolation error must be controlled	Euler residuals and simulation tests; no generic global proof

When to use which: GPs when the effective dimension is moderate ( $D \lesssim 15$ , or a few hundred only when active-subspace structure is strong) and uncertainty quantification or sensitivity analysis is required; DEQNs when $D$ is very large, GPU hardware is available, or the model involves complicated market-clearing conditions that are more naturally expressed as Euler equation residuals than as Bellman maximization.

9.7Deep Kernel Learning¶

Standard GP kernels such as the RBF or Matérn operate on raw input features: the covariance $k(\x, \x')$ depends on $\|\x - \x'\|$ , which may be a poor measure of similarity when the inputs are high-dimensional or when the function’s structure depends on complex nonlinear interactions among variables. Deep Kernel Learning (DKL) Wilson et al., 2016 addresses this limitation by combining the representation-learning power of deep neural networks with the probabilistic framework of GPs.

9.7.0.1The DKL architecture.¶

A DKL model consists of two components:

A feature extractor $\phi(\x; \bm{\theta}_\mathrm{NN})\colon \R^D \to \R^d$ , typically a multi-layer perceptron that maps the raw input $\x$ to a learned representation of (usually much lower) dimension $d$ .
A GP layer that places a GP prior on the function $g(\z) = g(\phi(\x; \bm{\theta}_\mathrm{NN}))$ , with a standard base kernel $k_\mathrm{base}(\z, \z')$ (e.g., RBF or Matérn) operating in the learned feature space.

The deep kernel is thus:

k_\mathrm{DKL}(\x, \x') = k_\mathrm{base}\bigl(\phi(\x; \bm{\theta}_\mathrm{NN}),\; \phi(\x'; \bm{\theta}_\mathrm{NN})\bigr).

(9.23)

Unlike a standard kernel that computes distance in the input space, the DKL kernel computes distance in a learned feature space. If the neural network learns to map functionally similar inputs close together, the GP can exploit this structure for better predictions and more calibrated uncertainty estimates.

9.7.0.2Joint training.¶

The parameters of both the neural network ( $\bm{\theta}_\mathrm{NN}$ ) and the GP hyperparameters ( $\ell, \sigma_f, \sigma_y$ of $k_\mathrm{base}$ ) are optimized jointly by maximizing the GP marginal likelihood:

\max_{\bm{\theta}_\mathrm{NN},\, \bm{\vartheta}} \; \log p(\bm{y}\,|\,\Phi(\X; \bm{\theta}_\mathrm{NN}),\, \bm{\vartheta}),

(9.24)

where $\Phi(\X; \bm{\theta}_\mathrm{NN})$ is the matrix of transformed features. This end-to-end training procedure automatically learns features that are useful for the GP, without requiring manual feature engineering. In practice, DKL is implemented via GPyTorch Gardner et al., 2018, which provides efficient GPU-accelerated GP inference and integrates seamlessly with PyTorch for the neural network component.

9.7.0.3When DKL helps.¶

DKL is most beneficial when:

The input space is high-dimensional but the function has low-dimensional structure that is nonlinearly embedded (active subspaces find linear structure; DKL can find nonlinear features).
Sufficient training data is available to train the feature extractor without overfitting (DKL has more parameters than a standard GP).
Uncertainty quantification is important, but a standard GP kernel is insufficiently expressive to capture the true function’s covariance structure.

The trade-off relative to a standard GP is clear: DKL offers greater expressiveness at the cost of more parameters and the risk of overfitting with very small datasets. For the moderate-data regimes typical of economic surrogate models ( $N \sim 100$ --1000), DKL can offer significant improvements over standard kernels, particularly when the target function has complex multi-scale behavior. Chen et al. (2025) apply learned-feature-map GP architectures in a dynamic model of private asset allocation, where the feature map captures the complex nonlinear interactions between illiquidity, portfolio composition, and optimal rebalancing.

9.7.0.4Illustrative examples.¶

Two simple examples highlight when DKL provides a qualitative advantage over standard kernels:

Step functions. Consider approximating a 1D function with a sharp discontinuity. A standard GP with an RBF kernel necessarily oversmooths near the jump and yields poorly calibrated uncertainty bands. A DKL model, by contrast, can learn a feature map that “compresses” the input near the discontinuity, effectively sharpening the GP’s resolution where it matters most. This is directly relevant to economic applications with occasionally binding constraints, where policy functions exhibit kinks or jumps.
Anisotropic boundaries in 2D. Standard stationary kernels (RBF, Matérn) are isotropic: they measure distance with circular level sets. When the target function has a discontinuity along a diagonal or curved boundary, common in portfolio problems with no-trade regions or in models with regime-dependent policies, the isotropic kernel cannot adapt. DKL learns a nonlinear coordinate transformation that aligns the kernel’s smoothness assumptions with the function’s actual structure, capturing diagonal and curved boundaries that would require an impractical number of training points with a standard kernel.

9.8GPs Among Their Bayesian Cousins¶

Gaussian processes are the most analytically transparent way to attach uncertainty to a non-parametric regressor, but they are not the only one. For completeness, this section briefly situates the GP machinery against two neural-network-based alternatives that come up frequently in modern uncertainty-aware deep learning.

9.8.0.1Monte-Carlo dropout.¶

Gal & Ghahramani (2016) show that a deep network trained with dropout and evaluated with dropout still active can be interpreted as approximate variational inference over a particular Bayesian neural network. Predictive uncertainty is obtained by averaging $T$ stochastic forward passes; the cost is one extra forward pass per sample. Calibration is rougher than a GP’s, but the method requires no architectural change and scales to deep nets and large $n$ .

9.8.0.2Deep ensembles.¶

Lakshminarayanan et al. (2017) train $E$ independent neural networks with different random seeds and combine them into a Gaussian-mixture predictor. Empirically this is one of the most robust uncertainty-quantification recipes available, often beating MC dropout and approaching the calibration of GPs, at $E$ times the training cost.

Table 9.3:Three uncertainty-quantification recipes compared on the dimensions that drive method choice in economic surrogate work. This script uses GPs in Chapter Chapter 9 and Chapter Chapter 11 because the typical regime ( $n \lesssim 10^3$ , expensive simulator) plays to their strengths; readers operating in larger-data regimes should consider MC dropout or deep ensembles before resorting to a sparse GP.

	Gaussian process	MC dropout	Deep ensembles
Calibration	exact under model	approximate	strong empirically
Training cost	$\mathcal{O}(n^3)$	one network	$E\times$ one network
Inference cost	$\mathcal{O}(n^2)$	$T$ forward passes	$E$ forward passes
Sample efficiency	best at small $n$	needs much more data	needs much more data
Best when	$n \lesssim 10^4$ , low $d$	cheap UQ on existing nets	willing to pay $E\times$ for top calibration
Reference	Rasmussen & Williams (2005)	Gal & Ghahramani (2016)	Lakshminarayanan et al. (2017)

9.8.0.3A decision rule for practice.¶

In our experience the right method follows from the application: (i) plain GPs for moderate $d$ ( $\lesssim 10$ --20) with a smooth target and an expensive simulator, when calibrated uncertainty is the goal; (ii) deep kernels (Wilson & al., (2016)) when the input geometry is non-trivial (regime switches, manifold structure, image-like inputs); (iii) deep ensembles or MC dropout for high- $d$ regression where calibrated uncertainty is desirable but exact GP inference is infeasible; (iv) sparse GPs (Titsias (2009) Hensman et al. (2013)) for $n \gtrsim 10^4$ when the target stays smooth. The summary frame in the companion deck (“Toolbox: When to Use What”) gives the same decomposition visually.

9.9Further Reading¶

Rasmussen & Williams (2005), the standard GP textbook.
Constantine (2015), the active-subspaces monograph.
Renner & Scheidegger (2018) Scheidegger & Bilionis (2019), GP+BAL methodology and applications in economics.
Wilson et al. (2016), deep kernel learning.
Titsias (2009) Hensman et al. (2013), sparse-GP scaling.

9.10Exercises¶

Worked solutions and guidance for these exercises appear in Appendix Appendix F.

Exercise 9.4

[Computational] Deep vs. linear active subspace on a radial ridge. Take the target $y(\xi) = \exp\!\bigl(-[(w_1^\top\xi)^2 + (w_2^\top\xi)^2]\bigr)$ with $\xi \sim \mathcal{N}(0, I_{20})$ and $w_1, w_2 \in \R^{20}$ fixed orthonormal. (i) Compute $\hat C$ from $4\,000$ samples and confirm that it has two nonzero eigenvalues. (ii) Train the Tripathy--Bilionis surrogate (9.11) with widths (9.12), Swish activation (9.13), and elastic-net loss (9.14) for $d \in \{1,2,3,4\}$ ; plot held-out MSE on a log scale and identify the elbow. (iii) Explain why linear AS needs $d \ge 2$ while deep AS already captures the response at $d = 1$ (cf. notebook 09_Deep_Active_Subspace_Ridge).

Exercise 9.6

[Computational] Sobol sensitivity with a GP surrogate. Write a short script, or extend notebook 02_GP_and_BAL.ipynb, to train a GP surrogate on the 4-dimensional Genz product-peak function $f(\bm x) = \prod_{i=1}^{4}(c_i^{-2} + (x_i - w_i)^2)^{-1}$ on $[0,1]^4$ with $c = (1, 2, 0.5, 1.5)$ , $w = (0.4, 0.6, 0.3, 0.7)$ , using $N \in \{50, 100, 200\}$ training points. For each $N$ , compute the Sobol first-order and total-effect indices in two ways: (i) direct on the true $f$ via $10{,}000$ Monte Carlo samples (reference), (ii) on the GP surrogate via 10⁶ samples (cheap thanks to the surrogate). Plot the relative error in each Sobol index against $N$ . Verify that the surrogate-based Sobol estimates converge to the reference as $N$ grows, and identify the $N$ at which all four first-order indices match the reference within $5\%$ . Discuss why surrogate-based sensitivity analysis is the workhorse for expensive simulators (climate models, structural macro models) where direct 10⁶-sample Monte Carlo is infeasible.

Exercise 9.7

[Core] Prior-driven RBF-GP extrapolation outside the training domain. Train a GP with RBF kernel on 20 points sampled uniformly from $[0, 1]$ from the function $f(x) = \sin(2\pi x)\,e^{-x}$ . (i) Plot the posterior mean and $\pm 2$ standard deviation band on the training interval $[0,1]$ ; verify the posterior tracks the true function and the band is narrow. (ii) Now extrapolate to $x \in [1.5, 3.0]$ and plot the posterior mean and band over the extended interval. Show analytically that for an RBF kernel with length scale $\ell$ , the posterior at $x$ far from any training point ( $|x - x_i| \gg \ell$ for all $i$ ) reverts to the prior: posterior mean $\to 0$ , posterior variance $\to \sigma_f^2$ . (iii) Verify numerically: at $x = 3$ , the posterior mean is essentially zero (independent of the training data), and the posterior standard deviation has expanded back to the prior $\sigma_f$ . (iv) Discuss the implication: far from the training domain the posterior reverts to the prior, so the variance band there reflects the learned hyperparameters rather than the data, and the band is overconfident only when the prior variance or the learned length scale is itself misleading. Naive Bayesian active learning that relies on posterior variance may therefore fail to acquire informative samples outside the convex hull when the prior variance is no larger than the inside-hull noise scale. Mitigations: use a Matérn- $\nu$ kernel with smaller $\nu$ (heavier-tailed), bound the analysis to the convex hull of the training data, or use a boundary-aware acquisition function.

Footnotes¶

https://dp.quantecon.org/
↩

References¶

Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
MacKay, D. J. C. (1992). Information-Based Objective Functions for Active Data Selection. Neural Computation, 4(4), 590–604.
Krause, A., Singh, A., & Guestrin, C. (2008). Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9, 235–284.
Deisenroth, M. P., Rasmussen, C. E., & Peters, J. (2009). Gaussian process dynamic programming. Neurocomputing, 72(7), 1508–1524.
Engel, Y., Mannor, S., & Meir, R. (2005). Reinforcement Learning with Gaussian Processes. Proceedings of the 22nd International Conference on Machine Learning.
Scheidegger, S., & Bilionis, I. (2019). Machine learning for high-dimensional dynamic stochastic economies. Journal of Computational Science, 33, 68–82. https://doi.org/10.1016/j.jocs.2019.03.004
Renner, P., & Scheidegger, S. (2018). Machine learning for dynamic incentive problems. 10.2139/ssrn.3282487
Friedl, A., Kübler, F., Scheidegger, S., & Usui, T. (2023). Deep Uncertainty Quantification: With an Application to Integrated Assessment Models.
Titsias, M. K. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS).
Hensman, J., Fusi, N., & Lawrence, N. D. (2013). Gaussian Processes for Big Data. Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI).
Srinivas, N., Krause, A., Kakade, S., & Seeger, M. (2010). Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. Proceedings of the 27th International Conference on Machine Learning, 1015–1022.
Constantine, P. G. (2015). Active Subspaces: Emerging Ideas for Dimension Reduction in Parameter Studies. SIAM.
Tripathy, R. K., & Bilionis, I. (2018). Deep UQ: Learning Deep Neural Network Surrogate Models for High Dimensional Uncertainty Quantification. Journal of Computational Physics, 375, 565–588.
Bilionis, I., & Zabaras, N. (2016). Bayesian uncertainty propagation using Gaussian processes. In Handbook of Uncertainty Quantification (pp. 1–45). Springer International Publishing. 10.1007/978-3-319-11259-6_16-1
Deisenroth, M. P., & Rasmussen, C. E. (2011). PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Proceedings of the 28th International Conference on Machine Learning.