Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1 Introduction to Machine Learning and Deep Learning

University of Lausanne

The Preface argued that quantitative economics needs new tools because the models of interest, heterogeneous-agent economies, OLG models with aggregate risk, integrated assessment models, all share state spaces too large for traditional grid-based methods. This chapter takes that argument as given and supplies the technical machinery the rest of the manuscript will build on. Readers who want a fuller pitch for why deep learning is the right response should re-read the Preface; readers who already accept the premise can dive directly into the machinery below.

We begin with a brief refresher on the motivation, mostly to fix vocabulary and citations, then survey the three broad paradigms of machine learning (supervised, unsupervised, and reinforcement learning), and then develop the core technical machinery: neural network architectures, loss functions, gradient-based optimization, backpropagation, weight initialization, activation functions, and the modern theory of generalization including the double descent phenomenon. Readers already comfortable with these topics may skim this chapter and proceed to Chapter Chapter 2, where the economic applications begin.

The foundational references for the material in this chapter include McCulloch & Pitts (1943), Hebb (1949) and Rosenblatt (1958) for the historical origins of artificial neurons and the first learning rules, Cybenko (1989) and Hornik et al. (1989) for universal approximation, Rumelhart et al. (1986) for backpropagation, Robbins & Monro (1951) for the stochastic-approximation roots of SGD, Geman et al. (1992) for the bias/variance dilemma that underpins modern generalization theory, Kingma & Ba (2015) for the Adam optimizer, Ioffe & Szegedy (2015) and Srivastava et al. (2014) for batch normalization and dropout, and Goodfellow et al. (2016) for a comprehensive textbook treatment, complemented by the historical survey of Schmidhuber (2015).

1.1Motivation and Applications

The past decade has witnessed a remarkable convergence of three developments that have transformed machine learning from a niche academic pursuit into a practical tool of extraordinary power: the availability of large-scale datasets, the advent of massively parallel hardware (GPUs and TPUs), and algorithmic innovations in training deep neural networks (Figure Figure 1.1). While much of the public attention has focused on applications such as image recognition, natural language processing, and game playing, the implications for economics and finance are equally profound.

The three enablers of the deep-learning revolution: large-scale data, massively parallel compute, and algorithmic improvements. None of the three alone is sufficient; their co-availability since the early 2010s is what has turned neural networks from a niche academic curiosity into a workhorse scientific and industrial tool.

Figure 1.1:The three enablers of the deep-learning revolution: large-scale data, massively parallel compute, and algorithmic improvements. None of the three alone is sufficient; their co-availability since the early 2010s is what has turned neural networks from a niche academic curiosity into a workhorse scientific and industrial tool.

Deep learning has already demonstrated its potential across a broad range of economic applications. In macroeconomics, neural networks serve as global approximators of policy and value functions in high-dimensional equilibria that classical grid-based methods cannot reach Maliar et al., 2021Azinovic et al., 2022; heterogeneous-agent extensions encode cross-sectional distributions via histograms or permutation-invariant moment networks Young, 2010Han et al., 2024Yang et al., 2025. Search-and-matching models with aggregate shocks Payne et al., 2025 and continuous-time macro-finance settings requiring HJB approximation or deep-BSDE solvers Gopalakrishna, 2024Duarte et al., 2024Han et al., 2018 add further coverage, as do optimal monetary policy rules under persistent supply shocks Nuño et al., 2024. In climate economics, surrogate-based workflows solve integrated assessment models and derive Pareto-improving carbon-tax rules in OLG--IAMs with deep uncertainty Kübler et al., 2026Folini et al., 2025Fernández-Villaverde et al., 2025. In finance, surrogate models accelerate option pricing, sovereign-default computation, and portfolio optimization Hutchinson et al., 1994Scheidegger & Treccani, 2018Arellano, 2008Gaegauf et al., 2023Chen et al., 2025. For structural estimation, neural-network surrogates make inference tractable and enable global uncertainty quantification for IAMs Kase et al., 2022Friedl et al., 2023Chen et al., 2026. Finally, these methods connect to an earlier generation of neural-network function approximation in economic computation: adaptive learning Chen & White, 1998, derivative pricing Hutchinson et al., 1994, and parameterized expectations Duffy & McNelis, 2001, with Valaitis & Villa (2024) showing how the parameterized-expectations approach extends to contemporary deep architectures; structural discrete-choice estimation rounds out the historical picture Norets, 2012.

Two themes cut across these application areas and motivate the rest of the script. First, every application area listed there involves a state space whose dimension grows with the number of agents, assets, shocks, or climate states; tensor-product grids become infeasible long before the modeling questions become uninteresting. Second, neural networks are universal approximators with cost that scales with parameter count rather than with NdN^d, so they are the natural replacement function class once the problem becomes high-dimensional. Subsequent chapters take this through-line and develop it: Chapter Chapter 2 introduces the DEQN methodology that all macro / heterogeneous-agent / search applications above share; Chapter Chapter 3 scales it to a 100+-country benchmark; Chapters Chapter 7--Chapter 8 develop the continuous-time analogue; Chapter Chapter 9 develops the Gaussian-process methodology, and Chapters Chapter 10--Chapter 11 put the deep-surrogate and GP machinery to work on structural estimation and integrated assessment models.

In this course, we focus on the recent advances enabled by deep neural networks, modern hardware, and algorithmic innovations in training.

For central banks, in particular, these tools address pressing practical needs. Many of the models listed above, such as heterogeneous-agent New Keynesian (HANK) models, search and matching models with aggregate shocks, overlapping-generations (OLG) economies with aggregate risk, and integrated assessment models coupling climate and economic dynamics, involve state spaces of dimension d5d \gg 5, where traditional grid-based numerical methods are computationally infeasible: a tensor-product grid with NN points per dimension requires NdN^{d} nodes, the canonical curse of dimensionality of Bellman (1961). Deep learning provides a mesh-free function approximator. For Barron-class functions, two-layer networks achieve dimension-independent rates in the number of hidden units (often stated as squared L2L^2 error of order O(1/n1)\mathcal{O}(1/n_1) under a bounded Barron norm), whereas tensor-product grids for generic smooth functions suffer rates such as O(M2/d)\mathcal{O}(M^{-2/d}) in the total number M=NdM=N^d of grid nodes. The point is not that every economic object satisfies a Barron bound, but that neural approximators avoid the mechanical tensor-product explosion. Chapter Chapter 2 returns to this comparison with concrete numbers for a six-shock DSGE.

Specifically, this course focuses on using deep neural networks as computational tools for: (i) solving high-dimensional dynamic stochastic general equilibrium (DSGE) models, (ii) approximating value and policy functions in continuous-time settings via physics-informed neural networks, (iii) constructing fast and accurate surrogate models for parameter estimation and uncertainty quantification, and (iv) leveraging Gaussian processes for sample-efficient Bayesian active learning Azinovic et al., 2022Friedl et al., 2023Chen et al., 2026. Supporting techniques include neural architecture search and adaptive multi-objective loss balancing (one option, ReLoBRaLo Bischof & Kraus, 2025, is developed in the notebooks; alternatives include SoftAdapt and GradNorm). For a comprehensive textbook treatment of the foundations, we refer the reader to Goodfellow et al. (2016) and Chollet (2017); for concise surveys we recommend LeCun et al. (2015).

1.2Types of Machine Learning

Before diving into the technical details, it is useful to recall the three broad paradigms of machine learning, each defined by the nature of the available data and the feedback signal provided to the algorithm.

1.2.1Supervised Learning

Given a set of labeled input--output pairs {(x(i),y(i))}i=1m\{(\x^{(i)}, y^{(i)})\}_{i=1}^{m}, the goal is to learn a mapping h:XYh:\mathcal{X}\to\mathcal{Y} that generalizes to unseen inputs. The two main tasks are regression and classification.

1.2.1.0.1Regression

(yRy \in \R): predict a continuous target from input features. A simple linear model takes the form

hθ(x)=θ0+θ1x,h_{\bm{\theta}}(x) = \theta_0 + \theta_1 x,

where the parameters θ=(θ0,θ1)\bm{\theta} = (\theta_0, \theta_1) are learned from data. Figure Figure 1.2 illustrates regression on a house-price dataset: each dot is a training observation, and the line is the fitted model.

Supervised learning: regression. The model h_{\bm{\theta}}(x) = \theta_0 + \theta_1 x (red line) is fitted to observed house prices (blue dots).

Figure 1.2:Supervised learning: regression. The model hθ(x)=θ0+θ1xh_{\bm{\theta}}(x) = \theta_0 + \theta_1 x (red line) is fitted to observed house prices (blue dots).

Classification. (y{0,1,,K}y \in \{0,1,\dots,K\}): assign an input x\x to one of KK discrete categories. A linear classifier predicts class 1 whenever wx+b>0\w^\top \x + b > 0 and class 0 otherwise. Figure Figure 1.3 shows a credit-scoring example: applicants are classified as low-risk or high-risk based on income and savings, and the dashed line is the learned decision boundary.

Supervised learning: classification. A linear decision boundary separates low-risk (blue circles) from high-risk (red crosses) applicants in the income--savings feature space.

Figure 1.3:Supervised learning: classification. A linear decision boundary separates low-risk (blue circles) from high-risk (red crosses) applicants in the income--savings feature space.

1.2.2Unsupervised Learning

Given only unlabeled data {x(i)}i=1m\{\x^{(i)}\}_{i=1}^{m}, the goal is to discover hidden structure without any target signal. Two common tasks are:

Figure Figure 1.4 illustrates a clustering task: unlabeled data points in two dimensions are partitioned into three clusters, each indicated by a different color and centroid marker.

Unsupervised learning: clustering. Unlabeled data points are grouped into three clusters; the + markers indicate cluster centroids. No target labels are used; the algorithm discovers the grouping from the data alone.

Figure 1.4:Unsupervised learning: clustering. Unlabeled data points are grouped into three clusters; the ++ markers indicate cluster centroids. No target labels are used; the algorithm discovers the grouping from the data alone.

1.2.3Reinforcement Learning

In reinforcement learning, an agent interacts with an environment over a sequence of time steps. At each step tt, the agent observes a state sts_t, selects an action at=π(st)a_t = \pi(s_t) according to its policy π\pi, and receives a reward rtr_t from the environment. The goal is to learn a policy that maximizes the expected cumulative discounted return:

maxπ  Eπ ⁣[t=0γtrt],γ[0,1),\max_{\pi}\; \mathbb{E}_{\pi}\!\left[\sum_{t=0}^{\infty} \gamma^t \, r_t\right], \qquad \gamma \in [0,1),

where Eπ[]\mathbb{E}_{\pi}[\,\cdot\,] is taken over the trajectory distribution induced jointly by the policy π\pi and the (possibly stochastic) environment dynamics, starting from a given initial-state distribution. Figure Figure 1.5 illustrates this agent--environment interaction loop.

Reinforcement learning: the agent--environment loop. The agent observes a state, takes an action, and receives a reward signal. Over time, it learns a policy \pi that maximizes cumulative discounted reward.

Figure 1.5:Reinforcement learning: the agent--environment loop. The agent observes a state, takes an action, and receives a reward signal. Over time, it learns a policy π\pi that maximizes cumulative discounted reward.

Example: an algorithmic trader learning an execution strategy by optimizing realized profit over sequences of order placements; or a central bank learning an interest-rate rule by maximizing a welfare criterion over simulated macroeconomic trajectories.

1.3Course Focus: Supervised vs. Unsupervised Learning

This course begins with the supervised learning paradigm, which provides the essential building blocks: choosing a parameterized model, defining a loss function, and minimizing it via gradient descent.

The core methods of this course, DEQNs and PINNs, are not supervised in the classical labeled-data sense. More precisely, they are self-supervised residual methods: the economic equilibrium conditions or governing equations generate the training signal. To see why, recall the key distinction: in supervised learning, the loss function measures the discrepancy between the network’s prediction y^(i)\hat{y}^{(i)} and a known target label y(i)y^{(i)}, for example, a mean squared error 1mi(y^(i)y(i))2\frac{1}{m}\sum_i (\hat{y}^{(i)} - y^{(i)})^2. This requires a dataset of correct input--output pairs.

In DEQNs and PINNs, no such labels exist. Consider the key differences:

In both cases, the training data consists only of input locations (sampled states or collocation points) with no associated output labels. The loss is defined entirely by the structure of the economic model or the governing equation, not by example solutions. They are therefore unsupervised in the narrow sense of using no labels, but the more informative term is equation-based self-supervision.

Despite this fundamental difference, the optimization machinery is shared: these approaches define a loss J(θ)J(\bm{\theta}) over trainable parameters and minimize it via (stochastic) gradient descent. This is why we introduce the supervised learning pipeline first in the next section: it establishes the model--loss--optimizer framework that DEQNs and PINNs then adapt by replacing the data-driven loss with a physics-based one.

1.4The Supervised Learning Pipeline

Every supervised learning algorithm follows the same three-step recipe, regardless of whether the model is a linear regression, a random forest, or a deep neural network (Figure Figure 1.6). Understanding this pipeline is essential because the DEQN and PINN methods in later chapters modify step 2 (replacing data-driven losses with physics-based residuals) while keeping steps 1 and 3 intact.

The three-step supervised-learning recipe that underpins every model in this course. Choose a parametric hypothesis h(\x;\bm{\theta}), measure its misfit on a labeled dataset via a loss J(\bm{\theta}), and minimize J over the parameter vector \bm{\theta}. DEQNs and PINNs modify step 2 (replacing the data-driven loss with an equilibrium or PDE residual) while keeping steps 1 and 3 identical.

Figure 1.6:The three-step supervised-learning recipe that underpins every model in this course. Choose a parametric hypothesis h(x;θ)h(\x;\bm{\theta}), measure its misfit on a labeled dataset via a loss J(θ)J(\bm{\theta}), and minimize JJ over the parameter vector θ\bm{\theta}. DEQNs and PINNs modify step 2 (replacing the data-driven loss with an equilibrium or PDE residual) while keeping steps 1 and 3 identical.

Given a training set {(x(i),y(i))}i=1m\{(\x^{(i)}, y^{(i)})\}_{i=1}^{m} of input--output pairs, we seek a hypothesis h:XYh:\mathcal{X}\to\mathcal{Y} parameterized by θ\bm{\theta} that minimizes the empirical risk. For regression problems, the default choice is the mean squared error (MSE):

J(θ)=1mi=1m(hθ(x(i))y(i))2.J(\bm{\theta}) = \frac{1}{m}\sum_{i=1}^{m}\big(h_{\bm{\theta}}(\x^{(i)}) - y^{(i)}\big)^{2}.

This loss is not chosen arbitrarily. If the data are generated by

y(i)=hθ(x(i))+ε(i),ε(i)N(0,σ2),y^{(i)} = h_{\bm{\theta}}(\x^{(i)}) + \varepsilon^{(i)}, \qquad \varepsilon^{(i)} \sim \mathcal{N}(0,\sigma^2),

then the log-likelihood of the sample is, up to constants, proportional to i(hθ(x(i))y(i))2-\sum_i (h_{\bm{\theta}}(\x^{(i)}) - y^{(i)})^2. Minimizing MSE is therefore equivalent to maximum likelihood under homoskedastic Gaussian observation noise. This is one reason why squared error is the natural benchmark loss for regression; see Bishop (2006) Goodfellow et al. (2016)Deisenroth et al. (2020).

For classification tasks, the model must output class probabilities. In the binary case (K=2K=2), the simplest approach passes a single scalar score zz through the sigmoid function,

p  =  σ(z)  =  11+ez    (0,1),p \;=\; \sigma(z) \;=\; \frac{1}{1+e^{-z}} \;\in\; (0,1),

and assigns class 1 whenever p>0.5p > 0.5, equivalently whenever z>0z > 0 (Figure Figure 1.7). For K>2K>2 classes the natural generalization maps a raw score vector zRK\bm{z}\in\mathbb{R}^K onto the probability simplex via softmax: y^k=ezk/jezj\hat{y}_k = e^{z_k}/\sum_j e^{z_j}. In both cases the misfit between predicted probabilities and true labels is measured by the cross-entropy loss:

J=1mi=1mk=1Kyk(i)logy^k(i),J = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} y_k^{(i)}\log \hat{y}_k^{(i)},

where y^k(i)\hat{y}_k^{(i)} is the predicted probability that observation ii belongs to class kk. If the target is encoded as a one-hot vector, exactly one component of (y1(i),,yK(i))(y_1^{(i)},\dots,y_K^{(i)}) equals one and all others are zero, so for an observation whose true class is cc the loss contribution reduces to logy^c(i)-\log \hat{y}_c^{(i)}. The model is rewarded for assigning high probability to the correct class and penalized heavily when that probability is near zero.

The origin of cross-entropy is again likelihood theory. In binary classification, with label y{0,1}y \in \{0,1\} and predicted success probability pp, the Bernoulli log-likelihood is

logL=ylogp+(1y)log(1p).\log L = y \log p + (1-y)\log(1-p).

Negating and averaging this expression gives the binary cross-entropy. The KK-class formula above is the corresponding negative log-likelihood for a categorical distribution with probabilities generated by sigmoid (K=2K=2) or softmax (K>2K>2). Cross-entropy is therefore the statistically natural loss whenever the model output is meant to represent class probabilities; see again Bishop (2006) Deisenroth et al. (2020).

Binary classification with a sigmoid output. A scalar score z (the model’s raw output) is mapped to a probability p = \sigma(z) \in (0,1). The prediction rule assigns class 1 whenever p > 0.5, equivalently whenever z > 0. The dashed lines mark the decision threshold; no neural-network architecture is assumed---any model that produces a real-valued score can be combined with this mapping and the binary cross-entropy loss.

Figure 1.7:Binary classification with a sigmoid output. A scalar score zz (the model’s raw output) is mapped to a probability p=σ(z)(0,1)p = \sigma(z) \in (0,1). The prediction rule assigns class 1 whenever p>0.5p > 0.5, equivalently whenever z>0z > 0. The dashed lines mark the decision threshold; no neural-network architecture is assumed---any model that produces a real-valued score can be combined with this mapping and the binary cross-entropy loss.

Binary cross-entropy and mean squared error as functions of the predicted class probability p. Cross-entropy rises much more sharply near confident mistakes, which is why it is usually better aligned with probabilistic classification.

Figure 1.8:Binary cross-entropy and mean squared error as functions of the predicted class probability pp. Cross-entropy rises much more sharply near confident mistakes, which is why it is usually better aligned with probabilistic classification.

Figure Figure 1.8 makes the practical difference visible. If the true label is y=1y=1 but the model predicts a very small pp, then the cross-entropy loss explodes because logp-\log p \to \infty as p0p \downarrow 0. The same holds symmetrically when y=0y=0 and p1p \uparrow 1. Mean squared error does penalize mistakes, but it does so much more mildly near the boundaries. For probabilistic classification, that weaker penalty is usually undesirable because it does not strongly discourage overconfident wrong predictions.

The optimization is performed via gradient descent or one of its stochastic variants, which we discuss in Section Section 1.6.

1.4.1Beyond MSE: Robust and Asymmetric Losses

MSE is optimal under Gaussian noise, but real-world economic and financial data often contain outliers and heavy tails that inflate the squared penalty disproportionately. Two classical alternatives are useful in this course and beyond.

1.4.1.0.1Huber loss.

Introduced by Huber (1964) in the context of robust location estimation, the Huber loss behaves like MSE near the origin and like L1L_1 in the tails, capping the influence of any single observation:

δ(r)={12r2if rδ,δ(r12δ)otherwise,ryy^.\ell_\delta(r) = \begin{cases} \tfrac{1}{2} r^2 & \text{if } |r| \leq \delta, \\ \delta\,(|r| - \tfrac{1}{2}\delta) & \text{otherwise,} \end{cases} \qquad r \equiv y - \hat{y}.

The threshold δ\delta controls the transition and is typically chosen to be a few times the noise scale. Huber loss retains the smoothness needed for gradient-based optimization while reducing the weight of extreme residuals, which makes it the default choice for regression problems with suspected outliers.

1.4.1.0.2Quantile (pinball) loss.

Koenker & Bassett (1978) proposed the check function

τ(r)=max ⁣(τr,  (τ1)r),τ(0,1),\ell_\tau(r) = \max\!\bigl(\tau\, r,\; (\tau-1)\, r\bigr), \qquad \tau \in (0,1),

whose minimizer is the conditional τ\tau-quantile of yy given x\x rather than the conditional mean. Setting τ=0.5\tau = 0.5 recovers the median (absolute-error) regression; setting τ=0.05\tau = 0.05 or τ=0.95\tau = 0.95 targets the lower or upper tail. In financial risk management this is precisely the statistic of interest: τ=0.05\tau = 0.05 yields a neural-network estimator of the lower-tail 5%5\%-quantile of returns, which corresponds to Value-at-Risk (VaR) at the conventional 5%5\% level, and averaging the pinball loss over many quantiles traces out the full conditional distribution of returns, an approach known as quantile regression or distributional regression.

1.5From Perceptrons to Deep Networks

The building block of every neural network is the artificial neuron, first proposed by McCulloch & Pitts (1943). A companion question, how the synaptic weights wiw_i themselves should adapt with experience, was first addressed by Hebb (1949), whose rule “neurons that fire together, wire together” (Δwij=ηxiyj\Delta w_{ij} = \eta\, x_i y_j) is the conceptual ancestor of all gradient-based learning rules discussed below. Rosenblatt (1958) then introduced the Perceptron, the first trainable binary classifier built on these ideas. A single neuron computes a weighted linear combination of its inputs, adds a bias term, and passes the result through a nonlinear activation function:

y^=g ⁣(w0+xw),\hat{y} = g\!\big(w_0 + \x^\top \w\big),

where w=(w1,,wd)\w = (w_1, \dots, w_d)^\top are the synaptic weights, w0w_0 is the bias, and g()g(\cdot) is the activation function (Figure Figure 1.9).

An artificial neuron in the McCulloch--Pitts lineage. Inputs x_i are multiplied by synaptic weights w_i, summed into a pre-activation z, and passed through a nonlinear activation g(\cdot) to yield the output \hat{y}. The original  unit used a binary threshold for g; the modern artificial neuron generalizes this to arbitrary smooth activations, and all deep networks are compositions of neurons of this form.

Figure 1.9:An artificial neuron in the McCulloch--Pitts lineage. Inputs xix_i are multiplied by synaptic weights wiw_i, summed into a pre-activation zz, and passed through a nonlinear activation g()g(\cdot) to yield the output y^\hat{y}. The original McCulloch & Pitts (1943) unit used a binary threshold for gg; the modern artificial neuron generalizes this to arbitrary smooth activations, and all deep networks are compositions of neurons of this form.

Common choices for gg include the sigmoid σ(z)=(1+ez)1\sigma(z) = (1+e^{-z})^{-1}, the hyperbolic tangent tanh(z)\tanh(z), and the rectified linear unit ReLU(z)=max(0,z)\mathrm{ReLU}(z) = \max(0,z) Nair & Hinton, 2010Glorot et al., 2011. Without a nonlinear activation, any composition of linear layers collapses to a single affine map, a mathematical fact of fundamental importance: W2(W1x+b1)+b2=W2W1x+(W2b1+b2)\W_2(\W_1\x + \bb_1) + \bb_2 = \W_2\W_1\x + (\W_2\bb_1 + \bb_2).

1.5.0.0.1From a single neuron to a layer.

The single-neuron equation y^=g(w0+xw)\hat y = g(w_0 + \x^\top \w) produces a scalar output. In practice we want vector-valued outputs (and, more importantly, vector-valued intermediate features). A layer of nn parallel neurons, each with its own weights w(j)\w^{(j)} and bias w0(j)w_0^{(j)}, is a vector-valued generalization

a  =  g ⁣(Wx+b),WRn×d, bRn, aRn,\bm{a} \;=\; g\!\big(\W \x + \bb\big), \qquad \W \in \R^{n \times d},\ \bb \in \R^{n},\ \bm{a} \in \R^{n},

where the nonlinearity gg is applied componentwise. Each row of W\W is the weight vector of one neuron; stacking nn of them gives the matrix W\W at once, so the layer evaluates nn neurons in a single matrix--vector product.

1.5.0.0.2From one layer to a deep composition.

A single hidden layer is already a universal approximator (Cybenko (1989)Hornik et al. (1989); the universal-approximation theorem stated in the next subsection), but its hidden-layer width can grow exponentially in the input dimension to attain a target accuracy. Stacking layers on top of one another reuses earlier features as inputs to later neurons; the resulting compositional representation is dramatically more efficient for many functions of interest Telgarsky, 2016Barron, 1993. A deep feedforward network with LL layers is therefore a nested composition of layer maps:

f(x;θ)=gL ⁣(W(L)gL1 ⁣(g1 ⁣(W(1)x+b(1))+b(L1))+b(L)).f(\x;\bm{\theta}) = g_L\!\Big(\W^{(L)} g_{L-1}\!\big(\cdots g_1\!\big(\W^{(1)}\x + \bb^{(1)}\big)\cdots + \bb^{(L-1)}\big) + \bb^{(L)}\Big).

The architecture that implements (1.12) is sketched in Figure Figure 1.10.

An L-layer deep feedforward network. Each layer applies an affine map followed by a pointwise nonlinearity; the composition realizes Eq. . Depth (rather than width) is what gives neural networks their efficient representational power for compositionally structured functions.

Figure 1.10:An LL-layer deep feedforward network. Each layer applies an affine map followed by a pointwise nonlinearity; the composition realizes Eq. (1.12). Depth (rather than width) is what gives neural networks their efficient representational power for compositionally structured functions.

A useful geometric intuition, popularized by Chollet (2017), is that each layer of the network performs a nonlinear coordinate transformation, successively “untangling” the manifold on which the data lies. In the input space, the data may be entangled in complex ways (e.g., two classes forming concentric spirals); each hidden layer warps the space so that the data become progressively more linearly separable. By the final hidden layer, a simple linear readout suffices. This perspective, formalized also by Goodfellow et al. (2016), explains why depth is so powerful: each layer adds an additional coordinate transformation, and the composition of many simple transformations can represent very complex mappings with far fewer parameters than a single, wide layer would require.

1.5.0.0.3Other architectures.

This course focuses almost entirely on feedforward networks of the form (1.12), because DEQNs and PINNs operate on unstructured state vectors xRd\x \in \R^d for which feedforward maps are the natural choice. For structured inputs other architecture families exist and are used widely elsewhere: convolutional networks LeCun et al., 2015 for image data, graph neural networks for relational data, and Transformers (Section Section 1.11.4) for sequences. We mention them so that readers who encounter these models in the empirical-finance or applied-ML literatures know where they fit; none are required for the methods developed in later chapters.

The universal approximation theorem Cybenko, 1989Hornik et al., 1989 guarantees that even a single hidden layer with sufficiently many neurons can approximate any continuous function on a compact set to arbitrary precision. However, in practice, deep (multi-layer) networks achieve the same accuracy with exponentially fewer parameters than wide (single-layer) ones, which motivates the use of depth; Telgarsky (2016) makes this precise by exhibiting compositional functions that a depth-LL network can represent in poly(d,L)\mathrm{poly}(d,L) parameters but for which any depth-(L1)(L-1) network requires width exponential in LL. Barron (1993) provides classical dimension-independent approximation rates for Barron-class targets, often stated as squared L2L^2 error of order O(1/n1)\mathcal{O}(1/n_1) in the hidden width, whereas tensor-product methods for generic smooth functions scale poorly in the total number of grid nodes. This qualified comparison is the formal version of the “deep learning can beat grids” argument that motivates DEQNs in Chapter Chapter 2.

1.6Training: Loss Functions, Gradient Descent, and Backpropagation

Given a loss function J(θ)J(\bm{\theta}), training proceeds by iteratively updating the parameters in the direction of steepest descent:

θθη ⁣θJ(θ),\bm{\theta} \leftarrow \bm{\theta} - \eta\,\nabla_{\!\bm{\theta}} J(\bm{\theta}),

where η>0\eta > 0 is the learning rate. In this introductory chapter θ\bm{\theta} denotes generic trainable model parameters; in later chapters, when structural parameters and neural-network weights appear together, the script uses θ\bm{\theta} for the structural parameters and ρ\rho for the network weights. Computing the gradient  ⁣θJ\nabla_{\!\bm{\theta}} J for a deep network is achieved through backpropagation Rumelhart et al., 1986, an efficient application of the chain rule that propagates error signals from the output layer back to the input layer. Appendix Appendix B collects the matrix-calculus identities and the one-paragraph reverse-mode AD summary used throughout the script.

To see why the chain rule is central, consider a network with a single hidden layer. Let z=W(1)x+b(1)\z = \W^{(1)}\x + \bb^{(1)}, a=g(z)\a = g(\z), and y^=w(2)a+b(2)\hat{y} = \w^{(2)\top}\a + b^{(2)} with loss J=12(y^y)2J = \tfrac{1}{2}(\hat{y}-y)^2. Define the “delta” at the hidden layer:

δ(1)=Jy^y^aaz=(y^y)w(2)g(z).\bm{\delta}^{(1)} = \frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \a} \cdot \frac{\partial \a}{\partial \z} = (\hat{y}-y)\,\w^{(2)} \odot g'(\z).

Then the weight gradient follows immediately:

JW(1)=δ(1)x.\frac{\partial J}{\partial \W^{(1)}} = \bm{\delta}^{(1)}\,\x^\top.

The key insight of Rumelhart et al. (1986) is that this chain rule application can be organized as a single backward pass through the network, reusing intermediate quantities (the δ\bm{\delta} vectors) from the forward pass. The resulting algorithm has computational cost proportional to the forward pass; there is no need for finite differences or other expensive gradient approximations.

In practice, evaluating the full gradient over all mm training examples is expensive. Stochastic gradient descent (SGD), whose roots go back to the stochastic-approximation scheme of Robbins & Monro (1951), replaces the full sum with a random mini-batch B{1,,m}\mathcal{B} \subset \{1, \dots, m\}, yielding an unbiased estimate of the empirical-risk gradient at much lower cost per iteration:

 ⁣θJ^B=1BiB ⁣θ(hθ(x(i)),y(i)).\widehat{\nabla_{\!\bm{\theta}} J}_{\mathcal{B}} = \frac{1}{|\mathcal{B}|}\sum_{i \in \mathcal{B}} \nabla_{\!\bm{\theta}}\,\ell\bigl(h_{\bm{\theta}}(\x^{(i)}),\, y^{(i)}\bigr).

With mini-batch sizes of 32--256, SGD achieves both computational efficiency and an implicit regularization effect from gradient noise. Two strands of theoretical work explain why this matters in deep nets specifically: the loss landscape of a deep network is dominated by saddle points rather than isolated bad local minima Dauphin et al., 2014, and SGD’s gradient noise tends to bias training toward flat rather than sharp regions of the loss surface, which often generalize better Keskar et al., 2017. Even on linearly separable data, gradient descent on the logistic loss converges to the maximum-margin solution, an instance of the broader principle that the optimizer itself imposes an implicit bias that contributes to generalization Soudry et al., 2018. For a comprehensive modern review of stochastic optimization for large-scale learning (including convergence rates, adaptive methods, and variance reduction), see Bottou et al. (2018).

1.6.1The Adam Optimizer

Modern optimizers such as Adam Kingma & Ba, 2015 adapt the learning rate for each parameter based on running averages of the first and second moments of the gradient:

mt=β1mt1+(1β1)Jt,vt=β2vt1+(1β2)(Jt)2,m^t=mt/(1β1t),v^t=vt/(1β2t),θt+1=θtηm^t/(v^t+ε).\begin{aligned} \bm{m}_t &= \beta_1\,\bm{m}_{t-1} + (1-\beta_1)\,\nabla J_t, \\ \bm{v}_t &= \beta_2\,\bm{v}_{t-1} + (1-\beta_2)\,(\nabla J_t)^2, \\ \hat{\bm{m}}_t &= \bm{m}_t/(1-\beta_1^t), \qquad \hat{\bm{v}}_t = \bm{v}_t/(1-\beta_2^t), \\ \bm{\theta}_{t+1} &= \bm{\theta}_t - \eta \cdot \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \varepsilon). \end{aligned}

The bias-corrected first moment m^t\hat{\bm{m}}_t provides momentum (smoothing out gradient noise), while the second moment v^t\hat{\bm{v}}_t provides per-parameter adaptive learning rates (parameters with large gradients receive smaller effective steps). The default hyperparameters β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, ε=108\varepsilon=10^{-8} work well across a wide range of problems, including all the economic applications in this course.

1.6.2The Optimizer Family Tree: Momentum, RMSprop, Adam, AdamW

Adam did not appear out of thin air; it inherits from a family of refinements to plain SGD whose interactions are worth being explicit about for readers who will tune optimizers in practice. Table Table 1.1 traces the lineage; each row is a one-line modification of the row above it.

Table 1.1:Lineage from plain SGD to AdamW. Each row introduces exactly one new ingredient: momentum buffers gradient noise; RMSprop adds a per-parameter learning-rate scaling by the running second moment; Adam combines the two with bias correction; AdamW separates weight decay from the gradient step so that the implicit L2L_2 regularizer does not interact with the adaptive denominator. PINN training in continuous-time chapters (Chapters Chapter 7--Chapter 8) often uses Adam or AdamW; DEQNs in Chapters Chapter 2--Chapter 6 use plain Adam with the default (β1,β2)=(0.9,0.999)(\beta_1,\beta_2)=(0.9, 0.999) as in Azinovic et al. (2022).

OptimizerUpdate rule (one parameter)Reference
SGDθθηgt\theta \leftarrow \theta - \eta\,g_tRobbins & Monro (1951)
SGD + momentumvt=μvt1+gtv_t = \mu v_{t-1} + g_t;  θθηvt\theta \leftarrow \theta - \eta\,v_tSutskever et al. (2013)
RMSpropst=ρst1+(1ρ)gt2s_t = \rho s_{t-1} + (1-\rho)g_t^2;  θθηgt/st+ε\theta \leftarrow \theta - \eta\,g_t/\sqrt{s_t+\varepsilon}Tieleman & Hinton (2012)
Adammomentum on gtg_t and on gt2g_t^2 with bias correction (Eqs. above)Kingma & Ba (2015)
AdamWAdam plus decoupled weight decay θ(1ηλ)θηm^t/(v^t+ε)\theta \leftarrow (1-\eta\lambda)\theta - \eta\,\hat m_t/(\sqrt{\hat v_t}+\varepsilon)Loshchilov & Hutter (2019)

The Adam-vs-AdamW distinction is sharper than the one-line table entry suggests, so it is worth writing out both rules side by side. With m^t\hat m_t, v^t\hat v_t the bias-corrected first and second moment of the gradient and λ\lambda the weight-decay rate, Adam-with-L2L_2 (i.e. Adam applied to the loss J+λ2θ2J + \tfrac{\lambda}{2}\|\bm\theta\|^2) updates

θt+1  =  θt    ηm^t+λθtv^t+ε,\bm\theta_{t+1} \;=\; \bm\theta_t \;-\; \eta\,\frac{\hat m_t + \lambda\,\bm\theta_t}{\sqrt{\hat v_t}+\varepsilon},

so the implicit regularizer is itself rescaled by the adaptive denominator v^t+ε\sqrt{\hat v_t}+\varepsilon. AdamW separates the two:

θt+1  =  (1ηλ)θt    ηm^tv^t+ε,\bm\theta_{t+1} \;=\; (1-\eta\lambda)\,\bm\theta_t \;-\; \eta\,\frac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon},

so the weight-decay term shrinks every parameter by the same proportional factor regardless of gradient magnitude. This is why AdamW recovers the textbook intuition “weight decay shrinks weights uniformly” that Adam-with-L2L_2 loses.

Figure Figure 1.11 gives a schematic comparison of the qualitative convergence patterns behind this optimizer family tree.

Schematic loss-trajectory comparison on a moderately ill-conditioned objective. Momentum and adaptive rescaling often improve early convergence relative to plain SGD, and Adam/AdamW are therefore useful defaults in the notebooks. The ranking is illustrative rather than universal: on some objectives, carefully tuned SGD or RMSprop can match or beat Adam-family methods.

Figure 1.11:Schematic loss-trajectory comparison on a moderately ill-conditioned objective. Momentum and adaptive rescaling often improve early convergence relative to plain SGD, and Adam/AdamW are therefore useful defaults in the notebooks. The ranking is illustrative rather than universal: on some objectives, carefully tuned SGD or RMSprop can match or beat Adam-family methods.

1.6.3Learning Rate Schedules

The choice of learning rate η\eta is arguably the single most important hyperparameter in deep learning. Too large, and the optimizer diverges; too small, and convergence is impractically slow. A popular schedule is cosine annealing Loshchilov & Hutter, 2017, which smoothly decays the learning rate according to:

η(t)=ηmin+12(ηmaxηmin)(1+cos(πt/T)),\eta(t) = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})\bigl(1 + \cos(\pi\, t / T)\bigr),

where TT is the total number of training iterations. Figure Figure 1.12 compares the three learning-rate strategies used most often in practice.

Three common learning-rate schedules. A constant rate is simple but often converges slowly in the fine-tuning phase; exponential decay shrinks monotonically; cosine annealing  provides a smooth warm-to-cold transition that empirically performs well across a wide range of problems. DEQNs and PINNs typically use exponential decay or cosine annealing to polish the solution after the initial coarse-grained phase.

Figure 1.12:Three common learning-rate schedules. A constant rate is simple but often converges slowly in the fine-tuning phase; exponential decay shrinks monotonically; cosine annealing Loshchilov & Hutter, 2017 provides a smooth warm-to-cold transition that empirically performs well across a wide range of problems. DEQNs and PINNs typically use exponential decay or cosine annealing to polish the solution after the initial coarse-grained phase.

In practice, decaying schedules such as exponential decay or cosine annealing tend to refine solutions in the later stages of training, once the optimizer has found a good basin of attraction.

1.6.4Backpropagation: The Chain Rule at Scale

For a network with LL layers, denote the pre-activation at layer ll as z(l)=W(l)a(l1)+b(l)\z^{(l)} = \W^{(l)}\a^{(l-1)} + \bb^{(l)} and the activation as a(l)=g(z(l))\a^{(l)} = g(\z^{(l)}). If the final layer is linear, set g(z)=1g'(z)=1 at l=Ll=L and interpret a(L)\a^{(L)} as the prediction y^\hat y. The backpropagation algorithm computes J/W(l)\partial J / \partial \W^{(l)} for all layers simultaneously by propagating a “delta” vector backward:

δ(L)=a(L)Jg(z(L)),δ(l)=((W(l+1))δ(l+1))g(z(l)),l=L1,,1,\begin{aligned} \bm{\delta}^{(L)} &= \nabla_{\a^{(L)}} J \odot g'(\z^{(L)}), \\ \bm{\delta}^{(l)} &= \bigl((\W^{(l+1)})^\top \bm{\delta}^{(l+1)}\bigr) \odot g'(\z^{(l)}), \qquad l = L-1, \ldots, 1, \end{aligned}

where \odot denotes element-wise multiplication. The parameter gradients are then J/W(l)=δ(l)(a(l1))\partial J / \partial \W^{(l)} = \bm{\delta}^{(l)} (\a^{(l-1)})^\top and J/b(l)=δ(l)\partial J / \partial \bb^{(l)} = \bm{\delta}^{(l)}. The computational cost is linear in the number of layers and the total number of parameters, a remarkable efficiency that enables training networks with millions of parameters. Figure Figure 1.13 shows the forward and backward passes side by side.

Backpropagation as forward and backward passes through the network. The forward pass (blue) produces and caches every layer’s pre-activations \z^{(l)} and activations \a^{(l)}; the backward pass (red, dashed) propagates the “delta” vectors \bm{\delta}^{(l)} from the loss back to the inputs, reusing the cached quantities to compute all parameter gradients in a single sweep.

Figure 1.13:Backpropagation as forward and backward passes through the network. The forward pass (blue) produces and caches every layer’s pre-activations z(l)\z^{(l)} and activations a(l)\a^{(l)}; the backward pass (red, dashed) propagates the “delta” vectors δ(l)\bm{\delta}^{(l)} from the loss back to the inputs, reusing the cached quantities to compute all parameter gradients in a single sweep.

In modern deep learning frameworks such as TensorFlow and PyTorch, backpropagation is implemented automatically through computational graph tracing (“autograd”). The user only needs to define the forward computation; the framework handles the differentiation. This same automatic differentiation capability is what makes PINNs (Chapter Chapter 7) possible: the PDE residual requires derivatives of the network output with respect to its inputs, which autograd provides exactly and efficiently.

1.7Weight Initialization

The initialization of network weights has a profound effect on training dynamics. If weights are initialized too large, activations explode through the layers; if too small, they vanish to zero. In both cases, the gradient signal degrades and training stalls. The key principle is to choose initial weights so that the variance of activations remains approximately constant across layers.

1.7.0.0.1Xavier/Glorot initialization.

Glorot & Bengio (2010) derived the following rule for networks with symmetric activations (such as tanh\tanh). For a layer with ninn_\mathrm{in} input neurons and noutn_\mathrm{out} output neurons, initialize weights as:

WijU ⁣[6nin+nout,  6nin+nout]or equivalentlyWijN ⁣(0,  2nin+nout).W_{ij} \sim \mathcal{U}\!\left[-\sqrt{\frac{6}{n_\mathrm{in}+n_\mathrm{out}}},\; \sqrt{\frac{6}{n_\mathrm{in}+n_\mathrm{out}}}\right] \qquad\text{or equivalently}\qquad W_{ij} \sim \mathcal{N}\!\left(0,\; \frac{2}{n_\mathrm{in}+n_\mathrm{out}}\right).

This ensures Var[a(l)]Var[a(l1)]\mathrm{Var}[a^{(l)}] \approx \mathrm{Var}[a^{(l-1)}] under the assumption that activations are in the linear regime of tanh\tanh.

1.7.0.0.2He initialization.

For ReLU activations, He et al. (2015) showed that the weight variance should be doubled relative to the forward fan-in rule:

WijN ⁣(0,  2nin).W_{ij} \sim \mathcal{N}\!\left(0,\; \frac{2}{n_\mathrm{in}}\right).

The justification is a second-moment-preserving calculation, not a variance one. For a centered symmetric pre-activation zz,

E ⁣[ReLU(z)2]  =  0z2p(z)dz  =  12E ⁣[z2],\E{\mathrm{ReLU}(z)^2} \;=\; \int_0^\infty z^2 \, p(z) \, dz \;=\; \tfrac{1}{2}\,\E{z^2},

because p(z)p(z) is symmetric about zero, so the negative half of the integrand is killed and the positive half is preserved. Doubling the input weight variance therefore preserves the second moment E ⁣[(z(l))2]\E{(z^{(l)})^2} across layers under ReLU. Strictly speaking E ⁣[ReLU(z)]>0\E{\mathrm{ReLU}(z)} > 0, so the variance (centered second moment) is slightly smaller than the second moment, and the factor of 2 is an approximation rather than an identity; in practice the approximation is excellent and He initialization is the default for ReLU-family networks throughout this course.

1.8Activation Functions in Depth

Beyond the three classical choices (sigmoid, tanh, ReLU), several modern activation functions address specific shortcomings. Table Table 1.2 summarizes the options used in this course.

Table 1.2:Activation functions used throughout the course. Origin papers: ReLU Nair & Hinton, 2010, Leaky ReLU Maas et al., 2013, ELU Clevert et al., 2016, Swish Ramachandran et al., 2017, GELU Hendrycks & Gimpel, 2016, Mish Misra, 2019. Range is the set of output values for zRz \in \R. Smoothness matters when derivatives of the network output are needed: sigmoid, tanh, Swish, GELU, Mish, and softplus are CC^\infty; ReLU is only C0C^0; Leaky ReLU is piecewise linear; ELU is piecewise CC^\infty and is C1C^1 at the origin only for α=1\alpha=1. Smooth activations are required for PINN applications that involve second-order derivatives (Chapter Chapter 7).

ActivationFormulaRangeKey property
Sigmoidσ(z)=(1+ez)1\sigma(z) = (1+e^{-z})^{-1}(0,1)(0,1)Smooth, saturates
Tanhtanh(z)\tanh(z)(1,1)(-1,1)Zero-centered, saturates
ReLUmax(0,z)\max(0,z)[0,)[0,\infty)Non-saturating for z>0z>0
Leaky ReLUmax(αz,z)\max(\alpha z, z), α ⁣= ⁣0.1\alpha\!=\!0.1(,)(-\infty,\infty)No dead neurons
ELUz if z>0; α(ez1) elsez\text{ if }z>0;\ \alpha(e^z-1)\text{ else}[α,)[-\alpha,\infty)Negative saturation; C1C^1 if α=1\alpha=1
Swishzσ(z)z \cdot \sigma(z)[0.28,)\approx [-0.28,\infty)Smooth, non-monotone
GELUzΦ(z)z\cdot\Phi(z)[0.17,)\approx [-0.17,\infty)Smooth, default in BERT / GPT
Mishztanh(ln(1+ez))z\cdot\tanh(\ln(1+e^z))[0.31,)\approx [-0.31,\infty)Smooth, used in YOLOv4
Softplusln(1+ez)\ln(1+e^z)(0,)(0,\infty)Smooth ReLU approximation

Leaky ReLU and ELU address the dying-neuron issue by providing a small but nonzero gradient for negative inputs. The Swish activation swish(z)=zσ(z)\mathrm{swish}(z) = z\sigma(z) Ramachandran et al., 2017, which is used extensively in the DEQN and IRBC implementations of this course, combines the benefits of ReLU (non-saturating for large zz) with smoothness at the origin. Its derivative swish(z)=σ(z)+zσ(z)(1σ(z))\mathrm{swish}'(z) = \sigma(z) + z\sigma(z)(1-\sigma(z)) is smooth everywhere and bounded between approximately -0.1 and 1.1, which can improve optimization stability.

For PDE applications (Chapter Chapter 7), the choice of activation function is particularly important because the PINN loss involves derivatives of the network output. Since ReLU(z)=0\mathrm{ReLU}''(z) = 0 almost everywhere, a ReLU network cannot represent second-order PDE residuals faithfully. Smooth activations such as tanh\tanh (CC^\infty) or Swish are therefore required for PINN applications involving second-order PDEs. Figure Figure 1.14 plots seven representative activations from Table Table 1.2.

Seven representative activation functions from Table . Sigmoid and tanh saturate at large |z| (vanishing gradients); ReLU is non-saturating but kinked at the origin; Leaky ReLU and ELU repair the dead-neuron problem with a small negative response; Swish and Softplus are everywhere C^\infty, which the PINN chapter (Chapter ) requires.

Figure 1.14:Seven representative activation functions from Table Table 1.2. Sigmoid and tanh saturate at large z|z| (vanishing gradients); ReLU is non-saturating but kinked at the origin; Leaky ReLU and ELU repair the dead-neuron problem with a small negative response; Swish and Softplus are everywhere CC^\infty, which the PINN chapter (Chapter Chapter 7) requires.

1.9Vanishing and Exploding Gradients

A central obstacle to training deep networks is that the gradient signal reaching early layers can become either astronomically small or astronomically large as it is back-propagated through many layers. The backward recursion derived above for the “delta” vector is

δ(l)=((W(l+1))δ(l+1))g(z(l)),\bm{\delta}^{(l)} = \bigl((\W^{(l+1)})^\top \bm{\delta}^{(l+1)}\bigr) \odot g'(\z^{(l)}),

so the magnitude of δ(l)\bm{\delta}^{(l)} is governed, roughly, by the product of derivatives k=lLg(z(k))\prod_{k=l}^{L} g'(\z^{(k)}) and the norms W(k)\|\W^{(k)}\|. Two symmetric failure modes follow:

Three ingredients, each already introduced separately, combine to tame these problems:

  1. Non-saturating activations. ReLU has g(z)=1g'(z) = 1 for z>0z > 0, eliminating the (1/4)L(1/4)^L decay; Swish and tanh avoid vanishing when activations remain in a moderate range.

  2. Variance-preserving initialization. Xavier/Glorot Glorot & Bengio, 2010 and He He et al., 2015 pick Var[W]1/nin\mathrm{Var}[W] \propto 1/n_\mathrm{in} precisely so that Var[z(l)]\mathrm{Var}[\z^{(l)}] is constant across layers, keeping activations in the useful range of gg (Section Section 1.7).

  3. Batch normalization Ioffe & Szegedy, 2015. Re-centering and re-scaling the pre-activations of each mini-batch prevents them from drifting toward the saturated tails of gg during training and allows much larger learning rates. Its affine parameters (γ,β)(\gamma, \beta) are learned.

A practical complement is gradient clipping: if θJ\|\nabla_{\bm{\theta}} J\| exceeds a threshold, rescale it to the threshold. This eliminates the most damaging exploding-gradient events at negligible cost and is standard in RNN training; it is occasionally useful in DEQNs and PINNs when the residual magnitudes are highly unbalanced across collocation points.

1.9.1Batch Normalization

Among the three mitigations listed above, batch normalization Ioffe & Szegedy, 2015 (BN) deserves a closer look because it is simple to state, surprisingly effective in practice, and has become a default building block in supervised deep networks. At its core BN is the standardization trick familiar from regression, re-centering each variable to mean zero and rescaling it to unit variance, applied separately to every layer’s pre-activations and recomputed on every mini-batch of training data.

Let z1,,zBz_1,\dots,z_B denote the pre-activations at one neuron over the BB examples in a mini-batch B\mathcal B. Batch normalization replaces them by

μB=1Bi=1Bzi,σB2=1Bi=1B(ziμB)2,z^i=ziμBσB2+ε,yi=γz^i+β,\mu_{\mathcal B} = \frac{1}{B}\sum_{i=1}^{B} z_i, \quad \sigma^2_{\mathcal B} = \frac{1}{B}\sum_{i=1}^{B} (z_i-\mu_{\mathcal B})^2, \quad \hat z_i = \frac{z_i-\mu_{\mathcal B}}{\sqrt{\sigma^2_{\mathcal B}+\varepsilon}}, \quad y_i = \gamma\,\hat z_i + \beta,

where ε\varepsilon is a small constant for numerical stability and (γ,β)(\gamma,\beta) are learnable scalar parameters specific to that neuron. The transformed activation yiy_i is what the next layer sees. In the standard recipe BN is inserted between the linear map W()a(1)+b()\W^{(\ell)}\bm a^{(\ell-1)}+\bm b^{(\ell)} and the elementwise nonlinearity g()g(\cdot).

1.9.1.0.1Why standardization, layer by layer.

Without BN, the input distribution to a hidden layer \ell depends on every weight in layers 1,,11,\dots,\ell-1. As earlier weights update during gradient descent, the distribution faced by layer \ell drifts from one optimization step to the next: each layer therefore chases a moving target, a phenomenon Ioffe & Szegedy (2015) called internal covariate shift. BN pins the input distribution of every layer to mean zero and unit variance at every step (Figure Figure 1.15). Gradients become better conditioned, and substantially larger learning rates become safe.

1.9.1.0.2The role of the affine parameters.

At first glance the learnable shift and scale (γ,β)(\gamma,\beta) seem to undo the normalization that BN just imposed. This is exactly the point. If a layer happens to prefer non-standard inputs, for example a tanh layer that needs slightly negative pre-activations to operate in its linear regime, the network is free to recover them via (γ,β)(\gamma,\beta). BN therefore never reduces the network’s representational capacity; it merely shifts to a parameterization in which the optimization trajectory is easier to follow.

Distribution of pre-activations at one hidden neuron, sampled at three points during training. Left: without BatchNorm, the distribution drifts in mean and in variance as earlier layers update, each layer chases a moving target. Right: with BatchNorm, the affine pre-normalization transformation pins the inputs to \mathcal{N}(0,1) at every training step, before the learned scale \gamma and shift \beta are applied. The downstream layer always operates on inputs of the same scale, and the gradient signal flowing back is well conditioned.

Figure 1.15:Distribution of pre-activations at one hidden neuron, sampled at three points during training. Left: without BatchNorm, the distribution drifts in mean and in variance as earlier layers update, each layer chases a moving target. Right: with BatchNorm, the affine pre-normalization transformation pins the inputs to N(0,1)\mathcal{N}(0,1) at every training step, before the learned scale γ\gamma and shift β\beta are applied. The downstream layer always operates on inputs of the same scale, and the gradient signal flowing back is well conditioned.

1.9.1.0.3Why higher learning rates work.

A precise Lipschitz bound depends on the operator norms of the surrounding weight matrices, the activation derivatives, and the learned affine scale γ\gamma. The useful intuition is that BN reduces sensitivity to shifts and rescalings of intermediate activations, making the local optimization problem better conditioned and allowing step sizes that would otherwise cause divergence. Santurkar et al. (2018) argue that loss-landscape smoothing, rather than the original “internal covariate shift” interpretation, better explains why BN helps optimization; the two views are complementary, but the smoothing perspective is the more directly testable one.

1.9.1.0.4At inference time.

During training, BN uses the current mini-batch to compute μB,σB2\mu_{\mathcal B},\sigma^2_{\mathcal B}. At inference the mini-batch may be a single example, in which case those statistics would be ill-defined. Implementations therefore maintain a running average of μ\mu and σ2\sigma^2 across training mini-batches and use these fixed estimates at test time, so the network’s output is deterministic at deployment.

1.9.2Normalization Variants Beyond BatchNorm

Batch normalization is the original normalization trick, but for several common deep-learning settings, small mini-batches, recurrent / sequence models, generative models, transformers, it is not the right one. All variants share the form z^=(zμ)/σ\hat z = (z-\mu)/\sigma followed by a learned affine γz^+β\gamma\hat z + \beta; they differ only in which axes the statistics (μ,σ)(\mu,\sigma) are pooled over:

For the methods in this script, the practical guidance is: use BatchNorm for large supervised datasets (Chapter Chapter 1), LayerNorm for recurrent or attention-based architectures and as an option when the effective batch size is small, and skip normalization entirely for small DEQN or PINN MLPs when input/output scaling plus Adam already condition the problem well.

1.10Generalization: Overfitting, Regularization, and Double Descent

1.10.1Train / Validation / Test Split

Before discussing overfitting formally, it is essential to fix the experimental protocol that every supervised-learning study should follow. The available data are partitioned into three disjoint subsets:

The key discipline is that no decision about the model (not hyperparameter tuning, not architecture, not early-stopping patience) may be informed by the test set. Using the test set multiple times turns it into an implicit validation set and invalidates its role as a measure of out-of-sample error. For small datasets, kk-fold cross-validation replaces the fixed train/validation split: the training data are partitioned into kk equal folds; for each fold, the model is trained on the other k1k-1 folds and evaluated on the held-out fold; the kk resulting validation scores are averaged. Common choices are k=5k = 5 or k=10k = 10; the test set is always held separately. In DEQNs and PINNs (Chapters Chapter 2 and Chapter 7), training and validation points are drawn from the same state distribution, and “generalization” is measured against an independently simulated test trajectory rather than a held-out labeled set.

A model that memorizes the training data but fails on unseen examples is said to overfit. To understand overfitting precisely, consider the following thought experiment. The decomposition below is the classical bias/variance analysis of Geman et al. (1992), which provided the canonical framework for thinking about generalization in neural networks long before modern overparameterized regimes were studied. Suppose we draw many independent training sets D\mathcal{D}, each of size nn, from the same data-generating process y=f(x)+εy = f(\x) + \varepsilon, where ε\varepsilon is zero-mean noise with variance σ2\sigma^2. On each training set we fit our model, obtaining a predictor f^D\hat{f}_{\mathcal{D}}. Conditioning on a fixed test input x0\x_0 and averaging over both the training set and the new test noise ε0\varepsilon_0, the squared prediction error decomposes into exactly three terms:

ED,ε0 ⁣[(y0f^D(x0))2]=(f(x0)ED[f^D(x0)])2Bias2+ED ⁣[(f^D(x0)ED[f^D(x0)])2]Variance  +  σ2Irreducible noise.\begin{aligned} \mathbb{E}_{\mathcal{D},\varepsilon_0} \!\bigl[(y_0 - \hat{f}_{\mathcal{D}}(\x_0))^2\bigr] &= \underbrace{\bigl(f(\x_0) - \mathbb{E}_{\mathcal{D}}[\hat{f}_{\mathcal{D}}(\x_0)]\bigr)^2}_{\text{Bias}^2} \notag\\ &\quad+ \underbrace{\mathbb{E}_{\mathcal{D}}\!\bigl[(\hat{f}_{\mathcal{D}}(\x_0) - \mathbb{E}_{\mathcal{D}}[\hat{f}_{\mathcal{D}}(\x_0)])^2\bigr]}_{\text{Variance}} \;+\; \underbrace{\sigma^2}_{\text{Irreducible noise}}. \end{aligned}

Each term captures a distinct source of error:

In classical statistics, there is a fundamental trade-off: reducing bias requires more flexible models, which increases variance. However, modern deep networks challenge this picture. Zhang et al. (2017) showed empirically that standard architectures can perfectly fit randomly labeled data, implying a VC-style capacity far beyond what classical bounds predict, yet still generalize well on real data. Belkin et al. (2019) subsequently demonstrated that with sufficient overparameterization, models exhibit a double descent phenomenon: test error first increases as the model becomes more complex (classical regime), but then decreases again as the number of parameters greatly exceeds the number of data points (interpolation regime). Nakkiran et al. (2020) showed that this phenomenon extends to deep networks and occurs not only as a function of model size, but also as a function of training time (“epoch-wise double descent”) and dataset size.

The key techniques for preventing overfitting in neural networks are:

  1. Early stopping Prechelt, 1998: monitor validation loss and stop training when it begins to rise.

  2. Weight decay (L2L_2 regularization) Krogh & Hertz, 1991: add λ2θ2\frac{\lambda}{2}\|\bm{\theta}\|^2 to the loss.

  3. Dropout Srivastava et al., 2014: randomly drop a fraction pp of activations at every training step. Two implementation conventions exist. The original convention drops units during training and multiplies the outgoing weights or activations by the keep probability 1p1-p at test time, so that the expected activation matches the training-time expectation. The now-standard inverted-dropout convention divides the retained activations by 1p1-p during training, so no rescaling is needed at test time. Either way, the mechanism is equivalent to training, on each mini-batch, a different sub-network drawn from an exponentially large ensemble that shares weights; the final network approximates the ensemble average. Typical values are p=0.5p=0.5 for hidden layers and p=0.1p=0.1--0.2 for inputs. Dropout is less commonly used in DEQN and PINN applications, where the loss is already noisy (stochastic collocation) and regularization is often supplied implicitly by the state-space sampling scheme.

  4. Data augmentation: synthetically enlarge the training set via transformations.

  5. Batch normalization Ioffe & Szegedy, 2015: normalize activations within each mini-batch to stabilize training; its mini-batch statistics also act as a mild regularizer.

Schematic of the double-descent phenomenon. In the classical regime (p < n) test error follows the standard bias--variance U-curve; around the interpolation threshold p \approx n test error can peak sharply because the fitted function is highly sensitive to noise; in the modern overparameterized regime (p \gg n) test error decreases again . In some linearized, kernel, max-margin, or least-norm settings, gradient methods exhibit an implicit bias toward particular low-complexity interpolants; in nonlinear finite-width networks this bias depends on architecture, data, optimizer, initialization, and training protocol. Axes are unitless; the qualitative shape, not the scale, is the point. The curve is illustrative, not a measurement.

Figure 1.16:Schematic of the double-descent phenomenon. In the classical regime (p<np < n) test error follows the standard bias--variance U-curve; around the interpolation threshold pnp \approx n test error can peak sharply because the fitted function is highly sensitive to noise; in the modern overparameterized regime (pnp \gg n) test error decreases again Belkin et al., 2019Nakkiran et al., 2020. In some linearized, kernel, max-margin, or least-norm settings, gradient methods exhibit an implicit bias toward particular low-complexity interpolants; in nonlinear finite-width networks this bias depends on architecture, data, optimizer, initialization, and training protocol. Axes are unitless; the qualitative shape, not the scale, is the point. The curve is illustrative, not a measurement.

Figure Figure 1.16 illustrates why classical bias--variance intuition breaks down for modern deep networks. In the classical regime (p<np < n), increasing model capacity beyond a point leads to overfitting. At the interpolation threshold (pnp \approx n), the model has just enough parameters to perfectly fit the training data, and the resulting solution can be extremely sensitive to noise. In the modern regime (pnp \gg n), test error often decreases again because optimization and architecture bias select comparatively regular interpolating solutions rather than arbitrary ones Belkin et al., 2019.

This phenomenon has been documented across many architectures and datasets by Nakkiran et al. (2020), who showed that it persists even when controlling for effective model complexity. The implications for computational economics are substantial but should not be overstated. In DEQN and PINN applications, the practitioner controls both the network size (number of parameters pp) and the amount of training data (number of collocation points nn), and those collocation points are often resampled rather than fixed once and for all. Overparameterized networks can therefore be useful and sometimes necessary, but their credibility must be checked by independent residual diagnostics, simulated trajectories, and benchmark comparisons rather than by parameter counting alone.

The double descent phenomenon can be explored interactively in the companion notebook 03_Double_Descent.ipynb, which reproduces the spirit of the double-descent curve in a small kernel-regression / random-Fourier-features setting (Belkin et al. (2019)); see also Nakkiran et al. (2020) for the deep-network / CIFAR-10 version of the experiment, which the notebook does not attempt to replicate at scale.

1.10.2A Pointer to the Theory: Neural Tangent Kernel (NTK) and Benign Overfitting

Why does an over-parameterized network with pnp \gg n generalize instead of merely memorizing? Two complementary lines of theory have emerged.

1.10.2.0.1Neural Tangent Kernel (NTK).

In the limit of infinite width and a particular initialization scaling, gradient-descent training of a deep network is described by kernel gradient descent with a fixed, deterministic kernel, the Neural Tangent Kernel of Jacot et al. (2018). In that lazy-training limit the network’s function evolves almost linearly around its initialization, which explains why first-order optimization can be well behaved for very wide networks even though the finite-parameter objective is non-convex.

1.10.2.0.2Benign overfitting.

In linear and kernel settings, gradient methods often select a minimum-norm interpolating solution, which can behave much like a ridge-regularized least-squares estimator and inherit good generalization properties. Bartlett et al. (2020) make this precise for linear regression, showing that interpolation can be benign provided the spectrum of the input covariance has heavy enough tails. Subsequent work has extended both stories beyond the simplest settings; for the practitioner the takeaway is that the NTK regime helps explain why training succeeds, while benign-overfitting theory explains why interpolation need not imply poor test error under additional assumptions.

These two threads are not the final word: finite-width deviations from the NTK matter for feature learning, and benign overfitting requires conditions on the data covariance. They nevertheless help explain why the rest of this script can use networks substantially wider than a classical degrees-of-freedom calculation would recommend, provided the numerical residuals are validated out of sample.

1.11Sequence Models: RNNs, LSTMs, and Attention

The chapters that follow rely almost entirely on feedforward networks, but many economic and financial datasets are intrinsically sequential. Before closing this introduction, it is therefore useful to briefly survey the main architectures for sequence data. We represent a sequence as tokens x1,,xn\x_1, \dots, \x_n, where each token is a vector. The word “token” is borrowed from natural-language processing, where a token is typically a word or sub-word piece; in the economic and financial context we use throughout this course a token is simply one element of the sequence, for example a scalar return, a vector of macroeconomic variables at one quarter, or a price-volume pair at one tick. The algorithms below are agnostic to this choice; they only need the sequence elements to be real-valued vectors of a fixed dimension.

1.11.1Recurrent Neural Networks

The traditional approach to sequence data is the Recurrent Neural Network (RNN) Elman, 1990. Unlike an MLP, which maps an input to an output in a single forward pass, an RNN maintains an internal hidden state ht\h_t that acts as a memory of past information. At each time step tt, the network updates this state using the current input xt\x_t and the previous state ht1\h_{t-1}:

ht=σ(Whhht1+Wxhxt+b),\h_t = \sigma(\Wh \h_{t-1} + \Wx \x_t + \bb),

where σ\sigma is an activation function. Concretely: for a scalar time series xt\x_t is a scalar (e.g. log-return at date tt), htRd\h_t \in \R^d is a dd-dimensional hidden vector summarizing everything the network has seen so far, and Whh,Wxh,b\Wh,\Wx,\bb are learnable parameters. The same update is applied at every time step, so this recursive structure lets the network process sequences of arbitrary length with a fixed parameter budget. Figure Figure 1.17 shows the resulting unrolled computation graph.

An unrolled Recurrent Neural Network. The same parameters \Wh, \Wx are reused at every time step, allowing the hidden state \h_t to accumulate historical information.

Figure 1.17:An unrolled Recurrent Neural Network. The same parameters Whh,Wxh\Wh, \Wx are reused at every time step, allowing the hidden state ht\h_t to accumulate historical information.

1.11.1.0.1Training: Backpropagation Through Time (BPTT).

Because the unrolled RNN is a feedforward graph of depth TT with shared weights, one can apply ordinary backpropagation and then sum the weight-gradients across time. Concretely, let LT=t=1T(y^t,yt)\mathcal{L}_T = \sum_{t=1}^{T}\ell(\hat{\y}_t, \y_t) denote the total loss. For the recurrence above, the forward single-step Jacobian is

Jththt1=diag ⁣(σ(Whhht1+Wxhxt+b))Whh.J_t \equiv \frac{\partial \h_t}{\partial \h_{t-1}} = \mathrm{diag}\!\bigl(\sigma'(\Wh\h_{t-1}+\Wx\x_t+\bb)\bigr)\Wh.

With column gradients, the backward pass multiplies by JtJ_t^\top. Differentiating with respect to an early hidden state hk\h_k therefore yields the schematic product

hkLT    Jk+1Jk+2JThTLT,\nabla_{\h_k}\mathcal{L}_T \;\approx\; J_{k+1}^{\top}J_{k+2}^{\top}\cdots J_T^{\top}\,\nabla_{\h_T}\mathcal{L}_T,

so the gradient picks up TkT-k products of matrices containing the same recurrent weight matrix Whh\Wh. The relevant singular values or spectral radii of these factors determine the asymptotics. If they are mostly below one, the gradient vanishes exponentially in TkT-k and the network cannot learn dependencies that span many steps; if they are mostly above one, it explodes, producing NaNs during training. This is the vanishing/exploding gradient problem, originally analyzed by Hochreiter (1991) and developed formally in Bengio et al. (1994) Hochreiter et al. (2001), and revisited with a modern optimization-theoretic view by Pascanu et al. (2013).

Three practical remedies partially alleviate the pathology without changing the architecture. Gradient clipping Pascanu et al., 2013 rescales the parameter-gradient whenever its norm exceeds a threshold; it eliminates the worst exploding events at negligible cost and is now a standard training default for any recurrent model. Orthogonal or identity-like initialization of Whh\Wh places its singular values near 1 so that the Jacobian product preserves norms at the start of training. Truncated BPTT unrolls only KK steps back at each update, capping both the memory footprint and the effective gradient horizon at KK. These help but do not cure the problem: the structural fix is to change the recurrence itself, which is the move made by gated cells.

1.11.2Long Short-Term Memory (LSTM) and GRUs

Before writing equations, it helps to keep a plain-language picture in mind. A vanilla RNN asks one hidden state ht\h_t to do three jobs at once: retain useful old information, absorb new information, and expose the relevant part to the next layer. This is a fragile design. The LSTM separates these tasks. It keeps a dedicated memory lane Ct\bm{C}_t flowing through time and at each date makes three soft decisions: what to keep, what to write, and what to reveal. That is the intuition behind the gate equations below.

The LSTM cell Hochreiter & Schmidhuber, 1997 replaces the single-state recurrence ht=σ(Whhht1+Wxhxt)\h_t = \sigma(\Wh\h_{t-1} + \Wx\x_t) by a pair of states: a cell state Ct\bm{C}_t that flows along the top of the cell with additive updates, and a hidden state ht\h_t that is read off it. Three learned sigmoid gates, each depending on the concatenation [ht1,xt][\h_{t-1}, \x_t], control what information flows through:

ft=σ ⁣(Wf[ht1,xt]+bf)(forget gate)\bm{f}_t = \sigma\!\big(\W_f\,[\h_{t-1}, \x_t] + \bb_f\big) \text{(forget gate)}
it=σ ⁣(Wi[ht1,xt]+bi)(input gate)\bm{i}_t = \sigma\!\big(\W_i\,[\h_{t-1}, \x_t] + \bb_i\big) \text{(input gate)}
C~t=tanh ⁣(WC[ht1,xt]+bC)(candidate cell)\tilde{\bm{C}}_t = \tanh\!\big(\W_C\,[\h_{t-1}, \x_t] + \bb_C\big) \text{(candidate cell)}
Ct=ftCt1+itC~t(cell state update)\bm{C}_t = \bm{f}_t \odot \bm{C}_{t-1} + \bm{i}_t \odot \tilde{\bm{C}}_t \text{(cell state update)}
ot=σ ⁣(Wo[ht1,xt]+bo)(output gate)\bm{o}_t = \sigma\!\big(\W_o\,[\h_{t-1}, \x_t] + \bb_o\big) \text{(output gate)}
ht=ottanh(Ct)(hidden state).\h_t = \bm{o}_t \odot \tanh(\bm{C}_t) \text{(hidden state).}

Each of ft,it,ot(0,1)d\bm{f}_t, \bm{i}_t, \bm{o}_t \in (0,1)^{d} acts as a soft switch applied element-wise. The crucial structural change is in equation (1.34): the cell state is additively corrected rather than multiplicatively overwritten. Along the direct memory path, differentiating Ct\bm{C}_t with respect to Ct1\bm{C}_{t-1} contributes diag(ft)\mathrm{diag}(\bm{f}_t) in place of a full recurrent matrix product. The full derivative also contains indirect terms because the gates depend on ht1\h_{t-1} and hence on earlier cell states, but the direct path is the constant-error-carousel intuition: when the cell judges information worth keeping, it can open the forget gate (ft1\bm{f}_t \approx \bm{1}) and allow gradients to flow through as if the sequence were shorter. Figure Figure 1.18 sketches the resulting cell.

The LSTM cell. The green top lane is the protected memory lane: old memory can be kept, new information can be written additively, and the resulting state can later be revealed through the hidden output (visible output, blue bottom lane). This is the intuition behind the forget, input, candidate, and output components shown inside the cell.

Figure 1.18:The LSTM cell. The green top lane is the protected memory lane: old memory can be kept, new information can be written additively, and the resulting state can later be revealed through the hidden output (visible output, blue bottom lane). This is the intuition behind the forget, input, candidate, and output components shown inside the cell.

The Gated Recurrent Unit (GRU) of Cho et al. (2014) is a lighter sibling that merges the forget and input gates into a single update gate zt\bm{z}_t and drops the separate cell state in favor of the hidden state itself:

zt=σ(Wz[ht1,xt]+bz),rt=σ(Wr[ht1,xt]+br),h~t=tanh ⁣(Wh[rtht1,xt]+bh),ht=(1zt)ht1+zth~t.\begin{aligned} \bm{z}_t &= \sigma(\W_z\,[\h_{t-1}, \x_t] + \bb_z), \\ \bm{r}_t &= \sigma(\W_r\,[\h_{t-1}, \x_t] + \bb_r), \\ \tilde{\h}_t &= \tanh\!\big(\W_h\,[\bm{r}_t \odot \h_{t-1}, \x_t] + \bb_h\big), \\ \h_t &= (1 - \bm{z}_t) \odot \h_{t-1} + \bm{z}_t \odot \tilde{\h}_t. \end{aligned}

A GRU uses roughly 25%25\% fewer parameters than an LSTM of the same hidden size and performs comparably on many sequence tasks Chung et al., 2014, at the cost of a slightly less expressive memory channel.

LSTMs remain particularly effective for economic time-series tasks where capturing specialized temporal patterns is more important than massive scalability. For example, Holt et al. (2024) use LSTMs to detect Edgeworth cycles in retail gasoline-price data. These cycles are asymmetric, high-frequency price movements that are difficult to identify with traditional spectral analysis or simple rule-based methods. In this setting the LSTM architecture excels at identifying the characteristic sawtooth patterns (sudden price jumps followed by slow decays) across thousands of retail stations, providing a robust tool for antitrust analysis and competition policy.

1.11.3Limits of Recurrent Models

Even with gated cells, recurrent architectures retain two fundamental limitations that no amount of tuning can remove.

  1. Inherently sequential. Step tt cannot begin until step t1t-1 has finished, so the full computation takes TT serial wavefronts regardless of how many cores or tensor units are available. On a modern GPU that can evaluate thousands of MLP rows in parallel, this forced serialization makes RNN training slow and wall-clock expensive. Training a contemporary large language model on a trillion-token corpus with a plain LSTM would take years of wall-clock time rather than weeks.

  2. Path-length-dependent decay. Information from position 1 must travel through all intermediate hidden states to influence position TT, each hop applying a matrix and a nonlinearity. Gating mitigates but does not eliminate this decay, and empirically LSTM performance saturates on sequences well below the theoretical limit implied by the cell’s capacity.

Both problems have the same structural cause: the recurrence forces a path of length TT between the two extreme positions of the sequence. The remedy is to drop recurrence entirely and let every position read every other position directly, in parallel. This is the Transformer.

1.11.4The Transformer Architecture

The Transformer architecture Vaswani et al., 2017 replaced recurrence entirely with self-attention. This allows the model to weigh the importance of all tokens in a sequence simultaneously, enabling massive parallelization and better capturing global dependencies.[1]

For a first pass, four steps are enough. Add position information so the model knows order; project each token into a query, key, and value; compare each query to all keys; then take a weighted average of the values. Multi-heads, LayerNorm, residual connections, and pointwise MLPs are refinements of this core idea, not a different idea.

1.11.4.1The Self-Attention Mechanism

1.11.4.1.1Intuition: search, then retrieval.

Before writing the mechanism formally it is useful to see it as a soft library lookup. Picture a small library with nn shelves, one per position in the sequence. Each shelf ii carries three vectors: a query qiq_i (“what am I looking for?”), a key kik_i (“what is printed on my label?”), and a value viv_i (“what do I actually contain?”). All three are linear projections of the same input token xi\x_i, and the projection matrices WQ,WK,WV\W_Q,\W_K,\W_V are learned during training; the librarian (queries and keys) and the books (values) are co-designed for whatever task the training objective encodes.

To produce the updated representation oi\bm{o}_i at shelf ii, the following four steps happen.

  1. Score. Shelf ii’s query qiq_i is compared against every shelf’s key kjk_j; the dot product qi ⁣kjq_i^{\!\top} k_j is a similarity score, high when shelf jj’s label matches what shelf ii is looking for.

  2. Normalize. The nn scores are divided by dk\sqrt{d_k} (to keep them in a numerically sane range) and pushed through a softmax, producing probability weights αij[0,1]\alpha_{ij}\in[0,1] with jαij=1\sum_j \alpha_{ij}=1.

  3. Retrieve. The weights are applied to the values: oi=jαijvj\bm{o}_i = \sum_j \alpha_{ij}\, v_j is a weighted average of the nn shelf contents.

  4. Repeat. Steps 1--3 happen in parallel for every shelf ii, producing the whole output sequence o1,,on\bm{o}_1,\dots,\bm{o}_n in a single layer.

Shelves whose label matched the query contribute most to the output; shelves whose label was off-topic are almost ignored.

1.11.4.1.2Why this is useful: a worked example.

Consider the sentence “The cat sat on the mat. It purred.” For the model to process “it” properly, it must first decide what “it” refers to, the cat or the mat. Self-attention performs exactly this disambiguation: the query vector at the “it” position probes the key vectors at every earlier position, and the softmax converts the raw similarity scores into probability weights that concentrate most of the mass on the correct antecedent. Figure Figure 1.19 below illustrates the resulting pattern: the bulk of the weight lands on “cat”, and the updated representation at the “it” position is formed as a weighted average of the values, driven mostly by “cat”. The same mechanism, run in parallel for every position, produces all of o1,,on\bm{o}_1,\dots,\bm{o}_n in a single layer.

1.11.4.1.3A small concrete example.

Suppose we have only three shelves and two-dimensional qq’s and kk’s. Take the query at shelf 3 to be q3=(1,0)q_3 = (1,0) and the three keys k1=(0.9,0.1), k2=(0.1,0.8), k3=(1,0).k_1 = (0.9,\,0.1),\ k_2 = (0.1,\,0.8),\ k_3 = (1,\,0). The raw similarity scores are q3 ⁣k1=0.9q_3^{\!\top}k_1 = 0.9, q3 ⁣k2=0.1q_3^{\!\top}k_2 = 0.1, q3 ⁣k3=1.0q_3^{\!\top}k_3 = 1.0. Ignoring the 1/dk1/\sqrt{d_k} scaling to keep the arithmetic clean, the softmax weights come out to roughly (0.405,0.182,0.448)(0.405,\,0.182,\,0.448). Shelf 3’s output o3\bm{o}_3 is therefore a blend of the three values in those proportions: most mass on shelf 3 itself and on shelf 1 (the closest match in label space); shelf 2, whose label k2k_2 points in an orthogonal direction, contributes about 18%18\%. The softmax is “soft” precisely in this sense, a smoothed nearest-neighbor retrieval rather than a hard argmax over the most similar shelf.

The “self” in *self-*attention says that every shelf plays both roles simultaneously: every shelf’s query probes every shelf’s key, and every shelf receives an updated representation as output. A single attention layer therefore pairs all nn positions with all nn positions in parallel. Contrast this with an RNN or LSTM: to let position tt peek at information from position 1, the signal must be shuttled through every intermediate position via the hidden state, with each hop applying its own weight matrix and nonlinearity, so long-range information is either blurred or lost entirely. Attention short-circuits that chain: position tt reads position 1 directly, with no intermediate hops and no path-length-dependent attenuation. This is the architectural source of the Transformer’s advantage on long sequences, and the reason why attention-based models quickly displaced recurrent ones for language, code, and (increasingly) time-series forecasting.

Given a sequence of input vectors, stack the tokens as rows of XRn×d\X \in \R^{n\times d}. The attention mechanism computes three projections for each token: a Query (QQ), a Key (KK), and a Value (VV), using learned weight matrices WQ,WK,WV\W_Q, \W_K, \W_V:

Q=XWQ,K=XWK,V=XWV.Q = \X \W_Q, \qquad K = \X \W_K, \qquad V = \X \W_V.

The output of the attention layer is a weighted sum of the values, where the weights are determined by the compatibility (dot product) of the queries with the keys:

Attention(Q,K,V)=softmax(QKdk)V.\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V.

The scaling factor dk\sqrt{d_k} (the dimensionality of the keys) prevents the dot products from growing too large in magnitude, which would otherwise make the softmax unstable.

1.11.4.1.4An econometric lens.

The attention layer is a data-dependent, learnable kernel smoother. Compare it to the Nadaraya--Watson estimator f^(x)=iwi(x)yi\hat f(x) = \sum_i w_i(x)\,y_i with kernel-based weights wi(x)k(x,xi)w_i(x) \propto k(x, x_i): attention has exactly this form, but the similarity k(,)k(\cdot,\cdot) is the parametric bilinear form (q,k)q ⁣k/dk(q,k)\mapsto q^{\!\top}k/\sqrt{d_k} and both qq and kk are themselves learned projections of the input. From this vantage point the Transformer’s “magic” is less mysterious: it is a nonparametric smoother whose kernel the optimizer tunes to whatever task the training objective encodes. Self-attention further recovers the classical recurrence-free property that every pair of positions interacts in a single parallel layer, with no signal decay along the sequence.

Figure Figure 1.19 renders the attention pattern of the worked “cat/it” example on a compressed five-token version of the sentence. The output oit\bm{o}_{\textit{it}} is the new representation at the “it” position, formed as a weighted average of the values, with most weight coming from “cat”.

A worked self-attention pattern. Both q_{\textit{it}} and \bm{o}_{\textit{it}} sit above the “it” position: q_{\textit{it}} is the query projection of “it” (blue), and \bm{o}_{\textit{it}} is the updated representation at that same position (green). The blue arrows show how the query is built from “it” and then compared with every key. The softmax weights above each token sum to one; here the largest weight lands on “cat”, so the green aggregation arrow starts above “cat” and curves up to \bm{o}_{\textit{it}}, indicating that the update at “it” is driven mainly by the value at “cat”.

Figure 1.19:A worked self-attention pattern. Both qitq_{\textit{it}} and oit\bm{o}_{\textit{it}} sit above the “it” position: qitq_{\textit{it}} is the query projection of “it” (blue), and oit\bm{o}_{\textit{it}} is the updated representation at that same position (green). The blue arrows show how the query is built from “it” and then compared with every key. The softmax weights above each token sum to one; here the largest weight lands on “cat”, so the green aggregation arrow starts above “cat” and curves up to oit\bm{o}_{\textit{it}}, indicating that the update at “it” is driven mainly by the value at “cat”.

1.11.4.2Multi-Head Attention

A single attention layer implements one similarity pattern between positions. Linguistic and economic sequences, however, contain many relations that matter simultaneously: subject--verb agreement, coreference of pronouns, topic alignment, or, in a macro panel, sector co-movements, shock transmission lags, and autocorrelation structure. A single head is forced to compress all of them into one kernel, and typically does a poor job.

The fix is multi-head attention. Run HH attention layers in parallel, each with its own projection matrices (WQ(h),WK(h),WV(h))(\W_Q^{(h)}, \W_K^{(h)}, \W_V^{(h)}) mapping the input to a lower-dimensional subspace of size dk=d/Hd_k = d/H, compute HH attention outputs independently, then concatenate and linearly project back:

headh=Attention(XWQ(h),XWK(h),XWV(h)),MHA(X)=[head1;head2;;headH]WO.\begin{aligned} \mathrm{head}_h &= \mathrm{Attention}\bigl(\X\W_Q^{(h)},\, \X\W_K^{(h)},\, \X\W_V^{(h)}\bigr), \\ \mathrm{MHA}(\X) &= \big[\mathrm{head}_1; \,\mathrm{head}_2; \,\ldots;\,\mathrm{head}_H\big]\,\W_O . \end{aligned}

The total parameter count is essentially unchanged, because each head works on a 1/H1/H-dimensional slice, but the inductive bias is richer: different heads are free to specialize in different relations. Interpretability studies of trained Transformers routinely find heads that focus on the previous token, on the closing bracket matching an open one, on the subject of the current clause, or, in time-series models, on the most recent analogue of the current calendar month. Multi-head attention is therefore structurally analogous to a mixture of learned kernels in a nonparametric regression; the weights WO\W_O are the mixing coefficients, and the softmax-scored pairs are the kernels themselves.

1.11.4.3Positional Encoding

Self-attention treats its input as a set: permuting the positions of the tokens permutes the rows of X\X and permutes the rows of the output, but changes nothing else. This is problematic, because order matters in every sequence application (“dog bites man” vs. “man bites dog”; “inflation before the shock” vs. “after”). A model without a notion of position cannot distinguish them.

The fix that Vaswani et al. (2017) propose is to add a deterministic vector PE(t)Rd\mathrm{PE}(t)\in\R^d to the input embedding at each position tt, chosen so that the dot products PE(t)PE(t+k)\mathrm{PE}(t)^\top \mathrm{PE}(t+k) encode the relative distance kk. The original sinusoidal scheme is

PE(t,2k)=sin ⁣(t/100002k/d),PE(t,2k+1)=cos ⁣(t/100002k/d),k=0,,d/21.\mathrm{PE}(t, 2k) = \sin\!\big(t/10000^{2k/d}\big), \qquad \mathrm{PE}(t, 2k+1) = \cos\!\big(t/10000^{2k/d}\big), \qquad k=0,\ldots,d/2-1.

Three properties make this choice useful. First, each coordinate 2k2k is a sine wave of wavelength 2π100002k/d2\pi \cdot 10000^{2k/d}, so the encoding spans wavelengths from roughly 2π2\pi up to 2π100002\pi \cdot 10000; the model sees both fine local positions and coarse global ones at once. Second, for each sine/cosine pair, a fixed offset can be represented by a 2×22\times2 rotation: PE(t+r)\mathrm{PE}(t+r) is a linear function of PE(t)\mathrm{PE}(t) with coefficients depending only on the relative offset rr. This gives attention a convenient way to represent relative positions. Third, the encoding extrapolates: a model trained on sequences of length 512 can be fed sequences of length 1024 and the position code is still well-defined. Modern alternatives, rotary position embeddings (RoPE) and ALiBi, refine these properties but keep the same philosophy.

1.11.4.4The Transformer Block

Single attention layers, even multi-headed, are not yet expressive enough: they are essentially linear in the values, with a nonlinear mixing pattern. A full Transformer block wraps one MHA layer and one pointwise MLP with residual connections and LayerNorm:

x+=x+MHA ⁣(LN(x))\x^{+} = \x + \mathrm{MHA}\!\big(\mathrm{LN}(\x)\big)
xout=x++MLP ⁣(LN(x+)).\x^{\mathrm{out}} = \x^{+} + \mathrm{MLP}\!\big(\mathrm{LN}(\x^{+})\big).

The LayerNorm steps Ba et al., 2016 standardize across feature coordinates; together with the residual additions they stabilize training of very deep stacks. Equations (1.42)--(1.43) describe the modern pre-norm variant (LN before each sub-block), which is easier to train than the original post-norm variant of Vaswani et al. (2017). Figure Figure 1.20 shows the architecture schematically.

tokens exchange information

Figure 1.20:tokens exchange information

same MLP applied token by token

One Transformer block in pre-norm form. Self-attention first mixes information across token positions, then the pointwise MLP transforms each token separately. The red skip paths are the residual connections that let deep stacks train stably. A full Transformer stacks LL such blocks; GPT-3, for instance, uses L=96L=96.

1.11.4.4.1Encoder, decoder, causal masking.

The original Transformer of Vaswani et al. (2017) pairs an encoder stack (processes the source sequence) with a decoder stack (generates the target sequence), linked by a cross-attention layer. Most modern large language models are decoder-only: they use the same block as above, but the attention softmax is applied to a masked score matrix that sets all entries above the diagonal to -\infty. This causal mask forbids a position from attending to future positions, turning the Transformer into a left-to-right autoregressive predictor suitable for language modeling.

1.11.4.4.2Scaling and parallelism.

For day 1 the key engineering fact is simpler than the modern LLM discussion: attention is parallel, recurrence is not. Because every sub-block is pointwise in time and the only cross-position operation (attention) is fully parallelizable, a Transformer with LL blocks, HH heads, and hidden dimension dd can be trained on accelerators at roughly the theoretical peak throughput. This is why modern foundation models, GPT-nn, BERT, ViT, Claude, are Transformers rather than RNNs. Empirical studies of scaling laws Kaplan et al., 2020Hoffmann et al., 2022 document that test loss decreases as a power law in parameters, data, and compute, giving rise to the compute-optimal prescriptions used by the largest labs. For an economist, the relevant takeaway is that the marginal cost of additional capability is governed by a smooth, quantifiable, and very large compute bill.

1.11.4.5At a glance: RNN vs LSTM vs Transformer

Table Table 1.3 summarizes the three architectures along the dimensions most relevant to a practitioner’s choice.

Table 1.3:Comparison of the three sequence architectures. Transformers trade a quadratic-in-TT attention cost for full parallelism and unit-length paths between any pair of positions, which is an excellent trade on modern accelerators and for the long sequences typical in language and high-frequency finance.

RNNLSTM / GRUTransformer
Hidden statesingle ht\h_tht\h_t and Ct\bm{C}_tnone per step; the residual stream across layers is an implicit state
Path length 1T1 \to TO(T)\mathcal{O}(T)O(T)\mathcal{O}(T)O(1)\mathcal{O}(1)
Parallelism over ttnonenonefull (all positions at once)
Compute per layerO(Td2)\mathcal{O}(T\,d^2)O(Td2)\mathcal{O}(T\,d^2)O(T2d+Td2)\mathcal{O}(T^2 d + T\,d^2)
Memory per layerO(Td)\mathcal{O}(T\,d)O(Td)\mathcal{O}(T\,d)O(T2+Td)\mathcal{O}(T^2 + T\,d)
Training stabilitygradient decay/blow-upmuch better, gatedstable with LN + residuals
Sweet spotshort sequencesmid-length, niche patternslong context, massive parallelism

A practical rule of thumb follows immediately. If the task is a specialized time-series problem with moderate history length and limited data, an LSTM remains a strong baseline. If context is long and accelerator-friendly parallelism matters, one should usually start with a Transformer.

1.11.5Advanced Aside: In-Context Learning of an AR(1) Process

The material up to this point is the core Chapter Chapter 1 message; readers comfortable with the RNN \to LSTM \to Transformer summary may skip directly to the Chapter Summary. This subsection is an optional but illuminating detour: it shows why economists often find attention intuitive once they see it through a regression lens, and it is the analytical companion to notebook 09_Transformer_InContext_AR1.

A remarkable emergent property of large Transformers is in-context learning: the ability to solve a task at inference time, given only examples in the prompt, with no weight updates. A model pretrained on a universe of sequences can be shown a fresh series it has never seen and produce sensible next-step forecasts. For an economist this is striking: the Transformer behaves as though it runs a regression inside its forward pass, with the prompt playing the role of the training sample and the final token playing the role of the test point. Von Oswald et al. (2023) make this formal by showing that self-attention layers can implement gradient-based optimization internally, so a stack of such layers iteratively reduces an implicit loss.

Consider the simplest concrete setting, an autoregressive process of order 1:

xt=ϱxt1+εt,εtN(0,σ2).x_t = \varrho \, x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \mathcal{N}(0, \sigma^2).

If we provide a Transformer with a sequence (x1,,xt)(x_1, \dots, x_t), it can predict xt+1x_{t+1} by implicitly estimating ϱ\varrho from the history. There are two closely related but distinct perspectives. First, under suitable linear-attention parameterizations, self-attention layers can implement gradient-descent-like updates on least-squares objectives Von Oswald et al., 2023; specialized to the AR(1) target above, this reads as descent on minϱi=2t(xiϱxi1)2\min_\varrho \sum_{i=2}^{t} (x_i - \varrho\,x_{i-1})^2. Second, with ordinary softmax attention, one natural Q/K/V assignment behaves like a kernel smoother:

This is one particular parameterization that works because the AR(1) target is linear in the lagged state; with this choice the softmax-attention output becomes

x^t+1=i=2tαixiϱ^xt,\hat{x}_{t+1} = \sum_{i=2}^{t} \alpha_i\, x_i \approx \hat{\varrho} \, x_t,

where the softmax weights αiexp(xtxi1)\alpha_i \propto \exp(x_t\,x_{i-1}) concentrate mass on those past states xi1x_{i-1} that look like the current state xtx_t.

1.11.5.0.1Why this approximates ϱxt\varrho\,x_t, in three short steps.

Equation (1.45) is exactly a Nadaraya--Watson kernel regression of xix_i on xi1x_{i-1}, evaluated at xi1=xtx_{i-1} = x_t, with kernel K(x,xi1)=exp(xxi1)K(x_*, x_{i-1}) = \exp(x_*\,x_{i-1}). The intuition is then standard:

  1. Population fact. The AR(1) data-generating process implies E ⁣[xixi1=x]=ϱx\E{x_i \mid x_{i-1} = x_*} = \varrho\, x_* exactly. So the population conditional mean we are trying to estimate at x=xtx_* = x_t is just ϱxt\varrho\, x_t.

  2. Kernel smoother. The softmax attention output iαixi\sum_i \alpha_i x_i with αiK(xt,xi1)\alpha_i \propto K(x_t, x_{i-1}) is a kernel-weighted average of the values xix_i at past time steps ii, with the weights peaked where the lagged state xi1x_{i-1} is closest to xtx_t. Provided enough past observations land near xtx_t (so the kernel concentrates around xtx_t), this is the empirical Nadaraya--Watson estimator of E ⁣[xxi1=xt]\E{x \mid x_{i-1} = x_t}.

  3. Conclusion. Combining the two, iαixiE ⁣[xixi1=xt]=ϱxt\sum_i \alpha_i x_i \approx \E{x_i \mid x_{i-1} = x_t} = \varrho\, x_t. The shock variance σ2\sigma^2 controls how much the realized xix_i’s scatter around the conditional mean and therefore the variance of the kernel estimate, but it does not enter the Nadaraya--Watson location at first order.

Note that the unscaled inner product xtxi1x_t\, x_{i-1} used in the kernel is dimension-1, so the standard 1/dk1/\sqrt{d_k} scaling of multi-head attention plays no role here: even without it, the softmax concentrates around the past xi1x_{i-1}'s closest to xtx_t as soon as the prompt is long enough.

For the econometrician’s mental library the closest classical objects are the Nadaraya--Watson and local-linear estimators; the novelty is that the kernel is not hand-chosen but jointly learned with the data representation, over a large corpus of related tasks. This is a form of meta-learning: the network learned “how to regress” during pretraining, and at inference it runs that regression on a brand-new series. Self-attention can therefore implement optimization-like or regression-like computations internally, not merely local pattern matching.

1.11.5.0.2Code examples.

The following Jupyter notebooks implement and extend the material in this chapter:

1.12Further Reading

1.13Exercises

Worked solutions and guidance for these exercises appear in Appendix Appendix F.

Workload labels. Throughout the script, every exercise carries one of three workload tags inside its title. [Core] marks short analytical or pencil-and-paper questions suitable for a weekly problem set. [Computational] marks notebook-based exercises that involve running or modifying companion code; allow yourself a long evening or a weekend with verification gates and starter code in hand. [Advanced/project] marks longer, research-style assignments that may require a multi-day investment, a proper compute budget, or a small term-project plan. The labels are advisory rather than prescriptive: students with prior exposure can promote a [Computational] exercise to a quick warm-up, while those new to the material can treat several [Advanced/project] entries as inspiration for term work.

Footnotes
  1. The idea that a network can compute its own weights from its inputs has a long history: the Fast Weight Programmers of Schmidhuber (1992) use one network to write the weights of another from context, which is widely viewed as a conceptual precursor of attention. Schlag et al. (2021) make the formal equivalence explicit, showing that linear-attention Transformers are Fast Weight Programmers.

References
  1. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133.
  2. Hebb, D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. Wiley.
  3. Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386–408.
  4. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
  5. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
  6. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
  7. Robbins, H., & Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400–407.
  8. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computation, 4(1), 1–58.
  9. Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR).
  10. Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML), 448–456.
  11. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
  12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
  13. Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural Networks, 61, 85–117.
  14. Maliar, L., Maliar, S., & Winant, P. (2021). Deep learning for solving dynamic economic models. Journal of Monetary Economics, 122, 76–101.
  15. Azinovic, M., Gaegauf, L., & Scheidegger, S. (2022). DEEP EQUILIBRIUM NETS. International Economic Review, 63(4), 1471–1525. 10.1111/iere.12575