The Preface argued that quantitative economics needs new tools because the models of interest, heterogeneous-agent economies, OLG models with aggregate risk, integrated assessment models, all share state spaces too large for traditional grid-based methods. This chapter takes that argument as given and supplies the technical machinery the rest of the manuscript will build on. Readers who want a fuller pitch for why deep learning is the right response should re-read the Preface; readers who already accept the premise can dive directly into the machinery below.
We begin with a brief refresher on the motivation, mostly to fix vocabulary and citations, then survey the three broad paradigms of machine learning (supervised, unsupervised, and reinforcement learning), and then develop the core technical machinery: neural network architectures, loss functions, gradient-based optimization, backpropagation, weight initialization, activation functions, and the modern theory of generalization including the double descent phenomenon. Readers already comfortable with these topics may skim this chapter and proceed to Chapter Chapter 2, where the economic applications begin.
The foundational references for the material in this chapter include McCulloch & Pitts (1943), Hebb (1949) and Rosenblatt (1958) for the historical origins of artificial neurons and the first learning rules, Cybenko (1989) and Hornik et al. (1989) for universal approximation, Rumelhart et al. (1986) for backpropagation, Robbins & Monro (1951) for the stochastic-approximation roots of SGD, Geman et al. (1992) for the bias/variance dilemma that underpins modern generalization theory, Kingma & Ba (2015) for the Adam optimizer, Ioffe & Szegedy (2015) and Srivastava et al. (2014) for batch normalization and dropout, and Goodfellow et al. (2016) for a comprehensive textbook treatment, complemented by the historical survey of Schmidhuber (2015).
1.1Motivation and Applications¶
The past decade has witnessed a remarkable convergence of three developments that have transformed machine learning from a niche academic pursuit into a practical tool of extraordinary power: the availability of large-scale datasets, the advent of massively parallel hardware (GPUs and TPUs), and algorithmic innovations in training deep neural networks (Figure Figure 1.1). While much of the public attention has focused on applications such as image recognition, natural language processing, and game playing, the implications for economics and finance are equally profound.
Figure 1.1:The three enablers of the deep-learning revolution: large-scale data, massively parallel compute, and algorithmic improvements. None of the three alone is sufficient; their co-availability since the early 2010s is what has turned neural networks from a niche academic curiosity into a workhorse scientific and industrial tool.
Deep learning has already demonstrated its potential across a broad range of economic applications. In macroeconomics, neural networks serve as global approximators of policy and value functions in high-dimensional equilibria that classical grid-based methods cannot reach Maliar et al., 2021Azinovic et al., 2022; heterogeneous-agent extensions encode cross-sectional distributions via histograms or permutation-invariant moment networks Young, 2010Han et al., 2024Yang et al., 2025. Search-and-matching models with aggregate shocks Payne et al., 2025 and continuous-time macro-finance settings requiring HJB approximation or deep-BSDE solvers Gopalakrishna, 2024Duarte et al., 2024Han et al., 2018 add further coverage, as do optimal monetary policy rules under persistent supply shocks Nuño et al., 2024. In climate economics, surrogate-based workflows solve integrated assessment models and derive Pareto-improving carbon-tax rules in OLG--IAMs with deep uncertainty Kübler et al., 2026Folini et al., 2025Fernández-Villaverde et al., 2025. In finance, surrogate models accelerate option pricing, sovereign-default computation, and portfolio optimization Hutchinson et al., 1994Scheidegger & Treccani, 2018Arellano, 2008Gaegauf et al., 2023Chen et al., 2025. For structural estimation, neural-network surrogates make inference tractable and enable global uncertainty quantification for IAMs Kase et al., 2022Friedl et al., 2023Chen et al., 2026. Finally, these methods connect to an earlier generation of neural-network function approximation in economic computation: adaptive learning Chen & White, 1998, derivative pricing Hutchinson et al., 1994, and parameterized expectations Duffy & McNelis, 2001, with Valaitis & Villa (2024) showing how the parameterized-expectations approach extends to contemporary deep architectures; structural discrete-choice estimation rounds out the historical picture Norets, 2012.
Two themes cut across these application areas and motivate the rest of the script. First, every application area listed there involves a state space whose dimension grows with the number of agents, assets, shocks, or climate states; tensor-product grids become infeasible long before the modeling questions become uninteresting. Second, neural networks are universal approximators with cost that scales with parameter count rather than with , so they are the natural replacement function class once the problem becomes high-dimensional. Subsequent chapters take this through-line and develop it: Chapter Chapter 2 introduces the DEQN methodology that all macro / heterogeneous-agent / search applications above share; Chapter Chapter 3 scales it to a 100+-country benchmark; Chapters Chapter 7--Chapter 8 develop the continuous-time analogue; Chapter Chapter 9 develops the Gaussian-process methodology, and Chapters Chapter 10--Chapter 11 put the deep-surrogate and GP machinery to work on structural estimation and integrated assessment models.
In this course, we focus on the recent advances enabled by deep neural networks, modern hardware, and algorithmic innovations in training.
For central banks, in particular, these tools address pressing practical needs. Many of the models listed above, such as heterogeneous-agent New Keynesian (HANK) models, search and matching models with aggregate shocks, overlapping-generations (OLG) economies with aggregate risk, and integrated assessment models coupling climate and economic dynamics, involve state spaces of dimension , where traditional grid-based numerical methods are computationally infeasible: a tensor-product grid with points per dimension requires nodes, the canonical curse of dimensionality of Bellman (1961). Deep learning provides a mesh-free function approximator. For Barron-class functions, two-layer networks achieve dimension-independent rates in the number of hidden units (often stated as squared error of order under a bounded Barron norm), whereas tensor-product grids for generic smooth functions suffer rates such as in the total number of grid nodes. The point is not that every economic object satisfies a Barron bound, but that neural approximators avoid the mechanical tensor-product explosion. Chapter Chapter 2 returns to this comparison with concrete numbers for a six-shock DSGE.
Specifically, this course focuses on using deep neural networks as computational tools for: (i) solving high-dimensional dynamic stochastic general equilibrium (DSGE) models, (ii) approximating value and policy functions in continuous-time settings via physics-informed neural networks, (iii) constructing fast and accurate surrogate models for parameter estimation and uncertainty quantification, and (iv) leveraging Gaussian processes for sample-efficient Bayesian active learning Azinovic et al., 2022Friedl et al., 2023Chen et al., 2026. Supporting techniques include neural architecture search and adaptive multi-objective loss balancing (one option, ReLoBRaLo Bischof & Kraus, 2025, is developed in the notebooks; alternatives include SoftAdapt and GradNorm). For a comprehensive textbook treatment of the foundations, we refer the reader to Goodfellow et al. (2016) and Chollet (2017); for concise surveys we recommend LeCun et al. (2015).
1.2Types of Machine Learning¶
Before diving into the technical details, it is useful to recall the three broad paradigms of machine learning, each defined by the nature of the available data and the feedback signal provided to the algorithm.
1.2.1Supervised Learning¶
Given a set of labeled input--output pairs , the goal is to learn a mapping that generalizes to unseen inputs. The two main tasks are regression and classification.
1.2.1.0.1Regression¶
(): predict a continuous target from input features. A simple linear model takes the form
where the parameters are learned from data. Figure Figure 1.2 illustrates regression on a house-price dataset: each dot is a training observation, and the line is the fitted model.
Figure 1.2:Supervised learning: regression. The model (red line) is fitted to observed house prices (blue dots).
Classification. (): assign an input to one of discrete categories. A linear classifier predicts class 1 whenever and class 0 otherwise. Figure Figure 1.3 shows a credit-scoring example: applicants are classified as low-risk or high-risk based on income and savings, and the dashed line is the learned decision boundary.
Figure 1.3:Supervised learning: classification. A linear decision boundary separates low-risk (blue circles) from high-risk (red crosses) applicants in the income--savings feature space.
1.2.2Unsupervised Learning¶
Given only unlabeled data , the goal is to discover hidden structure without any target signal. Two common tasks are:
Clustering: partitioning data into groups of similar observations. Example: segmenting firms into peer groups based on financial characteristics such as size, leverage, and profitability.
Dimensionality reduction: compressing features into fewer dimensions while preserving important variation. Example: principal component analysis of yield curves, where three factors (level, slope, curvature) capture most of the cross-sectional variation.
Figure Figure 1.4 illustrates a clustering task: unlabeled data points in two dimensions are partitioned into three clusters, each indicated by a different color and centroid marker.
Figure 1.4:Unsupervised learning: clustering. Unlabeled data points are grouped into three clusters; the markers indicate cluster centroids. No target labels are used; the algorithm discovers the grouping from the data alone.
1.2.3Reinforcement Learning¶
In reinforcement learning, an agent interacts with an environment over a sequence of time steps. At each step , the agent observes a state , selects an action according to its policy , and receives a reward from the environment. The goal is to learn a policy that maximizes the expected cumulative discounted return:
where is taken over the trajectory distribution induced jointly by the policy and the (possibly stochastic) environment dynamics, starting from a given initial-state distribution. Figure Figure 1.5 illustrates this agent--environment interaction loop.
Figure 1.5:Reinforcement learning: the agent--environment loop. The agent observes a state, takes an action, and receives a reward signal. Over time, it learns a policy that maximizes cumulative discounted reward.
Example: an algorithmic trader learning an execution strategy by optimizing realized profit over sequences of order placements; or a central bank learning an interest-rate rule by maximizing a welfare criterion over simulated macroeconomic trajectories.
1.3Course Focus: Supervised vs. Unsupervised Learning¶
This course begins with the supervised learning paradigm, which provides the essential building blocks: choosing a parameterized model, defining a loss function, and minimizing it via gradient descent.
The core methods of this course, DEQNs and PINNs, are not supervised in the classical labeled-data sense. More precisely, they are self-supervised residual methods: the economic equilibrium conditions or governing equations generate the training signal. To see why, recall the key distinction: in supervised learning, the loss function measures the discrepancy between the network’s prediction and a known target label , for example, a mean squared error . This requires a dataset of correct input--output pairs.
In DEQNs and PINNs, no such labels exist. Consider the key differences:
DEQNs: A neural network approximates the unknown policy function of a dynamic economic model. The loss function is the Euler equation residual, which measures how much the network’s output violates the model’s optimality conditions at sampled state points. The “correct” policy values are never provided; instead, the network discovers the equilibrium by driving these residuals to zero.
PINNs: A neural network approximates the unknown solution to a partial differential equation (PDE). The loss function is the PDE residual, which evaluates how well the network satisfies the differential equation at randomly sampled collocation points. Again, the true solution is not available as training data; the network learns it by enforcing the PDE constraint.
In both cases, the training data consists only of input locations (sampled states or collocation points) with no associated output labels. The loss is defined entirely by the structure of the economic model or the governing equation, not by example solutions. They are therefore unsupervised in the narrow sense of using no labels, but the more informative term is equation-based self-supervision.
Despite this fundamental difference, the optimization machinery is shared: these approaches define a loss over trainable parameters and minimize it via (stochastic) gradient descent. This is why we introduce the supervised learning pipeline first in the next section: it establishes the model--loss--optimizer framework that DEQNs and PINNs then adapt by replacing the data-driven loss with a physics-based one.
1.4The Supervised Learning Pipeline¶
Every supervised learning algorithm follows the same three-step recipe, regardless of whether the model is a linear regression, a random forest, or a deep neural network (Figure Figure 1.6). Understanding this pipeline is essential because the DEQN and PINN methods in later chapters modify step 2 (replacing data-driven losses with physics-based residuals) while keeping steps 1 and 3 intact.
Figure 1.6:The three-step supervised-learning recipe that underpins every model in this course. Choose a parametric hypothesis , measure its misfit on a labeled dataset via a loss , and minimize over the parameter vector . DEQNs and PINNs modify step 2 (replacing the data-driven loss with an equilibrium or PDE residual) while keeping steps 1 and 3 identical.
Given a training set of input--output pairs, we seek a hypothesis parameterized by that minimizes the empirical risk. For regression problems, the default choice is the mean squared error (MSE):
This loss is not chosen arbitrarily. If the data are generated by
then the log-likelihood of the sample is, up to constants, proportional to . Minimizing MSE is therefore equivalent to maximum likelihood under homoskedastic Gaussian observation noise. This is one reason why squared error is the natural benchmark loss for regression; see Bishop (2006) Goodfellow et al. (2016)Deisenroth et al. (2020).
For classification tasks, the model must output class probabilities. In the binary case (), the simplest approach passes a single scalar score through the sigmoid function,
and assigns class 1 whenever , equivalently whenever (Figure Figure 1.7). For classes the natural generalization maps a raw score vector onto the probability simplex via softmax: . In both cases the misfit between predicted probabilities and true labels is measured by the cross-entropy loss:
where is the predicted probability that observation belongs to class . If the target is encoded as a one-hot vector, exactly one component of equals one and all others are zero, so for an observation whose true class is the loss contribution reduces to . The model is rewarded for assigning high probability to the correct class and penalized heavily when that probability is near zero.
The origin of cross-entropy is again likelihood theory. In binary classification, with label and predicted success probability , the Bernoulli log-likelihood is
Negating and averaging this expression gives the binary cross-entropy. The -class formula above is the corresponding negative log-likelihood for a categorical distribution with probabilities generated by sigmoid () or softmax (). Cross-entropy is therefore the statistically natural loss whenever the model output is meant to represent class probabilities; see again Bishop (2006) Deisenroth et al. (2020).
Figure 1.7:Binary classification with a sigmoid output. A scalar score (the model’s raw output) is mapped to a probability . The prediction rule assigns class 1 whenever , equivalently whenever . The dashed lines mark the decision threshold; no neural-network architecture is assumed---any model that produces a real-valued score can be combined with this mapping and the binary cross-entropy loss.
Figure 1.8:Binary cross-entropy and mean squared error as functions of the predicted class probability . Cross-entropy rises much more sharply near confident mistakes, which is why it is usually better aligned with probabilistic classification.
Figure Figure 1.8 makes the practical difference visible. If the true label is but the model predicts a very small , then the cross-entropy loss explodes because as . The same holds symmetrically when and . Mean squared error does penalize mistakes, but it does so much more mildly near the boundaries. For probabilistic classification, that weaker penalty is usually undesirable because it does not strongly discourage overconfident wrong predictions.
The optimization is performed via gradient descent or one of its stochastic variants, which we discuss in Section Section 1.6.
1.4.1Beyond MSE: Robust and Asymmetric Losses¶
MSE is optimal under Gaussian noise, but real-world economic and financial data often contain outliers and heavy tails that inflate the squared penalty disproportionately. Two classical alternatives are useful in this course and beyond.
1.4.1.0.1Huber loss.¶
Introduced by Huber (1964) in the context of robust location estimation, the Huber loss behaves like MSE near the origin and like in the tails, capping the influence of any single observation:
The threshold controls the transition and is typically chosen to be a few times the noise scale. Huber loss retains the smoothness needed for gradient-based optimization while reducing the weight of extreme residuals, which makes it the default choice for regression problems with suspected outliers.
1.4.1.0.2Quantile (pinball) loss.¶
Koenker & Bassett (1978) proposed the check function
whose minimizer is the conditional -quantile of given rather than the conditional mean. Setting recovers the median (absolute-error) regression; setting or targets the lower or upper tail. In financial risk management this is precisely the statistic of interest: yields a neural-network estimator of the lower-tail -quantile of returns, which corresponds to Value-at-Risk (VaR) at the conventional level, and averaging the pinball loss over many quantiles traces out the full conditional distribution of returns, an approach known as quantile regression or distributional regression.
1.5From Perceptrons to Deep Networks¶
The building block of every neural network is the artificial neuron, first proposed by McCulloch & Pitts (1943). A companion question, how the synaptic weights themselves should adapt with experience, was first addressed by Hebb (1949), whose rule “neurons that fire together, wire together” () is the conceptual ancestor of all gradient-based learning rules discussed below. Rosenblatt (1958) then introduced the Perceptron, the first trainable binary classifier built on these ideas. A single neuron computes a weighted linear combination of its inputs, adds a bias term, and passes the result through a nonlinear activation function:
where are the synaptic weights, is the bias, and is the activation function (Figure Figure 1.9).
Figure 1.9:An artificial neuron in the McCulloch--Pitts lineage. Inputs are multiplied by synaptic weights , summed into a pre-activation , and passed through a nonlinear activation to yield the output . The original McCulloch & Pitts (1943) unit used a binary threshold for ; the modern artificial neuron generalizes this to arbitrary smooth activations, and all deep networks are compositions of neurons of this form.
Common choices for include the sigmoid , the hyperbolic tangent , and the rectified linear unit Nair & Hinton, 2010Glorot et al., 2011. Without a nonlinear activation, any composition of linear layers collapses to a single affine map, a mathematical fact of fundamental importance: .
1.5.0.0.1From a single neuron to a layer.¶
The single-neuron equation produces a scalar output. In practice we want vector-valued outputs (and, more importantly, vector-valued intermediate features). A layer of parallel neurons, each with its own weights and bias , is a vector-valued generalization
where the nonlinearity is applied componentwise. Each row of is the weight vector of one neuron; stacking of them gives the matrix at once, so the layer evaluates neurons in a single matrix--vector product.
1.5.0.0.2From one layer to a deep composition.¶
A single hidden layer is already a universal approximator (Cybenko (1989)Hornik et al. (1989); the universal-approximation theorem stated in the next subsection), but its hidden-layer width can grow exponentially in the input dimension to attain a target accuracy. Stacking layers on top of one another reuses earlier features as inputs to later neurons; the resulting compositional representation is dramatically more efficient for many functions of interest Telgarsky, 2016Barron, 1993. A deep feedforward network with layers is therefore a nested composition of layer maps:
The architecture that implements (1.12) is sketched in Figure Figure 1.10.
Figure 1.10:An -layer deep feedforward network. Each layer applies an affine map followed by a pointwise nonlinearity; the composition realizes Eq. (1.12). Depth (rather than width) is what gives neural networks their efficient representational power for compositionally structured functions.
A useful geometric intuition, popularized by Chollet (2017), is that each layer of the network performs a nonlinear coordinate transformation, successively “untangling” the manifold on which the data lies. In the input space, the data may be entangled in complex ways (e.g., two classes forming concentric spirals); each hidden layer warps the space so that the data become progressively more linearly separable. By the final hidden layer, a simple linear readout suffices. This perspective, formalized also by Goodfellow et al. (2016), explains why depth is so powerful: each layer adds an additional coordinate transformation, and the composition of many simple transformations can represent very complex mappings with far fewer parameters than a single, wide layer would require.
1.5.0.0.3Other architectures.¶
This course focuses almost entirely on feedforward networks of the form (1.12), because DEQNs and PINNs operate on unstructured state vectors for which feedforward maps are the natural choice. For structured inputs other architecture families exist and are used widely elsewhere: convolutional networks LeCun et al., 2015 for image data, graph neural networks for relational data, and Transformers (Section Section 1.11.4) for sequences. We mention them so that readers who encounter these models in the empirical-finance or applied-ML literatures know where they fit; none are required for the methods developed in later chapters.
The universal approximation theorem Cybenko, 1989Hornik et al., 1989 guarantees that even a single hidden layer with sufficiently many neurons can approximate any continuous function on a compact set to arbitrary precision. However, in practice, deep (multi-layer) networks achieve the same accuracy with exponentially fewer parameters than wide (single-layer) ones, which motivates the use of depth; Telgarsky (2016) makes this precise by exhibiting compositional functions that a depth- network can represent in parameters but for which any depth- network requires width exponential in . Barron (1993) provides classical dimension-independent approximation rates for Barron-class targets, often stated as squared error of order in the hidden width, whereas tensor-product methods for generic smooth functions scale poorly in the total number of grid nodes. This qualified comparison is the formal version of the “deep learning can beat grids” argument that motivates DEQNs in Chapter Chapter 2.
1.6Training: Loss Functions, Gradient Descent, and Backpropagation¶
Given a loss function , training proceeds by iteratively updating the parameters in the direction of steepest descent:
where is the learning rate. In this introductory chapter denotes generic trainable model parameters; in later chapters, when structural parameters and neural-network weights appear together, the script uses for the structural parameters and for the network weights. Computing the gradient for a deep network is achieved through backpropagation Rumelhart et al., 1986, an efficient application of the chain rule that propagates error signals from the output layer back to the input layer. Appendix Appendix B collects the matrix-calculus identities and the one-paragraph reverse-mode AD summary used throughout the script.
To see why the chain rule is central, consider a network with a single hidden layer. Let , , and with loss . Define the “delta” at the hidden layer:
Then the weight gradient follows immediately:
The key insight of Rumelhart et al. (1986) is that this chain rule application can be organized as a single backward pass through the network, reusing intermediate quantities (the vectors) from the forward pass. The resulting algorithm has computational cost proportional to the forward pass; there is no need for finite differences or other expensive gradient approximations.
In practice, evaluating the full gradient over all training examples is expensive. Stochastic gradient descent (SGD), whose roots go back to the stochastic-approximation scheme of Robbins & Monro (1951), replaces the full sum with a random mini-batch , yielding an unbiased estimate of the empirical-risk gradient at much lower cost per iteration:
With mini-batch sizes of 32--256, SGD achieves both computational efficiency and an implicit regularization effect from gradient noise. Two strands of theoretical work explain why this matters in deep nets specifically: the loss landscape of a deep network is dominated by saddle points rather than isolated bad local minima Dauphin et al., 2014, and SGD’s gradient noise tends to bias training toward flat rather than sharp regions of the loss surface, which often generalize better Keskar et al., 2017. Even on linearly separable data, gradient descent on the logistic loss converges to the maximum-margin solution, an instance of the broader principle that the optimizer itself imposes an implicit bias that contributes to generalization Soudry et al., 2018. For a comprehensive modern review of stochastic optimization for large-scale learning (including convergence rates, adaptive methods, and variance reduction), see Bottou et al. (2018).
1.6.1The Adam Optimizer¶
Modern optimizers such as Adam Kingma & Ba, 2015 adapt the learning rate for each parameter based on running averages of the first and second moments of the gradient:
The bias-corrected first moment provides momentum (smoothing out gradient noise), while the second moment provides per-parameter adaptive learning rates (parameters with large gradients receive smaller effective steps). The default hyperparameters , , work well across a wide range of problems, including all the economic applications in this course.
1.6.2The Optimizer Family Tree: Momentum, RMSprop, Adam, AdamW¶
Adam did not appear out of thin air; it inherits from a family of refinements to plain SGD whose interactions are worth being explicit about for readers who will tune optimizers in practice. Table Table 1.1 traces the lineage; each row is a one-line modification of the row above it.
Table 1.1:Lineage from plain SGD to AdamW. Each row introduces exactly one new ingredient: momentum buffers gradient noise; RMSprop adds a per-parameter learning-rate scaling by the running second moment; Adam combines the two with bias correction; AdamW separates weight decay from the gradient step so that the implicit regularizer does not interact with the adaptive denominator. PINN training in continuous-time chapters (Chapters Chapter 7--Chapter 8) often uses Adam or AdamW; DEQNs in Chapters Chapter 2--Chapter 6 use plain Adam with the default as in Azinovic et al. (2022).
| Optimizer | Update rule (one parameter) | Reference |
|---|---|---|
| SGD | Robbins & Monro (1951) | |
| SGD + momentum | ; | Sutskever et al. (2013) |
| RMSprop | ; | Tieleman & Hinton (2012) |
| Adam | momentum on and on with bias correction (Eqs. above) | Kingma & Ba (2015) |
| AdamW | Adam plus decoupled weight decay | Loshchilov & Hutter (2019) |
The Adam-vs-AdamW distinction is sharper than the one-line table entry suggests, so it is worth writing out both rules side by side. With , the bias-corrected first and second moment of the gradient and the weight-decay rate, Adam-with- (i.e. Adam applied to the loss ) updates
so the implicit regularizer is itself rescaled by the adaptive denominator . AdamW separates the two:
so the weight-decay term shrinks every parameter by the same proportional factor regardless of gradient magnitude. This is why AdamW recovers the textbook intuition “weight decay shrinks weights uniformly” that Adam-with- loses.
Figure Figure 1.11 gives a schematic comparison of the qualitative convergence patterns behind this optimizer family tree.
Figure 1.11:Schematic loss-trajectory comparison on a moderately ill-conditioned objective. Momentum and adaptive rescaling often improve early convergence relative to plain SGD, and Adam/AdamW are therefore useful defaults in the notebooks. The ranking is illustrative rather than universal: on some objectives, carefully tuned SGD or RMSprop can match or beat Adam-family methods.
1.6.3Learning Rate Schedules¶
The choice of learning rate is arguably the single most important hyperparameter in deep learning. Too large, and the optimizer diverges; too small, and convergence is impractically slow. A popular schedule is cosine annealing Loshchilov & Hutter, 2017, which smoothly decays the learning rate according to:
where is the total number of training iterations. Figure Figure 1.12 compares the three learning-rate strategies used most often in practice.
Figure 1.12:Three common learning-rate schedules. A constant rate is simple but often converges slowly in the fine-tuning phase; exponential decay shrinks monotonically; cosine annealing Loshchilov & Hutter, 2017 provides a smooth warm-to-cold transition that empirically performs well across a wide range of problems. DEQNs and PINNs typically use exponential decay or cosine annealing to polish the solution after the initial coarse-grained phase.
In practice, decaying schedules such as exponential decay or cosine annealing tend to refine solutions in the later stages of training, once the optimizer has found a good basin of attraction.
1.6.4Backpropagation: The Chain Rule at Scale¶
For a network with layers, denote the pre-activation at layer as and the activation as . If the final layer is linear, set at and interpret as the prediction . The backpropagation algorithm computes for all layers simultaneously by propagating a “delta” vector backward:
where denotes element-wise multiplication. The parameter gradients are then and . The computational cost is linear in the number of layers and the total number of parameters, a remarkable efficiency that enables training networks with millions of parameters. Figure Figure 1.13 shows the forward and backward passes side by side.
Figure 1.13:Backpropagation as forward and backward passes through the network. The forward pass (blue) produces and caches every layer’s pre-activations and activations ; the backward pass (red, dashed) propagates the “delta” vectors from the loss back to the inputs, reusing the cached quantities to compute all parameter gradients in a single sweep.
In modern deep learning frameworks such as TensorFlow and PyTorch, backpropagation is implemented automatically through computational graph tracing (“autograd”). The user only needs to define the forward computation; the framework handles the differentiation. This same automatic differentiation capability is what makes PINNs (Chapter Chapter 7) possible: the PDE residual requires derivatives of the network output with respect to its inputs, which autograd provides exactly and efficiently.
1.7Weight Initialization¶
The initialization of network weights has a profound effect on training dynamics. If weights are initialized too large, activations explode through the layers; if too small, they vanish to zero. In both cases, the gradient signal degrades and training stalls. The key principle is to choose initial weights so that the variance of activations remains approximately constant across layers.
1.7.0.0.1Xavier/Glorot initialization.¶
Glorot & Bengio (2010) derived the following rule for networks with symmetric activations (such as ). For a layer with input neurons and output neurons, initialize weights as:
This ensures under the assumption that activations are in the linear regime of .
1.7.0.0.2He initialization.¶
For ReLU activations, He et al. (2015) showed that the weight variance should be doubled relative to the forward fan-in rule:
The justification is a second-moment-preserving calculation, not a variance one. For a centered symmetric pre-activation ,
because is symmetric about zero, so the negative half of the integrand is killed and the positive half is preserved. Doubling the input weight variance therefore preserves the second moment across layers under ReLU. Strictly speaking , so the variance (centered second moment) is slightly smaller than the second moment, and the factor of 2 is an approximation rather than an identity; in practice the approximation is excellent and He initialization is the default for ReLU-family networks throughout this course.
1.8Activation Functions in Depth¶
Beyond the three classical choices (sigmoid, tanh, ReLU), several modern activation functions address specific shortcomings. Table Table 1.2 summarizes the options used in this course.
Table 1.2:Activation functions used throughout the course. Origin papers: ReLU Nair & Hinton, 2010, Leaky ReLU Maas et al., 2013, ELU Clevert et al., 2016, Swish Ramachandran et al., 2017, GELU Hendrycks & Gimpel, 2016, Mish Misra, 2019. Range is the set of output values for . Smoothness matters when derivatives of the network output are needed: sigmoid, tanh, Swish, GELU, Mish, and softplus are ; ReLU is only ; Leaky ReLU is piecewise linear; ELU is piecewise and is at the origin only for . Smooth activations are required for PINN applications that involve second-order derivatives (Chapter Chapter 7).
| Activation | Formula | Range | Key property |
|---|---|---|---|
| Sigmoid | Smooth, saturates | ||
| Tanh | Zero-centered, saturates | ||
| ReLU | Non-saturating for | ||
| Leaky ReLU | , | No dead neurons | |
| ELU | Negative saturation; if | ||
| Swish | Smooth, non-monotone | ||
| GELU | Smooth, default in BERT / GPT | ||
| Mish | Smooth, used in YOLOv4 | ||
| Softplus | Smooth ReLU approximation |
Leaky ReLU and ELU address the dying-neuron issue by providing a small but nonzero gradient for negative inputs. The Swish activation Ramachandran et al., 2017, which is used extensively in the DEQN and IRBC implementations of this course, combines the benefits of ReLU (non-saturating for large ) with smoothness at the origin. Its derivative is smooth everywhere and bounded between approximately -0.1 and 1.1, which can improve optimization stability.
For PDE applications (Chapter Chapter 7), the choice of activation function is particularly important because the PINN loss involves derivatives of the network output. Since almost everywhere, a ReLU network cannot represent second-order PDE residuals faithfully. Smooth activations such as () or Swish are therefore required for PINN applications involving second-order PDEs. Figure Figure 1.14 plots seven representative activations from Table Table 1.2.
Figure 1.14:Seven representative activation functions from Table Table 1.2. Sigmoid and tanh saturate at large (vanishing gradients); ReLU is non-saturating but kinked at the origin; Leaky ReLU and ELU repair the dead-neuron problem with a small negative response; Swish and Softplus are everywhere , which the PINN chapter (Chapter Chapter 7) requires.
1.9Vanishing and Exploding Gradients¶
A central obstacle to training deep networks is that the gradient signal reaching early layers can become either astronomically small or astronomically large as it is back-propagated through many layers. The backward recursion derived above for the “delta” vector is
so the magnitude of is governed, roughly, by the product of derivatives and the norms . Two symmetric failure modes follow:
Vanishing gradients. For the sigmoid activation, ; for , ; and both derivatives approach zero when is large. In the worst sigmoid case each layer shrinks the gradient by a factor close to , so after layers the signal at the first layer can be attenuated by roughly . Early layers stop learning.
Exploding gradients. If or if is large, the product grows instead of shrinking. Updates become huge, parameters diverge, and the loss “blows up”.
Three ingredients, each already introduced separately, combine to tame these problems:
Non-saturating activations. ReLU has for , eliminating the decay; Swish and tanh avoid vanishing when activations remain in a moderate range.
Variance-preserving initialization. Xavier/Glorot Glorot & Bengio, 2010 and He He et al., 2015 pick precisely so that is constant across layers, keeping activations in the useful range of (Section Section 1.7).
Batch normalization Ioffe & Szegedy, 2015. Re-centering and re-scaling the pre-activations of each mini-batch prevents them from drifting toward the saturated tails of during training and allows much larger learning rates. Its affine parameters are learned.
A practical complement is gradient clipping: if exceeds a threshold, rescale it to the threshold. This eliminates the most damaging exploding-gradient events at negligible cost and is standard in RNN training; it is occasionally useful in DEQNs and PINNs when the residual magnitudes are highly unbalanced across collocation points.
1.9.1Batch Normalization¶
Among the three mitigations listed above, batch normalization Ioffe & Szegedy, 2015 (BN) deserves a closer look because it is simple to state, surprisingly effective in practice, and has become a default building block in supervised deep networks. At its core BN is the standardization trick familiar from regression, re-centering each variable to mean zero and rescaling it to unit variance, applied separately to every layer’s pre-activations and recomputed on every mini-batch of training data.
Let denote the pre-activations at one neuron over the examples in a mini-batch . Batch normalization replaces them by
where is a small constant for numerical stability and are learnable scalar parameters specific to that neuron. The transformed activation is what the next layer sees. In the standard recipe BN is inserted between the linear map and the elementwise nonlinearity .
1.9.1.0.1Why standardization, layer by layer.¶
Without BN, the input distribution to a hidden layer depends on every weight in layers . As earlier weights update during gradient descent, the distribution faced by layer drifts from one optimization step to the next: each layer therefore chases a moving target, a phenomenon Ioffe & Szegedy (2015) called internal covariate shift. BN pins the input distribution of every layer to mean zero and unit variance at every step (Figure Figure 1.15). Gradients become better conditioned, and substantially larger learning rates become safe.
1.9.1.0.2The role of the affine parameters.¶
At first glance the learnable shift and scale seem to undo the normalization that BN just imposed. This is exactly the point. If a layer happens to prefer non-standard inputs, for example a tanh layer that needs slightly negative pre-activations to operate in its linear regime, the network is free to recover them via . BN therefore never reduces the network’s representational capacity; it merely shifts to a parameterization in which the optimization trajectory is easier to follow.
Figure 1.15:Distribution of pre-activations at one hidden neuron, sampled at three points during training. Left: without BatchNorm, the distribution drifts in mean and in variance as earlier layers update, each layer chases a moving target. Right: with BatchNorm, the affine pre-normalization transformation pins the inputs to at every training step, before the learned scale and shift are applied. The downstream layer always operates on inputs of the same scale, and the gradient signal flowing back is well conditioned.
1.9.1.0.3Why higher learning rates work.¶
A precise Lipschitz bound depends on the operator norms of the surrounding weight matrices, the activation derivatives, and the learned affine scale . The useful intuition is that BN reduces sensitivity to shifts and rescalings of intermediate activations, making the local optimization problem better conditioned and allowing step sizes that would otherwise cause divergence. Santurkar et al. (2018) argue that loss-landscape smoothing, rather than the original “internal covariate shift” interpretation, better explains why BN helps optimization; the two views are complementary, but the smoothing perspective is the more directly testable one.
1.9.1.0.4At inference time.¶
During training, BN uses the current mini-batch to compute . At inference the mini-batch may be a single example, in which case those statistics would be ill-defined. Implementations therefore maintain a running average of and across training mini-batches and use these fixed estimates at test time, so the network’s output is deterministic at deployment.
1.9.2Normalization Variants Beyond BatchNorm¶
Batch normalization is the original normalization trick, but for several common deep-learning settings, small mini-batches, recurrent / sequence models, generative models, transformers, it is not the right one. All variants share the form followed by a learned affine ; they differ only in which axes the statistics are pooled over:
LayerNorm Ba et al., 2016: pool across all features of a single example. Decouples training from batch size and is the de-facto choice for RNNs and transformers; it can also be useful in small-batch residual methods, though PINNs more commonly rely on careful input/output scaling and smooth activations.
GroupNorm Wu & He, 2018: pool over a group of channels of a single example. Interpolates between LayerNorm () and InstanceNorm ( = channels); the default in object detection and small-batch CNNs.
WeightNorm Salimans & Kingma, 2016: reparameterize so the activation statistics never enter the gradient. Avoids any batch-size dependence at the cost of slightly less representation power.
For the methods in this script, the practical guidance is: use BatchNorm for large supervised datasets (Chapter Chapter 1), LayerNorm for recurrent or attention-based architectures and as an option when the effective batch size is small, and skip normalization entirely for small DEQN or PINN MLPs when input/output scaling plus Adam already condition the problem well.
1.10Generalization: Overfitting, Regularization, and Double Descent¶
1.10.1Train / Validation / Test Split¶
Before discussing overfitting formally, it is essential to fix the experimental protocol that every supervised-learning study should follow. The available data are partitioned into three disjoint subsets:
Training set (typically ): used to fit the model parameters by minimizing the loss.
Validation set (typically ): used for model selection, choosing hyperparameters, comparing architectures, deciding when to stop training. The model’s performance on this set guides these decisions but its parameters are never trained on it.
Test set (typically ): touched once, at the very end, to report an unbiased estimate of generalization performance.
The key discipline is that no decision about the model (not hyperparameter tuning, not architecture, not early-stopping patience) may be informed by the test set. Using the test set multiple times turns it into an implicit validation set and invalidates its role as a measure of out-of-sample error. For small datasets, -fold cross-validation replaces the fixed train/validation split: the training data are partitioned into equal folds; for each fold, the model is trained on the other folds and evaluated on the held-out fold; the resulting validation scores are averaged. Common choices are or ; the test set is always held separately. In DEQNs and PINNs (Chapters Chapter 2 and Chapter 7), training and validation points are drawn from the same state distribution, and “generalization” is measured against an independently simulated test trajectory rather than a held-out labeled set.
A model that memorizes the training data but fails on unseen examples is said to overfit. To understand overfitting precisely, consider the following thought experiment. The decomposition below is the classical bias/variance analysis of Geman et al. (1992), which provided the canonical framework for thinking about generalization in neural networks long before modern overparameterized regimes were studied. Suppose we draw many independent training sets , each of size , from the same data-generating process , where is zero-mean noise with variance . On each training set we fit our model, obtaining a predictor . Conditioning on a fixed test input and averaging over both the training set and the new test noise , the squared prediction error decomposes into exactly three terms:
Each term captures a distinct source of error:
Bias measures the systematic error: how far the average prediction is from the true function . A model that is too simple (e.g., a linear function fit to a nonlinear target) will have high bias regardless of how much data it sees.
Variance measures the sensitivity of the prediction to the particular training set drawn. A highly flexible model (e.g., a large neural network) may fit each training set well, but the predictions can differ wildly across draws, which is overfitting.
Irreducible noise is the error that no model can eliminate, because it stems from randomness in the data-generating process itself.
In classical statistics, there is a fundamental trade-off: reducing bias requires more flexible models, which increases variance. However, modern deep networks challenge this picture. Zhang et al. (2017) showed empirically that standard architectures can perfectly fit randomly labeled data, implying a VC-style capacity far beyond what classical bounds predict, yet still generalize well on real data. Belkin et al. (2019) subsequently demonstrated that with sufficient overparameterization, models exhibit a double descent phenomenon: test error first increases as the model becomes more complex (classical regime), but then decreases again as the number of parameters greatly exceeds the number of data points (interpolation regime). Nakkiran et al. (2020) showed that this phenomenon extends to deep networks and occurs not only as a function of model size, but also as a function of training time (“epoch-wise double descent”) and dataset size.
The key techniques for preventing overfitting in neural networks are:
Early stopping Prechelt, 1998: monitor validation loss and stop training when it begins to rise.
Weight decay ( regularization) Krogh & Hertz, 1991: add to the loss.
Dropout Srivastava et al., 2014: randomly drop a fraction of activations at every training step. Two implementation conventions exist. The original convention drops units during training and multiplies the outgoing weights or activations by the keep probability at test time, so that the expected activation matches the training-time expectation. The now-standard inverted-dropout convention divides the retained activations by during training, so no rescaling is needed at test time. Either way, the mechanism is equivalent to training, on each mini-batch, a different sub-network drawn from an exponentially large ensemble that shares weights; the final network approximates the ensemble average. Typical values are for hidden layers and --0.2 for inputs. Dropout is less commonly used in DEQN and PINN applications, where the loss is already noisy (stochastic collocation) and regularization is often supplied implicitly by the state-space sampling scheme.
Data augmentation: synthetically enlarge the training set via transformations.
Batch normalization Ioffe & Szegedy, 2015: normalize activations within each mini-batch to stabilize training; its mini-batch statistics also act as a mild regularizer.
Figure 1.16:Schematic of the double-descent phenomenon. In the classical regime () test error follows the standard bias--variance U-curve; around the interpolation threshold test error can peak sharply because the fitted function is highly sensitive to noise; in the modern overparameterized regime () test error decreases again Belkin et al., 2019Nakkiran et al., 2020. In some linearized, kernel, max-margin, or least-norm settings, gradient methods exhibit an implicit bias toward particular low-complexity interpolants; in nonlinear finite-width networks this bias depends on architecture, data, optimizer, initialization, and training protocol. Axes are unitless; the qualitative shape, not the scale, is the point. The curve is illustrative, not a measurement.
Figure Figure 1.16 illustrates why classical bias--variance intuition breaks down for modern deep networks. In the classical regime (), increasing model capacity beyond a point leads to overfitting. At the interpolation threshold (), the model has just enough parameters to perfectly fit the training data, and the resulting solution can be extremely sensitive to noise. In the modern regime (), test error often decreases again because optimization and architecture bias select comparatively regular interpolating solutions rather than arbitrary ones Belkin et al., 2019.
This phenomenon has been documented across many architectures and datasets by Nakkiran et al. (2020), who showed that it persists even when controlling for effective model complexity. The implications for computational economics are substantial but should not be overstated. In DEQN and PINN applications, the practitioner controls both the network size (number of parameters ) and the amount of training data (number of collocation points ), and those collocation points are often resampled rather than fixed once and for all. Overparameterized networks can therefore be useful and sometimes necessary, but their credibility must be checked by independent residual diagnostics, simulated trajectories, and benchmark comparisons rather than by parameter counting alone.
The double descent phenomenon can be explored interactively in the companion notebook 03_Double_Descent.ipynb, which reproduces the spirit of the double-descent curve in a small kernel-regression / random-Fourier-features setting (Belkin et al. (2019)); see also Nakkiran et al. (2020) for the deep-network / CIFAR-10 version of the experiment, which the notebook does not attempt to replicate at scale.
1.10.2A Pointer to the Theory: Neural Tangent Kernel (NTK) and Benign Overfitting¶
Why does an over-parameterized network with generalize instead of merely memorizing? Two complementary lines of theory have emerged.
1.10.2.0.1Neural Tangent Kernel (NTK).¶
In the limit of infinite width and a particular initialization scaling, gradient-descent training of a deep network is described by kernel gradient descent with a fixed, deterministic kernel, the Neural Tangent Kernel of Jacot et al. (2018). In that lazy-training limit the network’s function evolves almost linearly around its initialization, which explains why first-order optimization can be well behaved for very wide networks even though the finite-parameter objective is non-convex.
1.10.2.0.2Benign overfitting.¶
In linear and kernel settings, gradient methods often select a minimum-norm interpolating solution, which can behave much like a ridge-regularized least-squares estimator and inherit good generalization properties. Bartlett et al. (2020) make this precise for linear regression, showing that interpolation can be benign provided the spectrum of the input covariance has heavy enough tails. Subsequent work has extended both stories beyond the simplest settings; for the practitioner the takeaway is that the NTK regime helps explain why training succeeds, while benign-overfitting theory explains why interpolation need not imply poor test error under additional assumptions.
These two threads are not the final word: finite-width deviations from the NTK matter for feature learning, and benign overfitting requires conditions on the data covariance. They nevertheless help explain why the rest of this script can use networks substantially wider than a classical degrees-of-freedom calculation would recommend, provided the numerical residuals are validated out of sample.
1.11Sequence Models: RNNs, LSTMs, and Attention¶
The chapters that follow rely almost entirely on feedforward networks, but many economic and financial datasets are intrinsically sequential. Before closing this introduction, it is therefore useful to briefly survey the main architectures for sequence data. We represent a sequence as tokens , where each token is a vector. The word “token” is borrowed from natural-language processing, where a token is typically a word or sub-word piece; in the economic and financial context we use throughout this course a token is simply one element of the sequence, for example a scalar return, a vector of macroeconomic variables at one quarter, or a price-volume pair at one tick. The algorithms below are agnostic to this choice; they only need the sequence elements to be real-valued vectors of a fixed dimension.
1.11.1Recurrent Neural Networks¶
The traditional approach to sequence data is the Recurrent Neural Network (RNN) Elman, 1990. Unlike an MLP, which maps an input to an output in a single forward pass, an RNN maintains an internal hidden state that acts as a memory of past information. At each time step , the network updates this state using the current input and the previous state :
where is an activation function. Concretely: for a scalar time series is a scalar (e.g. log-return at date ), is a -dimensional hidden vector summarizing everything the network has seen so far, and are learnable parameters. The same update is applied at every time step, so this recursive structure lets the network process sequences of arbitrary length with a fixed parameter budget. Figure Figure 1.17 shows the resulting unrolled computation graph.
Figure 1.17:An unrolled Recurrent Neural Network. The same parameters are reused at every time step, allowing the hidden state to accumulate historical information.
1.11.1.0.1Training: Backpropagation Through Time (BPTT).¶
Because the unrolled RNN is a feedforward graph of depth with shared weights, one can apply ordinary backpropagation and then sum the weight-gradients across time. Concretely, let denote the total loss. For the recurrence above, the forward single-step Jacobian is
With column gradients, the backward pass multiplies by . Differentiating with respect to an early hidden state therefore yields the schematic product
so the gradient picks up products of matrices containing the same recurrent weight matrix . The relevant singular values or spectral radii of these factors determine the asymptotics. If they are mostly below one, the gradient vanishes exponentially in and the network cannot learn dependencies that span many steps; if they are mostly above one, it explodes, producing NaNs during training. This is the vanishing/exploding gradient problem, originally analyzed by Hochreiter (1991) and developed formally in Bengio et al. (1994) Hochreiter et al. (2001), and revisited with a modern optimization-theoretic view by Pascanu et al. (2013).
Three practical remedies partially alleviate the pathology without changing the architecture. Gradient clipping Pascanu et al., 2013 rescales the parameter-gradient whenever its norm exceeds a threshold; it eliminates the worst exploding events at negligible cost and is now a standard training default for any recurrent model. Orthogonal or identity-like initialization of places its singular values near 1 so that the Jacobian product preserves norms at the start of training. Truncated BPTT unrolls only steps back at each update, capping both the memory footprint and the effective gradient horizon at . These help but do not cure the problem: the structural fix is to change the recurrence itself, which is the move made by gated cells.
1.11.2Long Short-Term Memory (LSTM) and GRUs¶
Before writing equations, it helps to keep a plain-language picture in mind. A vanilla RNN asks one hidden state to do three jobs at once: retain useful old information, absorb new information, and expose the relevant part to the next layer. This is a fragile design. The LSTM separates these tasks. It keeps a dedicated memory lane flowing through time and at each date makes three soft decisions: what to keep, what to write, and what to reveal. That is the intuition behind the gate equations below.
The LSTM cell Hochreiter & Schmidhuber, 1997 replaces the single-state recurrence by a pair of states: a cell state that flows along the top of the cell with additive updates, and a hidden state that is read off it. Three learned sigmoid gates, each depending on the concatenation , control what information flows through:
Each of acts as a soft switch applied element-wise. The crucial structural change is in equation (1.34): the cell state is additively corrected rather than multiplicatively overwritten. Along the direct memory path, differentiating with respect to contributes in place of a full recurrent matrix product. The full derivative also contains indirect terms because the gates depend on and hence on earlier cell states, but the direct path is the constant-error-carousel intuition: when the cell judges information worth keeping, it can open the forget gate () and allow gradients to flow through as if the sequence were shorter. Figure Figure 1.18 sketches the resulting cell.
Figure 1.18:The LSTM cell. The green top lane is the protected memory lane: old memory can be kept, new information can be written additively, and the resulting state can later be revealed through the hidden output (visible output, blue bottom lane). This is the intuition behind the forget, input, candidate, and output components shown inside the cell.
The Gated Recurrent Unit (GRU) of Cho et al. (2014) is a lighter sibling that merges the forget and input gates into a single update gate and drops the separate cell state in favor of the hidden state itself:
A GRU uses roughly fewer parameters than an LSTM of the same hidden size and performs comparably on many sequence tasks Chung et al., 2014, at the cost of a slightly less expressive memory channel.
LSTMs remain particularly effective for economic time-series tasks where capturing specialized temporal patterns is more important than massive scalability. For example, Holt et al. (2024) use LSTMs to detect Edgeworth cycles in retail gasoline-price data. These cycles are asymmetric, high-frequency price movements that are difficult to identify with traditional spectral analysis or simple rule-based methods. In this setting the LSTM architecture excels at identifying the characteristic sawtooth patterns (sudden price jumps followed by slow decays) across thousands of retail stations, providing a robust tool for antitrust analysis and competition policy.
1.11.3Limits of Recurrent Models¶
Even with gated cells, recurrent architectures retain two fundamental limitations that no amount of tuning can remove.
Inherently sequential. Step cannot begin until step has finished, so the full computation takes serial wavefronts regardless of how many cores or tensor units are available. On a modern GPU that can evaluate thousands of MLP rows in parallel, this forced serialization makes RNN training slow and wall-clock expensive. Training a contemporary large language model on a trillion-token corpus with a plain LSTM would take years of wall-clock time rather than weeks.
Path-length-dependent decay. Information from position 1 must travel through all intermediate hidden states to influence position , each hop applying a matrix and a nonlinearity. Gating mitigates but does not eliminate this decay, and empirically LSTM performance saturates on sequences well below the theoretical limit implied by the cell’s capacity.
Both problems have the same structural cause: the recurrence forces a path of length between the two extreme positions of the sequence. The remedy is to drop recurrence entirely and let every position read every other position directly, in parallel. This is the Transformer.
1.11.4The Transformer Architecture¶
The Transformer architecture Vaswani et al., 2017 replaced recurrence entirely with self-attention. This allows the model to weigh the importance of all tokens in a sequence simultaneously, enabling massive parallelization and better capturing global dependencies.[1]
For a first pass, four steps are enough. Add position information so the model knows order; project each token into a query, key, and value; compare each query to all keys; then take a weighted average of the values. Multi-heads, LayerNorm, residual connections, and pointwise MLPs are refinements of this core idea, not a different idea.
1.11.4.1The Self-Attention Mechanism¶
1.11.4.1.1Intuition: search, then retrieval.¶
Before writing the mechanism formally it is useful to see it as a soft library lookup. Picture a small library with shelves, one per position in the sequence. Each shelf carries three vectors: a query (“what am I looking for?”), a key (“what is printed on my label?”), and a value (“what do I actually contain?”). All three are linear projections of the same input token , and the projection matrices are learned during training; the librarian (queries and keys) and the books (values) are co-designed for whatever task the training objective encodes.
To produce the updated representation at shelf , the following four steps happen.
Score. Shelf ’s query is compared against every shelf’s key ; the dot product is a similarity score, high when shelf ’s label matches what shelf is looking for.
Normalize. The scores are divided by (to keep them in a numerically sane range) and pushed through a softmax, producing probability weights with .
Retrieve. The weights are applied to the values: is a weighted average of the shelf contents.
Repeat. Steps 1--3 happen in parallel for every shelf , producing the whole output sequence in a single layer.
Shelves whose label matched the query contribute most to the output; shelves whose label was off-topic are almost ignored.
1.11.4.1.2Why this is useful: a worked example.¶
Consider the sentence “The cat sat on the mat. It purred.” For the model to process “it” properly, it must first decide what “it” refers to, the cat or the mat. Self-attention performs exactly this disambiguation: the query vector at the “it” position probes the key vectors at every earlier position, and the softmax converts the raw similarity scores into probability weights that concentrate most of the mass on the correct antecedent. Figure Figure 1.19 below illustrates the resulting pattern: the bulk of the weight lands on “cat”, and the updated representation at the “it” position is formed as a weighted average of the values, driven mostly by “cat”. The same mechanism, run in parallel for every position, produces all of in a single layer.
1.11.4.1.3A small concrete example.¶
Suppose we have only three shelves and two-dimensional ’s and ’s. Take the query at shelf 3 to be and the three keys The raw similarity scores are , , . Ignoring the scaling to keep the arithmetic clean, the softmax weights come out to roughly . Shelf 3’s output is therefore a blend of the three values in those proportions: most mass on shelf 3 itself and on shelf 1 (the closest match in label space); shelf 2, whose label points in an orthogonal direction, contributes about . The softmax is “soft” precisely in this sense, a smoothed nearest-neighbor retrieval rather than a hard argmax over the most similar shelf.
The “self” in *self-*attention says that every shelf plays both roles simultaneously: every shelf’s query probes every shelf’s key, and every shelf receives an updated representation as output. A single attention layer therefore pairs all positions with all positions in parallel. Contrast this with an RNN or LSTM: to let position peek at information from position 1, the signal must be shuttled through every intermediate position via the hidden state, with each hop applying its own weight matrix and nonlinearity, so long-range information is either blurred or lost entirely. Attention short-circuits that chain: position reads position 1 directly, with no intermediate hops and no path-length-dependent attenuation. This is the architectural source of the Transformer’s advantage on long sequences, and the reason why attention-based models quickly displaced recurrent ones for language, code, and (increasingly) time-series forecasting.
Given a sequence of input vectors, stack the tokens as rows of . The attention mechanism computes three projections for each token: a Query (), a Key (), and a Value (), using learned weight matrices :
The output of the attention layer is a weighted sum of the values, where the weights are determined by the compatibility (dot product) of the queries with the keys:
The scaling factor (the dimensionality of the keys) prevents the dot products from growing too large in magnitude, which would otherwise make the softmax unstable.
1.11.4.1.4An econometric lens.¶
The attention layer is a data-dependent, learnable kernel smoother. Compare it to the Nadaraya--Watson estimator with kernel-based weights : attention has exactly this form, but the similarity is the parametric bilinear form and both and are themselves learned projections of the input. From this vantage point the Transformer’s “magic” is less mysterious: it is a nonparametric smoother whose kernel the optimizer tunes to whatever task the training objective encodes. Self-attention further recovers the classical recurrence-free property that every pair of positions interacts in a single parallel layer, with no signal decay along the sequence.
Figure Figure 1.19 renders the attention pattern of the worked “cat/it” example on a compressed five-token version of the sentence. The output is the new representation at the “it” position, formed as a weighted average of the values, with most weight coming from “cat”.
Figure 1.19:A worked self-attention pattern. Both and sit above the “it” position: is the query projection of “it” (blue), and is the updated representation at that same position (green). The blue arrows show how the query is built from “it” and then compared with every key. The softmax weights above each token sum to one; here the largest weight lands on “cat”, so the green aggregation arrow starts above “cat” and curves up to , indicating that the update at “it” is driven mainly by the value at “cat”.
1.11.4.2Multi-Head Attention¶
A single attention layer implements one similarity pattern between positions. Linguistic and economic sequences, however, contain many relations that matter simultaneously: subject--verb agreement, coreference of pronouns, topic alignment, or, in a macro panel, sector co-movements, shock transmission lags, and autocorrelation structure. A single head is forced to compress all of them into one kernel, and typically does a poor job.
The fix is multi-head attention. Run attention layers in parallel, each with its own projection matrices mapping the input to a lower-dimensional subspace of size , compute attention outputs independently, then concatenate and linearly project back:
The total parameter count is essentially unchanged, because each head works on a -dimensional slice, but the inductive bias is richer: different heads are free to specialize in different relations. Interpretability studies of trained Transformers routinely find heads that focus on the previous token, on the closing bracket matching an open one, on the subject of the current clause, or, in time-series models, on the most recent analogue of the current calendar month. Multi-head attention is therefore structurally analogous to a mixture of learned kernels in a nonparametric regression; the weights are the mixing coefficients, and the softmax-scored pairs are the kernels themselves.
1.11.4.3Positional Encoding¶
Self-attention treats its input as a set: permuting the positions of the tokens permutes the rows of and permutes the rows of the output, but changes nothing else. This is problematic, because order matters in every sequence application (“dog bites man” vs. “man bites dog”; “inflation before the shock” vs. “after”). A model without a notion of position cannot distinguish them.
The fix that Vaswani et al. (2017) propose is to add a deterministic vector to the input embedding at each position , chosen so that the dot products encode the relative distance . The original sinusoidal scheme is
Three properties make this choice useful. First, each coordinate is a sine wave of wavelength , so the encoding spans wavelengths from roughly up to ; the model sees both fine local positions and coarse global ones at once. Second, for each sine/cosine pair, a fixed offset can be represented by a rotation: is a linear function of with coefficients depending only on the relative offset . This gives attention a convenient way to represent relative positions. Third, the encoding extrapolates: a model trained on sequences of length 512 can be fed sequences of length 1024 and the position code is still well-defined. Modern alternatives, rotary position embeddings (RoPE) and ALiBi, refine these properties but keep the same philosophy.
1.11.4.4The Transformer Block¶
Single attention layers, even multi-headed, are not yet expressive enough: they are essentially linear in the values, with a nonlinear mixing pattern. A full Transformer block wraps one MHA layer and one pointwise MLP with residual connections and LayerNorm:
The LayerNorm steps Ba et al., 2016 standardize across feature coordinates; together with the residual additions they stabilize training of very deep stacks. Equations (1.42)--(1.43) describe the modern pre-norm variant (LN before each sub-block), which is easier to train than the original post-norm variant of Vaswani et al. (2017). Figure Figure 1.20 shows the architecture schematically.
Figure 1.20:tokens exchange information
same MLP applied token by token
One Transformer block in pre-norm form. Self-attention first mixes information across token positions, then the pointwise MLP transforms each token separately. The red skip paths are the residual connections that let deep stacks train stably. A full Transformer stacks such blocks; GPT-3, for instance, uses .
1.11.4.4.1Encoder, decoder, causal masking.¶
The original Transformer of Vaswani et al. (2017) pairs an encoder stack (processes the source sequence) with a decoder stack (generates the target sequence), linked by a cross-attention layer. Most modern large language models are decoder-only: they use the same block as above, but the attention softmax is applied to a masked score matrix that sets all entries above the diagonal to . This causal mask forbids a position from attending to future positions, turning the Transformer into a left-to-right autoregressive predictor suitable for language modeling.
1.11.4.4.2Scaling and parallelism.¶
For day 1 the key engineering fact is simpler than the modern LLM discussion: attention is parallel, recurrence is not. Because every sub-block is pointwise in time and the only cross-position operation (attention) is fully parallelizable, a Transformer with blocks, heads, and hidden dimension can be trained on accelerators at roughly the theoretical peak throughput. This is why modern foundation models, GPT-, BERT, ViT, Claude, are Transformers rather than RNNs. Empirical studies of scaling laws Kaplan et al., 2020Hoffmann et al., 2022 document that test loss decreases as a power law in parameters, data, and compute, giving rise to the compute-optimal prescriptions used by the largest labs. For an economist, the relevant takeaway is that the marginal cost of additional capability is governed by a smooth, quantifiable, and very large compute bill.
1.11.4.5At a glance: RNN vs LSTM vs Transformer¶
Table Table 1.3 summarizes the three architectures along the dimensions most relevant to a practitioner’s choice.
Table 1.3:Comparison of the three sequence architectures. Transformers trade a quadratic-in- attention cost for full parallelism and unit-length paths between any pair of positions, which is an excellent trade on modern accelerators and for the long sequences typical in language and high-frequency finance.
| RNN | LSTM / GRU | Transformer | |
|---|---|---|---|
| Hidden state | single | and | none per step; the residual stream across layers is an implicit state |
| Path length | |||
| Parallelism over | none | none | full (all positions at once) |
| Compute per layer | |||
| Memory per layer | |||
| Training stability | gradient decay/blow-up | much better, gated | stable with LN + residuals |
| Sweet spot | short sequences | mid-length, niche patterns | long context, massive parallelism |
A practical rule of thumb follows immediately. If the task is a specialized time-series problem with moderate history length and limited data, an LSTM remains a strong baseline. If context is long and accelerator-friendly parallelism matters, one should usually start with a Transformer.
1.11.5Advanced Aside: In-Context Learning of an AR(1) Process¶
The material up to this point is the core Chapter Chapter 1 message; readers comfortable with the RNN LSTM Transformer summary may skip directly to the Chapter Summary. This subsection is an optional but illuminating detour: it shows why economists often find attention intuitive once they see it through a regression lens, and it is the analytical companion to notebook 09_Transformer_InContext_AR1.
A remarkable emergent property of large Transformers is in-context learning: the ability to solve a task at inference time, given only examples in the prompt, with no weight updates. A model pretrained on a universe of sequences can be shown a fresh series it has never seen and produce sensible next-step forecasts. For an economist this is striking: the Transformer behaves as though it runs a regression inside its forward pass, with the prompt playing the role of the training sample and the final token playing the role of the test point. Von Oswald et al. (2023) make this formal by showing that self-attention layers can implement gradient-based optimization internally, so a stack of such layers iteratively reduces an implicit loss.
Consider the simplest concrete setting, an autoregressive process of order 1:
If we provide a Transformer with a sequence , it can predict by implicitly estimating from the history. There are two closely related but distinct perspectives. First, under suitable linear-attention parameterizations, self-attention layers can implement gradient-descent-like updates on least-squares objectives Von Oswald et al., 2023; specialized to the AR(1) target above, this reads as descent on . Second, with ordinary softmax attention, one natural Q/K/V assignment behaves like a kernel smoother:
Query: the current state (“where am I now?”),
Key: the lagged state (“past states that preceded each outcome”),
Value: the realized successor (“what came next”).
This is one particular parameterization that works because the AR(1) target is linear in the lagged state; with this choice the softmax-attention output becomes
where the softmax weights concentrate mass on those past states that look like the current state .
1.11.5.0.1Why this approximates , in three short steps.¶
Equation (1.45) is exactly a Nadaraya--Watson kernel regression of on , evaluated at , with kernel . The intuition is then standard:
Population fact. The AR(1) data-generating process implies exactly. So the population conditional mean we are trying to estimate at is just .
Kernel smoother. The softmax attention output with is a kernel-weighted average of the values at past time steps , with the weights peaked where the lagged state is closest to . Provided enough past observations land near (so the kernel concentrates around ), this is the empirical Nadaraya--Watson estimator of .
Conclusion. Combining the two, . The shock variance controls how much the realized ’s scatter around the conditional mean and therefore the variance of the kernel estimate, but it does not enter the Nadaraya--Watson location at first order.
Note that the unscaled inner product used in the kernel is dimension-1, so the standard scaling of multi-head attention plays no role here: even without it, the softmax concentrates around the past 's closest to as soon as the prompt is long enough.
For the econometrician’s mental library the closest classical objects are the Nadaraya--Watson and local-linear estimators; the novelty is that the kernel is not hand-chosen but jointly learned with the data representation, over a large corpus of related tasks. This is a form of meta-learning: the network learned “how to regress” during pretraining, and at inference it runs that regression on a brand-new series. Self-attention can therefore implement optimization-like or regression-like computations internally, not merely local pattern matching.
1.11.5.0.2Code examples.¶
The following Jupyter notebooks implement and extend the material in this chapter:
01_BasicML_intro: linear regression, classification, and loss functions.02_GradientDescent_and_SGD: implementing and visualizing optimizers.03_Double_Descent: the modern generalization regime.04_Gentle_DNN: building a simple DNN from scratch.05_Tensorboard: monitoring training progress.06_PyTorch_intro: introduction to PyTorch fundamentals.07_Genz_Approximation_and_Loss_Functions: high-dimensional integration using Genz test functions and robust losses.08_MLP_LSTM_Transformer_Edgeworth_Cycles: a three-way comparison on the same Edgeworth-cycle task. An MLP that sees only collapses near the cycle mean (no memory); an LSTM with a 32-unit hidden state tracks the sawtooth almost exactly via its gated memory; a tiny encoder-only Transformer (, two layers, four heads, k parameters) attends to the full window in parallel and beats the MLP by an order of magnitude, while remaining slightly behind the LSTM on this small, highly periodic, low-data signal -- a deliberate illustration that architectural inductive bias matters as much as flexibility on small problems. Quantitatively, the test-set MAE ranking that the notebook prints in its summary cell is, in order, LSTM Transformer MLP, with the Transformer typically within a small multiple of the LSTM and the MLP an order of magnitude worse; readers should consult the notebook’s printed table for the seed-specific numbers.09_Transformer_InContext_AR1: advanced / optional notebook. A tiny 2-layer Transformer learns how to regress across many AR(1) draws; at inference it recovers in-context without weight updates, reproducing the analytical prediction above.
1.12Further Reading¶
Goodfellow et al. (2016), the standard graduate textbook covering everything in this chapter at greater depth.
Schmidhuber (2015), a historical survey tracing the deep-learning lineage; useful for context on LSTMs, Highway Networks, and Fast Weight Programmers.
Bishop (2006), pattern recognition from a Bayesian/statistical viewpoint; the natural complement for econometric readers.
Kingma & Ba (2015) Loshchilov & Hutter (2019), the canonical references on Adam and AdamW; read together they explain modern optimizer tuning.
Bottou et al. (2018), a comprehensive survey of stochastic optimization for large-scale learning, including convergence rates.
1.13Exercises¶
Worked solutions and guidance for these exercises appear in Appendix Appendix F.
Workload labels. Throughout the script, every exercise carries one of three workload tags inside its title. [Core] marks short analytical or pencil-and-paper questions suitable for a weekly problem set. [Computational] marks notebook-based exercises that involve running or modifying companion code; allow yourself a long evening or a weekend with verification gates and starter code in hand. [Advanced/project] marks longer, research-style assignments that may require a multi-day investment, a proper compute budget, or a small term-project plan. The labels are advisory rather than prescriptive: students with prior exposure can promote a [Computational] exercise to a quick warm-up, while those new to the material can treat several [Advanced/project] entries as inspiration for term work.
The idea that a network can compute its own weights from its inputs has a long history: the Fast Weight Programmers of Schmidhuber (1992) use one network to write the weights of another from context, which is widely viewed as a conceptual precursor of attention. Schlag et al. (2021) make the formal equivalence explicit, showing that linear-attention Transformers are Fast Weight Programmers.
- McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133.
- Hebb, D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. Wiley.
- Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386–408.
- Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
- Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
- Robbins, H., & Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400–407.
- Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computation, 4(1), 1–58.
- Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR).
- Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML), 448–456.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural Networks, 61, 85–117.
- Maliar, L., Maliar, S., & Winant, P. (2021). Deep learning for solving dynamic economic models. Journal of Monetary Economics, 122, 76–101.
- Azinovic, M., Gaegauf, L., & Scheidegger, S. (2022). DEEP EQUILIBRIUM NETS. International Economic Review, 63(4), 1471–1525. 10.1111/iere.12575