Comparison Report: AlphaGo-Style Notebooks
Date: 2026-02-09
Subject: Comparison of website/experiments/companion-notebook-2-alphago.ipynb with Tom’s tom/kiyotaki_wright_alphago.ipynb
Paper: “Money as a Medium of Exchange in an Economy with Artificially Intelligent Agents,” Journal of Economic Dynamics and Control, 14, 329–373.
1. Executive Summary¶
Both notebooks apply modern deep learning / RL techniques to the Kiyotaki-Wright monetary exchange model, replacing the Holland classifier systems from the MMS (1990) paper. However, they take fundamentally different algorithmic approaches:
| Companion Notebook 2 | Tom’s Notebook | |
|---|---|---|
| Core Algorithm | MCTS + Policy-Value Networks (AlphaGo Zero) | Deep Q-Networks (DQN) |
| Decision Rule | Forward-looking tree search + neural net prior | Backward-looking Q-value maximization |
| Training Paradigm | Self-play → MCTS → Network training loop | Online experience replay + target networks |
| Economies Tested | A1, A2, B, C (fiat money) | A1, A1.2 (no A2, B, or C) |
| Key Finding | Discovers speculative equilibrium in A2 | — (A2 not tested) |
| Lines of Code | ~1,880 (35 cells) | ~930 (27 cells) |
| Dependencies | NumPy + Matplotlib only | NumPy + Matplotlib only |
Bottom line: Companion Notebook 2 implements the actual AlphaGo architecture (MCTS + policy-value networks), while Tom’s notebook implements DQN — a different (earlier) deep RL algorithm from the same era. Both are valid approaches to the research question, but they represent different points in the AlphaGo design space.
2. Algorithmic Architecture¶
2.1 Companion Notebook 2: AlphaGo Zero Architecture¶
Faithfully adapts the AlphaGo Zero pipeline:
Combined Policy-Value Network (
PolicyValueNetwork)2-layer MLP with shared trunk (32, 16 hidden units)
Separate output heads: sigmoid (policy) + tanh (value)
Trained via backpropagation with combined BCE + MSE loss + L2 regularization
Directly corresponds to AlphaGo’s dual-head network
Monte Carlo Tree Search (
MCTSclass)PUCT formula for action selection:
Heuristic rollout policy (always consume own good, random trades)
Value network terminal evaluation
Configurable simulation count and rollout depth
Self-Play Training Loop (
train_alphago_agents)Iterative cycle: simulate economy → MCTS policy improvement → train networks
Epsilon schedule from 0.5 → 0.1 over iterations
EMA policy updates (α=0.3) for stability
World Model (
EconomyModel)Models partner trade behavior using estimated acceptance probabilities
Tracks goods distribution for realistic MCTS rollouts
Updated from simulation data each training iteration
2.2 Tom’s Notebook: Deep Q-Network (DQN) Architecture¶
Implements the DQN algorithm (Mnih et al., 2015):
Q-Network (
SimpleNN)3-layer MLP (input → 64 → 64 → output)
ReLU activations with linear output for Q-values
Adam optimizer (hand-coded with momentum terms)
Separate networks for trade and consume decisions
Experience Replay (
ReplayBuffer)Stores (state, action, reward, next_state, done) tuples
Random sampling breaks temporal correlations
Capacity: 10,000 transitions
Target Networks (in
DQNAgent)Separate target Q-network updated every 100 steps
Stabilizes the moving target problem in Q-learning
ε-Greedy Exploration
Epsilon decays from 1.0 → 0.05 multiplicatively (×0.9995 per step)
Standard exploration strategy (no MCTS)
2.3 Key Algorithmic Differences¶
| Feature | Companion NB2 (MCTS+PV) | Tom’s NB (DQN) |
|---|---|---|
| Planning | Forward search (MCTS simulates future periods) | No planning (greedy Q-value maximization) |
| Policy representation | Explicit policy network P(a|s) | Implicit via argmax Q(s,a) |
| Value estimation | Value network V(s) + MCTS backup | Q(s,a) with target network |
| Training signal | MCTS-improved policy & value targets | Bellman error from replay buffer |
| Optimizer | SGD | Adam (hand-coded) |
| Network size | 2 layers (32, 16) with dual head | 3 layers (64, 64) separate trade/consume |
| Temporal reasoning | Explicit rollouts (15 periods ahead) | Only through discounted Q-values (γ=0.95) |
| Exploration | PUCT + epsilon schedule (0.5→0.1) | ε-greedy (1.0→0.05) |
3. Neural Network Comparison¶
3.1 Architecture¶
Companion NB2 — PolicyValueNetwork:
Input(2n) → Dense(32) → ReLU → Dense(16) → ReLU → ┬─ Dense(1) → Sigmoid [policy head]
└─ Dense(1) → Tanh [value head]Parameters: ~680 (for n_goods=3, input_dim=6)
Xavier/He initialization
SGD optimizer with L2 regularization (λ=1e-4)
Tom’s — SimpleNN:
Input(2n) → Dense(64) → ReLU → Dense(64) → ReLU → Dense(2) [Q-values: refuse/accept]Parameters: ~4,800 (for n_goods=3, input_dim=6)
Xavier initialization
Adam optimizer (β1=0.9, β2=0.999, ε=1e-8)
3.2 Training¶
| Aspect | Companion NB2 | Tom’s NB |
|---|---|---|
| Training data | MCTS-generated (state, π*, V*) triples | Experience replay transitions (s,a,r,s’,done) |
| Loss function | BCE(policy) + MSE(value) + L2 | MSE(Q-target) only |
| Batch size | Full state space per iteration (~9-12 states) | 64 random transitions from replay buffer |
| Training schedule | 10 epochs × 40 iterations = 400 updates | 1 update per simulation period |
| Pre-training | Yes — 300 epochs of domain knowledge | No pre-training |
| Learning rate | 0.005 | 0.001 |
4. Economy Model & Simulation¶
4.1 Companion NB2¶
Two-phase architecture:
Training phase (
train_alphago_agents)Iterative self-play + MCTS + network training
40 iterations, ~300 evaluation periods each
Takes ~40-65 seconds per economy
Evaluation phase (
KWAlphaGoSimulation)Runs trained agents for 1000-2000 periods
No exploration (pure exploitation)
Records full holdings distribution history
Key features:
EconomyModelclass provides world model for MCTS rolloutsPartner acceptance probabilities estimated from simulation data
Population goods distribution tracked and updated
Holdings initialized with production goods (economic realism)
Fiat money support (Economy C with 4 goods)
4.2 Tom’s Notebook¶
Single-phase architecture:
Training and evaluation happen simultaneously
Agents learn online during the 2000-period simulation
ε-greedy exploration decays as simulation progresses
Key features:
KiyotakiWrightDQNclass handles both simulation and trainingNo separate economy model — agents learn directly from interactions
Random initial holdings (not production goods)
No fiat money support (only 3-good economies)
4.3 Simulation Differences¶
| Feature | Companion NB2 | Tom’s NB |
|---|---|---|
| Agent sharing | 1 agent per type (shared policy) | 1 DQN per type (shared Q-networks) |
| Initial holdings | Production goods | Random |
| Consumption logic | Only consume own consumption good | Only consume own consumption good |
| Reward structure | u_i - s(new_good) or -s(holding) | Same |
| Fiat money | Supported (Economy C) | Not supported |
| Simulation length | 1000-2000 eval periods | 2000 periods total |
5. Experimental Coverage¶
5.1 Companion NB2¶
| Economy | Storage Costs | Utility | Production | Result | MMS Match? |
|---|---|---|---|---|---|
| A1 | (0.1, 1, 20) | 100 | [1,2,0] | Fundamental | ✓ Same |
| A2 | (0.1, 1, 20) | 500 | [1,2,0] | Speculative | ✗ Different |
| B | (1, 4, 9) | 100 | [2,0,1] | Mixed | ≈ Similar |
| C | (9, 14, 29, 0) | 100 | [1,2,0]+fiat | Fiat Money | ✓ Same |
Also includes:
Comparative analysis tables (AlphaGo vs. MMS for all economies)
Visual comparison bar charts
Summary table with automatic equilibrium classification
Detailed conclusions with historical perspective
5.2 Tom’s Notebook¶
| Economy | Storage Costs | Utility | Production | Result |
|---|---|---|---|---|
| A1 | (0.1, 1, 20) | 100 | [1,2,0] | Has stored outputs (not re-executed) |
| A1.2 | (0.1, 1, 30) | 100 | [1,2,0] | Has stored outputs (not re-executed) |
Also includes:
Comparison function (
plot_comparison) for Holland vs. DQN (requires classifier sim object)Steady-state computation function
Exploration rate visualization
Analysis markdown comparing DQN vs. classifier advantages/disadvantages
Missing from Tom’s notebook:
Economy A2 (the critical test of fundamental vs. speculative)
Economy B (alternative production structure)
Economy C (fiat money)
Actual comparison with classifier system results (comparison function exists but requires companion notebook 1 data)
6. Faithfulness to AlphaGo¶
6.1 AlphaGo’s Core Components (Silver et al., 2016, 2017)¶
| AlphaGo Component | Companion NB2 | Tom’s NB |
|---|---|---|
| Policy network π(a|s) | ✅ Explicit policy head (sigmoid) | ❌ No policy network (implicit via Q) |
| Value network V(s) | ✅ Explicit value head (tanh) | ❌ No value network (Q serves dual role) |
| MCTS | ✅ Full MCTS with PUCT formula | ❌ Not implemented |
| Self-play | ✅ Economy simulation = self-play | ✅ Agent interactions = self-play |
| Training from MCTS | ✅ Networks trained on MCTS targets | ❌ Networks trained on Bellman error |
| Combined policy-value net | ✅ Shared trunk, two heads | ❌ Separate Q-networks |
Assessment:
Companion NB2 implements 5/6 core AlphaGo components faithfully
Tom’s notebook implements DQN (Mnih et al., 2015), which is a predecessor technology — the “deep neural network” part of AlphaGo but not the “tree search” part
6.2 What Tom’s DQN Does Have from the AlphaGo Era¶
While not MCTS-based, Tom’s DQN does include important innovations from the same research lineage:
Experience replay (Mnih et al., 2015) — also used in AlphaGo’s supervised learning phase
Target networks — stabilization technique used throughout deep RL
Neural function approximation — the foundational deep RL contribution
Adam optimizer — more sophisticated than SGD used in Companion NB2
3-layer network — deeper architecture (though the problem doesn’t require it)
The intro markdown in Tom’s notebook explicitly frames it as “AlphaGo-style” and lists {DQN, experience replay, target networks, ε-greedy} as the AlphaGo principles being applied. This is partially accurate — these are components used within AlphaGo’s training pipeline, but the signature innovation of AlphaGo (MCTS + policy-value networks) is not present.
7. Code Quality & Documentation¶
7.1 Companion NB2¶
Strengths:
Extensive markdown explaining each component’s connection to AlphaGo
Mathematical notation (PUCT formula, loss functions, etc.)
Detailed docstrings with parameter descriptions
Clear separation of training and evaluation phases
Automatic equilibrium classification with economic reasoning
Comprehensive comparison with MMS results
Areas for improvement:
Dense code cells (~200 lines for MCTS, ~220 lines for AlphaGoAgent)
Some complexity in the heuristic rollout policy (workaround for coordination problem)
EMA policy smoothing adds complexity that wouldn’t be needed in a perfect implementation
7.2 Tom’s Notebook¶
Strengths:
Clean, well-structured code
Adam optimizer implementation is thorough and correct
Good separation of concerns (NN, ReplayBuffer, Agent, Simulation)
Visualization functions are comprehensive (exploration rate plot, side-by-side comparison)
Clear hyperparameter configuration in EconomyConfig
Areas for improvement:
Only tests 2 of the 4 key economies from the paper
Missing Economy A2 (the most interesting test case)
Missing fiat money (Economy C — the paper’s key finding)
plot_comparisonfunction requires classifier system data not available in the notebookConclusions claim “fundamental equilibrium emerges” but only tested one parameter set
Title says “AlphaGo-Style” but implements DQN (which is a distinctly different algorithm)
8. Results Comparison¶
8.1 Economy A1 (the baseline test)¶
Both notebooks should produce similar results for A1, since it’s the simplest case with a unique fundamental equilibrium. Both predict convergence to:
Type 1 mostly holds Good 2 (production good)
Type 2 holds a mix of production good (Good 3) and Good 1 (medium of exchange)
Type 3 mostly holds Good 1 (not production good — uses it as money)
Companion NB2 results (from execution):
π_i^h(j) | j=1 j=2 j=3
i=1 | 0.020 0.980 0.000 → holds production good, trades directly
i=2 | 0.620 0.060 0.320 → holds Good 1 as money (62%)
i=3 | 0.820 0.180 0.000 → holds Good 1 as money (82%)Classification: Fundamental ✓
Tom’s NB results: Stored outputs available but not re-executed in current session. Based on the code and hyperparameters, expected to produce similar results.
8.2 Economy A2 (the critical test)¶
Only tested in Companion NB2.
This is the most scientifically interesting experiment — it tests whether AlphaGo’s forward-looking MCTS can overcome the backward-looking myopia that trapped the MMS classifier system in the fundamental equilibrium.
Companion NB2 result: Speculative equilibrium (Type 1 holds 36% Good 3)
This differs from MMS and is attributed to MCTS’s ability to simulate future periods
Tom’s NB: Not tested. This is a significant omission, as Economy A2’s fundamental vs. speculative contest is the central question of the paper.
8.3 Economies B and C¶
Only tested in Companion NB2.
Economy B: Mixed equilibrium
Economy C: Fiat money emerges (~32% of holdings) — matches MMS
9. Summary Table¶
| Dimension | Companion NB2 | Tom’s NB | Winner |
|---|---|---|---|
| Algorithm fidelity to AlphaGo | MCTS + Policy-Value Net | DQN (no MCTS) | Companion NB2 |
| Experimental coverage | 4 economies (A1, A2, B, C) | 2 economies (A1, A1.2) | Companion NB2 |
| Novel finding | Speculative eq. in A2 | — | Companion NB2 |
| Fiat money | Yes (Economy C) | No | Companion NB2 |
| Code cleanliness | Good but dense | Clean and modular | Tom’s NB |
| Optimizer | SGD | Adam | Tom’s NB |
| Network depth | 2 layers | 3 layers | Tom’s NB |
| Training stability | EMA + pre-training (workarounds needed) | Experience replay + target networks | Comparable |
| MMS comparison | Detailed tables + charts | Framework only (needs classifier data) | Companion NB2 |
| Mathematical exposition | Extensive (PUCT, loss functions) | Moderate | Companion NB2 |
10. Recommendations for Merging¶
Tom’s notebook could be enhanced by:
Adding Economy A2 (fundamental vs. speculative test)
Adding Economy C (fiat money)
Adding MCTS on top of DQN for planning (making it truly “AlphaGo-style”)
Connecting to companion-notebook-1 for actual comparison data
Companion NB2 could be improved by:
Adopting Adam optimizer from Tom’s implementation
Using a 3-layer network for more capacity
Adding experience replay as a complement to MCTS training
Breaking large code cells into smaller, more modular cells
Both notebooks would benefit from:
Running multiple random seeds for statistical robustness
Reporting confidence intervals on equilibrium classifications
Systematic hyperparameter sensitivity analysis