Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Comparison Report: AlphaGo-Style Notebooks

Date: 2026-02-09
Subject: Comparison of website/experiments/companion-notebook-2-alphago.ipynb with Tom’s tom/kiyotaki_wright_alphago.ipynb
Paper: “Money as a Medium of Exchange in an Economy with Artificially Intelligent Agents,” Journal of Economic Dynamics and Control, 14, 329–373.


1. Executive Summary

Both notebooks apply modern deep learning / RL techniques to the Kiyotaki-Wright monetary exchange model, replacing the Holland classifier systems from the MMS (1990) paper. However, they take fundamentally different algorithmic approaches:

Companion Notebook 2Tom’s Notebook
Core AlgorithmMCTS + Policy-Value Networks (AlphaGo Zero)Deep Q-Networks (DQN)
Decision RuleForward-looking tree search + neural net priorBackward-looking Q-value maximization
Training ParadigmSelf-play → MCTS → Network training loopOnline experience replay + target networks
Economies TestedA1, A2, B, C (fiat money)A1, A1.2 (no A2, B, or C)
Key FindingDiscovers speculative equilibrium in A2— (A2 not tested)
Lines of Code~1,880 (35 cells)~930 (27 cells)
DependenciesNumPy + Matplotlib onlyNumPy + Matplotlib only

Bottom line: Companion Notebook 2 implements the actual AlphaGo architecture (MCTS + policy-value networks), while Tom’s notebook implements DQN — a different (earlier) deep RL algorithm from the same era. Both are valid approaches to the research question, but they represent different points in the AlphaGo design space.


2. Algorithmic Architecture

2.1 Companion Notebook 2: AlphaGo Zero Architecture

Faithfully adapts the AlphaGo Zero pipeline:

  1. Combined Policy-Value Network (PolicyValueNetwork)

    • 2-layer MLP with shared trunk (32, 16 hidden units)

    • Separate output heads: sigmoid (policy) + tanh (value)

    • Trained via backpropagation with combined BCE + MSE loss + L2 regularization

    • Directly corresponds to AlphaGo’s dual-head network

  2. Monte Carlo Tree Search (MCTS class)

    • PUCT formula for action selection: a=argmaxa[Q(s,a)+cpuctP(s,a)N(s)1+N(s,a)]a^* = \arg\max_a [Q(s,a) + c_{\text{puct}} \cdot P(s,a) \cdot \frac{\sqrt{N(s)}}{1+N(s,a)}]

    • Heuristic rollout policy (always consume own good, random trades)

    • Value network terminal evaluation

    • Configurable simulation count and rollout depth

  3. Self-Play Training Loop (train_alphago_agents)

    • Iterative cycle: simulate economy → MCTS policy improvement → train networks

    • Epsilon schedule from 0.5 → 0.1 over iterations

    • EMA policy updates (α=0.3) for stability

  4. World Model (EconomyModel)

    • Models partner trade behavior using estimated acceptance probabilities

    • Tracks goods distribution for realistic MCTS rollouts

    • Updated from simulation data each training iteration

2.2 Tom’s Notebook: Deep Q-Network (DQN) Architecture

Implements the DQN algorithm (Mnih et al., 2015):

  1. Q-Network (SimpleNN)

    • 3-layer MLP (input → 64 → 64 → output)

    • ReLU activations with linear output for Q-values

    • Adam optimizer (hand-coded with momentum terms)

    • Separate networks for trade and consume decisions

  2. Experience Replay (ReplayBuffer)

    • Stores (state, action, reward, next_state, done) tuples

    • Random sampling breaks temporal correlations

    • Capacity: 10,000 transitions

  3. Target Networks (in DQNAgent)

    • Separate target Q-network updated every 100 steps

    • Stabilizes the moving target problem in Q-learning

  4. ε-Greedy Exploration

    • Epsilon decays from 1.0 → 0.05 multiplicatively (×0.9995 per step)

    • Standard exploration strategy (no MCTS)

2.3 Key Algorithmic Differences

FeatureCompanion NB2 (MCTS+PV)Tom’s NB (DQN)
PlanningForward search (MCTS simulates future periods)No planning (greedy Q-value maximization)
Policy representationExplicit policy network P(a|s)Implicit via argmax Q(s,a)
Value estimationValue network V(s) + MCTS backupQ(s,a) with target network
Training signalMCTS-improved policy & value targetsBellman error from replay buffer
OptimizerSGDAdam (hand-coded)
Network size2 layers (32, 16) with dual head3 layers (64, 64) separate trade/consume
Temporal reasoningExplicit rollouts (15 periods ahead)Only through discounted Q-values (γ=0.95)
ExplorationPUCT + epsilon schedule (0.5→0.1)ε-greedy (1.0→0.05)

3. Neural Network Comparison

3.1 Architecture

Companion NB2 — PolicyValueNetwork:

Input(2n)  →  Dense(32)  →  ReLU  →  Dense(16)  →  ReLU  →  ┬─ Dense(1) → Sigmoid  [policy head]
                                                               └─ Dense(1) → Tanh     [value head]

Tom’s — SimpleNN:

Input(2n)  →  Dense(64)  →  ReLU  →  Dense(64)  →  ReLU  →  Dense(2)  [Q-values: refuse/accept]

3.2 Training

AspectCompanion NB2Tom’s NB
Training dataMCTS-generated (state, π*, V*) triplesExperience replay transitions (s,a,r,s’,done)
Loss functionBCE(policy) + MSE(value) + L2MSE(Q-target) only
Batch sizeFull state space per iteration (~9-12 states)64 random transitions from replay buffer
Training schedule10 epochs × 40 iterations = 400 updates1 update per simulation period
Pre-trainingYes — 300 epochs of domain knowledgeNo pre-training
Learning rate0.0050.001

4. Economy Model & Simulation

4.1 Companion NB2

Two-phase architecture:

  1. Training phase (train_alphago_agents)

    • Iterative self-play + MCTS + network training

    • 40 iterations, ~300 evaluation periods each

    • Takes ~40-65 seconds per economy

  2. Evaluation phase (KWAlphaGoSimulation)

    • Runs trained agents for 1000-2000 periods

    • No exploration (pure exploitation)

    • Records full holdings distribution history

Key features:

4.2 Tom’s Notebook

Single-phase architecture:

Key features:

4.3 Simulation Differences

FeatureCompanion NB2Tom’s NB
Agent sharing1 agent per type (shared policy)1 DQN per type (shared Q-networks)
Initial holdingsProduction goodsRandom
Consumption logicOnly consume own consumption goodOnly consume own consumption good
Reward structureu_i - s(new_good) or -s(holding)Same
Fiat moneySupported (Economy C)Not supported
Simulation length1000-2000 eval periods2000 periods total

5. Experimental Coverage

5.1 Companion NB2

EconomyStorage CostsUtilityProductionResultMMS Match?
A1(0.1, 1, 20)100[1,2,0]Fundamental✓ Same
A2(0.1, 1, 20)500[1,2,0]Speculative✗ Different
B(1, 4, 9)100[2,0,1]Mixed≈ Similar
C(9, 14, 29, 0)100[1,2,0]+fiatFiat Money✓ Same

Also includes:

5.2 Tom’s Notebook

EconomyStorage CostsUtilityProductionResult
A1(0.1, 1, 20)100[1,2,0]Has stored outputs (not re-executed)
A1.2(0.1, 1, 30)100[1,2,0]Has stored outputs (not re-executed)

Also includes:

Missing from Tom’s notebook:


6. Faithfulness to AlphaGo

6.1 AlphaGo’s Core Components (Silver et al., 2016, 2017)

AlphaGo ComponentCompanion NB2Tom’s NB
Policy network π(a|s)✅ Explicit policy head (sigmoid)❌ No policy network (implicit via Q)
Value network V(s)✅ Explicit value head (tanh)❌ No value network (Q serves dual role)
MCTS✅ Full MCTS with PUCT formula❌ Not implemented
Self-play✅ Economy simulation = self-play✅ Agent interactions = self-play
Training from MCTS✅ Networks trained on MCTS targets❌ Networks trained on Bellman error
Combined policy-value net✅ Shared trunk, two heads❌ Separate Q-networks

Assessment:

6.2 What Tom’s DQN Does Have from the AlphaGo Era

While not MCTS-based, Tom’s DQN does include important innovations from the same research lineage:

The intro markdown in Tom’s notebook explicitly frames it as “AlphaGo-style” and lists {DQN, experience replay, target networks, ε-greedy} as the AlphaGo principles being applied. This is partially accurate — these are components used within AlphaGo’s training pipeline, but the signature innovation of AlphaGo (MCTS + policy-value networks) is not present.


7. Code Quality & Documentation

7.1 Companion NB2

Strengths:

Areas for improvement:

7.2 Tom’s Notebook

Strengths:

Areas for improvement:


8. Results Comparison

8.1 Economy A1 (the baseline test)

Both notebooks should produce similar results for A1, since it’s the simplest case with a unique fundamental equilibrium. Both predict convergence to:

Companion NB2 results (from execution):

π_i^h(j) | j=1     j=2     j=3
  i=1    | 0.020   0.980   0.000    → holds production good, trades directly
  i=2    | 0.620   0.060   0.320    → holds Good 1 as money (62%)
  i=3    | 0.820   0.180   0.000    → holds Good 1 as money (82%)

Classification: Fundamental

Tom’s NB results: Stored outputs available but not re-executed in current session. Based on the code and hyperparameters, expected to produce similar results.

8.2 Economy A2 (the critical test)

Only tested in Companion NB2.

This is the most scientifically interesting experiment — it tests whether AlphaGo’s forward-looking MCTS can overcome the backward-looking myopia that trapped the MMS classifier system in the fundamental equilibrium.

Companion NB2 result: Speculative equilibrium (Type 1 holds 36% Good 3)

Tom’s NB: Not tested. This is a significant omission, as Economy A2’s fundamental vs. speculative contest is the central question of the paper.

8.3 Economies B and C

Only tested in Companion NB2.


9. Summary Table

DimensionCompanion NB2Tom’s NBWinner
Algorithm fidelity to AlphaGoMCTS + Policy-Value NetDQN (no MCTS)Companion NB2
Experimental coverage4 economies (A1, A2, B, C)2 economies (A1, A1.2)Companion NB2
Novel findingSpeculative eq. in A2Companion NB2
Fiat moneyYes (Economy C)NoCompanion NB2
Code cleanlinessGood but denseClean and modularTom’s NB
OptimizerSGDAdamTom’s NB
Network depth2 layers3 layersTom’s NB
Training stabilityEMA + pre-training (workarounds needed)Experience replay + target networksComparable
MMS comparisonDetailed tables + chartsFramework only (needs classifier data)Companion NB2
Mathematical expositionExtensive (PUCT, loss functions)ModerateCompanion NB2

10. Recommendations for Merging

  1. Tom’s notebook could be enhanced by:

    • Adding Economy A2 (fundamental vs. speculative test)

    • Adding Economy C (fiat money)

    • Adding MCTS on top of DQN for planning (making it truly “AlphaGo-style”)

    • Connecting to companion-notebook-1 for actual comparison data

  2. Companion NB2 could be improved by:

    • Adopting Adam optimizer from Tom’s implementation

    • Using a 3-layer network for more capacity

    • Adding experience replay as a complement to MCTS training

    • Breaking large code cells into smaller, more modular cells

  3. Both notebooks would benefit from:

    • Running multiple random seeds for statistical robustness

    • Reporting confidence intervals on equilibrium classifications

    • Systematic hyperparameter sensitivity analysis