This is the repo for Sable, a very strong Othello engine, and OthelloZero, the high-performance Reinforcement Learning (RL) framework used to train it. This project was developed for the CS 234 Final Project at Stanford University.
In this project, we approach the AlphaZero replication problem as a systems architecture problem rather than a learning one. The resulting system features a highly optimized search-and-evaluation pipeline capable of self-play at scale, as well as a very strong engine trained over a week of self play. We hope this code is helpful for other Othello engine developers.
After a week of self play (~1.2 billion positions, ~500 billion rollouts), Sable is quite strong, and at most settings superhuman. We evaluate it as performing well against Edax. At ~400k nodes per move, Sable is roughly equal in strength to Edax depth 21. At between 100 and 1000 rollouts per move, Sable is comparable to Edax at depths 5-11.
To facilitate large-scale reinforcement learning, we utilize a decoupled Hub-and-Spoke architecture leveraging Google Cloud Storage (GCS) as a high-throughput message bus between the training head and worker nodes.
The system coordinates state through two primary buckets:
-
Model Bucket: (in our case,
gs://othello-models): Acts as the source of models, executables, and scripts for the fleet. -
Hot-Swapping: Stores the C++ selfplay binaries and ONNX Runtime libraries, allowing for logic updates without rebuilding VM images.
-
Weight Registry: Houses
latest_model.onnxand a/history/directory for "Boss Mode" (League Play). -
Dynamic Orchestration: Workers pull the
worker.shcontrol script every loop, enabling real-time global adjustments to search parameters (rollouts, noise, temperature). Simply modify the script and reupload to the bucket. -
Data Bucket: (in our case,
gs://othello-data): A write-heavy sink for experience collection.
VM instances are created with setup_gcp.sh as their startup automation. For convenience, you can upload setup_gcp.sh to a bucket and just provide its URL. As soon as the instance group starts, it will automatically begin self-play.
We use the train_gcp.py script as the main interface with the self-play generated data. To begin training, just run python gcp_train.py.
The engine requires the C++ ONNX Runtime installed in external (see CMakeLists.txt), and Python dependencies are in requirements.txt. For GPU support, you will also need to install the appropriate CUDA and cuDNN versions for your system.
You will also need to install the Google Cloud SDK and set up your GCP credentials if you want to use Google Cloud to host your training.
The engine executable provides access to the Sable engine via an interface similar to (but lacking most of the features of) UCI, a well-documented interface used by most chess engines. It supports only one way of playing a position:
setposition startpos [moves] [f5 d6 ...]
go nodes 1000
The position is set as the start position, plus every move in the game up to the current point, including passes. The go command can control the number of rollouts performed.
Threads and GPU batch size can be set with:
setoption name Threads value {threads}
setoption name BatchSize value {batch_size}
We also provide gui.py where you can play with the engine via a PyGame interface on CPU, but does not support heavy search on the GPU.
The latest model checkpoint from our training runs is included in latest_model.onnx.
For completeness, we include vs_edax.py, which is called within the evaluation processes of train_gcp.py in order to run diagnostic checks against low-depth Edax. However, Edax (as far as we know) does not natively support depth limiting, so we needed to modify the source code to do so. Thus, by default vs_edax.py will probably not work as intended.
As a research project, we do not seriously optimize Sable for real game conditions, e.g. Sable uses no transposition tables or other tricks for pruning its tree. Although we support virtual visits for parallel MCTS, this system was not tuned. Sable cannot currently search particularly fast, maxing out at about 50k nodes per second in our tests. As such it is not ready out-of-the-box for competitive time-per-move conditions (and does not support timing logic in its interface either).
If you use this engine or system in your research, please cite:
@misc{tseng2026othello,
author = {Tseng, Jonathan},
title = {Self-play for Training 8x8 Othello Agents: Reinforcement Learning as a Systems Engineering Problem at Scale},
year = {2026},
publisher = {Stanford University},
note = {CS 234 Final Project}
}
