A Proximal Policy Optimization (PPO) agent trained to land a rocket using thrust vector controls. The gym environment is built in Rust using macroquad and rapier2d and is inspired by the LunarLander task. The landing task has been modified to have continuous control instead of a discrete action space and to be a bit simpler (fixed landing zone, flat land).
A live demo of the PPO agent is available here. The demo is a WASM build of the controlled_sim binary. The model is still a work in progress (the task is a lot harder than I initially expected).
- A physics simulator of the lander and thrust vector controls using
rapier2dvia Rust. - FFI Python bindings from the Rust simulation using
pyo3to create a gym module for training. - PPO reinforcement learning agent trained using Python and PyTorch.
- Curriculum learning for training the PPO agent.
- Interactive simulation with random start positions and mouse click-and-drag to reposition the rocket.
- Model inference in Rust (for the controlled simulation) using the ONNX runtime.
- WASM build for running the simulation in the browser.
The Proximal Policy Optimization (PPO) model is trained to control the rocket based on the following observation and action spaces:
The observation space is a 6-dimensional vector (since the sim is only in 2D) containing the following in order:
-
$x, y$ : The x and y coordinates of the rocket in the world. -
$\theta$ : The angle of the rocket from the vertical. -
$v_x, v_y$ : The linear velocity of the rocket in the x and y directions. -
$\omega$ : The angular velocity of the rocket.
The observations are roughly normalized to the range
The action space is a 2-dimensional vector representing standard thrust vector controls:
-
$F_{\text{thrust}}$ : The amount of thrust to apply, normalized to the range [-1, 1]. -
$\theta_{\text{gimbal}}$ : The angle of the gimbal, normalized to the range [-1, 1].
See base/src/constants.rs for the true min and max values for the action and observation space.
The project is divided into three main parts:
base: A Rust crate that contains the simulation (including the game engine, physics, and rendering).gym: A Rust crate that provides a Python binding to the simulation for a "gym" style reinforcement learning environment.python: Source code for training the PPO agent using PyTorch.
-
Clone the repository:
git clone https://github.com/akkshay0107/tvc-lander.git cd tvc-lander -
Set up python venv inside poetry and install dependencies:
cd python poetry install
The PPO agent is trained using curriculum learning. The training is divided into stages, where each stage increases the difficulty of the landing task. The curriculum is defined in python/src/ppo.py.
To train the PPO agent, run the following commands from the python directory:
poetry run maturin develop # builds and installs the gym crate as a wheel in the venv
poetry run python ./src/ppo.pyThe trained model will be saved to python/models/policy_net.pth.
Additionally to run test episodes using the trained model, run the following command:
poetry run python ./tests/ppo_test_episodes.pyTo run the simulation with the PPO agent providing controls, you first need to export the trained model to ONNX format. From the python directory, run:
poetry run python ./utils/export_to_onnx.pyThen, run the simulation with the following command from the project root:
cargo run --bin controlled_sim --releaseIn the simulation, the rocket starts at a random position and tries to land the rocket safely. The rocket can then be clicked and dragged to different locations on the screen to see how the model reacts to the rocket being dropped from that location.
Additionally, the simulation can also be run with keyboard inputs (instead of the model providing controls). For this, from the project root, run:
cargo run --bin base --release --features="keyinput"To build the WASM version of the simulation (needs the wasm-bindgen CLI), run the following command from the project root:
./build_wasm.shThis will create a dist directory with the WASM build. You can then serve the dist directory using a local web server (for example, python -m http.server).
The WASM build is automatically deployed to GitHub Pages on every push to the main branch. The deployment workflow is defined in .github/workflows/deploy.yml.
The code is designed to be modular, but if you'd like to experiment with your own models, you'll need to modify the following files:
python/src/ppo.py: Modify the model parameters and training process here.python/utils/export_to_onnx.py: This is currently hardcoded to accept thePolicyNetclass, but can be adapted for other models.base/src/bin/controlled_sim.rs: This loads the model from a hardcoded path and applies post-processing (clamping). This will need to be changed to reflect a new model.
Unlike other gym interfaces, you have some more leeway with what you can do as input to the model by modifying the code in the gym module.
gym/src/lib.rs: You can modify thecalculate_rewardfunction to change the rewards for the landing task. Additionally, the_samplefunction can be modified to implement different curriculums for curriculum training.
- Achieve parity with the OpenAI Gym interface (standardize the API, add a
renderfunction). - Improve the PPO agent's performance and deploy a more robust solution to the web demo.
