WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning
Authors: Xin Li, Mengbing Liu, Yiyang Zhu, Wenhe Zhang, Li Wei, Jiancheng An, Chau Yuen Affiliation: Nanyang Technological University
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance.
We present WirelessMathLM, demonstrating that compact models (0.5B–7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property—verifiable correctness—that enables effective reinforcement learning without human feedback.
- WirelessMathBench-XL: A comprehensive benchmark of 4,027 problems from 970 papers in wireless communications
- Domain-specific RL: Group Relative Policy Optimization (GRPO) with binary verification rewards, training directly from base checkpoints without supervised warm-start
- Efficient Performance: Our 7B model achieves 39.5% accuracy, approaching GPT-4o (40.4%) while using ~100× fewer parameters than DeepSeek-R1 (671B, 57.4%)
- Transfer Learning: Positive transfer to general mathematics benchmarks (+8.4 points average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME)
| Model | Parameters | Accuracy |
|---|---|---|
| WirelessMathLM-7B | 7B | 39.5% |
| GPT-4o | ~1.8T | 40.4% |
| DeepSeek-R1 | 671B | 57.4% |
GRPO training nearly doubles performance across all model scales:
- 0.5B: +11% improvement
- 3B: +103% improvement
- 7B: +81% improvement
WirelessMathBench-XL contains 4,027 mathematical problems extracted from 970 research papers in wireless communications, covering:
- Information theory and channel capacity
- Signal processing and beamforming
- Optimization in wireless networks
- MIMO systems and spatial diversity
- Resource allocation and scheduling
- Network coding and cooperative communications
Our approach uses GRPO with binary verification rewards:
- No Supervised Fine-tuning: Train directly from base model checkpoints
- Verifiable Rewards: Leverage the mathematical nature of wireless problems for automatic verification
- Domain-specific Training: Focus specifically on wireless communications mathematics
- Efficient Scaling: Achieve strong performance with compact models
Base Model → GRPO Training → WirelessMathLM
↑ ↑ ↓
Qwen2.5 Binary Rewards Wireless Math
Expertise
Our models show positive transfer to general mathematics:
| Benchmark | Improvement |
|---|---|
| MATH | +8.2 points |
| Minerva-Math | +7.9 points |
| OlympiadBench | +9.1 points |
| AMC | +8.7 points |
| AIME | +8.5 points |
| Average | +8.4 points |
@article{li2025wirelessmathlm,
title={WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning},
author={Li, Xin and Liu, Mengbing and Zhu, Yiyang and Zhang, Wenhe and Wei, Li and An, Jiancheng and Yuen, Chau},
journal={arXiv preprint},
year={2025}
}- Paper: Coming soon on arXiv
- Code: Will be released upon publication
- Website: Project Homepage
- Overview: WirelessMathLM-Overview.pdf
For questions or collaborations, please contact:
- Xin Li: xin019@ntu.edu.sg
Nanyang Technological University | Project Maxwell