A lightweight modular math OCR pipeline that processes scanned equations and reconstructs them into LaTeX.
- Symbol Classifier (CNN-based)
- Symbol Segmentation
For symbol classification training, we use the HASYv2 dataset.
git clone https://github.com/yourusername/math-ocr.git
cd math-ocr
pip install -r requirements.txtFolder Structure
data/: Store your raw, processed, and synthetic imagesmodels/: Trained model checkpointssrc/: All source code modules
To download HASYv2 dataset:
import kagglehub
# Download latest version
path = kagglehub.dataset_download("guru001/hasyv2")
print("Path to dataset files:", path)Then move files as needed.