MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning
AAAI 2026 Accepted Paper
Wenrui Zhang1 · Xinggang Wang1 · Bin Feng1 · Wenyu Liu1
1School of Electronic Information and Communications, Huazhong University of Science and Technology
MolSight is a comprehensive learning framework for Optical Chemical Structure Recognition (OCSR), designed to bridge the gap between computer vision and chemical informatics (AI4S).
Accurately translating molecular images into machine-readable formats (like SMILES) is critical for drug discovery and digital chemistry. MolSight addresses the limitations of previous methods—particularly in handling complex stereoisomers—through a novel three-stage training paradigm:
- SMILES Pretraining: Aligns visual representations with chemical strings.
- Multi-Granularity Fine-Tuning: Captures both global structure and local functional group details.
- RL Post-Training: Utilizes Reinforcement Learning to optimize for chemical semantic correctness rather than simple token matching.
- First RL-based OCSR: MolSight is the first OCSR system to integrate Reinforcement Learning. We utilize Group Relative Policy Optimization (GRPO) to directly optimize chemical validity[c.
- Stereo-200k Dataset: We introduce a new annotated dataset consisting of 200,000 challenging stereoisomeric molecules specifically curated to address confusion in 3D chiral structures.
- SOTA Performance: Extensive experiments demonstrate that MolSight achieves state-of-the-art results in accuracy, similarity, and robustness, outperforming classical and learning-based baselines.
- [2025-11-26] 🎉 MolSight has been accepted to AAAI 2026!
- [2025-11-26] 🚀 Code released.
- Release code
- Release Stereo-200k dataset
- Release model weights
# Clone the repository
git clone https://github.com/hustvl/MolSight
cd MolSight
# Install dependencies
pip install -r requirements.txt- Pretrain dataset: MolParser-7M
- SFT datasets: PubChem-1M, USPTO-680k
- RL dataset: Stereo-200k
- USPTO, UoB, CLEF, JPO: images, labels, we also provided labels in SMILES format.
- Stereo-2k
Notes: The Stereo dataset is introduced for the first time in this work, consisting entirely of stereoisomeric molecules.
| Name | Predict Field | Description | Acc. on USPTO |
| MolSight-base | SMILES & edge | Trained on PubChem-1M and USPTO-680k for 10 epochs. | 91.2 |
| MolSight-coord | SMILES & edge & coord | Continue trained on PubChem-1M for 2 epochs to get a coord head. | 91.1 |
| MolSight-stereo | SMILES | Continue trained on Stereo-200k with LoRA for 2 epochs to get better performance on stereo molecules. | 90.3 |
| MolSight-extra | SMILES & edge | Similar to MolSight-base, but with extra training steps (30 epochs), usually can get better evaluation score. | 92.0 |
| MolSight-Markush | SMILES | Finetuned on MarkushGrapher, can predict SMILES-M to deal with Markush structures. | - |
Start MolSight training with:
# SFT
bash train.sh
# train the additional coord predictor
bash train_loc_predictor.sh
# post training with RL
bash post_train.shIf you find MolSight or the Stereo-200k dataset useful for your research in AI4Science or Chemistry, please cite our paper:
@article{zhang2025molsight,
title={MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning},
author={Zhang, Wenrui and Wang, Xinggang and Feng, Bin and Liu, Wenyu},
journal={arXiv preprint arXiv:2511.17300},
year={2025}
}This project has referenced some excellent open-sourced repos (MolScribe, trl, Whisper, MMPose). Thanks for their wonderful works and contributions to the community.