microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification
- Oct-03-25: Preprint available on arXiv.
- Oct-03-25: Initial release of microCLIP code.
microCLIP is a lightweight self-training framework that adapts CLIP for fine-grained image classification without requiring labeled data.
While CLIP is strong in zero-shot transfer, it primarily relies on coarse global features. microCLIP enhances CLIP with localized, fine-grained cues, enabling sharper attention, more accurate pseudo-labels, and improved classification accuracy across challenging benchmarks.
Key ideas:
- Saliency-Oriented Attention Pooling (SOAP): builds a fine-grained
[FG]token from salient patch embeddings. - TokenFusion: fuses
[FG]with the global[CLS]token for coarse–fine alignment. - Two-headed LLM-derived classifier: a frozen prior and a learnable classifier stabilize pseudo-labeling.
- Dynamic Knowledge Aggregation: convexly combines static CLIP/LLM priors with evolving TokenFusion logits.
microCLIP improves +2.90% average accuracy across 13 fine-grained benchmarks, setting a new state-of-the-art for unsupervised CLIP adaptation.
Effect of coarse vs fine-grained cues:
Effect of SOAP:
Dynamic Knowledge Aggregation:
Two-headed classifier initialization:
Sharper local attention via SOAP-guided [FG]:
[CLS] vs [FG] attention across datasets:
Pseudo-label accuracy progression:
NCut saliency masks:
# Clone repository
git clone https://github.com/sathiiii/microCLIP.git
cd microCLIP
# Create environment
conda env create -f environment.yml
conda activate microclip-
Dataset paths are defined in
configs/dataset_catalog.json. You will need to update these paths to point to your local dataset locations. -
Dataset label files are provided in
configs/classes.json. -
For dataset preparation, we recommend using the scripts from the VISSL repository.
-
Check this issue for guidance on downloading Stanford Cars dataset.
python train.py --dataset dataset-name --train_config ours_vit_b_32_cupl_protopython evaluate.py --dataset dataset-name --ckpt-path path/to/checkpoint.pthThis work builds upon the MUST repository. We thank the authors for their open-source code.
We thank the authors of MetaCLIP for releasing their codebase, which we use in our additional experiments.
We also acknowledge CuPL for providing GPT-3 generated class descriptions, which we include in our repository under all_prompts/.
If you find this work useful in your research, please consider citing:
@misc{silva2025microclipunsupervisedclipadaptation,
title={microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification},
author={Sathira Silva and Eman Ali and Chetan Arora and Muhammad Haris Khan},
year={2025},
eprint={2510.02270},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.02270},
}









