Skip to content

Official repository of paper titled "microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification"

License

Notifications You must be signed in to change notification settings

sathiiii/microCLIP

Repository files navigation

microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

Status arXiv License Python

📢 Latest Updates

  • Oct-03-25: Preprint available on arXiv.
  • Oct-03-25: Initial release of microCLIP code.

Table of Contents


💡 Overview

microCLIP is a lightweight self-training framework that adapts CLIP for fine-grained image classification without requiring labeled data.

While CLIP is strong in zero-shot transfer, it primarily relies on coarse global features. microCLIP enhances CLIP with localized, fine-grained cues, enabling sharper attention, more accurate pseudo-labels, and improved classification accuracy across challenging benchmarks.

Key ideas:

  • Saliency-Oriented Attention Pooling (SOAP): builds a fine-grained [FG] token from salient patch embeddings.
  • TokenFusion: fuses [FG] with the global [CLS] token for coarse–fine alignment.
  • Two-headed LLM-derived classifier: a frozen prior and a learnable classifier stabilize pseudo-labeling.
  • Dynamic Knowledge Aggregation: convexly combines static CLIP/LLM priors with evolving TokenFusion logits.

microCLIP improves +2.90% average accuracy across 13 fine-grained benchmarks, setting a new state-of-the-art for unsupervised CLIP adaptation.

Overall Architecture

Overall architecture of microCLIP


📊 Results

Comparison to Zero-shot and UA Baselines

Top-1 accuracy comparison across 13 datasets (ViT-B/32 backbone)

Ablation Studies

Effect of coarse vs fine-grained cues:

Ablation on coarse-feature baselines

Effect of SOAP:

Ablation on Attention Pooling (SOAP vs baselines)

Dynamic Knowledge Aggregation:

Ablation on pseudo-labeler

Two-headed classifier initialization:

Two-headed classifier ablation

Backbone Scaling

Results with ViT-B/16 backbone


Visualizations

Sharper local attention via SOAP-guided [FG]:

Attention maps (Birdsnap/RESISC)

[CLS] vs [FG] attention across datasets:

Attention comparison between CLS and FG tokens

Pseudo-label accuracy progression:

Pseudo-labeling accuracy curves

NCut saliency masks:

NCut-based saliency maps on Birdsnap

📦 Installation

# Clone repository
git clone https://github.com/sathiiii/microCLIP.git
cd microCLIP

# Create environment
conda env create -f environment.yml
conda activate microclip

🗂️ Datasets

🔧 Usage

Train (UA Fine-tuning)

python train.py --dataset dataset-name --train_config ours_vit_b_32_cupl_proto

Evaluate

python evaluate.py --dataset dataset-name --ckpt-path path/to/checkpoint.pth

🙏 Acknowledgements

This work builds upon the MUST repository. We thank the authors for their open-source code.

We thank the authors of MetaCLIP for releasing their codebase, which we use in our additional experiments.

We also acknowledge CuPL for providing GPT-3 generated class descriptions, which we include in our repository under all_prompts/.

📜 Citation

If you find this work useful in your research, please consider citing:

@misc{silva2025microclipunsupervisedclipadaptation,
      title={microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification}, 
      author={Sathira Silva and Eman Ali and Chetan Arora and Muhammad Haris Khan},
      year={2025},
      eprint={2510.02270},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02270}, 
}

About

Official repository of paper titled "microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published