Skip to content

feat: add VLM_SmolVLM_Local plugin for fully local vision-language inference#2472

Open
Wanbogang wants to merge 2 commits intoOpenMind:mainfrom
Wanbogang:feat/vlm-smolvlm-local
Open

feat: add VLM_SmolVLM_Local plugin for fully local vision-language inference#2472
Wanbogang wants to merge 2 commits intoOpenMind:mainfrom
Wanbogang:feat/vlm-smolvlm-local

Conversation

@Wanbogang
Copy link
Collaborator

Summary

Adds a new vision-language model (VLM) input plugin that runs SmolVLM2
directly via HuggingFace transformers — no Ollama, no internet connection,
and no external server required after the initial model download.

Design Decisions

Why SmolVLM2-256M?

  • Less than 1GB VRAM — runs on embedded hardware and CPU fallback
  • Apache 2.0 license — compatible with OM1's MIT license
  • Auto-downloaded from HuggingFace on first run, cached locally

Why optional dependency?

transformers is a large package (~500MB+). Adding it to the main
dependencies would force all OM1 users to install it even if they
never use this plugin. Instead, it is added as an optional group:

[project.optional-dependencies]
smolvlm = [
    "transformers>=4.52.0",
    "num2words>=0.5.14",
]

Users install it only when needed:

pip install om1[smolvlm]

If transformers is not installed, the plugin logs a clear warning
and disables itself gracefully — no crash, no exception propagation.

GPU auto-detection

Follows the same pattern as other local plugins:

self.device = "cuda" if torch.cuda.is_available() else "cpu"

Falls back to CPU automatically if no CUDA device is available.

How to Use

  1. Install optional dependencies:
pip install transformers num2words
  1. Use config/smolvlm_local.json5 or add to your config:
{
  type: "VLM_SmolVLM_Local",
  config: {
    camera_index: 0,
    model_id: "HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
    prompt: "Briefly describe what you see in one or two sentences.",
  },
}
  1. Run OM1 normally — model downloads automatically on first run.

Testing

The 1% missing coverage is HAS_TRANSFORMERS = True on line 26,
which only executes when transformers is installed. This is
intentionally not installed in the OM1 dev venv since it is an
optional dependency.

Add VLM_SmolVLM_Local plugin that runs SmolVLM2 vision-language model
directly via HuggingFace transformers — no Ollama or internet connection
required after the initial model download.

- Auto-detects GPU via torch.cuda.is_available(), falls back to CPU
- Default model SmolVLM2-256M requires less than 1GB VRAM
- Graceful degradation if transformers is not installed
- Add smolvlm optional dependency group in pyproject.toml
- Add config/smolvlm_local.json5 for fully local stack with OllamaLLM
- 16 tests, 99% coverage

Install optional dependencies with:
    pip install om1[smolvlm]
@Wanbogang Wanbogang requested review from a team as code owners March 13, 2026 10:24
@github-actions github-actions bot added dependencies Pull requests that update a dependency file robotics Robotics code changes python Python code tests Test files config Configuration files labels Mar 13, 2026
@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 99.08257% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/inputs/plugins/vlm_smolvlm_local.py 99.08% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration files dependencies Pull requests that update a dependency file python Python code robotics Robotics code changes tests Test files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant