See like humans. Think like scientists.
VisionGPT is a reasoning-first visual foundation model designed to solve one of the largest weaknesses in modern vision-language systems:
they can describe images, but they struggle to reason about them.
Modern vision models typically follow a simple pipeline:
Image
↓
Vision Encoder
↓
Language Model
↓
Answer
This approach works well for:
- Captioning
- Basic visual question answering
- OCR
- General image understanding
However, it often struggles with:
- Counting
- Spatial reasoning
- Object relationships
- Multi-step visual logic
- Explainable decision making
Example:
Question:
How many people are holding umbrellas?
Traditional Models:
Image
↓
Answer
VisionGPT:
Image
↓
Objects
↓
Relationships
↓
Reasoning
↓
Answer
VisionGPT is built around explicit reasoning.
┌─────────────────┐
│ Input Image │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Perception │
│ Engine │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Scene Graph │
│ Engine │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Spatial │
│ Reasoning │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Reasoning │
│ Engine │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Response │
│ Engine │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Final Answer │
└─────────────────┘
Every answer is derived through explicit reasoning stages.
VisionGPT is optimized for:
- Counting
- Spatial understanding
- Relationship reasoning
- Multi-step logic
instead of pure caption generation.
Every answer should be traceable.
Objects
↓
Relationships
↓
Facts
↓
Reasoning
↓
Answer
No black-box decisions.
VisionGPT attempts to prevent unsupported conclusions by restricting responses to facts produced by earlier stages.
Each subsystem can be developed independently.
Perception
Scene Graph
Spatial
Reasoning
Response
This makes debugging and evaluation significantly easier.
Initial foundation release.
Perception Engine: 3.0B
Scene Graph Engine: 1.0B
Spatial Reasoning: 0.5B
Reasoning Engine: 4.0B
Response Engine: 1.5B
Total: 10.0BVisionGPT/
├── docs/
│
├── models/
│
├── datasets/
│
├── training/
│
├── evaluation/
│
├── serving/
│
├── applications/
│
├── research/
│
├── infra/
│
├── tests/
│
└── tools/
Project specifications live under:
docs/specs/
Current specifications:
MODEL_SPEC.md
ARCHITECTURE.md
DATASET_SPEC.md
TRAINING_SPEC.md
EVALUATION_SPEC.md
REPOSITORY_SPEC.md
INTERFACE_SPEC.md
These documents define the project before implementation begins.
VisionGPT is being built as an open architecture project focused on reasoning-first visual intelligence.
Contributions are welcome after the core architecture reaches implementation stability.
Apache 2.0
The goal is not to build another model that looks at images.
The goal is to build a system that can:
Observe
Understand
Reason
Explain
with every conclusion grounded in evidence.