VisionGPT

See like humans. Think like scientists.

VisionGPT is a reasoning-first visual foundation model designed to solve one of the largest weaknesses in modern vision-language systems:

they can describe images, but they struggle to reason about them.

The Problem

Modern vision models typically follow a simple pipeline:

Image
 ↓
Vision Encoder
 ↓
Language Model
 ↓
Answer

This approach works well for:

Captioning
Basic visual question answering
OCR
General image understanding

However, it often struggles with:

Counting
Spatial reasoning
Object relationships
Multi-step visual logic
Explainable decision making

Example:

Question:

How many people are holding umbrellas?

Traditional Models:

Image
 ↓
Answer

VisionGPT:

Image
 ↓
Objects
 ↓
Relationships
 ↓
Reasoning
 ↓
Answer

VisionGPT Architecture

VisionGPT is built around explicit reasoning.

┌─────────────────┐
│ Input Image     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Perception      │
│ Engine          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Scene Graph     │
│ Engine          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Spatial         │
│ Reasoning       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Reasoning       │
│ Engine          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Response        │
│ Engine          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Final Answer    │
└─────────────────┘

Every answer is derived through explicit reasoning stages.

Core Principles

1. Reasoning First

VisionGPT is optimized for:

Counting
Spatial understanding
Relationship reasoning
Multi-step logic

instead of pure caption generation.

2. Explainable Outputs

Every answer should be traceable.

Objects
 ↓
Relationships
 ↓
Facts
 ↓
Reasoning
 ↓
Answer

No black-box decisions.

3. Low Hallucination Design

VisionGPT attempts to prevent unsupported conclusions by restricting responses to facts produced by earlier stages.

4. Modular Architecture

Each subsystem can be developed independently.

Perception

Scene Graph

Spatial

Reasoning

Response

This makes debugging and evaluation significantly easier.

Model Roadmap

VisionGPT-10B

Initial foundation release.

Perception Engine:         3.0B
Scene Graph Engine:        1.0B
Spatial Reasoning:         0.5B
Reasoning Engine:          4.0B
Response Engine:           1.5B

Total:                    10.0B

Repository Structure

VisionGPT/

├── docs/
│
├── models/
│
├── datasets/
│
├── training/
│
├── evaluation/
│
├── serving/
│
├── applications/
│
├── research/
│
├── infra/
│
├── tests/
│
└── tools/

Documentation

Project specifications live under:

docs/specs/

Current specifications:

MODEL_SPEC.md

ARCHITECTURE.md

DATASET_SPEC.md

TRAINING_SPEC.md

EVALUATION_SPEC.md

REPOSITORY_SPEC.md

INTERFACE_SPEC.md

These documents define the project before implementation begins.

Contributing

VisionGPT is being built as an open architecture project focused on reasoning-first visual intelligence.

Contributions are welcome after the core architecture reaches implementation stability.

License

Apache 2.0

Vision

The goal is not to build another model that looks at images.

The goal is to build a system that can:

Observe

Understand

Reason

Explain

with every conclusion grounded in evidence.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
docs/specs		docs/specs
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisionGPT

The Problem

VisionGPT Architecture

Core Principles

1. Reasoning First

2. Explainable Outputs

3. Low Hallucination Design

4. Modular Architecture

Model Roadmap

VisionGPT-10B

Repository Structure

Documentation

Contributing

License

Vision

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VisionGPT

The Problem

VisionGPT Architecture

Core Principles

1. Reasoning First

2. Explainable Outputs

3. Low Hallucination Design

4. Modular Architecture

Model Roadmap

VisionGPT-10B

Repository Structure

Documentation

Contributing

License

Vision

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages