Skip to content

krishpranav/visionGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VisionGPT

See like humans. Think like scientists.

VisionGPT is a reasoning-first visual foundation model designed to solve one of the largest weaknesses in modern vision-language systems:

they can describe images, but they struggle to reason about them.


The Problem

Modern vision models typically follow a simple pipeline:

Image
 ↓
Vision Encoder
 ↓
Language Model
 ↓
Answer

This approach works well for:

  • Captioning
  • Basic visual question answering
  • OCR
  • General image understanding

However, it often struggles with:

  • Counting
  • Spatial reasoning
  • Object relationships
  • Multi-step visual logic
  • Explainable decision making

Example:

Question:

How many people are holding umbrellas?

Traditional Models:

Image
 ↓
Answer

VisionGPT:

Image
 ↓
Objects
 ↓
Relationships
 ↓
Reasoning
 ↓
Answer

VisionGPT Architecture

VisionGPT is built around explicit reasoning.

┌─────────────────┐
│ Input Image     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Perception      │
│ Engine          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Scene Graph     │
│ Engine          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Spatial         │
│ Reasoning       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Reasoning       │
│ Engine          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Response        │
│ Engine          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Final Answer    │
└─────────────────┘

Every answer is derived through explicit reasoning stages.


Core Principles

1. Reasoning First

VisionGPT is optimized for:

  • Counting
  • Spatial understanding
  • Relationship reasoning
  • Multi-step logic

instead of pure caption generation.


2. Explainable Outputs

Every answer should be traceable.

Objects
 ↓
Relationships
 ↓
Facts
 ↓
Reasoning
 ↓
Answer

No black-box decisions.


3. Low Hallucination Design

VisionGPT attempts to prevent unsupported conclusions by restricting responses to facts produced by earlier stages.


4. Modular Architecture

Each subsystem can be developed independently.

Perception

Scene Graph

Spatial

Reasoning

Response

This makes debugging and evaluation significantly easier.


Model Roadmap

VisionGPT-10B

Initial foundation release.

Perception Engine:         3.0B
Scene Graph Engine:        1.0B
Spatial Reasoning:         0.5B
Reasoning Engine:          4.0B
Response Engine:           1.5B

Total:                    10.0B

Repository Structure

VisionGPT/

├── docs/
│
├── models/
│
├── datasets/
│
├── training/
│
├── evaluation/
│
├── serving/
│
├── applications/
│
├── research/
│
├── infra/
│
├── tests/
│
└── tools/

Documentation

Project specifications live under:

docs/specs/

Current specifications:

MODEL_SPEC.md

ARCHITECTURE.md

DATASET_SPEC.md

TRAINING_SPEC.md

EVALUATION_SPEC.md

REPOSITORY_SPEC.md

INTERFACE_SPEC.md

These documents define the project before implementation begins.


Contributing

VisionGPT is being built as an open architecture project focused on reasoning-first visual intelligence.

Contributions are welcome after the core architecture reaches implementation stability.


License

Apache 2.0


Vision

The goal is not to build another model that looks at images.

The goal is to build a system that can:

Observe

Understand

Reason

Explain

with every conclusion grounded in evidence.

About

Open-source visual reasoning AI built entirely in-house. 100ms latency. Private. Free. Outperforms GPT-4o on reasoning.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages