Pitch Visualizer: From Words to Storyboard

An AI-powered tool that transforms sales narratives into visual storyboards. Paste a customer success story, and the system automatically extracts characters, settings, emotions, and story phases — then generates a multi-panel illustrated storyboard that brings your pitch to life.

Built for the Darwix AI Assessment — Challenge 2: The Pitch Visualizer.

Examples

Example 1 — Anime Style with Custom Character

Input: Sales pitch about a team member struggling with pitch deck creation
Character: Custom — "Yuki, Japanese 20 year old anime style girl"
Style: Story Animation

The system detected a single-character narrative and created a 4-panel story arc:
Daily Grind → Breaking Point → Discovery → Triumph

Example 2 — Auto-Generated Character (Single Person)

Input: Sarah's story of drowning in emails and discovering Darwix AI
Character: Not specified — AI created Sarah with full visual description
Style: Story Animation

The system identified Sarah as the sole character from the narrative, generated her appearance, and chose a 5-panel arc:
Overwhelmed → Frustration → Discovery → Transformation → Victory

Example 3 — Auto-Generated Team (Multiple Characters)

Input: MedTech Solutions sales team struggling with CRM and data entry
Character: Not specified — AI created 3 team members
Style: Comic

The narrative mentioned "the regional sales team" without naming individuals. The system invented 3 visually distinct characters to represent the team and chose a 5-panel arc:
Struggling → Frustrated → Discovery → Transformation → Success

Architecture & Pipeline

┌─────────────────────────────────────────────────────────────┐
│                        USER INPUT                            │
│   Sales narrative text + (optional) character + style choice │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 1: NARRATIVE ANALYZER                      │
│              (narrative_analyzer.py — LLM call)              │
│                                                              │
│  Single LLM call (Groq/Llama 3.3 70B) that decides:         │
│                                                              │
│  ► Character count: 1-3 (auto-detected from narrative,       │
│    or uses user-provided character)                          │
│  ► Character descriptions: structured fields                 │
│    (face, hair, clothing, build — front-loaded by importance)│
│  ► Setting: one consistent location for all panels           │
│  ► Panel count: 3-8 (based on story complexity)              │
│  ► Phase labels: LLM chooses labels that fit the story       │
│    (not locked to Problem/Pain/Solution/Result)              │
│  ► Per-panel: facial expressions → body language → actions   │
│  ► Per-panel: story objects, lighting mood, sales caption     │
│                                                              │
│  Retry logic: 3 attempts with validation on each             │
│  Banned word filtering: art/style words stripped from output │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 2: PROMPT BUILDER                         │
│              (prompt_builder.py — NO LLM, pure code)         │
│                                                              │
│  Assembles SDXL-optimized prompts by concatenation:          │
│                                                              │
│  [Style prefix] + [Scene/emotions] + [Setting] +             │
│  [Character tags] + [Objects] + [Lighting] + [Style suffix]  │
│                                                              │
│  ► Scene/emotions placed early in prompt (SDXL uses early    │
│    tokens for global composition — research-backed)          │
│  ► Character tags repeated identically in every prompt       │
│  ► Banned words stripped as safety net                        │
│  ► Style prefix/suffix from config (no conflicts possible)   │
│  ► Target: ~100-150 words per prompt (SDXL sweet spot)       │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 3: IMAGE GENERATOR                        │
│              (image_generator.py — HuggingFace API)          │
│                                                              │
│  ► Model: Stable Diffusion XL Base 1.0 (via HF Inference)   │
│  ► Fallback: Segmind SSD-1B (distilled SDXL)                │
│  ► Async parallel generation (all panels at once)            │
│  ► Per-panel retry with model loading detection              │
│  ► Style-specific negative prompts to prevent conflicts      │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 4: STORYBOARD DISPLAY                     │
│              (FastAPI + Jinja2 HTML templates)                │
│                                                              │
│  ► Animated panel-by-panel reveal                            │
│  ► Color-coded phase labels with contextual icons            │
│  ► Narrative arc visualization bar                           │
│  ► Character and setting info cards                          │
│  ► Responsive grid layout                                    │
└─────────────────────────────────────────────────────────────┘

Prompt Engineering Methodology

The prompt engineering follows a three-layer approach:

LLM extracts content — Groq/Llama 3.3 analyzes the narrative and outputs structured fields: facial expressions, body language, character actions, scene objects, and lighting mood. Each field is front-loaded (most important details first) and free of any art/style words.
Code assembles the prompt — The prompt builder concatenates fields in SDXL-optimal order: [style prefix] + [scene/emotions] + [setting] + [character tags] + [objects] + [lighting] + [style suffix]. A banned-word regex strips any style tokens that leaked from the LLM.
Style tokens are injected by config only — Each visual style (Story Animation, Comic, Watercolor, Simple) has its own prefix, suffix, and negative prompt defined in config.py. No style words ever come from the LLM, eliminating conflicts like "cinematic" appearing in a comic prompt.

Key Design Decisions

Why a Hybrid LLM + Code Approach?

Early iterations let the LLM write entire image prompts. This failed because:

The LLM paraphrased character descriptions differently each panel (inconsistency)
Style words leaked in ("cinematic" in a comic prompt, breaking the art style)
Prompts were too long (200+ words, wasting SDXL's 77-token encoder window)

The final architecture splits responsibilities:

LLM handles content: characters, emotions, actions, objects (what it's good at)
Code handles style: prefix/suffix tokens, banned word stripping, prompt assembly (deterministic, conflict-free)

This guarantees that character tags are byte-for-byte identical across panels, and no style words conflict with the chosen art direction.

Why Front-Loaded Field Ordering?

SDXL's CLIP text encoder processes approximately 77 tokens per encoder. Words at the beginning of the prompt have more influence on the final image than words at the end. Our prompt builder places elements in this priority order:

Style prefix (sets the art direction immediately)
Scene/emotions (facial expressions, body language — the story)
Setting (consistent environment)
Character appearance (visual consistency)
Objects (story details)
Lighting mood (emotional atmosphere)
Style suffix (reinforces art direction)

The LLM is also instructed to front-load each field — most important details first within every field — so even if later words get less attention from SDXL, the critical information is already encoded.

Why LLM-Decided Panel Count and Phase Labels?

Fixed 4-panel "Problem → Pain → Solution → Result" produced rigid, formulaic storyboards. By letting the LLM choose both the count (3-8) and the phase names, we get story-appropriate arcs:

A simple pitch generates 3-4 panels: "Struggle → Discovery → Triumph"
A complex team narrative generates 5-6 panels: "Daily Grind → Breaking Point → Failed Attempt → New Hope → Transformation → Victory"

The LLM sees examples of 3, 4, 5, 6, and 7-panel structures in its prompt, so it understands the range and picks what fits.

Banned Word System

Art/style words like "realistic", "cinematic", "3d", "illustration" can conflict with the chosen visual style and confuse SDXL. These are prevented at two levels:

LLM prompt: explicitly lists banned words and instructs the model to avoid them
Prompt builder: regex-strips any that leak through as a safety net

This dual-layer approach prevents prompts like "comic art style, cinematic lighting, photorealistic" which would produce incoherent images.

Limitations & Future Improvements

Current Model Constraints

This project uses Stable Diffusion XL Base 1.0 via HuggingFace's free inference API. SDXL is a powerful model but was released in 2023 — newer models like FLUX.1 and SD3.5 produce significantly better results. The free inference tier also has rate limits and occasional cold starts (~30s when the model hasn't been used recently).

Character consistency across panels remains the hardest challenge. Without img2img or IP-Adapter, each panel is generated independently from text alone. The identical character tags in every prompt help, but SDXL will still vary facial features between panels. This is a fundamental limitation of text-to-image without visual conditioning.

img2img: The Path Not Taken

An img2img pipeline would solve the consistency problem:

Generate Panel 1 via text-to-image (the base scene)
Feed Panel 1's image + a modified prompt into img2img for Panels 2-4
The strength parameter (0.3-0.5) would preserve the base composition while changing mood and expressions

We implemented and tested this approach, but every free API provider (HuggingFace hf-inference, fal-ai, Replicate) requires paid credits for img2img. Local GPU img2img with SD 1.5 works on a 4GB+ VRAM card but adds a heavyweight dependency (PyTorch + diffusers + 4GB model download) that makes the project harder to evaluate.

The tradeoff with img2img is that while consistency improves dramatically, the panels can look too similar to each other — the mood shifts become subtle rather than dramatic. For a sales storyboard where you want a stark visual contrast between "the problem" and "the success", independent text-to-image with strong emotional prompting may actually produce more compelling results.

What Would Make This Production-Ready

FLUX.1 or SD3.5: Modern models with better prompt following and character consistency
IP-Adapter / InstantID: Feed a character reference image into every generation for pixel-level consistency
img2img pipeline: Generate base scene, then modify mood/lighting/expressions across panels
Slide deck export: Generate PowerPoint/PDF alongside the HTML storyboard
Prompt iteration UI: Let users regenerate individual panels or tweak the LLM's scene descriptions
Multi-language support: Narrative analysis in languages other than English

Tech Stack

Component	Technology	Purpose
LLM	Groq + Llama 3.3 70B	Narrative analysis, character/scene extraction
Image Gen	SDXL 1.0 (HuggingFace)	Panel image generation
Backend	FastAPI	API server, pipeline orchestration
Frontend	Jinja2 + HTML/CSS	Server-rendered storyboard UI
Validation	Pydantic	Structured data models
HTTP	httpx (async)	Parallel image generation
JSON	Groq JSON mode	Guaranteed valid JSON from LLM

Setup & Running

Prerequisites

Python 3.10+
A Groq API key (free, takes 30 seconds)
A HuggingFace token (free, read access)

Step-by-Step Setup

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/pitch-visualizer.git
cd pitch-visualizer

# 2. Create a virtual environment
python -m venv venv

# 3. Activate it
# On Windows:
venv\Scripts\activate
# On Mac/Linux:
source venv/bin/activate

# 4. Install dependencies
pip install -r requirements.txt

# 5. Create your environment file
cp .env.example .env

# 6. Edit .env and add your API keys:
#    GROQ_API_KEY=gsk_your_groq_key_here
#    HF_API_TOKEN=hf_your_huggingface_token_here

# 7. Run the application
uvicorn app:app --reload --port 8000

Getting API Keys

Groq (for LLM — narrative analysis):

Go to console.groq.com
Sign up (Google login works)
Navigate to API Keys → Create API Key
Copy the key starting with gsk_

HuggingFace (for image generation):

Go to huggingface.co/settings/tokens
Create an account if needed
Create a new token with Read access
Copy the token starting with hf_

Environment File (`.env`)

GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxx
HF_API_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx

Running

uvicorn app:app --reload --port 8000

Open http://localhost:8000 in your browser.

Project Structure

pitch-visualizer/
│
├── app.py                    # FastAPI routes + pipeline orchestration
├── narrative_analyzer.py     # Stage 1: LLM-powered narrative extraction
├── prompt_builder.py         # Stage 2: Conflict-free prompt assembly
├── image_generator.py        # Stage 3: Async parallel image generation
├── schemas.py                # Pydantic data models
├── config.py                 # API keys, style presets, constants
├── llm_client.py             # Groq/Gemini client with JSON parsing
│
├── templates/
│   ├── index.html            # Input form UI
│   └── storyboard.html       # Storyboard output UI
│
├── static/generated/         # Generated images (per session)
├── examples/                 # Sample narratives + output screenshots
│
├── requirements.txt
├── .env.example
└── README.md

Available Visual Styles

Style	Description
Story Animation	Clean animated look with expressive faces and smooth shading
2D Comic	Bold outlines, cel-shaded coloring, graphic novel aesthetic
Watercolor	Soft brush strokes, pastel tones, hand-painted feel
Simple & Clean	Minimal flat illustration with soft lighting

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pitch Visualizer: From Words to Storyboard

Examples

Example 1 — Anime Style with Custom Character

Example 2 — Auto-Generated Character (Single Person)

Example 3 — Auto-Generated Team (Multiple Characters)

Architecture & Pipeline

Prompt Engineering Methodology

Key Design Decisions

Why a Hybrid LLM + Code Approach?

Why Front-Loaded Field Ordering?

Why LLM-Decided Panel Count and Phase Labels?

Banned Word System

Limitations & Future Improvements

Current Model Constraints

img2img: The Path Not Taken

What Would Make This Production-Ready

Tech Stack

Setup & Running

Prerequisites

Step-by-Step Setup

Getting API Keys

Environment File (`.env`)

Running

Project Structure

Available Visual Styles

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.py		config.py
image_generator.py		image_generator.py
llm_client.py		llm_client.py
narrative_analyzer.py		narrative_analyzer.py
prompt_builder.py		prompt_builder.py
requirements.txt		requirements.txt
schemas.py		schemas.py

Folders and files

Latest commit

History

Repository files navigation

Pitch Visualizer: From Words to Storyboard

Examples

Example 1 — Anime Style with Custom Character

Example 2 — Auto-Generated Character (Single Person)

Example 3 — Auto-Generated Team (Multiple Characters)

Architecture & Pipeline

Prompt Engineering Methodology

Key Design Decisions

Why a Hybrid LLM + Code Approach?

Why Front-Loaded Field Ordering?

Why LLM-Decided Panel Count and Phase Labels?

Banned Word System

Limitations & Future Improvements

Current Model Constraints

img2img: The Path Not Taken

What Would Make This Production-Ready

Tech Stack

Setup & Running

Prerequisites

Step-by-Step Setup

Getting API Keys

Environment File (.env)

Running

Project Structure

Available Visual Styles

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment File (`.env`)

Packages