An AI-powered tool that transforms sales narratives into visual storyboards. Paste a customer success story, and the system automatically extracts characters, settings, emotions, and story phases — then generates a multi-panel illustrated storyboard that brings your pitch to life.
Built for the Darwix AI Assessment — Challenge 2: The Pitch Visualizer.
Input: Sales pitch about a team member struggling with pitch deck creation
Character: Custom — "Yuki, Japanese 20 year old anime style girl"
Style: Story Animation
The system detected a single-character narrative and created a 4-panel story arc:
Daily Grind → Breaking Point → Discovery → Triumph
Input: Sarah's story of drowning in emails and discovering Darwix AI
Character: Not specified — AI created Sarah with full visual description
Style: Story Animation
The system identified Sarah as the sole character from the narrative, generated her appearance, and chose a 5-panel arc:
Overwhelmed → Frustration → Discovery → Transformation → Victory
Input: MedTech Solutions sales team struggling with CRM and data entry
Character: Not specified — AI created 3 team members
Style: Comic
The narrative mentioned "the regional sales team" without naming individuals. The system invented 3 visually distinct characters to represent the team and chose a 5-panel arc:
Struggling → Frustrated → Discovery → Transformation → Success
┌─────────────────────────────────────────────────────────────┐
│ USER INPUT │
│ Sales narrative text + (optional) character + style choice │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: NARRATIVE ANALYZER │
│ (narrative_analyzer.py — LLM call) │
│ │
│ Single LLM call (Groq/Llama 3.3 70B) that decides: │
│ │
│ ► Character count: 1-3 (auto-detected from narrative, │
│ or uses user-provided character) │
│ ► Character descriptions: structured fields │
│ (face, hair, clothing, build — front-loaded by importance)│
│ ► Setting: one consistent location for all panels │
│ ► Panel count: 3-8 (based on story complexity) │
│ ► Phase labels: LLM chooses labels that fit the story │
│ (not locked to Problem/Pain/Solution/Result) │
│ ► Per-panel: facial expressions → body language → actions │
│ ► Per-panel: story objects, lighting mood, sales caption │
│ │
│ Retry logic: 3 attempts with validation on each │
│ Banned word filtering: art/style words stripped from output │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 2: PROMPT BUILDER │
│ (prompt_builder.py — NO LLM, pure code) │
│ │
│ Assembles SDXL-optimized prompts by concatenation: │
│ │
│ [Style prefix] + [Scene/emotions] + [Setting] + │
│ [Character tags] + [Objects] + [Lighting] + [Style suffix] │
│ │
│ ► Scene/emotions placed early in prompt (SDXL uses early │
│ tokens for global composition — research-backed) │
│ ► Character tags repeated identically in every prompt │
│ ► Banned words stripped as safety net │
│ ► Style prefix/suffix from config (no conflicts possible) │
│ ► Target: ~100-150 words per prompt (SDXL sweet spot) │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 3: IMAGE GENERATOR │
│ (image_generator.py — HuggingFace API) │
│ │
│ ► Model: Stable Diffusion XL Base 1.0 (via HF Inference) │
│ ► Fallback: Segmind SSD-1B (distilled SDXL) │
│ ► Async parallel generation (all panels at once) │
│ ► Per-panel retry with model loading detection │
│ ► Style-specific negative prompts to prevent conflicts │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 4: STORYBOARD DISPLAY │
│ (FastAPI + Jinja2 HTML templates) │
│ │
│ ► Animated panel-by-panel reveal │
│ ► Color-coded phase labels with contextual icons │
│ ► Narrative arc visualization bar │
│ ► Character and setting info cards │
│ ► Responsive grid layout │
└─────────────────────────────────────────────────────────────┘
The prompt engineering follows a three-layer approach:
-
LLM extracts content — Groq/Llama 3.3 analyzes the narrative and outputs structured fields: facial expressions, body language, character actions, scene objects, and lighting mood. Each field is front-loaded (most important details first) and free of any art/style words.
-
Code assembles the prompt — The prompt builder concatenates fields in SDXL-optimal order:
[style prefix] + [scene/emotions] + [setting] + [character tags] + [objects] + [lighting] + [style suffix]. A banned-word regex strips any style tokens that leaked from the LLM. -
Style tokens are injected by config only — Each visual style (Story Animation, Comic, Watercolor, Simple) has its own prefix, suffix, and negative prompt defined in
config.py. No style words ever come from the LLM, eliminating conflicts like "cinematic" appearing in a comic prompt.
Early iterations let the LLM write entire image prompts. This failed because:
- The LLM paraphrased character descriptions differently each panel (inconsistency)
- Style words leaked in ("cinematic" in a comic prompt, breaking the art style)
- Prompts were too long (200+ words, wasting SDXL's 77-token encoder window)
The final architecture splits responsibilities:
- LLM handles content: characters, emotions, actions, objects (what it's good at)
- Code handles style: prefix/suffix tokens, banned word stripping, prompt assembly (deterministic, conflict-free)
This guarantees that character tags are byte-for-byte identical across panels, and no style words conflict with the chosen art direction.
SDXL's CLIP text encoder processes approximately 77 tokens per encoder. Words at the beginning of the prompt have more influence on the final image than words at the end. Our prompt builder places elements in this priority order:
- Style prefix (sets the art direction immediately)
- Scene/emotions (facial expressions, body language — the story)
- Setting (consistent environment)
- Character appearance (visual consistency)
- Objects (story details)
- Lighting mood (emotional atmosphere)
- Style suffix (reinforces art direction)
The LLM is also instructed to front-load each field — most important details first within every field — so even if later words get less attention from SDXL, the critical information is already encoded.
Fixed 4-panel "Problem → Pain → Solution → Result" produced rigid, formulaic storyboards. By letting the LLM choose both the count (3-8) and the phase names, we get story-appropriate arcs:
- A simple pitch generates 3-4 panels: "Struggle → Discovery → Triumph"
- A complex team narrative generates 5-6 panels: "Daily Grind → Breaking Point → Failed Attempt → New Hope → Transformation → Victory"
The LLM sees examples of 3, 4, 5, 6, and 7-panel structures in its prompt, so it understands the range and picks what fits.
Art/style words like "realistic", "cinematic", "3d", "illustration" can conflict with the chosen visual style and confuse SDXL. These are prevented at two levels:
- LLM prompt: explicitly lists banned words and instructs the model to avoid them
- Prompt builder: regex-strips any that leak through as a safety net
This dual-layer approach prevents prompts like "comic art style, cinematic lighting, photorealistic" which would produce incoherent images.
This project uses Stable Diffusion XL Base 1.0 via HuggingFace's free inference API. SDXL is a powerful model but was released in 2023 — newer models like FLUX.1 and SD3.5 produce significantly better results. The free inference tier also has rate limits and occasional cold starts (~30s when the model hasn't been used recently).
Character consistency across panels remains the hardest challenge. Without img2img or IP-Adapter, each panel is generated independently from text alone. The identical character tags in every prompt help, but SDXL will still vary facial features between panels. This is a fundamental limitation of text-to-image without visual conditioning.
An img2img pipeline would solve the consistency problem:
- Generate Panel 1 via text-to-image (the base scene)
- Feed Panel 1's image + a modified prompt into img2img for Panels 2-4
- The
strengthparameter (0.3-0.5) would preserve the base composition while changing mood and expressions
We implemented and tested this approach, but every free API provider (HuggingFace hf-inference, fal-ai, Replicate) requires paid credits for img2img. Local GPU img2img with SD 1.5 works on a 4GB+ VRAM card but adds a heavyweight dependency (PyTorch + diffusers + 4GB model download) that makes the project harder to evaluate.
The tradeoff with img2img is that while consistency improves dramatically, the panels can look too similar to each other — the mood shifts become subtle rather than dramatic. For a sales storyboard where you want a stark visual contrast between "the problem" and "the success", independent text-to-image with strong emotional prompting may actually produce more compelling results.
- FLUX.1 or SD3.5: Modern models with better prompt following and character consistency
- IP-Adapter / InstantID: Feed a character reference image into every generation for pixel-level consistency
- img2img pipeline: Generate base scene, then modify mood/lighting/expressions across panels
- Slide deck export: Generate PowerPoint/PDF alongside the HTML storyboard
- Prompt iteration UI: Let users regenerate individual panels or tweak the LLM's scene descriptions
- Multi-language support: Narrative analysis in languages other than English
| Component | Technology | Purpose |
|---|---|---|
| LLM | Groq + Llama 3.3 70B | Narrative analysis, character/scene extraction |
| Image Gen | SDXL 1.0 (HuggingFace) | Panel image generation |
| Backend | FastAPI | API server, pipeline orchestration |
| Frontend | Jinja2 + HTML/CSS | Server-rendered storyboard UI |
| Validation | Pydantic | Structured data models |
| HTTP | httpx (async) | Parallel image generation |
| JSON | Groq JSON mode | Guaranteed valid JSON from LLM |
- Python 3.10+
- A Groq API key (free, takes 30 seconds)
- A HuggingFace token (free, read access)
# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/pitch-visualizer.git
cd pitch-visualizer
# 2. Create a virtual environment
python -m venv venv
# 3. Activate it
# On Windows:
venv\Scripts\activate
# On Mac/Linux:
source venv/bin/activate
# 4. Install dependencies
pip install -r requirements.txt
# 5. Create your environment file
cp .env.example .env
# 6. Edit .env and add your API keys:
# GROQ_API_KEY=gsk_your_groq_key_here
# HF_API_TOKEN=hf_your_huggingface_token_here
# 7. Run the application
uvicorn app:app --reload --port 8000Groq (for LLM — narrative analysis):
- Go to console.groq.com
- Sign up (Google login works)
- Navigate to API Keys → Create API Key
- Copy the key starting with
gsk_
HuggingFace (for image generation):
- Go to huggingface.co/settings/tokens
- Create an account if needed
- Create a new token with Read access
- Copy the token starting with
hf_
GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxx
HF_API_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx
uvicorn app:app --reload --port 8000Open http://localhost:8000 in your browser.
pitch-visualizer/
│
├── app.py # FastAPI routes + pipeline orchestration
├── narrative_analyzer.py # Stage 1: LLM-powered narrative extraction
├── prompt_builder.py # Stage 2: Conflict-free prompt assembly
├── image_generator.py # Stage 3: Async parallel image generation
├── schemas.py # Pydantic data models
├── config.py # API keys, style presets, constants
├── llm_client.py # Groq/Gemini client with JSON parsing
│
├── templates/
│ ├── index.html # Input form UI
│ └── storyboard.html # Storyboard output UI
│
├── static/generated/ # Generated images (per session)
├── examples/ # Sample narratives + output screenshots
│
├── requirements.txt
├── .env.example
└── README.md
| Style | Description |
|---|---|
| Story Animation | Clean animated look with expressive faces and smooth shading |
| 2D Comic | Bold outlines, cel-shaded coloring, graphic novel aesthetic |
| Watercolor | Soft brush strokes, pastel tones, hand-painted feel |
| Simple & Clean | Minimal flat illustration with soft lighting |
MIT


