Skip to content

bit-soham/StoryCraft

Repository files navigation

Pitch Visualizer: From Words to Storyboard

An AI-powered tool that transforms sales narratives into visual storyboards. Paste a customer success story, and the system automatically extracts characters, settings, emotions, and story phases — then generates a multi-panel illustrated storyboard that brings your pitch to life.

Built for the Darwix AI Assessment — Challenge 2: The Pitch Visualizer.


Examples

Example 1 — Anime Style with Custom Character

Input: Sales pitch about a team member struggling with pitch deck creation
Character: Custom — "Yuki, Japanese 20 year old anime style girl"
Style: Story Animation

Example 1

The system detected a single-character narrative and created a 4-panel story arc:
Daily Grind → Breaking Point → Discovery → Triumph


Example 2 — Auto-Generated Character (Single Person)

Input: Sarah's story of drowning in emails and discovering Darwix AI
Character: Not specified — AI created Sarah with full visual description
Style: Story Animation

Example 2

The system identified Sarah as the sole character from the narrative, generated her appearance, and chose a 5-panel arc:
Overwhelmed → Frustration → Discovery → Transformation → Victory


Example 3 — Auto-Generated Team (Multiple Characters)

Input: MedTech Solutions sales team struggling with CRM and data entry
Character: Not specified — AI created 3 team members
Style: Comic

Example 3

The narrative mentioned "the regional sales team" without naming individuals. The system invented 3 visually distinct characters to represent the team and chose a 5-panel arc:
Struggling → Frustrated → Discovery → Transformation → Success


Architecture & Pipeline

┌─────────────────────────────────────────────────────────────┐
│                        USER INPUT                            │
│   Sales narrative text + (optional) character + style choice │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 1: NARRATIVE ANALYZER                      │
│              (narrative_analyzer.py — LLM call)              │
│                                                              │
│  Single LLM call (Groq/Llama 3.3 70B) that decides:         │
│                                                              │
│  ► Character count: 1-3 (auto-detected from narrative,       │
│    or uses user-provided character)                          │
│  ► Character descriptions: structured fields                 │
│    (face, hair, clothing, build — front-loaded by importance)│
│  ► Setting: one consistent location for all panels           │
│  ► Panel count: 3-8 (based on story complexity)              │
│  ► Phase labels: LLM chooses labels that fit the story       │
│    (not locked to Problem/Pain/Solution/Result)              │
│  ► Per-panel: facial expressions → body language → actions   │
│  ► Per-panel: story objects, lighting mood, sales caption     │
│                                                              │
│  Retry logic: 3 attempts with validation on each             │
│  Banned word filtering: art/style words stripped from output │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 2: PROMPT BUILDER                         │
│              (prompt_builder.py — NO LLM, pure code)         │
│                                                              │
│  Assembles SDXL-optimized prompts by concatenation:          │
│                                                              │
│  [Style prefix] + [Scene/emotions] + [Setting] +             │
│  [Character tags] + [Objects] + [Lighting] + [Style suffix]  │
│                                                              │
│  ► Scene/emotions placed early in prompt (SDXL uses early    │
│    tokens for global composition — research-backed)          │
│  ► Character tags repeated identically in every prompt       │
│  ► Banned words stripped as safety net                        │
│  ► Style prefix/suffix from config (no conflicts possible)   │
│  ► Target: ~100-150 words per prompt (SDXL sweet spot)       │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 3: IMAGE GENERATOR                        │
│              (image_generator.py — HuggingFace API)          │
│                                                              │
│  ► Model: Stable Diffusion XL Base 1.0 (via HF Inference)   │
│  ► Fallback: Segmind SSD-1B (distilled SDXL)                │
│  ► Async parallel generation (all panels at once)            │
│  ► Per-panel retry with model loading detection              │
│  ► Style-specific negative prompts to prevent conflicts      │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 4: STORYBOARD DISPLAY                     │
│              (FastAPI + Jinja2 HTML templates)                │
│                                                              │
│  ► Animated panel-by-panel reveal                            │
│  ► Color-coded phase labels with contextual icons            │
│  ► Narrative arc visualization bar                           │
│  ► Character and setting info cards                          │
│  ► Responsive grid layout                                    │
└─────────────────────────────────────────────────────────────┘

Prompt Engineering Methodology

The prompt engineering follows a three-layer approach:

  1. LLM extracts content — Groq/Llama 3.3 analyzes the narrative and outputs structured fields: facial expressions, body language, character actions, scene objects, and lighting mood. Each field is front-loaded (most important details first) and free of any art/style words.

  2. Code assembles the prompt — The prompt builder concatenates fields in SDXL-optimal order: [style prefix] + [scene/emotions] + [setting] + [character tags] + [objects] + [lighting] + [style suffix]. A banned-word regex strips any style tokens that leaked from the LLM.

  3. Style tokens are injected by config only — Each visual style (Story Animation, Comic, Watercolor, Simple) has its own prefix, suffix, and negative prompt defined in config.py. No style words ever come from the LLM, eliminating conflicts like "cinematic" appearing in a comic prompt.


Key Design Decisions

Why a Hybrid LLM + Code Approach?

Early iterations let the LLM write entire image prompts. This failed because:

  • The LLM paraphrased character descriptions differently each panel (inconsistency)
  • Style words leaked in ("cinematic" in a comic prompt, breaking the art style)
  • Prompts were too long (200+ words, wasting SDXL's 77-token encoder window)

The final architecture splits responsibilities:

  • LLM handles content: characters, emotions, actions, objects (what it's good at)
  • Code handles style: prefix/suffix tokens, banned word stripping, prompt assembly (deterministic, conflict-free)

This guarantees that character tags are byte-for-byte identical across panels, and no style words conflict with the chosen art direction.

Why Front-Loaded Field Ordering?

SDXL's CLIP text encoder processes approximately 77 tokens per encoder. Words at the beginning of the prompt have more influence on the final image than words at the end. Our prompt builder places elements in this priority order:

  1. Style prefix (sets the art direction immediately)
  2. Scene/emotions (facial expressions, body language — the story)
  3. Setting (consistent environment)
  4. Character appearance (visual consistency)
  5. Objects (story details)
  6. Lighting mood (emotional atmosphere)
  7. Style suffix (reinforces art direction)

The LLM is also instructed to front-load each field — most important details first within every field — so even if later words get less attention from SDXL, the critical information is already encoded.

Why LLM-Decided Panel Count and Phase Labels?

Fixed 4-panel "Problem → Pain → Solution → Result" produced rigid, formulaic storyboards. By letting the LLM choose both the count (3-8) and the phase names, we get story-appropriate arcs:

  • A simple pitch generates 3-4 panels: "Struggle → Discovery → Triumph"
  • A complex team narrative generates 5-6 panels: "Daily Grind → Breaking Point → Failed Attempt → New Hope → Transformation → Victory"

The LLM sees examples of 3, 4, 5, 6, and 7-panel structures in its prompt, so it understands the range and picks what fits.

Banned Word System

Art/style words like "realistic", "cinematic", "3d", "illustration" can conflict with the chosen visual style and confuse SDXL. These are prevented at two levels:

  1. LLM prompt: explicitly lists banned words and instructs the model to avoid them
  2. Prompt builder: regex-strips any that leak through as a safety net

This dual-layer approach prevents prompts like "comic art style, cinematic lighting, photorealistic" which would produce incoherent images.


Limitations & Future Improvements

Current Model Constraints

This project uses Stable Diffusion XL Base 1.0 via HuggingFace's free inference API. SDXL is a powerful model but was released in 2023 — newer models like FLUX.1 and SD3.5 produce significantly better results. The free inference tier also has rate limits and occasional cold starts (~30s when the model hasn't been used recently).

Character consistency across panels remains the hardest challenge. Without img2img or IP-Adapter, each panel is generated independently from text alone. The identical character tags in every prompt help, but SDXL will still vary facial features between panels. This is a fundamental limitation of text-to-image without visual conditioning.

img2img: The Path Not Taken

An img2img pipeline would solve the consistency problem:

  1. Generate Panel 1 via text-to-image (the base scene)
  2. Feed Panel 1's image + a modified prompt into img2img for Panels 2-4
  3. The strength parameter (0.3-0.5) would preserve the base composition while changing mood and expressions

We implemented and tested this approach, but every free API provider (HuggingFace hf-inference, fal-ai, Replicate) requires paid credits for img2img. Local GPU img2img with SD 1.5 works on a 4GB+ VRAM card but adds a heavyweight dependency (PyTorch + diffusers + 4GB model download) that makes the project harder to evaluate.

The tradeoff with img2img is that while consistency improves dramatically, the panels can look too similar to each other — the mood shifts become subtle rather than dramatic. For a sales storyboard where you want a stark visual contrast between "the problem" and "the success", independent text-to-image with strong emotional prompting may actually produce more compelling results.

What Would Make This Production-Ready

  • FLUX.1 or SD3.5: Modern models with better prompt following and character consistency
  • IP-Adapter / InstantID: Feed a character reference image into every generation for pixel-level consistency
  • img2img pipeline: Generate base scene, then modify mood/lighting/expressions across panels
  • Slide deck export: Generate PowerPoint/PDF alongside the HTML storyboard
  • Prompt iteration UI: Let users regenerate individual panels or tweak the LLM's scene descriptions
  • Multi-language support: Narrative analysis in languages other than English

Tech Stack

Component Technology Purpose
LLM Groq + Llama 3.3 70B Narrative analysis, character/scene extraction
Image Gen SDXL 1.0 (HuggingFace) Panel image generation
Backend FastAPI API server, pipeline orchestration
Frontend Jinja2 + HTML/CSS Server-rendered storyboard UI
Validation Pydantic Structured data models
HTTP httpx (async) Parallel image generation
JSON Groq JSON mode Guaranteed valid JSON from LLM

Setup & Running

Prerequisites

Step-by-Step Setup

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/pitch-visualizer.git
cd pitch-visualizer

# 2. Create a virtual environment
python -m venv venv

# 3. Activate it
# On Windows:
venv\Scripts\activate
# On Mac/Linux:
source venv/bin/activate

# 4. Install dependencies
pip install -r requirements.txt

# 5. Create your environment file
cp .env.example .env

# 6. Edit .env and add your API keys:
#    GROQ_API_KEY=gsk_your_groq_key_here
#    HF_API_TOKEN=hf_your_huggingface_token_here

# 7. Run the application
uvicorn app:app --reload --port 8000

Getting API Keys

Groq (for LLM — narrative analysis):

  1. Go to console.groq.com
  2. Sign up (Google login works)
  3. Navigate to API KeysCreate API Key
  4. Copy the key starting with gsk_

HuggingFace (for image generation):

  1. Go to huggingface.co/settings/tokens
  2. Create an account if needed
  3. Create a new token with Read access
  4. Copy the token starting with hf_

Environment File (.env)

GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxx
HF_API_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx

Running

uvicorn app:app --reload --port 8000

Open http://localhost:8000 in your browser.


Project Structure

pitch-visualizer/
│
├── app.py                    # FastAPI routes + pipeline orchestration
├── narrative_analyzer.py     # Stage 1: LLM-powered narrative extraction
├── prompt_builder.py         # Stage 2: Conflict-free prompt assembly
├── image_generator.py        # Stage 3: Async parallel image generation
├── schemas.py                # Pydantic data models
├── config.py                 # API keys, style presets, constants
├── llm_client.py             # Groq/Gemini client with JSON parsing
│
├── templates/
│   ├── index.html            # Input form UI
│   └── storyboard.html       # Storyboard output UI
│
├── static/generated/         # Generated images (per session)
├── examples/                 # Sample narratives + output screenshots
│
├── requirements.txt
├── .env.example
└── README.md

Available Visual Styles

Style Description
Story Animation Clean animated look with expressive faces and smooth shading
2D Comic Bold outlines, cel-shaded coloring, graphic novel aesthetic
Watercolor Soft brush strokes, pastel tones, hand-painted feel
Simple & Clean Minimal flat illustration with soft lighting

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors