🎬 VideoCraft - AI-Powered Video Editing Assistant

An intelligent video editing assistant that revolutionizes post-production workflows through AI-powered content analysis and automated suggestion generation

🚀 Quick Start • 🤖 AI Models • 📖 Documentation • 🔧 Development

📋 Table of Contents

🎯 Project Overview
📁 Project Structure
✨ Core Features
🤖 AI Models & Technologies
🚀 Installation & Setup
📖 Usage Guide
🏗️ Architecture
⚙️ Configuration
🔧 Development
🧪 Testing
⚡ Lightweight Test Mode
📊 Performance
🤝 Contributing
📄 License

📁 Project Structure

VideoCraft/
├── 📄 main.py                     # Main Streamlit application
├── 📄 config.yaml                 # Configuration settings
├── 📄 requirements.txt            # Python dependencies
├── 📄 LICENSE                     # Project license
├── 📄 README.md                   # This file
│
├── 📁 src/                        # Core source code
│   ├── 📁 ai_models/              # AI/ML models
│   │   ├── emotion_detector.py    # Emotion detection system
│   │   ├── sentiment_analyzer.py  # Text sentiment analysis
│   │   ├── visual_analyzer.py     # Computer vision models
│   │   └── user_learning_system.py # Personalization AI
│   │
│   ├── 📁 processors/             # Data processing modules
│   │   ├── audio_analyzer.py      # Audio analysis & processing
│   │   ├── scene_detector.py      # Scene change detection
│   │   ├── script_parser.py       # Script/subtitle parsing
│   │   └── video_analyzer.py      # Video content analysis
│   │
│   ├── 📁 suggestions/            # AI suggestion engines
│   │   ├── cut_suggester.py       # Intelligent cut recommendations
│   │   └── transition_recommender.py # Transition suggestions
│   │
│   ├── 📁 ui/                     # User interface components
│   │   ├── suggestion_panel.py    # Suggestions display
│   │   └── timeline_viewer.py     # Timeline visualization
│   │
│   └── 📁 utils/                  # Utility modules
│       ├── cloud_integration.py   # Cloud processing support
│       ├── file_handler.py        # File I/O operations
│       └── timeline_sync.py       # Timeline synchronization
│
├── 📁 docs/                       # Documentation & reports
│   ├── 📁 reports/                # Analysis & performance reports
│   ├── 📁 fixes/                  # Development fix logs
│   └── 📄 README.md               # Documentation index
│
├── 📁 scripts/                    # Development & testing scripts
│   ├── 📁 setup/                  # Setup & installation utilities
│   ├── 📁 test_scripts/           # Testing & validation scripts
│   └── 📄 README.md               # Scripts directory guide
│
└── 📁 tests/                      # Unit & integration tests
    └── test_*.py                  # Test modules

🎯 Project Overview

VideoCraft is a cutting-edge AI-powered video editing assistant designed to streamline and enhance the post-production workflow for content creators, filmmakers, and video editors. By leveraging state-of-the-art machine learning models, VideoCraft automatically analyzes video content, audio tracks, and scripts to provide intelligent suggestions for cuts, transitions, and editing decisions.

🌟 Key Capabilities

Intelligent Scene Detection: Automatically identifies optimal cut points based on visual content changes
Emotional Arc Analysis: Tracks emotional progression through dialogue, speech patterns, and visual cues
Multi-Modal Content Understanding: Combines video, audio, and text analysis for comprehensive insights
Real-Time Processing: Advanced UI with progress tracking and live status updates
Professional Workflow Integration: Export suggestions compatible with major editing software

🎬 Who Is This For?

Content Creators: YouTube creators, social media producers, and digital marketers
Filmmakers: Independent filmmakers, documentary producers, and video artists
Editors: Professional video editors looking to accelerate their workflow
Educators: Film students and video production instructors
Developers: AI researchers and developers interested in multimodal content analysis

✨ Core Features

🎯 Advanced Video Analysis

Scene Change Detection

Visual Content Analysis: Uses computer vision to identify scene boundaries
Statistical Methods: Histogram-based and pixel difference analysis
AI-Enhanced Detection: CLIP model integration for semantic scene understanding
Adaptive Thresholds: Configurable sensitivity for different content types

Audio Processing & Analysis

Speech Emotion Recognition: Real-time emotion detection in spoken dialogue
Speaker Diarization: Automatic identification of speaker changes
Audio Energy Analysis: Rhythm and pacing analysis for cut timing
Music Synchronization: Beat detection for music-synced cuts

Script & Text Analysis

Natural Language Processing: Advanced sentiment and emotion analysis
Dialogue Parsing: Support for multiple subtitle formats (SRT, VTT, ASS)
Emotional Beat Detection: Identifies dramatic moments and pacing changes
Timeline Synchronization: Automatic alignment of script to video timeline

🚀 Enhanced User Interface

Drag-and-Drop File Upload

Multi-Format Support: Video (MP4, AVI, MOV, MKV), Audio (MP3, WAV, AAC), Scripts (TXT, SRT, VTT)
Large File Handling: Support for video files up to 2GB, audio up to 500MB
Real-Time Validation: Instant feedback on file types and sizes
Visual Feedback: Animated upload areas with hover effects

Real-Time Progress Tracking

Step-by-Step Breakdown: Detailed progress indicators for each processing stage
Visual Progress Bars: Gradient-styled progress indicators with animations
Status Updates: Live text updates describing current operations
Error Handling: Comprehensive error reporting with recovery suggestions

Advanced Filtering & Sorting

Multi-Criteria Filtering: Filter by confidence, type, time range, emotions
Priority-Based Sorting: Sort suggestions by importance and relevance
Interactive Controls: Real-time filter application with immediate results
Custom Presets: Save and load filter configurations

Preview & Thumbnails

Automatic Thumbnail Generation: Video previews generated at optimal timestamps
Hover Effects: Interactive thumbnail previews with smooth animations
Responsive Design: Optimized for desktop, tablet, and mobile viewing
Fast Loading: Efficient thumbnail caching and compression

🎛️ Professional Tools

Interactive Timeline Visualization

Multi-Track Display: Separate tracks for video, audio, and script analysis
Zoom & Pan: Detailed timeline navigation with smooth interactions
Suggestion Overlay: Visual markers for all AI-generated suggestions
Export Integration: Direct export to popular editing software formats

Batch Processing & Export

Multiple File Processing: Analyze multiple videos in sequence
Format Conversion: Export suggestions in various formats (XML, EDL, CSV)
Metadata Preservation: Maintain all analysis data and confidence scores
Integration APIs: Direct integration with Adobe Premiere, DaVinci Resolve, Final Cut Pro

🤖 AI Models & Technologies

🧠 Core AI Models

1. CLIP (Contrastive Language-Image Pre-training)

Model: openai/clip-vit-base-patch32
Purpose: Visual content understanding and scene analysis
Capabilities:
- Scene boundary detection through visual similarity analysis
- Object and activity recognition in video frames
- Semantic understanding of visual content
- Cross-modal video-text matching
Technical Details:
- Vision Transformer (ViT) architecture with 32x32 patch size
- 151M parameters optimized for real-time inference
- Multi-scale feature extraction for temporal analysis
- GPU acceleration with automatic CPU fallback

2. Emotion Recognition Models

Text Emotion Analysis

Model: j-hartmann/emotion-english-distilroberta-base
Purpose: Analyze emotional content in scripts and dialogue
Capabilities:
- Multi-class emotion classification (Joy, Sadness, Anger, Fear, Surprise, Neutral)
- Confidence scoring for emotional predictions
- Contextual understanding of emotional transitions
- Support for multiple languages and dialects
Technical Details:
- DistilRoBERTa architecture for efficient processing
- Fine-tuned on comprehensive emotion datasets
- 82M parameters with optimized inference pipeline

Speech Emotion Recognition

Model: ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition
Purpose: Detect emotions from speech audio
Capabilities:
- Real-time emotion detection from audio waveforms
- Speaker-independent emotion recognition
- Temporal emotion tracking throughout speech segments
- Robustness to background noise and audio quality variations
Technical Details:
- Wav2Vec2 architecture with cross-lingual speech representations
- 317M parameters trained on diverse speech datasets
- Support for multiple audio formats and sample rates

3. Facial Expression Analysis

Model: trpakov/vit-face-expression
Purpose: Analyze facial expressions in video content
Capabilities:
- Real-time facial emotion detection from video frames
- Multi-face tracking and expression analysis
- Temporal consistency in emotion tracking
- Integration with scene change detection
Technical Details:
- Vision Transformer adapted for facial expression recognition
- Optimized for diverse lighting conditions and face orientations
- Efficient processing with face detection pipeline

4. Advanced NLP Pipeline

Framework: spaCy with custom emotion models
Purpose: Comprehensive text analysis and processing
Capabilities:
- Named entity recognition for character identification
- Dependency parsing for dialogue structure analysis
- Sentiment analysis with fine-grained scoring
- Temporal relationship extraction from scripts
Technical Details:
- Multi-language support with optimized tokenizers
- Custom training pipelines for domain-specific terminology
- Integration with external knowledge bases

🔬 Computer Vision Pipeline

Scene Detection Algorithms

Histogram-Based Analysis: Statistical comparison of color distributions
Optical Flow Detection: Motion vector analysis for scene transitions
Edge Detection: Structural change analysis using Sobel and Canny filters
Deep Learning Integration: CLIP-based semantic scene understanding

Frame Processing Pipeline

Adaptive Sampling: Intelligent frame selection for optimal processing speed
Multi-Resolution Analysis: Process at multiple scales for different features
Temporal Consistency: Ensure smooth transitions and avoid false positives
Memory Optimization: Efficient buffer management for large video files

🎵 Audio Processing Engine

Feature Extraction

Spectral Analysis: MFCC, chroma, and spectral centroid extraction
Temporal Features: Zero-crossing rate, energy, and rhythm analysis
Harmonic Analysis: Pitch tracking and harmonic content analysis
Noise Reduction: Advanced filtering for clean audio analysis

Speaker Diarization

Voice Activity Detection: Accurate speech/non-speech segmentation
Speaker Clustering: Unsupervised clustering of speaker embeddings
Change Point Detection: Precise identification of speaker transitions
Confidence Scoring: Reliability metrics for speaker change suggestions

🚀 Installation & Setup

📋 Prerequisites

Python: Version 3.11 or higher
Operating System: Windows 10/11, macOS 10.15+, or Linux Ubuntu 18.04+
Memory: Minimum 8GB RAM (16GB recommended for large videos)
Storage: At least 5GB free space for models and cache
GPU: Optional but recommended (CUDA-compatible GPU for acceleration)

🔧 Installation Methods

Method 1: Automated Setup (Recommended)

# Clone the repository
git clone https://github.com/smirk-dev/VideoCraft.git
cd VideoCraft

# Run automated setup script (Windows)
.\setup_venv.ps1

# For macOS/Linux
chmod +x setup_venv.sh
./setup_venv.sh

Method 2: Manual Installation

# Create virtual environment
python -m venv videocraft_env

# Activate virtual environment
# Windows
.\videocraft_env\Scripts\Activate.ps1
# macOS/Linux
source videocraft_env/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run setup script
python setup.py

Method 3: Docker Installation

# Build Docker image
docker build -t videocraft .

# Run container with GPU support
docker run --gpus all -p 8501:8501 videocraft

# Or CPU-only
docker run -p 8501:8501 videocraft

🔍 Dependency Details

Core Dependencies

# AI/ML Framework
torch>=2.6.0                     # PyTorch for deep learning
torchvision>=0.23.0             # Computer vision utilities
transformers>=4.48.0            # Hugging Face transformers
sentence-transformers==3.0.1    # Sentence embeddings

# Video/Audio Processing
opencv-python-headless==4.10.0.84  # Computer vision
moviepy==1.0.3                     # Video processing
librosa==0.10.2                    # Audio analysis
scenedetect==0.6.4                 # Scene detection
pydub==0.25.1                      # Audio manipulation

# NLP & Text Processing
spacy>=3.8.0                    # Natural language processing
nltk>=3.9.1                     # Text analysis toolkit

# Data Science
pandas>=2.2.3                   # Data manipulation
numpy>=2.0.2,<2.2.0            # Numerical computing
scikit-learn==1.5.1            # Machine learning utilities
scipy>=1.11.0                   # Scientific computing

# UI Framework
streamlit==1.36.0               # Web application framework
plotly==5.22.0                  # Interactive visualizations
streamlit-timeline==0.0.2       # Timeline components

Security Updates

PyTorch ≥2.6.0: Required for CVE-2025-32434 security fix
NumPy ≥2.0.2: Compatibility with latest PyTorch versions
Transformers ≥4.48.0: Latest security and performance updates

⚙️ Configuration

Environment Variables

# Create .env file from template
cp .env.example .env

# Configure settings
VIDEOCRAFT_MODEL_CACHE_DIR=./models
VIDEOCRAFT_MAX_VIDEO_SIZE=2048  # MB
VIDEOCRAFT_ENABLE_GPU=true
VIDEOCRAFT_LOG_LEVEL=INFO

Model Configuration

# config.yaml
models:
  clip:
    model_name: "openai/clip-vit-base-patch32"
    device: "auto"  # auto, cpu, cuda
    cache_dir: "./models"
  
  emotion:
    text_model: "j-hartmann/emotion-english-distilroberta-base"
    speech_model: "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
    
processing:
  video:
    max_fps: 30
    frame_skip: 5
    max_resolution: [1920, 1080]
  
  audio:
    sample_rate: 16000
    chunk_duration: 10  # seconds
    
suggestions:
  confidence_threshold: 0.7
  max_suggestions: 50

Streamlit Configuration

# .streamlit/config.toml
[server]
maxUploadSize = 2048  # 2GB
maxMessageSize = 2048
fileWatcherType = "auto"

[theme]
primaryColor = "#667eea"
backgroundColor = "#ffffff"
secondaryBackgroundColor = "#f8f9ff"
textColor = "#2c3e50"

📖 Usage Guide

🎬 Getting Started

1. Launch the Application

# Activate virtual environment if not already active
source videocraft_env/bin/activate  # macOS/Linux
# or
.\videocraft_env\Scripts\Activate.ps1  # Windows

# Start the application
streamlit run main.py

The application will open in your default web browser at http://localhost:8501

2. Upload Your Content

Video Upload

Supported Formats: MP4, AVI, MOV, MKV, WebM
Maximum Size: 2GB (configurable)
Optimal Settings: 1080p resolution, 30fps or less
Upload Methods:
- Drag and drop into the designated area
- Click "Browse files" button
- Paste file path directly

Script Upload (Optional)

Supported Formats: TXT, SRT, VTT, ASS
Maximum Size: 500KB
Content: Dialogue, subtitles, or scene descriptions
Encoding: UTF-8 recommended

Audio Upload (Optional)

Supported Formats: MP3, WAV, AAC, FLAC
Maximum Size: 500MB
Use Cases: Separate audio track or music overlay

3. Configure Analysis Settings

Analysis Options

Analyze Video Content: Enable visual scene detection and content analysis
Analyze Audio: Enable speech emotion and speaker change detection
Analyze Script: Enable text-based emotion and dialogue analysis

Sensitivity Settings

Emotion Change Threshold (0.1-1.0): How dramatic emotional changes must be
Scene Change Threshold (0.1-1.0): Sensitivity to visual scene transitions
Minimum Cut Interval (1-30 seconds): Minimum time between suggested cuts

4. Process and Review

The processing pipeline includes:

File Preparation: Upload validation and format conversion
Content Analysis: Video, audio, and script processing
AI Model Inference: Emotion detection and scene analysis
Suggestion Generation: Cut and transition recommendations
Results Compilation: Confidence scoring and ranking

🎛️ Advanced Features

Real-Time Progress Tracking

Visual Progress Bar: Shows completion percentage with gradient styling
Step-by-Step Updates: Detailed description of current processing stage
Time Estimation: Approximate remaining processing time
Error Recovery: Automatic retry and fallback mechanisms

Advanced Filtering System

Confidence Filtering

Range Slider: Set minimum and maximum confidence levels
Dynamic Updates: Real-time filtering as you adjust settings
Confidence Scores: Detailed breakdown of model certainty

Content-Based Filtering

Suggestion Types: Filter by scene changes, emotion beats, speaker changes
Time Range: Focus on specific segments of your video
Emotion Filtering: Show only specific emotional content types
Priority Levels: Filter by suggestion importance (High, Medium, Low)

Advanced Options

Speaker Changes Only: Show only suggestions at speaker transitions
Music-Synced Cuts: Show only cuts aligned with musical elements
Minimum Cut Length: Filter out suggestions too close together
Preview Thumbnails: Toggle thumbnail displays for suggestions

Interactive Timeline

Navigation

Zoom Controls: Zoom in/out for detailed timeline inspection
Pan and Scroll: Navigate through long video timelines
Jump to Time: Click on any point to jump to specific timestamps
Multi-Track View: Separate lanes for different analysis types

Suggestion Management

Selection Tools: Select individual or multiple suggestions
Batch Operations: Apply actions to multiple suggestions at once
Export Options: Save suggestions in various formats
Integration: Direct export to editing software

📊 Understanding Results

Suggestion Types

Scene Change Suggestions

Visual Transitions: Cuts based on significant visual changes
Confidence: Based on visual similarity analysis
Metadata: Color histogram differences, motion vectors
Timing: Precise frame-level timestamps

Emotion-Based Cuts

Text Emotions: Cuts at emotional beats in dialogue
Speech Emotions: Cuts based on vocal emotional changes
Visual Emotions: Cuts based on facial expression changes
Emotional Arc: Tracking emotional progression

Speaker Change Cuts

Dialogue Transitions: Automatic cuts at speaker changes
Voice Recognition: Speaker identification and clustering
Confidence Scoring: Reliability of speaker change detection
Context Awareness: Consideration of dialogue flow

Audio-Synced Cuts

Beat Detection: Cuts synchronized with musical beats
Audio Energy: Cuts at high/low energy moments
Silence Detection: Cuts at natural pauses
Rhythm Analysis: Cuts that maintain pacing

Confidence Scoring System

Score Interpretation

90-100%: Extremely confident, almost certainly correct
70-89%: High confidence, likely good cut points
50-69%: Medium confidence, review recommended
30-49%: Low confidence, use with caution
Below 30%: Very low confidence, manual review required

Factors Affecting Confidence

Visual Clarity: Image quality and lighting conditions
Audio Quality: Background noise and recording clarity
Content Complexity: Number of subjects and scene complexity
Model Certainty: Inherent model confidence in predictions

🏗️ Architecture

📁 Project Structure

VideoCraft/
├── 📄 README.md                    # Comprehensive project documentation
├── 📄 requirements.txt             # Python dependencies
├── 📄 config.yaml                  # Configuration settings
├── 📄 main.py                      # Main Streamlit application
├── 📄 setup.py                     # Installation and setup script
├── 📄 .env.example                 # Environment variables template
├── 📄 LICENSE                      # MIT license
├── 📄 ADVANCED_FEATURES.md         # Advanced features documentation
├── 📄 DEPENDENCY_RESOLUTION.md     # Dependency management guide
├── 📄 IMPLEMENTATION_COMPLETE.md   # Implementation status
├── 📄 PERFORMANCE_NOTES.md         # Performance optimization guide
├── 📄 UI_IMPROVEMENTS.md           # UI enhancement documentation
├── 📄 setup_venv.ps1               # Windows setup script
│
├── 🗂️ .streamlit/                  # Streamlit configuration
│   └── 📄 config.toml              # UI and server settings
│
├── 🗂️ src/                         # Source code directory
│   ├── 📄 __init__.py              # Package initialization
│   │
│   ├── 🗂️ processors/              # Content processing modules
│   │   ├── 📄 __init__.py
│   │   ├── 📄 video_analyzer.py    # Video content analysis
│   │   ├── 📄 audio_analyzer.py    # Audio processing and analysis
│   │   ├── 📄 script_parser.py     # Text and script analysis
│   │   └── 📄 scene_detector.py    # Scene boundary detection
│   │
│   ├── 🗂️ ai_models/               # AI model implementations
│   │   ├── 📄 __init__.py
│   │   ├── 📄 emotion_detector.py  # Multi-modal emotion detection
│   │   ├── 📄 sentiment_analyzer.py # Text sentiment analysis
│   │   └── 📄 visual_analyzer.py   # Computer vision models
│   │
│   ├── 🗂️ suggestions/             # Suggestion generation engines
│   │   ├── 📄 __init__.py
│   │   ├── 📄 cut_suggester.py     # Cut point recommendation
│   │   └── 📄 transition_recommender.py # Transition suggestions
│   │
│   ├── 🗂️ ui/                      # User interface components
│   │   ├── 📄 __init__.py
│   │   ├── 📄 timeline_viewer.py   # Interactive timeline display
│   │   └── 📄 suggestion_panel.py  # Suggestion management UI
│   │
│   └── 🗂️ utils/                   # Utility functions
│       ├── 📄 __init__.py
│       ├── 📄 file_handler.py      # File operations and validation
│       └── 📄 timeline_sync.py     # Timeline synchronization
│
├── 🗂️ data/                        # Data storage directory
│   ├── 🗂️ cache/                   # Model and processing cache
│   └── 🗂️ output/                  # Generated outputs and exports
│
└── 🗂️ tests/                       # Test suite
    ├── 📄 __init__.py
    ├── 📄 test_processors.py       # Processor unit tests
    ├── 📄 test_ai_models.py        # AI model tests
    └── 📄 test_ui_components.py    # UI component tests

🔄 Processing Pipeline

1. Input Processing Stage

File Upload → Validation → Format Conversion → Temporary Storage
     ↓
Metadata Extraction → Quality Assessment → Processing Queue

2. Content Analysis Stage

Video Analysis:
Raw Video → Frame Extraction → Scene Detection → Feature Extraction → CLIP Analysis

Audio Analysis:
Audio Track → Preprocessing → Feature Extraction → Emotion Detection → Speaker Analysis

Script Analysis:
Text Input → Tokenization → NLP Processing → Emotion Analysis → Timeline Alignment

3. AI Inference Stage

Parallel Processing:
├── Visual Content Understanding (CLIP)
├── Text Emotion Analysis (RoBERTa)
├── Speech Emotion Detection (Wav2Vec2)
└── Facial Expression Analysis (ViT)
     ↓
Feature Fusion → Temporal Alignment → Confidence Scoring

4. Suggestion Generation Stage

Multi-Modal Features → Cut Point Detection → Transition Recommendation
     ↓
Confidence Scoring → Ranking → Filtering → Export Formatting

🧩 Component Architecture

Core Modules

VideoAnalyzer Class

class VideoAnalyzer:
    """Comprehensive video content analysis"""
    
    def __init__(self, config):
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.face_detector = FaceDetector()
        self.config = config
    
    def analyze_video_timeline(self, video_path):
        """Analyze video content across timeline"""
        
    def detect_visual_emotions(self, frames):
        """Detect emotions from facial expressions"""
        
    def extract_video_features(self, video_path):
        """Extract comprehensive video features"""

AudioAnalyzer Class

class AudioAnalyzer:
    """Advanced audio processing and analysis"""
    
    def __init__(self, config):
        self.emotion_model = Wav2Vec2ForSequenceClassification.from_pretrained(
            "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
        )
        self.speaker_detector = SpeakerDetector()
    
    def analyze_speech_emotion(self, audio_path):
        """Detect emotions in speech"""
        
    def detect_speaker_changes(self, audio_path):
        """Identify speaker transition points"""
        
    def extract_audio_features(self, audio_path):
        """Extract spectral and temporal features"""

EmotionDetector Class

class EmotionDetector:
    """Multi-modal emotion detection system"""
    
    def __init__(self, config):
        self.text_model = AutoModelForSequenceClassification.from_pretrained(
            "j-hartmann/emotion-english-distilroberta-base"
        )
        self.speech_model = Wav2Vec2ForSequenceClassification.from_pretrained(
            "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
        )
    
    def detect_text_emotions(self, text):
        """Analyze emotions in text content"""
        
    def detect_speech_emotions(self, audio):
        """Analyze emotions in speech audio"""
        
    def fuse_emotion_predictions(self, text_emotions, speech_emotions):
        """Combine multi-modal emotion predictions"""

🔗 Data Flow

Input → Processing → Output Flow

graph TD
    A[User Upload] → B[File Validation]
    B → C[Content Extraction]
    C → D[Parallel Analysis]
    D → E[Video Analysis]
    D → F[Audio Analysis]
    D → G[Script Analysis]
    E → H[Feature Fusion]
    F → H
    G → H
    H → I[Suggestion Generation]
    I → J[Ranking & Filtering]
    J → K[UI Display]
    K → L[Export Options]

⚙️ Configuration

🔧 Configuration Files

config.yaml - Main Configuration

# Model Configuration
models:
  clip:
    model_name: "openai/clip-vit-base-patch32"
    device: "auto"  # auto, cpu, cuda, mps
    torch_dtype: "float32"
    cache_dir: "./models/clip"
    max_batch_size: 8
  
  emotion:
    text_model: "j-hartmann/emotion-english-distilroberta-base"
    speech_model: "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
    face_model: "trpakov/vit-face-expression"
    cache_dir: "./models/emotion"
  
  nlp:
    spacy_model: "en_core_web_sm"
    sentiment_model: "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Processing Configuration
processing:
  video:
    max_fps: 30
    frame_skip: 5  # Process every 5th frame
    max_resolution: [1920, 1080]
    thumbnail_timestamp: 1.0  # Generate thumbnail at 1 second
    scene_detection_method: "combined"  # histogram, optical_flow, combined
  
  audio:
    sample_rate: 16000
    chunk_duration: 10  # seconds
    overlap_duration: 1  # seconds
    noise_reduction: true
    normalize_audio: true
  
  script:
    max_text_length: 10000  # characters
    emotion_confidence_threshold: 0.5
    speaker_detection_method: "regex"  # regex, nlp, manual

# Suggestion Configuration
suggestions:
  cut_suggestions:
    confidence_threshold: 0.7
    minimum_confidence: 0.3
    max_suggestions: 50
    min_cut_interval: 2.0  # seconds
    emotion_change_threshold: 0.6
    scene_change_threshold: 0.5
  
  transition_suggestions:
    enable_smart_transitions: true
    transition_types: ["cut", "fade", "dissolve", "wipe"]
    context_awareness: true
    music_sync: true

# File Handling
file_limits:
  video_max_size: 2048  # MB
  audio_max_size: 500   # MB
  script_max_size: 0.5  # MB
  supported_video_formats: ["mp4", "avi", "mov", "mkv", "webm"]
  supported_audio_formats: ["mp3", "wav", "aac", "flac"]
  supported_script_formats: ["txt", "srt", "vtt", "ass"]

# UI Configuration
ui:
  theme: "professional"  # professional, creative, minimal
  progress_animations: true
  thumbnail_previews: true
  real_time_updates: true
  advanced_filters: true
  
# Performance Configuration
performance:
  use_gpu: true
  mixed_precision: true
  model_parallelism: false
  batch_processing: true
  cache_embeddings: true
  prefetch_models: true

# Logging Configuration
logging:
  level: "INFO"  # DEBUG, INFO, WARNING, ERROR
  log_file: "videocraft.log"
  log_rotation: true
  max_log_size: "10MB"
  backup_count: 5

.streamlit/config.toml - UI Configuration

[global]
developmentMode = false

[server]
# File upload limits
maxUploadSize = 2048  # 2GB in MB
maxMessageSize = 2048
enableCORS = false
enableXsrfProtection = true

# Performance settings
runOnSave = true
fileWatcherType = "auto"
headless = false

[browser]
serverAddress = "localhost"
serverPort = 8501
gatherUsageStats = false

[theme]
# Professional color scheme
primaryColor = "#667eea"        # Primary accent color
backgroundColor = "#ffffff"     # Main background
secondaryBackgroundColor = "#f8f9ff"  # Secondary background
textColor = "#2c3e50"          # Text color

# Advanced theming
font = "sans serif"            # Font family
base = "light"                 # Base theme (light/dark)

[client]
showErrorDetails = true
toolbarMode = "minimal"

.env - Environment Variables

# Application Settings
VIDEOCRAFT_ENV=production
VIDEOCRAFT_DEBUG=false
VIDEOCRAFT_LOG_LEVEL=INFO

# Model Configuration
VIDEOCRAFT_MODEL_CACHE_DIR=./models
VIDEOCRAFT_USE_GPU=true
VIDEOCRAFT_MIXED_PRECISION=true

# File Settings
VIDEOCRAFT_MAX_VIDEO_SIZE=2048
VIDEOCRAFT_MAX_AUDIO_SIZE=500
VIDEOCRAFT_TEMP_DIR=./temp

# API Keys (if using external services)
HUGGINGFACE_API_KEY=your_hf_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

# Database (for enterprise features)
DATABASE_URL=sqlite:///videocraft.db

# Security
SECRET_KEY=your-secret-key-here
ALLOWED_HOSTS=localhost,127.0.0.1

# Performance
MAX_WORKERS=4
BATCH_SIZE=8
CACHE_TTL=3600

🎛️ Runtime Configuration

Dynamic Settings via UI

Processing Options: Toggle analysis types in real-time
Sensitivity Sliders: Adjust thresholds while processing
Filter Controls: Modify result filtering dynamically
Theme Selection: Switch between UI themes instantly

Advanced Configuration

Model Selection: Choose between different AI models
Performance Tuning: Adjust batch sizes and GPU usage
Export Formats: Configure output formats for different editors
Integration Settings: Set up connections to external tools

📊 Performance

⚡ Optimization Features

GPU Acceleration

Automatic Detection: Automatically detects and utilizes available GPUs
Mixed Precision: Uses FP16 for faster inference with minimal accuracy loss
Memory Management: Efficient GPU memory allocation and cleanup
Fallback Support: Seamless CPU fallback when GPU unavailable

Processing Optimizations

Batch Processing: Process multiple frames/segments simultaneously
Smart Sampling: Intelligent frame selection to reduce processing time
Parallel Pipelines: Concurrent video, audio, and text processing
Caching System: Cache model outputs to avoid reprocessing

Memory Management

Streaming Processing: Process large files without loading entirely into memory
Buffer Management: Efficient memory buffers for video frame processing
Model Optimization: Quantized models for reduced memory usage
Garbage Collection: Proactive memory cleanup during processing

📈 Performance Metrics

Processing Speed Benchmarks

Video Processing (1080p, 10 minutes):
├── Scene Detection: ~2-3 minutes
├── Audio Analysis: ~1-2 minutes
├── Text Processing: ~30 seconds
└── Total Pipeline: ~4-6 minutes

Model Loading Times:
├── CLIP Model: ~10-15 seconds
├── Emotion Models: ~5-10 seconds
├── NLP Models: ~3-5 seconds
└── Total Initialization: ~20-30 seconds

Memory Usage:
├── Base Application: ~500MB
├── CLIP Model: ~1.5GB
├── Emotion Models: ~800MB
├── Processing Buffers: ~1-2GB
└── Peak Usage: ~4-5GB

Scalability Considerations

File Size Limits: Optimized for files up to 2GB
Concurrent Users: Single-user application optimized for desktop use
Processing Queue: Sequential processing with progress tracking
Resource Monitoring: Real-time memory and CPU usage tracking

🔧 Performance Tuning

Hardware Recommendations

Minimum Requirements

CPU: Intel Core i5 or AMD Ryzen 5 (4+ cores)
RAM: 8GB (16GB recommended)
Storage: 5GB free space (SSD recommended)
GPU: Optional (Intel UHD, AMD Radeon, or NVIDIA GTX 1050+)

Recommended Configuration

CPU: Intel Core i7/i9 or AMD Ryzen 7/9 (8+ cores)
RAM: 16GB+ DDR4
Storage: 20GB+ SSD with high read/write speeds
GPU: NVIDIA RTX 3060+ or equivalent with 8GB+ VRAM

Optimal Setup

CPU: Intel Core i9 or AMD Ryzen 9 (16+ cores)
RAM: 32GB+ DDR4/DDR5
Storage: NVMe SSD with 50GB+ free space
GPU: NVIDIA RTX 4070+ with 12GB+ VRAM

Configuration Optimization

# Performance tuning in config.yaml
performance:
  # GPU settings
  use_gpu: true
  mixed_precision: true
  gpu_memory_fraction: 0.8
  
  # Processing settings
  batch_size: 16  # Increase for more GPU memory
  num_workers: 8  # Match CPU cores
  frame_skip: 3   # Process every 3rd frame for speed
  
  # Caching settings
  enable_model_cache: true
  enable_result_cache: true
  cache_size_limit: "10GB"

🔧 Development Guide

🛠️ Setting Up Development Environment

Development Dependencies

# Install development dependencies
pip install -r requirements-dev.txt

# Additional development tools
pip install black flake8 pytest pytest-cov pre-commit

Code Style and Formatting

# Code formatting with Black
black src/ tests/

# Linting with flake8
flake8 src/ tests/

# Type checking with mypy
mypy src/

Pre-commit Hooks

# Install pre-commit hooks
pre-commit install

# Run all hooks manually
pre-commit run --all-files

🧪 Testing

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test modules
pytest tests/test_processors.py
pytest tests/test_ai_models.py

⚡ Lightweight Test Mode

To enable fast test execution without heavy ML dependencies (torch, transformers, spaCy models, etc.), VideoCraft provides a lightweight mode that stubs expensive components.

Key points:

Activated automatically during pytest via tests/conftest.py (sets LIGHT_TEST_MODE=1).
Skips loading large models; provides simple stub analyzers & suggester implementations.
Suitable for CI smoke tests and contributor quick feedback loops.
Full functionality still requires installing all dependencies in requirements.txt.

Environment variable:

LIGHT_TEST_MODE=1

Minimal test dependencies are listed in requirements-test.txt:

pip install -r requirements-test.txt
pytest

When you want to run the full stack (with real model inference):

pip install -r requirements.txt
pytest -m "not slow"  # (If slow markers are later added)

Optional validation that heavy mode works (example):

$env:LIGHT_TEST_MODE=0
pytest tests/test_final_integration.py -k advanced

If a test is intentionally skipped in lightweight mode (e.g., advanced analyzer coverage), it will be reported with a skip reason so you can decide when to run full mode.

Recommended CI matrix (conceptual):

Job 1 (fast): install requirements-test.txt, run core & unit tests (under 1 minute)
Job 2 (full, optional nightly): install requirements.txt, run integration & model tests

Add a GitHub Actions snippet (example only):

jobs:
  fast-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install minimal deps
        run: pip install -r requirements-test.txt
      - name: Run fast tests
        run: pytest -q
  full-tests:
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install full deps
        run: pip install -r requirements.txt
      - name: Run integration tests
        run: pytest -m "integration" -q || true  # allow failures initially

This separation keeps contributor workflows fast while preserving deep coverage in scheduled runs.

Test Structure

# Example test file: tests/test_processors.py
import pytest
from src.processors.video_analyzer import VideoAnalyzer

class TestVideoAnalyzer:
    @pytest.fixture
    def analyzer(self):
        config = {"models": {"clip": {"model_name": "openai/clip-vit-base-patch32"}}}
        return VideoAnalyzer(config)
    
    def test_video_analysis(self, analyzer):
        # Test video analysis functionality
        pass
    
    def test_scene_detection(self, analyzer):
        # Test scene detection
        pass

📝 Adding New Features

Adding a New AI Model

Create Model Class:

# src/ai_models/new_model.py
class NewModel:
    def __init__(self, config):
        self.model = load_pretrained_model()
        self.config = config
    
    def predict(self, input_data):
        return self.model(input_data)

Update Configuration:

# config.yaml
models:
  new_model:
    model_name: "huggingface/model-name"
    device: "auto"

Integration:

# main.py or relevant processor
from src.ai_models.new_model import NewModel

new_model = NewModel(config)
predictions = new_model.predict(data)

Adding a New Processor

Create Processor Class:

# src/processors/new_processor.py
class NewProcessor:
    def __init__(self, config):
        self.config = config
    
    def process(self, input_data):
        # Processing logic
        return processed_data

Add to Main Pipeline:

# main.py
from src.processors.new_processor import NewProcessor

new_processor = NewProcessor(config)
results = new_processor.process(input_data)

Adding UI Components

Create UI Module:

# src/ui/new_component.py
import streamlit as st

class NewComponent:
    def __init__(self, config):
        self.config = config
    
    def render(self, data):
        st.markdown("## New Component")
        # Render component

Integrate in Main UI:

# main.py
new_component = NewComponent(config)
new_component.render(data)

📚 API Documentation

Core Classes Documentation

VideoAnalyzer API

class VideoAnalyzer:
    """
    Comprehensive video content analysis using computer vision models.
    
    Methods:
        analyze_video_timeline(video_path: str) -> Dict
        detect_scenes(video_path: str, method: str) -> List[float]
        extract_video_features(video_path: str) -> Dict
        detect_visual_emotions(frames: List[np.ndarray]) -> List[Dict]
    """

AudioAnalyzer API

class AudioAnalyzer:
    """
    Advanced audio processing and speech analysis.
    
    Methods:
        analyze_speech_emotion(audio_path: str) -> List[Dict]
        detect_speaker_changes(audio_path: str) -> List[float]
        extract_audio_features(audio_path: str) -> Dict
        analyze_audio_energy(audio_path: str) -> List[float]
    """

EmotionDetector API

class EmotionDetector:
    """
    Multi-modal emotion detection system.
    
    Methods:
        detect_text_emotions(text: str) -> Dict
        detect_speech_emotions(audio: np.ndarray) -> Dict
        analyze_facial_emotions(image: np.ndarray) -> Dict
        fuse_emotions(text_em: Dict, speech_em: Dict, visual_em: Dict) -> Dict
    """

🚀 Deployment

Production Deployment

# Build production image
docker build -f Dockerfile.prod -t videocraft:prod .

# Deploy with docker-compose
docker-compose -f docker-compose.prod.yml up -d

# Or use cloud deployment
# AWS: Use ECS or EKS
# Google Cloud: Use Cloud Run or GKE
# Azure: Use Container Instances or AKS

Environment-Specific Configurations

# config/production.yaml
performance:
  use_gpu: true
  mixed_precision: true
  batch_size: 32
  
logging:
  level: "WARNING"
  log_file: "/var/log/videocraft.log"
  
security:
  enable_auth: true
  max_upload_size: 1024  # Reduced for production

🤝 Contributing

🌟 How to Contribute

Types of Contributions

Bug Reports: Help us identify and fix issues
Feature Requests: Suggest new functionality or improvements
Code Contributions: Submit pull requests for bug fixes or features
Documentation: Improve documentation and examples
Testing: Add tests or improve test coverage
Performance: Optimize algorithms and resource usage

Contribution Workflow

Fork the Repository: Create your own fork of the project
Create Feature Branch: git checkout -b feature/amazing-feature
Make Changes: Implement your feature or fix
Add Tests: Ensure your changes are properly tested
Update Documentation: Update relevant documentation
Commit Changes: git commit -m 'Add amazing feature'
Push to Branch: git push origin feature/amazing-feature
Create Pull Request: Submit PR with detailed description

Development Guidelines

Code Style: Follow PEP 8 and use Black for formatting
Documentation: Include docstrings for all public methods
Testing: Maintain >90% test coverage
Performance: Consider performance impact of changes
Compatibility: Ensure backward compatibility when possible

Reporting Issues

## Bug Report Template

**Description**: Brief description of the issue

**Steps to Reproduce**:
1. Step one
2. Step two
3. Step three

**Expected Behavior**: What should happen

**Actual Behavior**: What actually happens

**Environment**:
- OS: [Windows/macOS/Linux]
- Python Version: [3.11.x]
- VideoCraft Version: [x.x.x]
- GPU: [NVIDIA RTX 3060 / None]

**Additional Context**: Any other relevant information

🏆 Recognition

Contributors

Core Developers: Lead development and architecture
Feature Contributors: Major feature implementations
Bug Hunters: Significant bug reports and fixes
Documentation Writers: Documentation improvements
Community Helpers: Support in discussions and issues

Acknowledgments

Special thanks to:

Hugging Face: For providing pre-trained models
Streamlit: For the excellent UI framework
PyTorch: For the deep learning foundation
OpenAI: For the CLIP model
Open Source Community: For inspiration and support

📄 License

MIT License

MIT License

Copyright (c) 2025 VideoCraft Development Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

📋 Third-Party Licenses

AI Models

CLIP: MIT License (OpenAI)
RoBERTa Emotion: Apache 2.0 License
Wav2Vec2: MIT License (Facebook AI)
ViT Face Expression: Apache 2.0 License

Dependencies

PyTorch: BSD 3-Clause License
Transformers: Apache 2.0 License
Streamlit: Apache 2.0 License
OpenCV: Apache 2.0 License
Librosa: ISC License

📞 Support & Contact

🆘 Getting Help

Documentation

README: Comprehensive project overview (this document)
API Documentation: Detailed API reference
Tutorial Videos: Step-by-step usage guides
FAQ: Frequently asked questions

Community Support

GitHub Issues: Bug reports and feature requests
Discussions: General questions and community help
Wiki: Community-maintained documentation
Discord: Real-time chat support (coming soon)

Professional Support

Consulting: Custom implementation and integration
Training: Team training and workshops
Enterprise: Enterprise licensing and support

📧 Contact Information

Project Maintainer: VideoCraft Development Team
Email: [email protected]
GitHub: @smirk-dev/VideoCraft
Website: https://videocraft.ai

🔮 Roadmap & Future Features

🚀 Upcoming Features

Short Term (v2.0)

Real-time Processing: Live video analysis during recording
Advanced Transitions: AI-generated custom transition effects
Multi-language Support: Support for non-English content
Batch Processing: Process multiple videos simultaneously
Cloud Integration: Cloud-based processing for large files

Medium Term (v3.0)

Collaborative Editing: Multi-user editing and review
Advanced AI Models: Latest transformer architectures
Mobile App: iOS and Android companion apps
Plugin System: Third-party plugin architecture
API Service: RESTful API for integration

Long Term (v4.0+)

Automated Editing: Full video editing with minimal input
Style Transfer: Apply editing styles from reference videos
Voice Cloning: Generate voiceovers and dubbing
3D Scene Understanding: Advanced spatial video analysis
AR/VR Integration: Immersive editing experiences

🎯 Performance Goals

Processing Speed: 50% faster inference times
Memory Usage: 30% reduction in memory footprint
Model Accuracy: Improved confidence scores and accuracy
User Experience: Enhanced UI responsiveness and feedback
Compatibility: Support for more video formats and codecs

🌟 Star History

Made with ❤️ by the VideoCraft Team

⬆️ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.streamlit		.streamlit
__pycache__		__pycache__
data		data
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
main.py		main.py
packages.txt		packages.txt
pytest.ini		pytest.ini
requirements-streamlit.txt		requirements-streamlit.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
runtime.txt		runtime.txt
setup.py		setup.py
user_learning.db		user_learning.db

License

smirk-dev/VideoCraft-first-iteration

Folders and files

Latest commit

History

Repository files navigation

🎬 VideoCraft - AI-Powered Video Editing Assistant

📋 Table of Contents

📁 Project Structure

🎯 Project Overview

🌟 Key Capabilities

🎬 Who Is This For?

✨ Core Features

🎯 Advanced Video Analysis

Scene Change Detection

Audio Processing & Analysis

Script & Text Analysis

🚀 Enhanced User Interface

Drag-and-Drop File Upload

Real-Time Progress Tracking

Advanced Filtering & Sorting

Preview & Thumbnails

🎛️ Professional Tools

Interactive Timeline Visualization

Batch Processing & Export

🤖 AI Models & Technologies

🧠 Core AI Models

1. CLIP (Contrastive Language-Image Pre-training)

2. Emotion Recognition Models

Text Emotion Analysis

Speech Emotion Recognition

3. Facial Expression Analysis

4. Advanced NLP Pipeline

🔬 Computer Vision Pipeline

Scene Detection Algorithms

Frame Processing Pipeline

🎵 Audio Processing Engine

Feature Extraction

Speaker Diarization

🚀 Installation & Setup

📋 Prerequisites

🔧 Installation Methods

Method 1: Automated Setup (Recommended)

Method 2: Manual Installation

Method 3: Docker Installation

🔍 Dependency Details

Core Dependencies

Security Updates

⚙️ Configuration

Environment Variables

Model Configuration

Streamlit Configuration

📖 Usage Guide

🎬 Getting Started

1. Launch the Application

2. Upload Your Content

Video Upload

Script Upload (Optional)

Audio Upload (Optional)

3. Configure Analysis Settings

Analysis Options

Sensitivity Settings

4. Process and Review

🎛️ Advanced Features

Real-Time Progress Tracking

Advanced Filtering System

Confidence Filtering

Content-Based Filtering

Advanced Options

Interactive Timeline

Navigation

Suggestion Management

📊 Understanding Results

Suggestion Types

Scene Change Suggestions

Emotion-Based Cuts

Speaker Change Cuts

Audio-Synced Cuts

Confidence Scoring System

Score Interpretation