An intelligent video editing assistant that revolutionizes post-production workflows through AI-powered content analysis and automated suggestion generation
๐ Quick Start โข ๐ค AI Models โข ๐ Documentation โข ๐ง Development
- ๐ฏ Project Overview
- ๐ Project Structure
- โจ Core Features
- ๐ค AI Models & Technologies
- ๐ Installation & Setup
- ๐ Usage Guide
- ๐๏ธ Architecture
- โ๏ธ Configuration
- ๐ง Development
- ๐งช Testing
- โก Lightweight Test Mode
- ๐ Performance
- ๐ค Contributing
- ๐ License
VideoCraft/
โโโ ๐ main.py # Main Streamlit application
โโโ ๐ config.yaml # Configuration settings
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ LICENSE # Project license
โโโ ๐ README.md # This file
โ
โโโ ๐ src/ # Core source code
โ โโโ ๐ ai_models/ # AI/ML models
โ โ โโโ emotion_detector.py # Emotion detection system
โ โ โโโ sentiment_analyzer.py # Text sentiment analysis
โ โ โโโ visual_analyzer.py # Computer vision models
โ โ โโโ user_learning_system.py # Personalization AI
โ โ
โ โโโ ๐ processors/ # Data processing modules
โ โ โโโ audio_analyzer.py # Audio analysis & processing
โ โ โโโ scene_detector.py # Scene change detection
โ โ โโโ script_parser.py # Script/subtitle parsing
โ โ โโโ video_analyzer.py # Video content analysis
โ โ
โ โโโ ๐ suggestions/ # AI suggestion engines
โ โ โโโ cut_suggester.py # Intelligent cut recommendations
โ โ โโโ transition_recommender.py # Transition suggestions
โ โ
โ โโโ ๐ ui/ # User interface components
โ โ โโโ suggestion_panel.py # Suggestions display
โ โ โโโ timeline_viewer.py # Timeline visualization
โ โ
โ โโโ ๐ utils/ # Utility modules
โ โโโ cloud_integration.py # Cloud processing support
โ โโโ file_handler.py # File I/O operations
โ โโโ timeline_sync.py # Timeline synchronization
โ
โโโ ๐ docs/ # Documentation & reports
โ โโโ ๐ reports/ # Analysis & performance reports
โ โโโ ๐ fixes/ # Development fix logs
โ โโโ ๐ README.md # Documentation index
โ
โโโ ๐ scripts/ # Development & testing scripts
โ โโโ ๐ setup/ # Setup & installation utilities
โ โโโ ๐ test_scripts/ # Testing & validation scripts
โ โโโ ๐ README.md # Scripts directory guide
โ
โโโ ๐ tests/ # Unit & integration tests
โโโ test_*.py # Test modules
VideoCraft is a cutting-edge AI-powered video editing assistant designed to streamline and enhance the post-production workflow for content creators, filmmakers, and video editors. By leveraging state-of-the-art machine learning models, VideoCraft automatically analyzes video content, audio tracks, and scripts to provide intelligent suggestions for cuts, transitions, and editing decisions.
- Intelligent Scene Detection: Automatically identifies optimal cut points based on visual content changes
- Emotional Arc Analysis: Tracks emotional progression through dialogue, speech patterns, and visual cues
- Multi-Modal Content Understanding: Combines video, audio, and text analysis for comprehensive insights
- Real-Time Processing: Advanced UI with progress tracking and live status updates
- Professional Workflow Integration: Export suggestions compatible with major editing software
- Content Creators: YouTube creators, social media producers, and digital marketers
- Filmmakers: Independent filmmakers, documentary producers, and video artists
- Editors: Professional video editors looking to accelerate their workflow
- Educators: Film students and video production instructors
- Developers: AI researchers and developers interested in multimodal content analysis
- Visual Content Analysis: Uses computer vision to identify scene boundaries
- Statistical Methods: Histogram-based and pixel difference analysis
- AI-Enhanced Detection: CLIP model integration for semantic scene understanding
- Adaptive Thresholds: Configurable sensitivity for different content types
- Speech Emotion Recognition: Real-time emotion detection in spoken dialogue
- Speaker Diarization: Automatic identification of speaker changes
- Audio Energy Analysis: Rhythm and pacing analysis for cut timing
- Music Synchronization: Beat detection for music-synced cuts
- Natural Language Processing: Advanced sentiment and emotion analysis
- Dialogue Parsing: Support for multiple subtitle formats (SRT, VTT, ASS)
- Emotional Beat Detection: Identifies dramatic moments and pacing changes
- Timeline Synchronization: Automatic alignment of script to video timeline
- Multi-Format Support: Video (MP4, AVI, MOV, MKV), Audio (MP3, WAV, AAC), Scripts (TXT, SRT, VTT)
- Large File Handling: Support for video files up to 2GB, audio up to 500MB
- Real-Time Validation: Instant feedback on file types and sizes
- Visual Feedback: Animated upload areas with hover effects
- Step-by-Step Breakdown: Detailed progress indicators for each processing stage
- Visual Progress Bars: Gradient-styled progress indicators with animations
- Status Updates: Live text updates describing current operations
- Error Handling: Comprehensive error reporting with recovery suggestions
- Multi-Criteria Filtering: Filter by confidence, type, time range, emotions
- Priority-Based Sorting: Sort suggestions by importance and relevance
- Interactive Controls: Real-time filter application with immediate results
- Custom Presets: Save and load filter configurations
- Automatic Thumbnail Generation: Video previews generated at optimal timestamps
- Hover Effects: Interactive thumbnail previews with smooth animations
- Responsive Design: Optimized for desktop, tablet, and mobile viewing
- Fast Loading: Efficient thumbnail caching and compression
- Multi-Track Display: Separate tracks for video, audio, and script analysis
- Zoom & Pan: Detailed timeline navigation with smooth interactions
- Suggestion Overlay: Visual markers for all AI-generated suggestions
- Export Integration: Direct export to popular editing software formats
- Multiple File Processing: Analyze multiple videos in sequence
- Format Conversion: Export suggestions in various formats (XML, EDL, CSV)
- Metadata Preservation: Maintain all analysis data and confidence scores
- Integration APIs: Direct integration with Adobe Premiere, DaVinci Resolve, Final Cut Pro
- Model:
openai/clip-vit-base-patch32 - Purpose: Visual content understanding and scene analysis
- Capabilities:
- Scene boundary detection through visual similarity analysis
- Object and activity recognition in video frames
- Semantic understanding of visual content
- Cross-modal video-text matching
- Technical Details:
- Vision Transformer (ViT) architecture with 32x32 patch size
- 151M parameters optimized for real-time inference
- Multi-scale feature extraction for temporal analysis
- GPU acceleration with automatic CPU fallback
- Model:
j-hartmann/emotion-english-distilroberta-base - Purpose: Analyze emotional content in scripts and dialogue
- Capabilities:
- Multi-class emotion classification (Joy, Sadness, Anger, Fear, Surprise, Neutral)
- Confidence scoring for emotional predictions
- Contextual understanding of emotional transitions
- Support for multiple languages and dialects
- Technical Details:
- DistilRoBERTa architecture for efficient processing
- Fine-tuned on comprehensive emotion datasets
- 82M parameters with optimized inference pipeline
- Model:
ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition - Purpose: Detect emotions from speech audio
- Capabilities:
- Real-time emotion detection from audio waveforms
- Speaker-independent emotion recognition
- Temporal emotion tracking throughout speech segments
- Robustness to background noise and audio quality variations
- Technical Details:
- Wav2Vec2 architecture with cross-lingual speech representations
- 317M parameters trained on diverse speech datasets
- Support for multiple audio formats and sample rates
- Model:
trpakov/vit-face-expression - Purpose: Analyze facial expressions in video content
- Capabilities:
- Real-time facial emotion detection from video frames
- Multi-face tracking and expression analysis
- Temporal consistency in emotion tracking
- Integration with scene change detection
- Technical Details:
- Vision Transformer adapted for facial expression recognition
- Optimized for diverse lighting conditions and face orientations
- Efficient processing with face detection pipeline
- Framework: spaCy with custom emotion models
- Purpose: Comprehensive text analysis and processing
- Capabilities:
- Named entity recognition for character identification
- Dependency parsing for dialogue structure analysis
- Sentiment analysis with fine-grained scoring
- Temporal relationship extraction from scripts
- Technical Details:
- Multi-language support with optimized tokenizers
- Custom training pipelines for domain-specific terminology
- Integration with external knowledge bases
- Histogram-Based Analysis: Statistical comparison of color distributions
- Optical Flow Detection: Motion vector analysis for scene transitions
- Edge Detection: Structural change analysis using Sobel and Canny filters
- Deep Learning Integration: CLIP-based semantic scene understanding
- Adaptive Sampling: Intelligent frame selection for optimal processing speed
- Multi-Resolution Analysis: Process at multiple scales for different features
- Temporal Consistency: Ensure smooth transitions and avoid false positives
- Memory Optimization: Efficient buffer management for large video files
- Spectral Analysis: MFCC, chroma, and spectral centroid extraction
- Temporal Features: Zero-crossing rate, energy, and rhythm analysis
- Harmonic Analysis: Pitch tracking and harmonic content analysis
- Noise Reduction: Advanced filtering for clean audio analysis
- Voice Activity Detection: Accurate speech/non-speech segmentation
- Speaker Clustering: Unsupervised clustering of speaker embeddings
- Change Point Detection: Precise identification of speaker transitions
- Confidence Scoring: Reliability metrics for speaker change suggestions
- Python: Version 3.11 or higher
- Operating System: Windows 10/11, macOS 10.15+, or Linux Ubuntu 18.04+
- Memory: Minimum 8GB RAM (16GB recommended for large videos)
- Storage: At least 5GB free space for models and cache
- GPU: Optional but recommended (CUDA-compatible GPU for acceleration)
# Clone the repository
git clone https://github.com/smirk-dev/VideoCraft.git
cd VideoCraft
# Run automated setup script (Windows)
.\setup_venv.ps1
# For macOS/Linux
chmod +x setup_venv.sh
./setup_venv.sh# Create virtual environment
python -m venv videocraft_env
# Activate virtual environment
# Windows
.\videocraft_env\Scripts\Activate.ps1
# macOS/Linux
source videocraft_env/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run setup script
python setup.py# Build Docker image
docker build -t videocraft .
# Run container with GPU support
docker run --gpus all -p 8501:8501 videocraft
# Or CPU-only
docker run -p 8501:8501 videocraft# AI/ML Framework
torch>=2.6.0 # PyTorch for deep learning
torchvision>=0.23.0 # Computer vision utilities
transformers>=4.48.0 # Hugging Face transformers
sentence-transformers==3.0.1 # Sentence embeddings
# Video/Audio Processing
opencv-python-headless==4.10.0.84 # Computer vision
moviepy==1.0.3 # Video processing
librosa==0.10.2 # Audio analysis
scenedetect==0.6.4 # Scene detection
pydub==0.25.1 # Audio manipulation
# NLP & Text Processing
spacy>=3.8.0 # Natural language processing
nltk>=3.9.1 # Text analysis toolkit
# Data Science
pandas>=2.2.3 # Data manipulation
numpy>=2.0.2,<2.2.0 # Numerical computing
scikit-learn==1.5.1 # Machine learning utilities
scipy>=1.11.0 # Scientific computing
# UI Framework
streamlit==1.36.0 # Web application framework
plotly==5.22.0 # Interactive visualizations
streamlit-timeline==0.0.2 # Timeline components- PyTorch โฅ2.6.0: Required for CVE-2025-32434 security fix
- NumPy โฅ2.0.2: Compatibility with latest PyTorch versions
- Transformers โฅ4.48.0: Latest security and performance updates
# Create .env file from template
cp .env.example .env
# Configure settings
VIDEOCRAFT_MODEL_CACHE_DIR=./models
VIDEOCRAFT_MAX_VIDEO_SIZE=2048 # MB
VIDEOCRAFT_ENABLE_GPU=true
VIDEOCRAFT_LOG_LEVEL=INFO# config.yaml
models:
clip:
model_name: "openai/clip-vit-base-patch32"
device: "auto" # auto, cpu, cuda
cache_dir: "./models"
emotion:
text_model: "j-hartmann/emotion-english-distilroberta-base"
speech_model: "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
processing:
video:
max_fps: 30
frame_skip: 5
max_resolution: [1920, 1080]
audio:
sample_rate: 16000
chunk_duration: 10 # seconds
suggestions:
confidence_threshold: 0.7
max_suggestions: 50# .streamlit/config.toml
[server]
maxUploadSize = 2048 # 2GB
maxMessageSize = 2048
fileWatcherType = "auto"
[theme]
primaryColor = "#667eea"
backgroundColor = "#ffffff"
secondaryBackgroundColor = "#f8f9ff"
textColor = "#2c3e50"# Activate virtual environment if not already active
source videocraft_env/bin/activate # macOS/Linux
# or
.\videocraft_env\Scripts\Activate.ps1 # Windows
# Start the application
streamlit run main.pyThe application will open in your default web browser at http://localhost:8501
- Supported Formats: MP4, AVI, MOV, MKV, WebM
- Maximum Size: 2GB (configurable)
- Optimal Settings: 1080p resolution, 30fps or less
- Upload Methods:
- Drag and drop into the designated area
- Click "Browse files" button
- Paste file path directly
- Supported Formats: TXT, SRT, VTT, ASS
- Maximum Size: 500KB
- Content: Dialogue, subtitles, or scene descriptions
- Encoding: UTF-8 recommended
- Supported Formats: MP3, WAV, AAC, FLAC
- Maximum Size: 500MB
- Use Cases: Separate audio track or music overlay
- Analyze Video Content: Enable visual scene detection and content analysis
- Analyze Audio: Enable speech emotion and speaker change detection
- Analyze Script: Enable text-based emotion and dialogue analysis
- Emotion Change Threshold (0.1-1.0): How dramatic emotional changes must be
- Scene Change Threshold (0.1-1.0): Sensitivity to visual scene transitions
- Minimum Cut Interval (1-30 seconds): Minimum time between suggested cuts
The processing pipeline includes:
- File Preparation: Upload validation and format conversion
- Content Analysis: Video, audio, and script processing
- AI Model Inference: Emotion detection and scene analysis
- Suggestion Generation: Cut and transition recommendations
- Results Compilation: Confidence scoring and ranking
- Visual Progress Bar: Shows completion percentage with gradient styling
- Step-by-Step Updates: Detailed description of current processing stage
- Time Estimation: Approximate remaining processing time
- Error Recovery: Automatic retry and fallback mechanisms
- Range Slider: Set minimum and maximum confidence levels
- Dynamic Updates: Real-time filtering as you adjust settings
- Confidence Scores: Detailed breakdown of model certainty
- Suggestion Types: Filter by scene changes, emotion beats, speaker changes
- Time Range: Focus on specific segments of your video
- Emotion Filtering: Show only specific emotional content types
- Priority Levels: Filter by suggestion importance (High, Medium, Low)
- Speaker Changes Only: Show only suggestions at speaker transitions
- Music-Synced Cuts: Show only cuts aligned with musical elements
- Minimum Cut Length: Filter out suggestions too close together
- Preview Thumbnails: Toggle thumbnail displays for suggestions
- Zoom Controls: Zoom in/out for detailed timeline inspection
- Pan and Scroll: Navigate through long video timelines
- Jump to Time: Click on any point to jump to specific timestamps
- Multi-Track View: Separate lanes for different analysis types
- Selection Tools: Select individual or multiple suggestions
- Batch Operations: Apply actions to multiple suggestions at once
- Export Options: Save suggestions in various formats
- Integration: Direct export to editing software
- Visual Transitions: Cuts based on significant visual changes
- Confidence: Based on visual similarity analysis
- Metadata: Color histogram differences, motion vectors
- Timing: Precise frame-level timestamps
- Text Emotions: Cuts at emotional beats in dialogue
- Speech Emotions: Cuts based on vocal emotional changes
- Visual Emotions: Cuts based on facial expression changes
- Emotional Arc: Tracking emotional progression
- Dialogue Transitions: Automatic cuts at speaker changes
- Voice Recognition: Speaker identification and clustering
- Confidence Scoring: Reliability of speaker change detection
- Context Awareness: Consideration of dialogue flow
- Beat Detection: Cuts synchronized with musical beats
- Audio Energy: Cuts at high/low energy moments
- Silence Detection: Cuts at natural pauses
- Rhythm Analysis: Cuts that maintain pacing
- 90-100%: Extremely confident, almost certainly correct
- 70-89%: High confidence, likely good cut points
- 50-69%: Medium confidence, review recommended
- 30-49%: Low confidence, use with caution
- Below 30%: Very low confidence, manual review required
- Visual Clarity: Image quality and lighting conditions
- Audio Quality: Background noise and recording clarity
- Content Complexity: Number of subjects and scene complexity
- Model Certainty: Inherent model confidence in predictions
VideoCraft/
โโโ ๐ README.md # Comprehensive project documentation
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ config.yaml # Configuration settings
โโโ ๐ main.py # Main Streamlit application
โโโ ๐ setup.py # Installation and setup script
โโโ ๐ .env.example # Environment variables template
โโโ ๐ LICENSE # MIT license
โโโ ๐ ADVANCED_FEATURES.md # Advanced features documentation
โโโ ๐ DEPENDENCY_RESOLUTION.md # Dependency management guide
โโโ ๐ IMPLEMENTATION_COMPLETE.md # Implementation status
โโโ ๐ PERFORMANCE_NOTES.md # Performance optimization guide
โโโ ๐ UI_IMPROVEMENTS.md # UI enhancement documentation
โโโ ๐ setup_venv.ps1 # Windows setup script
โ
โโโ ๐๏ธ .streamlit/ # Streamlit configuration
โ โโโ ๐ config.toml # UI and server settings
โ
โโโ ๐๏ธ src/ # Source code directory
โ โโโ ๐ __init__.py # Package initialization
โ โ
โ โโโ ๐๏ธ processors/ # Content processing modules
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ video_analyzer.py # Video content analysis
โ โ โโโ ๐ audio_analyzer.py # Audio processing and analysis
โ โ โโโ ๐ script_parser.py # Text and script analysis
โ โ โโโ ๐ scene_detector.py # Scene boundary detection
โ โ
โ โโโ ๐๏ธ ai_models/ # AI model implementations
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ emotion_detector.py # Multi-modal emotion detection
โ โ โโโ ๐ sentiment_analyzer.py # Text sentiment analysis
โ โ โโโ ๐ visual_analyzer.py # Computer vision models
โ โ
โ โโโ ๐๏ธ suggestions/ # Suggestion generation engines
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ cut_suggester.py # Cut point recommendation
โ โ โโโ ๐ transition_recommender.py # Transition suggestions
โ โ
โ โโโ ๐๏ธ ui/ # User interface components
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ timeline_viewer.py # Interactive timeline display
โ โ โโโ ๐ suggestion_panel.py # Suggestion management UI
โ โ
โ โโโ ๐๏ธ utils/ # Utility functions
โ โโโ ๐ __init__.py
โ โโโ ๐ file_handler.py # File operations and validation
โ โโโ ๐ timeline_sync.py # Timeline synchronization
โ
โโโ ๐๏ธ data/ # Data storage directory
โ โโโ ๐๏ธ cache/ # Model and processing cache
โ โโโ ๐๏ธ output/ # Generated outputs and exports
โ
โโโ ๐๏ธ tests/ # Test suite
โโโ ๐ __init__.py
โโโ ๐ test_processors.py # Processor unit tests
โโโ ๐ test_ai_models.py # AI model tests
โโโ ๐ test_ui_components.py # UI component tests
File Upload โ Validation โ Format Conversion โ Temporary Storage
โ
Metadata Extraction โ Quality Assessment โ Processing Queue
Video Analysis:
Raw Video โ Frame Extraction โ Scene Detection โ Feature Extraction โ CLIP Analysis
Audio Analysis:
Audio Track โ Preprocessing โ Feature Extraction โ Emotion Detection โ Speaker Analysis
Script Analysis:
Text Input โ Tokenization โ NLP Processing โ Emotion Analysis โ Timeline Alignment
Parallel Processing:
โโโ Visual Content Understanding (CLIP)
โโโ Text Emotion Analysis (RoBERTa)
โโโ Speech Emotion Detection (Wav2Vec2)
โโโ Facial Expression Analysis (ViT)
โ
Feature Fusion โ Temporal Alignment โ Confidence Scoring
Multi-Modal Features โ Cut Point Detection โ Transition Recommendation
โ
Confidence Scoring โ Ranking โ Filtering โ Export Formatting
class VideoAnalyzer:
"""Comprehensive video content analysis"""
def __init__(self, config):
self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
self.face_detector = FaceDetector()
self.config = config
def analyze_video_timeline(self, video_path):
"""Analyze video content across timeline"""
def detect_visual_emotions(self, frames):
"""Detect emotions from facial expressions"""
def extract_video_features(self, video_path):
"""Extract comprehensive video features"""class AudioAnalyzer:
"""Advanced audio processing and analysis"""
def __init__(self, config):
self.emotion_model = Wav2Vec2ForSequenceClassification.from_pretrained(
"ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)
self.speaker_detector = SpeakerDetector()
def analyze_speech_emotion(self, audio_path):
"""Detect emotions in speech"""
def detect_speaker_changes(self, audio_path):
"""Identify speaker transition points"""
def extract_audio_features(self, audio_path):
"""Extract spectral and temporal features"""class EmotionDetector:
"""Multi-modal emotion detection system"""
def __init__(self, config):
self.text_model = AutoModelForSequenceClassification.from_pretrained(
"j-hartmann/emotion-english-distilroberta-base"
)
self.speech_model = Wav2Vec2ForSequenceClassification.from_pretrained(
"ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)
def detect_text_emotions(self, text):
"""Analyze emotions in text content"""
def detect_speech_emotions(self, audio):
"""Analyze emotions in speech audio"""
def fuse_emotion_predictions(self, text_emotions, speech_emotions):
"""Combine multi-modal emotion predictions"""graph TD
A[User Upload] โ B[File Validation]
B โ C[Content Extraction]
C โ D[Parallel Analysis]
D โ E[Video Analysis]
D โ F[Audio Analysis]
D โ G[Script Analysis]
E โ H[Feature Fusion]
F โ H
G โ H
H โ I[Suggestion Generation]
I โ J[Ranking & Filtering]
J โ K[UI Display]
K โ L[Export Options]
# Model Configuration
models:
clip:
model_name: "openai/clip-vit-base-patch32"
device: "auto" # auto, cpu, cuda, mps
torch_dtype: "float32"
cache_dir: "./models/clip"
max_batch_size: 8
emotion:
text_model: "j-hartmann/emotion-english-distilroberta-base"
speech_model: "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
face_model: "trpakov/vit-face-expression"
cache_dir: "./models/emotion"
nlp:
spacy_model: "en_core_web_sm"
sentiment_model: "cardiffnlp/twitter-roberta-base-sentiment-latest"
# Processing Configuration
processing:
video:
max_fps: 30
frame_skip: 5 # Process every 5th frame
max_resolution: [1920, 1080]
thumbnail_timestamp: 1.0 # Generate thumbnail at 1 second
scene_detection_method: "combined" # histogram, optical_flow, combined
audio:
sample_rate: 16000
chunk_duration: 10 # seconds
overlap_duration: 1 # seconds
noise_reduction: true
normalize_audio: true
script:
max_text_length: 10000 # characters
emotion_confidence_threshold: 0.5
speaker_detection_method: "regex" # regex, nlp, manual
# Suggestion Configuration
suggestions:
cut_suggestions:
confidence_threshold: 0.7
minimum_confidence: 0.3
max_suggestions: 50
min_cut_interval: 2.0 # seconds
emotion_change_threshold: 0.6
scene_change_threshold: 0.5
transition_suggestions:
enable_smart_transitions: true
transition_types: ["cut", "fade", "dissolve", "wipe"]
context_awareness: true
music_sync: true
# File Handling
file_limits:
video_max_size: 2048 # MB
audio_max_size: 500 # MB
script_max_size: 0.5 # MB
supported_video_formats: ["mp4", "avi", "mov", "mkv", "webm"]
supported_audio_formats: ["mp3", "wav", "aac", "flac"]
supported_script_formats: ["txt", "srt", "vtt", "ass"]
# UI Configuration
ui:
theme: "professional" # professional, creative, minimal
progress_animations: true
thumbnail_previews: true
real_time_updates: true
advanced_filters: true
# Performance Configuration
performance:
use_gpu: true
mixed_precision: true
model_parallelism: false
batch_processing: true
cache_embeddings: true
prefetch_models: true
# Logging Configuration
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR
log_file: "videocraft.log"
log_rotation: true
max_log_size: "10MB"
backup_count: 5[global]
developmentMode = false
[server]
# File upload limits
maxUploadSize = 2048 # 2GB in MB
maxMessageSize = 2048
enableCORS = false
enableXsrfProtection = true
# Performance settings
runOnSave = true
fileWatcherType = "auto"
headless = false
[browser]
serverAddress = "localhost"
serverPort = 8501
gatherUsageStats = false
[theme]
# Professional color scheme
primaryColor = "#667eea" # Primary accent color
backgroundColor = "#ffffff" # Main background
secondaryBackgroundColor = "#f8f9ff" # Secondary background
textColor = "#2c3e50" # Text color
# Advanced theming
font = "sans serif" # Font family
base = "light" # Base theme (light/dark)
[client]
showErrorDetails = true
toolbarMode = "minimal"# Application Settings
VIDEOCRAFT_ENV=production
VIDEOCRAFT_DEBUG=false
VIDEOCRAFT_LOG_LEVEL=INFO
# Model Configuration
VIDEOCRAFT_MODEL_CACHE_DIR=./models
VIDEOCRAFT_USE_GPU=true
VIDEOCRAFT_MIXED_PRECISION=true
# File Settings
VIDEOCRAFT_MAX_VIDEO_SIZE=2048
VIDEOCRAFT_MAX_AUDIO_SIZE=500
VIDEOCRAFT_TEMP_DIR=./temp
# API Keys (if using external services)
HUGGINGFACE_API_KEY=your_hf_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
# Database (for enterprise features)
DATABASE_URL=sqlite:///videocraft.db
# Security
SECRET_KEY=your-secret-key-here
ALLOWED_HOSTS=localhost,127.0.0.1
# Performance
MAX_WORKERS=4
BATCH_SIZE=8
CACHE_TTL=3600- Processing Options: Toggle analysis types in real-time
- Sensitivity Sliders: Adjust thresholds while processing
- Filter Controls: Modify result filtering dynamically
- Theme Selection: Switch between UI themes instantly
- Model Selection: Choose between different AI models
- Performance Tuning: Adjust batch sizes and GPU usage
- Export Formats: Configure output formats for different editors
- Integration Settings: Set up connections to external tools
- Automatic Detection: Automatically detects and utilizes available GPUs
- Mixed Precision: Uses FP16 for faster inference with minimal accuracy loss
- Memory Management: Efficient GPU memory allocation and cleanup
- Fallback Support: Seamless CPU fallback when GPU unavailable
- Batch Processing: Process multiple frames/segments simultaneously
- Smart Sampling: Intelligent frame selection to reduce processing time
- Parallel Pipelines: Concurrent video, audio, and text processing
- Caching System: Cache model outputs to avoid reprocessing
- Streaming Processing: Process large files without loading entirely into memory
- Buffer Management: Efficient memory buffers for video frame processing
- Model Optimization: Quantized models for reduced memory usage
- Garbage Collection: Proactive memory cleanup during processing
Video Processing (1080p, 10 minutes):
โโโ Scene Detection: ~2-3 minutes
โโโ Audio Analysis: ~1-2 minutes
โโโ Text Processing: ~30 seconds
โโโ Total Pipeline: ~4-6 minutes
Model Loading Times:
โโโ CLIP Model: ~10-15 seconds
โโโ Emotion Models: ~5-10 seconds
โโโ NLP Models: ~3-5 seconds
โโโ Total Initialization: ~20-30 seconds
Memory Usage:
โโโ Base Application: ~500MB
โโโ CLIP Model: ~1.5GB
โโโ Emotion Models: ~800MB
โโโ Processing Buffers: ~1-2GB
โโโ Peak Usage: ~4-5GB
- File Size Limits: Optimized for files up to 2GB
- Concurrent Users: Single-user application optimized for desktop use
- Processing Queue: Sequential processing with progress tracking
- Resource Monitoring: Real-time memory and CPU usage tracking
- CPU: Intel Core i5 or AMD Ryzen 5 (4+ cores)
- RAM: 8GB (16GB recommended)
- Storage: 5GB free space (SSD recommended)
- GPU: Optional (Intel UHD, AMD Radeon, or NVIDIA GTX 1050+)
- CPU: Intel Core i7/i9 or AMD Ryzen 7/9 (8+ cores)
- RAM: 16GB+ DDR4
- Storage: 20GB+ SSD with high read/write speeds
- GPU: NVIDIA RTX 3060+ or equivalent with 8GB+ VRAM
- CPU: Intel Core i9 or AMD Ryzen 9 (16+ cores)
- RAM: 32GB+ DDR4/DDR5
- Storage: NVMe SSD with 50GB+ free space
- GPU: NVIDIA RTX 4070+ with 12GB+ VRAM
# Performance tuning in config.yaml
performance:
# GPU settings
use_gpu: true
mixed_precision: true
gpu_memory_fraction: 0.8
# Processing settings
batch_size: 16 # Increase for more GPU memory
num_workers: 8 # Match CPU cores
frame_skip: 3 # Process every 3rd frame for speed
# Caching settings
enable_model_cache: true
enable_result_cache: true
cache_size_limit: "10GB"# Install development dependencies
pip install -r requirements-dev.txt
# Additional development tools
pip install black flake8 pytest pytest-cov pre-commit# Code formatting with Black
black src/ tests/
# Linting with flake8
flake8 src/ tests/
# Type checking with mypy
mypy src/# Install pre-commit hooks
pre-commit install
# Run all hooks manually
pre-commit run --all-files# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test modules
pytest tests/test_processors.py
pytest tests/test_ai_models.pyTo enable fast test execution without heavy ML dependencies (torch, transformers, spaCy models, etc.), VideoCraft provides a lightweight mode that stubs expensive components.
Key points:
- Activated automatically during pytest via
tests/conftest.py(setsLIGHT_TEST_MODE=1). - Skips loading large models; provides simple stub analyzers & suggester implementations.
- Suitable for CI smoke tests and contributor quick feedback loops.
- Full functionality still requires installing all dependencies in
requirements.txt.
Environment variable:
LIGHT_TEST_MODE=1Minimal test dependencies are listed in requirements-test.txt:
pip install -r requirements-test.txt
pytestWhen you want to run the full stack (with real model inference):
pip install -r requirements.txt
pytest -m "not slow" # (If slow markers are later added)Optional validation that heavy mode works (example):
$env:LIGHT_TEST_MODE=0
pytest tests/test_final_integration.py -k advancedIf a test is intentionally skipped in lightweight mode (e.g., advanced analyzer coverage), it will be reported with a skip reason so you can decide when to run full mode.
Recommended CI matrix (conceptual):
- Job 1 (fast): install
requirements-test.txt, run core & unit tests (under 1 minute) - Job 2 (full, optional nightly): install
requirements.txt, run integration & model tests
Add a GitHub Actions snippet (example only):
jobs:
fast-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install minimal deps
run: pip install -r requirements-test.txt
- name: Run fast tests
run: pytest -q
full-tests:
if: github.event_name == 'schedule'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install full deps
run: pip install -r requirements.txt
- name: Run integration tests
run: pytest -m "integration" -q || true # allow failures initiallyThis separation keeps contributor workflows fast while preserving deep coverage in scheduled runs.
# Example test file: tests/test_processors.py
import pytest
from src.processors.video_analyzer import VideoAnalyzer
class TestVideoAnalyzer:
@pytest.fixture
def analyzer(self):
config = {"models": {"clip": {"model_name": "openai/clip-vit-base-patch32"}}}
return VideoAnalyzer(config)
def test_video_analysis(self, analyzer):
# Test video analysis functionality
pass
def test_scene_detection(self, analyzer):
# Test scene detection
pass- Create Model Class:
# src/ai_models/new_model.py
class NewModel:
def __init__(self, config):
self.model = load_pretrained_model()
self.config = config
def predict(self, input_data):
return self.model(input_data)- Update Configuration:
# config.yaml
models:
new_model:
model_name: "huggingface/model-name"
device: "auto"- Integration:
# main.py or relevant processor
from src.ai_models.new_model import NewModel
new_model = NewModel(config)
predictions = new_model.predict(data)- Create Processor Class:
# src/processors/new_processor.py
class NewProcessor:
def __init__(self, config):
self.config = config
def process(self, input_data):
# Processing logic
return processed_data- Add to Main Pipeline:
# main.py
from src.processors.new_processor import NewProcessor
new_processor = NewProcessor(config)
results = new_processor.process(input_data)- Create UI Module:
# src/ui/new_component.py
import streamlit as st
class NewComponent:
def __init__(self, config):
self.config = config
def render(self, data):
st.markdown("## New Component")
# Render component- Integrate in Main UI:
# main.py
new_component = NewComponent(config)
new_component.render(data)class VideoAnalyzer:
"""
Comprehensive video content analysis using computer vision models.
Methods:
analyze_video_timeline(video_path: str) -> Dict
detect_scenes(video_path: str, method: str) -> List[float]
extract_video_features(video_path: str) -> Dict
detect_visual_emotions(frames: List[np.ndarray]) -> List[Dict]
"""class AudioAnalyzer:
"""
Advanced audio processing and speech analysis.
Methods:
analyze_speech_emotion(audio_path: str) -> List[Dict]
detect_speaker_changes(audio_path: str) -> List[float]
extract_audio_features(audio_path: str) -> Dict
analyze_audio_energy(audio_path: str) -> List[float]
"""class EmotionDetector:
"""
Multi-modal emotion detection system.
Methods:
detect_text_emotions(text: str) -> Dict
detect_speech_emotions(audio: np.ndarray) -> Dict
analyze_facial_emotions(image: np.ndarray) -> Dict
fuse_emotions(text_em: Dict, speech_em: Dict, visual_em: Dict) -> Dict
"""# Build production image
docker build -f Dockerfile.prod -t videocraft:prod .
# Deploy with docker-compose
docker-compose -f docker-compose.prod.yml up -d
# Or use cloud deployment
# AWS: Use ECS or EKS
# Google Cloud: Use Cloud Run or GKE
# Azure: Use Container Instances or AKS# config/production.yaml
performance:
use_gpu: true
mixed_precision: true
batch_size: 32
logging:
level: "WARNING"
log_file: "/var/log/videocraft.log"
security:
enable_auth: true
max_upload_size: 1024 # Reduced for production- Bug Reports: Help us identify and fix issues
- Feature Requests: Suggest new functionality or improvements
- Code Contributions: Submit pull requests for bug fixes or features
- Documentation: Improve documentation and examples
- Testing: Add tests or improve test coverage
- Performance: Optimize algorithms and resource usage
- Fork the Repository: Create your own fork of the project
- Create Feature Branch:
git checkout -b feature/amazing-feature - Make Changes: Implement your feature or fix
- Add Tests: Ensure your changes are properly tested
- Update Documentation: Update relevant documentation
- Commit Changes:
git commit -m 'Add amazing feature' - Push to Branch:
git push origin feature/amazing-feature - Create Pull Request: Submit PR with detailed description
- Code Style: Follow PEP 8 and use Black for formatting
- Documentation: Include docstrings for all public methods
- Testing: Maintain >90% test coverage
- Performance: Consider performance impact of changes
- Compatibility: Ensure backward compatibility when possible
## Bug Report Template
**Description**: Brief description of the issue
**Steps to Reproduce**:
1. Step one
2. Step two
3. Step three
**Expected Behavior**: What should happen
**Actual Behavior**: What actually happens
**Environment**:
- OS: [Windows/macOS/Linux]
- Python Version: [3.11.x]
- VideoCraft Version: [x.x.x]
- GPU: [NVIDIA RTX 3060 / None]
**Additional Context**: Any other relevant information- Core Developers: Lead development and architecture
- Feature Contributors: Major feature implementations
- Bug Hunters: Significant bug reports and fixes
- Documentation Writers: Documentation improvements
- Community Helpers: Support in discussions and issues
Special thanks to:
- Hugging Face: For providing pre-trained models
- Streamlit: For the excellent UI framework
- PyTorch: For the deep learning foundation
- OpenAI: For the CLIP model
- Open Source Community: For inspiration and support
MIT License
Copyright (c) 2025 VideoCraft Development Team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
- CLIP: MIT License (OpenAI)
- RoBERTa Emotion: Apache 2.0 License
- Wav2Vec2: MIT License (Facebook AI)
- ViT Face Expression: Apache 2.0 License
- PyTorch: BSD 3-Clause License
- Transformers: Apache 2.0 License
- Streamlit: Apache 2.0 License
- OpenCV: Apache 2.0 License
- Librosa: ISC License
- README: Comprehensive project overview (this document)
- API Documentation: Detailed API reference
- Tutorial Videos: Step-by-step usage guides
- FAQ: Frequently asked questions
- GitHub Issues: Bug reports and feature requests
- Discussions: General questions and community help
- Wiki: Community-maintained documentation
- Discord: Real-time chat support (coming soon)
- Consulting: Custom implementation and integration
- Training: Team training and workshops
- Enterprise: Enterprise licensing and support
- Project Maintainer: VideoCraft Development Team
- Email: [email protected]
- GitHub: @smirk-dev/VideoCraft
- Website: https://videocraft.ai
- Real-time Processing: Live video analysis during recording
- Advanced Transitions: AI-generated custom transition effects
- Multi-language Support: Support for non-English content
- Batch Processing: Process multiple videos simultaneously
- Cloud Integration: Cloud-based processing for large files
- Collaborative Editing: Multi-user editing and review
- Advanced AI Models: Latest transformer architectures
- Mobile App: iOS and Android companion apps
- Plugin System: Third-party plugin architecture
- API Service: RESTful API for integration
- Automated Editing: Full video editing with minimal input
- Style Transfer: Apply editing styles from reference videos
- Voice Cloning: Generate voiceovers and dubbing
- 3D Scene Understanding: Advanced spatial video analysis
- AR/VR Integration: Immersive editing experiences
- Processing Speed: 50% faster inference times
- Memory Usage: 30% reduction in memory footprint
- Model Accuracy: Improved confidence scores and accuracy
- User Experience: Enhanced UI responsiveness and feedback
- Compatibility: Support for more video formats and codecs