Skip to content

[4.0.1] Pipeline architecture, multi-series isolation & search CLI#132

Open
dam2452 wants to merge 90 commits intomainfrom
Multi-Series-Support-&-Data-Isolation-(Preprocessor)
Open

[4.0.1] Pipeline architecture, multi-series isolation & search CLI#132
dam2452 wants to merge 90 commits intomainfrom
Multi-Series-Support-&-Data-Isolation-(Preprocessor)

Conversation

@dam2452
Copy link
Copy Markdown
Owner

@dam2452 dam2452 commented Feb 9, 2026

Description

Testing

Remember to run the tests when your changes are ready. Workflows can be started from here. Use your branch.

Also, deploy and tests will automatically begin upon PR approval.

Introduce multi-series output layout and central path helpers, plus a large refactor of processor and character modules. Added get_base_output_dir/get_output_path and per-setting get_output_dir methods so CLI commands resolve default output directories by series name; CLI commands were updated to accept --name/series and to use these per-series paths. Moved many processing modules under preprocessor/processors and reorganized character and video code into subpackages (preprocessor/characters/search, preprocessor/characters/face, preprocessor/video/helpers, preprocessor/video/subprocessors). Pipeline steps and orchestrator now use series-aware paths; README updated with multi-series examples and migration notes. Also added core path/processor registry/factory stubs to support the new layout.
Move EpisodeScraper and CharacterReferenceDownloader imports from inside functions to module-level imports in preprocessor/cli/pipeline/steps.py to remove import-outside-toplevel usage and clarify dependencies. In preprocessor/core/base_processor.py, adjust the pylint disable placement and reformat the conditional return for file_naming.build_filename; this is a styling/lint change with no functional behavior change.
Introduce CharacterReferenceProcessor.__safe_resize to guard against None/empty images and OpenCV resize errors, returning None on failure. Replace direct cv2.resize calls with the safe helper in reference_processor, skip invalid faces and log warnings. Also reorder an import (init_face_detection) in reference_downloader and apply small formatting fixes (trailing commas) in processor_registry, elastic document/indexer, and frame_processor.
Large refactor: reorganize the preprocessor package into new app, lib and modules subpackages (many files moved/renamed and legacy CLI/processor files removed). Add a pipeline system (Pipeline, StepBuilder, pipeline_factory, pipeline_builder, config_defaults) under preprocessor.app to build/visualize processing pipelines. Introduce many Elasticsearch index mapping constants in ElasticSearchManager (segments, text/video embeddings, episode names, full-episode embeddings, sound events and embeddings). Update usages/imports: reindex_service now imports bot.search.elastic_search_manager.ElasticSearchManager and adjusts the connect call, bot/types imports from preprocessor.config.types, and preprocessor.__main__ uses the relocated console. Overall this commit prepares the codebase for modular pipeline execution and richer ES indexing schemas.
Reduce default scene detection minimum length from 15s to 10s and increase Whisper transcription beam_size from 5 to 10. Updates applied to config defaults, step configs, pipeline factory, and the runtime wrappers (TransNetWrapper and Whisper) so defaults and implementations stay consistent; this allows detection of shorter scenes and uses a larger beam for potentially improved transcription quality at the cost of extra compute.
Introduce per-series configuration and make the pipeline build dynamic from those configs. Added SeriesConfig loader and default/template/kiepscy/ranczo JSON configs; pipeline_factory now constructs StepBuilders from series_config and exposes build/visualize/get_step_configs with a series parameter. CLI and helpers updated to accept a --series arg, handle Docker vs local input/output paths, and apply selective skip rules from series config. Pipeline execution now checks state_manager to skip completed steps and marks steps as started/completed; StateManager uses a per-series state file. Misc: pass series_name into scrapers, add create_progress factory, simplify entrypoint, and adjust base scraper behavior.
Rename FFmpegWrapper internal constants and helper methods to use double-underscore (name-mangled) identifiers and update all call sites accordingly in preprocessor/lib/media/ffmpeg.py. Reorder/relocate helper implementations while preserving behavior. Also switch list[str] annotations to typing.List for compatibility and add the List import in both ffmpeg.py and preprocessor/modules/audio/extraction.py; update the command variable typing in audio extraction to List[str]. No functional changes to encoding logic.
Detect interlaced video and apply deinterlace filter during transcoding; refactor scene detection internals.

- FFmpegWrapper: added ffmpeg idet-based detect_interlacing() and __parse_idet_output(); made __build_video_filter accept deinterlace flag and compose filters; imported re and Tuple typing.
- VideoTranscoderStep: run interlace detection, log results, pass deinterlace flag into FFmpegWrapper, and added pylint disable for long method.
- TransNetWrapper: renamed several helpers to private (__get_video_info, __build_scenes_from_predictions, __create_scene_dict, __frame_to_timecode), adjusted detect_scenes to use the new names, moved cleanup earlier, and ensured GPU memory is cleared.
- Scene detection module: updated call site to use the renamed private video info method.

These changes enable automatic detection of interlaced content and conditional application of a bwdif deinterlacing filter, while tightening internal method visibility and cleaning up resources after model use.
Introduce a force_deinterlace option and make interlace detection more robust. Wire force_deinterlace through config (series_config, step_configs), pipeline_factory, and defaults/kiepscy series configs (defaults=false, kiepscy=true). FFmpegWrapper.detect_interlacing: make analysis_time optional, conditionally include -t, add -an, handle non-zero ffmpeg exit, and parse idet output via a multiline regex to extract TFF/BFF/Progressive. VideoTranscoderStep: when force_deinterlace is enabled skip detection and force bwdif; otherwise run detection, log failures explicitly and proceed without deinterlacing if idet fails; use a deinterlace variable and ensure temp file cleanup by broadening the except to catch BaseException. Also translate several pipeline error messages from Polish to English for clarity.
Refactor core pipeline API and CLI, add video discovery and search features.

- Rename Pipeline -> PipelineDefinition and Pipeline runner -> PipelineExecutor; update pipeline API (validate(logger: Optional), get_all_steps, execute_step(s)).
- Extract video discovery logic to preprocessor/app/video_discovery.py and use it in the runner.
- Update pipeline_factory to construct PipelineDefinition and use get_all_steps.
- Replace inline video/path discovery with PathResolver usage in CLI; add new core path_resolver/path_service modules (added files).
- Add a comprehensive async Elasticsearch-based `search` CLI command with multiple search modes (text, semantic, image, hash, character, emotion, object, episode name, stats) and associated clients; compute perceptual hashes when needed.
- Introduce SkipListBuilder for constructing skip lists and integrate into run-all flow.
- Clean up helpers to use PathResolver and simplify context setup.
- Large documentation updates: expand README and SEARCH_GUIDE with new config format, pipeline steps, commands, examples, state management and API key instructions.

These changes reorganize execution flow, improve modularity, and add a full-featured search CLI and better configuration/state handling.
Replace Qwen/Qwen2-VL-8B-Instruct with Qwen/Qwen3-VL-Embedding-8B in defaults and step configs, and translate pipeline step descriptions from Polish to English. Files modified: preprocessor/config/step_configs.py, preprocessor/app/config_defaults.py, preprocessor/app/pipeline_factory.py. No functional changes besides model selection and description text.
Remove non-functional docstrings and a TODO comment across several preprocessor modules (cli/cli_main.py, config/prompts/common_schemas.py, lib/io/metadata.py, modules/scraping/base_scraper.py, modules/search/clients/result_formatters.py, modules/search/indexing.py, modules/text/embeddings.py, modules/vision/embeddings.py). Cosmetic cleanup only — no functional changes.
Rename numerous internal helpers to private (double-underscore) names across the codebase and update call sites accordingly. Remove or trim several unused/legacy routines (state interrupt handling, metadata save, various helper utilities and wrappers), simplify logging wrapper, and add pylint disables where appropriate. These changes are intended to encapsulate implementation details, reduce public surface area and remove dead code without altering high-level behaviors.
Add tools to enforce/fix dataclass field ordering and perform a broad refactor of pipeline/executor and configuration dataclasses.

Changes include:
- Add check_dataclasses.py and fix_dataclasses.py: utilities to detect and automatically reorder dataclass fields so fields without defaults come before those with defaults.
- Pipeline changes: reorganize PipelineDefinition methods (get_all_steps, register, validate, cycle/missing-dependency errors) and improve error messages; add executor helpers in PipelineExecutor/PipelineBuilder (cleanup, execute_step(s), state-marking helpers) to consolidate step lifecycle handling.
- StepBuilder/CLI updates: reorder StepBuilder dataclass fields and relocate validation logic; adjust PipelineContextFactory ordering and helper methods.
- Config refactor: reorder and add fields across many config dataclasses (OutputSubdirs, TranscodeSettings, Whisper/ElevenLabs/Embedding settings, ImageScraper, Elasticsearch, Gemini, Settings, TranscodeConfig, TranscriptionConfig, etc.), rename private env loaders from __from_env to _from_env, add SeriesConfig.load and defaults-loading helpers, and add small behavioral/property changes (e.g. image scraper serpapi_key property).

Overall purpose: ensure consistent dataclass definitions, improve maintainability of pipeline execution and lifecycle, and normalize configuration structures and env-loading conventions.
Simplify BaseProcessor by removing unused helpers and prints, and by adjusting defaults. Removed unused __get_processing_info and __get_temp_files methods and stopped printing processing info during resource loading. Default loglevel changed from logging.DEBUG to numeric 10 to avoid the logging import. When marking a step started, temp files are no longer collected and an empty list is passed to state_manager; several tuple-return syntaxes were simplified to plain returns.
Move modules into a unified lib/ layout and update related imports; create preprocessor.lib.search.clients package and adjust search client imports. Simplify preprocessor.lib.io exports and relocate path_manager/path_resolver/path_service under lib/io. Remove several transcription processors and elevenlabs transcriber. Refactor BaseProcessor API: introduce _finalize, change abstract methods signatures to be implemented by subclasses, centralize missing-output and state sync logic, and adapt path manager import. Update CharacterReferenceDownloader to new BaseProcessor flow (cleanup, _get_expected_outputs, _get_processing_items, _load_resources) and switch to modules-based placement. Update README to add pipeline validation step and reflect new step order. Numerous file renames and import fixes to match the new package layout.
Refactor codebase namespaces and add validation phase.

- Move many modules from preprocessor.lib / preprocessor.modules to preprocessor.services and preprocessor.steps and update imports accordingly (UI, IO, media, episodes, search, transcription, video, etc.).
- Remove legacy package __init__ files under preprocessor/lib and preprocessor/modules.
- Add a VALIDATION phase and register a Validation step in the pipeline (pipeline_factory) using new ValidationConfig.
- Introduce ValidationConfig (anomaly_threshold, episodes_info_json) and ValidationResult dataclass in core.artifacts.
- Add PipelineStep._check_cache_validity helper to centralize cache/skip logic.
- Update various files to use services.* and steps.* paths and adjust typing imports where necessary.

This commit is primarily a namespace reorganization plus the addition of validation plumbing and small core helpers to support it.
Rename many preprocessor step modules to use a consistent *_step.py filename (audio, packaging, search, text, video, vision) and update package __init__.py imports accordingly. This standardizes module names (e.g. separation.py -> separation_step.py, archives.py -> archives_step.py, analysis.py -> analysis_step.py, etc.) so imports reference the new filenames; no functional changes were made aside from renames and import updates.
Introduce a new resolution_analysis pipeline step (and Result type) to analyze source video resolutions before transcode and make transcode depend on it; update README and pipeline diagram to show 21 steps and the new analyze-resolution CLI command. Refactor CLI/search logic by extracting SearchCommandHandler and SearchFilters, centralizing perceptual-hash computation, and unifying async search flow; wire in EmbeddingService/Elasticsearch queries and replace ad-hoc result printing with buffered output. Replace PathResolver usages with PathService (remove old path_resolver), streamline logging/messages (remove emojis and replace Unicode arrows with ASCII), add several validator modules, add small typing and signature improvements, and apply various minor cleanups across AI, face-detection, state management, artifacts, and core modules.
Refactor pipeline execution to support global steps by adding __run_global_step and __run_episode_step in PipelineExecutor and using a special 'all' episode_id for global completion tracking. Add an is_global property (default False) to PipelineStep and mark BaseScraperStep, ResolutionAnalysisStep, and CharacterReferenceStep as global. Remove per-instance _executed flags from scraper and reference steps so step execution and skipping rely on the pipeline's centralized skip/mark logic and context.force_rerun handling. Improve logging and error messages for global step execution.
Update pipeline module paths to new *_step modules and refactor resolution analysis step. The ResolutionAnalysisStep now uses stricter private method names, improved imports/typing, and writes a detailed JSON report (including counts, labels, per-file info and upscaling metrics) to the output directory. Also replace CharacterReferenceProcessor with CharacterReferenceDownloader (parameter key renamed and series_name passed) and adjust logging/messages accordingly.
Multiple changes to improve video quality, interlacing analysis and transcoding consistency:

- series config: disable forced deinterlace for kiepscy.json.
- FFmpegWrapper: add default audio sample rate (48 kHz), tweak detect_interlacing (default 60s, command ordering), include -ar in transcode, enable sws_flags, embed color metadata and video_track_timescale, adjust aq-strength, enforce closed GOP (strict_gop/forced-idr/no-scenecut), and refine deinterlace/scaler filters for better results.
- ResolutionAnalysisStep: refactor to return richer per-file video_info (width/height, field_order, idet stats, needs_deinterlace, metadata match), run idet per file, validate metadata vs idet, improve logging and include interlacing_analysis in the JSON output.
- VideoTranscoderStep: add informational logs about forced parameters, force target FPS to 24.0, revise bitrate/upscaling heuristics (pixel_ratio/quality_boost and min bitrate adjustments), update deinterlace detection messaging to use idet (first 60s) and handle metadata mismatches, always pass target_fps to transcode, and small signature/lint cleanups.

Overall these changes aim to produce more consistent, color-accurate outputs, more reliable deinterlacing decisions based on idet, and smarter bitrate scaling for upscaling scenarios.
Major refactor to decouple environment/output path handling and simplify CLI/search logic. Moves base output path logic into preprocessor.config.output_paths and introduces Environment.is_docker; adds OutputDirMixin to centralize per-feature output dirs. Introduces SettingsFactory and settings_instance for controlled Settings creation and injects settings into ExecutionContext. Replaces PathManager with PathService and removes several legacy IO modules (detection_io, hashing, path_manager); updates imports across services. Simplifies LLMProvider by unifying combined content building and client usage, and reduces duplication in SearchCommandHandler by adding a generic _execute_search helper; the CLI search command now builds a SearchConfig/params dataclass and delegates execution. FFmpeg wrapper updated to accept a TranscodeParams object and constants/flags were renamed/tuned. Miscellaneous plumbing and cleanup to align modules with the new config/IO patterns.
Major refactor across pipeline, builder, config and CLI modules:

- PipelineDefinition: made internal attributes private, added name property, registration/getters, full validate() that builds a DAG, checks missing deps and cycles, logging on success, execution ordering by topological sort, grouped steps by phase, improved ASCII output and repr.
- PipelineExecutor / pipeline_builder: encapsulated context and steps, use run() to execute per-episode/global steps, forward context to step execution, implemented step progress/completion state updates, improved logging and error handling.
- StepBuilder: freeze dataclass, validate step id and module path on post-init, keep eq/hash/repr.
- Pipeline factory: organize phases with comments, use a single output_dir variable, update transcode config keys and description, minor config/formatting improvements, and rename helper to _get_step_configs.
- CLI and search: make many attributes/methods private, freeze dataclasses for search params, refactor SearchFilters/SearchCommandHandler to use properties and private members, rename/clarify async runner, and tidy up click option formatting and small logging/messages.
- Config dataclasses: make multiple config dataclasses frozen, hide API key in repr by using field(default=None, repr=False), remove unused imports.
- Other small fixes: formatting, minor API and naming cleanups, and consistency improvements across modules.

These changes improve encapsulation, add validation for pipeline definitions, and standardize config and CLI data structures.
Large refactor across the preprocessor package: convert many instance helpers to @staticmethods, tighten typing, and rename parameters/fields for clarity. Key changes: replace Settings._from_env with Settings.from_env and update SettingsFactory; introduce/adjust FrameRequest fields (frame_number, timestamp) and propagate that type through keyframe strategies; make numerous helper methods static (console, validators, generators, FFmpeg parsers, embedding computation, etc.); normalize method names in emotion and scene detection (init_model/detect_batch/crop_face, get_video_info); make PathService.suffix optional; avoid mutating TranscodeParams by using dataclasses.replace in transcoding step; simplify calls (safe_resize rename and call sites); format search CLI output into a single joinable list. Also includes small typing fixes, minor logic cleanups, and other consistency improvements.
@dam2452 dam2452 changed the title Multi series support & data isolation (preprocessor) Pipeline architecture, multi-series isolation & search CLI Feb 13, 2026
Introduce a new CLI utility to copy processed series output to NAS storage. The deploy_to_nas script collects files from `archives` and `transcoded_videos` subdirs, supports dry-run, overwrite, and concurrent copying via ThreadPoolExecutor, and auto-resolves the local output_data base path unless overridden. Also add an empty package __init__.py to expose preprocessor.scripts as a module.
Copy link
Copy Markdown
Collaborator

@skelly37 skelly37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rob jak uwazasz, nie ma mnie na tyle zeby to czytac, sie tylko upewnij ze pliki maja tymczasowe rozszerzenia bo tgo mi sie szuakc w tym nie chcialo

dam2452 added 6 commits March 9, 2026 12:11
Update VLLM client defaults and runtime settings: change default model from Qwen/Qwen2.5-Coder-7B-Instruct to Qwen/Qwen3.5-9B; tune sampling parameters (add min_p, presence_penalty, set repetition_penalty to 1.0); pass chat_template_kwargs to disable "thinking" in outputs; and increase max_model_len from 131072 to 262144 (update console message to 256K context). These changes enable longer-context runs and adjusted generation behavior.
Introduce a TranscriptionImportStep and wiring to allow importing pre-existing transcriptions instead of running transcription.

- Add TranscriptionImportStep implementation that supports batch processing, finds and converts 11labs_seg mented JSON (and generic JSON), writes normalized transcription artifacts and provides output descriptor. Supports season_remap and resolves per-episode files.
- Make pipeline_factory choose between TranscriptionImportStep and runtime TranscriptionStep based on series_config.processing.transcription_import.
- Extend SeriesConfig to parse an optional transcription_import section (TranscriptionImportProcessingConfig) and refactor processing config construction.
- Add new example series config kapitan_bomba.json with transcription_import settings.
- Update step_configs: TranscriptionImportConfig now uses Path and includes season_remap; add apply_boost_on_resize_only flag to TranscodeConfig.
- Video transcoder: add resolution equality check and honor apply_boost_on_resize_only to avoid bitrate boosting when output resolution equals source.

These changes enable importing externally-produced transcriptions and prevent unnecessary bitrate boosts when no resize occurs.
Update series config path and adjust transcoding bitrate logic and logging.

- preprocessor/series_configs/kapitan_bomba.json: switch source_dir from a local Windows Downloads path to /transcriptions/kapitan_bomba.
- preprocessor/steps/video/transcoding_step.py: when config.apply_boost_on_resize_only is set and the video resolution is unchanged, preserve the normalized bitrate (no boost), even if it falls below min_bitrate. Reordered branch logic to ensure same-resolution preservation takes precedence. Adjusted the preserved message text.
- Made __log_transcode_details an instance method so it can call the resolution check, and changed the scale label to emit SAME/UP/DOWN in logs for clearer reporting.

These changes ensure bitrate boosts are applied only when resizing (if configured) and improve logging clarity.
Dockerfile: switch to pre-release/nightly vllm wheels, install transformers from main, remove flashinfer and onnxruntime, pin onnxruntime-gpu==1.21.0, and replace CUDA/CuDNN/NCCL packages with cu12 variants (add tolerant uninstall for cu11). core/state_reconstruction.py: rename step_instance parameters to step_def for output checks and use step_def.get_output_descriptors(). series_configs: reduce images_per_character from 5 to 3 in defaults and stop skipping character_scraper for kapitan_bomba (enable character scraping). services/ai/clients.py: adjust vLLM sampling and runtime defaults (temperature 0.7->1.0, top_p 0.8->0.95, gpu_memory_utilization 0.95->0.90) and add language_model_only=True. These changes align dependencies with nightly vLLM, update runtime tuning, and tweak preprocessing defaults.
Allow disabling character reference downloads and harden model loading and face clustering.

- Config: set face clustering and validation default parallel episodes to 1; allow images_per_character >= 0.
- Series config: set kapitan_bomba scraping.character_references.images_per_character to 0 to skip downloads.
- Scraping step: short-circuit reference download when images_per_character == 0 and log the skip.
- Face clustering: handle tiny sample counts (return zeros for <2 samples) and cap min_samples/min_cluster_size to number of samples to avoid errors.
- Emotion utils: add support for EMOTION_MODEL_HOME to load/persist ONNX models from a mounted volume, add retry logic for HTTP 429 rate limits when downloading models, persist packaged model into volume if present, and patch model path resolution accordingly.

These changes make scraping optional for character images, improve robustness in low-sample clustering scenarios, and make emotion model loading resilient and friendly to containerized environments with mounted model volumes.
Introduce per-episode file paths and path helper methods, update validators to validate single episode JSON/JSONL files and use FileValidator via PathService. Key changes:

- Added PathService.get_episode_dir_by_code and get_episode_file_path; expose Validator.validation_reports_dir.
- Switched several output subdirectory names in OutputSubdirs (characters, clusters, frames, hashes, object detection, scene paths, etc.).
- ElasticValidator: search season folder for {ep_code}_*.jsonl, validate individual files and text statistics via PathService/get_base_output_dir; improved error/warning messages.
- FaceCluster, ImageHash, Object, Scene, Transcription and Frame validators: now operate on per-episode files (using PathService.get_episode_file_path or get_episode_dir_by_code) and use FileValidator for JSON integrity; removed old directory-based helpers.
- Document generation and archiving: set min_size_bytes to 0 for many outputs; document generator now handles missing/empty source data by writing empty NDJSON safely instead of early-returning; archives reporting now lists missing document types when skipping.
- SoundEventEmbeddingStep: save empty results when no segments.
- TranscodeConfig: removed apply_boost_on_resize_only and updated VideoTranscoderStep logic to always apply bitrate rules (simplified boost behavior).

These changes migrate validation and IO to a simpler per-episode file layout, improve validation messages, and make document/archiving behavior more explicit for missing or empty inputs.
@dam2452 dam2452 changed the title Pipeline architecture, multi-series isolation & search CLI [4.0.1] Pipeline architecture, multi-series isolation & search CLI Mar 13, 2026
dam2452 added 11 commits March 13, 2026 09:39
Introduce BaseTranscriptionStep to centralize transcription output descriptors and cache path resolution. Update TranscriptionImportStep and TranscriptionStep to inherit from the new base class and remove duplicated get_output_descriptors/_get_cache_path implementations. Bump VERSION to 4.0.1 and remove an unnecessary pylint disable comment in face_clusterer.py.
Add a new series configuration for a Sejm RP demo. Defines display_name and series_name 'sejm_demo', Elasticsearch index_name 'sejm_demo', scraping settings (character image refs, character and episode URLs), and skips the episode_scraper and character_scraper steps for demo runs.
Set pipeline_mode to "selective" in sejm_demo.json. Change CharacterReferenceDownloader to log "No suitable images found" as a warning instead of an error to reduce noisy error reports when characters have no available reference images.
Introduce a uses_global_completion property (default True) on PipelineStep and update PipelineExecutor to only mark/skip global steps when the flag is set. Disable caching and global completion for the character reference step and forward force_rerun to the downloader. In CharacterReferenceDownloader add a force_rerun option, short-circuit runs when a .exhausted marker exists, and create that marker when no images are saved to avoid repeated futile scrapes. These changes prevent unnecessary global completion bookkeeping for this step and skip re-running searches for characters previously found to be exhausted unless explicitly forced.
Introduce a configurable search_query_template for character reference scraping. Added the field to SeriesConfig and CharacterReferencesConfig (dataclass) and to the Pydantic CharacterReferenceConfig with a default of "Serial {series_name} {char_name} postać". Updated series defaults and sejm_demo config to include templates (sejm_demo uses "{char_name} poseł"). PipelineFactory and the reference scraping step now pass the template through to the downloader. CharacterReferenceDownloader reads the template from args (with a fallback default) and builds search queries via .format(series_name=..., char_name=...), enabling per-series/custom search query formats.
Switch Google image search to RapidAPI: rename config key to google_search_key and read RAPIDAPI_GOOGLE_SEARCH_KEY from env. Replace serpapi client usage in GoogleImageSearch with requests to the RapidAPI Google Search endpoint (update names/error text and result parsing). Add a new SerpApiImageSearch class that retains the original serpapi-based implementation and export it from the package. Update reference_downloader to accept no direct DuckDuckGo fallback (search engine is optional), use settings.image_scraper.google_search_key for premium mode, and implement a browser-based DuckDuckGo i.js response fallback to gather image URLs. Improve download handling by checking content-type, add extraction of og:image when HTML is returned, and move/surface result-sorting utility; also adjust imports and logging accordingly.
Reduce min image dimensions from 600x800 to 60x60 and introduce explicit search engine handling. Default search_engine in CharacterReferenceConfig changed to "normal"; sejm_demo.json now requests "premium" engine and fixes the search query string. DuckDuckGoImageSearch now normalizes results to dicts with 'image' and 'thumbnail'; GoogleImageSearch extracts knowledge panel image first and returns up to max_results. Character reference downloader was simplified to always use a search engine instance (Google when 'premium', otherwise DuckDuckGo), removed the browser-based DuckDuckGo fallback and unused imports, and adjusted arg/variable names for clarity.
Switch image scraping to SerpAPI and browser-based Bing, unify search interface to iterators, and overhaul reference downloader pipeline. Updates include: increase max results (50->100), adjust request/retry delays, rename env key to SERPAPI_API_KEY (settings property serpapi_key), and change sejm_demo search_engine from "premium" to "normal". Added BrowserBingImageSearch (browser scraping), updated DuckDuckGo to yield results with pre-search delay, replaced RapidAPI Google search with serpapi.GoogleSearch, and removed the old serpapi_image_search module. reference_downloader was rewritten to initialize the chosen search engine after launching the browser, stream results, collect candidate images, perform consensus-based face embedding filtering/scoring, decode images via PIL, and save top candidates (removed OG extraction and previous simple validation/sorting logic).
Introduce a browser-driven DuckDuckGo image search (BrowserDuckDuckGoImageSearch) and remove the old DDGS-based DuckDuckGo implementation. Improve Bing browser scraper: add page timeouts, hard alarm timeout, safer scrolling, and return lists instead of yielding generators. Add image_download_timeout setting to config and use it for page navigation when downloading images. Refactor CharacterReferenceDownloader: change candidate collection to return a consensus embedding early when confident, implement consensus clustering/selection, replace previous consensus & scoring helpers with consensus-aware scoring, and guard against insufficient consensus. Update exports to expose the new browser-based search classes and remove the deprecated implementation.
Enable chunked transcription for long audio files by adding a max_chunk_duration_seconds config (default 1800s) and wiring it through the step config and pipeline. WhisperEngine now checks audio duration via ffprobe and, if needed, splits input with ffmpeg into chunks, transcribes each chunk, adjusts segment ids/timestamps (and word timestamps), and concatenates text and segments. Also minor defaults: set max_parallel_episodes=1 in defaults and pipeline. Note: this adds ffmpeg/ffprobe usage and temporary file handling.
from bot.services.reindex.zip_extractor import ZipExtractor
from bot.settings import settings
from preprocessor.search.elastic_manager import ElasticSearchManager
from preprocessor.search.elastic_manager import ElasticSearchManager # pylint: disable=no-name-in-module
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from preprocessor.search.elastic_manager import ElasticSearchManager # pylint: disable=no-name-in-module
from preprocessor.search.elastic_manager import ElasticSearchManager

dam2452 added 8 commits March 18, 2026 11:32
Stop creating /app/output_data/characters, /app/output_data/scraped_pages, and /app/output_data/processing_metadata in the Dockerfile mkdir step. These runtime/output directories are likely managed or mounted elsewhere, so only model directories (including /models/emotion_model) are created during image build.
Introduce preprocessor/scripts/split_double_episodes.py: a CLI tool to detect and split double-episode video files and renumber episodes per season. It probes video duration with ffprobe, uses a TransNetV2 wrapper for scene detection, and refines split points by scanning for black frames with ffmpeg's blackdetect. Files are renamed using an SxxExx pattern (appends _SPECIAL for single-episode specials) and output to season subdirectories; splitting is performed with ffmpeg (hevc_nvenc settings) while specials are copied. The script supports dry-run mode and configurable options for scene threshold, minimum scene length, and black-frame scan window.
Add a new compare_scribe_models script to submit and compare ElevenLabs scribe_v1/scribe_v2 transcriptions (with optional Whisper via Docker), poll for results, and save JSON/SRT outputs. Change default ElevenLabs model in config to scribe_v2. Increase diarization speaker cap (num_speakers=32) in ElevenLabs engine requests. Broaden sound-event detection regex to match parentheses or square brackets. Accept both '11labs' and 'elevenlabs' modes when creating the ElevenLabs engine. Add .gitignore entry for preprocessor scripts output.
Introduce series-wide face clustering and support using labeled cluster folders as character reference sources. Key changes:

- Add SeriesFaceClusteringStep to extract embeddings from all frames, cluster them, create numbered cluster folders and write a _cluster_index.json.
- Add ClusterFolderManager service to create cluster folders, extract dominant face vectors from cluster folders, and manage labeled folder checks.
- Extend CharacterReferenceProcessorStep to support a "clusters" reference_source: validate labeled cluster folders, extract per-character vectors from labeled folders, and emit metadata; preserve existing web-based processing.
- Add configuration fields: CharacterReferencesConfig.source (default 'web') and CharacterReferenceProcessorConfig.reference_source (Literal["web","clusters"]). SeriesConfig parsing uses the new source field.
- Wire pipeline_factory to conditionally run series clustering and adjust step phases/dependencies when reference_source == "clusters"; register steps in correct order.
- Export ClusterFolderManager from services.characters.__init__ and add SeriesFaceClusteringConfig placeholder.

Defaults retain existing web scraping behavior; the new cluster flow enables manual labeling of clusters for improved character reference vectors.
Delete the per-episode FaceClusteringStep and its FaceClusterValidator and remove the FaceClusteringConfig and ClusterData artifact. Update pipeline_factory and vision step exports to no longer import or register the removed step/config. Adjust validation wiring and EpisodeStats to stop referencing the removed validator.

Refactor cluster folder handling and face data: ClusterFolderManager now creates separate 'frames' and 'faces' subfolders, ranks frames with bbox info, saves cropped face images via a new _save_face_crop helper, and falls back to legacy folder layout when reading frames. FaceClusterer now includes bbox tuples in extracted face entries to support cropping. These changes prepare for series-level clustering and consolidate face-clustering responsibilities.
Tighten clustering thresholds and improve cluster output handling. Increased min_face_px to 40 and min_det_score to 0.4 to reduce false positives. ClusterFolderManager now collects faces labeled as noise and writes them to an `_noise` folder instead of skipping them. SeriesFaceClusteringStep now imports json and creates empty character label folders from the series' <series>_characters.json (logging the count) to aid downstream workflows; it still writes the cluster index as before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants