Skip to content

Add async parallel processing for batch operations #12

@ahmed-sekka

Description

@ahmed-sekka

Description

Implement parallel processing for batch operations to significantly improve performance on large document sets.

Current behavior

  • Documents processed sequentially
  • Single-threaded

Proposed behavior

# Process with 4 parallel workers
ragctl batch ./documents --workers 4 --output ./chunks/

Expected improvements

  • 3-5x speedup on multi-core systems
  • Better CPU utilization
  • Configurable worker count

Tasks

  • Add --workers / -j option (default: 1)
  • Implement process pool or thread pool executor
  • Handle errors gracefully per-worker
  • Aggregate results correctly
  • Show per-worker progress
  • Add worker count to history/logs
  • Benchmark and document performance gains

Technical considerations

  • Use concurrent.futures.ProcessPoolExecutor for CPU-bound OCR
  • Use ThreadPoolExecutor for I/O-bound operations
  • Ensure thread-safe history writing
  • Handle keyboard interrupt gracefully

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions