Skip to content

⚡ [Feature] Implement automatic time data import from PDF and Excel files #2

@PPeitsch

Description

@PPeitsch

Problem Description

Currently, users need to manually enter all time records into the system, which is time-consuming and error-prone. Many organizations already generate time reports in PDF or Excel formats that contain all the necessary information (like the example PDF shared). We need a way to automatically import and parse these documents to save users time and reduce entry errors.

Proposed Solution

Implement a file import system that can:

  1. Accept PDF and Excel (XLSX, XLS, CSV) files
  2. Parse different common time report formats
  3. Extract date, time entry/exit, and observation information
  4. Validate the extracted data
  5. Preview the parsed data before importing
  6. Import valid entries into the TimeTrack database

PDF Parser Options

Several libraries could be used for PDF parsing:

  • Tabula-py: Excellent for extracting tables from PDFs (requires Java)
  • pdfplumber: Good alternative that doesn't require Java
  • PyPDF2 + regex: Lighter solution for simpler PDFs
  • pytesseract: For scanned PDFs that require OCR

Excel Parser Options

For Excel files, we could use:

  • pandas: Powerful data analysis library with excellent Excel support
  • openpyxl: Native Python library for Excel files
  • xlrd/xlwt: For older Excel formats

Architecture Considerations

To make this system modular and extensible, we should consider using the Model Context Protocol (MCP) approach:

  1. Create a base ImporterProtocol interface that all parsers implement
  2. Develop context-specific parsers for different file formats and layouts
  3. Implement a factory pattern to select the appropriate parser based on file type and content
  4. Use adapter pattern to normalize all extracted data to a common format

Relevant Projects Using MCP

Some projects that could serve as references:

  • Parsito: A modular parsing toolkit using protocol-based design
  • Structlog: Uses a protocol-based approach for configurable logging
  • Pydantic: For data validation and settings management

Implementation Steps

  1. Create a file upload interface in the UI
  2. Implement the base importer protocol and factory
  3. Create PDF importers starting with the most common format
  4. Add Excel importers
  5. Build validation and preview features
  6. Implement the final import process
  7. Add testing with sample files

Questions

  • Should we support a "template" system where users can define custom formats?
  • Do we need to handle continuous imports (e.g., monthly automated imports)?
  • Should we implement a correction system for incorrectly parsed entries?

Acceptance Criteria

  • Users can upload PDF and Excel files through a web interface
  • System correctly parses at least 3 common time report formats
  • Users can preview parsed data before committing to import
  • Duplicate prevention mechanism is in place
  • Error handling for malformed or unsupported files
  • Documentation on supported formats and how to use the feature

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions