Skip to content

Data Ingestion Pipeline #84

@emincalyakaisskar

Description

@emincalyakaisskar

WHY: As a user, I want to have a relevant chatbot in the sense that the data accessible by the RAG must be accessible quickly and updated with a real data ingestion system.

DoD:

  • Identify the data sources for the different types (audio, video, text, web, pdf) in our use case
  • We want a pipeline for each type of data: Video/Audio Pipeline, PDF Pipeline, Web Pipeline, Plain Text Pipeline
  • Add connectors capable of retrieving data from these sources (with the right rights, take all data, update data), one connector per pipeline
  • Define triggers according to source type (Event driven, Cron, manual)
  • Manage data transformation (Video -> Audio -> Transcript, PDF -> OCR -> Text, Html parsing -> Text, ...)
  • Chunking strategies: Intelligent according to data type
  • Quality control
  • Metadata (origin and provenance of data)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions