Skip to content

[FR] Kreuzberg as an alternative content exractor to Apache Tika #2507

@cpollmann

Description

@cpollmann

Is your feature request related to a problem? Please describe.

According to the .env.example from the opencloud-compose repository, Apache Tika is disabled as a search extractor by default due to performance reasons.

Describe the solution you'd like

I would like to propose adding support for an additional content extractor based on the Kreuzberg project.

From Kreuzberg's README:

Extract text and metadata from a wide range of file formats (91+), generate embeddings and post-process at native speeds without needing a GPU.

Flexible deployment – Use as library, CLI tool, REST API server, or MCP server

Describe alternatives you've considered

N/A

Additional context

I am not a developer so I cannot estimate the effort required to develop such a content extractor for OpenCloud, nor can I validate the quality or production-readiness of the mentioned project. However, I would happily receive feedback on the idea in general.

My research showed that the co-founder of Kreuzberg is also located in Berlin! ;)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions