Skip to content

Conversation

@Bobholamovic
Copy link

Related Issues

Proposed Changes:

This is a new feature that adds the official PaddleOCR integration for Haystack, providing a PaddleOCR-VL document converter component. The component leverages PaddleOCR's API for document parsing and supports text extraction from both PDF and image files.

How did you test it?

This PR includes a complete unit test suite including initialization tests, parameter validation, file type inference, API call tests, etc. Tests cover PDF and image file processing.

Notes for the reviewer

Checklist

@Bobholamovic Bobholamovic requested a review from a team as a code owner November 26, 2025 12:33
@Bobholamovic Bobholamovic requested review from mpangrazzi and removed request for a team November 26, 2025 12:33
@CLAassistant
Copy link

CLAassistant commented Nov 26, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the type:documentation Improvements or additions to documentation label Nov 26, 2025
@anakin87 anakin87 self-requested a review November 26, 2025 13:43
@anakin87 anakin87 mentioned this pull request Nov 26, 2025
9 tasks
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey... thanks for the implementation!

I created #2569 to track the work to be done.

I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.

@Bobholamovic
Copy link
Author

Hey... thanks for the implementation!

I created #2569 to track the work to be done.

I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.

Thanks. The CI workflow has been added.

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation.

I did a first pass and found some opportunities for improvement...

@@ -0,0 +1,29 @@
loaders:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed. Let's remove this file and keep config_docusaurus.yml only.

fail-fast: false
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: ["3.9", "3.12"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for not including 3.13 is that PaddleOCR package is not compatible with it. Right?


[project.urls]
Documentation = "https://github.com/haystack-core-integrations/tree/main/integrations/paddleocr#readme"
Issues = "https://github.com/haystack-core-integrations/paddleocr/issues"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Issues = "https://github.com/haystack-core-integrations/paddleocr/issues"
Issues = "https://github.com/haystack-core-integrations/issues"

dependencies = ["haystack-pydoc-tools", "ruff"]

[tool.hatch.envs.default.scripts]
docs = ["pydoc-markdown pydoc/config.yml"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
docs = ["pydoc-markdown pydoc/config.yml"]
docs = ["pydoc-markdown pydoc/config_docusaurus.yml"]

We recently changed this command in all integrations.

Comment on lines +143 to +144
self,
api_url: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self,
api_url: str,
self,
*,
api_url: str,

logger = logging.getLogger(__name__)


FileTypeInput: TypeAlias = Union[Literal["pdf", "image", 0, 1], None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is meant to be provided by the user and later converted to FileType, can we just restrict it to
FileTypeInput: TypeAlias = Union[Literal["pdf", "image"], None]?

_PDF_EXTENSIONS = {".pdf"}


def _infer_file_type_from_source(source: Union[str, Path, ByteStream], bytestream: ByteStream) -> Optional[FileType]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _infer_file_type_from_source(source: Union[str, Path, ByteStream], bytestream: ByteStream) -> Optional[FileType]:
def _infer_file_type_from_source(source: Union[str, Path, ByteStream], mime_type: Optional[str] = None) -> Optional[FileType]:

What about changing the signature to something like this (and adjusting the implementation accordingly)? This would be clearer...

)


def download_test_file(url, dest_path, timeout=30):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the test files to this repo. See for example https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/anthropic/tests/test_files

This would remove flakiness due to possible network issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to include some examples in English instead of Chinese? This would help us understand the model's behavior more easily and make future maintenance simpler.

Comment on lines +392 to +395
@pytest.fixture
def integration_enabled(self):
"""Check if integration tests should run."""
return bool(os.environ.get("PADDLEOCR_VL_API_URL") and os.environ.get("AISTUDIO_ACCESS_TOKEN"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.fixture
def integration_enabled(self):
"""Check if integration tests should run."""
return bool(os.environ.get("PADDLEOCR_VL_API_URL") and os.environ.get("AISTUDIO_ACCESS_TOKEN"))

This does not seem to be used. Instead, you are using pytest.mark.skipif, which is consistent with other integrations.



@component
class PaddleOCRVLDocumentConverter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why PaddleOCRVLDocumentConverter instead of a more generic PaddleOCRDocumentConverter? Is this component tied to a specific model? It might be used with other models?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants