feat: Add PaddleOCR-VL document converter #2567

Bobholamovic · 2025-11-26T12:33:10Z

Related Issues

Proposed Changes:

This is a new feature that adds the official PaddleOCR integration for Haystack, providing a PaddleOCR-VL document converter component. The component leverages PaddleOCR's API for document parsing and supports text extraction from both PDF and image files.

How did you test it?

This PR includes a complete unit test suite including initialization tests, parameter validation, file type inference, API call tests, etc. Tests cover PDF and image file processing.

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

CLAassistant · 2025-11-26T12:33:18Z

All committers have signed the CLA.

anakin87

Hey... thanks for the implementation!

I created #2569 to track the work to be done.

I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.

Bobholamovic · 2025-11-27T03:38:14Z

Hey... thanks for the implementation!

I created #2569 to track the work to be done.

I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.

Thanks. The CI workflow has been added.

anakin87

Thanks for the implementation.

I did a first pass and found some opportunities for improvement...

anakin87 · 2025-12-02T09:55:57Z

integrations/paddleocr/pydoc/config.yml

@@ -0,0 +1,29 @@
+loaders:


This is no longer needed. Let's remove this file and keep config_docusaurus.yml only.

anakin87 · 2025-12-02T09:59:39Z

.github/workflows/paddleocr.yml

+      fail-fast: false
+      matrix:
+        os: [ubuntu-latest, windows-latest, macos-latest]
+        python-version: ["3.9", "3.12"]


The reason for not including 3.13 is that PaddleOCR package is not compatible with it. Right?

anakin87 · 2025-12-02T10:01:07Z

integrations/paddleocr/pyproject.toml

+
+[project.urls]
+Documentation = "https://github.com/haystack-core-integrations/tree/main/integrations/paddleocr#readme"
+Issues = "https://github.com/haystack-core-integrations/paddleocr/issues"


Suggested change

Issues = "https://github.com/haystack-core-integrations/paddleocr/issues"

Issues = "https://github.com/haystack-core-integrations/issues"

anakin87 · 2025-12-02T10:02:22Z

integrations/paddleocr/pyproject.toml

+dependencies = ["haystack-pydoc-tools", "ruff"]
+
+[tool.hatch.envs.default.scripts]
+docs = ["pydoc-markdown pydoc/config.yml"]


Suggested change

docs = ["pydoc-markdown pydoc/config.yml"]

docs = ["pydoc-markdown pydoc/config_docusaurus.yml"]

We recently changed this command in all integrations.

anakin87 · 2025-12-02T10:17:58Z

...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py

+        self,
+        api_url: str,


Suggested change

self,

api_url: str,

self,

*,

api_url: str,

anakin87 · 2025-12-02T10:29:33Z

...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py

+logger = logging.getLogger(__name__)
+
+
+FileTypeInput: TypeAlias = Union[Literal["pdf", "image", 0, 1], None]


Since this is meant to be provided by the user and later converted to FileType, can we just restrict it to
FileTypeInput: TypeAlias = Union[Literal["pdf", "image"], None]?

anakin87 · 2025-12-02T10:37:46Z

...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py

+_PDF_EXTENSIONS = {".pdf"}
+
+
+def _infer_file_type_from_source(source: Union[str, Path, ByteStream], bytestream: ByteStream) -> Optional[FileType]:


Suggested change

def _infer_file_type_from_source(source: Union[str, Path, ByteStream], bytestream: ByteStream) -> Optional[FileType]:

def _infer_file_type_from_source(source: Union[str, Path, ByteStream], mime_type: Optional[str] = None) -> Optional[FileType]:

What about changing the signature to something like this (and adjusting the implementation accordingly)? This would be clearer...

anakin87 · 2025-12-02T10:43:34Z

integrations/paddleocr/tests/test_paddleocr_vl_document_converter.py

+)
+
+
+def download_test_file(url, dest_path, timeout=30):


I would add the test files to this repo. See for example https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/anthropic/tests/test_files

This would remove flakiness due to possible network issues.

Would it be possible to include some examples in English instead of Chinese? This would help us understand the model's behavior more easily and make future maintenance simpler.

anakin87 · 2025-12-02T10:45:55Z

integrations/paddleocr/tests/test_paddleocr_vl_document_converter.py

+    @pytest.fixture
+    def integration_enabled(self):
+        """Check if integration tests should run."""
+        return bool(os.environ.get("PADDLEOCR_VL_API_URL") and os.environ.get("AISTUDIO_ACCESS_TOKEN"))


Suggested change

@pytest.fixture

def integration_enabled(self):

"""Check if integration tests should run."""

return bool(os.environ.get("PADDLEOCR_VL_API_URL") and os.environ.get("AISTUDIO_ACCESS_TOKEN"))

This does not seem to be used. Instead, you are using pytest.mark.skipif, which is consistent with other integrations.

anakin87 · 2025-12-02T10:49:45Z

...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py

+
+
+@component
+class PaddleOCRVLDocumentConverter:


Could you explain why PaddleOCRVLDocumentConverter instead of a more generic PaddleOCRDocumentConverter? Is this component tied to a specific model? It might be used with other models?

Add PaddleOCR-VL document converter

4bcc397

Bobholamovic requested a review from a team as a code owner November 26, 2025 12:33

Bobholamovic requested review from mpangrazzi and removed request for a team November 26, 2025 12:33

github-actions bot added the type:documentation Improvements or additions to documentation label Nov 26, 2025

Bobholamovic added 2 commits November 26, 2025 20:52

Fix image extensions

95698ee

Add type ignore comment

d3e6020

anakin87 self-requested a review November 26, 2025 13:43

anakin87 mentioned this pull request Nov 26, 2025

Paddle OCR Integration #2569

Open

9 tasks

anakin87 reviewed Nov 26, 2025

View reviewed changes

Add CI workflow

e8da287

github-actions bot added the topic:CI label Nov 27, 2025

Update paddleocr and paddlex version

c9bbd7b

anakin87 requested changes Dec 2, 2025

View reviewed changes

	Issues = "https://github.com/haystack-core-integrations/paddleocr/issues"
	Issues = "https://github.com/haystack-core-integrations/issues"

	docs = ["pydoc-markdown pydoc/config.yml"]
	docs = ["pydoc-markdown pydoc/config_docusaurus.yml"]

		logger = logging.getLogger(__name__)


		FileTypeInput: TypeAlias = Union[Literal["pdf", "image", 0, 1], None]

		_PDF_EXTENSIONS = {".pdf"}


		def _infer_file_type_from_source(source: Union[str, Path, ByteStream], bytestream: ByteStream) -> Optional[FileType]:



		@component
		class PaddleOCRVLDocumentConverter:

feat: Add PaddleOCR-VL document converter #2567

Are you sure you want to change the base?

feat: Add PaddleOCR-VL document converter #2567

Conversation

Bobholamovic commented Nov 26, 2025

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

CLAassistant commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Bobholamovic commented Nov 27, 2025

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Nov 26, 2025 •

edited

Loading