-
Notifications
You must be signed in to change notification settings - Fork 202
feat: Add PaddleOCR-VL document converter #2567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
anakin87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey... thanks for the implementation!
I created #2569 to track the work to be done.
I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.
Thanks. The CI workflow has been added. |
anakin87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the implementation.
I did a first pass and found some opportunities for improvement...
| @@ -0,0 +1,29 @@ | |||
| loaders: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed. Let's remove this file and keep config_docusaurus.yml only.
| fail-fast: false | ||
| matrix: | ||
| os: [ubuntu-latest, windows-latest, macos-latest] | ||
| python-version: ["3.9", "3.12"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for not including 3.13 is that PaddleOCR package is not compatible with it. Right?
|
|
||
| [project.urls] | ||
| Documentation = "https://github.com/haystack-core-integrations/tree/main/integrations/paddleocr#readme" | ||
| Issues = "https://github.com/haystack-core-integrations/paddleocr/issues" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Issues = "https://github.com/haystack-core-integrations/paddleocr/issues" | |
| Issues = "https://github.com/haystack-core-integrations/issues" |
| dependencies = ["haystack-pydoc-tools", "ruff"] | ||
|
|
||
| [tool.hatch.envs.default.scripts] | ||
| docs = ["pydoc-markdown pydoc/config.yml"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| docs = ["pydoc-markdown pydoc/config.yml"] | |
| docs = ["pydoc-markdown pydoc/config_docusaurus.yml"] |
We recently changed this command in all integrations.
| self, | ||
| api_url: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self, | |
| api_url: str, | |
| self, | |
| *, | |
| api_url: str, |
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| FileTypeInput: TypeAlias = Union[Literal["pdf", "image", 0, 1], None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is meant to be provided by the user and later converted to FileType, can we just restrict it to
FileTypeInput: TypeAlias = Union[Literal["pdf", "image"], None]?
| _PDF_EXTENSIONS = {".pdf"} | ||
|
|
||
|
|
||
| def _infer_file_type_from_source(source: Union[str, Path, ByteStream], bytestream: ByteStream) -> Optional[FileType]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _infer_file_type_from_source(source: Union[str, Path, ByteStream], bytestream: ByteStream) -> Optional[FileType]: | |
| def _infer_file_type_from_source(source: Union[str, Path, ByteStream], mime_type: Optional[str] = None) -> Optional[FileType]: |
What about changing the signature to something like this (and adjusting the implementation accordingly)? This would be clearer...
| ) | ||
|
|
||
|
|
||
| def download_test_file(url, dest_path, timeout=30): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add the test files to this repo. See for example https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/anthropic/tests/test_files
This would remove flakiness due to possible network issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to include some examples in English instead of Chinese? This would help us understand the model's behavior more easily and make future maintenance simpler.
| @pytest.fixture | ||
| def integration_enabled(self): | ||
| """Check if integration tests should run.""" | ||
| return bool(os.environ.get("PADDLEOCR_VL_API_URL") and os.environ.get("AISTUDIO_ACCESS_TOKEN")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @pytest.fixture | |
| def integration_enabled(self): | |
| """Check if integration tests should run.""" | |
| return bool(os.environ.get("PADDLEOCR_VL_API_URL") and os.environ.get("AISTUDIO_ACCESS_TOKEN")) |
This does not seem to be used. Instead, you are using pytest.mark.skipif, which is consistent with other integrations.
|
|
||
|
|
||
| @component | ||
| class PaddleOCRVLDocumentConverter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why PaddleOCRVLDocumentConverter instead of a more generic PaddleOCRDocumentConverter? Is this component tied to a specific model? It might be used with other models?
Related Issues
Proposed Changes:
This is a new feature that adds the official PaddleOCR integration for Haystack, providing a PaddleOCR-VL document converter component. The component leverages PaddleOCR's API for document parsing and supports text extraction from both PDF and image files.
How did you test it?
This PR includes a complete unit test suite including initialization tests, parameter validation, file type inference, API call tests, etc. Tests cover PDF and image file processing.
Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.