-
Notifications
You must be signed in to change notification settings - Fork 38
Bug/AP-25573 Fix bugs with pandas >= 2.1.0 #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses bugs that emerged with pandas version 2.1.0 and later, specifically related to how pandas handles extension arrays and struct dict encoded data. The changes fix data corruption issues during pandas-arrow round-trips and ensure compatibility with newer pandas versions.
Changes:
- Fixed type inference issues in PyArrow array creation by explicitly specifying
type=pa.bool_() - Modified
KnimePandasExtensionArray.take()to prevent struct dict encoding corruption when pandas re-batches data - Added empty array handling in
isna()method - Converted test suite from unittest to pytest with parameterized testing for both populated and empty tables
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pytest.ini | Re-enabled pytest class discovery to support class-based tests |
| org.knime.python3.arrow/src/main/python/knime/_arrow/_types.py | Added explicit boolean type to mask array creation |
| org.knime.python3.arrow/src/main/python/knime/_arrow/_table.py | Added TODO comment about struct-dict-encoded array re-batching issue |
| org.knime.python3.arrow/src/main/python/knime/_arrow/_pandas.py | Fixed pandas 2.1.0+ compatibility by handling empty arrays and optimizing take() to prevent encoding corruption |
| org.knime.python3.arrow/src/main/python/knime/_arrow/_dictencoding.py | Added explicit boolean type to mask array creation |
| org.knime.python3.arrow.tests/src/test/python/unittest/test_table.py | Converted from unittest to pytest with parameterized testing for empty/non-empty tables |
| org.knime.python3.arrow.tests/src/test/python/unittest/test_pandas_extension_type.py | Added regression test for struct dict encoding corruption bug |
| org.knime.python3.arrow.tests/src/test/python/unittest/structDictEncodedDataCellsWithBatches.zip | Added test data file for struct dict encoding regression test |
| org.knime.python3.arrow.tests/src/test/python/unittest/emptyGeneratedTestData.zip | Added test data file for empty table testing |
org.knime.python3.arrow/src/main/python/knime/_arrow/_pandas.py
Outdated
Show resolved
Hide resolved
|
@HedgehogCode I've opened a new pull request, #57, to work on those changes. Once the pull request is ready, I'll request review from you. |
ce94bd8 to
239a578
Compare
AP-25573 (Workflow tests fail with pandas 2.3)
AP-25573 (Workflow tests fail with pandas 2.3)
…mpty arrays The usual case works for empty arrays and handles dtypes of extension arrays correctly. AP-25573 (Workflow tests fail with pandas 2.3)
… of int/bool AP-25573 (Workflow tests fail with pandas 2.3)
Otherwise, if the chunked array has no chunks and we attempt to concatenate no arrays, which fails. AP-25573 (Workflow tests fail with pandas 2.3)
239a578 to
6f8f9dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
org.knime.python3.arrow.tests/src/test/python/unittest/test_pandas_extension_type.py
Show resolved
Hide resolved
…are taken Bug: In pandas 2.1.0+, pd.concat changed its behavior and now calls KnimePandasExtensionArray.take with indices [0,1,2,...,n] to effectively copy the array. When take is called on struct dict encoded arrays, it delegates to storage.take(), which merges chunks of the ChunkedArray without recalculating the struct-dict encoding. This breaks the dictionary structure across chunk boundaries: keys in chunk 2 reference indices that only exist in chunk 1's data array, causing "Cannot read DataCell with empty type information" errors. Fix: Shortcut the specific case where all indices are taken sequentially by returning a copy instead of calling storage.take(). This fixes the pd.concat code path that triggers during arrow → pandas → arrow round-trips. Limitation: This is only a partial fix for the very specific case of taking all indices. The general take() and re-batching behavior for struct dict encoded arrays remains buggy when partial indices are selected or arbitrary re-batching occurs. AP-25573 (Workflow tests fail with pandas 2.3)
…ionArray.take AP-25573 (Workflow tests fail with pandas 2.3)
AP-25573 (Workflow tests fail with pandas 2.3)
6f8f9dd to
e7ba282
Compare
|
|
|
||
| def isna(self): | ||
| if len(self) == 0: | ||
| return np.array([], dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.bool ?



No description provided.