fix: upsert with null values in join columns #2429

mdwint · 2025-09-04T21:55:31Z

Rationale for this change

This fixes #2426. The upsert method now supports null values to be passed in the join columns.

Are these changes tested?

Yes, I added unit tests.

Are there any user-facing changes?

Yes, the upsert method is user-facing.

mdwint · 2025-09-22T09:18:13Z

@kevinjqliu Could you help me find a reviewer? This is my first contribution, so I'm not sure how to get this noticed.

kevinjqliu

thanks for the PR!

i left a few nit comments

tests/table/test_upsert.py

kevinjqliu · 2025-09-23T19:57:56Z

pyiceberg/table/upsert_util.py


    if len(join_cols) == 1:
-        return In(join_cols[0], unique_keys[0].to_pylist())
+        column = join_cols[0]


is there a way we can simplify the logic here?

i think the primary issue is that the In operator cannot handle Null, is that right?

Yes, the In operator cannot handle null by design, and this goes for SQL as well.

The following SQL is invalid:

WHERE x IN (1, 2, 3, NULL)

Instead it should be this:

WHERE x IN (1, 2, 3) OR x IS NULL

Testing for null requires IS NULL (or IS NOT NULL), and it's impossible with IN or =.

This is the reason for changing the create_match_filter function: we need to build more complex expressions if null is involved. Examples of such expressions are shown in the test cases.

If there's a better way I'm open to changing it, but I believe the added complexity in building filter expressions with null is justified. When null is not involved we produce the same In expression as before.

zhongyujiang · 2025-09-24T11:51:05Z

pyiceberg/table/upsert_util.py

 )


 def create_match_filter(df: pyarrow_table, join_cols: list[str]) -> BooleanExpression:


If I understand this correctly, this is creating a predicate to test whether a row might exist in the pyarrow_table (matching on join_cols).
And since Null == Any should always return unknown in SQL, can we just filter out any rows from the pyarrow_table where the join_cols fields contain None(we treat None as SQL Null), and then build the match filter based on the filtered pyarrow table (using the existing logic for building the match filter)? This would be much simpler.

Would that mean it's impossible to update rows with null in the join columns, since they are filtered out?
If so, that's not what I was going for. I'd like the solution to pass this test: https://github.com/mdwint/iceberg-python/blob/f818016e5c198581b7d7b11dba2b9ebd414e19bc/tests/table/test_upsert.py#L784-L831

This would be equivalent to the following Spark SQL (using the null-safe equality operator <=>):

MERGE INTO target_table AS t USING source_table AS s ON (t.foo <=> s.foo AND t.bar <=> s.bar) WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *

Thinking through this some more, I ask myself: What should the semantics of upsert be? Should it use = or <=> to test equality? For my use case <=> is right, and I also find it most intuitive, but does that mean it should be the default?

I see several options:

Make <=> the default. Users who don't want to update nulls can filter them out themselves before calling upsert. The status quo is crashing, so there are no existing users expecting a different behaviour.

Make = the default. This means I can't achieve my goal, and I'll need to reimplement upsert myself. It also means new rows will be inserted for every row containing null in the join columns. This is unintuitive to me, but who knows someone might want it?

Add an argument to upsert to select the comparison operator. Maximum flexibility, more work to implement.

tests/table/test_upsert.py

mdwint force-pushed the fix/upsert-with-nulls branch from 76fc451 to 075a966 Compare September 5, 2025 07:15

mdwint marked this pull request as ready for review September 5, 2025 09:34

mdwint mentioned this pull request Sep 5, 2025

Upsert with None values fails on "Invalid literal value: None" #2426

Open

3 tasks

mdwint force-pushed the fix/upsert-with-nulls branch 3 times, most recently from 236b0c8 to 479d11d Compare September 5, 2025 09:52

kevinjqliu reviewed Sep 23, 2025

View reviewed changes

zhongyujiang reviewed Sep 24, 2025

View reviewed changes

mdwint added 6 commits October 13, 2025 13:18

fix: upsert with null values in join columns

1321c1b

test: add test cases for create_match_filter

4df175f

fix: respect null values in inner join in get_rows_to_update

763042f

fix: type hints

a32bb06

test: add test case for create_match_filter without null

9af1b3d

test: separate test_upsert_with_nulls_in_join_columns

6d772b9

mdwint force-pushed the fix/upsert-with-nulls branch from 9e323e6 to 6d772b9 Compare October 13, 2025 11:20

rambleraptor reviewed Oct 13, 2025

View reviewed changes

tests/table/test_upsert.py Outdated Show resolved Hide resolved

test: unroll parametrized tests for clarity

cf3d68e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: upsert with null values in join columns #2429

fix: upsert with null values in join columns #2429

mdwint commented Sep 4, 2025

Uh oh!

mdwint commented Sep 22, 2025

Uh oh!

kevinjqliu left a comment

Uh oh!

Uh oh!

Uh oh!

kevinjqliu Sep 23, 2025

Uh oh!

mdwint Sep 24, 2025 •

edited

Loading

Uh oh!

zhongyujiang Sep 24, 2025 •

edited

Loading

Uh oh!

mdwint Sep 24, 2025 •

edited

Loading

Uh oh!

mdwint Sep 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		)


		def create_match_filter(df: pyarrow_table, join_cols: list[str]) -> BooleanExpression:

fix: upsert with null values in join columns #2429

Are you sure you want to change the base?

fix: upsert with null values in join columns #2429

Conversation

mdwint commented Sep 4, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mdwint commented Sep 22, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevinjqliu Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

mdwint Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongyujiang Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdwint Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdwint Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mdwint Sep 24, 2025 •

edited

Loading

zhongyujiang Sep 24, 2025 •

edited

Loading

mdwint Sep 24, 2025 •

edited

Loading

mdwint Sep 24, 2025 •

edited

Loading