Skip to content

Conversation

@mdwint
Copy link

@mdwint mdwint commented Sep 4, 2025

Rationale for this change

This fixes #2426. The upsert method now supports null values to be passed in the join columns.

Are these changes tested?

Yes, I added unit tests.

Are there any user-facing changes?

Yes, the upsert method is user-facing.

@mdwint mdwint force-pushed the fix/upsert-with-nulls branch from 76fc451 to 075a966 Compare September 5, 2025 07:15
@mdwint mdwint marked this pull request as ready for review September 5, 2025 09:34
@mdwint mdwint force-pushed the fix/upsert-with-nulls branch 3 times, most recently from 236b0c8 to 479d11d Compare September 5, 2025 09:52
@mdwint
Copy link
Author

mdwint commented Sep 22, 2025

@kevinjqliu Could you help me find a reviewer? This is my first contribution, so I'm not sure how to get this noticed.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR!

i left a few nit comments


if len(join_cols) == 1:
return In(join_cols[0], unique_keys[0].to_pylist())
column = join_cols[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way we can simplify the logic here?

i think the primary issue is that the In operator cannot handle Null, is that right?

Copy link
Author

@mdwint mdwint Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the In operator cannot handle null by design, and this goes for SQL as well.

The following SQL is invalid:

WHERE x IN (1, 2, 3, NULL)

Instead it should be this:

WHERE x IN (1, 2, 3) OR x IS NULL

Testing for null requires IS NULL (or IS NOT NULL), and it's impossible with IN or =.

This is the reason for changing the create_match_filter function: we need to build more complex expressions if null is involved. Examples of such expressions are shown in the test cases.

If there's a better way I'm open to changing it, but I believe the added complexity in building filter expressions with null is justified. When null is not involved we produce the same In expression as before.

)


def create_match_filter(df: pyarrow_table, join_cols: list[str]) -> BooleanExpression:
Copy link
Contributor

@zhongyujiang zhongyujiang Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, this is creating a predicate to test whether a row might exist in the pyarrow_table (matching on join_cols).
And since Null == Any should always return unknown in SQL, can we just filter out any rows from the pyarrow_table where the join_cols fields contain None(we treat None as SQL Null), and then build the match filter based on the filtered pyarrow table (using the existing logic for building the match filter)? This would be much simpler.

Copy link
Author

@mdwint mdwint Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that mean it's impossible to update rows with null in the join columns, since they are filtered out?
If so, that's not what I was going for. I'd like the solution to pass this test: https://github.com/mdwint/iceberg-python/blob/f818016e5c198581b7d7b11dba2b9ebd414e19bc/tests/table/test_upsert.py#L784-L831

This would be equivalent to the following Spark SQL (using the null-safe equality operator <=>):

MERGE INTO target_table AS t
USING source_table AS s
ON (t.foo <=> s.foo AND t.bar <=> s.bar)
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Copy link
Author

@mdwint mdwint Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through this some more, I ask myself: What should the semantics of upsert be? Should it use = or <=> to test equality? For my use case <=> is right, and I also find it most intuitive, but does that mean it should be the default?

I see several options:

  • Make <=> the default. Users who don't want to update nulls can filter them out themselves before calling upsert. The status quo is crashing, so there are no existing users expecting a different behaviour.
  • Make = the default. This means I can't achieve my goal, and I'll need to reimplement upsert myself. It also means new rows will be inserted for every row containing null in the join columns. This is unintuitive to me, but who knows someone might want it?
  • Add an argument to upsert to select the comparison operator. Maximum flexibility, more work to implement.

@mdwint mdwint force-pushed the fix/upsert-with-nulls branch from 9e323e6 to 6d772b9 Compare October 13, 2025 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upsert with None values fails on "Invalid literal value: None"

4 participants