- 
                Notifications
    You must be signed in to change notification settings 
- Fork 381
fix: upsert with null values in join columns #2429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
76fc451    to
    075a966      
    Compare
  
    236b0c8    to
    479d11d      
    Compare
  
    | @kevinjqliu Could you help me find a reviewer? This is my first contribution, so I'm not sure how to get this noticed. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR!
i left a few nit comments
|  | ||
| if len(join_cols) == 1: | ||
| return In(join_cols[0], unique_keys[0].to_pylist()) | ||
| column = join_cols[0] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way we can simplify the logic here?
i think the primary issue is that the In operator cannot handle Null, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the In operator cannot handle null by design, and this goes for SQL as well.
The following SQL is invalid:
WHERE x IN (1, 2, 3, NULL)Instead it should be this:
WHERE x IN (1, 2, 3) OR x IS NULLTesting for null requires IS NULL (or IS NOT NULL), and it's impossible with IN or =.
This is the reason for changing the create_match_filter function: we need to build more complex expressions if null is involved. Examples of such expressions are shown in the test cases.
If there's a better way I'm open to changing it, but I believe the added complexity in building filter expressions with null is justified. When null is not involved we produce the same In expression as before.
| ) | ||
|  | ||
|  | ||
| def create_match_filter(df: pyarrow_table, join_cols: list[str]) -> BooleanExpression: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, this is creating a predicate to test whether a row might exist in the pyarrow_table (matching on join_cols).
And since Null == Any should always return unknown in SQL, can we just filter out any rows from the pyarrow_table where the join_cols fields contain None(we treat None as SQL Null), and then build the match filter based on the filtered pyarrow table (using the existing logic for building the match filter)? This would be much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that mean it's impossible to update rows with null in the join columns, since they are filtered out?
If so, that's not what I was going for. I'd like the solution to pass this test:  https://github.com/mdwint/iceberg-python/blob/f818016e5c198581b7d7b11dba2b9ebd414e19bc/tests/table/test_upsert.py#L784-L831
This would be equivalent to the following Spark SQL (using the null-safe equality operator <=>):
MERGE INTO target_table AS t
USING source_table AS s
ON (t.foo <=> s.foo AND t.bar <=> s.bar)
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through this some more, I ask myself: What should the semantics of upsert be? Should it use = or <=> to test equality? For my use case <=> is right, and I also find it most intuitive, but does that mean it should be the default?
I see several options:
- Make <=>the default. Users who don't want to update nulls can filter them out themselves before callingupsert. The status quo is crashing, so there are no existing users expecting a different behaviour.
- Make =the default. This means I can't achieve my goal, and I'll need to reimplementupsertmyself. It also means new rows will be inserted for every row containing null in the join columns. This is unintuitive to me, but who knows someone might want it?
- Add an argument to upsertto select the comparison operator. Maximum flexibility, more work to implement.
9e323e6    to
    6d772b9      
    Compare
  
    
Rationale for this change
This fixes #2426. The upsert method now supports null values to be passed in the join columns.
Are these changes tested?
Yes, I added unit tests.
Are there any user-facing changes?
Yes, the upsert method is user-facing.