feat: bind literals with right type after serde#562
feat: bind literals with right type after serde#562evindj wants to merge 1 commit intoapache:mainfrom
Conversation
068a0d1 to
d8e2630
Compare
3d77607 to
695120b
Compare
695120b to
e1439d8
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Reviewed by Claude Code (claude-opus-4-6) — automated review, not a substitute for human review.
Review Report: PR #562
📄 File: src/iceberg/util/transform_util.cc
Line 113–115 (ParseDay):
- Parity Issue: Java's
isoDateToDaysusesLocalDate.parse(dateString, DateTimeFormatter.ISO_LOCAL_DATE)which accepts any ISO date including 5-digit years and negative years. The C++ implementation manually parses withfrom_charsand assumesdash1is at a fixed position relative to a 4-digit year. For negative years (e.g.,"-0001-01-01"),dash1is found correctly viastr[0] == '-' ? 1 : 0, butstr.size() < 10check will reject valid negative-year dates like"-001-01-01"(9 chars). More critically, the error message on line 131 says "Invalid year in date string" but it fires for any of the threefrom_charsfailures (month or day could be the bad field).
Line 148–168 (ParseTime):
- Logic Issue: The validation order is wrong.
str.size() < 5is checked afterfrom_charsalready readsstr.data() + 3tostr.data() + 5— ifstris shorter than 5 chars, this is UB (out-of-bounds pointer arithmetic). The size check must come first. - Logic Issue:
ParseTimeaccepts"HH:mm"(no seconds), but the colon atstr[2]is never validated. A string like"1200:00"would silently parse hours=1200. - Parity Issue: Java's
ISO_LOCAL_TIMEaccepts nanosecond precision (9 digits) and silently truncates to micros. C++ParseFractionalMicrosrejectsfrac.size() > 6, so strings like"00:00:01.123456789"will error instead of truncating. Suggest adding:// TODO: truncate nanoseconds to micros for ISO_LOCAL_TIME parity
Line 175–195 (ParseTimestampWithZone):
- Parity Issue: Java's
isUTCTimestamptzusesOffsetDateTime.parsewithISO_DATE_TIME, which accepts any UTC offset resolving toZoneOffset.UTC(e.g.,"Z","-00:00"). The C++ implementation only accepts the literal suffix"+00:00"and will reject"Z". Suggest adding:// TODO: accept "Z" and "-00:00" as valid UTC suffixes for full ISO_DATE_TIME parity
📄 File: src/iceberg/expression/json_serde.cc
Line 330–340 (LiteralFromJson, kDate/kTime/kTimestamp/kTimestampTz cases):
- Parity Issue: Java's
SingleValueParser.fromJsononly accepts textual values for date/time/timestamp types. The C++ implementation additionally accepts integer values (json.is_number_integer()). If this is intentional, add a comment explaining the divergence from spec.
Line 395–405 (kDecimal case):
- Parity Issue: Java validates that the parsed
BigDecimal's scale matches the type's scale (seeSingleValueParser.javalines 97–101). The C++ code does not validate thatdec.scale()matchesdec_type.scale(). A string like"123.456"parsed into adecimal(6, 2)type would silently produce a wrong result. Missing:// Missing: validate dec.scale() == dec_type.scale() before constructing Literal
📄 File: src/iceberg/test/transform_util_test.cc
- No negative-year date test (e.g.,
"-0001-01-01"), which would expose the fragile manual parsing inParseDay. - No error-path tests for
ParseTimewith short strings (e.g.,"12") to catch the UB risk noted above. - No test for
ParseTimestampWithZonewith"Z"suffix to document the known limitation.
Summary & Recommendation
Request Changes.
Key blockers:
- UB in
ParseTime— size check occurs after out-of-bounds pointer arithmetic. - Missing decimal scale validation in
LiteralFromJson(parity with JavaSingleValueParser). ParseTimestampWithZonesilently rejects valid UTC formats ("Z","-00:00").
wgtmac
left a comment
There was a problem hiding this comment.
Thanks for updating this PR! I've added some comments from my initial review.
| // For temporal types (date, time, timestamp, timestamp_tz), we support both integer | ||
| // and string representations. | ||
| case TypeId::kDate: | ||
| if (json.is_number_integer()) return Literal::Date(json.get<int32_t>()); |
There was a problem hiding this comment.
I'd recommend not support integer representation like this as the timezone processing is really tricky in C++. We cannot really trust arbitrary integers from timestamp values.
| if (json.is_string()) { | ||
| ICEBERG_ASSIGN_OR_RAISE(auto uuid, Uuid::FromString(json.get<std::string>())); | ||
| return Literal::UUID(uuid); | ||
| } | ||
| return JsonParseError("Cannot parse {} as a uuid value", SafeDumpJson(json)); |
There was a problem hiding this comment.
| if (json.is_string()) { | |
| ICEBERG_ASSIGN_OR_RAISE(auto uuid, Uuid::FromString(json.get<std::string>())); | |
| return Literal::UUID(uuid); | |
| } | |
| return JsonParseError("Cannot parse {} as a uuid value", SafeDumpJson(json)); | |
| if (!json.is_string()) { | |
| return JsonParseError("Cannot parse {} as a uuid value", SafeDumpJson(json)); | |
| } | |
| ICEBERG_ASSIGN_OR_RAISE(auto uuid, Uuid::FromString(json.get<std::string>())); | |
| return Literal::UUID(uuid); |
Let's just be consistent as above? Same for below.
| case TypeId::kDecimal: { | ||
| if (json.is_string()) { | ||
| const auto& dec_type = internal::checked_cast<const DecimalType&>(*type); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto dec, Decimal::FromString(json.get<std::string>())); |
There was a problem hiding this comment.
We need to check the output scale from Decimal::FromString to make sure it is same as in the type.
| } | ||
| case TypeId::kDecimal: { | ||
| const auto& dec_type = internal::checked_cast<const DecimalType&>(*target_type); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto dec, Decimal::FromString(str_val)); |
There was a problem hiding this comment.
Same as my other comment, we need to check the parsed scale against dec_type.scale()
| return InvalidArgument("Invalid date string: '{}'", str); | ||
| } | ||
| int32_t year = 0, month = 0, day = 0; | ||
| auto [_, e1] = std::from_chars(str.data(), str.data() + dash1, year); |
There was a problem hiding this comment.
Let's reuse ParseNumber from string_util.h.
There was a problem hiding this comment.
BTW, it seems that 20x6-03-03 can be parsed without issue since from_chars ignores the non-numeric characters. We may need to check the returned ptr to see if it consumes all input.
| } | ||
| } | ||
|
|
||
| return hours * 3'600 * kMicrosPerSecond + minutes * 60 * kMicrosPerSecond + |
There was a problem hiding this comment.
We need to check hours < 24, minutes < 60, seconds <= 60. It seems that 99:99:99 can be parsed without issue at the moment.
| TransformUtilTest, ParseRoundTripTest, | ||
| ::testing::Values( | ||
| // Day round-trips | ||
| ParseRoundTripParam{"DayEpoch", "1970-01-01", 0, ParseRoundTripParam::kDay}, |
There was a problem hiding this comment.
Thanks for adding these test cases! It would be good to add cases for various invalid values.
expressions serde will convert some types to string but right now, the binding process does not support translating from the string representation back to the right type this PR addresses this gap.
This PR must be landed after #553 is merged.