Skip to content

feat: bind literals with right type after serde#562

Open
evindj wants to merge 1 commit intoapache:mainfrom
evindj:bound_expressions
Open

feat: bind literals with right type after serde#562
evindj wants to merge 1 commit intoapache:mainfrom
evindj:bound_expressions

Conversation

@evindj
Copy link
Contributor

@evindj evindj commented Feb 12, 2026

expressions serde will convert some types to string but right now, the binding process does not support translating from the string representation back to the right type this PR addresses this gap.

This PR must be landed after #553 is merged.

@evindj evindj marked this pull request as draft February 12, 2026 05:30
@evindj evindj force-pushed the bound_expressions branch 3 times, most recently from 3d77607 to 695120b Compare March 11, 2026 18:33
@evindj evindj force-pushed the bound_expressions branch from 695120b to e1439d8 Compare March 11, 2026 19:51
@evindj evindj marked this pull request as ready for review March 11, 2026 20:22
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed by Claude Code (claude-opus-4-6) — automated review, not a substitute for human review.


Review Report: PR #562

📄 File: src/iceberg/util/transform_util.cc

Line 113–115 (ParseDay):

  • Parity Issue: Java's isoDateToDays uses LocalDate.parse(dateString, DateTimeFormatter.ISO_LOCAL_DATE) which accepts any ISO date including 5-digit years and negative years. The C++ implementation manually parses with from_chars and assumes dash1 is at a fixed position relative to a 4-digit year. For negative years (e.g., "-0001-01-01"), dash1 is found correctly via str[0] == '-' ? 1 : 0, but str.size() < 10 check will reject valid negative-year dates like "-001-01-01" (9 chars). More critically, the error message on line 131 says "Invalid year in date string" but it fires for any of the three from_chars failures (month or day could be the bad field).

Line 148–168 (ParseTime):

  • Logic Issue: The validation order is wrong. str.size() < 5 is checked after from_chars already reads str.data() + 3 to str.data() + 5 — if str is shorter than 5 chars, this is UB (out-of-bounds pointer arithmetic). The size check must come first.
  • Logic Issue: ParseTime accepts "HH:mm" (no seconds), but the colon at str[2] is never validated. A string like "1200:00" would silently parse hours=1200.
  • Parity Issue: Java's ISO_LOCAL_TIME accepts nanosecond precision (9 digits) and silently truncates to micros. C++ ParseFractionalMicros rejects frac.size() > 6, so strings like "00:00:01.123456789" will error instead of truncating. Suggest adding:
    // TODO: truncate nanoseconds to micros for ISO_LOCAL_TIME parity

Line 175–195 (ParseTimestampWithZone):

  • Parity Issue: Java's isUTCTimestamptz uses OffsetDateTime.parse with ISO_DATE_TIME, which accepts any UTC offset resolving to ZoneOffset.UTC (e.g., "Z", "-00:00"). The C++ implementation only accepts the literal suffix "+00:00" and will reject "Z". Suggest adding:
    // TODO: accept "Z" and "-00:00" as valid UTC suffixes for full ISO_DATE_TIME parity

📄 File: src/iceberg/expression/json_serde.cc

Line 330–340 (LiteralFromJson, kDate/kTime/kTimestamp/kTimestampTz cases):

  • Parity Issue: Java's SingleValueParser.fromJson only accepts textual values for date/time/timestamp types. The C++ implementation additionally accepts integer values (json.is_number_integer()). If this is intentional, add a comment explaining the divergence from spec.

Line 395–405 (kDecimal case):

  • Parity Issue: Java validates that the parsed BigDecimal's scale matches the type's scale (see SingleValueParser.java lines 97–101). The C++ code does not validate that dec.scale() matches dec_type.scale(). A string like "123.456" parsed into a decimal(6, 2) type would silently produce a wrong result. Missing:
    // Missing: validate dec.scale() == dec_type.scale() before constructing Literal

📄 File: src/iceberg/test/transform_util_test.cc

  • No negative-year date test (e.g., "-0001-01-01"), which would expose the fragile manual parsing in ParseDay.
  • No error-path tests for ParseTime with short strings (e.g., "12") to catch the UB risk noted above.
  • No test for ParseTimestampWithZone with "Z" suffix to document the known limitation.

Summary & Recommendation

Request Changes.

Key blockers:

  1. UB in ParseTime — size check occurs after out-of-bounds pointer arithmetic.
  2. Missing decimal scale validation in LiteralFromJson (parity with Java SingleValueParser).
  3. ParseTimestampWithZone silently rejects valid UTC formats ("Z", "-00:00").

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this PR! I've added some comments from my initial review.

// For temporal types (date, time, timestamp, timestamp_tz), we support both integer
// and string representations.
case TypeId::kDate:
if (json.is_number_integer()) return Literal::Date(json.get<int32_t>());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend not support integer representation like this as the timezone processing is really tricky in C++. We cannot really trust arbitrary integers from timestamp values.

Comment on lines +381 to +385
if (json.is_string()) {
ICEBERG_ASSIGN_OR_RAISE(auto uuid, Uuid::FromString(json.get<std::string>()));
return Literal::UUID(uuid);
}
return JsonParseError("Cannot parse {} as a uuid value", SafeDumpJson(json));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (json.is_string()) {
ICEBERG_ASSIGN_OR_RAISE(auto uuid, Uuid::FromString(json.get<std::string>()));
return Literal::UUID(uuid);
}
return JsonParseError("Cannot parse {} as a uuid value", SafeDumpJson(json));
if (!json.is_string()) {
return JsonParseError("Cannot parse {} as a uuid value", SafeDumpJson(json));
}
ICEBERG_ASSIGN_OR_RAISE(auto uuid, Uuid::FromString(json.get<std::string>()));
return Literal::UUID(uuid);

Let's just be consistent as above? Same for below.

case TypeId::kDecimal: {
if (json.is_string()) {
const auto& dec_type = internal::checked_cast<const DecimalType&>(*type);
ICEBERG_ASSIGN_OR_RAISE(auto dec, Decimal::FromString(json.get<std::string>()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check the output scale from Decimal::FromString to make sure it is same as in the type.

}
case TypeId::kDecimal: {
const auto& dec_type = internal::checked_cast<const DecimalType&>(*target_type);
ICEBERG_ASSIGN_OR_RAISE(auto dec, Decimal::FromString(str_val));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as my other comment, we need to check the parsed scale against dec_type.scale()

return InvalidArgument("Invalid date string: '{}'", str);
}
int32_t year = 0, month = 0, day = 0;
auto [_, e1] = std::from_chars(str.data(), str.data() + dash1, year);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reuse ParseNumber from string_util.h.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it seems that 20x6-03-03 can be parsed without issue since from_chars ignores the non-numeric characters. We may need to check the returned ptr to see if it consumes all input.

}
}

return hours * 3'600 * kMicrosPerSecond + minutes * 60 * kMicrosPerSecond +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check hours < 24, minutes < 60, seconds <= 60. It seems that 99:99:99 can be parsed without issue at the moment.

TransformUtilTest, ParseRoundTripTest,
::testing::Values(
// Day round-trips
ParseRoundTripParam{"DayEpoch", "1970-01-01", 0, ParseRoundTripParam::kDay},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these test cases! It would be good to add cases for various invalid values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants