feat: Add incremental scan API with IncrementalAppendScan and Increme…#559
feat: Add incremental scan API with IncrementalAppendScan and Increme…#559WZhuo wants to merge 1 commit intoapache:mainfrom
Conversation
ef8f9b8 to
3214b82
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Review Report: PR #559
(Generated by gemini-cli)
📄 Files: src/iceberg/table_scan.h & src/iceberg/table_scan.cc
Java Counterpart: TableScan.java / IncrementalScan.java
-
Parity Check & API Design: ❌
- CRITICAL API FLAW: The PR makes
BaseIncrementalScanBuilderinherit fromTableScanBuilder. However,TableScanBuilder's methods (likeFilter(),Project(),Select()) returnTableScanBuilder&. - This breaks the fluent builder pattern. If a user tries to chain methods:
It will fail to compile because
table->NewIncrementalAppendScan().value() ->Filter(expr) .FromSnapshot(1); // ERROR: TableScanBuilder has no FromSnapshot method
Filter()returnsTableScanBuilder&. - Furthermore, calling
Build()afterFilter()will invokeTableScanBuilder::Build(), which returns aDataTableScaninstead of the expectedIncrementalAppendScan. - Action: Refactor the builders to use the Curiously Recurring Template Pattern (CRTP) for a common base builder (e.g.,
template <typename Derived> class ScanBuilderBase), so that inherited methods return the correct derived type. Alternatively, override the methods in the derived builders to cast and return*this. - Good catch: Removing
UseBranchfromTableScanBuilderand moving it toBaseIncrementalScanBuilderperfectly matches Java's parity, whereuseBranchonly exists inIncrementalScan.
- CRITICAL API FLAW: The PR makes
-
Style & Comments:
⚠️ - Java uses
fromSnapshotInclusive(long)andfromSnapshotExclusive(long). The PR usesFromSnapshot(int64_t, bool inclusive = false). While acceptable in C++, boolean arguments can cause "boolean blindness" at the call site (e.g.,FromSnapshot(100, true)). Consider using an enum (e.g.,enum class SnapshotBoundary { kInclusive, kExclusive }) or mirroring Java's explicit method names for better readability.
- Java uses
-
Logic Check: ❌
- In
table_scan.cc, theBuild()implementation is hardcoded to return the same error message regardless of theScanType:Action: Iftemplate <typename ScanType> Result<std::unique_ptr<ScanType>> IncrementalScanBuilder<ScanType>::Build() { return NotImplemented("IncrementalAppendScanBuilder is not implemented"); }
ScanTypeisIncrementalChangelogScan, the error message is misleading. Update the error message to be generic (e.g.,"Incremental scan builder is not implemented") or usetypeidor template specialization to provide an accurate message.
- In
-
Design & Conciseness: ✅
- The separation of
DataTableScanand the new incremental scan classes is logically sound and structurally aligned with the Iceberg specification.
- The separation of
-
Test Quality: ❌
- No tests were added for the new incremental builders. While they currently return
NotImplemented, there should be basic instantiation tests to ensure the builders can be created fromTable::NewIncremental...(). - Action: Add a test case that specifically chains a common method (like
Filter) with an incremental method (likeFromSnapshot) to ensure the fluent API compiles and works correctly once the CRTP/inheritance issue is fixed.
- No tests were added for the new incremental builders. While they currently return
📄 Files: src/iceberg/table.h & src/iceberg/table.cc
Java Counterpart: Table.java
- Parity Check: ✅ The new methods
NewIncrementalAppendScan()andNewIncrementalChangelogScan()match Java'sTableinterface. Returning a builder instead of the scan itself aligns with the existing C++ paradigm (NewScan()returningTableScanBuilder). - Style & Comments: ✅ Clean and consistent with the rest of the file.
- Logic Check: ✅
- Design & Conciseness: ✅
- Test Quality:
⚠️ No tests were added to verify thatTable::NewIncrementalAppendScan()successfully returns a builder.
Summary & Recommendation
- Request Changes
- The PR has a critical API design flaw regarding the C++ Builder pattern inheritance. Because
TableScanBuildermethods returnTableScanBuilder&, method chaining inIncrementalScanBuilderis broken. The builder hierarchy needs to be refactored (likely using CRTP) to support fluent chaining correctly. Please also address the hardcoded error message in the template builder and add basic tests to catch compile-time API issues.
src/iceberg/table_scan.h
Outdated
|
|
||
| /// \brief Use the specified branch | ||
| /// \param branch the branch name | ||
| BaseIncrementalScanBuilder& UseBranch(const std::string& branch); |
There was a problem hiding this comment.
UseBranch should also be used in non-incremental scan scenarios ?
There was a problem hiding this comment.
branch is only used in the IncrementalScan, for validating the to_snapshot_id is on the specified branch.
There was a problem hiding this comment.
Technically snapshot read of a table scan should support branch. So I still think it should be kept even if the Java api does not implement this.
There was a problem hiding this comment.
The UseRef api supports both Tag and Branch for a TableScan.
5e3c1d0 to
9d73001
Compare
src/iceberg/table_scan.h
Outdated
| template <class Builder> | ||
| class ICEBERG_EXPORT ScanBuilderBase : public TableScanBuilder { | ||
| public: | ||
| Builder& Option(std::string key, std::string value) { |
There was a problem hiding this comment.
I know that the intention of this class hierarchy design is to reuse code. However, this duplicates all apis of the builder so it does not seem worth the effort.
There was a problem hiding this comment.
Any good suggestions?
There was a problem hiding this comment.
Use a template scan builder class for all and then use std::enable_if to enable functions of incremental scan builder?
…ntalChangelogScan
wgtmac
left a comment
There was a problem hiding this comment.
This review was generated by gemini-cli.
The overall API design in this PR is excellent. You've done a great job aligning with the Java API parity while making smart use of Modern C++ features. In particular:
- Using C++20
requiresconcepts (requires IsIncrementalScan<ScanType>) to restrict methods at compile-time is a much safer and cleaner approach than Java's runtimeUnsupportedOperationException. - Consolidating Java's
fromSnapshotInclusive/fromSnapshotExclusiveinto a singleFromSnapshotwith a defaultinclusiveboolean parameter is a great C++ idiom that reduces the API surface.
However, there is noticeable code duplication in the template implementations in .cc that should be cleaned up. Please see the inline comments for details.
| std::shared_ptr<FileIO> io) { | ||
| ICEBERG_PRECHECK(metadata != nullptr, "Table metadata cannot be null"); | ||
| ICEBERG_PRECHECK(io != nullptr, "FileIO cannot be null"); | ||
| return std::unique_ptr<TableScanBuilder<IncrementalChangelogScan>>( |
There was a problem hiding this comment.
There is noticeable code duplication for the Make method specializations (DataTableScan, IncrementalAppendScan, IncrementalChangelogScan). They are completely identical.
You can remove these three explicit specializations and provide a single generic implementation that relies on the template class ... instantiations at the end of the file:
template <typename ScanType>
Result<std::unique_ptr<TableScanBuilder<ScanType>>>
TableScanBuilder<ScanType>::Make(std::shared_ptr<TableMetadata> metadata,
std::shared_ptr<FileIO> io) {
ICEBERG_PRECHECK(metadata != nullptr, "Table metadata cannot be null");
ICEBERG_PRECHECK(io != nullptr, "FileIO cannot be null");
return std::unique_ptr<TableScanBuilder<ScanType>>(
new TableScanBuilder<ScanType>(std::move(metadata), std::move(io)));
}| Result<std::unique_ptr<IncrementalChangelogScan>> | ||
| TableScanBuilder<IncrementalChangelogScan>::Build() { | ||
| return NotImplemented("IncrementalChangelogScanBuilder is not implemented"); | ||
| } |
There was a problem hiding this comment.
Similar to the Make method, these Build() explicit specializations for IncrementalAppendScan and IncrementalChangelogScan are redundant and can be generalized.
If you consolidate this to a single generic Build() implementation, it would just call ScanType::Make(...), and since IncrementalAppendScan::Make already returns NotImplemented, the error would bubble up naturally. This will help reduce boilerplate code:
template <typename ScanType>
Result<std::unique_ptr<ScanType>> TableScanBuilder<ScanType>::Build() {
ICEBERG_RETURN_UNEXPECTED(CheckErrors());
ICEBERG_RETURN_UNEXPECTED(context_.Validate());
ICEBERG_ASSIGN_OR_RAISE(auto schema, ResolveSnapshotSchema());
return ScanType::Make(metadata_, schema.get(), io_, std::move(context_));
}
No description provided.