Skip to content

Fix import edge cases for DATE/TIMESTAMP, INT96, CSV quotes and format filtering#819

Open
gengziyand wants to merge 1 commit into
apache:developfrom
gengziyand:fix-tools-import-edge-cases
Open

Fix import edge cases for DATE/TIMESTAMP, INT96, CSV quotes and format filtering#819
gengziyand wants to merge 1 commit into
apache:developfrom
gengziyand:fix-tools-import-edge-cases

Conversation

@gengziyand
Copy link
Copy Markdown
Contributor

Summary

This PR fixes several import edge cases in java/tools:

  • Support DATE and TIMESTAMP value conversion before writing Tablets, and pass import time precision from
    TabletBuilder to ValueConverter.
  • Decode legacy Parquet INT96 timestamps from nanos-of-day plus Julian day, and mark them as nanosecond precision in
    auto schema.
  • Resolve timezone offsets from the parsed local datetime and configured ZoneId, instead of using the JVM's current
    DST offset.
  • Make --format filtering in directory mode respect file extensions, so unrelated files are skipped instead of parsed
    as data.
  • Extract Arrow DATE and TIMESTAMP vectors through their actual vector classes.
  • Parse quoted CSV fields with embedded separators and escaped quotes.

Tests

mvn '-Dspotless.apply.skip=true' test
'-Dtest=ValueConverterTest,TabletBuilderTest,ParquetSourceReaderTest,DateTimeUtilsTest,TsFileToolFormatFilterTest,CsvSou
rceReaderTest,ArrowSourceReaderTest,TsFileToolEndToEndTest'
mvn '-Dspotless.apply.skip=true' test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant