add partition_columns to StructuredDatasetType#364
add partition_columns to StructuredDatasetType#364cosmicBboy wants to merge 3 commits intomasterfrom
Conversation
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #364 +/- ##
=======================================
Coverage 73.71% 73.71%
=======================================
Files 18 18
Lines 1377 1377
=======================================
Hits 1015 1015
Misses 311 311
Partials 51 51
Flags with carried forward coverage won't be shown. Click here to find out more. Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
hamersaw
left a comment
There was a problem hiding this comment.
@eapolinario please confirm, did we decide this needed to be included in the serialized flyteidl type or could just be read from metadata within flytekit at runtime?
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
I may be missing something, but we need to include it in the type so that the structured dataset decoder has access to the metadata (unless we want to manually inspect the uri path for multiple directories) |
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan niels.bantilan@gmail.com
Add
partition_columnstoStructuredDatasetTypePartially addresses flyteorg/flyte#3219
TL;DR
This PR adds an additional property to the
StructureDatasetTypeprotobuf definition so that metadata about which columns in the dataset (some kind of DataFrame object) are used for partitioning the dataset into chunks, for example when apandas.DataFrameis serialized as a parquet file.Type
Are all requirements met?
Complete description
This change is required to store additional metadata about which columns are used for partitioning. Currently this only meaningfully affects the serialization/deserialization of parquet files, but in the future we could support the partitioning of other serialization formats.
Tracking Issue
Partly addresses flyteorg/flyte#3219
Follow-up issue
NA