Skip to content

[Track] Alternate writers for canonical formats (parquet-mr, parquet-go, ...) #6

@mprammer

Description

@mprammer

Add optional alternate writers for canonical formats — Parquet first — alongside the default pyarrow output. File-format research benefits from corpora produced by multiple writer implementations: encoding choices, page sizing, dictionary thresholds, and stats policies differ per library and shape downstream compression / pushdown evaluation.

Likely shares machinery with #5; both produce additional sibling artifacts under the slug's output directory.

Per writer

  • New convert stage variant (or generalised stage that dispatches on writer + format).
  • Extend sources.json: per-writer flag and skip-reason, e.g. convert.parquet_java.
  • Update validate_manifest invariants.
  • Outputs at outputs/v{n}/<slug>/<fmt>-<writer>/<slug>.<ext> (e.g. parquet-java/).
  • Regen docs/datasets.md + docs/snapshot.json.

Writers in scope

  • parquet-mr (Java) — reference writer; subprocess via java -jar.
  • parquet-go — Go-native writer; subprocess.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesttracking-issueShared implementation context for work likely to span multiple PRs.
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions