feat(spark): add Iceberg + MinIO integration example#501
Conversation
Signed-off-by: digvijay-y <144053736+digvijay-y@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a local-development example for using Spark with Apache Iceberg backed by MinIO (S3-compatible) and an Iceberg REST catalog.
Changes:
- Added a runnable PySpark example that creates, writes, and reads an Iceberg table using MinIO + REST catalog.
- Added a docker-compose stack to spin up MinIO, initialize a warehouse bucket, and run the Iceberg REST catalog.
- Documented setup/run/config/teardown steps in the Spark examples README.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| examples/spark/iceberg_minio.py | New end-to-end Iceberg + MinIO example script with Spark session configuration and basic table operations |
| examples/spark/docker-compose-iceberg-minio.yml | New docker-compose stack for MinIO + bucket init + Iceberg REST catalog |
| examples/spark/README.md | Documents how to run the Iceberg + MinIO example and its configuration |
Signed-off-by: digvijay-y <144053736+digvijay-y@users.noreply.github.com>
tariq-hasan
left a comment
There was a problem hiding this comment.
Hi @digvijay-y! Thanks for raising the PR. This is a great start. I left a few high-level comments for now. We can go deeper on the review once we align on the approach.
| #!/usr/bin/env python3 | ||
| # Copyright 2025 The Kubeflow Authors. | ||
| # |
There was a problem hiding this comment.
The copyright should be year-less as per the new guidelines.
| #!/usr/bin/env python3 | |
| # Copyright 2025 The Kubeflow Authors. | |
| # | |
| #!/usr/bin/env python3 | |
| # Copyright The Kubeflow Authors. | |
| # |
| .config("spark.sql.catalog.lakehouse.s3.secret-access-key", MINIO_SECRET_KEY) | ||
| .getOrCreate() | ||
| ) | ||
|
|
There was a problem hiding this comment.
The test uses pyspark directly. We need to run the test through the Kubeflow Spark SDK.
The value of this example for #470 is proving the Iceberg path works through the SDK's SparkConnect path end-to-end. A local SparkSession validates Iceberg+Spark but exercises no Kubeflow code, so it can't go in the example harness.
|
|
||
| ```bash | ||
| pip install pyspark==3.5.0 | ||
| ``` |
There was a problem hiding this comment.
The pyspark version should flow directly from pyproject.toml. Since the SDK targets Spark 4.0 (Scala 2.13), the Iceberg runtime must be iceberg-spark-runtime-4.0_2.13, and there should be no separate pyspark install at all.
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
The example needs to be integrated with the test harness to verify that it works as part of the e2e.
| CATALOG_S3_PATH__STYLE__ACCESS: "true" | ||
| depends_on: | ||
| minio: | ||
| condition: service_healthy |
There was a problem hiding this comment.
Ideally we would like to have Iceberg/MinIO as part of e2e setup or similar.
| print(f"\nGenerated range with {df.count()} rows across {num_executors} executors") | ||
| print(f"Session name: {session_name}") | ||
| df.show(10) | ||
|
|
There was a problem hiding this comment.
We should split the changes to spark_connect_simple.py into its own PR as they do not touch the Iceberg MinIO example.
What this PR does / why we need it:
Adds a runnable local development example demonstrating SparkClient with Apache Iceberg and MinIO as S3-compatible object storage.
Includes:
iceberg_minio.py— end-to-end: create namespace, table, write and read datadocker-compose-iceberg-minio.yml— one-command local setupREADME.mdwith setup instructionsTested with:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #500
Checklist: