Skip to content

feat(spark): add Iceberg + MinIO integration example#501

Open
digvijay-y wants to merge 2 commits into
kubeflow:mainfrom
digvijay-y:examples/iceberg-minio
Open

feat(spark): add Iceberg + MinIO integration example#501
digvijay-y wants to merge 2 commits into
kubeflow:mainfrom
digvijay-y:examples/iceberg-minio

Conversation

@digvijay-y
Copy link
Copy Markdown
Contributor

What this PR does / why we need it:
Adds a runnable local development example demonstrating SparkClient with Apache Iceberg and MinIO as S3-compatible object storage.

Includes:

  • iceberg_minio.py — end-to-end: create namespace, table, write and read data
  • docker-compose-iceberg-minio.yml — one-command local setup
  • Updated README.md with setup instructions

Tested with:

  • iceberg-spark-runtime 1.9.1
  • AWS SDK v2 (software.amazon.awssdk:bundle:2.26.24)
  • PySpark 3.5.0
  • MinIO + tabulario/iceberg-rest

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #500

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: digvijay-y <144053736+digvijay-y@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 24, 2026 08:34
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a local-development example for using Spark with Apache Iceberg backed by MinIO (S3-compatible) and an Iceberg REST catalog.

Changes:

  • Added a runnable PySpark example that creates, writes, and reads an Iceberg table using MinIO + REST catalog.
  • Added a docker-compose stack to spin up MinIO, initialize a warehouse bucket, and run the Iceberg REST catalog.
  • Documented setup/run/config/teardown steps in the Spark examples README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
examples/spark/iceberg_minio.py New end-to-end Iceberg + MinIO example script with Spark session configuration and basic table operations
examples/spark/docker-compose-iceberg-minio.yml New docker-compose stack for MinIO + bucket init + Iceberg REST catalog
examples/spark/README.md Documents how to run the Iceberg + MinIO example and its configuration

Comment thread examples/spark/iceberg_minio.py Outdated
Comment thread examples/spark/iceberg_minio.py Outdated
Comment thread examples/spark/iceberg_minio.py Outdated
Comment thread examples/spark/README.md
Comment thread examples/spark/docker-compose-iceberg-minio.yml Outdated
Comment thread examples/spark/docker-compose-iceberg-minio.yml Outdated
Signed-off-by: digvijay-y <144053736+digvijay-y@users.noreply.github.com>
Copy link
Copy Markdown
Member

@tariq-hasan tariq-hasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @digvijay-y! Thanks for raising the PR. This is a great start. I left a few high-level comments for now. We can go deeper on the review once we align on the approach.

Comment on lines +1 to +3
#!/usr/bin/env python3
# Copyright 2025 The Kubeflow Authors.
#
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright should be year-less as per the new guidelines.

Suggested change
#!/usr/bin/env python3
# Copyright 2025 The Kubeflow Authors.
#
#!/usr/bin/env python3
# Copyright The Kubeflow Authors.
#

.config("spark.sql.catalog.lakehouse.s3.secret-access-key", MINIO_SECRET_KEY)
.getOrCreate()
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses pyspark directly. We need to run the test through the Kubeflow Spark SDK.

The value of this example for #470 is proving the Iceberg path works through the SDK's SparkConnect path end-to-end. A local SparkSession validates Iceberg+Spark but exercises no Kubeflow code, so it can't go in the example harness.

Comment thread examples/spark/README.md

```bash
pip install pyspark==3.5.0
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyspark version should flow directly from pyproject.toml. Since the SDK targets Spark 4.0 (Scala 2.13), the Iceberg runtime must be iceberg-spark-runtime-4.0_2.13, and there should be no separate pyspark install at all.



if __name__ == "__main__":
main()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example needs to be integrated with the test harness to verify that it works as part of the e2e.

CATALOG_S3_PATH__STYLE__ACCESS: "true"
depends_on:
minio:
condition: service_healthy
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would like to have Iceberg/MinIO as part of e2e setup or similar.

print(f"\nGenerated range with {df.count()} rows across {num_executors} executors")
print(f"Session name: {session_name}")
df.show(10)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should split the changes to spark_connect_simple.py into its own PR as they do not touch the Iceberg MinIO example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

examples/spark: Add Iceberg + MinIO integration example

3 participants