sparkparse

identify spark bottlenecks without breaking your neck

what it does

sparkparse parses Apache Spark event logs and provides:

an interactive Dash dashboard for exploring query plans, stage timelines, and task metrics
a CLI for parsing logs into structured Polars DataFrames (CSV, Parquet, Delta, JSON output)
a context manager / decorator for capturing logs from an active SparkSession
(planned) an analyze command that produces structured JSON findings for LLM-assisted analysis

install

pip install sparkparse

usage

CLI

# parse logs and write output files
sparkparse get --log-dir ./logs --out-format parquet

# launch the dashboard
sparkparse viz --log-dir ./logs

# (planned) produce LLM-friendly analysis JSON
sparkparse analyze --log-dir ./logs

context manager

import sparkparse

with sparkparse.capture_context(spark=spark, action="get") as cap:
    df.groupBy("id").count().show()

parsed = cap._parsed_logs  # ParsedLogDataFrames with .dag and .combined

decorator

import sparkparse

@sparkparse.capture(spark=spark, action="get")
def run_job(spark):
    spark.read.parquet("s3://bucket/data/").groupBy("id").count().show()

result, cap = run_job()

design goals

simplified UI that highlights bottlenecks and their causes
node drill-down for detailed information and metric distribution
generation of base models and DataFrames for extensible analysis
LLM-friendly analysis output for automated bottleneck detection

development

See IMPLEMENTATION.md for the planned improvement roadmap.

# install with dev dependencies
uv sync --dev

# lint
uv run ruff check sparkparse/

# test (excludes Spark-dependent integration tests)
uv run pytest tests/ --ignore=tests/test.py --ignore=tests/test_capture.py -v

TODOs

structured node details like project columns and scan sources
task box plots on hover
metric capture via context manager / decorator
hotspot highlighting by metrics other than duration (spill, records, etc.)
analyze command with LLM-friendly JSON output
reading from cloud storage
ruff + pyrefly CI

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
sparkparse		sparkparse
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
IMPLEMENTATION.md		IMPLEMENTATION.md
Justfile		Justfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sparkparse

what it does

install

usage

CLI

context manager

decorator

design goals

development

TODOs

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sparkparse

what it does

install

usage

CLI

context manager

decorator

design goals

development

TODOs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages