feat: add flatten_dataframe function for nested DataFrames by jlaportebot · Pull Request #187 · MrPowers/chispa

jlaportebot · 2026-05-03T12:37:27Z

Summary

This PR adds a flatten_dataframe function to recursively flatten nested structures in PySpark DataFrames, addressing issue #47. The function handles StructType, ArrayType, and MapType columns and converts them into flat columns with customizable separators.

Changes

New Functionality

Added flatten_dataframe function in chispa/dataframe_transformer.py:

Recursively flattens nested DataFrame structures
Supports custom separator for flattened column names (default: "_")
Handles three complex types:
- StructType: Expands all sub-elements to individual columns
- ArrayType: Uses explode_outer to add array elements as rows
- MapType: Extracts all keys and creates columns for each key-value pair
Comprehensive docstring with parameters, notes, and examples
Type hints for all function parameters and return values

Added comprehensive test suite in tests/test_dataframe_transformer.py:

12 test cases covering various scenarios:
- Flattening struct fields
- Flattening map fields
- Flattening array fields
- Mixed complex types
- Default and custom separators
- Preserving simple fields
- Nested structs
- Empty DataFrames
- DataFrames with only simple fields
- Maps with multiple keys
- Structs with multiple fields
All tests follow pytest-describe pattern used in the project

Updated public API in chispa/__init__.py:

Added flatten_dataframe to imports
Added flatten_dataframe to __all__ list for public API exposure

Implementation Details

Key Features

Recursive Processing: The function iteratively processes complex fields until no nested structures remain
Type Safety: Uses proper type hints following the project's strict typing standards
Backward Compatibility: No breaking changes to existing functionality
Performance Considerations: MapType flattening requires finding all keys, which can be slow for large datasets (documented in notes)

Code Quality

✅ Follows project coding standards (from future import annotations, type hints, etc.)
✅ Uses modern Python typing (Union types with | syntax)
✅ Proper import ordering (stdlib → third-party → local)
✅ Comprehensive docstring with NumPy-style documentation
✅ All syntax validated with py_compile

Testing

✅ 12 comprehensive test cases
✅ Tests follow pytest-describe pattern
✅ Tests cover edge cases and common use cases
✅ All tests use the shared Spark session from tests.spark

Examples

Flatten Struct Fields

from chispa import flatten_dataframe

data = [
    {"id": 1, "name": "Cole", "fitness": {"height": 130, "weight": 60}},
    {"id": 2, "name": "Jane", "fitness": {"height": 130, "weight": 60}},
]
df = spark.createDataFrame(data)
flat_df = flatten_dataframe(df, sep=":")
# Result columns: ['id', 'name', 'fitness:height', 'fitness:weight']

Flatten Map Fields

data = [
    {"state": "Florida", "info": {"governor": "Rick Scott"}},
    {"state": "Ohio", "info": {"governor": "John Kasich"}},
]
df = spark.createDataFrame(data)
flat_df = flatten_dataframe(df, sep=":")
# Result columns: ['state', 'info:governor']

Related Issues

Closes #47

Checklist

Code follows project style guidelines
All functions have proper type hints
Comprehensive docstrings included
Tests added for new functionality
No breaking changes to existing API
Function added to public API via init.py
Follows AGENTS.md guidelines (no breaking changes)

Notes

This implementation is based on the example code provided in issue #47, adapted to follow the project's coding standards and best practices. The function is designed to be backward compatible and does not modify any existing functionality.

Add flatten_dataframe function to recursively flatten nested structures in DataFrames, including StructType, ArrayType, and MapType columns. - Added flatten_dataframe function in dataframe_transformer.py - Function supports custom separator for flattened column names - Handles StructType by expanding sub-elements to columns - Handles ArrayType by exploding arrays to rows - Handles MapType by extracting all keys as columns - Added comprehensive test suite with 12 test cases - Added function to public API in __init__.py Closes MrPowers#47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add flatten_dataframe function for nested DataFrames#187

feat: add flatten_dataframe function for nested DataFrames#187
jlaportebot wants to merge 1 commit into
MrPowers:mainfrom
jlaportebot:feat/add-flatten-dataframe

jlaportebot commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlaportebot commented May 3, 2026

Summary

Changes

New Functionality

Implementation Details

Key Features

Code Quality

Testing

Examples

Flatten Struct Fields

Flatten Map Fields

Related Issues

Checklist

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant