Skip to content

feat: add flatten_dataframe function for nested DataFrames#187

Open
jlaportebot wants to merge 1 commit into
MrPowers:mainfrom
jlaportebot:feat/add-flatten-dataframe
Open

feat: add flatten_dataframe function for nested DataFrames#187
jlaportebot wants to merge 1 commit into
MrPowers:mainfrom
jlaportebot:feat/add-flatten-dataframe

Conversation

@jlaportebot
Copy link
Copy Markdown

Summary

This PR adds a flatten_dataframe function to recursively flatten nested structures in PySpark DataFrames, addressing issue #47. The function handles StructType, ArrayType, and MapType columns and converts them into flat columns with customizable separators.

Changes

New Functionality

Added flatten_dataframe function in chispa/dataframe_transformer.py:

  • Recursively flattens nested DataFrame structures
  • Supports custom separator for flattened column names (default: "_")
  • Handles three complex types:
    • StructType: Expands all sub-elements to individual columns
    • ArrayType: Uses explode_outer to add array elements as rows
    • MapType: Extracts all keys and creates columns for each key-value pair
  • Comprehensive docstring with parameters, notes, and examples
  • Type hints for all function parameters and return values

Added comprehensive test suite in tests/test_dataframe_transformer.py:

  • 12 test cases covering various scenarios:
    • Flattening struct fields
    • Flattening map fields
    • Flattening array fields
    • Mixed complex types
    • Default and custom separators
    • Preserving simple fields
    • Nested structs
    • Empty DataFrames
    • DataFrames with only simple fields
    • Maps with multiple keys
    • Structs with multiple fields
  • All tests follow pytest-describe pattern used in the project

Updated public API in chispa/__init__.py:

  • Added flatten_dataframe to imports
  • Added flatten_dataframe to __all__ list for public API exposure

Implementation Details

Key Features

  1. Recursive Processing: The function iteratively processes complex fields until no nested structures remain
  2. Type Safety: Uses proper type hints following the project's strict typing standards
  3. Backward Compatibility: No breaking changes to existing functionality
  4. Performance Considerations: MapType flattening requires finding all keys, which can be slow for large datasets (documented in notes)

Code Quality

  • ✅ Follows project coding standards (from future import annotations, type hints, etc.)
  • ✅ Uses modern Python typing (Union types with | syntax)
  • ✅ Proper import ordering (stdlib → third-party → local)
  • ✅ Comprehensive docstring with NumPy-style documentation
  • ✅ All syntax validated with py_compile

Testing

  • ✅ 12 comprehensive test cases
  • ✅ Tests follow pytest-describe pattern
  • ✅ Tests cover edge cases and common use cases
  • ✅ All tests use the shared Spark session from tests.spark

Examples

Flatten Struct Fields

from chispa import flatten_dataframe

data = [
    {"id": 1, "name": "Cole", "fitness": {"height": 130, "weight": 60}},
    {"id": 2, "name": "Jane", "fitness": {"height": 130, "weight": 60}},
]
df = spark.createDataFrame(data)
flat_df = flatten_dataframe(df, sep=":")
# Result columns: ['id', 'name', 'fitness:height', 'fitness:weight']

Flatten Map Fields

data = [
    {"state": "Florida", "info": {"governor": "Rick Scott"}},
    {"state": "Ohio", "info": {"governor": "John Kasich"}},
]
df = spark.createDataFrame(data)
flat_df = flatten_dataframe(df, sep=":")
# Result columns: ['state', 'info:governor']

Related Issues

Closes #47

Checklist

  • Code follows project style guidelines
  • All functions have proper type hints
  • Comprehensive docstrings included
  • Tests added for new functionality
  • No breaking changes to existing API
  • Function added to public API via init.py
  • Follows AGENTS.md guidelines (no breaking changes)

Notes

This implementation is based on the example code provided in issue #47, adapted to follow the project's coding standards and best practices. The function is designed to be backward compatible and does not modify any existing functionality.

Add flatten_dataframe function to recursively flatten nested structures
in DataFrames, including StructType, ArrayType, and MapType columns.

- Added flatten_dataframe function in dataframe_transformer.py
- Function supports custom separator for flattened column names
- Handles StructType by expanding sub-elements to columns
- Handles ArrayType by exploding arrays to rows
- Handles MapType by extracting all keys as columns
- Added comprehensive test suite with 12 test cases
- Added function to public API in __init__.py

Closes MrPowers#47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flatten dataframe

1 participant