Added trasformation function with unit test cases #847

Sbnikitha · 2025-11-13T06:56:41Z

Implemented transformation functions
Cleaning
to_lower – changes all letters in the text to lowercase.
strip_whitespace – removes spaces from the beginning and end of the text.
squash_whitespace – replaces multiple spaces between words with a single space.
normalize_unicode – fixes and standardizes special or accented characters.
remove_punctuation – removes punctuation marks like commas, periods, and question marks.
map_values – replaces a value using a given dictionary or mapping.
cast_numeric – converts text or other types into numbers safely.
Date Transformations
try_parse_date – checks if something is a date and returns it.
extract_date_parts – gives the year, month, day, and weekday from a date.
floor_to_month – changes the date to the first day of the same month.
ceil_to_month – changes the date to the first day of the next month.
Input functions
ImputationReport – keeps a small report showing which method was used to fill missing data.
_numeric_skewness – checks how much the numeric data is skewed (not evenly spread).
choose_imputation_strategy – decides whether to fill missing values using the mean, median, or mode.
compute_imputation_value – actually finds the mean, median, or mode value to use for filling.
fill_nulls_column – fills missing values in one column with the chosen method.
fill_nulls_record – fills missing values in a full record (row) using sample data and gives a small report.

Math Functions
minmax_scale – scales a number to a new range, usually between 0 and 1.
zscore – finds how far a number is from the average in terms of standard deviation.
clip – keeps a number within a lower and upper limit.
winsorize – limits extreme values to reduce outliers (similar to clip).
log1p_safe – safely applies a log(1+x) transformation without errors.
bucketize – puts a number into a range or group (a bucket).
robust_percentile_scale – scales data between percentiles to reduce outlier effects.

Summary by CodeRabbit

Release Notes

New Features
- Introduced transforms utilities library with functions for string cleaning (lowercasing, whitespace normalization, punctuation removal), date operations (parsing, part extraction, month floor/ceil), numeric scaling and transformations, and data imputation strategies.
- Unified public API for transforms accessible via airbyte_cdk.utils.transforms.
Documentation
- Added comprehensive development guide for the Airbyte Python CDK.
Tests
- Added comprehensive test coverage for all transforms utilities.

coderabbitai · 2025-11-13T07:01:36Z

📝 Walkthrough

Walkthrough

This PR adds a comprehensive suite of data transformation utilities to the Airbyte CDK across four modules: mathematical scaling functions, string normalization utilities, date handling functions, and data imputation logic. A new __init__.py consolidates these into a unified public API. Full test coverage is provided for each module, and a development guide is added for AI agent reference.

Changes

Cohort / File(s)	Change Summary
Documentation `.github/copilot-instructions.md`	Adds a comprehensive development guide for the Airbyte Python CDK detailing project overview, core components, data flow, development conventions, and common workflows. Serves as inline documentation for AI agents.
Math Transforms `airbyte_cdk/utils/transforms/math.py`, `airbyte_cdk/test/utils/transforms/test_math.py`	Introduces seven numeric transformation utilities: minmax_scale, zscore, clip, winsorize, log1p_safe, bucketize, and robust_percentile_scale. Includes comprehensive test coverage for typical usage and edge cases (e.g., division-by-zero, boundary conditions).
String Cleaning Transforms `airbyte_cdk/utils/transforms/cleaning.py`, `airbyte_cdk/test/utils/transforms/test_cleaning.py`	Adds seven string normalization and type-casting utilities: to_lower, strip_whitespace, squash_whitespace, normalize_unicode, remove_punctuation, map_values, and cast_numeric. Tests cover typical and edge cases including None handling and error modes.
Date Transforms `airbyte_cdk/utils/transforms/date.py`, `airbyte_cdk/test/utils/transforms/test_date.py`	Provides four date handling functions: try_parse_date, extract_date_parts, floor_to_month, and ceil_to_month. Tests validate datetime handling, edge cases, and month-boundary behavior.
Imputation Transforms `airbyte_cdk/utils/transforms/impute.py`, `airbyte_cdk/test/utils/transforms/test_impute.py`	Implements data imputation utilities including ImputationReport dataclass, strategy selection logic, and per-column/record-level filling functions. Tests cover strategy inference, numeric skewness detection, and multi-column imputation workflows.
Public API `airbyte_cdk/utils/transforms/__init__.py`	Consolidates and re-exports 24 symbols (functions and classes) from math, cleaning, date, and impute submodules via a centralized `__all__` list, establishing a unified public interface.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant fill_nulls_record
    participant choose_imputation_strategy
    participant compute_imputation_value
    
    Caller->>fill_nulls_record: record, columns, samples, strategies
    
    loop For each column
        alt explicit strategy provided
            fill_nulls_record->>compute_imputation_value: series, strategy
        else infer strategy
            fill_nulls_record->>choose_imputation_strategy: series, numeric, skew_threshold
            choose_imputation_strategy-->>fill_nulls_record: strategy ("mean"/"median"/"mode")
            fill_nulls_record->>compute_imputation_value: series, strategy
        end
        
        compute_imputation_value-->>fill_nulls_record: imputation_value
        fill_nulls_record->>fill_nulls_record: apply value if field is None<br/>create ImputationReport
    end
    
    fill_nulls_record-->>Caller: updated_record, [ImputationReport, ...]

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Heterogeneous changes across four distinct utility modules with different purposes and logic densities requiring separate reasoning for each
Moderate complexity in impute.py — strategy selection logic, skewness calculation, and multi-column coordination warrant careful review
Edge case handling — division-by-zero safeguards, None propagation, and error modes in cast_numeric and numeric functions should be verified
No structural changes to existing code, purely additive, which reduces overall complexity
Comprehensive test coverage provides confidence but tests themselves need validation

Consider focusing extra attention on:

The numeric skewness calculation and strategy selection thresholds in impute.py — do the defaults align with intended use cases, wdyt?
Error handling consistency across modules, particularly how None is propagated vs. raising exceptions
The cast_numeric error modes ("default", "none", "raise", and implicit ignore) — is the behavior clear and complete, wdyt?

Suggested labels

enhancement

Suggested reviewers

maxi297
brianjlai

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title mentions 'trasformation function' (with a typo) and 'unit test cases', which broadly aligns with the PR's addition of multiple transformation utilities and comprehensive tests across math, cleaning, date, and imputation modules.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (4)

airbyte_cdk/utils/transforms/__init__.py (1)
17-29: Consider adding spaces after commas in __all__ for consistency?

The __all__ list items are missing spaces after commas. While this works fine, adding spaces would align with PEP 8 style conventions and improve readability. Wdyt?

Apply this diff if you'd like to improve the formatting:
 __all__ = [
     # math
-    "minmax_scale","zscore","clip","winsorize","log1p_safe",
-    "bucketize","robust_percentile_scale",
+    "minmax_scale", "zscore", "clip", "winsorize", "log1p_safe",
+    "bucketize", "robust_percentile_scale",
     # cleaning
-    "to_lower","strip_whitespace","squash_whitespace",
-    "normalize_unicode","remove_punctuation","map_values","cast_numeric",
+    "to_lower", "strip_whitespace", "squash_whitespace",
+    "normalize_unicode", "remove_punctuation", "map_values", "cast_numeric",
     # date
-    "try_parse_date","extract_date_parts","floor_to_month","ceil_to_month",
+    "try_parse_date", "extract_date_parts", "floor_to_month", "ceil_to_month",
     # impute
-    "ImputationReport","choose_imputation_strategy",
-    "compute_imputation_value","fill_nulls_column","fill_nulls_record",
+    "ImputationReport", "choose_imputation_strategy",
+    "compute_imputation_value", "fill_nulls_column", "fill_nulls_record",
 ]
airbyte_cdk/utils/transforms/math.py (3)
7-11: Consider potential floating-point precision issues with equality check?

Line 9 uses == to compare floats (data_max == data_min). While this usually works when the same values are passed in, floating-point arithmetic can sometimes lead to precision issues. Would using a small epsilon for comparison be more robust, or is exact equality the intended behavior here? Wdyt?

22-28: Document the error-handling behavior of log1p_safe?

The function returns the original value float(x) when an exception occurs (line 28). While this "safe" pattern prevents crashes, users might not expect this behavior. Adding a docstring to clarify when and why the original value is returned would help. Wdyt about adding documentation for this?

36-50: Consider using a local variable instead of reassigning the parameter?

On line 46, the parameter x is reassigned when clip_outliers=True. While this works, it can make the code slightly harder to follow. Would you consider using a local variable like scaled_x to preserve the original parameter? This could improve clarity. Wdyt?

Example:
def robust_percentile_scale(
    x: Number,
    p_low_value: Number,
    p_high_value: Number,
    out_range: Tuple[Number, Number]=(0.0, 1.0),
    clip_outliers: bool=True
) -> float:
    a, b = out_range
    lo, hi = float(p_low_value), float(p_high_value)
    scaled_x = clip(float(x), lo, hi) if clip_outliers else float(x)
    width = hi - lo
    if width == 0:
        return float(a + (b - a) / 2.0)
    return ((scaled_x - lo) / width) * (b - a) + a

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8764296 and f08ebad.

⛔ Files ignored due to path filters (1)

.DS_Store is excluded by !**/.DS_Store

📒 Files selected for processing (10)

.github/copilot-instructions.md (1 hunks)
airbyte_cdk/test/utils/transforms/test_cleaning.py (1 hunks)
airbyte_cdk/test/utils/transforms/test_date.py (1 hunks)
airbyte_cdk/test/utils/transforms/test_impute.py (1 hunks)
airbyte_cdk/test/utils/transforms/test_math.py (1 hunks)
airbyte_cdk/utils/transforms/__init__.py (1 hunks)
airbyte_cdk/utils/transforms/cleaning.py (1 hunks)
airbyte_cdk/utils/transforms/date.py (1 hunks)
airbyte_cdk/utils/transforms/impute.py (1 hunks)
airbyte_cdk/utils/transforms/math.py (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2024-12-11T16:34:46.319Z

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

.github/copilot-instructions.md

📚 Learning: 2024-11-15T01:04:21.272Z

Learnt from: aaronsteers
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

.github/copilot-instructions.md

📚 Learning: 2024-12-11T16:34:46.319Z

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.

Applied to files:

.github/copilot-instructions.md

🧬 Code graph analysis (5)

airbyte_cdk/test/utils/transforms/test_cleaning.py (1)

airbyte_cdk/utils/transforms/cleaning.py (7)

to_lower (7-8)

strip_whitespace (10-11)

squash_whitespace (13-16)

normalize_unicode (18-19)

remove_punctuation (22-25)

map_values (27-28)

cast_numeric (30-44)

airbyte_cdk/test/utils/transforms/test_date.py (1)

airbyte_cdk/utils/transforms/date.py (4)

try_parse_date (4-8)

extract_date_parts (10-14)

floor_to_month (16-20)

ceil_to_month (22-28)

airbyte_cdk/test/utils/transforms/test_impute.py (1)

airbyte_cdk/utils/transforms/impute.py (6)

_numeric_skewness (17-26)

choose_imputation_strategy (28-45)

compute_imputation_value (47-62)

fill_nulls_column (64-72)

fill_nulls_record (74-94)

ImputationReport (11-15)

airbyte_cdk/utils/transforms/__init__.py (4)

airbyte_cdk/utils/transforms/math.py (7)

minmax_scale (7-11)

zscore (13-14)

clip (16-17)

winsorize (19-20)

log1p_safe (22-28)

bucketize (30-34)

robust_percentile_scale (36-50)

airbyte_cdk/utils/transforms/cleaning.py (7)

to_lower (7-8)

strip_whitespace (10-11)

squash_whitespace (13-16)

normalize_unicode (18-19)

remove_punctuation (22-25)

map_values (27-28)

cast_numeric (30-44)

airbyte_cdk/utils/transforms/date.py (4)

try_parse_date (4-8)

extract_date_parts (10-14)

floor_to_month (16-20)

ceil_to_month (22-28)

airbyte_cdk/utils/transforms/impute.py (5)

ImputationReport (11-15)

choose_imputation_strategy (28-45)

compute_imputation_value (47-62)

fill_nulls_column (64-72)

fill_nulls_record (74-94)

airbyte_cdk/test/utils/transforms/test_math.py (1)

airbyte_cdk/utils/transforms/math.py (7)

minmax_scale (7-11)

zscore (13-14)

clip (16-17)

winsorize (19-20)

log1p_safe (22-28)

bucketize (30-34)

robust_percentile_scale (36-50)

🪛 GitHub Actions: Linters

airbyte_cdk/test/utils/transforms/test_cleaning.py

[error] 13-13: Function is missing a return type annotation [no-untyped-def]

[error] 13-13: Use "-> None" if function does not return a value

[error] 28-28: Function is missing a return type annotation [no-untyped-def]

[error] 28-28: Use "-> None" if function does not return a value

[error] 43-43: Function is missing a return type annotation [no-untyped-def]

[error] 43-43: Use "-> None" if function does not return a value

[error] 59-59: Function is missing a return type annotation [no-untyped-def]

[error] 59-59: Use "-> None" if function does not return a value

[error] 78-78: Function is missing a return type annotation [no-untyped-def]

[error] 78-78: Use "-> None" if function does not return a value

[error] 96-96: Function is missing a return type annotation [no-untyped-def]

[error] 96-96: Use "-> None" if function does not return a value

[error] 113-113: Function is missing a return type annotation [no-untyped-def]

[error] 113-113: Use "-> None" if function does not return a value

[error] 131-131: Non-overlapping equality check (left operand type: "int | float | None", right operand type: "Literal['']") [comparison-overlap]

[error] 132-132: Non-overlapping equality check (left operand type: "int | float | None", right operand type: "Literal[' ']") [comparison-overlap]

[error] 136-136: Non-overlapping equality check (left operand type: "int | float | None", right operand type: "str") [comparison-overlap]

airbyte_cdk/test/utils/transforms/test_date.py

[error] 11-11: Function is missing a return type annotation [no-untyped-def]

[error] 11-11: Use "-> None" if function does not return a value

[error] 22-22: Function is missing a return type annotation [no-untyped-def]

[error] 22-22: Use "-> None" if function does not return a value

[error] 39-39: Function is missing a return type annotation [no-untyped-def]

[error] 39-39: Call to untyped function "floor_to_month" in typed context [no-untyped-call]

[error] 46-46: Call to untyped function "floor_to_month" in typed context [no-untyped-call]

[error] 50-50: Call to untyped function "floor_to_month" in typed context [no-untyped-call]

[error] 53-53: Call to untyped function "floor_to_month" in typed context [no-untyped-call]

[error] 54-54: Call to untyped function "floor_to_month" in typed context [no-untyped-call]

airbyte_cdk/test/utils/transforms/test_impute.py

[error] 12-12: Function is missing a return type annotation [no-untyped-def]

[error] 12-12: Use "-> None" if function does not return a value

[error] 26-26: Function is missing a return type annotation [no-untyped-def]

[error] 26-26: Use "-> None" if function does not return a value

[error] 68-68: Argument 3 to "fill_nulls_record" has incompatible type "dict[str, object]"; expected "Mapping[str, Sequence[Any]]" [arg-type]

[error] 107-107: Argument 3 to "fill_nulls_record" has incompatible type "dict[str, object]"; expected "Mapping[str, Sequence[Any]]" [arg-type]

[error] 115-115: Argument 3 to "fill_nulls_record" has incompatible type "dict[str, object]"; expected "Mapping[str, Sequence[Any]]" [arg-type]

[error] 12-12: Function is missing a return type annotation [no-untyped-def]

[error] 12-12: Use "-> None" if function does not return a value

[error] 26-26: Function is missing a return type annotation [no-untyped-def]

[error] 26-26: Use "-> None" if function does not return a value

[error] 46-46: Function is missing a return type annotation [no-untyped-def]

[error] 46-46: Use "-> None" if function does not return a value

airbyte_cdk/utils/transforms/__init__.py

[error] 1-1: I001 Import block is un-sorted or un-formatted. Organize imports.

[error] 15-15: I001 Import block is un-sorted or un-formatted. Organize imports.

airbyte_cdk/utils/transforms/date.py

[error] 4-4: Function is missing a return type annotation [no-untyped-def]

[error] 10-10: Function is missing a type annotation for one or more arguments [no-untyped-def]

[error] 16-16: Function is missing a type annotation [no-untyped-def]

[error] 22-22: Function is missing a type annotation [no-untyped-def]

airbyte_cdk/test/utils/transforms/test_math.py

[error] 14-14: Function is missing a return type annotation [no-untyped-def]

[error] 14-14: Use "-> None" if function does not return a value

[error] 33-33: Function is missing a return type annotation [no-untyped-def]

[error] 33-33: Use "-> None" if function does not return a value

[error] 45-45: Function is missing a return type annotation [no-untyped-def]

[error] 45-45: Use "-> None" if function does not return a value