A Python command-line tool for scrubbing sensitive data from FOCUS billing files with consistent, reproducible mappings.
⚠️ DISCLAIMER: This tool provides automated scrubbing of common PII patterns in billing data, but cannot guarantee complete removal of all personally identifiable information. The output data should always be reviewed by a human before sharing or publishing to ensure all sensitive information has been properly anonymized. Users are responsible for verifying the scrubbed data meets their security and privacy requirements.
- Account ID Scrubbing: Consistently maps account IDs (numeric, UUIDs, ARNs) across all columns
- Name Anonymization: Replaces account names with stellar-themed generated names
- Date Shifting: Shifts date/datetime values by a configurable number of days
- Commitment Discount IDs: Intelligently scrubs complex IDs containing embedded account numbers and UUIDs
- Tag Scrubbing: Scrambles tag values while preserving keys by default; optionally scrambles keys as well
- Resource ID Scrubbing: Handles AWS ARNs, Azure Resource IDs, and OCI OCIDs with intelligent pattern-based scrambling
Metric columns (costs, quantities, usage amounts) are intentionally left unchanged. This is a deliberate design decision for the following reasons:
-
Interdependency Complexity: Metric columns are mathematically related (e.g.,
BilledCost = ConsumedQuantity × UnitPrice). Adjusting one metric would require proportional adjustments to maintain consistency, which is extremely complex across all FOCUS metric relationships. -
Reversibility Risk: Any systematic scaling of metrics could potentially be reversed by correlating the scrubbed dataset with publicly available pricing data, providing a false sense of security without actual data protection.
-
Data Integrity: Altering metrics would break the mathematical consistency of the dataset, making it unsuitable for realistic testing, validation, or development scenarios where accurate cost calculations are required.
-
Limited PII Value: Unlike account IDs and names, aggregate cost and usage metrics typically don't constitute personally identifiable information on their own.
Recommendation: If metric values are sensitive in your use case, consider using a representative subset of your data, applying additional manual obfuscation, or generating synthetic data with realistic distributions rather than relying on automated metric scrubbing.
- Component-Level Mappings: Centralized mapping engine ensures consistency across all columns
NumberId: Maps numeric IDs (e.g., 12-digit AWS account IDs)UUID: Maps UUIDs to new random UUIDsName: Maps names to stellar-themed names (e.g., "Nebula Alpha")ProfileCode: Maps dash-separated codes preserving structure
- Same account ID (e.g.,
658755425446) maps to the same value whether it appears:- As a standalone
SubAccountId - Embedded in a
CommitmentDiscountIdARN - In any other account-related column
- As a standalone
- Export mappings to ensure consistency across multiple processing runs
- Load mappings from previous runs to maintain referential integrity
- Input formats:
.csv,.csv.gz,.parquet - Output formats:
csv-gzip,parquet,sql - Process single files or entire directories
- Preserves directory structure in output
focus_scrub/focus_scrub/cli.py- CLI entrypointfocus_scrub/focus_scrub/io.py- File discovery + read/write logicfocus_scrub/focus_scrub/scrub.py- Deterministic column replacement enginefocus_scrub/focus_scrub/handlers.py- Reusable handler registry + dataset-to-column mappingfocus_scrub/focus_scrub/mapping/- Mapping infrastructureengine.py- Central MappingEngine for consistent component mappingscollector.py- MappingCollector for tracking column-level mappings
Handlers delegate to the shared MappingEngine to ensure consistency:
- AccountIdHandler: Decomposes complex values (ARNs), extracts components (account IDs, UUIDs), and maps each via the engine
- StellarNameHandler: Maps account names to stellar-themed names
- CommitmentDiscountIdHandler: Delegates to AccountIdHandler with shared engine
- DateReformatHandler: Shifts dates by configured number of days
Any column without a configured handler is passed through unchanged.
- Install Poetry:
curl -sSL https://install.python-poetry.org | python3 -- Install project dependencies:
poetry installProcess files without exporting mappings:
poetry run focus-scrub <input_path> <output_path> --dataset CostAndUsageExport mappings for reuse in subsequent runs:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--export-mappings mappings.jsonThe exported JSON contains:
column_mappings: Per-column old→new value mappingscomponent_mappings: Component-level mappings (NumberId, UUID, Name, ProfileCode)
Reuse mappings from a previous run to ensure consistency:
poetry run focus-scrub input2/ output2/ \
--dataset CostAndUsage \
--load-mappings mappings.jsonShift all date columns by 30 days:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--date-shift-days 30Specify output format (default is parquet):
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--output-format csv-gzipAvailable formats:
parquet(default): Apache Parquet columnar formatcsv-gzip: Compressed CSV filessql: SQL INSERT statements for database loading
The SQL output format generates CREATE TABLE DDL and bulk INSERT statements optimized for database loading:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--output-format sqlFeatures:
- Generates CREATE TABLE IF NOT EXISTS statement with inferred column types
- Adds auto-incrementing
idcolumn as primary key - Generates bulk INSERT statements (1000 rows per batch)
- Automatically derives table name from filename (sanitizes hyphens, spaces, periods → underscores)
- Properly escapes single quotes in string values
- Handles NULL values correctly
- Includes row count in header comments
Example output:
-- FOCUS Scrubbed Data
-- Table: focus_data
-- Rows: 2500
CREATE TABLE IF NOT EXISTS focus_data (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
BillingAccountId TEXT,
BillingAccountName TEXT,
Cost DOUBLE PRECISION
);
INSERT INTO focus_data (BillingAccountId, BillingAccountName, Cost)
VALUES
('123456789012', 'Nebula Alpha', 100.50),
('987654321098', 'Stellar Beta', 250.75),
...;
INSERT INTO focus_data (BillingAccountId, BillingAccountName, Cost)
VALUES
...Column type mapping:
- Integer columns →
BIGINT - Float/Double columns →
DOUBLE PRECISION - Boolean columns →
BOOLEAN - DateTime columns →
TIMESTAMP - Date columns →
DATE - All other types →
TEXT
Use case: Ideal for loading scrubbed data directly into databases (PostgreSQL, MySQL, SQLite, etc.) using standard SQL import tools. The CREATE TABLE statement ensures the table exists before inserting data, and the auto-incrementing primary key provides a unique identifier for each row.
By default, the SQL table name is derived from the input filename. You can specify a custom table name using the --sql-table-name option:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--output-format sql \
--sql-table-name my_billing_dataThis will generate SQL with your custom table name:
CREATE TABLE IF NOT EXISTS my_billing_data (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
...
);
INSERT INTO my_billing_data (...)
VALUES ...;Note: The custom table name is still sanitized (hyphens, spaces, periods → underscores) to ensure SQL compatibility.
Remove custom columns (starting with x_ or oci_) from the output:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--remove-custom-columnsNote: The FOCUS spec states that custom columns should start with
x_, but OCI usesoci_prefix. Both patterns are recognized and removed when this option is enabled.
By default, the x_Discounts column is automatically dropped from the output. This is a known data generator custom column that is commonly removed.
To keep x_Discounts, explicitly pass an empty list:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--drop-columnsTo drop additional specific columns:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--drop-columns x_Discounts x_CustomField x_AnotherFieldThis is useful when you want to remove specific custom columns but keep others. For example, to drop only x_CustomField while keeping x_Discounts:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--drop-columns x_CustomFieldYou can combine --drop-columns with --remove-custom-columns, though the latter will remove all custom columns anyway:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--remove-custom-columns \
--drop-columns x_Discounts # Redundant but harmlessBy default, the Tags column handler preserves tag keys and only scrubs the values. To also scrub the tag keys:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--scrub-tag-keysDefault behavior (preserve keys):
Input: {"environment":"production","owner":"team-alpha"}
Output: {"environment":"qspcvdujpo","owner":"ufbn-bmqib"}
With --scrub-tag-keys (scrub both keys and values):
Input: {"environment":"production","owner":"team-alpha"}
Output: {"fowjsponfou":"qspcvdujpo","pxofs":"ufbn-bmqib"}
Note: When
--scrub-tag-keysis enabled, the same key will consistently map to the same scrambled key across all tags and all rows, ensuring referential integrity is maintained.
To only shift dates without scrubbing any other sensitive data, use the --dates-only option. This is useful for re-dating FOCUS datasets for testing or analysis while preserving the original data:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--date-shift-days 30 \
--dates-onlyWhat happens:
- Date columns are shifted by the specified number of days
- All other columns (account IDs, names, tags, etc.) remain unchanged
- Useful for time-based testing without data scrubbing
Example use case:
# Re-date a dataset to current month for demo purposes
poetry run focus-scrub historical_data/ demo_data/ \
--dataset CostAndUsage \
--date-shift-days 365 \
--dates-only# First run: Process files and export mappings
poetry run focus-scrub datafiles/AWS datafiles_out/AWS \
--dataset CostAndUsage \
--output-format parquet \
--date-shift-days 30 \
--export-mappings mappings/aws_mappings.json
# Second run: Process more files using same mappings
poetry run focus-scrub datafiles/AWS_batch2 datafiles_out/AWS_batch2 \
--dataset CostAndUsage \
--load-mappings mappings/aws_mappings.jsonCostAndUsage- Standard cost and usage dataContractCommitment- Contract commitment data
In handlers.py:
- Register handler factory in
HANDLER_FACTORIES:
HANDLER_FACTORIES: dict[str, HandlerFactory] = {
"DateReformat": _build_date_reformat_handler,
"AccountId": _build_account_id_handler,
"StellarName": _build_stellar_name_handler,
"YourNewHandler": _build_your_new_handler,
}- Map columns to handlers in
DATASET_COLUMN_HANDLER_NAMES:
"CostAndUsage": {
"BillingAccountId": "AccountId",
"BillingAccountName": "StellarName",
"YourColumn": "YourNewHandler",
}-
MappingEngine creates consistent mappings for primitive components:
- Numeric IDs always map to same random numeric ID
- UUIDs always map to same random UUID
- Names always map to same stellar name
-
Handlers decompose complex values and use engine for each component:
- ARN
arn:aws:ec2:us-east-1:658755425446:reserved-instances/uuid→ - Account
658755425446→ maps viaengine.map_number_id() - UUID → maps via
engine.map_uuid() - Result:
arn:aws:ec2:us-east-1:752426551655:reserved-instances/new-uuid
- ARN
-
Consistency is maintained because the engine remembers all mappings:
- Same input value always produces same output
- Works across all columns and all files in a run
- Can be exported and reloaded for future runs
BillingAccountId: 658745821254
SubAccountId: 658755425446
BillingAccountName: MyBillingAccount
CommitmentDiscountId: arn:aws:ec2:us-east-1:658755425446:reserved-instances/ed12ad8c-...
BillingAccountId: 736035721513
SubAccountId: 752426551655
BillingAccountName: Nebula Iota
CommitmentDiscountId: arn:aws:ec2:us-east-1:752426551655:reserved-instances/c741d6b8-...
Note: The account ID 658755425446 consistently maps to 752426551655 in both the SubAccountId column and within the CommitmentDiscountId ARN.
Run the test suite:
make test
# or
poetry run pytest tests/ -vRun tests with coverage report:
make coverageThis will run all tests and display a coverage report in the terminal showing:
- Overall coverage percentage
- Coverage by file
- Missing line numbers
Generate an HTML coverage report:
make coverage-htmlThis creates an interactive HTML report in htmlcov/index.html that you can open in your browser to see detailed line-by-line coverage information.
Current coverage: ~89%
Coverage requirement: The test suite enforces a minimum coverage threshold of 80%. Tests will fail if coverage drops below this threshold, both locally and in CI/CD pipelines. This ensures code quality is maintained as the project evolves.
Run all code quality checks:
make checkThis runs:
- Linting (ruff)
- Format checking (ruff format)
- Type checking (mypy)
- Tests with coverage
Format code:
make formatInstall pre-commit hooks to automatically run checks before each commit:
make pre-commit-installOr run manually:
make pre-commit-runRemove generated files and caches:
make cleanThis removes:
- Python cache files (
__pycache__,.pyc) - Test cache (
.pytest_cache) - Type checking cache (
.mypy_cache) - Linter cache (
.ruff_cache) - Coverage files (
.coverage,coverage.xml,htmlcov/)