Skip to content

Releases: databrickslabs/dqx

v0.15.0

13 Jun 20:47
e7b1109

Choose a tag to compare

What's Changed

  • Added LLM-generated AI explanations for row-level anomaly detection (#1129). The has_no_row_anomalies check now attaches a plain-language ai_explanation to each flagged row under _dq_info[].anomaly, describing the likely cause, business impact, suggested action, the top contributing features, and the group's size and average severity. Explanations are generated vis Spark ai_query function against a Databricks Model Serving endpoint — no extra Python dependencies and no driver-side LLM calls — and anomalous rows are grouped by segment and top contributing features so the model is called once per group, keeping cost predictable on large datasets (bounded by max_groups). AI explanations are enabled by default and does not require additional settings. New parameters with good set of defaults include enable_ai_explanation, ai_explanation_llm_model_config, redact_columns (to keep sensitive columns out of the prompt and grouping), and max_groups. If the serving endpoint is unreachable, explanations are left null with a warning and scoring still succeeds. LLMModelConfig also gains max_tokens, temperature, timeout, and max_retries to bound LLM cost and latency and expose tuning parameters for the users if required.
  • Added stratified sampling to the profiler (#1240). The profiler now accepts a sample_by option to perform stratified sampling based on column values. Users control the sampling fraction with either a single sample_fraction applied equally across all strata, or a dictionary mapping each stratum to its own fraction. When sample_by is omitted, the profiler continues to use uniform sampling across all rows.
  • Added new row-level check function to validate email addresses called is_valid_email (#1158). A new is_valid_email check validates email addresses against a pragmatic, ReDoS-safe subset of RFC 5321/5322. Like the IP-address checks, it ignores null values (no violation reported).
  • Added geofencing checks (#1164). Five new row-level geospatial checks validate topological relationships between a column geometry and a reference geometry: is_geo_contains, is_geo_covers, is_geo_intersects, is_geo_touches, and is_geo_within. By default they use exact, meter-level precision built on the ST_* family of functions; is_geo_covers and is_geo_intersects additionally support an approximate mode built on H3_* cell indexing with a configurable resolution for faster checks on large datasets. The reference geometry can be a literal WKT/WKB/EWKT/EWKB value or another column, with optional try_to_geometry conversion of either side. Running these checks requires Databricks serverless compute or runtime 17.1 or above.
  • Added support for metrics-only writes (#1236). save_results_in_table and the corresponding workflow path can now persist summary metrics without requiring an output or quarantine table, supporting observability-focused pipelines that only need the metrics table. Batch observations are triggered before metrics are saved so the metrics table is populated correctly, and streaming and no-observer cases now raise explicit errors. Existing configurations with an output or quarantine table are unaffected.
  • Allow custom check failure messages (#1092). DQRule now accepts an optional message_expr parameter that lets users define custom failure messages as a Spark Column or a SQL expression string. The same option is supported for checks defined declaratively in metadata (YAML/JSON), specified as a top-level message_expr key on the check definition alongside criticality and check. When omitted, the default message behavior is preserved; when provided, the custom message replaces the default message for failed rows.
  • Added a Query Results Cookbook and aligned stored check names and fingerprints (#1193). A new reference page provides "copy-paste" SQL and PySpark recipes for querying DQX result tables (summary metrics, output, quarantine, and checks) to trace errors and warnings across runs, rows, and check definitions. To make the cookbook's fingerprint and name joins reliable, checks saved without an explicit name now store the same autogenerated name and name-inclusive rule_fingerprint that apply_checks writes to _errors/_warnings (named checks and for_each_column rules are byte-identical to before). Requesting summary metrics via metrics_config without a configured observer now fails fast with an InvalidParameterError instead of silently skipping the metrics table.
  • Added in-app language switching to DQX Studio (#1172). DQX Studio now ships with four locales (English, Brazilian Portuguese, Italian, and Spanish), selectable from a new Preferences card on the user's Profile page. The choice is persisted per-browser via localStorage with no server-side or table changes, and the change is frontend-only. Non-English translations are AI-assisted and not yet reviewed by native speakers.
  • DQX Studio: replaced the apx build framework with first-party build and dev scripts (#1223). The app no longer depends on the apx package. scripts/build_app.py generates the FastAPI OpenAPI schema, runs orval, builds the frontend with Vite, and produces the application wheel (with a build-tagged local-version segment so successive deploys at the same commit always reinstall fresh code). scripts/dev.py runs uvicorn with reload alongside the Vite dev server, forwarding signals and tearing down both processes together. The bundle and warehouse-grant scripts were updated to support both bundle-managed and external (reuse) SQL warehouse modes. There is no runtime behavior change in the app itself.
  • DQX Studio: added Lakebase storage backend to improve app latency with declarative storage and destroy protection (#1173). Schemas, the wheels volume, and the Lakebase instance and logical database are now declared in the bundle with prevent_destroy lifecycle protection, and make app-bind adopts pre-existing resources. OLTP tables (rules, settings, RBAC, comments, schedules) move to Postgres via a migration runner, while analytical tables (validation runs, profiling, quarantine, metrics) stay on Delta. Error, warning, and input row counts from the DQX observer are now persisted and surfaced in the UI, label badges and label filtering were added to rule selection and scheduling, and a Spark Connect Observation.get mutability bug that overwrote total row counts was fixed.
  • Fixed quarantine-only writes when no output table is configured (#1183). apply_checks_and_save_in_table and apply_checks_by_metadata_and_save_in_table previously raised AttributeError when called with output_config=None and a quarantine_config. output_config is now optional and skipped when unset, so quarantine-only runs write just the invalid records; passing neither configuration raises a clear InvalidParameterError.
  • Allow special characters in catalog and schema names (#1232). The validation regex for storage locations now accepts catalog and schema names that contain characters such as hyphens, which were previously rejected.
  • Fixed installation when the anomaly-detection workflow is absent (#1194). Installation no longer fails when the Anomaly Trainer workflow is not present; its presence is now checked before it is appended to the workflow.
  • Fixed data contract rule generation without the [llm] extra (#1191). DQLLMEngine was imported unconditionally in contract_rules_generator.py purely for a type annotation, causing an ImportError when the [llm] extra was not installed and producing a misleading "install datacontract-cli" error. The import is now guarded behind TYPE_CHECKING, so generate_rules_from_contract(..., process_text_rules=False) works without the [llm] extra.
  • Added an installation wizard reference and promoted DQX Studio as the recommended no-code option (#1229).
  • Added a data drift detection guide to the profiling documentation (#1205).
  • Renamed user to client_id in the LakebaseChecksStorageConfig documentation to match the actual configuration field (#1201).

BREAKING CHANGES!

  • Row-level anomaly detection rule now computes SHAP feature contributions by default — enable_contributions defaults to True (was False), adding scoring cost (requires the shap library, already included in the [anomaly] extra). Set enable_contributions=False to restore the previous behaviour. (#1129)
  • Row-level anomaly detection now generates AI explanations by default — enable_ai_explanation defaults to True, so existing anomaly checks will make LLM calls against a Databricks Model Serving endpoint (default databricks-claude-sonnet-4-5) and incur cost. This requires Foundation Model APIs to be available in the workspace; if the endpoint is unreachable, explanations are skipped (null) with a warning rather than failing. Set enable_ai_explanation=False to opt out entirely. (#1129)
  • The `dq...
Read more

v0.14.0

05 May 21:17
8bdca67

Choose a tag to compare

What's Changed

  • ML-based row-level anomaly detection (#990, #1055, #1062). DQX now offers ML-based row anomaly detection that automatically identifies unusual rows in data without requiring manually specified thresholds, enabling the detection of issues missed by rule-based checks. Users provide recent representative data, and DQX trains an Isolation Forest model that flags rows deviating from typical patterns at scoring time, with auto-discovery of relevant columns and segmentation where appropriate, plus per-row explanations of why a record was flagged. The feature integrates with MLflow for model registry, supports both training and scoring workflows, and complements existing rule-based and aggregate checks.
  • DQX Studio app (Beta) — MVP release of DQX App (#1090) (#1040) (#1050) (#1034). DQX Studio is the no-code Databricks App for authoring and managing data quality rules through a browser-based UI. AI-assisted rule generation, in-app dry-run validation, scheduled rule execution with run history and per-check summary metrics, role-based access control (Admin, Approver, Author, Viewer, plus an orthogonal Runner role) backed by Databricks workspace groups, and a contextual AI assistant integrated into the UI. The bundle provisions all required resources automatically (app, SQL warehouse, task-runner job, schemas, volume) and exposes per-target variables for catalog, admin group, app name, warehouse name, and schema overrides. The app uses On-Behalf-Of (OBO) authentication so end users only see data they can access in Unity Catalog, and validates user-supplied checks with proper HTTP status codes (400 for malformed input). LLM configuration uses the calling user's OBO token on every request to ensure correct identity propagation in the deployed Apps environment.
  • Added AI agent skills for DQX (#1125) (#1056). DQX now ships with Agent Skills under skills/ that teach AI assistants (Databricks Genie Code, Claude Code, Cursor, Copilot, and other tools following the open standard) how to use DQX correctly. The skills cover the public-API capabilities and are accompanied by an AGENTS.md canonical onboarding guide for AI coding agents, with a thin CLAUDE.md redirect for tools that look for it. A new docs guide documents installation and usage for each supported tool.
  • Added has_no_aggr_outliers stateless rolling-window sigma outlier check (#1118). A new dataset-level quality check, has_no_aggr_outliers, has been introduced that detects outliers in time-series aggregates using a stateless rolling-window sigma method. The check is suitable for monitoring metrics such as daily transaction counts, hourly throughput, or any aggregate where deviations from a rolling baseline indicate quality issues, and complements the existing has_no_outliers MAD-based row-level check.
  • Added are_polygons_mutually_disjoint geometry check function (#1061). A new geospatial check, are_polygons_mutually_disjoint, validates whether polygons in a column are mutually disjoint using ST_Intersects. The check supports row_filter, handles nulls and invalid geometries gracefully, and uses native Spark spatial intersections (rather than H3 indexing) for compatibility with Photon's spatial optimizations.
  • Added null-safe support to foreign key check (#1106). The foreign_key check now accepts a null_safe parameter. By default, NULL values in the foreign key columns are ignored (SQL ANSI behavior). When null_safe=True, NULL foreign-key values are matched against NULL reference values. Note: enabling null_safe=True on a previously non-null-safe single-column FK changes the auto-generated rule name (a_not_exists_in_ref_bstruct_a_as_a_not_exists_in_ref_struct_b_as_a) and the violation message format.
  • Added variable substitution support for check definitions (#1078). Check definitions now support {{ placeholder }} syntax for reusable templates, resolved at load time via a new variables parameter on load_checks() and load_checks_from_local_file(), or via default variables passed through ExtraParams at engine construction. The new resolve_variables() utility recursively replaces placeholders in all string fields of check definitions in a single pass and supports scalar types (str, int, float, bool, Decimal, datetime.date, datetime.datetime, datetime.time). Unresolved placeholders are logged as warnings.
  • Added suppress_skipped option and skipped flag for skipped checks (#1063). A new suppress_skipped: bool = False option in ExtraParams allows checks skipped due to missing columns or invalid filters to produce no entry in _errors/_warnings and not cause rows to appear in the invalid DataFrame. Additionally, a new skipped boolean field has been added to dq_result_item_schema so skipped checks can be identified structurally without string-parsing the violation message.
  • Added per-check-name breakdowns to summary metrics (#1097). The DQMetricsObserver now emits a new check_metrics row alongside the existing aggregates (input_row_count, error_row_count, warning_row_count, valid_row_count). The value is a JSON array of structs — one per check — with check_name, error_count, and warning_count, fitting the existing metric_name/metric_value schema without widening it. The change is backward compatible: existing metrics are unchanged and the new row is additive.
  • Added versioning of checks with rule fingerprints (#1044). Checks now carry rule_fingerprint, rule_set_fingerprint, and created_at fields when saved to Delta or Lakebase storage, and rule_set_fingerprint is also stamped on summary metrics so every metric row can be traced back to the exact rule version that produced it. Each save creates a new versioned entry rather than overwriting prior history.
  • Added partition and clustering support for output tables (#1012). The OutputConfig now accepts partition_by and cluster_by fields, allowing users to save DataFrames as partitioned or clustered tables. Liquid clustering is automatically applied the first time checks are saved to a liquid-clustered table, and the integration tests verify both partitioning and clustering behaviour end to end.
  • Added configurable default criticality for profiler job (#1117). The profiler workflow now accepts a parameter to specify the default criticality (error or warn) for generated rules, allowing users to control rule severity at generation time rather than relying on a hardcoded default.
  • Added schema validation rules generation from data contracts (#1043). The data contract rule generator now produces schema-validation rules from ODCS contracts (enabled via generate_schema_validation, defaulting to True), ensuring dataset schemas match contract definitions. A new InvalidPhysicalTypeError provides clearer error handling when physical types are missing or invalid in schema properties.
  • Added end-to-end methods that load checks from storage (#1064). apply_checks_and_save_in_table and apply_checks_by_metadata_and_save_in_table now optionally load checks directly from a storage location (table or file), in addition to the existing option of using preloaded checks. Best-practice documentation has been updated with the recommended end-to-end patterns.
  • Added solutions accelerators and industry demos (#1100). New industry-focused accelerators have been added under demos/dqx_demo_industry/: a Banking demo (dqx_banking_demo.py) focused on fraud detection and transaction monitoring, and a rebuilt Fashion demo (dqx_fashion_demo.py) with industry-specific custom check functions and 11 quality rules. The Manufacturing demo has been moved into the same subdirectory for consistency, and the demo documentation has been updated with a new "Industry Accelerators" section.
  • Added intermediate demo for new users (#1041). A new intermediate demo has been added that can be presented in 5–10 minutes and showcases DQX's core functionality to someone seeing it for the first time.
  • Added LLM-friendly documentation with llms.txt generation (#1029). The Docusaurus build now generates AI-accessible documentation in the standardized llms.txt format via the @signalwire/docusaurus-plugin-llms-txt plugin, with hierarchical organization so AI assistants and LLM-powered tools can consume DQX documentation more efficiently.
  • Updated profiler implementation with rules-based profile builders (#1059). The DQX profiler has been refactored around a rules-based approach: profiles are now generated via registered profile builders, making it straightforward to add new profile types without modifying core profiler code.
  • Improv...
Read more

v0.13.0

09 Feb 17:18
99734cd

Choose a tag to compare

What's Changed

  • New DQX Data Quality Dashboard (#1019). The data quality dashboard has been significantly enhanced to provide a centralized view of data quality metrics across all tables, allowing users to monitor and track data quality issues with greater ease. The dashboard now consists of three tabs - Data Quality Summary, Data Quality by Table (Time Series), and Data Quality by Table (Full Snapshot) - each catering to different monitoring scenarios, and offers customizable parameters for reporting column names and filtering tables with data quality issues. Additionally, the installation process for the dashboard has been simplified, with options to import it directly to a Workspace or deploy it automatically using the Databricks CLI.
  • DQX App Skeleton (#982). The DQX application (frontend and backend) has been built with a core set of features, including configuration management and AI-assisted rule generation based on natural-language input from users. A comprehensive README documents the application architecture as well as development and deployment workflows. Future versions of DQX will introduce additional functionality (loading/saving rules, rules authoring in graphical form) and provide a streamlined, user-friendly way to deploy the application directly into a Databricks workspace.
  • Added Decimal support to check functions and to min_max generator (#1013) (#1017). The data quality checks have been enhanced to support Python's Decimal type, in addition to int and float, for min/max validation checks, enabling proper data quality checks for decimal-precise financial and scientific data where floating-point precision issues would cause false positives.
  • Added DQX produciton best practices and fix datetime limit handling (#997). Practical guidance and best practices for using DQX in production have been added, covering aspects such as storing checks in Delta tables, enforcing access controls, and optimizing rules for performance and scalability. Fixes have also been implemented to address issues related to handling date and datetime limits, particularly when provided as strings.
  • Added new row-level check functions: is_null, is_empty, and is_null_or_empty (#1015). DQX now includes three new check functions, is_null, is_empty, and is_null_or_empty, which enable verification of column values as null, empty strings, or both, complementing existing checks like is_not_null, is_not_empty, and is_not_null_and_not_empty. The functions also support optional arguments, like trim_strings to trim spaces from strings.
  • Added tolerance to equality and non-equality check functions (#1011). The library's quality check functionality has been enhanced to support absolute and relative tolerance parameters for numeric value comparisons in is_equal_to, is_not_equal_to, is_aggr_equal and is_aggr_not_equal checks, allowing for more flexible and precise control over data validation. The introduction of tolerance logic, which checks for absolute and relative differences within specified thresholds via abs_tolerance and rel_tolerance parameters, provides more nuanced comparisons for numeric data.
  • Allow new lines in sql expression checks (#1009). SQL expression check function (sql_expression) has been updated to support new lines in its expression argument, allowing for more complex and formatted SQL expressions.
  • Allow summary metrics with SparkConnect sessions (#1000). The library now supports writing summary metrics directly to a table with SparkConnect sessions, eliminating the need for a classic compute cluster in Dedicated access mode. This change lifts the previous restriction and enables generatic summary metrics using Serverless and all standard clusters with Databricks Runtime 17.3LTS or higeher.
  • Fixed loading checks from a delta table with special characters (#992). The loading checks functionality from a delta table has been fixed to handle special characters in the fully qualified table.
  • Fixed resolution of pii detection check function (#1003). The PII detection check function resolution has been enhanced to support the application of checks defined as metadata (YAML).
  • Fixed serialization/deserialization of row filter parameter for dataset-level rules (#1021). The filter field in checks definition now correctly pushes down the filter condition defined at the check-level as row_filter to the check function, allowing checks to operate on the relevant subset of rows before aggregation. The documentation has been updated to advice users to use op-level filter condition for consistency instead of row_filter parameter. Overall, these changes aim to enhance the overall user experience.
  • Improved Lakeflow Declarative Pipeline tests (#1010). The Lakeflow Declarative Pipeline (LDP) tests have been enhanced to utilize full Unity Catalog mode, enabling support for writing to arbitrary catalogs and schemas, and performing additional checks to prevent certain operations.
  • Updated Lakebase authentication method (#975). The Lakebase authentication method has been updated to utilize a client ID instead of a username, simplifying its use in the context of a Databricks App. The lakebase_user parameter has been replaced with lakebase_client_id, an optional service principal client ID used to connect to Lakebase, defaulting to the caller's identity if not provided. This change enhances the security and reliability of the authentication process, making it easier to work with Lakebase as a checks storage.
  • Updated handling of metadata columns during schema validation (#1002). The has_valid_schema check has been enhanced to provide more flexibility in schema validation by introducing an optional exclude_columns parameter, allowing users to specify columns to ignore during validation. This parameter can be used to exclude metadata columns or other columns not relevant to schema validation, and it takes precedence over the columns list.
  • Updated product info when missing in config while verifying workspace client (#987). The workspace client configuration has been enhanced to default product information to dqx with the current version when it is missing, ensuring that product information is always set for telemetry purposes.
  • Updated profiler and generator documentation (#1026). The data profiling and quality checks generation feature has been enhanced with updated documentation, providing reference information for data quality profile types and associated rules.
  • Added filter attribute in rules generated from ODCS (#978). The rules generation process has been enhanced with the introduction of a filter attribute in rules generated from Open Data Contract Standard (ODCS), allowing for more flexible and targeted rules creation.

Contributors

@mwojtyczka @ghanse @alexott @nehamilak-db @cornzyblack @laurencewells @renardeinside @tlgnr @pierre-monnet @sheeluvikas @ashwin-911 @dwanneruchi @bpm1993 @Jgprog117

Full Changelog: v0.12.0...v0.13.0

v0.12.0

20 Dec 00:13
fff89a7

Choose a tag to compare

What's Changed

  • AI-Assisted rules generation from data profiles (#963). AI-assisted data quality rule generation was added, leveraging summary statistics from a profiler to create rules. The DQGenerator class includes a generate_dq_rules_ai_assisted method that can generate rules with or without user-provided input, using summary statistics to inform the rule creation process. This method offers flexibility in rule generation, allowing for both automated and user-guided creation of data quality rules.
  • Added new checks for JSON validation (#616). DQX now includes three new quality checks for JSON data validation, especially useful for validating data coming from streaming systems such as Kafka: is_valid_json, has_json_keys, and has_valid_json_schema. The is_valid_json check verifies whether values in a specified column are valid JSON strings, while the has_json_keys check confirms the presence of specific keys in the outermost JSON object, allowing for optional parameters to require all keys to be present. The has_valid_json_schema check ensures that JSON strings conform to an expected schema, ignoring extra fields not defined in the schema.
  • Added geometry row-level checks (#636). The library has been enhanced with new row-level checks for geometry columns, including checks for area and number of points, such as is_area_not_less_than, is_area_not_greater_than, is_area_equal_to, is_area_not_equal_to, is_num_points_not_less_than, is_num_points_not_greater_than, is_num_points_equal_to, and is_num_points_not_equal_to. These checks allow users to validate geometric data based on specific criteria, with options to specify the spatial reference system (SRID) and use geodesic area calculations. These changes enable more effective validation and quality control of geometric data, and are supported in Databricks serverless compute or runtime versions 17.1 and later.
  • Added support to write using delta table path (#594). The quality check results saving functionality has been enhanced to support saving to Unity Catalog Volume paths, S3, ADLS, or GCS in addition to tables, providing more flexibility in storing and managing results. The save_results_in_table method now accepts output configurations with volume paths, and the OutputConfig object has been updated to support table names with 2 or 3-level namespace, storage paths including Volume paths, S3, ADLS, or GCS, and optional trigger settings for streaming output. Furthermore, the code now supports saving DataFrames to both Delta tables and storage paths, with the save_dataframe_as_table function taking an output_config object that determines whether to save the DataFrame to a table or a path. The functionality includes support for batch and streaming writes, input validation, and error handling, with the existing functionality of saving to Delta tables preserved and new functionality added for saving to storage paths.
  • Extended aggregation check function to support more aggregation types (#951). The aggregation check function has been significantly enhanced to support a wide range of aggregate functions, including 20 curated statistical and percentile-based functions, as well as any Databricks built-in aggregate function, with runtime validation to ensure compatibility and trigger warnings for non-curated functions. The function now accepts an aggr_params parameter to pass parameters to aggregate functions, such as percentile calculations, and supports two-stage aggregation for window-incompatible aggregates like count_distinct. Additionally, the function includes improved error handling, human-readable violation messages, and performance benchmarks for various aggregation scenarios, enabling advanced data quality monitoring and validation capabilities for data engineers and analysts.
  • Added new is_not_in_list check function (#969). A new check function, is_not_in_list, has been added to verify that values in a specified column are not present in a given list of forbidden values, allowing for null values and optional case-insensitive comparisons. This function is suitable for columns that are not of type MapType or StructType, and for optimal performance with large lists of forbidden values, it is recommended to use the foreign_key dataset-level check with the negate argument set to Trueumn to check, the list of forbidden values, and optionally the case sensitivity of the comparison, and its implementation includes input validation and custom error messages, with additional benchmark tests to measure its performance.
  • Improve Generator to emit temporal checks for min/max date & datetime (#624). The data quality generator has been enhanced to support temporal checks for columns with datetime and date types, in addition to numeric types. The generator now creates rules with "is_in_range", "is_not_less_than", and is_not_greater_than functions based on the provided minimum and maximum limits, ensuring correct comparison by verifying that both limit values are of the same type. This update preserves the existing numeric behavior and introduces support for timestamp and date checks, while maintaining the ability to handle Python numeric types without stringification.
  • Improved sql query check funciton to make merge columns parameter optional (#945). The sql_query check has been enhanced to support both row-level and dataset-level validation, allowing for more flexible data validation scenarios. In row-level validation, the check joins query results back to the input data to mark specific rows, whereas in dataset-level validation, the check result applies to all rows, making it suitable for aggregate validations with custom metrics. The merge_columns parameter is now optional, and when not provided, the check performs a dataset-level validation, providing a convenient way to validate entire datasets without requiring specific column mappings. Additionally, the check has been made more robust with input validation and error handling, ensuring that users can perform checks at both the row and dataset levels while preventing incorrect usage with informative error messages.
  • Outlier detection numerical values (#944). The has_no_outliers function has been introduced to detect outliers in numeric columns using the Median Absolute Deviation (MAD) method, which calculates the lower and upper limits as median - 3.5 * MAD and median + 3.5 * MAD, respectively, and considers values outside these limits as outliers. The function is designed to work with numeric columns of type int, float, long, and decimal, and it raises an error if the specified column is not of numeric type. The addition of this function enables the detection of outlier numeric values, enhancing the overall data validation capabilities.
  • Library improvements (#966). The library has undergone updates to improve its functionality, performance, and documentation. The has_json_keys function has been updated to treat NULL values as valid, ensuring consistent behavior across ANSI and non-ANSI modes. Additionally, the functionality of saving DataFrames as tables has been improved, with updated regular expression patterns for table names and enhanced handling of streaming and non-streaming DataFrames.
  • Updated has_valid_schema check to accept a reference dataframe or table (#960). The has_valid_schema check has been enhanced to support validation against a reference dataframe or table, in addition to the existing expected schema. This allows users to verify the schema of their input dataframe against a reference dataframe or table by specifying either the ref_df_name or ref_table parameter, with exactly one of expected_schema, ref_df_name, or ref_table required. The check can be performed in strict mode for exact schema matching or in non-strict mode, which permits extra columns, and users can also specify particular columns to validate using the columns parameter. The function's update includes improved parameter validation, ensuring that only one valid schema source is specified, and new test cases have been added to cover various scenarios, including the use of reference tables and dataframes for schema validation, as well as parameter validation logic.
  • Updated dashboards deployment to use standard lakeview dashboard definitions (#950). The dashboard installer has been updated to use standard Lakeview dashboard definitions.
  • Added null island gemetry check function (#613). A new quality check called is_not_null_island has been introduced to verify whether values in a specified column are NULL island geometries, such as POINT(0 0), POINTZ(0 0 0), or POINTZM(0 0 0 0). The is_not_null_island function requires Databricks serverless compute or runtime version 17.1 or higher.
  • Added float support for range and compare functions (#962). The comparison and validation functions have been enhanced to support float values, in addition to existing support for integers, dates, timestamps, and strings. This update allows for more flexible and nuanced comparisons and range checks, enabling precise and robust validation operations, particularly in scenarios involving decimal or fractional values. The...
Read more

v0.11.1

02 Dec 12:13
d200468

Choose a tag to compare

What's Changed

  • Hotfix to update log level for spark connect to suppress dlt telemetry warnings in non-dlt serverless clusters.

Contributors: @mwojtyczka

v0.11.0

01 Dec 23:40
70c00fe

Choose a tag to compare

  • Generationg of DQX rules from ODCS Data Contracts (#932). The Data Contract Quality Rules Generation feature has been introduced, enabling users to generate data quality rules directly from data contracts following the Open Data Contract Standard (ODCS). This feature supports three types of rule generation: predefined rules derived from schema properties and constraints, explicit DQX rules embedded in the contract, and text-based rules defined in natural language and processed by a Large Language Model (LLM) to generate appropriate checks. The feature provides rich metadata tracing generated rules back to the source contract for lineage and governance, and it can be used to implement federated data governance, standardize data contracts, and maintain version-controlled quality rules alongside schema definitions.
  • AI-Assisted Primary Key Detection and Uniqueness Rules Generation (#934). Introduced AI-assisted primary key detection and uniqueness rules generation capabilities, leveraging Large Language Models (LLMs) to analyze table schema and metadata. This feature analyzes table schemas and metadata to intelligently detect single or composite primary keys, and performs validation by checking for duplicate values. The DQProfiler class now includes a detect_primary_keys_with_llm method, which returns a dictionary containing the primary key detection result, including the table name, success status, detected primary key columns, confidence level, reasoning, and error message if any. The DQGenerator class has been extended to utilize uniqueness profiles from the profiler for AI-assisted uniqueness rules generation. Various updates have been made to the configuration options, including the addition of an llm_primary_key_detection option, which allows users to control whether AI-assisted primary key detection is enabled or disabled.
  • AI-Assisted Rules Generation Improvements (#925). The AI-Assisted Rules Generation feature has been enhanced to handle input as a path in addition to a table, and to generate rules with a filter. The generate_dq_rules_ai_assisted method now accepts an InputConfig object, which allows users to specify the location and format of the input data, enabling more flexible input handling and filtering capabilities. The feature includes test cases to verify its functionality, including manual tests, unit tests, and integration tests, and the documentation has been updated with minor changes to reflect the new functionality. Additionally, the code has been modified to capitalize keywords to stabilize integration tests, and the DQGenerator class has been updated to accommodate the changes, allowing users to generate data quality rules from a variety of input sources. The InputConfig class provides a flexible way to configure the input data, including its location and format, and the get_column_metadata function has been introduced to retrieve column metadata from a given location. Overall, these updates aim to enhance the functionality and usability of the AI-assisted rules generation feature, providing more flexibility and accuracy in generating data quality rules.
  • Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list checks (#673). The is_in_list and is_not_null_and_is_in_list check functions have been enhanced to support case-insensitive comparison, allowing users to choose between case-sensitive and case-insensitive comparisons via an optional case_sensitive boolean flag that defaults to True. These checks verify if values in a specified column are present in a list of allowed values, with the is_not_null_and_is_in_list check also requiring the values to be non-null. The updated checks provide more flexibility in data validation, enabling users to configure parameters such as the column to check, the list of allowed values, and the case sensitivity flag. However, it is recommended to use the foreign_key dataset-level check for large lists of allowed values or for columns of type MapType or StructType, as these checks are not suitable for such scenarios.
  • Added documentation for using DQX in streaming scenarios with foreach batch (#948). Documentation and example code snippets were added to demonstrate how to apply checks in foreachBatch structured streaming function.
  • Added telemetry to track count of input tables (#954). Added additional telemetry for better trakcing of DQX usage to help improve the product.
  • Added support for installing DQX from private PYPI repositories (#930). The DQX library has been enhanced with support for installing DQX using a company-hosted PyPI mirror, which is necessary for enterprises that block the public PyPI index. The documentation has been added to describe the feature. The tool installation code has been modified to include new functionality for automatically upload dependencies to a workspace when internet access is blocked.
  • Support Custom Folder Installation for CLI Commands (#942). The command-line interface (CLI) has been enhanced to support custom installation folders, providing users with greater flexibility when working with the library. A new --install-folder argument has been introduced, allowing users to specify a custom installation folder when running various CLI commands, such as opening dashboards, workflows, logs, and profiles. This argument override the default installation location to support scenarios where the user installs DQX in a custom location. The library's dependency on sqlalchemy has also been updated to require a version greater than or equal to 2.0 and less than 3.0 to avoid dependency issues in older DBRs.
  • Enhancement to end to end tests (#921). The e2e tests has been enhanced to test integration with dbt transformation framework. Additionally, the documentation for contributing to the project and testing has been updated to simplify the setup process for running tests locally.

BREAKING CHANGES!

  • Renamed level parameter to criticality in generate_dq_rules method of DQGenerator for consistency.
  • Replaced table: str parameter with input_config: InputConfig in profile_table method of DQProfiler for greater flexibility.
  • Replaced table_name: str parameter with input_config: InputConfig in generate_dq_rules_ai_assisted method of DQGenerator for greater flexibility.

Contributors: @dinbab1984, @mwojtyczka, @ghanse, @vb-dbrks, @jominjohny, @AdityaMandiwal

v0.10.0

07 Nov 08:11
6432e45

Choose a tag to compare

  • Added Data Quality Summary Metrics (#553). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new DQMetricsObserver class has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. The DQEngine class has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using the metrics_config parameter, and a new save_summary_metrics method has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a unique run_id field in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively.
  • LLM assisted rules generation (#577). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The DQGenerator class now includes a generate_dq_rules_ai_assisted method, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules.
  • Added Lakebase checks storage backend (#550). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The checks_location resolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through the LakebaseChecksStorageConfig class, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format.
  • Added runtime validation of sql expressions (#625). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages.
  • Fixed docs (#598). The documentation build process has undergone significant improvements to enhance efficiency and maintainability.
  • Improved Config Serialization (#676). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the ConfigSerializer class, which handles the serialization and deserialization of workspace and run configurations.
  • Restore use of hatch-fancy-pypi-readme to fix images in PyPi (#601). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi.
  • Skip check evaluation if columns or filter cannot be resolved in the input DataFrame (#609). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid.
  • Updated user guide docs (#607). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables.
  • Improved build process (#618). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the databricks-labs-pytester version from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure.

BREAKING CHANGES!

  • Added new field run_id to the detailed per-row quality results. This may or may not be a breaking change for you depending on how you leverage the results today. This is a unique run ID recorded in the summary metrics as well as detailed quality checking results to enable cross-referencing. When reusing the same DQEngine instance, the run ID stays the same. Each apply checks execution does not generate a new run ID for the same instance. It is only changed when new engine and observer (if using one) is created.

LIMITATIONS

  • Saving metrics to a table requires using a classic compute cluster in Dedicated Access Mode. This limitation will be lifted observations issue is fixed in Spark Connect.

Contributors: @mwojtyczka, @ghanse, @souravg-db2, @vb-dbrks, @alexott, @tlgnr

v0.9.3

03 Oct 16:53
d9d47a9

Choose a tag to compare

  • Added support for running checks on multiple tables (#566). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as profiler_max_parallelism and quality_checker_max_parallelism. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX.
  • Added New Row-level Checks: IPv6 Address Validation (#578). DQX now includes 2 new row-level checks: validation of IPv6 address (is_valid_ipv6_address check function), and validation if IPv6 address is within provided CIDR block (is_ipv6_address_in_cidr check function).
  • Added New Dataset-level Check: Schema Validation check (#568). The has_valid_schema check function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode.
  • Added New Row-level Checks: Spatial data validations (#581). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including is_latitude, is_longitude, is_geometry, is_geography, is_point, is_linestring, is_polygon, is_multipoint, is_multilinestring, is_multipolygon, is_ogc_valid, is_non_empty_geometry, has_dimension, has_x_coordinate_between, and has_y_coordinate_between. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above.
  • Added absolute and relative tolerance to comparison of datasets (#574). The compare_datasets check has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns.
  • Added detailed telemetry (#561). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users.
  • Allow installation in a custom folder (#575). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation. Allowing custom installation folder makes it possible to use DQX on group assigned cluster.
  • Profile subset dataframe (#589). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions.
  • Added custom exceptions (#582). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions.

BREAKING CHANGES!

  • Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only.
  • The following depreciated methods are removed from the DQEngine: load_checks_from_local_file, load_checks_from_workspace_file, load_checks_from_table, load_checks_from_installation, save_checks_in_local_file, save_checks_in_workspace_file, save_checks_in_table,, save_checks_in_installation, load_run_config. For loading and saving checks, users are advised to use load_checks and save_checks of the DQEngine described here, which support various storage types.

Contributors: @mwojtyczka, @ghanse, @tdikland, @Divya-Kovvuru-0802, @cornzyblack, @STEFANOVIVAS

v0.9.2

05 Sep 19:59
a22ab79

Choose a tag to compare

  • Added performance benchmarks
    (#548). Performance tests are run to ensure performance does not degrade by more than 25% by any change. Benchmark results are published in the documentation in the reference section. The benchmark covers all check functions, running all funcitons at once and applying the same funcitons at once for multiple columns using foreach column. A new performance GitHub workflow has been introduced to automate performance benchmarking, generating a new benchmark baseline, updating the existing baseline, and running performance tests to compare with the baseline.
  • Declare readme in the project (#547). The project configuration has been updated to include README file in the released package so that it is visible in PyPi.
  • Fixed deserializing to DataFrame to assign columns properly (#559). The deserialize_checks_to_dataframe function has been enhanced to correctly handle columns for sql_expression by removing the unnecessary check for DQDatasetRule instance and directly verifying if dq_rule_check.columns is not None.
  • Fixed lsql dependency (#564). The lsql dependency has been updated to address a sqlglot dependency issue that arises when imported in artifacts repositories.

Contributors: @mwojtyczka @ghanse @cornzyblack @gchandra10

v0.9.1

25 Aug 10:56
ee802c4

Choose a tag to compare

0.9.1

  • Added quality checker and end to end workflows (#519). This release introduces no-code solution for applying checks. The following workflows were added: quality-checker (apply checks and save results to tables) and end-to-end (e2e) workflows (profile input data, generate quality checks, apply the checks, save results to tables). The workflows enable quality checking for data at-rest without the need for code-level integration. It supports reference data for checks using tables (e.g., required by foreign key or compare datasets checks) as well as custom python check functions (mapping of custom check funciton to the module path in the workspace or Unity Catalog volume containing the function definition). The workflows handle one run config for each job run. Future release will introduce functionality to execute this across multiple tables. In addition, CLI commands have been added to execute the workflows. Additionaly, DQX workflows are configured now to execute using serverless clusters, with an option to use standards clusters as well. InstallationChecksStorageHandler now support absolute workspace path locations.
  • Added built-in row-level check for PII detection (#486). Introduced a new built-in check for Personally Identifiable Information (PII) detection, which utilizes the Presidio framework and can be configured using various parameters, such as NLP entity recognition configuration. This check can be defined using the does_not_contain_pii check function and can be customized to suit specific use cases. The check requires pii extras to be installed: pip install databricks-labs-dqx[pii]. Furthermore, a new enum class NLPEngineConfig has been introduced to define various NLP engine configurations for PII detection. Overall, these updates aim to provide more robust and customizable quality checking capabilities for detecting PII data.
  • Added equality row-level checks (#535). Two new row-level checks, is_equal_to and is_not_equal_to, have been introduced to enable equality checks on column values, allowing users to verify whether the values in a specified column are equal to or not equal to a given value, which can be a numeric literal, column expression, string literal, date literal, or timestamp literal.
  • Added demo for Spark Structured Streaming (#518). Added demo to showcase usage of DQX with Spark Structured Streaming for in-transit data quality checking. The demo is available as Databricks notebook, and can be run on any Databricks workspace.
  • Added clarification to profiler summary statistics (#523). Added new section on understanding summary statistics, which explains how these statistics are computed on a sampled subset of the data and provides a reference for the various summary statistics fields.
  • Fixed rounding datetimes in the checks generator (#517). The generator has been enhanced to correctly handle midnight values when rounding "up", ensuring that datetime values already at midnight remain unchanged, whereas previously they were rounded to the next day.
  • Added API Docs (#520). The DQX API documentation is generated automatically using docstrings. As part of this change the library's documentation has been updated to follow Google style.
  • Improved test automation by adding end-to-end test for the asset bundles demo (#533).

BREAKING CHANGES!

  • ExtraParams was moved from databricks.labs.dqx.rule module to databricks.labs.dqx.config

Contributors: @mwojtyczka @ghanse @renardeinside @cornzyblack @bsr-the-mngrm @dinbab1984 @AdityaMandiwal