Skip to content

Migrate Data Catalog to Dataplex Knowledge Catalog for CDC templates#3927

Open
stankiewicz wants to merge 22 commits into
GoogleCloudPlatform:mainfrom
stankiewicz:fix_data_catalog_usage
Open

Migrate Data Catalog to Dataplex Knowledge Catalog for CDC templates#3927
stankiewicz wants to merge 22 commits into
GoogleCloudPlatform:mainfrom
stankiewicz:fix_data_catalog_usage

Conversation

@stankiewicz

Copy link
Copy Markdown
Contributor

This PR resolves the INVALID_ARGUMENT write operation failure (Project is not allowed to perform write operations due to Data Catalog deprecation) caused by the deprecation of the legacy Google Cloud Data Catalog API.

It migrates the Debezium-to-PubSub CDC pipeline's schema publishing and schema retrieval logic to use the new Dataplex Knowledge Catalog API (com.google.cloud.dataplex.v1.CatalogServiceClient).

Key Changes

  • Dependency Update: Replaced google-cloud-datacatalog with google-cloud-dataplex (v1.90.0) in cdc-common.
  • Schema Aspect Migration: Refactored SchemaUtils.java to convert Apache Beam Schema objects into Dataplex technical schema aspects (represented as google.protobuf.Struct following the system-defined dataplex-types.global.schema format).
  • API Client Update:
    • Replaced all usages of the legacy DataCatalogClient with CatalogServiceClient inside DataCatalogSchemaUtils.java.
    • Migrated createEntryGroup to utilize createEntryGroupAsync as required by the Dataplex SDK for long-running operations.
  • Metadata Restoration: Preserved userSpecifiedSystem and userSpecifiedType logic by mapping them appropriately to the new Dataplex EntrySource object (system field and user_specified_type label).

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical failure in the CDC pipeline caused by the deprecation of the legacy Google Cloud Data Catalog API. By migrating the schema publishing and retrieval logic to the Dataplex Knowledge Catalog API, the pipeline maintains compatibility with current Google Cloud infrastructure standards. The changes ensure that metadata, including system types and labels, is correctly preserved and mapped to the new Dataplex entry structures.

Highlights

  • API Migration: Migrated the CDC pipeline's schema management from the deprecated Google Cloud Data Catalog API to the new Dataplex Knowledge Catalog API.
  • Dependency Update: Replaced the google-cloud-datacatalog dependency with google-cloud-dataplex (v1.90.0) in the cdc-common module.
  • Schema Refactoring: Updated SchemaUtils to convert Apache Beam schemas into Dataplex technical schema aspects using protobuf Structs.
  • Asynchronous Operations: Refactored entry group creation to use asynchronous operations as required by the Dataplex SDK.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the CDC parent and connector modules from Google Cloud Data Catalog to Google Cloud Dataplex Catalog, updating dependencies, client initializations, and schema mapping utilities to use Dataplex's aspect-based schema model. The review feedback identifies several critical issues, including resource leaks from unclosed CatalogServiceClient instances, an N+1 query performance bottleneck when retrieving entry schemas, improper handling of InterruptedException, and potential NullPointerException risks when parsing protobuf Struct fields without defensive checks.

@stankiewicz stankiewicz added improvement Making existing code better bug-fix labels Jun 17, 2026
@stankiewicz

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the CDC common library from Google Cloud Data Catalog to Google Cloud Dataplex Catalog, updating dependencies, clients, and schema mapping utilities to use Dataplex aspects. Feedback on these changes suggests propagating exceptions during asynchronous entry group creation to avoid invalid states, filtering search results in lookupPubSubEntry to verify they belong to the @pubsub entry group, and normalizing data types to lowercase in SchemaUtils to ensure robust, case-insensitive type mapping.

Comment on lines +209 to 218
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
LOG.error("Interrupted while creating EntryGroup", e);
} catch (ExecutionException e) {
if (e.getCause() instanceof AlreadyExistsException) {
this.entryGroupCreated = true;
} else {
LOG.error("Failed to create EntryGroup", e);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If createEntryGroupAsync fails with an exception other than AlreadyExistsException (or if the thread is interrupted), the error is logged but the exception is swallowed. This allows the constructor of SingleTopicSchemaManager to complete successfully in an invalid state, which will cause subsequent schema updates to fail.

To ensure fail-fast behavior and make debugging easier, we should propagate these failures by throwing a RuntimeException.

      } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        LOG.error("Interrupted while creating EntryGroup", e);
        throw new RuntimeException("Interrupted while creating EntryGroup", e);
      } catch (ExecutionException e) {
        if (e.getCause() instanceof AlreadyExistsException) {
          this.entryGroupCreated = true;
        } else {
          LOG.error("Failed to create EntryGroup", e);
          throw new RuntimeException("Failed to create EntryGroup", e);
        }
      }
References
  1. When throwing exceptions in environments where the caller or global exception handler might not log them (such as certain Dataflow templates), log the error message before throwing the exception to ensure the failure is recorded.

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 244 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.45%. Comparing base (65d422a) to head (e59dea4).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...ud/dataflow/cdc/common/DataCatalogSchemaUtils.java 0.00% 152 Missing ⚠️
.../google/cloud/dataflow/cdc/common/SchemaUtils.java 0.00% 91 Missing ⚠️
...d/dataflow/cdc/connector/PubSubChangeConsumer.java 0.00% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3927      +/-   ##
============================================
- Coverage     55.59%   55.45%   -0.15%     
+ Complexity     7052     6601     -451     
============================================
  Files          1103     1103              
  Lines         67681    67866     +185     
  Branches       7603     7627      +24     
============================================
+ Hits          37625    37632       +7     
- Misses        27636    27808     +172     
- Partials       2420     2426       +6     
Components Coverage Δ
spanner-templates 88.48% <ø> (-0.02%) ⬇️
spanner-import-export 68.70% <ø> (-0.01%) ⬇️
spanner-live-forward-migration 90.23% <ø> (-0.02%) ⬇️
spanner-live-reverse-replication 84.41% <ø> (-0.03%) ⬇️
spanner-bulk-migration 92.62% <ø> (-0.01%) ⬇️
gcs-spanner-dv 88.88% <ø> (-0.02%) ⬇️
Files with missing lines Coverage Δ
...d/dataflow/cdc/connector/PubSubChangeConsumer.java 0.00% <0.00%> (ø)
.../google/cloud/dataflow/cdc/common/SchemaUtils.java 0.00% <0.00%> (ø)
...ud/dataflow/cdc/common/DataCatalogSchemaUtils.java 0.00% <0.00%> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@derrickaw

Copy link
Copy Markdown
Contributor

Fixes: #3921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix improvement Making existing code better size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants