Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,15 @@ jobs:
version: 10
platform: x64

- name: Set up Google Cloud SDK
uses: google-github-actions/setup-gcloud@v0
- name: Authenticate to Google Cloud
id: auth
uses: google-github-actions/auth@v2
with:
project_id: ${{ secrets.GCP_PROJECT_ID }}
service_account_key: ${{ secrets.GCP_SA_KEY }}
export_default_credentials: true
workload_identity_provider: ${{ secrets.GCP_WIF_PROVIDER }}
service_account: ${{ secrets.GCP_WIF_SA }}

- name: Set up Cloud SDK
uses: google-github-actions/setup-gcloud@v2

- name: Get gcloud CLI info
run: gcloud info
Expand All @@ -82,7 +85,6 @@ jobs:
export CLOUDDQ_BIGQUERY_DATASET="clouddq_test_usc1"
export CLOUDDQ_BIGQUERY_REGION="us-central1"
export GOOGLE_APPLICATION_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS}
export GOOGLE_SDK_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS}
export IMPERSONATION_SERVICE_ACCOUNT=${{ secrets.IMPERSONATION_SERVICE_ACCOUNT }}
export GCS_BUCKET_NAME=${{ secrets.GCS_BUCKET_NAME }}
export GCS_BAZEL_CACHE=${{ secrets.BAZEL_CACHE_BUCKET }}
Expand Down
55 changes: 55 additions & 0 deletions USERMANUAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -612,3 +612,58 @@ CloudDQ supports the follow methods for authenticating to GCP:
### Number of Concurrent Bigquery Threads
1. `--num_threads` CLI argument specifies the number of concurrent bigquery operations that can be increased to reduce run-time.
2. `--num_threads` is currently an optional argument and has a default value of `8 threads`. We advice setting this to number of cores of your run-environment machines. One worker thread per core seemed a good number for how many threads to run at once. This number is chosen much more carefully based on other factors, such as other applications and services running on the same machine.


### Setting up Workflow Identity Federation to run the Github Actions

If your organization blocks the export of Service Account JSON keys, you can set up Workload Identity Federation (WIF) to allow GitHub Actions to authenticate securely with Google Cloud.

#### Step 1: Configure WIF in Google Cloud

Run the following `gcloud` commands in your terminal (ensure you are authenticated and have the necessary IAM permissions). Replace placeholders with your actual values:

```bash
# Set your variables
export PROJECT_ID="your-project-id"
export PROJECT_NUMBER="your-project-number" # Numeric ID, find it in GCP Homepage console
export SERVICE_ACCOUNT_EMAIL="your-service-account@your-project-id.iam.gserviceaccount.com"
export REPO_OWNER="your-github-username-or-org"
export REPO_NAME="cloud-data-quality"

# 1. Enable IAM Credentials API
gcloud services enable iamcredentials.googleapis.com --project="${PROJECT_ID}"

# 2. Create the Workload Identity Pool
gcloud iam workload-identity-pools create "github-pool" \
--project="${PROJECT_ID}" \
--location="global" \
--display-name="GitHub Actions Pool"

# 3. Create the Workload Identity Provider
gcloud iam workload-identity-pools providers create-oidc "github-provider" \
--project="${PROJECT_ID}" \
--location="global" \
--workload-identity-pool="github-pool" \
--display-name="GitHub Provider" \
--attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository" \
--issuer-uri="https://token.actions.githubusercontent.com" \
--attribute-condition="attribute.repository == '${REPO_OWNER}/${REPO_NAME}'"

# 4. Allow the GitHub repository to impersonate the Service Account
gcloud iam service-accounts add-iam-policy-binding "${SERVICE_ACCOUNT_EMAIL}" \
--project="${PROJECT_ID}" \
--role="roles/iam.workloadIdentityUser" \
--member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/github-pool/attribute.repository/${REPO_OWNER}/${REPO_NAME}" \
--condition=None
```

#### Step 2: Configure GitHub Repository Secrets

After setting up WIF in GCP, navigate to your GitHub Repository Settings > Secrets and variables > Actions, and add the following **Repository Secrets**:

- **`GCP_WIF_PROVIDER`**: Set value to `projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/github-pool/providers/github-provider` (replace `${PROJECT_NUMBER}` with your actual number).
- **`GCP_WIF_SA`**: Set value to your Service Account email (e.g. `your-service-account@your-project-id.iam.gserviceaccount.com`).
- **`GCP_PROJECT_ID`**: (If not already set) Set value to your GCP Project ID.

The GitHub Actions workflow is pre-configured to automatically pick up these secrets and switch to WIF authentication.

73 changes: 73 additions & 0 deletions cloudbuild-release-debian11-v2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
steps:
- name: 'debian:11-slim'
env:
- 'GOOGLE_CLOUD_PROJECT=${_GOOGLE_CLOUD_PROJECT}'
- 'CLOUDDQ_BIGQUERY_DATASET=${_CLOUDDQ_BIGQUERY_DATASET}'
- 'CLOUDDQ_BIGQUERY_REGION=${_CLOUDDQ_BIGQUERY_REGION}'
- 'GCS_BUCKET_NAME=${_GCS_BUCKET_NAME}'
- 'GCS_BAZEL_CACHE=${_GCS_BAZEL_CACHE}'
- 'DATAPLEX_ENDPOINT=${_DATAPLEX_ENDPOINT}'
- 'DATAPLEX_LAKE_NAME=${_DATAPLEX_LAKE_NAME}'
- 'DATAPLEX_REGION_ID=${_DATAPLEX_REGION_ID}'
- 'DATAPLEX_ZONE_ID=${_DATAPLEX_ZONE_ID}'
- 'DATAPLEX_BIGQUERY_DATASET_ID=${_DATAPLEX_BIGQUERY_DATASET_ID}'
- 'DATAPLEX_BUCKET_NAME=${_DATAPLEX_BUCKET_NAME}'
- 'DATAPLEX_TARGET_BQ_DATASET=${_DATAPLEX_TARGET_BQ_DATASET}'
- 'DATAPLEX_TARGET_BQ_TABLE=${_DATAPLEX_TARGET_BQ_TABLE}'
- 'DATAPLEX_TASK_SA=${_DATAPLEX_TASK_SA}'
- 'TAG_NAME=$TAG_NAME'
- 'GCS_RELEASE_BUCKET=${_GCS_RELEASE_BUCKET}'
args:
- '-c'
- >
set -x

DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -yq
tzdata sudo lsb-release apt-utils curl

source scripts/cloud_build_test_debian.sh "3.9.7"

make check

make test test_dataplex_dq_configs_cache

make test test_dataplex_metadata

make test test_dataplex_metadata_uri_templates

make test test_dataplex_task

make addlicense

source scripts/install_gcloud.sh

gsutil ls gs://$_GCS_RELEASE_BUCKET

gsutil cp clouddq_patched.zip
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip

gsutil cp clouddq_patched.zip.hashsum
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip.hashsum

gsutil cp clouddq/integration/clouddq_pyspark_driver.py
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq_pyspark_driver.py

gsutil cp clouddq_patched.zip
gs://${_GCS_BUCKET_NAME}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip

gsutil cp clouddq_patched.zip.hashsum
gs://${_GCS_BUCKET_NAME}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip.hashsum

gsutil cp clouddq/integration/clouddq_pyspark_driver.py
gs://${_GCS_BUCKET_NAME}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq_pyspark_driver.py
entrypoint: /bin/bash
timeout: 3600s
logsBucket: 'gs://pbalm-test-github-cloud-build'
options:
machineType: E2_HIGHCPU_8
73 changes: 2 additions & 71 deletions cloudbuild-release-debian11.yaml
Original file line number Diff line number Diff line change
@@ -1,73 +1,4 @@
steps:
- name: 'debian:11-slim'
env:
- 'GOOGLE_CLOUD_PROJECT=${_GOOGLE_CLOUD_PROJECT}'
- 'CLOUDDQ_BIGQUERY_DATASET=${_CLOUDDQ_BIGQUERY_DATASET}'
- 'CLOUDDQ_BIGQUERY_REGION=${_CLOUDDQ_BIGQUERY_REGION}'
- 'GCS_BUCKET_NAME=${_GCS_BUCKET_NAME}'
- 'GCS_BAZEL_CACHE=${_GCS_BAZEL_CACHE}'
- 'DATAPLEX_ENDPOINT=${_DATAPLEX_ENDPOINT}'
- 'DATAPLEX_LAKE_NAME=${_DATAPLEX_LAKE_NAME}'
- 'DATAPLEX_REGION_ID=${_DATAPLEX_REGION_ID}'
- 'DATAPLEX_ZONE_ID=${_DATAPLEX_ZONE_ID}'
- 'DATAPLEX_BIGQUERY_DATASET_ID=${_DATAPLEX_BIGQUERY_DATASET_ID}'
- 'DATAPLEX_BUCKET_NAME=${_DATAPLEX_BUCKET_NAME}'
- 'DATAPLEX_TARGET_BQ_DATASET=${_DATAPLEX_TARGET_BQ_DATASET}'
- 'DATAPLEX_TARGET_BQ_TABLE=${_DATAPLEX_TARGET_BQ_TABLE}'
- 'DATAPLEX_TASK_SA=${_DATAPLEX_TASK_SA}'
- 'TAG_NAME=$TAG_NAME'
- 'GCS_RELEASE_BUCKET=${_GCS_RELEASE_BUCKET}'
args:
- '-c'
- >
set -x

DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -yq
tzdata sudo lsb-release apt-utils curl

source scripts/cloud_build_test_debian.sh "3.9.7"

make check

make test test_dataplex_dq_configs_cache

make test test_dataplex_metadata

make test test_dataplex_metadata_uri_templates

make test test_dataplex_task

make addlicense

source scripts/install_gcloud.sh

gsutil ls gs://$_GCS_RELEASE_BUCKET

gsutil cp clouddq_patched.zip
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip

gsutil cp clouddq_patched.zip.hashsum
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip.hashsum

gsutil cp clouddq/integration/clouddq_pyspark_driver.py
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq_pyspark_driver.py

gsutil cp clouddq_patched.zip
gs://${_GCS_BUCKET_NAME}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip

gsutil cp clouddq_patched.zip.hashsum
gs://${_GCS_BUCKET_NAME}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip.hashsum

gsutil cp clouddq/integration/clouddq_pyspark_driver.py
gs://${_GCS_BUCKET_NAME}/build-artifacts/debian11/python3.9/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq_pyspark_driver.py
entrypoint: /bin/bash
timeout: 3600s
logsBucket: 'gs://dataplex-clouddq-github-cloud-build'
options:
machineType: E2_HIGHCPU_8
args: ['echo', 'This build is deprecated on this trigger. Continuing in GitHub Actions.']
entrypoint: /bin/bash
69 changes: 69 additions & 0 deletions cloudbuild-release-ubuntu18-v2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
steps:
- name: 'ubuntu:18.04'
env:
- 'GOOGLE_CLOUD_PROJECT=${_GOOGLE_CLOUD_PROJECT}'
- 'CLOUDDQ_BIGQUERY_DATASET=${_CLOUDDQ_BIGQUERY_DATASET}'
- 'CLOUDDQ_BIGQUERY_REGION=${_CLOUDDQ_BIGQUERY_REGION}'
- 'GCS_BUCKET_NAME=${_GCS_BUCKET_NAME}'
- 'GCS_BAZEL_CACHE=${_GCS_BAZEL_CACHE}'
- 'DATAPLEX_ENDPOINT=${_DATAPLEX_ENDPOINT}'
- 'DATAPLEX_LAKE_NAME=${_DATAPLEX_LAKE_NAME}'
- 'DATAPLEX_REGION_ID=${_DATAPLEX_REGION_ID}'
- 'DATAPLEX_ZONE_ID=${_DATAPLEX_ZONE_ID}'
- 'DATAPLEX_BIGQUERY_DATASET_ID=${_DATAPLEX_BIGQUERY_DATASET_ID}'
- 'DATAPLEX_BUCKET_NAME=${_DATAPLEX_BUCKET_NAME}'
- 'DATAPLEX_TARGET_BQ_DATASET=${_DATAPLEX_TARGET_BQ_DATASET}'
- 'DATAPLEX_TARGET_BQ_TABLE=${_DATAPLEX_TARGET_BQ_TABLE}'
- 'DATAPLEX_TASK_SA=${_DATAPLEX_TASK_SA}'
- 'TAG_NAME=$TAG_NAME'
- 'GCS_RELEASE_BUCKET=${_GCS_RELEASE_BUCKET}'
args:
- '-c'
- >
set -x

DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -yq
tzdata sudo lsb-release apt-utils curl software-properties-common

sudo add-apt-repository ppa:ubuntu-toolchain-r/test

DEBIAN_FRONTEND=noninteractive sudo apt-get update && sudo apt-get install gcc-10 -yq

source scripts/cloud_build_test_ubuntu.sh "3.8.6"

make test test_dataplex_dq_configs_cache

make test test_dataplex_metadata

source scripts/install_gcloud.sh

gsutil ls gs://$_GCS_RELEASE_BUCKET

gsutil cp clouddq_patched.zip
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip

gsutil cp clouddq_patched.zip.hashsum
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip.hashsum

gsutil cp clouddq/integration/clouddq_pyspark_driver.py
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq_pyspark_driver.py

gsutil cp clouddq_patched.zip
gs://${_GCS_BUCKET_NAME}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip

gsutil cp clouddq_patched.zip.hashsum
gs://${_GCS_BUCKET_NAME}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip.hashsum

gsutil cp clouddq/integration/clouddq_pyspark_driver.py
gs://${_GCS_BUCKET_NAME}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq_pyspark_driver.py
entrypoint: /bin/bash
timeout: 3600s
logsBucket: 'gs://pbalm-test-github-cloud-build'
options:
machineType: E2_HIGHCPU_8
69 changes: 2 additions & 67 deletions cloudbuild-release-ubuntu18.yaml
Original file line number Diff line number Diff line change
@@ -1,69 +1,4 @@
steps:
- name: 'ubuntu:18.04'
env:
- 'GOOGLE_CLOUD_PROJECT=${_GOOGLE_CLOUD_PROJECT}'
- 'CLOUDDQ_BIGQUERY_DATASET=${_CLOUDDQ_BIGQUERY_DATASET}'
- 'CLOUDDQ_BIGQUERY_REGION=${_CLOUDDQ_BIGQUERY_REGION}'
- 'GCS_BUCKET_NAME=${_GCS_BUCKET_NAME}'
- 'GCS_BAZEL_CACHE=${_GCS_BAZEL_CACHE}'
- 'DATAPLEX_ENDPOINT=${_DATAPLEX_ENDPOINT}'
- 'DATAPLEX_LAKE_NAME=${_DATAPLEX_LAKE_NAME}'
- 'DATAPLEX_REGION_ID=${_DATAPLEX_REGION_ID}'
- 'DATAPLEX_ZONE_ID=${_DATAPLEX_ZONE_ID}'
- 'DATAPLEX_BIGQUERY_DATASET_ID=${_DATAPLEX_BIGQUERY_DATASET_ID}'
- 'DATAPLEX_BUCKET_NAME=${_DATAPLEX_BUCKET_NAME}'
- 'DATAPLEX_TARGET_BQ_DATASET=${_DATAPLEX_TARGET_BQ_DATASET}'
- 'DATAPLEX_TARGET_BQ_TABLE=${_DATAPLEX_TARGET_BQ_TABLE}'
- 'DATAPLEX_TASK_SA=${_DATAPLEX_TASK_SA}'
- 'TAG_NAME=$TAG_NAME'
- 'GCS_RELEASE_BUCKET=${_GCS_RELEASE_BUCKET}'
args:
- '-c'
- >
set -x

DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -yq
tzdata sudo lsb-release apt-utils curl software-properties-common

sudo add-apt-repository ppa:ubuntu-toolchain-r/test

DEBIAN_FRONTEND=noninteractive sudo apt-get update && sudo apt-get install gcc-10 -yq

source scripts/cloud_build_test_ubuntu.sh "3.8.6"

make test test_dataplex_dq_configs_cache

make test test_dataplex_metadata

source scripts/install_gcloud.sh

gsutil ls gs://$_GCS_RELEASE_BUCKET

gsutil cp clouddq_patched.zip
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip

gsutil cp clouddq_patched.zip.hashsum
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip.hashsum

gsutil cp clouddq/integration/clouddq_pyspark_driver.py
gs://${_GCS_RELEASE_BUCKET}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq_pyspark_driver.py

gsutil cp clouddq_patched.zip
gs://${_GCS_BUCKET_NAME}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip

gsutil cp clouddq_patched.zip.hashsum
gs://${_GCS_BUCKET_NAME}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq-executable.zip.hashsum

gsutil cp clouddq/integration/clouddq_pyspark_driver.py
gs://${_GCS_BUCKET_NAME}/build-artifacts/ubuntu18.04/python3.8/main/`date
-I'minutes'`_${TAG_NAME}_${SHORT_SHA}/clouddq_pyspark_driver.py
- name: 'debian:11-slim'
args: ['echo', 'This build is deprecated on this trigger. Continuing in GitHub Actions.']
entrypoint: /bin/bash
timeout: 3600s
logsBucket: 'gs://dataplex-clouddq-github-cloud-build'
options:
machineType: E2_HIGHCPU_8
2 changes: 1 addition & 1 deletion cloudbuild.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ steps:
timeout: 3600s
options:
machineType: 'E2_HIGHCPU_8'
logsBucket: 'gs://dataplex-clouddq-github-cloud-build'
logsBucket: 'gs://pbalm-test-github-cloud-build'
2 changes: 1 addition & 1 deletion clouddq/integration/dataplex/clouddq_dataplex.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
TARGET_SCOPES = [
"https://www.googleapis.com/auth/cloud-platform",
]
DEFAULT_GCS_BUCKET_NAME = "dataplex-clouddq-artifacts-{gcp_dataplex_region}"
DEFAULT_GCS_BUCKET_NAME = "pbalm-test-artifacts-{gcp_dataplex_region}"


class DATAPLEX_TASK_TRIGGER_TYPE(str, Enum):
Expand Down
Loading
Loading