This project implements an anomaly detection system using autoencoders trained on Google Cloud Vertex AI. An autoencoder is a neural network that learns to compress data into a lower-dimensional representation and then reconstruct it back to the original form. Anomalies are detected by measuring reconstruction errors - data points that are poorly reconstructed are likely to be anomalous.
The system trains an autoencoder on normal transaction data from BigQuery and uses reconstruction error thresholds to identify anomalous patterns. The training pipeline includes data preprocessing, feature engineering, model training, and evaluation, all orchestrated through Vertex AI.
-
Configure your training parameters by filling out
config.json:cp config.example.json config.json
Edit
config.jsonwith your specific parameters:- GCP project and data paths
- Column specifications for your dataset
- Training hyperparameters
- Data date ranges and filtering options
-
Configure your Vertex AI job specifications by filling out
jobspec.json:cp jobspec.example.json jobspec.json
Edit
jobspec.jsonwith your compute requirements:- Machine type and accelerators
- Service account
- Resource allocation
-
Submit the training job to Vertex AI:
python3 submit_job.py
project-id: GCP Project IDgcs-path: Destination GCS path for model artifacts and temporary databq-training-data-path: BigQuery source table pathbq-report-path: BigQuery target table path for reportend-train-date: Training data end date (YYYY-MM-DD)start-train-interval: Days before end date to start training data (default: 90)validation-interval: Days for validation dataset (default: 1)
id-columns: List of identifier columnsdrop-columns: Columns to exclude from trainingimpute-columns: Columns to impute with 0log-scale-columns: Columns requiring log normalizationstat-encoding-columns: High cardinality categorical columns for statistical encodingperiodic-columns: Columns with periodic topologyohe-columns: Columns that will be applied one-hot encoding operationtime-column: Transaction timestamp column
learning-rate: Training learning rate (default: 0.001)n-hidden: Number of hidden layers (default: 3)latent-dim: Latent space dimension (float 0-1 for ratio, int for absolute size)activation: Hidden layer activation function (default: 'relu')quantile-threshold: Anomaly detection threshold percentile (default: 0.95)epochs: Training epochs (default: 100)batch_size: Training batch size (default: 1024)
model-name: Saved model name (default: 'autoencoder')postfix: Additional identifier for model and reportsget-new-data: Whether to fetch fresh data from BigQuery (default: true). set to other than "true" to set it to false.
- Google Cloud SDK configured with appropriate permissions
- Access to BigQuery source data
- Vertex AI API enabled
- Required Python dependencies (see requirements.txt)