NATS-Based Data Processing Pipeline

This project consists of a complete data processing pipeline using NATS JetStream for messaging between microservices, with support for both local Docker Compose deployment and cloud-native Kubernetes deployment on Google Kubernetes Engine (GKE).

Note: This repository uses Git submodules. Clone with git clone --recurse-submodules https://github.com/richardr1126/k8s-datacenter-project.git or run git submodule update --init --recursive if already cloned. See Submodule Repositories below for links to individual service repositories.

Architecture

Cloud Deployment (GKE)

The project includes a cost-optimized GKE cluster manager for production deployments:

Cluster Type: Single zonal cluster (us-central1-b)
Node Pools:
- Default Pool: t2d-standard-2 (2 vCPUs, 8GB RAM) for general workloads
- ML Pool: n2d-highcpu-4 (4 vCPUs, 4GB RAM) for sentiment analysis workloads
Cost Optimization: Spot instances (60-91% savings), private nodes with Cloud NAT
Estimated Cost: ~$40-67/month with spot instances

Services

NATS Server: Message broker with JetStream persistence
NATS Firehose Ingest: Ingests data from external sources into NATS streams
NATS Stream Processor: Processes messages with sentiment analysis
Mock Ingest: Generates test data for development
Sentiment Web UI: Real-time visualization dashboard (Next.js)

Submodule Repositories

bsky-sentiment-web - Real-time visualization dashboard (Next.js)
gke-cluster - GKE cluster management and Helm charts
nats-firehose-ingest - Firehose data ingestion service
nats-stream-processor - Sentiment analysis processor

LOCAL DEVELOPMENT

Quick Start (Docker Compose)

Run the complete pipeline:

docker compose up --build

Run with mock data generation:

docker compose --profile mock up --build

Run with debugging tools:

docker compose --profile tools up --build

Run everything (full setup):

docker compose --profile mock --profile tools up --build

Local Service Endpoints

NATS Server:
- Client: localhost:4222
- HTTP Monitoring: localhost:8222
- Cluster: localhost:6222
Firehose Ingest: http://localhost:8081
- Health: http://localhost:8081/health
- Metrics: http://localhost:8081/metrics
Stream Processor 0: http://localhost:8082
- Health: http://localhost:8082/health
- Metrics: http://localhost:8082/metrics
Stream Processor 1: http://localhost:8083
- Health: http://localhost:8083/health
- Metrics: http://localhost:8083/metrics
Sentiment Web UI: http://localhost:3000

Local Monitoring and Debugging

Check service health:

curl http://localhost:8081/health  # Firehose Ingest
curl http://localhost:8082/health  # Stream Processor 0
curl http://localhost:8083/health  # Stream Processor 1

View metrics:

curl http://localhost:8081/metrics  # Firehose Ingest metrics
curl http://localhost:8082/metrics  # Stream Processor 0 metrics
curl http://localhost:8083/metrics  # Stream Processor 1 metrics

NATS monitoring:

Web UI: http://localhost:8222
CLI access: docker compose exec nats-box nats

Interactive NATS debugging:

# Start with tools profile
docker compose --profile tools up -d

# Connect to NATS box
docker compose exec nats-box sh

# Inside the container, you can use NATS CLI:
nats stream list
nats stream info bluesky-posts-dev
nats stream info bluesky-posts-enriched-dev
nats sub "bluesky.posts.dev.>"
nats sub "bluesky.enriched.dev.>"

Local Data Flow

Mock Ingest → publishes test messages to bluesky-posts-dev stream with subject bluesky.posts.dev.*
Firehose Ingest → consumes from external firehose and publishes to bluesky-posts-dev stream with subject bluesky.posts.dev.*
Stream Processors (0 & 1) → consume from bluesky-posts-dev stream (load-balanced), add sentiment/topic analysis, publish to bluesky-posts-enriched-dev stream with subject bluesky.enriched.dev.*
Sentiment Web UI → subscribes to enriched stream and displays real-time updates

Local Development

Individual service development:

Each service can still be developed independently using their local docker-compose files:

# Firehose Ingest only
cd nats-firehose-ingest
docker compose up --build

# Stream Processor only  
cd nats-stream-processor
docker compose up --build

# Sentiment Web UI only
cd bsky-sentiment-web
npm install
npm run dev

Logs:

# View all logs
docker compose logs -f

# View specific service logs
docker compose logs -f nats-firehose-ingest
docker compose logs -f nats-stream-processor-0
docker compose logs -f nats-stream-processor-1
docker compose logs -f mock-ingest
docker compose logs -f bsky-sentiment-web

Cleanup:

# Stop all services
docker compose down

# Remove volumes (clears data)
docker compose down -v

# Remove images
docker compose down --rmi all

CLOUD DEPLOYMENT (GKE)

Prerequisites

Install Google Cloud SDK and authenticate:

gcloud auth login
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

Enable required APIs:

gcloud services enable container.googleapis.com compute.googleapis.com

Install uv package manager and Python dependencies:

# uv is the recommended way to install dependencies
 curl -LsSf https://astral.sh/uv/install.sh | sh
 
 # Install dependencies and run
 uv sync --frozen
 source .venv/bin/activate

Create .env.prod files for each microservice:

IMPORTANT: Before running ./create-all.sh or deploying with Helm, you must create .env.prod files in each microservice directory. These files contain environment variables that will be loaded into Kubernetes secrets.

# Create .env.prod in each service directory:
cp nats-firehose-ingest/.env.example nats-firehose-ingest/.env.prod  # Edit with your values
cp nats-stream-processor/.env.example nats-stream-processor/.env.prod  # Edit with your values
cp bsky-sentiment-web/.env.example bsky-sentiment-web/.env.prod  # Edit with your values

Environment file comparisons:

nats-firehose-ingest:

Variable	.env.example	.env.prod	Notes
`NATS_STREAM_NUM_REPLICAS`	`1`	`3`	Increased for HA in production
All others	Same as .env.prod	See .env.example	Same for dev and prod

nats-stream-processor:

Variable	.env.example	.env.prod	Notes
`NUM_STREAM_REPLICAS`	`1`	`3`	Increased for HA in production
`SENTIMENT_MODEL_CACHE_DIR`	`./models/sentiment`	`/var/cache/models/sentiment`	Absolute path for containerized env
`TOPIC_MODEL_CACHE_DIR`	`./models/topics`	`/var/cache/models/topics`	Absolute path for containerized env
`SENTIMENT_CONFIDENCE_THRESHOLD`	`0.4`	`0.3`	Lower threshold in production for more detections
All others	Same as .env.prod	See .env.example	Same for dev and prod

bsky-sentiment-web:

Variable	.env.example	.env.prod	Notes
`NATS_URL`	Examples provided	`nats://nats.nats.svc.cluster.local:4222`	K8s DNS for production
`OUTPUT_STREAM`	Examples provided	`bluesky-posts-enriched`	Match processor output stream
`OUTPUT_SUBJECT`	Examples provided	`bluesky.enriched`	Match processor output subject

If these files are missing, ./create-all.sh and the individual ./create-secrets.sh scripts will fail.

Create a GKE Cluster

Quick Setup (Automated)

Before running: Ensure all .env.prod files are created in each microservice directory (see Prerequisites).

Deploy the entire pipeline with a single command:

# Run the complete setup
./create-all.sh

This script automatically:

Creates the GKE cluster with Cloud NAT setup
Connects to the cluster
Installs all Kubernetes applications (NATS, Prometheus, CockroachDB, etc.)
Creates Kubernetes secrets from .env.prod files in each microservice
Deploys all microservices (firehose ingest, stream processor, sentiment web UI)

Manual Setup

cd gke-cluster

# Create cluster with default name and spot instances
uv run gke-cluster.py create

# Create cluster with custom name
uv run gke-cluster.py create --name my-cluster

# Create cluster without spot instances (more expensive but more reliable)
uv run gke-cluster.py create --no-spot

Connect to Your Cluster

gcloud container clusters get-credentials cost-optimized-cluster \
  --zone us-central1-b --project YOUR_PROJECT_ID

# Verify nodes
kubectl get nodes

# Check node pools
kubectl get nodes --show-labels | grep pool

Deploy Services with Helm

After your cluster is running, deploy the services using Helm charts:

1. Install K8s Helm Apps

cd gke-cluster/helm

# Run installation script
# 1. Installs cert-manager, NGINX Gateway Fabric, and cluster issuers
# 2. Installs kube-prometheus-stack w/ grafana for monitoring
# 3. Installs prometheus-adapter for custom metrics in K8s HPA
# 4. Installs NATS cluster with JetStream
# 5. Installs CockroachDB for persistent storage
./install-apps.sh

2. Create Secrets and Deploy Application Services

For each service, create secrets and deploy using Helm:

Firehose Ingest:

cd nats-firehose-ingest/charts
./create-secrets.sh

helm upgrade --install nats-firehose-ingest ./nats-firehose-ingest \
  --set image.tag=latest

Stream Processor:

cd nats-stream-processor/charts
./create-secrets.sh

helm upgrade --install nats-stream-processor ./nats-stream-processor \
  --set image.tag=latest

Sentiment Web UI:

cd bsky-sentiment-web/charts
./create-secrets.sh

helm upgrade --install bsky-sentiment-web ./bsky-sentiment-web \
  --set image.tag=latest

Helm Commands Reference

# List all releases
helm list --all-namespaces

# Check release status
helm status nats-firehose-ingest --namespace default

# View release values
helm get values nats-firehose-ingest --namespace default

# Upgrade a release
helm upgrade nats-firehose-ingest ./nats-firehose-ingest \
  --namespace default \
  --values ./nats-firehose-ingest/values.yaml

# Rollback a release
helm rollback nats-firehose-ingest 1 --namespace default

# Uninstall a release
helm uninstall nats-firehose-ingest --namespace default

# Dry-run to see what would be deployed
helm upgrade --install nats-firehose-ingest ./nats-firehose-ingest \
  --namespace default \
  --dry-run --debug

Create Secrets Script Options

Each microservice has a create-secrets.sh script in its charts/ directory. These scripts read from .env.prod files and create Kubernetes secrets.

REQUIRED: Each microservice must have a .env.prod file in its root directory for the secrets script to work.

# Basic usage (uses .env.prod by default)
cd nats-firehose-ingest/charts
bash create-secrets.sh

# Specify custom namespace
bash create-secrets.sh --namespace my-namespace

# Specify custom secret name
bash create-secrets.sh --secret-name my-secret

# Specify custom .env file
bash create-secrets.sh --env-file /path/to/.env

# Dry-run to see what would be created
bash create-secrets.sh --dry-run

# Combine options
bash create-secrets.sh \
  --namespace production \
  --secret-name app-secrets \
  --env-file .env.production \
  --dry-run

Example for each microservice:

# Firehose Ingest
cd nats-firehose-ingest/charts
bash create-secrets.sh  # requires nats-firehose-ingest/.env.prod

# Stream Processor
cd nats-stream-processor/charts
bash create-secrets.sh  # requires nats-stream-processor/.env.prod

# Sentiment Web UI
cd bsky-sentiment-web/charts
bash create-secrets.sh  # requires bsky-sentiment-web/.env.prod

Scale the Cluster

# Scale all pools to 5 nodes each
uv run gke-cluster.py scale --name cost-optimized-cluster --nodes 5

# Scale down to 0 nodes (save money, only pay for control plane)
uv run gke-cluster.py scale --nodes 0

# Scale specific pool only
uv run gke-cluster.py scale --name cost-optimized-cluster --nodes 3 --pool ml-pool

Delete the Cluster

# Delete cluster and all associated resources (PVCs, Cloud NAT, Cloud Router)
uv run gke-cluster.py delete

# Delete specific cluster
uv run gke-cluster.py delete --name my-cluster

List Clusters

uv run gke-cluster.py list

Cloud Service Endpoints

Services are deployed via Helm charts in the gke-cluster/helm/ directory:

NATS cluster with JetStream
Kube-Prometheus stack for monitoring
CockroachDB for persistent storage
Individual service deployments with auto-scaling

Cloud Monitoring and Debugging

Interactive NATS debugging:

# Connect to NATS box in the cluster
kubectl exec -it deployment/nats-box -n nats -- sh

# Inside the container, you can use NATS CLI:
nats stream list
nats stream info bluesky-posts
nats stream info bluesky-posts-enriched
nats sub "bluesky.posts.>"
nats sub "bluesky.enriched.>"

Cloud Data Flow

Mock Ingest → publishes test messages to bluesky-posts stream with subject bluesky.posts.*
Firehose Ingest → consumes from external firehose and publishes to bluesky-posts stream with subject bluesky.posts.*
Stream Processors (0 & 1) → consume from bluesky-posts stream (load-balanced), add sentiment/topic analysis, publish to bluesky-posts-enriched stream with subject bluesky.enriched.*
Sentiment Web UI → subscribes to enriched stream and displays real-time updates

GKE Cluster Details

Node Pools

Default Pool

Machine Type: t2d-standard-2 (2 vCPUs, 8GB RAM)
Purpose: General workloads
Disk: 20GB standard persistent disk
Autoscaling: Manual (use scale command)

ML Pool

Machine Type: n2d-highcpu-4 (4 vCPUs, 4GB RAM)
Purpose: ML inference workloads (optimized for ONNX INT8 models)
Disk: 50GB standard persistent disk
Taint: dedicated=ml:NoSchedule
Autoscaling: Enabled (0-6 nodes)
GCFS: Enabled for image streaming

To deploy to ML pool, add toleration and node selector:

tolerations:
- key: "dedicated"
  operator: "Equal"
  value: "ml"
  effect: "NoSchedule"
nodeSelector:
  cloud.google.com/gke-nodepool: ml-pool

Private Nodes & Cloud NAT

The cluster uses private nodes to save on external IP quota:

Private Nodes: Nodes only have private IPs (no external IPs)
Cloud NAT: Provides outbound internet access for:
- Pulling container images
- Accessing external APIs
- Downloading packages
Security: Nodes are unreachable from the internet
Cost: Adds ~$1-5/month for NAT

Benefits:

✅ No external IP quota needed (only LoadBalancer needs 1 external IP)
✅ Reduced attack surface
✅ All outbound traffic through NAT gateway
✅ Can scale to 9 nodes without external IP quota issues

Cost Optimization Features

Spot Instances

Savings: 60-91% off regular instance prices
Trade-off: Can be terminated with 30-second notice
Best for: Development, testing, fault-tolerant workloads

Machine Type Selection

Default Pool (t2d-standard-2): 2 vCPUs, 8GB RAM
ML Pool (n2d-highcpu-4): 4 vCPUs, 4GB RAM (compute-optimized)
Total capacity: 18 vCPUs, 48GB RAM at full scale (3+6 nodes)

Storage Optimization

Standard Persistent Disk: Cheapest disk option
Sizes: 20GB default, 50GB for ML nodes
Performance: Suitable for most development workloads

Automatic Cleanup

PVC Disks: Automatically deleted with cluster
Cloud NAT: Automatically removed with cluster
No Orphaned Resources: All resources cleaned up on delete

Cost Estimation

Monthly costs (approximate, spot instances):

Default Pool (3 x t2d-standard-2): ~$20-33
ML Pool (3 x n2d-highcpu-4): ~$15-25
Persistent Disk (100GB standard): ~$4
Cloud NAT: ~$1-5
Total: ~$40-67/month

Cost when scaled to 0 nodes:

Control plane: Free (single zonal cluster)
Cloud NAT: ~$1/month (minimal with no traffic)
Total: ~$1/month

Prices may vary by region and are subject to change. Spot instance pricing is typically 60-91% less than regular instances.

MESSAGE FORMATS

Input Message (to sentiment processor):

{
  "uri": "at://test/12345",
  "text": "This is a great day!",
  "author": "test-user-67890",
  "timestamp": "2025-09-30T10:30:00Z"
}

Output Message (from sentiment processor):

{
  "uri": "at://test/12345",
  "text": "This is a great day!",
  "author": "test-user-67890",
  "timestamp": "2025-09-30T10:30:00Z",
  "sentiment": {
    "label": "POSITIVE",
    "score": 0.9234
  },
  "processed_at": "2025-09-30T10:30:01Z"
}

TROUBLESHOOTING

Local Development Issues

Services not connecting to NATS: Check that NATS is healthy before dependent services start
Port conflicts: The unified compose uses different ports (8081, 8082, 8083) to avoid conflicts
Model download: Stream processors download AI models on first run, which may take time
Memory usage: Sentiment analysis requires adequate memory allocation (3GB per processor)

GKE Deployment Issues

Authentication Issues:

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

API Not Enabled:

gcloud services enable container.googleapis.com compute.googleapis.com

Cluster not accessible: Verify private cluster configuration and Cloud NAT setup

PROJECT STRUCTURE

.
├── README.md                          # This file
├── architecture.excalidraw.png        # Architecture diagram
├── docker-compose.yml                 # Local development orchestration
├── gke-cluster/                       # GKE cluster management
│   ├── gke-cluster.py                # Cluster creation/management script
│   ├── README.md                      # GKE-specific documentation
│   ├── pyproject.toml                # Python dependencies
│   └── helm/                          # Kubernetes Helm charts
├── nats-firehose-ingest/             # Firehose data ingestion service
├── nats-stream-processor/            # Sentiment analysis processor
├── nats-stream-processor-topics/     # Topic classification processor
└── bsky-sentiment-web/               # Real-time visualization dashboard

SECURITY BEST PRACTICES

The GKE cluster includes several security optimizations:

Private nodes (no direct internet access to nodes)
Cloud NAT for controlled outbound access
Workload Identity for secure service access
Container-Optimized OS (COS_CONTAINERD)
Uses GKE defaults for RBAC and security policies
Latest Kubernetes version
Cost management enabled for usage tracking

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bsky-sentiment-web @ da235e7		bsky-sentiment-web @ da235e7
gke-cluster @ 7115572		gke-cluster @ 7115572
nats-firehose-ingest @ 24ae055		nats-firehose-ingest @ 24ae055
nats-stream-processor @ d21670c		nats-stream-processor @ d21670c
.gitmodules		.gitmodules
README.md		README.md
architecture.excalidraw.png		architecture.excalidraw.png
create-all.sh		create-all.sh
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

NATS-Based Data Processing Pipeline

Architecture

Cloud Deployment (GKE)

Services

Submodule Repositories

LOCAL DEVELOPMENT

Quick Start (Docker Compose)

Run the complete pipeline:

Run with mock data generation:

Run with debugging tools:

Run everything (full setup):

Local Service Endpoints

Local Monitoring and Debugging

Check service health:

View metrics:

NATS monitoring:

Interactive NATS debugging:

Local Data Flow

Local Development

Individual service development:

Logs:

Cleanup:

CLOUD DEPLOYMENT (GKE)

Prerequisites

Create a GKE Cluster

Quick Setup (Automated)

Manual Setup

Connect to Your Cluster

Deploy Services with Helm

1. Install K8s Helm Apps

2. Create Secrets and Deploy Application Services

Helm Commands Reference

Create Secrets Script Options

Scale the Cluster

Delete the Cluster

List Clusters

Cloud Service Endpoints

Cloud Monitoring and Debugging

Interactive NATS debugging:

Cloud Data Flow

GKE Cluster Details

Node Pools

Default Pool

ML Pool

Private Nodes & Cloud NAT

Cost Optimization Features

Spot Instances

Machine Type Selection

Storage Optimization

Automatic Cleanup

Cost Estimation

Monthly costs (approximate, spot instances):

Cost when scaled to 0 nodes:

MESSAGE FORMATS

Input Message (to sentiment processor):

Output Message (from sentiment processor):

TROUBLESHOOTING

Local Development Issues

GKE Deployment Issues

PROJECT STRUCTURE

SECURITY BEST PRACTICES

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages