Skip to content

feat: add Netbird self-hosted VPN CDK app (control plane + routing peer) for shared [COM-143]#11

Open
vanguille wants to merge 5 commits into
mainfrom
feature/COM-143-netbird-infra
Open

feat: add Netbird self-hosted VPN CDK app (control plane + routing peer) for shared [COM-143]#11
vanguille wants to merge 5 commits into
mainfrom
feature/COM-143-netbird-infra

Conversation

@vanguille

Copy link
Copy Markdown
Member

What

Adds the self-hosted Netbird (WireGuard) VPN as Infrastructure as Code for the autoguru-shared account (791686214595, ap-southeast-2). This is the planned replacement for the Pritunl VPN; the two run in parallel until the routing-peer egress gate (COM-145) passes, so nothing here touches Pritunl.

Per ADR-002 and the Hybrid-ZTNA-Netbird Business Case.

Structure

AWS CDK in C# (matching the rest of our CDK infra), under netbird/. Two independent stacks (no "God" stack):

Stack Instance Purpose
NetbirdControlPlaneStack t3.small / 30 GB gp3 Management, signal, relay, dashboard, Coturn
NetbirdRoutingPeerStack t3.micro / 30 GB gp3 Routing agent + WireGuard data plane + Cloudflare egress EIP

Each: dedicated VPC (public subnet + EIP, no NAT), Amazon Linux 2023 + Docker, IMDSv2 required, encrypted EBS, CloudWatch auto-recovery, an SSM-only IAM role (no inbound SSH) scoped to exactly the secret it needs. EC2 user-data lives in netbird/scripts/*.sh and is embedded into the assembly.

DNS (netbird.autoguru.com.au) and the routing-peer origin allowlist are managed in Cloudflare, so the stacks only output the Elastic IPs.

CI

.github/workflows/netbird-deploy.yml is path-scoped to netbird/**: PRs run cdk diff (read-only), deploys are a manual workflow_dispatch (never implicit on merge). Requires repo secret AWS_DEPLOY_ROLE_ARN (a shared-account role assumable via OIDC).

Validation

  • dotnet build -c Release: 0 warnings / 0 errors.
  • cdk synth of both stacks against the real shared account: succeeds.

Notes / follow-ups

  • Draft pending: configure the AWS_DEPLOY_ROLE_ARN repo secret.
  • netbird/scripts/routing-peer-user-data.sh pins binaries to :latest for the POC with a TODO to pin before production cutover.
  • shared is already CDK-bootstrapped (SharedPlatformStack deploys there via CDK), so no cdk bootstrap is needed.

🤖 Generated with Claude Code

…er) for shared [COM-143]

Self-hosted Netbird (WireGuard) for the autoguru-shared account, the planned Pritunl replacement. Two independent CDK stacks (C#) in netbird/: control plane and routing peer, each with a dedicated VPC, EIP, SSM-only role (no SSH), IMDSv2, encrypted EBS and CloudWatch auto-recovery. EC2 user-data is embedded from netbird/scripts. DNS lives in Cloudflare so the stacks only output the EIPs. Path-scoped GitHub workflow: cdk diff on PRs, manual deploy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread netbird/scripts/control-plane-user-data.sh Fixed
Comment thread netbird/scripts/control-plane-user-data.sh Fixed
vanguille and others added 2 commits June 24, 2026 09:31
The Entra application (client) ID, tenant ID and audience in the Netbird control-plane user-data are public OIDC identifiers, not secrets (the client secret is pulled from Secrets Manager at runtime and is never committed). gitleaks' generic-api-key rule flags them on entropy. Add a value-scoped allowlist (not path-scoped, so a real secret in the same file is still caught) and wire the self-scan to use the config.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
configure-aws-credentials, cdk diff and cdk deploy now run only when the AWS_DEPLOY_ROLE_ARN repo secret is set. Until an admin wires it, the workflow runs the dotnet build/compile check and emits a warning instead of failing on missing credentials, so the draft PR check is not misleadingly red.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vanguille

Copy link
Copy Markdown
Member Author

@claude review this

@vanguille vanguille marked this pull request as ready for review June 25, 2026 12:52
@vanguille vanguille requested a review from Copilot June 25, 2026 19:58

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new AWS CDK (C#) application to provision a self-hosted Netbird (WireGuard) VPN in the autoguru-shared AWS account, covering both the control plane and a dedicated routing peer with stable EIPs, plus a path-scoped GitHub Actions workflow for diff/deploy.

Changes:

  • Introduces two CDK stacks (NetbirdControlPlaneStack, NetbirdRoutingPeerStack) provisioning VPC + EC2 + EIP + auto-recovery + least-privilege secrets access.
  • Adds embedded EC2 user-data scripts for control-plane bootstrap prerequisites and routing-peer Docker Compose bring-up.
  • Adds a path-scoped CI workflow for cdk diff on PRs and manual cdk deploy, plus gitleaks allowlisting for non-secret Entra identifiers.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
netbird/scripts/routing-peer-user-data.sh Bootstraps Docker/Compose, pulls setup key from Secrets Manager, starts routing peer via Compose.
netbird/scripts/control-plane-user-data.sh Bootstraps Docker/Compose and writes Entra OIDC env for manual Netbird control-plane setup via SSM.
netbird/README.md Documents architecture, prerequisites, deploy flow, and post-deploy manual steps.
netbird/cdk/Program.cs Defines CDK app entrypoint and pins deployment env (shared account + region).
netbird/cdk/NetbirdRoutingPeerStack.cs Provisions routing-peer EC2 + VPC + EIP + alarm recovery + setup-key secret + IAM role.
netbird/cdk/NetbirdControlPlaneStack.cs Provisions control-plane EC2 + VPC + EIP + alarm recovery + IAM role reading Entra secret.
netbird/cdk/Netbird.Cdk.csproj Adds CDK project configuration and embeds user-data scripts as resources.
netbird/cdk/EmbeddedScript.cs Helper for reading embedded user-data scripts at synth time.
netbird/cdk/cdk.json Configures CDK app execution and watch settings.
netbird/.gitignore Ignores CDK/.NET outputs and local env files under netbird/.
netbird/.gitattributes Enforces LF endings for the netbird/ subtree (notably for user-data scripts).
.gitleaks.toml Adds narrow allowlist for known non-secret Entra GUIDs used in setup.
.github/workflows/netbird-deploy.yml Adds path-scoped workflow for build + optional AWS-auth + cdk diff/deploy.
.github/workflows/gitleaks-self-scan.yml Wires self-scan workflow to use the repo’s .gitleaks.toml config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +12
# Install Docker Compose plugin
mkdir -p /usr/local/lib/docker/cli-plugins
curl -fsSL https://github.com/docker/compose/releases/latest/download/docker-compose-linux-x86_64 \
-o /usr/local/lib/docker/cli-plugins/docker-compose
chmod +x /usr/local/lib/docker/cli-plugins/docker-compose
Comment on lines +8 to +12
# Install Docker Compose plugin
mkdir -p /usr/local/lib/docker/cli-plugins
curl -fsSL https://github.com/docker/compose/releases/latest/download/docker-compose-linux-x86_64 \
-o /usr/local/lib/docker/cli-plugins/docker-compose
chmod +x /usr/local/lib/docker/cli-plugins/docker-compose
Comment on lines +33 to +36
services:
netbird:
image: netbirdio/netbird:latest
container_name: netbird-routing-peer
Comment on lines +61 to +63
sg.AddIngressRule(Peer.AnyIpv4(), Port.Tcp(443), "HTTPS -- management API and dashboard");
sg.AddIngressRule(Peer.AnyIpv4(), Port.Tcp(80), "Lets Encrypt ACME HTTP challenge");
sg.AddIngressRule(Peer.AnyIpv4(), Port.Tcp(33073), "Management gRPC -- peer client connections");
Comment thread netbird/cdk/cdk.json Outdated
…escription note)

- cdk.json: run the CDK app with -c Release so synth matches the CI build instead of an extra Debug compile.
- control-plane-user-data.sh: add the same pre-cutover pin/checksum TODO the routing peer already carries for the Docker Compose plugin.
- NetbirdControlPlaneStack.cs: document that the SG rule description omits the apostrophe in "Lets Encrypt" deliberately, because AWS rejects apostrophes in security-group rule descriptions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@anthony-keller

Copy link
Copy Markdown
Member

VPC reachability check: can a Netbird VPN user reach internal RDS (e.g. dev SQL Server)?

Following up on the new-VPC review — I verified this against the live AWS environment (ap-southeast-2), because it determines whether Netbird can actually replace Pritunl for internal access.

Short answer: no. As designed, a VPN user routed through the Netbird VPC cannot reach the dev account's RDS SQL Server. There are three independent barriers, each sufficient on its own.

1. No network route

Both Netbird VPC route tables contain only:

10.0.0.0/16 → local
0.0.0.0/0   → igw (internet)

No route to the dev VPC 10.80.0.0/16. The dev RDS (anz-development-accountplatforms-sqlserver…) is private (PubliclyAccessible: false), so it's reachable only over private networking — which the routing peer doesn't have. Its only non-local route is the internet gateway.

2. No VPC peering / transit gateway

The Netbird VPCs are in zero peering connections, and there's no transit gateway in the estate. Pritunl works only because of peering the new VPC doesn't inherit:

  • pcx-065b7a22feabfbff2 (ACTIVE): shared 10.70.0.0/16 ↔ dev 10.80.0.0/16
  • The dev route tables carry 10.70.0.0/16 → pcx-065b… on all subnets.

Pritunl lives inside the shared 10.70/16 VPC, so it rides that peering into dev. The Netbird VPC sits on its own island.

3. The RDS security group would block it anyway

The dev RDS SG (sg-058de93df38eefa91) allows port 1433 only from:

10.70.0.0/20, 10.70.16.0/20, 10.70.32.0/20   "Allow MSSQL from VPN subnet"   ← shared/Pritunl
10.80.64.0/20, 10.80.80.0/20, 10.80.96.0/20  ← dev's own private subnets

The Netbird VPC CIDR (10.0.0.0/16) is not on that list. Even with routing + peering in place, the SG drops the connection.

Design-intent note: the routing peer is built to route *.autoguru.com.au egress out its static EIP to Cloudflare over the internet. RDS is a private …rds.amazonaws.com endpoint — not public, not behind Cloudflare — so it's outside the routing peer's intended scope entirely.

Side finding: deployment account drift

The two Netbird 10.0.0.0/16 VPCs are already deployed in the dev account (635532940647)NetbirdControlPlaneStack/Vpc and NetbirdRoutingPeerStack/Vpc — even though Program.cs pins Account = "791686214595" (shared). So either someone test-deployed to dev or there's drift between where the PR says it deploys and where it was applied. There are live, unmanaged 10.0.0.0/16 VPCs in dev right now that this PR doesn't account for.

What it would take to reach internal RDS over Netbird

If private internal access (e.g. SSMS → dev SQL over the VPN — what Pritunl serves today) is a goal, the isolated-VPC design needs all three of:

  1. A deliberate, unique CIDR (not 10.0.0.0/16, and not identical across both stacks) so the VPC can be peered without collision.
  2. A private path to each workload VPC — pairwise peering (Netbird ↔ dev/test/prod) plus return routes both sides. With no TGW that's an N-way peering mesh per environment — exactly the sprawl a transit gateway avoids.
  3. SG allowlisting — add the routing peer's source CIDR to each workload RDS SG (the peer SNAT/masquerades, so RDS sees its VPC private IP), extending the AccountPlatformStack SharedPublicSubnets pattern.

Cleaner alternative — the pattern the platform already proves: run the routing peer inside the shared 10.70/16 VPC the way Pritunl does (SharedPlatformStack imports it via Vpc.FromLookup). It then inherits the existing shared↔workload peering mesh and the existing RDS SG allowlists with zero new peering or SG changes.

Key question to settle before merge: if Netbird is only ever meant to front Cloudflare-protected public apps, the isolated VPC is fine. The moment it's expected to replace Pritunl for private/internal access (RDS, Redis, internal hosts), this VPC topology can't do it without the peering + CIDR + SG work above.

(Verified via AWS API in the dev account: route tables rtb-0446a67f…/rtb-076ba68f…, peering pcx-065b7a22feabfbff2, RDS sg-058de93df38eefa91. Happy to confirm the same gap holds for test/prod RDS if useful.)

@anthony-keller

Copy link
Copy Markdown
Member

Follow-up: VPC placement (dedicated vs shared) + security-group minimality

Expanding on the reachability comment above, with AWS guidance, a live security-group review, and a recommendation. TL;DR: the dedicated multi-VPC design is not wrong — it's the more-isolated model and it's the right call under circumstances spelled out below. For this deployment, though, placing Netbird in the existing shared-services VPC is the cleaner fit and still meets best practice. Either way the security groups themselves are tight.

The dedicated multi-VPC design is legitimate, and has real security benefits

To be clear up front: isolating the VPN in its own VPC (and splitting control plane vs routing peer into two) is a sound, defensible design. It is the stronger-isolation model and is the right choice when:

  • The VPN must reach a regulated or high-sensitivity tier (PCI CDE, a PII data store). A hard VPC boundary + Transit Gateway with inspection gives two independent controls — route table and security group — between the internet-facing box and the data. Same-VPC collapses that to SG-only, because the VPC local route already connects every subnet.
  • You operate a central network/perimeter account with centralized ingress/egress inspection (AWS Network Firewall, TGW route tables). A dedicated perimeter VPC is the AWS-idiomatic home for edge services at that scale (perimeter in a Network account, centralized edge connectivity via TGW).
  • The host has elevated compromise likelihood and you want hard blast-radius containment. An internet-facing forwarding pivot (ip_forward, NET_ADMIN, host networking, unpinned images) is exactly the workload SEC05-BP01 says to put in its own layer.

The control-plane / routing-peer split into separate VPCs is itself a genuine merit: the control plane never needs a route to any internal network, only the routing peer bridges inward — so isolating them lets each get the minimum. Keep that split regardless of the placement decision.

Why the shared-services VPC is nonetheless the better fit here

SEC05-BP01 "Create network layers" flags as a High-risk anti-pattern "all resources in a single VPC or subnet" — but the key qualifier is without layering by sensitivity. The shared VPC (vpc-064a7525a3bcc4667, 10.70.0.0/16) already layers: public subnets for the edge tier (Pritunl, the public ALB), private-isolated subnets for the data tier (the shared RDS). SEC05 explicitly endorses segmentation via "multiple private subnets, different VPCs in the same account, or different accounts" — subnet-layering within one VPC is a valid network layer, not the anti-pattern.

Given that, the shared VPC is the better pragmatic home because:

  1. It's the purpose-built shared-services / perimeter tier — its stated job is exactly VPN + CI/CD, and it already runs a VPN (Pritunl). Netbird in the public-subnet tier is a like-for-like placement next to the VPN it replaces.
  2. It inherits the vetted connectivity fabric. In the shared public subnets (10.70.0–32.x) Netbird is already in the source CIDRs the workload RDS SGs allowlist, and the existing shared↔workload peering already routes those. No parallel peering mesh, no new SG sprawl — and fewer hand-built holes means fewer chances to misconfigure, which is itself a security argument.
  3. It enables SG-reference admission. The shared RDS already admits the Pritunl VPN by security-group reference (tighter than any CIDR rule). Same-VPC Netbird could be admitted the same way; a dedicated VPC cannot use SG-references across the peering.
  4. The auth layer is the real control. With Entra SSO + 2FA gating the management surface (see below), the marginal value of route-layer isolation from the shared RDS is small relative to the connectivity complexity a dedicated VPC imposes.

Security-group minimality (live-verified in the deployed account)

Routing peer sg-01e6df197a8046205 — minimal. Ingress is a single rule, udp/51820 (WireGuard) from 0.0.0.0/0 — which must be world-open for peers. Egress all-allow is justified (its function is forwarding/masquerading traffic outbound).

Control plane sg-085b59ad5157feaca — appropriately minimal; every open port maps to a required service: tcp/443 (mgmt API + dashboard), tcp/80 (ACME), tcp/33073 (mgmt gRPC), tcp/10000 (signal), tcp/33080 (relay), tcp+udp/3478 (TURN/STUN), udp/49152–65535 (TURN relay media). All 0.0.0.0/0, which is correct, not a finding — a public VPN control plane accepts connections from arbitrary networks, so these can't be scoped to a corporate CIDR.

The 2FA point is decisive here. The only sensitive listener is the 443 management dashboard/API, and it's gated by Entra SSO + MFA. So the security boundary is the authentication layer, not the network ACL — opening the listener to the world is the standard, accepted design because auth is doing the gatekeeping. This is the same posture as the existing Pritunl VPN.

Optional tightenings (minor, non-blocking): set Coturn min-port/max-port to shrink the 49152–65535 range and match the SG; and use the DNS-01 ACME challenge (DNS is in Cloudflare) to drop inbound tcp/80 entirely.

RDS sg-058de93df38eefa91 — ingress is least-privilege. Only tcp/1433, only from six specific /20s (the shared public/VPN subnets + the dev private subnets); no 0.0.0.0/0, no extra ports. Minor over-grant: the VPN rules trust the whole shared public /20s rather than the VPN host alone — a cross-account limitation that SG-reference admission (point 3 above) would resolve. Egress is the default all-allow, which is standard for RDS (low priority).

Recommendation

  1. Prefer placing Netbird in the shared-services VPC (Vpc.FromLookup, public-subnet tier, both roles), admitted to internal SQL by SG-reference. This meets best practice via subnet layering, reuses the vetted peering + allowlist fabric, and resolves the reachability gap from the earlier comment with no new mesh.
  2. Keep the dedicated multi-VPC design on the table as the future-state model if/when the routing peer must reach a regulated/prod-sensitive tier, or we adopt a central network account with TGW + inspection. It is not wrong — here it mainly costs the reachability/complexity already noted, for isolation that subnet-layering largely already provides.
  3. Regardless of placement: keep the routing peer's reach into the workload accounts (especially prod) least-privilege — specific routes, specific ports, Netbird identity ACLs. And if the dedicated VPCs are retained, still fix the earlier hygiene items (deliberate non-overlapping CIDR, VPC Flow Logs, NACLs).

The security groups are tight enough that none of this is a loose-rule finding; the decision is purely about where the VPN's trust boundary should sit, and both answers are defensible.

(SGs verified live via AWS API: sg-085b59ad5157feaca, sg-01e6df197a8046205, sg-058de93df38eefa91. Pritunl/shared-RDS SGs read from the SharedPlatformStack CDK source.)

…Drata + Pritunl parity)

Addresses the PR review (Anthony) and the Drata controls the dev POC tripped:
- Place both stacks in the shared-services VPC public-subnet tier via Vpc.FromLookup (vpc-064a7525a3bcc4667) instead of a dedicated VPC, matching the Pritunl VPN. This reuses the vetted peering + RDS allowlist fabric (so developers can reach SQL Server RDS), inherits the VPC's flow logs, and keeps the control-plane/routing-peer split.
- AssociatePublicIpAddress on both instances: the shared public subnets do not auto-assign, and user-data needs egress before the EIP associates.
- CPU utilization alarm -> shared Slack topic on both instances (Drata 'Infrastructure Instance CPU Monitored').
- backup=true tag on the stateful control plane (shared AWS Backup plan), matching the platform VPN/RDS convention.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants