Skip to content

feat: add Azure provider support with full infra, provisioning, and Bastion workflows#161

Merged
l50 merged 17 commits into
mainfrom
feat/azure-provider
May 1, 2026
Merged

feat: add Azure provider support with full infra, provisioning, and Bastion workflows#161
l50 merged 17 commits into
mainfrom
feat/azure-provider

Conversation

@l50
Copy link
Copy Markdown
Contributor

@l50 l50 commented Apr 30, 2026

Key Changes:

  • Implemented Azure provider for infrastructure, provisioning, and validation
  • Added new Terraform modules for Azure networking, VM factory, Bastion, and controller
  • Introduced runcmd and bastion CLI verbs for Azure-native host access
  • Extended CLI, inventory parsing, and doctor checks for Azure support

Added:

  • Azure CLI provider (internal/azure) implementing all provider interfaces, including VM discovery, lifecycle, and fast WinRM-based command execution
  • Azure-specific CLI commands:
    • cli/cmd/runcmd.go: Azure Run Command (stateless, SSM-like shell and command runner)
    • cli/cmd/bastion.go: Native Bastion SSH/RDP/tunnel workflows
  • cli/internal/azure package:
    • Native VM, Run Command, and Bastion management via Azure SDK and CLI
    • WinRM runner with SOCKS5 tunnel through Bastion → controller for fast parallel provisioning and validation
    • Comprehensive unit/integration tests for Azure flows
  • New Terraform modules:
    • terraform-azure-net: VNet, subnets, NAT gateway, NSG
    • terraform-azure-instance-factory: Windows VM with bootstrap, identity
    • terraform-azure-bastion: Optional Bastion host with SKU/tunnel support
    • terraform-azure-controller: In-VNet Ansible controller with ephemeral key handling
  • Azure Terragrunt configuration under infra/azure/goad-deployment/ for full lab deployment
  • Cloud-init and bootstrap templates for controller and lab hosts
  • Documentation for Azure usage, CLI workflows, and tips

Changed:

  • CLI provider selection and infra commands:
    • Added azure as a supported provider and updated flags, help, and region logic
    • infra apply|plan|destroy now support Azure region and optional Bastion/controller flags
    • Unified Terragrunt runner for AWS/Azure, including opt-in module flags
  • Provisioning logic:
    • Added Azure SOCKS5 tunnel setup for in-VNet WinRM/PSRP via Bastion and controller
    • Genericized the SOCKS tunnel helper to work for both Ludus and Azure
  • Inventory parser:
    • Now supports ansible_password for WinRM auth in Azure
    • Improved parsing of host/group entries for compatibility with Azure-generated inventories
  • Doctor checks:
    • Added Azure CLI, Bastion extension, and SSH extension checks
    • Provider-specific help text and environment validation
  • Validator:
    • Increased PowerShell timeout for Azure's longer Run Command latency
    • Output is now streamed as checks complete, not in submission order, for better UX with slow providers
  • SSM/SessionManager interface:
    • Abstracted interactive shell interface so both AWS SSM and Azure Run Command can implement native shells
  • Pre-commit hooks and documentation:
    • Added terraform-docs hook for Azure modules
    • Updated module documentation and top-level Azure provider docs

Removed:

  • Parallel DSC module installation in Ansible Windows role (now installs sequentially to avoid race conditions with Azure/WinRM)
  • Redundant/legacy AWS-specific region and provider logic where Azure now applies

l50 added 8 commits April 29, 2026 15:52
**Added:**

- Documentation for Azure authentication validation proof-of-concept, including
  prerequisites, usage, and validation steps - `infra/azure/README.md`
- Terragrunt configuration to deploy a single Windows Server 2022 VM in Azure for
  authentication testing, with local state and tagging -
  `infra/azure/eastus/auth-validation/terragrunt.hcl`
- Reusable Terraform module to provision a Windows VM in Azure with:
  - Self-contained networking: VNet, subnet, NSG, NIC
  - Resource group and all resources named from a common prefix
  - Configurable admin username, VM size, address space, subnet, image, and tags
  - Random password generation for the admin user
  - Sensitive outputs for admin credentials and resource metadata
  - Provider, version, and variable definitions
  - Comprehensive README and autogenerated documentation
  - Source files: `main.tf`, `network.tf`, `outputs.tf`, `variables.tf`,
    `versions.tf`, and `README.md` in `modules/terraform-azure-vm/`
…labs

**Added:**

- Azure provider implementation for dreadgoad CLI, including:
  - Azure provider registration and interface implementation in Go
  - Azure VM lifecycle operations (discovery, start/stop/destroy, run command)
  - CLI integration for `apply`, `destroy`, `output`, and `validate` actions
- Modularized Azure infrastructure for GOAD labs:
  - `terraform-azure-instance-factory` module for per-VM provisioning with bootstrap
  - `terraform-azure-net` module for VNet, subnets, NAT gateway, and NSG
  - Example Terragrunt structure for Azure lab deployments, including sample
    environment and region config files, and PowerShell bootstrap templates
- Pre-commit hook for `terraform_docs` to enforce module documentation

**Changed:**

- CLI provider flag and help text to include "azure" as a supported provider
- Provider factory and config logic to support Azure region resolution
- Terragrunt host registry path in GOAD deployment to use absolute path for Azure compatibility

**Removed:**

- Legacy `terraform-azure-vm` module in favor of the new composable instance
  and network modules
- Old Azure proof-of-concept Terragrunt configuration for auth validation
…I support

**Added:**

- Azure Bastion support:
  - New `terraform-azure-bastion` Terraform module for optional Bastion host deployment
  - CLI command group `bastion` with subcommands for status, SSH, RDP, and port tunneling
  - Bastion discovery and connection logic in Go (`cli/internal/azure/bastion.go`)
  - Terragrunt configuration for Bastion module with opt-in gating
  - Prerequisite and usage documentation for Bastion workflows
- In-VNet Ansible controller support:
  - New `terraform-azure-controller` Terraform module for an SSH-accessible Ansible controller VM in a private subnet
  - Terragrunt configuration for controller, gated via environment variable or flag
  - Cloud-init template to bootstrap Ansible and dependencies on controller VM
- CLI command group `runcmd` for Azure Run Command:
  - Subcommands to run PowerShell commands across instances or open a REPL-like shell per host
  - Interactive shell simulation over Run Command with $PWD persistence, output capping, and cancellation support
  - Hostname/resource ID resolution for Azure VMs via inventory and live discovery
- Provider interface `InteractiveShell` for abstracting interactive shell support
- Compile-time interface checks for new provider capabilities
- Azure-specific checks in `doctor` for CLI, Bastion extension, and SSH extension

**Changed:**

- Azure provider implementation:
  - Added `StartInteractiveShell` to enable interactive shell sessions via Run Command
  - VM discovery now exposes Azure resource tags for downstream use (e.g., controller key auto-selection)
- AWS provider implementation:
  - Renamed `StartInteractiveSession` to `StartInteractiveShell` for interface consistency
  - Registered as `InteractiveShell` provider
- `infra_cmd.go`:
  - Added `--with-bastion` and `--with-controller` flags to Terragrunt commands for Azure
  - Azure infra actions set environment variables for module gating
  - Refactored Terragrunt module execution to support both AWS and Azure via shared logic
- `doctor`:
  - Runs Azure-specific prerequisite checks when provider is Azure, including CLI, login, Bastion, and SSH extension validation
- Documentation:
  - Expanded Azure provider docs with Bastion and controller workflows, runcmd usage, and REPL caveats
  - Updated module README to describe new Azure modules and usage patterns
- Various Terragrunt configurations for Azure:
  - Updated to include or reference new Bastion and controller modules
  - Standardized `include` blocks for root config consistency
  - Added local variables for Bastion/controller options in environment configs

**Removed:**

- N/A (no logical removals detected; only code refactoring and new features introduced)
**Added:**

- Introduced `azure.ProvisionTunnel` which chains Azure Bastion port-forwarding
  and a SOCKS5 proxy to enable WinRM connectivity from the local machine to
  private Azure VMs via the controller - `cli/internal/azure/provision_tunnel.go`
- Added provider-aware SOCKS5 tunnel selection to provisioning logic, enabling
  support for Azure environments that require Bastion relays for connectivity
- Provided logic to discover the Ansible controller VM and automatically locate
  its ephemeral SSH key for tunneling in Azure

**Changed:**

- Refactored provisioning logic to use a generic `closableTunnel` interface,
  supporting both Ludus and Azure SOCKS5 tunnels
- Updated Azure instance HCL definitions to explicitly specify Windows Server
  2016 or 2019 Datacenter images instead of 2022, improving compatibility
- Improved documentation and comments for provisioning workflows, especially
  around tunnel setup and Ansible connection variables
- Updated Ludus SSH client configuration to support `InsecureIgnoreHostKey`
  and `IdentitiesOnly` flags, enabling reliable SSH through ephemeral Bastion
  tunnels and avoiding SSH agent key exhaustion
- Modified bootstrap script template for Azure VMs to remove redundant TLS 1.2
  SCHANNEL registry tweaks (no longer needed for 2019/2016), and clarified
  WinRM and firewall setup steps for clarity and minimalism
- Enhanced Azure VM bootstrap extension to use a script hash in the public
  settings, ensuring script re-execution when the template changes

**Removed:**

- Removed unnecessary TLS 1.2 registry fixups from Azure bootstrap script since
  Windows Server 2019/2016 images have appropriate defaults
- Eliminated redundant or outdated comments in provisioning and bootstrap logic
…ll logic

**Added:**

- Implemented per-VM serialization for Azure Run Command to avoid 409 errors when
  concurrent commands target the same instance (cli/internal/azure/provider.go)
- Added instance-level mutex management for AzureProvider to ensure only one Run
  Command runs at a time per VM

**Changed:**

- Rewrote DSC module installation logic to install modules sequentially instead of
  checking and installing in parallel, unblocking files after install and improving
  reliability on Windows hosts (ansible/roles/common/tasks/main.yml)
- Updated documentation to describe the new sequential DSC module installation
  approach (ansible/roles/common/README.md)
- Enhanced runChecks in Validator to flush check output to stdout as soon as each
  check completes, giving operators real-time progress feedback (instead of
  in-order submission buffering) (cli/internal/validate/validator.go)
- Improved bootstrap script for Azure VMs to reliably rename the built-in
  Administrator (SID-500) account to 'administrator' before provisioning, ensuring
  compatibility with GOAD playbooks and idempotency across reboots
  (infra/azure/goad-deployment/test/centralus/goad/templates/bootstrap.ps1.tpl)

**Removed:**

- Removed parallel DSC module install steps and logic for async status polling in
  favor of the new sequential approach (ansible/roles/common/tasks/main.yml)
- Removed obsolete steps and documentation regarding checking and installing all
  required modules in parallel (ansible/roles/common/README.md)
…d improve resource cleanup

**Added:**

- Introduced Azure SDK-based implementation for VM discovery, lifecycle, and Run Command, replacing most uses of the `az` CLI
- Added internal WinRM runner for Azure VMs, enabling fast, parallel validator checks via WinRM/NTLM tunneled through Bastion and controller
- Implemented per-VM and global concurrency limits for Azure Managed Run Commands to respect ARM API and VM resource limits
- Added `Drainer` interface to provider abstraction and implemented it for Azure to ensure cleanup goroutines complete before exit
- Provided `SOCKSAddr()` in `ProvisionTunnel` for direct SOCKS5 access by Go WinRM client
- New tests for Azure SDK-based VM lifecycle and Run Command operations (`lifecycle_sdk_test.go`, `runcommand_sdk_test.go`, `vm_sdk_test.go`, `credentials_test.go`)
- Azure provider now accepts and wires through environment name and inventory path for side-channel state required by WinRM
- Terraform controller module: added support for deploying from a custom `source_image_id` (Shared Image Gallery, managed image, etc.)

**Changed:**

- VM lifecycle (start, stop, delete) and discovery now use Azure SDK clients with improved parallelism and error handling
- PowerShell validator timeout increased to 180s to account for Azure Run Command tail latency under concurrency
- Azure Run Command now uses Managed Run Command subresources for safe, parallel execution; old `az vm run-command invoke` path removed from hot path
- AzureProvider refactored to always route validator checks through the new WinRM runner; managed Run Commands remain available for ad-hoc use
- Provider construction now passes environment and inventory path to Azure for WinRM tunnel/inventory integration
- Inventory parser now extracts `ansible_password` and handles quoted values for WinRM authentication
- Documentation and comments updated to clarify the new Azure provider architecture and Terraform controller image logic

**Removed:**

- Eliminated most uses of `az` CLI for hot-path operations in Azure provider
- Removed per-VM mutex/serialization for Azure Run Command (no longer needed with Managed Run Command subresources)
- Deprecated legacy group membership parsing in inventory parser for host lines with key-value pairs (now treated as host definitions)
**Added:**

- Added `useNTLM` flag to host credentials to select WinRM transport based on
  whether the host is a domain controller or member server
- Implemented `isDomainController` helper to detect DCs from inventory groups
- Added debug logging for WinRM client initialization showing auth method used

**Changed:**

- Updated credential loading to set `useNTLM` based on host group, ensuring the
  correct authentication protocol is selected for each host type
- Modified credential construction to strip the `.\` prefix from usernames,
  preventing issues with authentication headers and aligning with WinRM
  library expectations
- Adjusted WinRM client creation to select NTLM or Basic transport dynamically
  according to the `useNTLM` flag, improving compatibility with both DCs and
  member servers
**Changed:**

- Clarified VM OS support to include generic Linux, not just Ubuntu 24.04 LTS
- Documented support for booting from Shared Image Gallery images via
  `source_image_id` for prebuilt attacker images
- Noted that Ansible dependencies setup via cloud-init can be skipped if
  using gallery images that include them
- Improved accuracy and flexibility of feature descriptions in the bastion
  module README
@dreadnode-renovate-bot dreadnode-renovate-bot Bot added area/pre-commit Changes made to pre-commit hooks area/roles Changes made to Ansible roles area/docs Changes made to documentation labels Apr 30, 2026
l50 added 2 commits April 30, 2026 16:22
**Added:**

- Added installation and version check steps for terraform-docs v0.20.0 in the
  pre-commit GitHub Actions workflow to ensure docs generation is available

**Changed:**

- Updated inventory parser to strip quotes from ansible_user value, ensuring
  consistency with how ansible_password is handled
**Changed:**

- Updated the test to ensure each check's output is a contiguous block (header
  and results) and not interleaved, regardless of check completion order
- Clarified test name and comments to reflect that output is grouped by check,
  not strictly in submission order
- Simplified assertions to check grouping and contiguity of output for each check

**Removed:**

- Removed the setup step for terraform-docs in the pre-commit GitHub Actions
  workflow to streamline dependencies
…ace fix

**Added:**

- Introduced `terraform-azure-vnet-peering` module with support for bidirectional
  VNet peering, optional remote NSG rules, and configurable inputs/outputs
- Added Terraform files: `main.tf`, `variables.tf`, `outputs.tf`, `versions.tf`,
  and a comprehensive `README.md` with usage and input documentation

**Changed:**

- Added per-VM serialization to WinRM runner to prevent NTLM handshake races,
  using a new `vmLocks` map and locking in `runPS`
- Modified `runScriptText` in script runner to propagate run errors and return
  partial output, ensuring callers can differentiate between template/rendering
  bugs and transport failures

**Removed:**

- Cleared `vmLocks` in WinRM runner's `close()` method to avoid stale locks after
  shutdown
**Added:**

- Added TFD_VERSION environment variable to specify terraform-docs version
- Introduced step to download, extract, and install terraform-docs binary in
  pre-commit workflow, ensuring terraform-docs is available for documentation
  generation and linting
@dreadnode-renovate-bot dreadnode-renovate-bot Bot added the area/github Changes made to github actions label May 1, 2026
@l50 l50 temporarily deployed to terratest May 1, 2026 15:23 — with GitHub Actions Inactive
@l50 l50 temporarily deployed to terratest May 1, 2026 15:23 — with GitHub Actions Inactive
**Changed:**

- Bumped `TFD_VERSION` environment variable from v0.20.0 to v0.22.0 in the
  pre-commit GitHub Actions workflow to ensure use of the latest
  terraform-docs features and fixes
…ment

**Added:**

- Introduced `.terraform.lock.hcl` files for all Terraform modules to ensure
  consistent provider versions and improve reproducibility across environments
- Locked AWS providers for `terraform-aws-instance-factory` and `terraform-aws-net`
  modules, including `aws`, `http`, and `random` as required
- Locked AzureRM provider for all Azure-related modules, specifying compatible
  versions per module (`terraform-azure-bastion`, `terraform-azure-controller`,
  `terraform-azure-instance-factory`, `terraform-azure-net`, and
  `terraform-azure-vnet-peering`)
- Included additional provider locks for `local` and `tls` in
  `terraform-azure-controller` to match its dependencies
**Changed:**

- Added new provider hashes for AWS, AzureRM, HTTP, Local, Random, and TLS
  providers in multiple `.terraform.lock.hcl` files to ensure compatibility with
  updated provider releases and improve supply chain verification. This change
  affects module lock files for `terraform-aws-instance-factory`, `terraform-aws-net`,
  `terraform-azure-bastion`, `terraform-azure-controller`, `terraform-azure-instance-factory`,
  `terraform-azure-net`, and `terraform-azure-vnet-peering`. No provider versions or
  constraints were changed—only additional hash entries were added for integrity.
**Added:**

- Added task to check all required DSC modules and identify missing ones
- Implemented parallel installation of missing DSC modules using async jobs
- Introduced async status polling to wait for module installations to complete

**Changed:**

- Replaced sequential DSC module installation with a check-and-parallel-install
  workflow to improve efficiency and reliability
- Updated documentation to reflect new module installation logic and async
  process

**Removed:**

- Removed the old sequential DSC module installation task and related looping logic
@l50 l50 merged commit 88b2a25 into main May 1, 2026
10 of 12 checks passed
@l50 l50 deleted the feat/azure-provider branch May 1, 2026 17:41
l50 added a commit that referenced this pull request May 1, 2026
…astion workflows (#161)

**Key Changes:**

- Implemented Azure provider for infrastructure, provisioning, and validation
- Added new Terraform modules for Azure networking, VM factory, Bastion, and controller
- Introduced `runcmd` and `bastion` CLI verbs for Azure-native host access
- Extended CLI, inventory parsing, and doctor checks for Azure support

**Added:**

- Azure CLI provider (`internal/azure`) implementing all provider interfaces, including VM discovery, lifecycle, and fast WinRM-based command execution
- Azure-specific CLI commands:
  - `cli/cmd/runcmd.go`: Azure Run Command (stateless, SSM-like shell and command runner)
  - `cli/cmd/bastion.go`: Native Bastion SSH/RDP/tunnel workflows
- `cli/internal/azure` package:
  - Native VM, Run Command, and Bastion management via Azure SDK and CLI
  - WinRM runner with SOCKS5 tunnel through Bastion → controller for fast parallel provisioning and validation
  - Comprehensive unit/integration tests for Azure flows
- New Terraform modules:
  - `terraform-azure-net`: VNet, subnets, NAT gateway, NSG
  - `terraform-azure-instance-factory`: Windows VM with bootstrap, identity
  - `terraform-azure-bastion`: Optional Bastion host with SKU/tunnel support
  - `terraform-azure-controller`: In-VNet Ansible controller with ephemeral key handling
- Azure Terragrunt configuration under `infra/azure/goad-deployment/` for full lab deployment
- Cloud-init and bootstrap templates for controller and lab hosts
- Documentation for Azure usage, CLI workflows, and tips

**Changed:**

- CLI provider selection and infra commands:
  - Added `azure` as a supported provider and updated flags, help, and region logic
  - `infra apply|plan|destroy` now support Azure region and optional Bastion/controller flags
  - Unified Terragrunt runner for AWS/Azure, including opt-in module flags
- Provisioning logic:
  - Added Azure SOCKS5 tunnel setup for in-VNet WinRM/PSRP via Bastion and controller
  - Genericized the SOCKS tunnel helper to work for both Ludus and Azure
- Inventory parser:
  - Now supports `ansible_password` for WinRM auth in Azure
  - Improved parsing of host/group entries for compatibility with Azure-generated inventories
- Doctor checks:
  - Added Azure CLI, Bastion extension, and SSH extension checks
  - Provider-specific help text and environment validation
- Validator:
  - Increased PowerShell timeout for Azure's longer Run Command latency
  - Output is now streamed as checks complete, not in submission order, for better UX with slow providers
- SSM/SessionManager interface:
  - Abstracted interactive shell interface so both AWS SSM and Azure Run Command can implement native shells
- Pre-commit hooks and documentation:
  - Added terraform-docs hook for Azure modules
  - Updated module documentation and top-level Azure provider docs

**Removed:**

- Parallel DSC module installation in Ansible Windows role (now installs sequentially to avoid race conditions with Azure/WinRM)
- Redundant/legacy AWS-specific region and provider logic where Azure now applies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs Changes made to documentation area/github Changes made to github actions area/pre-commit Changes made to pre-commit hooks area/roles Changes made to Ansible roles

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant