feat(runner): debug-toggle SSM parameter is ineffective; runner agent doesn't read ACTIONS_RUNNER_DEBUG from process env

**Why is this needed**:

Live verification of issue #55 on PR #57 surfaced a fundamental gap in how the SSM debug toggle (`/jit-runners/runner-log-level`) is wired to the runner agent.

Current behavior:

1. SSM parameter set to `debug` ✅
2. Scaleup Lambda reads SSM with 30s cache ✅
3. Userdata template renders `export ACTIONS_RUNNER_DEBUG=true` and `export ACTIONS_STEP_DEBUG=true` at script level ✅
4. Userdata's `su - runner -c "..."` line passes both vars into the runner user's shell (PR #60 fix) ✅
5. Runner agent process receives the env vars ✅
6. **Runner agent's `_diag/Runner_*.log` content stays at INFO level — no DEBUG / `##[debug]` markers** ❌

Inspection of 1000 log events from a `debug`-toggle-flipped runner (PR #57, runner 733): 100% INFO level, zero DEBUG, zero `##[debug]` markers in Worker output.

Per [GitHub's enabling-debug-logging docs](https://docs.github.com/en/actions/monitoring-and-troubleshooting-workflows/enabling-debug-logging), `ACTIONS_RUNNER_DEBUG` and `ACTIONS_STEP_DEBUG` are documented to be set as **repository/organization secrets or variables**. The runner agent fetches them from GitHub at job-pickup time, not from its local process environment.

So the entire SSM → Lambda → userdata → su → process env pipeline works correctly, but the agent doesn't read the values from where we put them.

**What would you like to be added**:

Two coordinated changes (one PR is fine):

### 1. Document the current SSM toggle's limitation

`docs/troubleshooting.md` "Debugging silent runner failures" subsection currently advertises:

> Flip the SSM toggle to `debug`. … Reproduce the issue, inspect the debug-level log lines (look for `##[debug]` markers in `Worker_*.log`), then revert.

This is misleading. Replace with one of:

- A note that this toggle is ineffective until the underlying mechanism is fixed (recommended now).
- The corrected operator workflow once a real mechanism is in place (recommended after fix).

### 2. Implement a working debug-toggle mechanism

Three options to evaluate:

- **(a) GitHub-side secrets:** scaleup Lambda (or a separate management Lambda) calls `PUT /repos/{owner}/{repo}/actions/secrets/ACTIONS_RUNNER_DEBUG` to set the secret to `true` when SSM toggle is `debug`, removes it on flip back. Requires GitHub App scope `secrets:write`. Affects ALL workflow runs in the repo, including unrelated ones.
- **(b) Runner config override:** modify the AMI to include a `.runner` config file with debug logging enabled, OR write an override at boot time. Need to research the correct config key — possibly `traceLogLevel` or `agentLogLevel` in Runner.Listener config XML.
- **(c) Workflow-level env injection:** users opt in per-workflow by setting `env: ACTIONS_RUNNER_DEBUG: true` at workflow or job level. Doesn't need our infrastructure at all.

Option (c) is the lightest weight and most correct — no infrastructure changes needed. Option (a) is operationally cleanest but adds attack surface. Option (b) is most jit-runners-native but requires AMI work.

Recommendation: **document option (c)** as the workaround, and either deprecate the SSM toggle (swap for documentation) OR repurpose it for a different runtime control (e.g. CloudWatch agent log level, scaledown thresholds, etc.).

**Acceptance criteria**:

- [ ] `docs/troubleshooting.md` no longer advertises the broken SSM debug toggle path.
- [ ] Replacement workflow (option a/b/c above) documented with reproducible verification recipe.
- [ ] If keeping SSM parameter `/jit-runners/runner-log-level`: it controls something measurable (or repurposed to a different control). If not: parameter removed via CFN/Tofu update.
- [ ] Verification on a draft PR confirms the new mechanism produces `##[debug]` markers in Worker output.

**Out of scope**:

- Fixing the SSM cache or userdata env passthrough (already correct per PR #60).
- Removing CloudWatch agent log forwarding (separate concern, working correctly).

**Who is this feature for**:

`jit-runners` operators who currently believe the documented SSM debug toggle works. Right now they would flip the parameter, observe no debug output, and have no signal that something is wrong.

**References**:

- Surfacing context: PR [#57](https://github.com/devopsfactory-io/jit-runners/pull/57) verification of v1.0.0-rc.3 deploy.
- GitHub docs: [Enabling debug logging](https://docs.github.com/en/actions/monitoring-and-troubleshooting-workflows/enabling-debug-logging).
- Originating spec: zettelkasten `Projects/jit-runners/specs/2026-05-02-runner-agent-diagnostics-design.md` (decision Q6, locked option C — needs revisiting).
- Predecessor fixes that brought us here: #58 (sed), #59 (systemctl restart), #60 (su env passthrough).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runner): debug-toggle SSM parameter is ineffective; runner agent doesn't read ACTIONS_RUNNER_DEBUG from process env #61

1. Document the current SSM toggle's limitation

2. Implement a working debug-toggle mechanism

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(runner): debug-toggle SSM parameter is ineffective; runner agent doesn't read ACTIONS_RUNNER_DEBUG from process env #61

Description

1. Document the current SSM toggle's limitation

2. Implement a working debug-toggle mechanism

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions