Skip to content

feat(runner): debug-toggle SSM parameter is ineffective; runner agent doesn't read ACTIONS_RUNNER_DEBUG from process env #61

@kaio6fellipe

Description

@kaio6fellipe

Why is this needed:

Live verification of issue #55 on PR #57 surfaced a fundamental gap in how the SSM debug toggle (/jit-runners/runner-log-level) is wired to the runner agent.

Current behavior:

  1. SSM parameter set to debug
  2. Scaleup Lambda reads SSM with 30s cache ✅
  3. Userdata template renders export ACTIONS_RUNNER_DEBUG=true and export ACTIONS_STEP_DEBUG=true at script level ✅
  4. Userdata's su - runner -c "..." line passes both vars into the runner user's shell (PR fix(userdata): pass ACTIONS_*_DEBUG env vars through 'su - runner' #60 fix) ✅
  5. Runner agent process receives the env vars ✅
  6. Runner agent's _diag/Runner_*.log content stays at INFO level — no DEBUG / ##[debug] markers

Inspection of 1000 log events from a debug-toggle-flipped runner (PR #57, runner 733): 100% INFO level, zero DEBUG, zero ##[debug] markers in Worker output.

Per GitHub's enabling-debug-logging docs, ACTIONS_RUNNER_DEBUG and ACTIONS_STEP_DEBUG are documented to be set as repository/organization secrets or variables. The runner agent fetches them from GitHub at job-pickup time, not from its local process environment.

So the entire SSM → Lambda → userdata → su → process env pipeline works correctly, but the agent doesn't read the values from where we put them.

What would you like to be added:

Two coordinated changes (one PR is fine):

1. Document the current SSM toggle's limitation

docs/troubleshooting.md "Debugging silent runner failures" subsection currently advertises:

Flip the SSM toggle to debug. … Reproduce the issue, inspect the debug-level log lines (look for ##[debug] markers in Worker_*.log), then revert.

This is misleading. Replace with one of:

  • A note that this toggle is ineffective until the underlying mechanism is fixed (recommended now).
  • The corrected operator workflow once a real mechanism is in place (recommended after fix).

2. Implement a working debug-toggle mechanism

Three options to evaluate:

  • (a) GitHub-side secrets: scaleup Lambda (or a separate management Lambda) calls PUT /repos/{owner}/{repo}/actions/secrets/ACTIONS_RUNNER_DEBUG to set the secret to true when SSM toggle is debug, removes it on flip back. Requires GitHub App scope secrets:write. Affects ALL workflow runs in the repo, including unrelated ones.
  • (b) Runner config override: modify the AMI to include a .runner config file with debug logging enabled, OR write an override at boot time. Need to research the correct config key — possibly traceLogLevel or agentLogLevel in Runner.Listener config XML.
  • (c) Workflow-level env injection: users opt in per-workflow by setting env: ACTIONS_RUNNER_DEBUG: true at workflow or job level. Doesn't need our infrastructure at all.

Option (c) is the lightest weight and most correct — no infrastructure changes needed. Option (a) is operationally cleanest but adds attack surface. Option (b) is most jit-runners-native but requires AMI work.

Recommendation: document option (c) as the workaround, and either deprecate the SSM toggle (swap for documentation) OR repurpose it for a different runtime control (e.g. CloudWatch agent log level, scaledown thresholds, etc.).

Acceptance criteria:

  • docs/troubleshooting.md no longer advertises the broken SSM debug toggle path.
  • Replacement workflow (option a/b/c above) documented with reproducible verification recipe.
  • If keeping SSM parameter /jit-runners/runner-log-level: it controls something measurable (or repurposed to a different control). If not: parameter removed via CFN/Tofu update.
  • Verification on a draft PR confirms the new mechanism produces ##[debug] markers in Worker output.

Out of scope:

Who is this feature for:

jit-runners operators who currently believe the documented SSM debug toggle works. Right now they would flip the parameter, observe no debug output, and have no signal that something is wrong.

References:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions