You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Live verification of issue #55 on PR #57 surfaced a fundamental gap in how the SSM debug toggle (/jit-runners/runner-log-level) is wired to the runner agent.
Current behavior:
SSM parameter set to debug ✅
Scaleup Lambda reads SSM with 30s cache ✅
Userdata template renders export ACTIONS_RUNNER_DEBUG=true and export ACTIONS_STEP_DEBUG=true at script level ✅
Runner agent's _diag/Runner_*.log content stays at INFO level — no DEBUG / ##[debug] markers ❌
Inspection of 1000 log events from a debug-toggle-flipped runner (PR #57, runner 733): 100% INFO level, zero DEBUG, zero ##[debug] markers in Worker output.
Per GitHub's enabling-debug-logging docs, ACTIONS_RUNNER_DEBUG and ACTIONS_STEP_DEBUG are documented to be set as repository/organization secrets or variables. The runner agent fetches them from GitHub at job-pickup time, not from its local process environment.
So the entire SSM → Lambda → userdata → su → process env pipeline works correctly, but the agent doesn't read the values from where we put them.
What would you like to be added:
Two coordinated changes (one PR is fine):
1. Document the current SSM toggle's limitation
docs/troubleshooting.md "Debugging silent runner failures" subsection currently advertises:
Flip the SSM toggle to debug. … Reproduce the issue, inspect the debug-level log lines (look for ##[debug] markers in Worker_*.log), then revert.
This is misleading. Replace with one of:
A note that this toggle is ineffective until the underlying mechanism is fixed (recommended now).
The corrected operator workflow once a real mechanism is in place (recommended after fix).
2. Implement a working debug-toggle mechanism
Three options to evaluate:
(a) GitHub-side secrets: scaleup Lambda (or a separate management Lambda) calls PUT /repos/{owner}/{repo}/actions/secrets/ACTIONS_RUNNER_DEBUG to set the secret to true when SSM toggle is debug, removes it on flip back. Requires GitHub App scope secrets:write. Affects ALL workflow runs in the repo, including unrelated ones.
(b) Runner config override: modify the AMI to include a .runner config file with debug logging enabled, OR write an override at boot time. Need to research the correct config key — possibly traceLogLevel or agentLogLevel in Runner.Listener config XML.
(c) Workflow-level env injection: users opt in per-workflow by setting env: ACTIONS_RUNNER_DEBUG: true at workflow or job level. Doesn't need our infrastructure at all.
Option (c) is the lightest weight and most correct — no infrastructure changes needed. Option (a) is operationally cleanest but adds attack surface. Option (b) is most jit-runners-native but requires AMI work.
Recommendation: document option (c) as the workaround, and either deprecate the SSM toggle (swap for documentation) OR repurpose it for a different runtime control (e.g. CloudWatch agent log level, scaledown thresholds, etc.).
Acceptance criteria:
docs/troubleshooting.md no longer advertises the broken SSM debug toggle path.
Replacement workflow (option a/b/c above) documented with reproducible verification recipe.
If keeping SSM parameter /jit-runners/runner-log-level: it controls something measurable (or repurposed to a different control). If not: parameter removed via CFN/Tofu update.
Verification on a draft PR confirms the new mechanism produces ##[debug] markers in Worker output.
Removing CloudWatch agent log forwarding (separate concern, working correctly).
Who is this feature for:
jit-runners operators who currently believe the documented SSM debug toggle works. Right now they would flip the parameter, observe no debug output, and have no signal that something is wrong.
References:
Surfacing context: PR #57 verification of v1.0.0-rc.3 deploy.
Why is this needed:
Live verification of issue #55 on PR #57 surfaced a fundamental gap in how the SSM debug toggle (
/jit-runners/runner-log-level) is wired to the runner agent.Current behavior:
debug✅export ACTIONS_RUNNER_DEBUG=trueandexport ACTIONS_STEP_DEBUG=trueat script level ✅su - runner -c "..."line passes both vars into the runner user's shell (PR fix(userdata): pass ACTIONS_*_DEBUG env vars through 'su - runner' #60 fix) ✅_diag/Runner_*.logcontent stays at INFO level — no DEBUG /##[debug]markers ❌Inspection of 1000 log events from a
debug-toggle-flipped runner (PR #57, runner 733): 100% INFO level, zero DEBUG, zero##[debug]markers in Worker output.Per GitHub's enabling-debug-logging docs,
ACTIONS_RUNNER_DEBUGandACTIONS_STEP_DEBUGare documented to be set as repository/organization secrets or variables. The runner agent fetches them from GitHub at job-pickup time, not from its local process environment.So the entire SSM → Lambda → userdata → su → process env pipeline works correctly, but the agent doesn't read the values from where we put them.
What would you like to be added:
Two coordinated changes (one PR is fine):
1. Document the current SSM toggle's limitation
docs/troubleshooting.md"Debugging silent runner failures" subsection currently advertises:This is misleading. Replace with one of:
2. Implement a working debug-toggle mechanism
Three options to evaluate:
PUT /repos/{owner}/{repo}/actions/secrets/ACTIONS_RUNNER_DEBUGto set the secret totruewhen SSM toggle isdebug, removes it on flip back. Requires GitHub App scopesecrets:write. Affects ALL workflow runs in the repo, including unrelated ones..runnerconfig file with debug logging enabled, OR write an override at boot time. Need to research the correct config key — possiblytraceLogLeveloragentLogLevelin Runner.Listener config XML.env: ACTIONS_RUNNER_DEBUG: trueat workflow or job level. Doesn't need our infrastructure at all.Option (c) is the lightest weight and most correct — no infrastructure changes needed. Option (a) is operationally cleanest but adds attack surface. Option (b) is most jit-runners-native but requires AMI work.
Recommendation: document option (c) as the workaround, and either deprecate the SSM toggle (swap for documentation) OR repurpose it for a different runtime control (e.g. CloudWatch agent log level, scaledown thresholds, etc.).
Acceptance criteria:
docs/troubleshooting.mdno longer advertises the broken SSM debug toggle path./jit-runners/runner-log-level: it controls something measurable (or repurposed to a different control). If not: parameter removed via CFN/Tofu update.##[debug]markers in Worker output.Out of scope:
Who is this feature for:
jit-runnersoperators who currently believe the documented SSM debug toggle works. Right now they would flip the parameter, observe no debug output, and have no signal that something is wrong.References:
Projects/jit-runners/specs/2026-05-02-runner-agent-diagnostics-design.md(decision Q6, locked option C — needs revisiting).