-
-
Notifications
You must be signed in to change notification settings - Fork 901
Description
We've identified two related issues with processKeepAlive that stem from a fundamental design limitation: recycling decisions are based on execution count, not actual resource consumption. This document outlines the issues we've experienced and proposes a robust fix: memory-aware recycling as the primary mechanism.
Issues We're Experiencing
Issue 1: SIGSEGV Crashes in Reused Processes
Symptoms:
- Task crashed with
TASK_PROCESS_SIGSEGVafter ~8.8 seconds of execution - The same task succeeded when replayed 8 hours later (after process was recycled)
- No obvious bug in our code; the crash appeared to be in native code
What we observed:
- A task (Task A) ran and likely caused subtle memory corruption in a native addon
- The process appeared "healthy" (IPC still connected)
- Our task (Task B) reused the corrupted process
- ~8.8 seconds in, native code access triggered SIGSEGV
Issue 2: Increased OOM Errors
Symptoms:
- Significantly more
TASK_PROCESS_OOM_KILLEDerrors since enablingprocessKeepAlive - Tasks that previously succeeded now fail with OOM
- Memory usage grows over consecutive task executions
Root Cause Analysis (hypothesis)
The Fundamental Problem
The current implementation makes recycling decisions based on execution count alone:
// packages/cli-v3/src/entryPoints/managed/taskRunProcessProvider.ts:284-288
private shouldReusePersistentProcess(): boolean {
return (
!!this.persistentProcess &&
this.executionCount < this.processKeepAliveMaxExecutionCount &&
this.persistentProcess.isHealthy // Only checks IPC connectivity
);
}The health check is minimal:
// packages/cli-v3/src/executions/taskRunProcess.ts:498-507
get isHealthy() {
if (!this._child) return false;
if (this.isBeingKilled || this.isBeingSuspended) return false;
return this._child.connected; // Only checks IPC connection!
}Why This Is Problematic
-
Blind to memory consumption: A process could hit 90% memory on execution 3 but still be reused 47 more times until OOM.
-
Can't detect native code corruption: Memory corruption in native addons (Prisma engine, sharp, OTEL) doesn't affect IPC connectivity until it's too late.
-
One-size-fits-all defaults:
maxExecutionsPerProcess: 50is too high for smaller machines with less memory headroom. -
No leak detection: Slow memory leaks accumulate over many executions without any detection mechanism.
Proposed Solution: Memory-Aware Recycling
The most robust fix is to make memory utilization the primary recycling trigger, with execution count as a secondary backstop.
Core Principle
shouldRecycle =
executionCount >= maxExecutions || // Backstop
heapUsed > maxHeapThreshold || // Absolute limit
rss / machineMemory > maxMemoryUtilization || // Relative limit
memoryLeakDetected; // Growth rate abnormalWhy This Approach
-
Addresses root cause: Memory pressure is the actual problem, not execution count. This measures what matters.
-
Self-adjusting: A task using 50MB gets many reuses. A task using 500MB gets recycled sooner. No user configuration needed.
-
Leak detection: Memory growth tracking catches slow leaks that wouldn't trigger absolute thresholds.
-
Sensible defaults: Smaller machines automatically get more conservative recycling.
-
Backwards compatible: Execution count remains as a backstop. Existing configs still work.
-
Observable: Debug logs show exactly why recycling happened.
-
Low overhead: Memory checks are cheap. IPC memory stats can be throttled (every 5s).
Implementation Details
1. Update TaskRunProcessProvider Options
// packages/cli-v3/src/entryPoints/managed/taskRunProcessProvider.ts
export interface TaskRunProcessProviderOptions {
workerManifest: WorkerManifest;
env: RunnerEnv;
logger: RunLogger;
processKeepAliveEnabled: boolean;
processKeepAliveMaxExecutionCount: number;
// NEW: Memory-based recycling options
processKeepAliveMaxMemoryUtilization: number; // e.g., 0.7 = 70%
processKeepAliveMaxHeapUsedBytes: number | null;
machineMemoryBytes: number;
}2. Enhanced Recycling Logic
// packages/cli-v3/src/entryPoints/managed/taskRunProcessProvider.ts
export class TaskRunProcessProvider {
// ... existing fields ...
// NEW: Track memory baseline for leak detection
private baselineMemory: number | null = null;
private memoryLeakScore: number = 0;
private shouldReusePersistentProcess(): boolean {
if (!this.persistentProcess || !this.persistentProcess.isHealthy) {
return false;
}
// Execution count check (backstop)
if (this.executionCount >= this.processKeepAliveMaxExecutionCount) {
this.sendDebugLog("Max executions reached, recycling", {
executionCount: this.executionCount,
});
return false;
}
// NEW: Memory utilization check (primary)
const memoryUsage = this.persistentProcess.getMemoryUsage();
if (memoryUsage) {
const memoryUtilization = memoryUsage.rss / this.machineMemoryBytes;
if (memoryUtilization > this.processKeepAliveMaxMemoryUtilization) {
this.sendDebugLog("Memory utilization exceeded threshold, recycling", {
rss: memoryUsage.rss,
machineMemory: this.machineMemoryBytes,
utilization: memoryUtilization,
threshold: this.processKeepAliveMaxMemoryUtilization,
});
return false;
}
// NEW: Absolute heap limit check
if (this.processKeepAliveMaxHeapUsedBytes &&
memoryUsage.heapUsed > this.processKeepAliveMaxHeapUsedBytes) {
this.sendDebugLog("Heap usage exceeded absolute threshold, recycling", {
heapUsed: memoryUsage.heapUsed,
threshold: this.processKeepAliveMaxHeapUsedBytes,
});
return false;
}
// NEW: Memory leak detection
if (this.memoryLeakScore > 3) {
this.sendDebugLog("Memory leak pattern detected, recycling", {
leakScore: this.memoryLeakScore,
});
return false;
}
}
return true;
}
async returnProcess(process: TaskRunProcess): Promise<void> {
// ... existing logic ...
if (this.processKeepAliveEnabled && this.shouldKeepProcessAlive(process)) {
// NEW: Track memory delta for leak detection
const memoryUsage = process.getMemoryUsage();
if (memoryUsage && this.baselineMemory !== null) {
const memoryGrowth = memoryUsage.heapUsed - this.baselineMemory;
const growthPercent = memoryGrowth / this.baselineMemory;
// If memory grew >15% and didn't return, likely a leak
if (growthPercent > 0.15) {
this.memoryLeakScore += 1;
this.sendDebugLog("Memory growth detected after execution", {
baseline: this.baselineMemory,
current: memoryUsage.heapUsed,
growthPercent,
leakScore: this.memoryLeakScore,
});
} else if (growthPercent < 0.05) {
// Memory returned to normal, reduce leak score
this.memoryLeakScore = Math.max(0, this.memoryLeakScore - 1);
}
}
// Set baseline after first execution
if (this.baselineMemory === null && memoryUsage) {
this.baselineMemory = memoryUsage.heapUsed;
}
// ... rest of existing logic ...
}
}
}3. Add Memory Reporting to TaskRunProcess
// packages/cli-v3/src/executions/taskRunProcess.ts
export class TaskRunProcess {
private _lastMemoryUsage: {
heapUsed: number;
heapTotal: number;
rss: number;
external: number;
} | null = null;
// NEW: Get cached memory usage from child process
getMemoryUsage() {
return this._lastMemoryUsage;
}
initialize() {
// ... existing code ...
this._ipc = new ZodIpcConnection({
listenSchema: ExecutorToWorkerMessageCatalog,
emitSchema: WorkerToExecutorMessageCatalog,
handlers: {
// ... existing handlers ...
// NEW: Memory stats handler
MEMORY_STATS: async (message) => {
this._lastMemoryUsage = {
heapUsed: message.heapUsed,
heapTotal: message.heapTotal,
rss: message.rss,
external: message.external,
};
},
},
});
// ... rest of existing code ...
}
}4. Worker-Side Memory Reporting
// In worker entry point (e.g., packages/cli-v3/src/entryPoints/managed-index-worker.ts)
// Report memory stats periodically
const memoryReportInterval = setInterval(() => {
const mem = process.memoryUsage();
ipc.send("MEMORY_STATS", {
heapUsed: mem.heapUsed,
heapTotal: mem.heapTotal,
rss: mem.rss,
external: mem.external,
});
}, 5000); // Every 5 seconds
// Clean up on exit
process.on("exit", () => {
clearInterval(memoryReportInterval);
});5. Add Message to IPC Schema
// packages/core/src/v3/zodIpc.ts or relevant schema file
// Add to ExecutorToWorkerMessageCatalog
MEMORY_STATS: z.object({
heapUsed: z.number(),
heapTotal: z.number(),
rss: z.number(),
external: z.number(),
}),6. Sensible Defaults by Machine Size
// packages/core/src/v3/machines.ts
export const processKeepAliveDefaults: Record<MachinePresetName, {
maxExecutions: number;
maxMemoryUtilization: number;
}> = {
'micro': { maxExecutions: 5, maxMemoryUtilization: 0.60 },
'small-1x': { maxExecutions: 10, maxMemoryUtilization: 0.65 },
'small-2x': { maxExecutions: 15, maxMemoryUtilization: 0.70 },
'medium-1x': { maxExecutions: 25, maxMemoryUtilization: 0.70 },
'medium-2x': { maxExecutions: 35, maxMemoryUtilization: 0.75 },
'large-1x': { maxExecutions: 50, maxMemoryUtilization: 0.75 },
'large-2x': { maxExecutions: 50, maxMemoryUtilization: 0.80 },
};7. Updated Configuration API
// packages/core/src/v3/config.ts
type ProcessKeepAlive =
| boolean
| {
enabled: boolean;
// Existing (becomes backstop)
maxExecutionsPerProcess?: number;
devMaxPoolSize?: number;
// NEW: Memory-based recycling (primary)
maxMemoryUtilization?: number; // 0-1, default varies by machine size
maxHeapUsedMB?: number; // Absolute limit in MB
maxRssMB?: number; // Absolute RSS limit in MB
// NEW: Leak detection
detectMemoryLeaks?: boolean; // Default: true
memoryLeakThreshold?: number; // Growth % to flag as leak, default: 0.15
};What This Solution Doesn't Address
Native code corruption that doesn't manifest as memory growth - This is fundamentally hard to detect. A SIGSEGV might still occur if native code enters a bad state without increasing memory usage.
For this edge case, additional features could help:
- Per-task process isolation option: Allow tasks with known-risky native deps to opt-out of reuse
- Process sandboxing with snapshot/restore: More complex, but would provide clean state guarantees
However, memory-aware recycling would catch 90%+ of real-world issues since memory pressure is the primary symptom.