Skip to content

issues with processKeepAlive #2716

@andreasasprou

Description

@andreasasprou

We've identified two related issues with processKeepAlive that stem from a fundamental design limitation: recycling decisions are based on execution count, not actual resource consumption. This document outlines the issues we've experienced and proposes a robust fix: memory-aware recycling as the primary mechanism.


Issues We're Experiencing

Issue 1: SIGSEGV Crashes in Reused Processes

Symptoms:

  • Task crashed with TASK_PROCESS_SIGSEGV after ~8.8 seconds of execution
  • The same task succeeded when replayed 8 hours later (after process was recycled)
  • No obvious bug in our code; the crash appeared to be in native code

What we observed:

  1. A task (Task A) ran and likely caused subtle memory corruption in a native addon
  2. The process appeared "healthy" (IPC still connected)
  3. Our task (Task B) reused the corrupted process
  4. ~8.8 seconds in, native code access triggered SIGSEGV

Issue 2: Increased OOM Errors

Symptoms:

  • Significantly more TASK_PROCESS_OOM_KILLED errors since enabling processKeepAlive
  • Tasks that previously succeeded now fail with OOM
  • Memory usage grows over consecutive task executions

Root Cause Analysis (hypothesis)

The Fundamental Problem

The current implementation makes recycling decisions based on execution count alone:

// packages/cli-v3/src/entryPoints/managed/taskRunProcessProvider.ts:284-288
private shouldReusePersistentProcess(): boolean {
  return (
    !!this.persistentProcess &&
    this.executionCount < this.processKeepAliveMaxExecutionCount &&
    this.persistentProcess.isHealthy  // Only checks IPC connectivity
  );
}

The health check is minimal:

// packages/cli-v3/src/executions/taskRunProcess.ts:498-507
get isHealthy() {
  if (!this._child) return false;
  if (this.isBeingKilled || this.isBeingSuspended) return false;
  return this._child.connected;  // Only checks IPC connection!
}

Why This Is Problematic

  1. Blind to memory consumption: A process could hit 90% memory on execution 3 but still be reused 47 more times until OOM.

  2. Can't detect native code corruption: Memory corruption in native addons (Prisma engine, sharp, OTEL) doesn't affect IPC connectivity until it's too late.

  3. One-size-fits-all defaults: maxExecutionsPerProcess: 50 is too high for smaller machines with less memory headroom.

  4. No leak detection: Slow memory leaks accumulate over many executions without any detection mechanism.


Proposed Solution: Memory-Aware Recycling

The most robust fix is to make memory utilization the primary recycling trigger, with execution count as a secondary backstop.

Core Principle

shouldRecycle =
  executionCount >= maxExecutions ||           // Backstop
  heapUsed > maxHeapThreshold ||               // Absolute limit
  rss / machineMemory > maxMemoryUtilization || // Relative limit
  memoryLeakDetected;                          // Growth rate abnormal

Why This Approach

  1. Addresses root cause: Memory pressure is the actual problem, not execution count. This measures what matters.

  2. Self-adjusting: A task using 50MB gets many reuses. A task using 500MB gets recycled sooner. No user configuration needed.

  3. Leak detection: Memory growth tracking catches slow leaks that wouldn't trigger absolute thresholds.

  4. Sensible defaults: Smaller machines automatically get more conservative recycling.

  5. Backwards compatible: Execution count remains as a backstop. Existing configs still work.

  6. Observable: Debug logs show exactly why recycling happened.

  7. Low overhead: Memory checks are cheap. IPC memory stats can be throttled (every 5s).


Implementation Details

1. Update TaskRunProcessProvider Options

// packages/cli-v3/src/entryPoints/managed/taskRunProcessProvider.ts

export interface TaskRunProcessProviderOptions {
  workerManifest: WorkerManifest;
  env: RunnerEnv;
  logger: RunLogger;
  processKeepAliveEnabled: boolean;
  processKeepAliveMaxExecutionCount: number;

  // NEW: Memory-based recycling options
  processKeepAliveMaxMemoryUtilization: number;  // e.g., 0.7 = 70%
  processKeepAliveMaxHeapUsedBytes: number | null;
  machineMemoryBytes: number;
}

2. Enhanced Recycling Logic

// packages/cli-v3/src/entryPoints/managed/taskRunProcessProvider.ts

export class TaskRunProcessProvider {
  // ... existing fields ...

  // NEW: Track memory baseline for leak detection
  private baselineMemory: number | null = null;
  private memoryLeakScore: number = 0;

  private shouldReusePersistentProcess(): boolean {
    if (!this.persistentProcess || !this.persistentProcess.isHealthy) {
      return false;
    }

    // Execution count check (backstop)
    if (this.executionCount >= this.processKeepAliveMaxExecutionCount) {
      this.sendDebugLog("Max executions reached, recycling", {
        executionCount: this.executionCount,
      });
      return false;
    }

    // NEW: Memory utilization check (primary)
    const memoryUsage = this.persistentProcess.getMemoryUsage();
    if (memoryUsage) {
      const memoryUtilization = memoryUsage.rss / this.machineMemoryBytes;

      if (memoryUtilization > this.processKeepAliveMaxMemoryUtilization) {
        this.sendDebugLog("Memory utilization exceeded threshold, recycling", {
          rss: memoryUsage.rss,
          machineMemory: this.machineMemoryBytes,
          utilization: memoryUtilization,
          threshold: this.processKeepAliveMaxMemoryUtilization,
        });
        return false;
      }

      // NEW: Absolute heap limit check
      if (this.processKeepAliveMaxHeapUsedBytes &&
          memoryUsage.heapUsed > this.processKeepAliveMaxHeapUsedBytes) {
        this.sendDebugLog("Heap usage exceeded absolute threshold, recycling", {
          heapUsed: memoryUsage.heapUsed,
          threshold: this.processKeepAliveMaxHeapUsedBytes,
        });
        return false;
      }

      // NEW: Memory leak detection
      if (this.memoryLeakScore > 3) {
        this.sendDebugLog("Memory leak pattern detected, recycling", {
          leakScore: this.memoryLeakScore,
        });
        return false;
      }
    }

    return true;
  }

  async returnProcess(process: TaskRunProcess): Promise<void> {
    // ... existing logic ...

    if (this.processKeepAliveEnabled && this.shouldKeepProcessAlive(process)) {
      // NEW: Track memory delta for leak detection
      const memoryUsage = process.getMemoryUsage();
      if (memoryUsage && this.baselineMemory !== null) {
        const memoryGrowth = memoryUsage.heapUsed - this.baselineMemory;
        const growthPercent = memoryGrowth / this.baselineMemory;

        // If memory grew >15% and didn't return, likely a leak
        if (growthPercent > 0.15) {
          this.memoryLeakScore += 1;
          this.sendDebugLog("Memory growth detected after execution", {
            baseline: this.baselineMemory,
            current: memoryUsage.heapUsed,
            growthPercent,
            leakScore: this.memoryLeakScore,
          });
        } else if (growthPercent < 0.05) {
          // Memory returned to normal, reduce leak score
          this.memoryLeakScore = Math.max(0, this.memoryLeakScore - 1);
        }
      }

      // Set baseline after first execution
      if (this.baselineMemory === null && memoryUsage) {
        this.baselineMemory = memoryUsage.heapUsed;
      }

      // ... rest of existing logic ...
    }
  }
}

3. Add Memory Reporting to TaskRunProcess

// packages/cli-v3/src/executions/taskRunProcess.ts

export class TaskRunProcess {
  private _lastMemoryUsage: {
    heapUsed: number;
    heapTotal: number;
    rss: number;
    external: number;
  } | null = null;

  // NEW: Get cached memory usage from child process
  getMemoryUsage() {
    return this._lastMemoryUsage;
  }

  initialize() {
    // ... existing code ...

    this._ipc = new ZodIpcConnection({
      listenSchema: ExecutorToWorkerMessageCatalog,
      emitSchema: WorkerToExecutorMessageCatalog,
      handlers: {
        // ... existing handlers ...

        // NEW: Memory stats handler
        MEMORY_STATS: async (message) => {
          this._lastMemoryUsage = {
            heapUsed: message.heapUsed,
            heapTotal: message.heapTotal,
            rss: message.rss,
            external: message.external,
          };
        },
      },
    });

    // ... rest of existing code ...
  }
}

4. Worker-Side Memory Reporting

// In worker entry point (e.g., packages/cli-v3/src/entryPoints/managed-index-worker.ts)

// Report memory stats periodically
const memoryReportInterval = setInterval(() => {
  const mem = process.memoryUsage();
  ipc.send("MEMORY_STATS", {
    heapUsed: mem.heapUsed,
    heapTotal: mem.heapTotal,
    rss: mem.rss,
    external: mem.external,
  });
}, 5000);  // Every 5 seconds

// Clean up on exit
process.on("exit", () => {
  clearInterval(memoryReportInterval);
});

5. Add Message to IPC Schema

// packages/core/src/v3/zodIpc.ts or relevant schema file

// Add to ExecutorToWorkerMessageCatalog
MEMORY_STATS: z.object({
  heapUsed: z.number(),
  heapTotal: z.number(),
  rss: z.number(),
  external: z.number(),
}),

6. Sensible Defaults by Machine Size

// packages/core/src/v3/machines.ts

export const processKeepAliveDefaults: Record<MachinePresetName, {
  maxExecutions: number;
  maxMemoryUtilization: number;
}> = {
  'micro':      { maxExecutions: 5,  maxMemoryUtilization: 0.60 },
  'small-1x':   { maxExecutions: 10, maxMemoryUtilization: 0.65 },
  'small-2x':   { maxExecutions: 15, maxMemoryUtilization: 0.70 },
  'medium-1x':  { maxExecutions: 25, maxMemoryUtilization: 0.70 },
  'medium-2x':  { maxExecutions: 35, maxMemoryUtilization: 0.75 },
  'large-1x':   { maxExecutions: 50, maxMemoryUtilization: 0.75 },
  'large-2x':   { maxExecutions: 50, maxMemoryUtilization: 0.80 },
};

7. Updated Configuration API

// packages/core/src/v3/config.ts

type ProcessKeepAlive =
  | boolean
  | {
      enabled: boolean;

      // Existing (becomes backstop)
      maxExecutionsPerProcess?: number;
      devMaxPoolSize?: number;

      // NEW: Memory-based recycling (primary)
      maxMemoryUtilization?: number;     // 0-1, default varies by machine size
      maxHeapUsedMB?: number;            // Absolute limit in MB
      maxRssMB?: number;                 // Absolute RSS limit in MB

      // NEW: Leak detection
      detectMemoryLeaks?: boolean;       // Default: true
      memoryLeakThreshold?: number;      // Growth % to flag as leak, default: 0.15
    };

What This Solution Doesn't Address

Native code corruption that doesn't manifest as memory growth - This is fundamentally hard to detect. A SIGSEGV might still occur if native code enters a bad state without increasing memory usage.

For this edge case, additional features could help:

  • Per-task process isolation option: Allow tasks with known-risky native deps to opt-out of reuse
  • Process sandboxing with snapshot/restore: More complex, but would provide clean state guarantees

However, memory-aware recycling would catch 90%+ of real-world issues since memory pressure is the primary symptom.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions