Skip to content

Conversation

@keegoid-nr
Copy link

Fix for Windows CPU Monitor Crash

I have implemented a possible fix for the windows_exporter "Socket already in use" error by addressing the potential root cause: an unsafe pointer arithmetic crash in the Infrastructure Agent's WindowsCPUMonitor.

Review Required

IMPORTANT
This change modifies the signature of UTF16PtrToString in internal/windows/api/pdh.go. This is an internal API change but ensures safety.

1. Safe String Conversion

I modified UTF16PtrToString in internal/windows/api/pdh.go to accept a maxLen parameter and enforce bounds checking. This prevents the function from reading beyond the allocated buffer if the null terminator is missing or the pointer is invalid.

func UTF16PtrToString(ptr *uint16, maxLen uint32) string {
    // ...
    for i := uint32(0); i < maxLen; i++ {
        if *p == 0 {
            break
        }
        // ...
    }
    // ...
}

2. Buffer Size Calculation

I updated PollRawArray in internal/windows/pdh_raw_poll.go to calculate the remaining buffer size from the current pointer position and pass it to UTF16PtrToString.

// Calculate offset of string within buffer
if stringStart >= bufferStart && stringStart < bufferStart+uintptr(bufferSize) {
    stringOffset := stringStart - bufferStart
    remainingBytes := uintptr(bufferSize) - stringOffset
    maxLen := uint32(remainingBytes / 2)

    name = winapi.UTF16PtrToString(item.SzName, maxLen)
}

3. Test Coverage

I updated internal/windows/api/pdh_test.go to include a new test case TestUTF16PtrToString_BoundsCheck which verifies that the function stops reading at maxLen even without a null terminator.


Verification Results

Automated Tests

Since I am running on macOS, I could not run the Windows-specific tests directly. However, I successfully compiled the tests for the Windows target to ensure type safety and syntax correctness:

GOOS=windows go test -c ./internal/windows/api

Result: Compilation succeeded.

Manual Verification

The fix logic strictly enforces memory bounds based on the buffer size returned by the PDH API. This directly addresses the mechanism of the crash described in the Executive Summary.


Executive Summary

The root cause is likely a stability issue in the newly introduced Windows CPU Monitor (WindowsCPUMonitor), which uses raw PDH (Performance Data Helper) API calls via unsafe pointer manipulation. This implementation appears to be causing the Infrastructure Agent to crash (likely via an Access Violation/Segmentation Fault not caught by Go's recover).

When the Agent crashes, the Windows Service Manager restarts it, but child processes spawned by the previous instance—specifically the windows_exporter used by the Windows Services integration—are left running (orphaned). These orphans hold onto TCP port 9182. When the restarted Agent tries to launch the integration again, it fails with "Address already in use," as observed in the symptoms.

Critical Files

  • internal/windows/api/pdh.go
  • internal/windows/pdh_raw_poll.go
  • pkg/metrics/cpu_windows.go

Detailed Analysis

1. The Crash Mechanism (Unsafe Memory Access)

The update introduces a new method for polling CPU metrics on Windows to handle multi-core systems better. This involves direct system calls to pdh.dll and manual memory management.

In internal/windows/api/pdh.go, the function UTF16PtrToString iterates through memory using unsafe pointers until it finds a null terminator:

func UTF16PtrToString(ptr *uint16) string {
    // ...
    for p := ptr; *p != 0; {
        length++
        p = (*uint16)(unsafe.Pointer(uintptr(unsafe.Pointer(p)) + 2))
    }
    // ...
}

If ptr (derived from PdhGetRawCounterArrayW) points to invalid memory, or if the buffer returned by PDH is malformed/corrupted, this loop will read beyond allocated bounds. In Go, accessing invalid memory addresses via unsafe pointers can trigger an OS-level Access Violation (Exception code 0xC0000005), which immediately terminates the process and cannot be caught by the recover() block added in pkg/metrics/cpu.go.

2. The Trigger

In internal/windows/pdh_raw_poll.go, the code iterates over a buffer assuming a strict array layout:

item := (*winapi.PDH_RAW_COUNTER_ITEM)(unsafe.Pointer(uintptr(unsafe.Pointer(itemBuffer)) + offset))
if item.SzName != nil {
    name := winapi.UTF16PtrToString(item.SzName)
    // ...
}

This logic runs on every Sample() interval. If specific environment conditions (e.g., specific CPU topology, locale settings affecting counter names, or momentary PDH subsystem instability) cause the buffer to be unexpected, the Agent crashes.

3. The Symptom Chain

  1. Agent Start: The Agent starts and launches the nri-winservices integration (which runs windows_exporter.exe on port 9182).
  2. Crash: The WindowsCPUMonitor executes Sample(), triggers the unsafe memory access issue, and crashes the Agent process hard.
  3. Orphan Process: Because the crash is abrupt, the Agent does not signal the windows_exporter child process to exit. It remains running, listening on port 9182.
  4. Restart: Windows Service Manager detects the Agent service stopped and restarts newrelic-infra.exe.
  5. Collision: The new Agent instance tries to start nri-winservices / windows_exporter.
  6. Error: The new exporter process fails to bind to port 9182 because the orphaned process from step 3 is still listening. The logs show bind: Only one usage of each socket address....

Recommended Fix

  1. Immediate Mitigation: Disable the new CPU monitor if a configuration flag exists, or revert to the previous CPU sampling method (WMI or gopsutil) until the PDH implementation is hardened.
  2. Code Fix: Add bounds checking to UTF16PtrToString. Pass the buffer size to the function and ensure the pointer arithmetic does not exceed the allocated memory range.
  3. Process Management: Ensure integrations are spawned with a Job Object (on Windows) so that if the parent Agent process crashes, the OS automatically terminates the child integration processes, preventing port conflicts upon restart.

Address root cause: an unsafe pointer arithmetic crash in the new
WindowsCPUMonitor.
@keegoid-nr keegoid-nr requested a review from a team as a code owner November 21, 2025 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant