windows_exporter socket in use #2153
Open
+63
−17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix for Windows CPU Monitor Crash
I have implemented a possible fix for the
windows_exporter"Socket already in use" error by addressing the potential root cause: an unsafe pointer arithmetic crash in the Infrastructure Agent'sWindowsCPUMonitor.Review Required
1. Safe String Conversion
I modified
UTF16PtrToStringininternal/windows/api/pdh.goto accept amaxLenparameter and enforce bounds checking. This prevents the function from reading beyond the allocated buffer if the null terminator is missing or the pointer is invalid.2. Buffer Size Calculation
I updated
PollRawArrayininternal/windows/pdh_raw_poll.goto calculate the remaining buffer size from the current pointer position and pass it toUTF16PtrToString.3. Test Coverage
I updated
internal/windows/api/pdh_test.goto include a new test caseTestUTF16PtrToString_BoundsCheckwhich verifies that the function stops reading atmaxLeneven without a null terminator.Verification Results
Automated Tests
Since I am running on macOS, I could not run the Windows-specific tests directly. However, I successfully compiled the tests for the Windows target to ensure type safety and syntax correctness:
GOOS=windows go test -c ./internal/windows/apiResult: Compilation succeeded.
Manual Verification
The fix logic strictly enforces memory bounds based on the buffer size returned by the PDH API. This directly addresses the mechanism of the crash described in the Executive Summary.
Executive Summary
The root cause is likely a stability issue in the newly introduced Windows CPU Monitor (
WindowsCPUMonitor), which uses raw PDH (Performance Data Helper) API calls viaunsafepointer manipulation. This implementation appears to be causing the Infrastructure Agent to crash (likely via an Access Violation/Segmentation Fault not caught by Go'srecover).When the Agent crashes, the Windows Service Manager restarts it, but child processes spawned by the previous instance—specifically the
windows_exporterused by the Windows Services integration—are left running (orphaned). These orphans hold onto TCP port 9182. When the restarted Agent tries to launch the integration again, it fails with "Address already in use," as observed in the symptoms.Critical Files
internal/windows/api/pdh.gointernal/windows/pdh_raw_poll.gopkg/metrics/cpu_windows.goDetailed Analysis
1. The Crash Mechanism (Unsafe Memory Access)
The update introduces a new method for polling CPU metrics on Windows to handle multi-core systems better. This involves direct system calls to
pdh.dlland manual memory management.In
internal/windows/api/pdh.go, the functionUTF16PtrToStringiterates through memory using unsafe pointers until it finds a null terminator:If
ptr(derived fromPdhGetRawCounterArrayW) points to invalid memory, or if the buffer returned by PDH is malformed/corrupted, this loop will read beyond allocated bounds. In Go, accessing invalid memory addresses viaunsafepointers can trigger an OS-level Access Violation (Exception code0xC0000005), which immediately terminates the process and cannot be caught by therecover()block added inpkg/metrics/cpu.go.2. The Trigger
In
internal/windows/pdh_raw_poll.go, the code iterates over a buffer assuming a strict array layout:This logic runs on every
Sample()interval. If specific environment conditions (e.g., specific CPU topology, locale settings affecting counter names, or momentary PDH subsystem instability) cause the buffer to be unexpected, the Agent crashes.3. The Symptom Chain
nri-winservicesintegration (which runswindows_exporter.exeon port 9182).WindowsCPUMonitorexecutesSample(), triggers the unsafe memory access issue, and crashes the Agent process hard.windows_exporterchild process to exit. It remains running, listening on port 9182.newrelic-infra.exe.nri-winservices/windows_exporter.bind: Only one usage of each socket address....Recommended Fix
gopsutil) until the PDH implementation is hardened.UTF16PtrToString. Pass the buffer size to the function and ensure the pointer arithmetic does not exceed the allocated memory range.