Skip to content

Conversation

@npow
Copy link
Contributor

@npow npow commented Oct 22, 2025

Problem

Intermittent JSONDecodeError when multiple environments are resolved concurrently during deployment:

json.decoder.JSONDecodeError: Expecting value: line 1 column 86673 (char 86672)

Root Cause

Race condition in FIFO-based IPC between deployer subprocess and parent process:

  1. Writer side: Subprocess writes JSON to FIFO, but Python's buffered I/O may not flush immediately
  2. Reader side: Parent process reads from FIFO in non-blocking mode
  3. Race: When subprocess exits quickly after close(), reader detects process exit and breaks on empty read
  4. Problem: OS kernel may still have buffered data in pipe that hasn't been delivered yet
  5. Result: Truncated JSON at arbitrary positions (~86KB in the error case)

Solution

Changed read_from_fifo_when_ready() to use a hybrid approach:

  1. Start in non-blocking mode (existing behavior)
  • Use select.poll() to wait for data
  • Can detect subprocess failures early
  • Can timeout if subprocess hangs
  1. Switch to blocking mode once first data arrives
  • Use fcntl() to remove O_NONBLOCK flag
  • Continue with blocking read() calls
  • POSIX guarantee: Blocking read() returns EOF (0 bytes) ONLY after writer closes AND all kernel pipe buffers are drained

@nflx-mf-bot
Copy link
Collaborator

Netflix internal testing[1398] @ 7594e0f

# All data read, exit main loop
break
else:
if len(events):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this happen? So we got some event (like file close?) and no data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If poll() returned an event (any event) AND read() returned 0 bytes then it must be EOF:

  • If it was POLLIN (data ready), then read() would have returned data
  • If read() returned 0 bytes despite an event, it must be POLLHUP (writer closed) or a stale POLLIN that resolved to EOF

@savingoyal savingoyal force-pushed the npow/fix-json-decode-error branch from 7594e0f to ba8226f Compare October 30, 2025 17:30
Comment on lines +139 to +147
# Now do blocking reads until true EOF
while True:
chunk = os.read(fifo_fd, 8192)
if not chunk:
# True EOF - all data drained
break
content += chunk
# All data read, exit main loop
break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very unit-testable chunk of code. Can we add unit test to:

  1. To trigger the scenario that caused the error.
  2. Verify the unit test fails before this code change.
  3. Verify the unit test passes after this fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add unit tests for this chunk of code, but the actual bug is difficult to reproduce reliably because of a race condition: the writer has to close while kernel buffers still has data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants