- 
                Notifications
    You must be signed in to change notification settings 
- Fork 931
Fix json decode error #2653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix json decode error #2653
Conversation
| Netflix internal testing[1398] @ 7594e0f | 
| # All data read, exit main loop | ||
| break | ||
| else: | ||
| if len(events): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does this happen? So we got some event (like file close?) and no data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If poll() returned an event (any event) AND read() returned 0 bytes then it must be EOF:
- If it was POLLIN (data ready), then read() would have returned data
- If read() returned 0 bytes despite an event, it must be POLLHUP (writer closed) or a stale POLLIN that resolved to EOF
7594e0f    to
    ba8226f      
    Compare
  
    | # Now do blocking reads until true EOF | ||
| while True: | ||
| chunk = os.read(fifo_fd, 8192) | ||
| if not chunk: | ||
| # True EOF - all data drained | ||
| break | ||
| content += chunk | ||
| # All data read, exit main loop | ||
| break | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very unit-testable chunk of code. Can we add unit test to:
- To trigger the scenario that caused the error.
- Verify the unit test fails before this code change.
- Verify the unit test passes after this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add unit tests for this chunk of code, but the actual bug is difficult to reproduce reliably because of a race condition: the writer has to close while kernel buffers still has data
Problem
Intermittent
JSONDecodeErrorwhen multiple environments are resolved concurrently during deployment:Root Cause
Race condition in FIFO-based IPC between deployer subprocess and parent process:
close(), reader detects process exit and breaks on empty readSolution
Changed
read_from_fifo_when_ready()to use a hybrid approach:select.poll()to wait for datafcntl()to removeO_NONBLOCKflagread()callsread()returns EOF (0 bytes) ONLY after writer closes AND all kernel pipe buffers are drained