fix(agent): handle None response from retry exhaustion to prevent crashes#787
fix(agent): handle None response from retry exhaustion to prevent crashes#787gajeshbhat wants to merge 2 commits intocanonical:mainfrom
Conversation
c04cb3b to
0e8cab0
Compare
|
CI failed because of the policy check. Need maintainers' approval to rerun. |
|
@pedro-avalos @rene-oromtz Can you kindly look at this PR and let me know if my submission will be accepted? |
|
@boukeas Thank you for reviewing my other PR. I have just one more PR up, which is this one. I understand you folks might have other higher-priority items and totally understand if this has to wait, but if you have some comments/questions about this one, I can address them in the meantime. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #787 +/- ##
==========================================
+ Coverage 73.54% 73.59% +0.05%
==========================================
Files 109 109
Lines 10217 10226 +9
Branches 882 885 +3
==========================================
+ Hits 7514 7526 +12
+ Misses 2515 2513 -2
+ Partials 188 187 -1
*This pull request uses carry forward flags. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Run charm test CI check hangs and did not run successfully. The tests I added only cover the None cases (when job_request is None), but the else branches (when job_request is a Response object with status >= 400) are NOT covered yet. I will add tests to enchance coverage. |
565c7be to
7dd9c57
Compare
7dd9c57 to
60ad9aa
Compare
|
Rebased and fixed conflicts. CI Checks pass. This has been open for a while, Can this get a review ? CC: @rene-oromtz @boukeas |
ajzobro
left a comment
There was a problem hiding this comment.
Thank you for your contribution and your patience, we are sorry it has taken so long to get around to reviewing it.
Please add tests for:
post_result() with a response object that has a status_code (lines 202-207)
get_result() with a response object that has a status_code (lines 233-237)
save_artifacts() with a response object that has a status_code (lines 335-340)
| ) | ||
| raise TFServerError("No response received") | ||
| else: | ||
| logger.error( |
There was a problem hiding this comment.
Thank you for adding tests for the new None branch, but could you please also add tests for these other else branches where the responses have status codes.
| "No response received", | ||
| ) | ||
| else: | ||
| logger.error( |
| ) | ||
| raise TFServerError("No response received") | ||
| else: | ||
| logger.error( |
agent/tests/test_client.py
Outdated
| job_id = str(uuid.uuid1()) | ||
|
|
||
| # Mock the session.post to return None (simulating retry exhaustion) | ||
| with patch.object(client.session, "post", return_value=None): |
There was a problem hiding this comment.
We have been moving away from using the with patch syntax in favor of using @patch. decorators. See Rene's feedback on my PR #915 for example.
There was a problem hiding this comment.
Sounds good. I moved my tests to @patch
Thank you for taking a look. I made some changes and pushed them. Let me know if you have any questions. |
@ajzobro CI Checks pass and I have made some changes. Let me know if you have any further questions. Cheers. |
…shes Fix AttributeError when session.post() returns None after retry exhaustion. When the server returns repeated 503 errors, the requests retry mechanism can return None instead of a Response object, causing agents to crash when accessing status_code attribute. Fixed methods: - post_status_update() - post_result() - get_result() - repost_job() - post_artifacts() Resolves agents showing EXITED status in supervisorctl. Fixes canonical#438 (Fixes #CERTTF-474 Internally) Fix AttributeError when session.post() returns None after retry exhaustion. When the server returns repeated 503 errors, the requests retry mechanism can return None instead of a Response object, causing agents to crash when accessing status_code attribute. Fixed methods: - post_status_update() - post_result() - get_result() - save_artifacts() Resolves agents showing EXITED status in supervisorctl. Fixes canonical#438 (Fixes #CERTTF-474 Internally)
Signed-off-by: Gajesh Bhat <gajeshbht@gmail.com>
2976adc to
9f6be42
Compare
|
@ajzobro Actions workflow failed and might need a rerun with your approval. https://github.com/canonical/testflinger/actions/runs/22130217781/job/63968604644?pr=787 |
Description
Fixes issue #438 where Testflinger agents show EXITED status in supervisorctl and cannot recover themselves.
Problem
When the Testflinger server experiences issues (like repeated 503 errors), the agent's HTTP client retry mechanism can return
Noneinstead of a properResponseobject after exhausting all retries. The existing error handling code checkedif not job_request:but then tried to accessjob_request.status_code, causing anAttributeError: 'NoneType' object has no attribute 'status_code'.This unhandled exception caused the agent process to crash, leading to:
Root Cause
The issue stems from the ambiguity in Boolean evaluation of
requests.Responseobjects:Noneevaluates toFalse(retry exhaustion)Responsewith status >= 400 also evaluates toFalse(HTTP error)Both cases triggered the same error handling path, but only the Response case has a
status_codeattribute.Boolean Evaluation Issue
The
requests.Responseobject implements a custom__bool__method that returnsFalsefor HTTP status codes >= 400 (requests documentation). This creates an ambiguous condition where both cases evaluate as falsy:Requests Retry Behaviour
When using
urllib3.util.retry.Retrywithrequests.Session, the retry mechanism can returnNonein edge cases after exhausting all retry attempts, particularly with network-level failures (urllib3 retry documentation). The original code assumed all HTTP operations would return aResponseobject, but this assumption breaks during severe connectivity issues.The fix described in the section below explicitly distinguishes between these two falsy states:
Solution
Updated error handling in 5 client methods to explicitly check for
Nonebefore accessingstatus_code:post_status_update()- The main method causing reported crashespost_result()get_result()repost_job()post_artifacts()Resolved issues
Fixes #438 (Resolves #CERTTF-474 Internally)
Documentation
Bug fix. No documentation update required
Web service API changes
None. Added better error handling in a bugfix
Tests