Skip to content

Conversation

@paaragon
Copy link
Contributor

Summary

Adds error handling for Ray cluster operations in job status update command to prevent crashes when Ray operator loses job references.

Details and comments

This PR adds try-except blocks around Ray cluster API calls in update_jobs_statuses.py:

  1. Status retrieval (line 35): Catches RuntimeError when fetching job status. If error occurs, logs the issue and marks job as FAILED.

  2. Logs retrieval (line 76): Catches RuntimeError when fetching job logs. If error occurs, logs a warning and continues execution without crashing.

Impact: Prevents command crashes and provides better observability when Ray cluster loses track of jobs.

@paaragon paaragon requested a review from a team as a code owner November 26, 2025 09:01
Comment on lines +41 to +45
"Job [%s] with ray_job_id [%s] failed to get status from Ray cluster. "
"Marking as FAILED. Error: %s",
job.id,
job.ray_job_id,
str(e),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use %s interpolation when we have f"Job {jon.id}" (at least when we create new code)? the %s interpolation is more prone to error (you have to convert to string with str, and you have to rearrange the parameters if you change the %s order... Just saying :)
/cc @ElePT @Tansito

Copy link
Contributor

@avilches avilches left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test (unit, integration, whatever)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants