I am encountering an issue where submit_job() raises a Job Not Found error, even though job IDs are generated, successfully scheduled, and run on Perlmutter. This complicates an automated workflow I am developing where we need to wait for the results of one task before starting the next one.
From my script job_controller.py:
try:
logger.info("Submitting reconstruction job script to Perlmutter.")
job = self.client.perlmutter.submit_job(job_script)
except Exception as e:
logger.error(f"Failed to submit or complete reconstruction job: {e}")
Error log from the exception:
13:11:27.498 | INFO | orchestration.flows.bl832.job_controller - Submitting reconstruction job script to Perlmutter.
13:12:03.894 | ERROR | orchestration.flows.bl832.job_controller - Failed to submit or complete reconstruction job: Job not found: 33821565
It seems like this could arise from one of the SfApiErrors raised by the submit_job() function defined in sfapi_client/_sync/compute.py
I am encountering an issue where
submit_job()raises a Job Not Found error, even though job IDs are generated, successfully scheduled, and run on Perlmutter. This complicates an automated workflow I am developing where we need to wait for the results of one task before starting the next one.From my script
job_controller.py:Error log from the exception:
It seems like this could arise from one of the
SfApiErrorsraised by thesubmit_job()function defined insfapi_client/_sync/compute.py