Skip to content

recovery after remote repository temporarily unavailable #80

@peter-sk

Description

@peter-sk

What happened?
We are using a remote repository in order to collect data on related runs from multiple GPU nodes. This has worked fine so far, but last night, the remote repository was unreachable for a few minutes (likely due to a network issue). The tracked data was collected during and after the incident, but at some point in time the queue was filled up and the processes were blocked.

What should have happened?
After the incident, aim should have attempted to reconnect to the remote repository, delivering the collected values from the queue.

What is the impact of the issue?
Long runs might be resumed from the last checkpoint (if available) or have to be restarted (if resuming is not an option). In this case, we lost a total of approx. 200 GPU hours.

How might the issue be mitigated?
When there was an outage, the system should periodically try to reestablish the connection. If pointed to the right place in the code, I might give it a try to implement this, if manpower is an issue. This is a showstopper for using remote repositories.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions