Feature Request: Pipelines 

## Desired Behavior

I want to create relationships between jobs. For the beginning, it should suffice to create `for each` and `once for all` relationships. These connected jobs should result in a Directed Acyclic Graph (DAG) that jobs can be easily executed from root nodes to leaf nodes. If I trigger job A the framework should check if all preceding jobs do exist and if not they should be queued first. The execution order of jobs (e.g. A should start _after_ all preceding jobs finished) can be defined in SLURM via the `sbatch` command ([Example](https://www.hpc2n.umu.se/documentation/batchsystem/job-dependencies)). Jobs that can be executed in parallel should use the parallelism determined by the SLURM scheduler.

## Implementation Suggestion

It would be easiest to use an existing framework to capture these features. Often they also provide a nice frontend etc. However, I feat that the current yaml configuration files are hardly compatible with other existing solutions.

If we decide to extend SEML, I suggest:
- Each job is defined in a separate yaml file (specifying its own running time estimate)
- In the "seml block" dependencies to other jobs can be defined as:
    -  If the yaml of the current job defines `n` jobs, and `path/to/a.yaml` defines `m` jobs, then this results in a total of `m * n` jobs.
    -  If the yaml of the current job defines `n` jobs, and `path/to/b.yaml` defines `k` jobs, then this results in a total of `n` jobs.
    -  If the yaml of the current job defines `n` jobs, `path/to/a.yaml` defines `m` jobs,  and `path/to/b.yaml` defines `m` jobs (with relationtype of `for_each_of`), then this results in a total of `k * m *n` jobs.
- The run method shall receive a dictionary for its lineage. Based on this information, the user shall be responsible for loading the right artifacts.

Example:
```yaml
seml:
    ...
    dependencies:
         - for_each_of: path/to/a.yaml
         - once_for_all_of: path/to/b.yaml
```

## References

Here some references to other pipeline frameworks (mostly for inspiration):
- [Snakemake](https://snakemake.readthedocs.io/en/stable/)
- [Luigi](https://github.com/spotify/luigi)
- [Kubeflow](https://www.kubeflow.org/docs/pipelines/pipelines-quickstart/)
- [Azure Data Factory](https://docs.microsoft.com/en-us/azure/data-factory/#:~:text=Azure%20Data%20Factory%20is%20Azure's,with%20full%20compatibility%20in%20ADF.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Pipelines #32

Desired Behavior

Implementation Suggestion

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Request: Pipelines #32

Description

Desired Behavior

Implementation Suggestion

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions