I am trying to understand the intended workflow for OpusCleaner.
Suppose I want to build some MT systems. I fire up OpusCleaner, download some data, apply cleaning rules until I am happy, then I upload data to the data to the cluster for training. Maybe I come back the next day, and want to create a new version of this data set, or maybe I want train a completely different MT system.
For this, would it be useful if OC had the notion of a "project"? I open a project, add files to it, set some project-wide rules and parameters, then maybe some data set specific parameters. If I then want to work on a different MT system, then I open a different project. I can copy the project file onto a different server, and initialise it (by downloading the files). Maybe projects could have versions, so I can track which data/rule set I used.
I am trying to understand the intended workflow for OpusCleaner.
Suppose I want to build some MT systems. I fire up OpusCleaner, download some data, apply cleaning rules until I am happy, then I upload data to the data to the cluster for training. Maybe I come back the next day, and want to create a new version of this data set, or maybe I want train a completely different MT system.
For this, would it be useful if OC had the notion of a "project"? I open a project, add files to it, set some project-wide rules and parameters, then maybe some data set specific parameters. If I then want to work on a different MT system, then I open a different project. I can copy the project file onto a different server, and initialise it (by downloading the files). Maybe projects could have versions, so I can track which data/rule set I used.