Skip to content

Resource agnostic (HPC or local) pipeline#44

Draft
ZJaume wants to merge 28 commits intomainfrom
lumify
Draft

Resource agnostic (HPC or local) pipeline#44
ZJaume wants to merge 28 commits intomainfrom
lumify

Conversation

@ZJaume
Copy link

@ZJaume ZJaume commented Aug 19, 2025

No description provided.

mbanon and others added 7 commits July 10, 2025 13:27
A resource-agnostic (HPC or local) implementation for monolingual data
analytics. In theory, this implementation should run seamlessly, the
existence of HPC or not should only change the way HyperQueue is set up.
In the case of local just starting a hq server and a hq worker. In the
case of HPC, start a hq server and a hq allocation queue.

All the procedures that were part of the map have been grouped into a
single script. Parallelization should be line-based, preferably, not
procedure-based, which makes it more complicated.

The reduce has also been grouped into a single script, but each one on
its own bash function, so all can be wrapped into the same pipeline:
read concat each batch of the map, apply function, write to tmp, then
move to permanent.

readcorpus and readdocument positional arguments have been reodered, so
the call can be just script.py <lang> and will use stdin/stdout by
default.

The singularity container to run each task has been added and the
Dockerfile for it has some GPU stuff commented out for now (or not).
Everything in here should be now in 01.map and 02.reduce
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants