Resource agnostic (HPC or local) pipeline by ZJaume · Pull Request #44 · hplt-project/data-analytics-tool

ZJaume · 2025-08-19T14:06:40Z

No description provided.

A resource-agnostic (HPC or local) implementation for monolingual data analytics. In theory, this implementation should run seamlessly, the existence of HPC or not should only change the way HyperQueue is set up. In the case of local just starting a hq server and a hq worker. In the case of HPC, start a hq server and a hq allocation queue. All the procedures that were part of the map have been grouped into a single script. Parallelization should be line-based, preferably, not procedure-based, which makes it more complicated. The reduce has also been grouped into a single script, but each one on its own bash function, so all can be wrapped into the same pipeline: read concat each batch of the map, apply function, write to tmp, then move to permanent. readcorpus and readdocument positional arguments have been reodered, so the call can be just script.py <lang> and will use stdin/stdout by default. The singularity container to run each task has been added and the Dockerfile for it has some GPU stuff commented out for now (or not).

Everything in here should be now in 01.map and 02.reduce

Change input file naming Add missing srctok reduce

Fix path for intermediate files. Use HQ requested memory in sorts. Other small fixes

Not hpc-friendly having thousands of jobs trying to acquire .lock files

This allows using more ram.

The jobdefinition file now contains all the dependencies and tasks for a single lang. This way all tasks can be submitted without blocking. Also removes the need of having to open many windows, one for each lang.

mbanon and others added 7 commits July 10, 2025 13:27

Untested files scripts for lumi

8099b8a

Add files via upload

a55c8d4

Added script for register labels

69fa7f1

Merge branch 'lumify' of https://github.com/hplt-project/data-analyti…

42758a3

…cs-tool into lumify

Added script for samplings

b6de5d4

Remove map and reduce separated scripts

2b6fc42

Everything in here should be now in 01.map and 02.reduce

ZJaume force-pushed the lumify branch from 80b13ca to 2b6fc42 Compare August 19, 2025 14:12

ZJaume added 21 commits August 20, 2025 15:55

Add write_yaml step

6f5a597

Change input file naming Add missing srctok reduce

Add missing reduce hardrules

82c51c0

Lumi fixes and README

6792c62

Fix path for intermediate files. Use HQ requested memory in sorts. Other small fixes

More README

388f737

ignore vim swap files

c3795df

Do not use TLDextract cache

24856af

Not hpc-friendly having thousands of jobs trying to acquire .lock files

Write intermediate sorted ngrams to disk

58d74f9

This allows using more ram.

Refactor job submission into a single jobdef file per lang

15a45af

The jobdefinition file now contains all the dependencies and tasks for a single lang. This way all tasks can be submitted without blocking. Also removes the need of having to open many windows, one for each lang.

Remove output workdir at the end if not debug

7030097

Pre-download nltk stuff when building container

d0918f8

add orjson dep

1da3a2e

Fix stopwords for Macedonian

dc04325

Raise exception if tokenization fails

03b3cc9

Add re-use of register labels and sample extraction

e35ce4e

Reduce token counts without eating all ram

ff3327b

Merge branch 'main' into lumify

e7d0d12

Add srclang2

80bb848

Add support for finepdfs

01de7d3

Add memory restriction in the readme allocation example

10049eb

Fix langode matching in finepdfs

81b60ed

Avoid OOM wiht finepdfs

b2ac79c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource agnostic (HPC or local) pipeline#44

Resource agnostic (HPC or local) pipeline#44
ZJaume wants to merge 28 commits intomainfrom
lumify

ZJaume commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZJaume commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants