Skip to content

Add dclm-core-22 to jupiter#42

Merged
geoalgo merged 4 commits intomainfrom
harsh/dclm-core-22-jupiter
Mar 6, 2026
Merged

Add dclm-core-22 to jupiter#42
geoalgo merged 4 commits intomainfrom
harsh/dclm-core-22-jupiter

Conversation

@harshraj172
Copy link
Collaborator

To add lighteval for dclm on JUPITER we had to add another jupiter-lighteval.def as including the installations in the same .def file for jupiter was creating conflicting issues. Because Jupiter uses ARM64, and many of lighteval's runtime dependencies (spacy, underthesea, pyvi, etc) lack pre-built aarch64 wheels

Copy link
Collaborator

@geoalgo geoalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but they are some conflicts to solve.

- task: include_base_44_ukrainian
subset: Ukrainian

dclm-core-22:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add the dataset field to either the task group or tasks? otherwise dataset pre-downloading (before job submission) will fail/not be available and then the jobs will fail unless you have internet access on the compute nodes. You can check global-mmlu-eu or generic-multilingual for an example of how this looks


build-push:
needs: setup-lambda
strategy:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we moved to SkyPilot + Lambda Labs for the container build workflow so the actual workflows are now split between .github/workflows and .github/sky

@timurcarstensen
Copy link
Collaborator

Thanks for the contribution and making this work for Jupiter @harshraj172! Could you update your branch with the latest changes from main? We changed a few things about the github actions workflows (left you a comment about that) and testing.

SINGULARITY_ARGS: "--nv --contain --env PYTHONNOUSERSITE=1"
EVAL_CONTAINER_IMAGE: "lm-eval-jupiter.sif"
LIGHTEVAL_CONTAINER_IMAGE: "lighteval-jupiter.sif"
SINGULARITY_ARGS: "--nv --contain --env PYTHONNOUSERSITE=1 --env SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably also have to integrate this into the container downloading logic in oellm/utils.py::_ensure_singularity_image otherwise it'll only download the lm-eval container

@harshraj172 harshraj172 force-pushed the harsh/dclm-core-22-jupiter branch from 23c87ea to 35a1f83 Compare March 5, 2026 16:19
@harshraj172
Copy link
Collaborator Author

Some changes in this PR. @timurcarstensen @geoalgo @JeniaJitsev

  • Switched from lighteval to lm-eval-harness. The DCLM paper uses LLM Foundry for evaluation, which computes log-probs for multiple-choice and schema tasks. lm-eval handles these the same way, whereas lighteval uses a different approach for some of them. Pinned to lm-eval==0.4.9.2 because >v0.4.10 has a regression that breaks agieval_lsat_ar in few-shot mode. This also required pinning transformers<5.0.0 and datasets<4.0.0 for compatibility.
  • 21 of the 22 CORE tasks are covered. The missing one is jeopardy, lm-eval has a jeopardy task but it's broken (the task config doesn't properly handle the generation format). Planning to look into fixing this upstream. One other thing: DCLM evaluates on squad v1 but lm-eval only ships squadv2, so that task's numbers won't be directly comparable to DCLM baselines. All other tasks match the paper's few-shot counts, metrics, and evaluation types exactly. Tested everything on qwen3-0.6B-base on jupiter

Copy link
Collaborator

@geoalgo geoalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @harshraj172
can you rebase so that we can merge to mainline?
my main comment is on not tying a specific version just for jupiter, otherwise it looks good to go once updated with mainline.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those intended? If not lets remove them.

Comment on lines +16 to +17
uv pip install --system --break-system-packages "lm-eval==0.4.9.2" \
"transformers>=4.43.2,<5.0.0" "datasets<4.0.0" wandb sentencepiece tiktoken accelerate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is tricky, we do not want to tie the version per containers as we will get different results accross clusters. If you need a specific version for a fixed benchmark, the best would be to use a custom environment (which can be done by following https://github.com/OpenEuroLLM/oellm-cli/blob/main/docs/VENV.md happy to help if there is any issue with this, I tested it and it worked on lumi for instance).

Suggested change
uv pip install --system --break-system-packages "lm-eval==0.4.9.2" \
"transformers>=4.43.2,<5.0.0" "datasets<4.0.0" wandb sentencepiece tiktoken accelerate
uv pip install --system --break-system-packages lm-eval \
"transformers<=4.53.0" "datasets<4.0.0" wandb sentencepiece tiktoken accelerate

Copy link
Collaborator Author

@harshraj172 harshraj172 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is tricky. If we don’t pin the versions, agieval_lsat_ar will break. Do you think we should mention somewhere in docs/VENV.md that people should use a venv if they want to run all 21 tasks (otherwise it will only run 20)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for rebasing, my only remaining point is about not pinning the version on the jupiter container.

What you suggest is a good idea, we can also possibly add another requirements.txt with those versions named with something like requirements-venv-dclm.txt (in addition to that one https://github.com/OpenEuroLLM/oellm-cli/blob/main/requirements-venv.txt) and add an entry in the readme.md (or VENV.md) which explains how to evaluate this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I resolved the comments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those binaries are probably not intended, can you remove them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, my bad.

Comment on lines +16 to +17
uv pip install --system --break-system-packages lm-eval \
wandb sentencepiece tiktoken accelerate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt it missing datasets and some other dependencies?

uv pip install --system --break-system-packages \
    lm-eval \
    transformers \
    "datasets<4.0.0" \
    wandb \
    sentencepiece \
    tiktoken \
    accelerate \
    nltk

If so lets merge this PR to add DCLM support and update the jupiter container in a follow-up since the PR has been open for a long time.

Copy link
Collaborator Author

@harshraj172 harshraj172 Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoalgo I found that lm-eval installs these already. Also nltk isn't required now as it was used with lighteval earlier.

@geoalgo geoalgo merged commit 1b75fcd into main Mar 6, 2026
3 checks passed
@geoalgo geoalgo deleted the harsh/dclm-core-22-jupiter branch March 6, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants