Conversation
harshraj172
commented
Feb 13, 2026
- Adds the config to pretrain open-sci architecture model with MixtureVitae dataset on jupiter
# Conflicts: # README.md # config/autoexp.yaml # config/container/jupiter.yaml # config/slurm/jupiter.yaml # config/sweep/minimal.yaml # oellm_autoexp/backends/megatron_backend.py # scripts/run_autoexp_container.py
There was a problem hiding this comment.
Ideally, make this use a common $OELLM_DATASETS_TOKENIZED_DIR based path, such that this config works across clusters. (You can keep the absolute one as a comment such that people know where to copy it from from another cluster).
There was a problem hiding this comment.
As a comment, we should unify the Container names / paths, such that we don't change the config again and again.
But that's more of a general problem.
| env: | ||
| CUDA_DEVICE_MAX_CONNECTIONS: "1" | ||
| PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True" | ||
| NCCL_SOCKET_IFNAME: ib0 |
There was a problem hiding this comment.
The following options might be actually useful in general on jupiter, right?
So we could add them to slurm/jupiter.yaml ?
| gres: "gpu:4" | ||
| gpu_bind: "none" | ||
| time: "12:00:00" | ||
| partition: booster |
There was a problem hiding this comment.
I typically "externalize" these to env variables: ${oc.env:SLURM_PARTITION,booster} / ${oc.env:SLURM_ACCOUNT,jureap59}
This way, one can "update" the project more easily.
There was a problem hiding this comment.
If you update the submodules/Megatron-LM, you should re-run the scripts for dataclass/config generation (scripts/generate_megatron_config.py , scripts/generate_megatron_dataclass.py)
There was a problem hiding this comment.
Thanks for removing those!