Skip to content

Open-Sci+MixtureVitae on Jupiter#28

Open
harshraj172 wants to merge 4 commits intomainfrom
harsh/jupiter_setup
Open

Open-Sci+MixtureVitae on Jupiter#28
harshraj172 wants to merge 4 commits intomainfrom
harsh/jupiter_setup

Conversation

@harshraj172
Copy link

  • Adds the config to pretrain open-sci architecture model with MixtureVitae dataset on jupiter

# Conflicts:
#	README.md
#	config/autoexp.yaml
#	config/container/jupiter.yaml
#	config/slurm/jupiter.yaml
#	config/sweep/minimal.yaml
#	oellm_autoexp/backends/megatron_backend.py
#	scripts/run_autoexp_container.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, make this use a common $OELLM_DATASETS_TOKENIZED_DIR based path, such that this config works across clusters. (You can keep the absolute one as a comment such that people know where to copy it from from another cluster).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a comment, we should unify the Container names / paths, such that we don't change the config again and again.
But that's more of a general problem.

env:
CUDA_DEVICE_MAX_CONNECTIONS: "1"
PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
NCCL_SOCKET_IFNAME: ib0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following options might be actually useful in general on jupiter, right?
So we could add them to slurm/jupiter.yaml ?

gres: "gpu:4"
gpu_bind: "none"
time: "12:00:00"
partition: booster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically "externalize" these to env variables: ${oc.env:SLURM_PARTITION,booster} / ${oc.env:SLURM_ACCOUNT,jureap59}
This way, one can "update" the project more easily.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you update the submodules/Megatron-LM, you should re-run the scripts for dataclass/config generation (scripts/generate_megatron_config.py , scripts/generate_megatron_dataclass.py)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing those!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants