Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
1ca42e3
can only merge to main from dev
NouamaneTazi Apr 14, 2025
0dbf24d
Fix UnBoundLocalError in `clm_collator.py` (#339)
c8ef Apr 14, 2025
d4e9daf
InitScalingMethod
NouamaneTazi Apr 14, 2025
6e7f0fa
InitScalingMethod
NouamaneTazi Apr 14, 2025
24d07e5
eval
NouamaneTazi Apr 16, 2025
438257a
try adding lightevalrunner to trainer
NouamaneTazi Apr 16, 2025
4f8a350
amend
NouamaneTazi Apr 16, 2025
c9c479d
amend
NouamaneTazi Apr 16, 2025
190a6b9
amend
NouamaneTazi Apr 17, 2025
004a89c
amend
NouamaneTazi Apr 17, 2025
b4cbb55
amend
NouamaneTazi Apr 17, 2025
d39872b
amend
NouamaneTazi Apr 17, 2025
feb818a
.
NouamaneTazi Apr 17, 2025
025f314
amend
NouamaneTazi Apr 17, 2025
abe75af
amend
NouamaneTazi Apr 17, 2025
bd50c66
.
NouamaneTazi Apr 17, 2025
2227432
qos to low
eliebak Apr 17, 2025
b62cacd
add nanotron_path
eliebak Apr 17, 2025
802fad6
some fix: logs, and config
eliebak Apr 17, 2025
895354a
cp instead of sync
eliebak Apr 17, 2025
55a5d3e
eval_interval
NouamaneTazi Apr 17, 2025
298492e
serialize sanity checks
NouamaneTazi Apr 17, 2025
4219ec8
add output dir and s3_save path in the config
eliebak Apr 17, 2025
f1780ec
add output dir and s3_save path in the config
eliebak Apr 17, 2025
016760e
fix s3 only if define
eliebak Apr 17, 2025
85138ca
fixes
NouamaneTazi Apr 17, 2025
0390de2
Merge branch 'nouamane/lighteval' of https://github.com/huggingface/n…
NouamaneTazi Apr 17, 2025
fefb560
add requeue
eliebak Apr 17, 2025
f1160f1
move moe from qwen modeling to src/nn
xrsrke Apr 18, 2025
bb8ac96
add groupedmlp
xrsrke Apr 18, 2025
ebdd55d
add token permute and unpermute
xrsrke Apr 18, 2025
3178944
fix num_tokens_per_expert counting < num_experts
xrsrke Apr 18, 2025
0a04f34
fix init and init scaling factor and run evals in background (#349)
NouamaneTazi Apr 18, 2025
d8b2717
[Feature] Implement CUDA event-based timing for improved GPU performa…
grewalsk Apr 18, 2025
9095a9d
amend previous pr (#354)
NouamaneTazi Apr 18, 2025
4558036
add wandb with lighteval and fix eval interval
eliebak Apr 18, 2025
17b5284
Merge branch 'nouamane/lighteval' of github.com:huggingface/nanotron …
eliebak Apr 18, 2025
4c39f62
inference qwen moe seems to work
zzhhjjj Apr 18, 2025
20f0196
update readme
zzhhjjj Apr 19, 2025
e472f77
fix router's weight initialization and wrong hidden size for non-moe …
xrsrke Apr 19, 2025
76ff5c7
add source for router weight and router logits in float32
xrsrke Apr 19, 2025
b5ea942
fix this little space :(
eliebak Apr 20, 2025
8851256
Merge branch 'main' into dev_qwen_moe
xrsrke Apr 20, 2025
863f45f
Merge branch 'dev' of https://github.com/huggingface/nanotron into de…
NouamaneTazi Apr 21, 2025
11f7997
fixes
NouamaneTazi Apr 21, 2025
81c31ba
.
NouamaneTazi Apr 21, 2025
2e79d24
.
NouamaneTazi Apr 21, 2025
bcf1658
add parametrize grouped mlp in column and row linear
xrsrke Apr 22, 2025
2b3a59d
add logging per-param grad norm
xrsrke Apr 22, 2025
abbb4a1
fix conversation fail due to buffer on cpu
xrsrke Apr 23, 2025
561ca6b
folder_path should always have s3 when using s3 (fix consumed tokens …
NouamaneTazi Apr 23, 2025
fc003ac
MoE without token dropping (#355)
xrsrke Apr 23, 2025
e04c5f1
add ep, ep+tp, ep+dp process group initialization
xrsrke Apr 24, 2025
009b8f7
add all-to-all, but not finished tests
xrsrke Apr 24, 2025
ba2ba84
Nouamane/lighteval (#356)
NouamaneTazi Apr 24, 2025
309470f
remove ep dimension from non-moe
xrsrke Apr 24, 2025
4cdba21
add the first all-to-all of ep
xrsrke Apr 24, 2025
a48aca2
add 2nd all-to-all
xrsrke Apr 25, 2025
fd740e0
fix ep inference (not correct outputs), shard expert weights in load_…
xrsrke Apr 25, 2025
2539352
fix runtime generation with ep=1
xrsrke Apr 25, 2025
adbe7db
refactor token dispatching out of moe
xrsrke Apr 26, 2025
fb20017
partially working dispatching (incorrect order in the resulting dispa…
xrsrke Apr 27, 2025
97c56a4
add and fix all-to-all tests for custom input/output splits
xrsrke Apr 27, 2025
628b53e
add moe token dispatcher test
xrsrke Apr 27, 2025
5d513ac
fix all-to-all token dispatching with permute before all-to-all
xrsrke Apr 27, 2025
f0cef96
add permutation after all-to-all
xrsrke Apr 27, 2025
4602753
add the 2nd all-to-all + unpermutation
xrsrke Apr 28, 2025
bc5baff
add num_local_dispatched_tokens_per_expert
xrsrke Apr 28, 2025
06d2afe
clean token dispatching test
xrsrke Apr 28, 2025
81fc02d
fix "sizes[i] <= index && index < sizes[i] && "index out of bounds"" …
xrsrke Apr 28, 2025
0d39fe5
add a complete expert parallelism (_compute_expert_outputs), pass tests
xrsrke Apr 28, 2025
a852c53
expert parallelsim pass tests sometimes
xrsrke Apr 29, 2025
a6c1e23
expert parallelelism pass unit tests
xrsrke Apr 29, 2025
6d1c636
clean moe unit tests
xrsrke Apr 29, 2025
ede9d4a
add tests for permute, and unpermute
xrsrke Apr 29, 2025
15e8169
add topk>1 for all-to-all token dispatching's permute
xrsrke Apr 29, 2025
1acdbf1
all-to-all token [dispatching+permute] and [undispatching+unpermute] …
xrsrke Apr 29, 2025
076bc71
clean up moe layer, token dispatching + tests
xrsrke Apr 29, 2025
b706c70
add mp_pg process group to moe
xrsrke Apr 29, 2025
7e0d4e3
add a fast implementation of getting dispatched routing indices
xrsrke Apr 30, 2025
4261e99
clean up moe
xrsrke Apr 30, 2025
a1e23b4
another basic clean up
xrsrke Apr 30, 2025
3e566d9
merge moe process group to parallel context
xrsrke Apr 30, 2025
232bbc0
fix incorrect shape in grouped_gemm if there is only one of the local…
xrsrke Apr 30, 2025
110f318
add benchmakr moe
xrsrke May 1, 2025
f2a0ba1
remove a cpu sync in .permute
xrsrke May 2, 2025
14335f1
clean up
xrsrke May 2, 2025
069016b
add haojun's permute
xrsrke May 2, 2025
9495184
add expert logging
xrsrke May 3, 2025
a58fd49
clean up
xrsrke May 5, 2025
8b1f03c
add DEBUG_MOE env variable for moe logging
xrsrke May 5, 2025
50ca006
moe without router passes gradient tests
xrsrke May 5, 2025
bc49afc
moe's shard expert pass tests
xrsrke May 5, 2025
f573f94
ep=8 and ep=1's loss curves match
xrsrke May 6, 2025
c7f5c2c
clean up moe tests
xrsrke May 6, 2025
7fa7e0d
add separate process groups for edp and pg in all_reduce
xrsrke May 7, 2025
07ca0b7
execute edp and ep separately in DistributedDataParallel's bucketing …
xrsrke May 7, 2025
3ddac31
clean
xrsrke May 7, 2025
1277185
add moe hidden_size, and shared expert hidden size to moe config
xrsrke May 8, 2025
6b0e333
add counting moe active params
xrsrke May 8, 2025
f39f5c2
fix resuming with new data mixture
NouamaneTazi May 8, 2025
44eebac
offsets must be in samples not tokens
NouamaneTazi May 8, 2025
c8239fb
remove moe_logging, model grads logging, clean up
xrsrke May 9, 2025
246604c
Merge branch 'dev' into dev_qwen_moe_with_ep
xrsrke May 9, 2025
eadf97f
rename use_haojun_permute to use_torch_permute
xrsrke May 9, 2025
74316d8
sanity check local files when dataset_read_path
NouamaneTazi May 11, 2025
a7a16a8
better error for new stage
NouamaneTazi May 11, 2025
17dad0a
rmsnorm
NouamaneTazi May 11, 2025
3ef78f0
sliding window
NouamaneTazi May 11, 2025
2670849
causal SWA
NouamaneTazi May 11, 2025
84b3f55
add te implem
NouamaneTazi May 12, 2025
4a64da4
Merge branch 'dev_qwen_moe_with_ep' of https://github.com/huggingface…
NouamaneTazi May 12, 2025
543072e
init works
NouamaneTazi May 12, 2025
ab649f0
training works
NouamaneTazi May 12, 2025
83e28d5
Revert "rmsnorm"
NouamaneTazi May 12, 2025
b4142da
rope_seq_len_interpolation_factor
NouamaneTazi May 12, 2025
d99aa4b
add moe ep
xrsrke May 12, 2025
aa4e754
add saving moe checkpoints
xrsrke May 13, 2025
f0d410f
add resume moe checkpoints
xrsrke May 13, 2025
e04f6be
adapt edp to te moe
xrsrke May 13, 2025
d36905b
fix edp+ep saving/resume checkpoint
xrsrke May 13, 2025
d24e603
use all_gather_into_tensor in token dispatcher
xrsrke May 14, 2025
559d532
use fused permute, megablock's grouped gemm, and fuse multiplying exp…
xrsrke May 16, 2025
5d36aef
check init, fwd in case of ep=1
NouamaneTazi May 16, 2025
df8cd20
add all2all
NouamaneTazi May 17, 2025
135c715
loss goes down after fixing data labels
NouamaneTazi May 17, 2025
98df8cc
fix bias_activation_fusion and set to True by default
NouamaneTazi May 17, 2025
88c2ee0
disable collect_env
NouamaneTazi May 19, 2025
acad09b
preprocess_data
NouamaneTazi May 19, 2025
1a8fb35
fix bug
NouamaneTazi May 19, 2025
d678ae7
assert in case of top_k=1
NouamaneTazi May 19, 2025
e233a7d
add SimpleTokenDataset
NouamaneTazi May 19, 2025
7499256
add timer for debug
zzhhjjj May 20, 2025
55fb95f
fix all-to-all
xrsrke May 20, 2025
aea7910
scripts/scaling_moe_benchmark.py
xrsrke May 21, 2025
d30ae0f
fix no gradients, and expert device has no tokens
xrsrke May 21, 2025
88d4140
add compute moe params
xrsrke May 21, 2025
9e687c4
fix sync_tied_weights_gradients without gradient accumulator
xrsrke May 21, 2025
06faa91
fix sync_tied_weights_gradients
xrsrke May 21, 2025
ce5f61c
fix cannot import name 'Qwen2Config'
xrsrke May 21, 2025
af69364
resolve merge conflicts
xrsrke Jun 11, 2025
cc09277
resolve merge conflcits, fix timer in training loop
xrsrke Jun 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,15 @@ test:
--color=yes \
--verbose \
examples/llama/tests/

install-moe:
pip install --no-build-isolation git+https://github.com/fanshiqing/grouped_gemm@main

test-moe:
pytest --color=yes --verbose tests/test_moe_dispatcher.py
pytest --color=yes --verbose tests/test_moe.py
pytest --color=yes --verbose tests/test_distributed_primitives.py::test_all_to_all

run-sanity-moe:
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file /fsx/phuc/new_workspace/snippets/experiment_configs/qwen_moe/exp0a0_sanity_dense.yaml
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file /fsx/phuc/new_workspace/snippets/experiment_configs/qwen_moe/exp0b0_sanity_moe_ep8.yaml
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-
The model will be saved in the `checkpoints` directory as specified in the config file.

> [!NOTE]
> You can use `examples/config_tiny_llama.py` to generate your own training config
> You can use `examples/config_tiny_llama.py` to generate your own training config

For detailed instructions on training your first model, check out our [Your First Training guide](docs/your-first-training.md). For multi-node training with Slurm, see our [Multi-Node Training guide](docs/multi-node-training.md).

Expand Down Expand Up @@ -173,6 +173,7 @@ We currently support the following features:
- [x] Custom module checkpointing for large models
- [x] Spectral µTransfer parametrization for scaling up neural networks
- [x] Mamba example
- [x] CUDA event-based timing for accurate GPU performance measurement

And we have on our roadmap:
- [ ] FP8 training
Expand Down
92 changes: 92 additions & 0 deletions docs/cuda_event_timing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# CUDA Event-Based Timing in Nanotron

## Overview

Nanotron now uses CUDA events for timing GPU operations instead of CPU-based timing with `time.time()`. This change provides several benefits:

1. **More accurate measurement of GPU execution time**: CUDA events are recorded directly on the GPU timeline, providing more precise timing of GPU operations.
2. **Reduced need for explicit CUDA synchronization**: CPU-based timing requires synchronization between CPU and GPU to get accurate measurements, which can introduce overhead and affect performance.
3. **Lower overhead**: CUDA event-based timing has minimal impact on the execution of GPU operations.
4. **Better performance monitoring**: More accurate timing leads to better performance analysis and optimization.

## Implementation Details

The implementation uses `torch.cuda.Event` with `enable_timing=True` to create start and end events that are recorded on the GPU timeline. The elapsed time is then calculated using `start_event.elapsed_time(end_event)`, which returns the time in milliseconds.

### Key Changes

1. **Default Timer Type**: The default timer type in `nanotron/src/nanotron/logging/timers.py` has been changed from `TimerType.CPU` to `TimerType.CUDA`.

2. **Iteration Timing**: The iteration timing in `trainer.py` now uses CUDA events instead of `time.time()`.

3. **Synchronization Control**: By default, CUDA event-based timers do not force synchronization unless explicitly requested with `cuda_sync=True`.

## Usage

### Basic Usage

```python
# Create and use a CUDA timer (default)
with nanotron_timer("my_operation"):
# Your GPU operation here
...

# Explicitly specify CUDA timing
with nanotron_timer("my_operation", timer_type="cuda"):
# Your GPU operation here
...

# For CPU-only operations, you can still use CPU-based timing
with nanotron_timer("cpu_operation", timer_type="cpu"):
# Your CPU operation here
...

# As a decorator with default CUDA timing
@nanotron_timer
def my_function():
# Your GPU operation here
...

# As a decorator with custom name
@nanotron_timer("custom_name")
def my_function():
# Your GPU operation here
...

# As a decorator with CPU timing
@nanotron_timer(timer_type=TimerType.CPU)
def my_cpu_function():
# Your CPU operation here
...
```

### Advanced Usage

```python
# Start and end a timer manually
timer = nanotron_timer("my_operation")
timer.start()
# Your operation here
timer.end()

# Get the elapsed time in seconds
elapsed_time = timer.elapsed

# Get the total time across all calls
total_time = timer.total_time

# Get the average time per call
avg_time = timer.average_time
```

## Considerations

1. **Synchronization**: By default, CUDA event-based timers do not force synchronization to avoid overhead. If you need more accurate timing at the cost of performance, you can set `cuda_sync=True`.

2. **Units**: CUDA events measure time in milliseconds, but the timer API converts this to seconds for consistency with the previous CPU-based timing.

3. **Fallback**: If CUDA is not available, the timer will automatically fall back to CPU-based timing.

## Performance Impact

Using CUDA events for timing instead of CPU-based timing with synchronization can significantly reduce overhead, especially in distributed training scenarios with thousands of GPUs.
138 changes: 138 additions & 0 deletions examples/OLMoE-1B-7B-0924-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
checkpoints:
checkpoint_interval: 1000
checkpoints_path: /fsx/nouamane/checkpoints
checkpoints_path_is_shared_file_system: false
load_lr_scheduler: true
load_optimizer: true
resume_checkpoint_path: null
save_final_state: true
save_initial_state: false
data_stages:
- data:
# dataset:
# dataset_folder:
# - /fsx/loubna/datasets/llama_tokenized/fineweb-edu/merged
# dataset_max_tokens: null
# dataset_read_path: null
# dataset_weights: null
# pad_samples_to_global_batch_size: false
# return_positions: true
# shuffle_files: false
# skip_in_stream: false
# token_size_in_bytes: 4
# tokenizer_name: meta-llama/Llama-3.2-1B
# use_old_brrr_dataloader: false
# vocab_size: 128256
num_loading_workers: 1
seed: 6198
name: Stable Training Stage
start_training_step: 1
general:
benchmark_csv_path: null
consumed_train_samples: null
ignore_sanity_checks: true
project: olmoe
run: olmoe-test
seed: 6198
step: null
lighteval: null
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
metrics_logging: null
model:
ddp_bucket_cap_mb: 25
dtype: bfloat16
init_method:
std: 0.02
# scaling_method: NONE
make_vocab_size_divisible_by: 1
model_config:
_attn_implementation: flash_attention_2
_fused_rms_norm: true
_fused_rotary_emb: true
_use_doc_masking: true
_use_qkv_packed: true
attention_bias: false
bos_token_id: 1
eos_token_id: 0
flex_attention_mask: null
hidden_act: silu
hidden_size: 2048
initializer_range: 0.02
intermediate_size: 2048
is_qwen2_config: true
max_position_embeddings: 4096
no_rope_layer: null
num_attention_heads: 16
num_hidden_layers: 2
num_key_value_heads: 16
pad_token_id: 1
pretraining_tp: 1
rms_norm_eps: 1.0e-06
rope_interleaved: false
rope_scaling: null
rope_theta: 10000.0
sliding_window_size: 20
tie_word_embeddings: false
use_cache: true
vocab_size: 128256
z_loss_enabled: false
moe_config:
num_experts: 8
top_k: 2
moe_hidden_size: 2048
moe_intermediate_size: 1024 # output_multiplier=0.5 for swiglu
# shared_expert_hidden_size: 2048
# shared_expert_intermediate_size: 1024
# router_aux_loss_coef: 0.01
# enable_shared_expert: false
# token_dispatcher_type: allgather
optimizer:
accumulate_grad_in_fp32: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 1.0e-4
lr_decay_starting_step: null
lr_decay_steps: 31998
lr_decay_style: cosine
lr_warmup_steps: 500
lr_warmup_style: linear
min_decay_lr: 4.0e-5
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-8
name: adamW
torch_adam_is_fused: true
weight_decay: 0.1
weight_decay_exclude_named_params: []
zero_stage: 0
parallelism:
context_parallel_size: 1
dp: 2
expert_parallel_size: 1
expert_data_parallel_size: 2
pp: 1
pp_engine: 1f1b
recompute_layer: false
tp: 1
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
tp_recompute_allgather: true
enabled_moe: true
profiler: null
s3_upload: null
# tokenizer:
# tokenizer_max_length: null
# tokenizer_name_or_path: meta-llama/Llama-3.2-1B
# tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 4
sequence_length: 4096
train_steps: 1000
val_check_interval: -1
22 changes: 15 additions & 7 deletions examples/config_qwen.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
"410m": (24, 1024, 16, 16, 4096), # ~410M params
# Small to medium models
"1b": (16, 2048, 16, 16, 5632), # ~1B params
"3b": (28, 2048, 16, 2, 11008), # ~3B params
"3b": (36, 2048, 16, 4, 11008), # ~3B params
# Standard sizes
"7b": (32, 4096, 32, 32, 11008), # ~7B params
"13b": (40, 5120, 40, 40, 13824), # ~13B params
Expand All @@ -47,7 +47,7 @@ def get_args():
parser.add_argument(
"--model",
choices=MODEL_SIZES.keys(),
default="custom",
default="3b",
help="Model size to generate config for (e.g., 7b, 13b)",
)
parser.add_argument(
Expand Down Expand Up @@ -76,6 +76,10 @@ def get_args():
tokens_group.add_argument("--mbs", type=int, default=3, help="Micro batch size")
tokens_group.add_argument("--acc", type=int, default=1, help="Batch accumulation per replica")

# checkpoints
checkpoints_group = parser.add_argument_group("checkpoints")
checkpoints_group.add_argument("--ckpt-save", type=int, default=10, help="Checkpoint save interval")

args = parser.parse_args()
return args

Expand Down Expand Up @@ -108,7 +112,7 @@ def get_model_config(model_size: str) -> Qwen2Config:
is_qwen2_config=True,
pad_token_id=None,
_attn_implementation="flash_attention_2",
sliding_window_size=20,
_use_doc_masking=True,
)


Expand Down Expand Up @@ -154,7 +158,7 @@ def calculate_parameters(model_config: Qwen2Config) -> str:

def create_config(model_config: Qwen2Config, args: argparse.Namespace) -> Config:
learning_rate = LRSchedulerArgs(
learning_rate=3e-4, lr_warmup_steps=2, lr_warmup_style="linear", lr_decay_style="cosine", min_decay_lr=1e-5
learning_rate=3e-4, lr_warmup_steps=2000, lr_warmup_style="linear", lr_decay_style="cosine", min_decay_lr=0
)
parallelism = ParallelismArgs(
dp=args.dp,
Expand All @@ -175,7 +179,7 @@ def create_config(model_config: Qwen2Config, args: argparse.Namespace) -> Config
)
optimizer = OptimizerArgs(
zero_stage=args.zero,
weight_decay=0.01,
weight_decay=0.1,
clip_grad=1.0,
accumulate_grad_in_fp32=True,
learning_rate_scheduler=learning_rate,
Expand All @@ -192,7 +196,7 @@ def create_config(model_config: Qwen2Config, args: argparse.Namespace) -> Config

return Config(
general=GeneralArgs(project="debug", run=args.run, seed=seed, ignore_sanity_checks=args.no_sanity),
checkpoints=CheckpointsArgs(checkpoints_path=checkpoints_path, checkpoint_interval=10),
checkpoints=CheckpointsArgs(checkpoints_path=checkpoints_path, checkpoint_interval=args.ckpt_save),
parallelism=parallelism,
model=ModelArgs(init_method=RandomInit(std=0.025), model_config=model_config),
# tokenizer=TokenizerArgs("HuggingFaceTB/cosmo2-tokenizer"),
Expand All @@ -219,7 +223,11 @@ def create_config(model_config: Qwen2Config, args: argparse.Namespace) -> Config
world_size = args.dp * args.tp * args.pp * args.cp
if world_size <= 8:
print(
f"CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node={world_size} run_train.py --config-file {args.out}"
f"ENABLE_TIMERS=1 DEBUG_CPU=1 STATS_SAMPLING_INTERVAL_IN_SEC=1 CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node={world_size} run_train.py --config-file {args.out}"
)
print("You can also use environment variables for more debugging:")
print(" - ENABLE_TIMERS=1: Enable detailed timing information")
print(" - DEBUG_CPU=1: Log CPU and memory usage statistics")
print(" - STATS_SAMPLING_INTERVAL_IN_SEC=1: Set sampling interval for metrics collection")
else:
print("Checkout slurm_launcher.py to launch a multi-node job")
Loading
Loading