Hi, thank you for releasing the code, models, and data.
I am trying to reproduce the BERT OSM tag pretraining experiment using:
nlp_pretraining/osm_tag_embed_from_scratch.py
- the released dataset
relevant_tags_pairs_dataset_parquet_sharded_20_rep.tar.gz
- the released initial model
bert-144-l6-reset-custom-tokenizer
However, I am still seeing a significant gap compared with the released BERT checkpoint, so I would like to confirm some information and details.
From the released training artifacts and script, I understand the main setup to be roughly:
CachedMultipleNegativesSymmetricRankingLoss
num_train_epochs = 4
learning_rate = 1e-3
warmup_ratio = 0.1
fp16 = True
BatchSamplers.NO_DUPLICATES
The main difference I can clearly identify on my side is the batch-related setting, but I am not sure whether this alone explains the discrepancy.
For batchsize and minibatchsize, I'm using RTX4090 (24GB) x8 with minibatch size=320 and batch size=1600, while the source code corresponds to 1024 and 5120. I'm unsure if this alone could explain the reproduction gap, or if I need to adjust the learning rate or other hyperparameters to compensate for the smaller batch size.
Then I would like to confirm a few details.
-
Is bert-144-l6-reset-custom-tokenizer fully randomly initialized from a fresh BertConfig, or does it inherit any pretrained weights in any form?
From the released weights statistics, it looks consistent with random BERT initialization (std around 0.02), but I want to confirm this explicitly.
-
Are there any important training details not fully captured by osm_tag_embed_from_scratch.py that may materially affect the final model?
For example:
- exact number of GPUs used(I guess the original experiment used four?)
- exact global batch size
- seed / data seed
- exact
sentence-transformers / transformers versions
- whether shard order was fixed/sorted before loading the parquet files
- whether the published checkpoint was trained with the script exactly as released
-
Is the released BERT checkpoint bert-144-osm-tags-embed-from_scratch_torch19 expected to be reproducible from the released dataset + released init model + released training script alone, or were there additional environment/code details involved?
One thing that also confused me is that newer transformers versions complain about eval_strategy="steps" without eval_dataset, so I am not fully sure whether the released run used a slightly different code path or library version.
Any clarification would be very helpful. Thanks a lot.
Hi, thank you for releasing the code, models, and data.
I am trying to reproduce the BERT OSM tag pretraining experiment using:
nlp_pretraining/osm_tag_embed_from_scratch.pyrelevant_tags_pairs_dataset_parquet_sharded_20_rep.tar.gzbert-144-l6-reset-custom-tokenizerHowever, I am still seeing a significant gap compared with the released BERT checkpoint, so I would like to confirm some information and details.
From the released training artifacts and script, I understand the main setup to be roughly:
CachedMultipleNegativesSymmetricRankingLossnum_train_epochs = 4learning_rate = 1e-3warmup_ratio = 0.1fp16 = TrueBatchSamplers.NO_DUPLICATESThe main difference I can clearly identify on my side is the batch-related setting, but I am not sure whether this alone explains the discrepancy.
For batchsize and minibatchsize, I'm using RTX4090 (24GB) x8 with minibatch size=320 and batch size=1600, while the source code corresponds to 1024 and 5120. I'm unsure if this alone could explain the reproduction gap, or if I need to adjust the learning rate or other hyperparameters to compensate for the smaller batch size.
Then I would like to confirm a few details.
Is
bert-144-l6-reset-custom-tokenizerfully randomly initialized from a freshBertConfig, or does it inherit any pretrained weights in any form?From the released weights statistics, it looks consistent with random BERT initialization (std around 0.02), but I want to confirm this explicitly.
Are there any important training details not fully captured by
osm_tag_embed_from_scratch.pythat may materially affect the final model?For example:
sentence-transformers/transformersversionsIs the released BERT checkpoint
bert-144-osm-tags-embed-from_scratch_torch19expected to be reproducible from the released dataset + released init model + released training script alone, or were there additional environment/code details involved?One thing that also confused me is that newer
transformersversions complain abouteval_strategy="steps"withouteval_dataset, so I am not fully sure whether the released run used a slightly different code path or library version.Any clarification would be very helpful. Thanks a lot.