Question about reproducing bert OSM tag pretraining

Hi, thank you for releasing the code, models, and data.

I am trying to reproduce the BERT OSM tag pretraining experiment using:
- `nlp_pretraining/osm_tag_embed_from_scratch.py`
- the released dataset `relevant_tags_pairs_dataset_parquet_sharded_20_rep.tar.gz`
- the released initial model `bert-144-l6-reset-custom-tokenizer`

However, I am still seeing a significant gap compared with the released BERT checkpoint, so I would like to confirm some information and details.

From the released training artifacts and script, I understand the main setup to be roughly:
- `CachedMultipleNegativesSymmetricRankingLoss`
- `num_train_epochs = 4`
- `learning_rate = 1e-3`
- `warmup_ratio = 0.1`
- `fp16 = True`
- `BatchSamplers.NO_DUPLICATES`

The main difference I can clearly identify on my side is the batch-related setting, but I am not sure whether this alone explains the discrepancy.
For batchsize and minibatchsize, I'm using RTX4090 (24GB) x8 with minibatch size=320 and batch size=1600, while the source code corresponds to 1024 and 5120. I'm unsure if this alone could explain the reproduction gap, or if I need to adjust the learning rate or other hyperparameters to compensate for the smaller batch size.

Then I would like to confirm a few details.

1. Is `bert-144-l6-reset-custom-tokenizer` fully randomly initialized from a fresh `BertConfig`, or does it inherit any pretrained weights in any form?
   From the released weights statistics, it looks consistent with random BERT initialization (std around 0.02), but I want to confirm this explicitly.

2. Are there any important training details not fully captured by `osm_tag_embed_from_scratch.py` that may materially affect the final model?
   For example:
   - exact number of GPUs used(I guess the original experiment used four?)
   - exact global batch size
   - seed / data seed
   - exact `sentence-transformers` / `transformers` versions
   - whether shard order was fixed/sorted before loading the parquet files
   - whether the published checkpoint was trained with the script exactly as released

3. Is the released BERT checkpoint `bert-144-osm-tags-embed-from_scratch_torch19` expected to be reproducible from the released dataset + released init model + released training script alone, or were there additional environment/code details involved?

One thing that also confused me is that newer `transformers` versions complain about `eval_strategy="steps"` without `eval_dataset`, so I am not fully sure whether the released run used a slightly different code path or library version.

Any clarification would be very helpful. Thanks a lot.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about reproducing bert OSM tag pretraining #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about reproducing bert OSM tag pretraining #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions