Skip to content

Question about reproducing bert OSM tag pretraining #5

@xxn967

Description

@xxn967

Hi, thank you for releasing the code, models, and data.

I am trying to reproduce the BERT OSM tag pretraining experiment using:

  • nlp_pretraining/osm_tag_embed_from_scratch.py
  • the released dataset relevant_tags_pairs_dataset_parquet_sharded_20_rep.tar.gz
  • the released initial model bert-144-l6-reset-custom-tokenizer

However, I am still seeing a significant gap compared with the released BERT checkpoint, so I would like to confirm some information and details.

From the released training artifacts and script, I understand the main setup to be roughly:

  • CachedMultipleNegativesSymmetricRankingLoss
  • num_train_epochs = 4
  • learning_rate = 1e-3
  • warmup_ratio = 0.1
  • fp16 = True
  • BatchSamplers.NO_DUPLICATES

The main difference I can clearly identify on my side is the batch-related setting, but I am not sure whether this alone explains the discrepancy.
For batchsize and minibatchsize, I'm using RTX4090 (24GB) x8 with minibatch size=320 and batch size=1600, while the source code corresponds to 1024 and 5120. I'm unsure if this alone could explain the reproduction gap, or if I need to adjust the learning rate or other hyperparameters to compensate for the smaller batch size.

Then I would like to confirm a few details.

  1. Is bert-144-l6-reset-custom-tokenizer fully randomly initialized from a fresh BertConfig, or does it inherit any pretrained weights in any form?
    From the released weights statistics, it looks consistent with random BERT initialization (std around 0.02), but I want to confirm this explicitly.

  2. Are there any important training details not fully captured by osm_tag_embed_from_scratch.py that may materially affect the final model?
    For example:

    • exact number of GPUs used(I guess the original experiment used four?)
    • exact global batch size
    • seed / data seed
    • exact sentence-transformers / transformers versions
    • whether shard order was fixed/sorted before loading the parquet files
    • whether the published checkpoint was trained with the script exactly as released
  3. Is the released BERT checkpoint bert-144-osm-tags-embed-from_scratch_torch19 expected to be reproducible from the released dataset + released init model + released training script alone, or were there additional environment/code details involved?

One thing that also confused me is that newer transformers versions complain about eval_strategy="steps" without eval_dataset, so I am not fully sure whether the released run used a slightly different code path or library version.

Any clarification would be very helpful. Thanks a lot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions