BERT-style mask prediction pretraining #851

niklasmei · 2025-12-01T10:50:32Z

This adds a module for unsupervised pretraining using BERT-style mask prediction.

I originally made it as to pretrain a model that returns a latent view of input sequences along with a single vector to represent the sequences. This was the case because I used a model based on DeepIce, where I had the cls-token and the processed sequence. Some variables still reflect the original use in their names.

In the version here I made it optional to provide a vector that summarizes the input data, which is only used to predict some summary feature (the total charge in an event in the standard case).

It is important that the model that is to be pretrained does not change the number of sequence elements beyond providing an additional summary vector, like a cls-token. Other than that this pretraining module should be indifferent to the model that is pretrained.

…pared in a mse_loss function

…ves everything unchanged

Aske-Rosted

Hi Niklas, thanks for this contribution, and sorry for taking this long before having a look at it.

In the current form the code is written with all the functionality in one large file. GraphNeT is structured in a modular way such that the user can quickly swap out detector classes, model backbones, tasks etc.

Therefore in order to accept this new functionality it has to be separated out such that each of the new functionalities are put in their respective locations, e.g. The loss functions defined on l.23 and l.33 should be moved to src/graphnet/training/loss_functions.py and make use of the LossFunction base class.

Functionality also should not be duplicated with code already available within GraphNeT. Using the loss functions as an example the dense_mse_loss should probably either inherit from MSELoss, have the different functionality be enabled by an argument to the existing class, or the scatter functionality should be a wrapper taking a loss function as an input, or be functionality enabled in the base LossFunction class.

Since this is new functionality it would also be nice to have a minimal working mock example/test running on some of the example data available, which can be used for code checks ensuring stability under future development as well as give users insight into how the functionality is used.

If you have questions about how to best start this process or anything you find to be unclear then I would be happy to try and provide answers.

…raphnet/examples

niklasmei · 2026-01-23T14:32:27Z

Hi Aske, thanks for the review.

With the last commit I removed the loss functions and added a Negative Cosine Loss in the file you specified. For the MSE I now use the pre-existing MSELoss.

The rest of the modules in my file I left untouched because I feel like they make sense to be there. An exception might be the standard_maskpred_net, where one could argue to move that somewhere else. But at the same time it is just for the case when a user does not want to specify another network themselves for the charge predicition.

As to the simple example: I put that now in the example section just to have it somewhere and we can just move it anywhere you think best. All the example does is being evaluated on some data to see if it works and to illustrate how it is supposed to be used. Plus I put in one line that shows how to save the state of the model that is to be pretrained. There is only one small problem in this example: I put the file path to the data as "/ptmp/mpp/nikme/graphnet/data/examples/sqlite/prometheus/prometheus-events.db" which is obviously not generally useable and I would be grateful for a tip as to how I can generalize that.

Otherwise I hope my modules are better now but let me know if there is anything else I can do

Aske-Rosted · 2026-01-26T05:09:55Z

Hi Aske, thanks for the review.

With the last commit I removed the loss functions and added a Negative Cosine Loss in the file you specified. For the MSE I now use the pre-existing MSELoss.

The rest of the modules in my file I left untouched because I feel like they make sense to be there. An exception might be the standard_maskpred_net, where one could argue to move that somewhere else. But at the same time it is just for the case when a user does not want to specify another network themselves for the charge predicition.

As to the simple example: I put that now in the example section just to have it somewhere and we can just move it anywhere you think best. All the example does is being evaluated on some data to see if it works and to illustrate how it is supposed to be used. Plus I put in one line that shows how to save the state of the model that is to be pretrained. There is only one small problem in this example: I put the file path to the data as "/ptmp/mpp/nikme/graphnet/data/examples/sqlite/prometheus/prometheus-events.db" which is obviously not generally useable and I would be grateful for a tip as to how I can generalize that.

Otherwise I hope my modules are better now but let me know if there is anything else I can do

Thanks a lot for the updates and the quick follow-up.

Concerning the prometheus database path, looking at the other examples.

from graphnet.constants import EXAMPLE_DATA_DIR
f"{EXAMPLE_DATA_DIR}/sqlite/prometheus/prometheus-events.db"

Should allow you to get the correct path to the prometheus event database.

I would like to come back to the structural issue I raised in my previous review. The current implementation still places several distinct pieces of functionality in a single file. An overview of how the different components that make up a model can be found here

the current location src/graphnet/models/gnn/pretraining_maskpred.py is where some backbones, specific neural network architecture classes, are defined. The way I see it the additional functionality that you want to add are the following.

data-augmentation: A masking of the input data (after?) data_representation has been applied to the raw input from the files.
A way for the masked out data to be kept in order to use them as targets for a task.
A task that takes a model output along with target in the shape of whatever data-representation was used.
An optional alteration of the masked out data (like the custom_charge_target t)

These are the different functionalities, that we have to determine lives in what parts of the pre-existing framework.

There are several ways to accomplish this in a way that is compliant within the current framework. The approach that I would suggest is the following:

A data_representation, which takes another data_representation as input applies the augmentation to it and either saves the masked target in the same data-object, if possible, or returns another data-object in addition to the usual data-object. Then one could create a task which handles how the output of the backbone is connected to the target of reconstructing parts of the masked out input features. The model class might have to be slightly altered such that is listens for Tasks which require information apart from the output of the backbone.

It is my impression that this will produces less duplicate code and works within the current data_representation -> backbone -> Task framework that most would be familiar with, but I am open to discussion.

I want to emphasize that I am well aware that this is quite a lot of work. However, if we want these additional features to be available in GraphNeT, they have to be implemented in a way that works with the modular framework of GraphNeT.

Another concern I have with the current implementation is that, unless I am missing something, the current code does not seem to, in the case of the KNN data_representation, be re-computing the edges between the nodes. This is not an issue for the simple model used here but could lead to incorrect behavior if these edges were used in the chosen backbone.

Again feel free to ask further questions and I will try to answer to the best of my ability.

niklasmei · 2026-01-26T17:12:19Z

Thanks again for the feedback.

I feel like I should clarify that my pretraining model is supposed to be used similar to the Standard_Model or the Normalizing_Flow in the sense that the user supplies his data_representation and backbone (no task in the usual sense due to not having a typical supervised target). Therefore I don't quite understand why I would reframe my augmentation as a data rerpresentation. All it should be doing is mask some features and extract the original values along with a sense of their position within the input tensors in the shape of the mask. I want to stress that the augmentation does not touch the graph structure of the input data and really only overwrites some feature values. Therefore I also don't think the edges need to be recalculated. I am sure it would be possible to do all this with a data_representation but again I don't see the point as the structure stays untouched. Maybe I am just missing something so please if I am wrong here let me know.

What I suggest that I could do for now is the following:

Put the main pretraining_maskpred.py in the models folder instead of models/gnn. Maybe I caused some confusion about the intended use of my pretraining frame by putting it next to backbones like Dynedge
Try to make the handling of the loss calcultion happen in a Task

Please let me know your thoughts especially regarding the augmentation. I will in the meantime look into how I could do it as a data representation

Aske-Rosted · 2026-01-27T02:20:08Z

If you are not "removing" entire nodes from the data_representation then I agree with you that the structure is unchanged. That does however introduce discussions about fill values. For both the log_scaled charge, the time and x,y,z coordinate the value 0 is a possible quantity that could come up in the data_representation, and therefore using it as the mask_value might not be the best approach. This is not an issue in the same way for BERT where they are usually working on string sequences. Your learned_value implementation does address this in part.

If you are open to it then, I would love to have a call where we can discuss more in depth. At the current stage there are still a number of things that are unclear to me.

niklasmei · 2026-01-27T14:13:12Z

If you are not "removing" entire nodes from the data_representation then I agree with you that the structure is unchanged. That does however introduce discussions about fill values. For both the log_scaled charge, the time and x,y,z coordinate the value 0 is a possible quantity that could come up in the data_representation, and therefore using it as the mask_value might not be the best approach. This is not an issue in the same way for BERT where they are usually working on string sequences. Your learned_value implementation does address this in part.

If you are open to it then, I would love to have a call where we can discuss more in depth. At the current stage there are still a number of things that are unclear to me.

I am of course open for a call, that would probably be beneficial for me anyways so that I can better understand what to improve with my modules. Just let me know when you are available

niklasmei added 27 commits June 13, 2025 13:33

mad the basis for maskpred pretraining and TheseusDeepIce

e0b7243

updated the mask_pred module; currently still in my theseus.py file

f387e0a

updated the maskpred frame; nowa certain percentage is masked and com…

408cb3f

…pared in a mse_loss function

added loss function

e234e03

fixed mse loss

21f7a45

renamed file and brought up to date

8c34835

added in the charge prediction functionality and learend masking values

f92a13f

some minor fixes

1115508

restored saving

0e751f6

tested and should be ready for pull request

eafe2e3

black reformatted

3d38cae

further fix for passing checks

d280f9a

added docstrings

172ce83

another reformatting

dae5642

docformatter reformatting

cb74825

more formatting

e7d8ffb

formatting

89fe2ae

formatting

ef53032

mypy fix

744d376

formatting? black check fails even though running black on my end lea…

4fc261b

…ves everything unchanged

avoid mypy problem with type of 'rep'

91822d4

avoid mypy error due to type of 'rep' variable

d248288

black formatting

c982cca

mypy fix

4504994

again weird behaviour of black

103a40d

still error from black

99b5b38

formatting

8d8448e

RasmusOrsoe requested a review from Aske-Rosted January 6, 2026 09:55

Aske-Rosted requested changes Jan 23, 2026

View reviewed changes

Aske-Rosted self-assigned this Jan 23, 2026

Aske-Rosted assigned niklasmei Jan 23, 2026

niklasmei added 7 commits January 23, 2026 14:05

removed the loss function, kept rest, added a short example file in g…

a4d0668

…raphnet/examples

reformatting again

6de0871

pydocstyle on example

f2707ac

still fixing example

3ff1abf

still example

720f7a0

still example

ca9509f

still example

480ea2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT-style mask prediction pretraining #851

BERT-style mask prediction pretraining #851

Uh oh!

niklasmei commented Dec 1, 2025

Uh oh!

Aske-Rosted left a comment

Uh oh!

niklasmei commented Jan 23, 2026

Uh oh!

Aske-Rosted commented Jan 26, 2026

Uh oh!

niklasmei commented Jan 26, 2026 •

edited

Loading

Uh oh!

Aske-Rosted commented Jan 27, 2026 •

edited

Loading

Uh oh!

niklasmei commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BERT-style mask prediction pretraining #851

Are you sure you want to change the base?

BERT-style mask prediction pretraining #851

Uh oh!

Conversation

niklasmei commented Dec 1, 2025

Uh oh!

Aske-Rosted left a comment

Choose a reason for hiding this comment

Uh oh!

niklasmei commented Jan 23, 2026

Uh oh!

Aske-Rosted commented Jan 26, 2026

Uh oh!

niklasmei commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aske-Rosted commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niklasmei commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

niklasmei commented Jan 26, 2026 •

edited

Loading

Aske-Rosted commented Jan 27, 2026 •

edited

Loading