Skip to content

Conversation

@niklasmei
Copy link
Collaborator

This adds a module for unsupervised pretraining using BERT-style mask prediction.

I originally made it as to pretrain a model that returns a latent view of input sequences along with a single vector to represent the sequences. This was the case because I used a model based on DeepIce, where I had the cls-token and the processed sequence. Some variables still reflect the original use in their names.

In the version here I made it optional to provide a vector that summarizes the input data, which is only used to predict some summary feature (the total charge in an event in the standard case).

It is important that the model that is to be pretrained does not change the number of sequence elements beyond providing an additional summary vector, like a cls-token. Other than that this pretraining module should be indifferent to the model that is pretrained.

Copy link
Collaborator

@Aske-Rosted Aske-Rosted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Niklas, thanks for this contribution, and sorry for taking this long before having a look at it.

In the current form the code is written with all the functionality in one large file. GraphNeT is structured in a modular way such that the user can quickly swap out detector classes, model backbones, tasks etc.

Therefore in order to accept this new functionality it has to be separated out such that each of the new functionalities are put in their respective locations, e.g. The loss functions defined on l.23 and l.33 should be moved to src/graphnet/training/loss_functions.py and make use of the LossFunction base class.

Functionality also should not be duplicated with code already available within GraphNeT. Using the loss functions as an example the dense_mse_loss should probably either inherit from MSELoss, have the different functionality be enabled by an argument to the existing class, or the scatter functionality should be a wrapper taking a loss function as an input, or be functionality enabled in the base LossFunction class.

Since this is new functionality it would also be nice to have a minimal working mock example/test running on some of the example data available, which can be used for code checks ensuring stability under future development as well as give users insight into how the functionality is used.

If you have questions about how to best start this process or anything you find to be unclear then I would be happy to try and provide answers.

@Aske-Rosted Aske-Rosted self-assigned this Jan 23, 2026
@niklasmei
Copy link
Collaborator Author

Hi Aske, thanks for the review.

With the last commit I removed the loss functions and added a Negative Cosine Loss in the file you specified. For the MSE I now use the pre-existing MSELoss.

The rest of the modules in my file I left untouched because I feel like they make sense to be there. An exception might be the standard_maskpred_net, where one could argue to move that somewhere else. But at the same time it is just for the case when a user does not want to specify another network themselves for the charge predicition.

As to the simple example: I put that now in the example section just to have it somewhere and we can just move it anywhere you think best. All the example does is being evaluated on some data to see if it works and to illustrate how it is supposed to be used. Plus I put in one line that shows how to save the state of the model that is to be pretrained. There is only one small problem in this example: I put the file path to the data as "/ptmp/mpp/nikme/graphnet/data/examples/sqlite/prometheus/prometheus-events.db" which is obviously not generally useable and I would be grateful for a tip as to how I can generalize that.

Otherwise I hope my modules are better now but let me know if there is anything else I can do

@Aske-Rosted
Copy link
Collaborator

Hi Aske, thanks for the review.

With the last commit I removed the loss functions and added a Negative Cosine Loss in the file you specified. For the MSE I now use the pre-existing MSELoss.

The rest of the modules in my file I left untouched because I feel like they make sense to be there. An exception might be the standard_maskpred_net, where one could argue to move that somewhere else. But at the same time it is just for the case when a user does not want to specify another network themselves for the charge predicition.

As to the simple example: I put that now in the example section just to have it somewhere and we can just move it anywhere you think best. All the example does is being evaluated on some data to see if it works and to illustrate how it is supposed to be used. Plus I put in one line that shows how to save the state of the model that is to be pretrained. There is only one small problem in this example: I put the file path to the data as "/ptmp/mpp/nikme/graphnet/data/examples/sqlite/prometheus/prometheus-events.db" which is obviously not generally useable and I would be grateful for a tip as to how I can generalize that.

Otherwise I hope my modules are better now but let me know if there is anything else I can do

Thanks a lot for the updates and the quick follow-up.

Concerning the prometheus database path, looking at the other examples.

from graphnet.constants import EXAMPLE_DATA_DIR
f"{EXAMPLE_DATA_DIR}/sqlite/prometheus/prometheus-events.db"

Should allow you to get the correct path to the prometheus event database.

I would like to come back to the structural issue I raised in my previous review. The current implementation still places several distinct pieces of functionality in a single file. An overview of how the different components that make up a model can be found here

the current location src/graphnet/models/gnn/pretraining_maskpred.py is where some backbones, specific neural network architecture classes, are defined. The way I see it the additional functionality that you want to add are the following.

  1. data-augmentation: A masking of the input data (after?) data_representation has been applied to the raw input from the files.
  2. A way for the masked out data to be kept in order to use them as targets for a task.
  3. A task that takes a model output along with target in the shape of whatever data-representation was used.
  4. An optional alteration of the masked out data (like the custom_charge_target t)

These are the different functionalities, that we have to determine lives in what parts of the pre-existing framework.

There are several ways to accomplish this in a way that is compliant within the current framework. The approach that I would suggest is the following:

A data_representation, which takes another data_representation as input applies the augmentation to it and either saves the masked target in the same data-object, if possible, or returns another data-object in addition to the usual data-object. Then one could create a task which handles how the output of the backbone is connected to the target of reconstructing parts of the masked out input features. The model class might have to be slightly altered such that is listens for Tasks which require information apart from the output of the backbone.

It is my impression that this will produces less duplicate code and works within the current data_representation -> backbone -> Task framework that most would be familiar with, but I am open to discussion.

I want to emphasize that I am well aware that this is quite a lot of work. However, if we want these additional features to be available in GraphNeT, they have to be implemented in a way that works with the modular framework of GraphNeT.

Another concern I have with the current implementation is that, unless I am missing something, the current code does not seem to, in the case of the KNN data_representation, be re-computing the edges between the nodes. This is not an issue for the simple model used here but could lead to incorrect behavior if these edges were used in the chosen backbone.

Again feel free to ask further questions and I will try to answer to the best of my ability.

@niklasmei
Copy link
Collaborator Author

niklasmei commented Jan 26, 2026

Thanks again for the feedback.

I feel like I should clarify that my pretraining model is supposed to be used similar to the Standard_Model or the Normalizing_Flow in the sense that the user supplies his data_representation and backbone (no task in the usual sense due to not having a typical supervised target). Therefore I don't quite understand why I would reframe my augmentation as a data rerpresentation. All it should be doing is mask some features and extract the original values along with a sense of their position within the input tensors in the shape of the mask. I want to stress that the augmentation does not touch the graph structure of the input data and really only overwrites some feature values. Therefore I also don't think the edges need to be recalculated. I am sure it would be possible to do all this with a data_representation but again I don't see the point as the structure stays untouched. Maybe I am just missing something so please if I am wrong here let me know.

What I suggest that I could do for now is the following:

  1. Put the main pretraining_maskpred.py in the models folder instead of models/gnn. Maybe I caused some confusion about the intended use of my pretraining frame by putting it next to backbones like Dynedge
  2. Try to make the handling of the loss calcultion happen in a Task

Please let me know your thoughts especially regarding the augmentation. I will in the meantime look into how I could do it as a data representation

@Aske-Rosted
Copy link
Collaborator

Aske-Rosted commented Jan 27, 2026

If you are not "removing" entire nodes from the data_representation then I agree with you that the structure is unchanged. That does however introduce discussions about fill values. For both the log_scaled charge, the time and x,y,z coordinate the value 0 is a possible quantity that could come up in the data_representation, and therefore using it as the mask_value might not be the best approach. This is not an issue in the same way for BERT where they are usually working on string sequences. Your learned_value implementation does address this in part.

If you are open to it then, I would love to have a call where we can discuss more in depth. At the current stage there are still a number of things that are unclear to me.

@niklasmei
Copy link
Collaborator Author

If you are not "removing" entire nodes from the data_representation then I agree with you that the structure is unchanged. That does however introduce discussions about fill values. For both the log_scaled charge, the time and x,y,z coordinate the value 0 is a possible quantity that could come up in the data_representation, and therefore using it as the mask_value might not be the best approach. This is not an issue in the same way for BERT where they are usually working on string sequences. Your learned_value implementation does address this in part.

If you are open to it then, I would love to have a call where we can discuss more in depth. At the current stage there are still a number of things that are unclear to me.

I am of course open for a call, that would probably be beneficial for me anyways so that I can better understand what to improve with my modules. Just let me know when you are available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants