New regression and classification datasets for ontology pre-training #130

sfluegel05 · 2025-10-29T13:16:24Z

No description provided.

…olubility

…lubility regression

…thub.com/schnamo/python-chebai into sol_new_fix

…to eval fct

…tments for regression tasks and classification tasks

schnamo · 2025-10-31T08:45:29Z

Adding support for regression problems and a wider range of classification problems

schnamo · 2025-11-03T17:07:29Z

add loading from checkpoint pretrained model fix

sfluegel05 · 2025-11-05T12:36:32Z

I added some comments. It would be great if you could have a look at them. Also, you have added quite a number of config files. Some seem to be very specific (e.g. an ELECTRA config with a different learning rate for a specific experiment). My suggestion would be to either remove those configs (and publish it in a paper-specific zenodo archive or mention the parameters in the paper) or group them so that new users don't get overwhelmed (e.g. all moleculenet dataset configs could be one folder).

sfluegel05 · 2025-11-05T12:16:43Z

chebai/loss/semantic.py

        use_sigmoidal_implication: bool = False,
-        weight_epoch_dependent: Union[bool | tuple[int, int]] = False,
+        weight_epoch_dependent: Union[bool, Tuple[int, int]] = False,
+        weight_epoch_dependent: Union[bool, Tuple[int, int]] = False,


why does weight_epoch_dependent appear twice here?

sfluegel05 · 2025-11-05T12:18:58Z

chebai/models/base.py

                if self.pass_loss_kwargs:
                    loss_kwargs = loss_kwargs_candidates
-                loss_kwargs["current_epoch"] = self.trainer.current_epoch
+                # loss_kwargs["current_epoch"] = self.trainer.current_epoch


why is this commented out? Afaik we don't have any loss function at the moment that needs this (this was added for some experimental semantic loss features that didn't perform well). Does this break anything?

sfluegel05 · 2025-11-05T12:19:46Z

chebai/models/electra.py


-from chebai.loss.semantic import DisjointLoss as ElectraChEBIDisjointLoss  # noqa
+# TODO: put back in before pull request
+# from chebai.loss.semantic import DisjointLoss as ElectraChEBIDisjointLoss  # noqa


i guess you wanted to uncomment this :)

sfluegel05 · 2025-11-05T12:26:43Z

chebai/preprocessing/bin/smiles_token/tokens.txt

this will be a problem for merging. I have added new smiles tokens on a different branch (from pubchem) so the new pubchem-pretrained model (and all models based on that) will depend on those tokens.

Are the tokens you added here actually used by a model or are those just artifacts?

I have removed the part in question and will open an issue and look into what is going on with this

sfluegel05 · 2025-11-05T12:31:34Z

configs/training/wandb_logger.yml

is there a reason for deleting this file?

restructering of config files fixing small issues from merging

schnamo · 2025-11-11T15:25:47Z

addressed all comments

sfluegel05 · 2025-11-11T16:19:22Z

chebai/preprocessing/reader.py


    def _get_token_index(self, token: str) -> int:
        """Returns a unique number for each token, automatically adds new tokens."""
+        print(str(token))


I assume this is a leftover from debugging?

Yes, will remove it!

schnamo · 2025-12-11T16:29:04Z

Lint issues fixed. Unit tests most likely need to be adjusted, missing labels might cause issues in some places

sfluegel05 · 2025-12-12T09:18:36Z

The unittests can be fixed by adjusting the mock data for the Tox21Challenge dataset. You just need to add the missing_labels attribute for all entries that have missing labels.

Below are the fixed functions for Tox21ChallengeMockData from tests.unit.mock_data.tox_mock_data.

    @staticmethod
    def data_in_dict_format() -> List[Dict]:
        data_list = [
            {
                "labels": [
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    0,
                    None,
                    None,
                ],
                "ident": "25848",
            },
            {
                "labels": [
                    0,
                    None,
                    None,
                    1,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                ],
                "ident": "2384",
            },
            {
                "labels": [
                    0,
                    None,
                    0,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                ],
                "ident": "27102",
            },
            {
                "labels": [
                    1,
                    1,
                    1,
                    1,
                    1,
                    1,
                    1,
                    1,
                    1,
                    1,
                    1,
                    1,
                ],
                "ident": "26792",
            },
            {
                "labels": [
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    1,
                    None,
                    1,
                    None,
                    None,
                ],
                "ident": "26401",
            },
            {
                "labels": [
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                ],
                "ident": "25973",
            },
        ]

        for dict_ in data_list:
            dict_["features"] = Tox21ChallengeMockData.FEATURE_OF_SMILES
            dict_["group"] = None
            # missing labels get added here
            if any(label is None for label in dict_["labels"]):
                dict_["missing_labels"] = [True if label is None else False for label in dict_["labels"]]

        return data_list

    @staticmethod
    def get_setup_processed_output_data() -> List[Dict]:
        """
        Returns mock processed data used for testing the `setup_processed` method.

        The data contains molecule identifiers and their corresponding toxicity labels for multiple endpoints.
        Each dictionary in the list represents a molecule with its associated labels, features, and group information.

        Returns:
            List[Dict]: A list of dictionaries where each dictionary contains:
                        - "features": The SMILES features of the molecule.
                        - "labels": A list of toxicity endpoint labels (0, 1, or None).
                        - "ident": The sample identifier.
                        - "group": None (default value for the group key).
        """

        # "NR-AR", "NR-AR-LBD", "NR-AhR", "NR-Aromatase", "NR-ER", "NR-ER-LBD", "NR-PPAR-gamma", "SR-ARE", "SR-ATAD5",
        # "SR-HSE", "SR-MMP", "SR-p53",
        data_list = [
            {
                "labels": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                "ident": "NCGC00260869-01",
            },
            {
                "labels": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                "ident": "NCGC00261776-01",
            },
            {
                "labels": [
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                    None,
                ],
                "ident": "NCGC00261380-01",
            },
            {
                "labels": [0, 0, 0, None, 0, 0, 0, 0, 0, 0, None, 1],
                "ident": "NCGC00261842-01",
            },
            {
                "labels": [0, 0, 1, None, 1, 1, 1, None, 1, 1, None, 1],
                "ident": "NCGC00261662-01",
            },
            {
                "labels": [0, 0, None, None, 1, 0, 0, 1, 0, 0, 1, 1],
                "ident": "NCGC00261190-01",
            },
        ]

        complete_list = []
        for dict_ in data_list:
            complete_list.append(
                {
                    "features": Tox21ChallengeMockData.FEATURE_OF_SMILES,
                    **dict_,
                    "group": None,
                }
            )
            # add missing labels
            if any(label is None for label in dict_["labels"]):
                complete_list[-1]["missing_labels"] = [
                    True if label is None else False for label in dict_["labels"]
                ]

        return complete_list

schnamo and others added 30 commits June 6, 2024 17:41

create new class for solubility data

ed2ca6c

adjusting new class

ead3007

add solubility yml file

5956183

adjusting solubility class to correctly download solubility data

c3afeed

make it compatible with classification problem

d57b073

onehotencoding for solubility labels

0faca31

adjust to regression, add yml files for regression

4000215

adjust prediction to regression

0709188

refactor code

f8bd06a

regression fix, yml files for mae loss

21fbde4

take out kinect dataset

f3bfe08

adjust learning rate

e26925d

adjustments for new solu dataset

0f2f85f

Merge branch 'dev' of https://github.com/schnamo/python-chebai into s…

d0da5c2

…olubility

working on evaluation script, addded a bunch of things earlier for so…

0d94b44

…lubility regression

further adjusting evaluation function for regression

45228ba

regression adjustments

dbf8532

fix union expression

fa97f45

fix tuple issue to make it backwards compatible

8b91dce

wandb

677d6ec

fix issue with solubility dataset read in

2c159e8

Fix missing label handling

b537b7f

add more datasets

a99e438

Merge commit 'b537b7fd776e6afc535e05a111a0bc6a493ec8e9' of https://gi…

754de12

…thub.com/schnamo/python-chebai into sol_new_fix

merge branches part 2

9b084cb

add more datasets

326e9a2

adjust metrics for classifications, add BBBP

c272f45

more datasets

dc9e104

bug fixes and different loss and electra params

9a3967d

changes to missing labels: negate labels as well as logits, add them …

1bc8736

…to eval fct

schnamo and others added 11 commits September 29, 2025 11:13

lint fix

41c0b1c

lint fix

4c993a2

add regression to readme

4aa1771

fix union expression

d411c9e

fix tuple issue to make it backwards compatible

18d8e02

Merge branch 'sol_final' into dev and adjustments to new logic, adjus…

af7df07

…tments for regression tasks and classification tasks

adjust to current dev branch

ed1d4b4

adjust all regression tasks to new logic

dca60a3

adjust classification tasks for new logic

b6f0d23

lightning cli issue

9b29411

black-lint fix

d56e226

schnamo force-pushed the dev branch from 2462c6e to d56e226 Compare October 29, 2025 17:25

fix load from checkpoint issues for pretrained models

fc444e0

sfluegel05 commented Nov 11, 2025

View reviewed changes

adding decoding of encoded tokens function

5304b3e

restructering of config files fixing small issues from merging

sfluegel05 changed the title ~~merge into dev~~ New regression and classification datasets for ontology pre-training Nov 11, 2025

sfluegel05 commented Nov 11, 2025

View reviewed changes

schnamo added 5 commits November 11, 2025 17:20

remove print statements from debugging

426f1b0

Merge branch 'dev' of https://github.com/ChEB-AI/python-chebai into dev

b0b3113

lint fixes

fb6fdb7

ruff fixes

81f8025

black fixes

9a24fd7

sfluegel05 added this to the v1.1 milestone Nov 17, 2025

lint fixes

e67e4eb

New regression and classification datasets for ontology pre-training #130

Are you sure you want to change the base?

New regression and classification datasets for ontology pre-training #130

Uh oh!

Conversation

sfluegel05 commented Oct 29, 2025

Uh oh!

schnamo commented Oct 31, 2025

Uh oh!

schnamo commented Nov 3, 2025

Uh oh!

sfluegel05 commented Nov 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schnamo commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schnamo commented Dec 11, 2025

Uh oh!

sfluegel05 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

schnamo commented Nov 11, 2025 •

edited

Loading