-
Notifications
You must be signed in to change notification settings - Fork 6
New regression and classification datasets for ontology pre-training #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…lubility regression
…tments for regression tasks and classification tasks
|
|
add loading from checkpoint pretrained model fix |
|
I added some comments. It would be great if you could have a look at them. Also, you have added quite a number of config files. Some seem to be very specific (e.g. an ELECTRA config with a different learning rate for a specific experiment). My suggestion would be to either remove those configs (and publish it in a paper-specific zenodo archive or mention the parameters in the paper) or group them so that new users don't get overwhelmed (e.g. all moleculenet dataset configs could be one folder). |
chebai/loss/semantic.py
Outdated
| use_sigmoidal_implication: bool = False, | ||
| weight_epoch_dependent: Union[bool | tuple[int, int]] = False, | ||
| weight_epoch_dependent: Union[bool, Tuple[int, int]] = False, | ||
| weight_epoch_dependent: Union[bool, Tuple[int, int]] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does weight_epoch_dependent appear twice here?
chebai/models/base.py
Outdated
| if self.pass_loss_kwargs: | ||
| loss_kwargs = loss_kwargs_candidates | ||
| loss_kwargs["current_epoch"] = self.trainer.current_epoch | ||
| # loss_kwargs["current_epoch"] = self.trainer.current_epoch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this commented out? Afaik we don't have any loss function at the moment that needs this (this was added for some experimental semantic loss features that didn't perform well). Does this break anything?
chebai/models/electra.py
Outdated
|
|
||
| from chebai.loss.semantic import DisjointLoss as ElectraChEBIDisjointLoss # noqa | ||
| # TODO: put back in before pull request | ||
| # from chebai.loss.semantic import DisjointLoss as ElectraChEBIDisjointLoss # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess you wanted to uncomment this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be a problem for merging. I have added new smiles tokens on a different branch (from pubchem) so the new pubchem-pretrained model (and all models based on that) will depend on those tokens.
Are the tokens you added here actually used by a model or are those just artifacts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the part in question and will open an issue and look into what is going on with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason for deleting this file?
restructering of config files fixing small issues from merging
|
addressed all comments |
chebai/preprocessing/reader.py
Outdated
|
|
||
| def _get_token_index(self, token: str) -> int: | ||
| """Returns a unique number for each token, automatically adds new tokens.""" | ||
| print(str(token)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is a leftover from debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will remove it!
|
Lint issues fixed. Unit tests most likely need to be adjusted, missing labels might cause issues in some places |
|
The unittests can be fixed by adjusting the mock data for the Tox21Challenge dataset. You just need to add the Below are the fixed functions for @staticmethod
def data_in_dict_format() -> List[Dict]:
data_list = [
{
"labels": [
None,
None,
None,
None,
None,
None,
None,
None,
None,
0,
None,
None,
],
"ident": "25848",
},
{
"labels": [
0,
None,
None,
1,
None,
None,
None,
None,
None,
None,
None,
None,
],
"ident": "2384",
},
{
"labels": [
0,
None,
0,
None,
None,
None,
None,
None,
None,
None,
None,
None,
],
"ident": "27102",
},
{
"labels": [
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
],
"ident": "26792",
},
{
"labels": [
None,
None,
None,
None,
None,
None,
None,
1,
None,
1,
None,
None,
],
"ident": "26401",
},
{
"labels": [
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
],
"ident": "25973",
},
]
for dict_ in data_list:
dict_["features"] = Tox21ChallengeMockData.FEATURE_OF_SMILES
dict_["group"] = None
# missing labels get added here
if any(label is None for label in dict_["labels"]):
dict_["missing_labels"] = [True if label is None else False for label in dict_["labels"]]
return data_list @staticmethod
def get_setup_processed_output_data() -> List[Dict]:
"""
Returns mock processed data used for testing the `setup_processed` method.
The data contains molecule identifiers and their corresponding toxicity labels for multiple endpoints.
Each dictionary in the list represents a molecule with its associated labels, features, and group information.
Returns:
List[Dict]: A list of dictionaries where each dictionary contains:
- "features": The SMILES features of the molecule.
- "labels": A list of toxicity endpoint labels (0, 1, or None).
- "ident": The sample identifier.
- "group": None (default value for the group key).
"""
# "NR-AR", "NR-AR-LBD", "NR-AhR", "NR-Aromatase", "NR-ER", "NR-ER-LBD", "NR-PPAR-gamma", "SR-ARE", "SR-ATAD5",
# "SR-HSE", "SR-MMP", "SR-p53",
data_list = [
{
"labels": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"ident": "NCGC00260869-01",
},
{
"labels": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"ident": "NCGC00261776-01",
},
{
"labels": [
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
],
"ident": "NCGC00261380-01",
},
{
"labels": [0, 0, 0, None, 0, 0, 0, 0, 0, 0, None, 1],
"ident": "NCGC00261842-01",
},
{
"labels": [0, 0, 1, None, 1, 1, 1, None, 1, 1, None, 1],
"ident": "NCGC00261662-01",
},
{
"labels": [0, 0, None, None, 1, 0, 0, 1, 0, 0, 1, 1],
"ident": "NCGC00261190-01",
},
]
complete_list = []
for dict_ in data_list:
complete_list.append(
{
"features": Tox21ChallengeMockData.FEATURE_OF_SMILES,
**dict_,
"group": None,
}
)
# add missing labels
if any(label is None for label in dict_["labels"]):
complete_list[-1]["missing_labels"] = [
True if label is None else False for label in dict_["labels"]
]
return complete_list |
No description provided.