Releases: FunctionLab/selene
Releases · FunctionLab/selene
0.6.0
config_utils.py: Add additional information saved upon running Selene. Specifically, we now save the version of Selene that the latest run used, make a copy of the input configuration file, and save this along with the model architecture file in the output directory. This adds a new dependency to Selene, the packageruamel.yamlH5Dataloaderand_H5Dataset: PreviouslyH5Dataloaderhad a number of arguments that were used to then initialize_H5Datasetinternally. One major change in this version is that we now ask that users initialize_H5Datasetexplicitly and then pass it toH5Dataloaderas a class argument. This makes the two classes consistent with the PyTorch specifications forDatasetandDataLoaderclasses, enabling them to be compatible with different data parallelization configurations supported by PyTorch and the PyTorch Lightning framework._H5Datasetclass initialization optional arguments:unpackbitscan now be specified separately for sequences and targets by way ofunpackbits_seqandunpackbits_tgtuse_seq_lenenables subsetting to the centeruse_seq_lenlength of the sequences in the dataset.shift(particularly paired withuse_seq_len) allows for retrieving sequences shifted from the center position byshiftbases. Note currentlyshiftonly allows shifting in one direction, depending on whether you pass in a positive or negative integer.
GenomicFeaturesH5: This is a new targets class to handle continuous-valued targets, stored in an HDF5 file, that can be retrieved based on genomic coordinate. Once again, genomic regions are stored in a tabix-indexed .bed file, with the main change being that the BED file now specifies for each genomic regions the index of the row in the HDF5 matrix that contains all the target values to predict. If multiple target rows are returned for a query region, the average of those rows is returned.RandomPositionsSampler:exclude_chrs: Added a new optional argument which by default excludes all nonstandard chromosomesexclude_chrs=['_']by ignoring all chromosomes with an underscore in the name. Pass in a list of chromosomes or substrings to exclude. When loading possible sampling positions, the class now iterates through theexclude_chrslist and checks for each substringsin list ifs in chrom, and if so, skips that chromosome entirely.- Internal function
_retrievenow takes in an optional argumentstrand(defaultNone) to enable explicit retrieval of a sequence atchrom, positionfor a specific side. The default behavior of theRandomPositionsSamplerclass remains the same, with the strand side randomly selected for each genomic position sampled.
PerformanceMetrics:- Now supports
spearmanrandpearsonrfromscipy.stats. Room for improvement to generalize this class in the future. - The
updatefunction now has an optional argumentscoreswhich can pass in a subset of the metrics aslist(str)to compute.
- Now supports
TrainModel:self.stepstarts fromself._start_step, which is non-zero if loaded from a Selene-saved checkpoint- removed call to
self._test_metrics.visualizeinevaluatesince the visualize method does not generalize well.
NonStrandSpecific: Can now handle a model outputting two outputs inforward, will handle by taking either the mean or max of each of the two individual outputs for their forward and reverse predictions. A customNonStrandSpecificclass is recommended for more specific cases.
0.5.3
Adjust dependency requirements (NumPy, Cython lower bound requirement)
0.5.2
Fixes a NumPy/Cython type error causing build issues with Python 3.9+
0.5.0
Version 0.5.0
New functionality
sampler.MultiSampler:MultiSampleraccepts any Selene sampler for each of the train, validation, and test partitions where previouslyMultiFileSampleronly acceptedFileSamplers. We will deprecateMultiFileSamplerin our next major release.DataLoader: Parallel data loading based on PyTorch'sDataLoaderclass, which can be used with Selene'sMultiSamplerandMultiFileSamplerclass. (see:sampler.SamplerDataLoader,sampler.H5DataLoader)- To support parallelism via multiprocessing, the sampler that
SamplerDataLoaderused needs to be picklable. To enable this, opening file operations are delayed to when any method that needs the file is called. There is no change to the API and settinginit_unpicklable=Truein__init__forGenomeand allOnlineSamplerclasses will fully reproduce the functionality inselene_sdk<=0.4.8. sampler.RandomPositionsSampler: added support forcenter_bin_to_predicttaking in a list/tuple of two integers to specify the region from which to query the targets---that is,center_bin_to_predictby default (center_bin_to_predict=<int>) queries targets based on the center bin size, but can be specified as start and end integers that are not at the center if desired.EvaluateModel: accepts a list of metrics (by default computing ROC AUC and average precision) with which to evaluate the test dataset.
Usage
- Command-line interface (CLI): You can now run the CLI directly with
python -m selene_sdk(if you have cloned the repository, make sure you have locally installedselene_sdkviapython setup.py install, orselene_sdkis in the same directory as your script / added toPYTHONPATH). Developers can make a copy of theselene_sdk/cli.pyscript and use it the same way thatselene_cli.pywas used in earlier versions of Selene (python -u cli.py <config-yml> [--lr])
Bug fixes
EvaluateModel:use_features_ordallows you to evaluate a trained model on only a subset of chromatin features (targets) predicted by the model. If you are using aFileSamplerfor your test dataset, you now have the option to pass in a subsetted matrix; however, this matrix must be ordered the same way asfeatures(the original targets prediction ordering) and not in the same ordering asuse_features_ord. However, the final model predictions and targets
(test_predictions.npzandtest_targets.npz) will be outputted according to theuse_features_ordlist and ordering.MatFileSampler: Previously theMatFileSamplerreset the pointer to the start of the matrix too early (going back to the first sample before we had finished sampling the whole matrix).- CLI learning rate: Edge cases (e.g. not specifying the learning rate via CLI or config) previously were not handled correctly and did not throw an informative error.
0.4.8
Enhancements
- PyTorch now has flexible state dict loading, which allows users more flexibility in loading models that were trained with older/newer versions of PyTorch. Selene has been updated to use this parameter.
- Added HeartENN model architecture ahead of publication.
0.4.7
Bugfixes:
- Use
self.use_cudainget_predictfor raw sequence input in theAnalyzeSequencesclass.
0.4.6
Updates
- Allow users to pass in individual sequences to
get_predictionsinAnalyzeSequencesclass and get the model prediction directly (as opposed to having it be written to an output file).
0.4.5
Updates
- Specify upper & lower bounds for Selene's torch dependency
- Add '.' as a valid delimiter for VCF multiallelic parsing
- Allow users to evaluate on subsets of features in EvaluateModel
Bugfixes:
BASES_ARRtype consistency (specify as a list only) and resetting for lua-trained model vs. Selene-trained model.
0.4.4
Updates
- Refactored variant effect prediction to simplify the code
- Removed
contains_unkcolumn from output ofget_predictions_from_fastainAnalyzeSequencesclass
Bugfixes
- Fixed variant effect prediction handling for odd-length sequences
0.4.3
Updates:
- Add a column
contains_unkto BED/VCF predictions. This boolean column indicates whether a sequence contains any unknown bases.
Bugfixes:
- MultiModelWrapper can be used with CUDA.