-
Couldn't load subscription status.
- Fork 28
Open
Labels
questionFurther information is requestedFurther information is requested
Description
What is your issue?
Current state
Currently, xbatcher v0.3.0's BatchGenerator is this all-in-one class/function that does too many things, and there are more features planned. The 400+ lines of code at https://github.com/xarray-contrib/xbatcher/blob/v0.3.0/xbatcher/generators.py is not something easy for people to understand and contribute to without spending a few hours. To make things more maintainable and future proof, we might need a major refactor.
Proposal
Split BatchGenerator into 2 (or more) subcomponents. Specifically:
- A
Slicerthat does the slicing/subsetting/cropping/tiling/chipping from a multi-dimensionalxarrayobject. - A
Batcherthat groups together the pieces from theSlicerinto batches of data.
These are the parameters from the current BatchGenerator that would be handled by each component:
Slicer:
- input_dims
- input_overlap
Batcher:
- batch_dims
- concat_input_dims
- preload_batch
Benefits
- A NaN checker could be inserted in between
SlicerandBatcher - All the extra logic on deleting/adding extra dimensions can be done on the
Batcherside, or in a step post-Batcher - Allow for creating train/val/test splits after
Slicerbut beforeBatcher- Verde & Xbatcher -> Any connections / shared use? #78
- Also, some people do shuffling after getting slices of data, others may shuffle after batches are created, xref Add ability to shuffle (and reshuffle) batches #170
- Streaming data for performance reasons
- In torchdata, it is possible to have the
Slicerrun in parallel with theBatcher. E.g. with a batch_size of 128,Slicerwould load data up to 128 chips, pass it on toBatcherand feed it to the ML model, while the next round of data processing happens. This is without loading everything into memory. - https://github.com/orgs/xarray-contrib/projects/1
- In torchdata, it is possible to have the
- Flexibility with what step to cache things at
- At Cache batches #109, the proposal was to cache things after
Batcherwhen the batches have been generated already. Sometimes though, people might want to setbatch_sizeas a hyperparameter in their ML experimentation, in which case the cache should be done afterSlicer.
- At Cache batches #109, the proposal was to cache things after
Cons
- May result in the current one-liner becoming a multi-liner
- Could lead to some backwards incompatibility/breaking changes
amaissen
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested