ShufflingSampler can lead to significantly different free energies compared to default sampler

In distributed execution, the ShufflingSampler potentially [samples duplicate data points](https://github.com/tvlearn/tvo/blob/master/tvo/utils/data.py#L82) to ensure synchronized batch processing on each worker. The duplicate data points contribute twice to E- and M-step. 

In its current version, the `Trainer` includes terms associated to duplicate data points when evaluating free energies (e.g., [here](https://github.com/tvlearn/tvo/blob/master/tvo/trainer/Trainer.py#L110), [here](https://github.com/tvlearn/tvo/blob/master/tvo/trainer/Trainer.py#L253), [here](https://github.com/tvlearn/tvo/blob/master/tvo/trainer/Trainer.py#L279), [here](https://github.com/tvlearn/tvo/blob/master/tvo/trainer/Trainer.py#L294)). This can lead to significantly different results compared to a sequential execution without the ShufflingSampler and hence without duplicate datapoints (e.g., for a SSSC-House benchmark (\sigma=50, D=144, H=512, |K|=30), I observed a free energy difference on the order of 7). 

Furthermore, the additional data points lead to additional terms for the Theta updates (in `update_param_batch` methods), s..t. different M-step results compared to the sequential execution setting are obtained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ShufflingSampler can lead to significantly different free energies compared to default sampler #40

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

ShufflingSampler can lead to significantly different free energies compared to default sampler #40

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions