Hi, thanks for your excellent work on TTRL — it has been extremely helpful for our research!
I have two questions about the training procedure, and would really appreciate your clarification:
-
Does TTRL training stop when the reward converges?
In your experiments, do you monitor the reward curve and stop when it reaches convergence / plateau?
Or do you train for a fixed number of epochs regardless of reward stabilization?
-
Should test data be used only once in TTRL?
Since TTRL (Test-Time RL) generates rollouts conditioned on a test dataset, is it recommended that:
each test sample should be used exactly once?
or is it acceptable that the same test sample appears repeatedly during multiple epochs of training?
Hi, thanks for your excellent work on TTRL — it has been extremely helpful for our research!
I have two questions about the training procedure, and would really appreciate your clarification:
Does TTRL training stop when the reward converges?
In your experiments, do you monitor the reward curve and stop when it reaches convergence / plateau?
Or do you train for a fixed number of epochs regardless of reward stabilization?
Should test data be used only once in TTRL?
Since TTRL (Test-Time RL) generates rollouts conditioned on a test dataset, is it recommended that:
each test sample should be used exactly once?
or is it acceptable that the same test sample appears repeatedly during multiple epochs of training?