Thank you for the impressive tech report on C-RADIOv4.
I have a few questions regarding the integration with SAM3 and the general training setup:
SAM3 Backbone Adaptation:
You mentioned that C-RADIOv4 can be used as a direct replacement for the SAM3 vision encoder. Aside from the multi-teacher distillation losses described in the paper, did you perform any additional training or fine-tuning specifically to adapt C-RADIOv4 to the SAM3 decoder? If so, could you provide more details on how this adaptation process was conducted?
Distillation:
What kind of dataset (and of what scale) is recommended or was used for the distillation process? Could you share the approximate computational resources required to perform the distillation of SAM3?