In paper figure2, there is "For efficiency, we use local attention within “mask units” (Fig. 4, 5) for the first two stages and global attention for the rest. " But i checked the codes, but i found the stage3 and stage4 still use the window local attention (token interaction within the partioned window). Therefore, where is the "global attention for the rest" ?
In paper figure2, there is "For efficiency, we use local attention within “mask units” (Fig. 4, 5) for the first two stages and global attention for the rest. " But i checked the codes, but i found the stage3 and stage4 still use the window local attention (token interaction within the partioned window). Therefore, where is the "global attention for the rest" ?