2.8m - Gmail.txt
: Uses 22k data pairs focusing on textual accuracy (
: Uses 11k pairs with a balance of textual and visual rewards ( 2.8M GMAIL.txt
: The SFT stage requires 60 hours of training on 16 H800 GPUs . The RL stages take an additional 34 hours on 24 H800 GPUs [11]. : Uses 22k data pairs focusing on textual
: Increasing data from 2M to 2.8M results in no further performance gains, confirming the plateau [22]. Multimodal Structured Reinforcement Learning (MSRL) : 2.8M GMAIL.txt
: The model is tested on subsets ranging from 200k to 2.8 million samples.