1ļøā£ Build a solid RL env with Verifiers (Prime Intellect) 2ļøā£ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env 3ļøā£ SFT warm-up to teach format 4ļøā£ Group-based RL (CISPO) against opponents making 20-70% random moves 5ļøā£ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies