Scaling Pedagogical Pre-training to 10 Billion Tokens
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
I'm releasing OpenCS2 a 11TB dataset of around 5000 hours of counter strike gameplay recording. - HD resolution - 1280×720 · 32 fps - For each frame keyboard and mouse + world state (player position, velocity, weapon ...) - HD Stereo audio - All 10 players perspective