MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling Paper • 2602.11761 • Published Feb 12 • 8
Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection Paper • 2604.02819 • Published Apr 3 • 1
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices Paper • 2605.10933 • Published May 11 • 4
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It Paper • 2606.11052 • Published 13 days ago • 16
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It Paper • 2606.11052 • Published 13 days ago • 16
HypeNet Collection The models for the paper: Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts • 2 items • Updated Apr 28
HypeNet Collection The models for the paper: Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts • 2 items • Updated Apr 28
view article Article Welcome Gemma 4: Frontier multimodal intelligence on device +5 merve, pcuenq, sergiopaniego, burtenshaw, Steveeeeeeen, alvarobartt, SaylorTwift • Apr 2 • 909
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts Paper • 2601.22156 • Published Jan 29 • 15
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts Paper • 2601.22156 • Published Jan 29 • 15
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts Paper • 2601.22156 • Published Jan 29 • 15