Streaming Structured Inference with Flash-SemiCRF
Abstract
Semi-Markov Conditional Random Fields are enhanced through efficient memory management techniques that enable exact inference on long sequences and large label sets by using on-the-fly computation and streaming algorithms.
Semi-Markov Conditional Random Fields (semi-CRFs) assign labels to segments of a sequence rather than to individual positions, enabling exact inference over segment-level features and principled uncertainty estimates at their boundaries. However, existing implementations must materialize a large edge potential tensor whose size grows with sequence length, maximum segment length, and label count, becoming prohibitive for speech-scale state spaces and intractable at genomic scales where sequences can exceed 100,000 positions. This memory bottleneck has limited the adoption of exact segment-level inference for long sequences and large label sets. We identify that the core inefficiency is materializing edge potentials that can instead be evaluated on-the-fly from a compact prefix-sum array, and make several improvements. First, replacing the stored edge tensor with prefix-sum lookup reduces the memory footprint by a factor proportional to the product of segment length and label count. Second, a streaming forward-backward pass with checkpoint-boundary normalization keeps working memory sublinear in sequence length while preserving exact gradients. Third, zero-centered cumulative scores control numerical drift and induce an adaptive duration prior under label imbalance. We integrate these ideas into Flash-SemiCRF, a fused Triton kernel that enables exact semi-CRF inference on previously intractable problem sizes. Available at https://github.com/biobenkj/flash-semicrf.
Community
Flash-SemiCRF makes exact semi-Markov CRF inference tractable at genomic scale (T > 10⁵) by avoiding the O(TKC²) edge tensor - where T is sequence length, K is maximum segment duration, and C is the number of labels - and taking a streaming approach. Three ideas drive the implementation: (1) prefix-sum decomposition enables O(1)-per-edge evaluation from an O(TC) cumulative score array, so the full edge tensor is never stored; (2) ring-buffer streaming keeps forward-backward working memory at O(KC), independent of sequence length; and (3) sublinear gradient checkpointing at √(TK) intervals preserves exact gradients. A globally-centered emission baseline simultaneously controls numerical drift and induces an adaptive, data-dependent duration prior that regularizes segmentation under label imbalance. The system is implemented as a fused Triton kernel. On TIMIT phoneme segmentation, Flash-SemiCRF achieves a 25x training speedup and 178x inference speedup over pytorch-struct's linear scan, the only existing exact backend capable of running K=30 without OOM. We further prove that banded and block-sparse matrix formats cannot exploit the bounded segment length K in tree-structured formulations, as fill-in under composition drives intermediates to near-dense within O(log(T/K)) tree levels.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Sparton: Fast and Memory-Efficient Triton Kernel for Learned Sparse Retrieval (2026)
- S2O: Early Stopping for Sparse Attention via Online Permutation (2026)
- FlashSampling: Fast and Memory-Efficient Exact Sampling (2026)
- Dispatch-Aware Ragged Attention for Pruned Vision Transformers (2026)
- Attention Residuals (2026)
- Mixture-of-Depths Attention (2026)
- LinearARD: Linear-Memory Attention Distillation for RoPE Restoration (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.18780 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper