HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering
Abstract
HiMu is a training-free framework for long-form video question answering that uses a hierarchical logic tree decomposition and lightweight experts to efficiently select relevant video frames while preserving temporal structure and cross-modal bindings.
Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.
Community
TL;DR: HiMu is a training-free frame selector for long video QA that gets the best of both worlds — structured reasoning without expensive LVLM calls.
The problem: When answering questions about long videos, you need to pick the right frames. Fast methods (like CLIP retrieval) work okay for simple queries but fail on compositional ones ("What did the speaker say after the chart appeared?"). Smart methods (agent-based selectors) handle these well but cost 10–100× more compute.
How HiMu works: One cheap LLM call breaks the question into a logic tree of simple checks — visual appearance (CLIP), object detection, OCR, speech (ASR), and audio (CLAP). Each check scores every frame independently, then fuzzy logic combines the scores into a final ranking. No iterative LVLM reasoning needed.
Results: Matches or beats agent-based methods on Video-MME, LongVideoBench, and HERBench-Lite — at a fraction of the cost. Works as a drop-in module in front of any LVLM.
Paper: https://arxiv.org/abs/2603.18558
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Event-Anchored Frame Selection for Effective Long-Video Understanding (2026)
- FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering (2026)
- HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models (2026)
- Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning (2026)
- Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding (2026)
- Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering (2026)
- Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper