arxiv:2603.18558

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Published on Mar 19

· Submitted by

Gabriele Serussi on Mar 23

INSIGHT Lab

Upvote

Authors:

Gabriele Serussi ,

Abstract

HiMu is a training-free framework for long-form video question answering that uses a hierarchical logic tree decomposition and lightweight experts to efficiently select relevant video frames while preserving temporal structure and cross-modal bindings.

AI-generated summary

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

GSerussi

Paper author Paper submitter about 24 hours ago

TL;DR: HiMu is a training-free frame selector for long video QA that gets the best of both worlds — structured reasoning without expensive LVLM calls.
The problem: When answering questions about long videos, you need to pick the right frames. Fast methods (like CLIP retrieval) work okay for simple queries but fail on compositional ones ("What did the speaker say after the chart appeared?"). Smart methods (agent-based selectors) handle these well but cost 10–100× more compute.
How HiMu works: One cheap LLM call breaks the question into a logic tree of simple checks — visual appearance (CLIP), object detection, OCR, speech (ASR), and audio (CLAP). Each check scores every frame independently, then fuzzy logic combines the scores into a final ranking. No iterative LVLM reasoning needed.
Results: Matches or beats agent-based methods on Video-MME, LongVideoBench, and HERBench-Lite — at a fraction of the cost. Works as a drop-in module in front of any LVLM.
Paper: https://arxiv.org/abs/2603.18558