Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

Abstract

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video and multimodal large language models as task guidance assistants. A crucial capability for real-world task guidance is the ability to intervene proactively as soon as a mistake is apparent. To evaluate this capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. To address the lack of training data for this task, we introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non-interactive cooking videos into supervised examples showing proactive interventions. Fine-tuning on Ego-CoMist yields gains, especially for smaller and more efficient video LLMs well suited for assistance on edge devices.

Ego-MC-Bench: Mistake Corrections

Ego-MC-Bench evaluates whether AI assistants can intervene at the right time and with the right feedback to prevent mistakes. The benchmark contains expert-provided instruction-feedback pairs in real-world kitchen scenarios, testing both when an intervention should happen and what the assistant should say.

Recording Setup. Ego-MC-Bench is collected in real kitchen scenarios with expert-provided interventions. The setup captures user actions from complementary viewpoints so that mistakes can be timestamped as soon as they become apparent.

Why This Is Hard

F1 scores for state-of-the-art video LLMs on Ego-MC-Bench

Current state-of-the-art video LLMs show very poor mistake intervention capabilities. Even Gemini-3-Flash reaches a mistake intervention F1 score of only 0.18 on per-recipe steps, highlighting the challenge of detecting mistakes at the right time while producing useful corrective feedback. This is becuase this task combines the challenges of perception, memory, temporal grounding, anticipation, and proactive communication.

Ego-CoMist: Counterfactual Mistakes

Counterfactual mistake annotation in Ego-CoMist

A major bottleneck is the lack of appropriate procedural activity videos with mistakes, despite the abundance of cooking and instructional video datasets. Ego-CoMist addresses this by transforming non-interactive cooking videos into supervised examples with counterfactual mistakes and corrective feedback.

Counterfactuals Improve Intervention Capabilities

F1 score gains from fine-tuning on Ego-CoMist plus

Fine-tuning on Ego-CoMist+ significantly improves mistake intervention capabilities. Qwen3.5-2B reaches an F1 score of 0.20 on per-recipe steps, showing strong gains for small models that are practical for low-latency and edge-deployed assistants.

Streaming Evaluation on Ego-MC-Bench

	Instruction	Mistake
Method	IC-Acc ↑	Prec. ↑	Rec. ↑	F1 ↑	BERT ↑	ROUGE-L ↑
Per-recipe step
InternVL3.5-38B	3.9	0.00	0.00	0.00	0.000	0.000
Qwen2.5-VL-32B	27.3	0.00	0.00	0.00	0.000	0.000
Qwen3-VL-8B	30.7	0.00	0.00	0.00	0.000	0.000
VideoLLaMA3-7B	31.8	0.00	0.00	0.00	0.000	0.000
Qwen3.5-2B	0.0	0.02	0.29	0.04	0.184	0.130
Qwen3.5-9B	6.1	0.07	0.33	0.11	0.201	0.137
Qwen3.5-27B	45.5	0.12	0.17	0.14	0.206	0.137
Qwen3-VL-32B	6.8	0.10	0.34	0.16	0.068	0.092
Videollm-online	2.7	0.02	0.38	0.05	0.265	0.201
LiveCC	1.6	0.03	0.43	0.06	0.248	0.196
Gemini-2.5-Flash	24.6	0.17	0.20	0.18	0.180	0.135
Gemini-3-Flash	32.7	0.18	0.18	0.18	0.126	0.102
Full recipes
Qwen3.5-27B	30.3	0.05	0.13	0.07	0.201	0.136
Qwen3-VL-32B	6.8	0.04	0.28	0.07	0.061	0.090
Gemini-3-Flash	10.6	0.05	0.20	0.08	0.097	0.091

Larger proprietary and open models still struggle to intervene reliably, especially when evaluated over full recipes where errors can compound across steps.

Streaming Evaluation on Ego-MC-Bench (Training on Counterfactuals)

	Instruction	Mistake
Method	IC-Acc ↑	Prec. ↑	Rec. ↑	F1 ↑	BERT ↑	ROUGE-L ↑
Per-recipe step
ProAssist	3.0	0.31	0.09	0.14	0.281	0.173
Qwen3.5-2B (QICD)	28.9	0.75	0.05	0.10	0.359	0.218
Qwen3.5-2B (Ego-CoMist)	36.1	0.34	0.11	0.12	0.359	0.229
Qwen3-VL-2B (Ego-CoMist+)	30.4	0.40	0.10	0.16	0.335	0.219
Qwen3.5-0.8B (Ego-CoMist+)	30.5	0.29	0.03	0.06	0.339	0.183
Qwen3.5-2B (Ego-CoMist+)	37.1	0.39	0.14	0.20	0.444	0.272
Full recipes
ProAssist	0.2	0.00	0.00	0.00	0.000	0.000
Qwen3.5-2B (QICD)	8.2	0.18	0.04	0.06	0.347	0.161
Qwen3.5-2B (Ego-CoMist+)	19.7	0.30	0.08	0.13	0.433	0.278

Training with counterfactual mistakes improves both instruction completion and intervention quality. The gains are strongest for efficient 2B-scale models, making Ego-CoMist+ useful for practical assistants.

Citation

@article{bhattacharyya2026streaming,
    title={Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?},
    author={Bhattacharyya, Apratim and Mahajan, Shweta and Haresh, Sanjay and Yasarla, Rajeev and Pourreza, Reza and Liu, Litian and Garrepalli, Risheek and Memisevic, Roland},
    journal={arXiv preprint arXiv:2606.09547},
    year={2026}
}