Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods.
To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) reveals critical deficiencies in complex scenarios. Consequently, we propose FRACTAM as a baseline framework, which uses a "Decouple-Anchor-Reason" paradigm to construct explicit cross-modal evidence chains and enhance mainstream models' performance in complex strategic tasks.
Constructed from high-pressure social strategy games involving deception, reasoning, and voting-based elimination (e.g., Werewolf).
Captures the dynamic evolution of intents grounded in key facts during prolonged interactions, with speech segment ranges scaling from 154 to 555.
Shifts the learning objective from superficial guessing to tracking complex derivation chains based on explicitly annotated historical hard evidence.
Contains precisely synchronized video and audio modalities, capturing subtle cross-modal leaks (e.g., micro-expressions vs. verbal claims).
Records foundational background metrics for individual utterances, including participant identity, basic emotion state, emotion intensity, and subjectivity vs. objectivity analysis.
Targets long-range multimodal discourse analysis. Annotates key strategic events, confidence scores, and critically, modality inconsistencies (cross-modal incongruence during interactions).
A structured approach that combines both layers with ground truth to precisely locate key contextual cues, guiding models to reconstruct logical chains and infer deceptive behaviors.
To eliminate text-prior visual hallucination, FRACTAM disables early cross-modal attention. MLLMs independently decode visual and audio signals into objective factual text descriptions, mitigating textual logic dominance.
A dual-stage recall mechanism (Lexical + Semantic search with Reciprocal Rank Fusion) followed by a Cross-Encoder reranker isolates sparse causal variables from dense historical noise spanning hundreds of turns.
Explicit cross-modal causal chains are constructed. The reasoning model is forced to follow these explicit evidence chains to generate the final fact determination and hidden intent analysis.
| Dataset | Depth | Multimodal (T/A/V) | Strategic Env. | Fact-based Causal Annot. | Turn Length |
|---|---|---|---|---|---|
| MCIC | Explicit | Text Only | ✗ | ✗ | 10-30 |
| SLURP | Explicit | Text/Audio | ✗ | ✗ | 1-10 |
| MIntRec | Explicit | Text/Audio/Video | ✗ | ✗ | 1-10 |
| MIntRec 2.0 | Explicit | Text/Audio/Video | ✗ | ✗ | 10-20 |
| MECPE | Explicit | Text/Audio/Video | ✗ | ✓ | 10-40 |
| Genesis | Explicit | Text/Audio/Video | ✗ | ✓ | 100-500 |
| CSC | Implicit | Audio Only | ✗ | ✗ | 1-10 |
| Bag-of-Lies | Implicit | Text/Audio/Video | ✗ | ✗ | 1-10 |
| MELD | Implicit | Text/Audio/Video | ✗ | ✗ | 10-100 |
| CraigslistBargain | Implicit | Text Only | ✓ | ✗ | 10-30 |
| IntentQA | Implicit | Text/Video | ✗ | ✓ | 1-10 |
| Diplomacy | Implicit | Text Only | ✓ | ✗ | 100-600 |
| MISID (Ours) | Implicit | Text/Audio/Video | ✓ | ✓ | 154-555 |
Table 1: MISID provides unprecedented depth by combining multi-turn, multimodal dynamics with fact-based causal reasoning in complex strategic scenarios.
If you find our dataset or framework useful in your research, please cite our paper: