MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

Shufang Lin*¹, Muyang Chen*¹, Xiabing Zhou², Rongrong Zhang², Dayou Zhang†², Fangxin Wang¹
¹The Chinese University of Hong Kong, Shenzhen, GuangDong, China
²Capital Normal University, Beijing, China
MISID Benchmark Overview
Figure 1: MISID Benchmark Overview. (Top) A multi-participant strategic dialogue timeline exhibiting hidden tactics. (Bottom) Our multi-dimensional annotation scheme and fact-based reasoning paradigm.

Abstract

Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods.

To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) reveals critical deficiencies in complex scenarios. Consequently, we propose FRACTAM as a baseline framework, which uses a "Decouple-Anchor-Reason" paradigm to construct explicit cross-modal evidence chains and enhance mainstream models' performance in complex strategic tasks.

Dataset Statistics

3,962
Speech Segment
9.15
Hours Duration
15
Participants
374.7
Avg. Utterances / Game

Key Features

Complex Strategic Environment

Constructed from high-pressure social strategy games involving deception, reasoning, and voting-based elimination (e.g., Werewolf).

Multi-turn Dynamics

Captures the dynamic evolution of intents grounded in key facts during prolonged interactions, with speech segment ranges scaling from 154 to 555.

Fact-based Causal Annotation

Shifts the learning objective from superficial guessing to tracking complex derivation chains based on explicitly annotated historical hard evidence.

Multimodal Synchronization

Contains precisely synchronized video and audio modalities, capturing subtle cross-modal leaks (e.g., micro-expressions vs. verbal claims).

Annotation Scheme

Distribution of Multi-dimensional Annotations
Figure 2: Distribution of Multi-dimensional Annotations. Distributions of 5-point scale scores for four metrics (left), categorical emotions (center), and speech durations across game roles (right).

Annotation Layers

Layer 1: Utterance-Level (Micro-states)

Records foundational background metrics for individual utterances, including participant identity, basic emotion state, emotion intensity, and subjectivity vs. objectivity analysis.

Layer 2: Turn-Level (Macro-discourse)

Targets long-range multimodal discourse analysis. Annotates key strategic events, confidence scores, and critically, modality inconsistencies (cross-modal incongruence during interactions).

Fact-based Reasoning Paradigm

A structured approach that combines both layers with ground truth to precisely locate key contextual cues, guiding models to reconstruct logical chains and infer deceptive behaviors.

FRACTAM Baseline Framework

Overall Architecture of the FRACTAM Framework
Figure 3: Overall Architecture of the FRACTAM Framework. The pipeline standardizes multimodal inputs into objective text, retrieves historical evidence, and constructs explicit logical chains.

"Decouple-Anchor-Reason" Paradigm

Stage 1: Unimodal Fact Decoupling

To eliminate text-prior visual hallucination, FRACTAM disables early cross-modal attention. MLLMs independently decode visual and audio signals into objective factual text descriptions, mitigating textual logic dominance.

Stage 2: Hybrid Long-range Fact Anchoring

A dual-stage recall mechanism (Lexical + Semantic search with Reciprocal Rank Fusion) followed by a Cross-Encoder reranker isolates sparse causal variables from dense historical noise spanning hundreds of turns.

Stage 3: Chain-of-Evidence Reasoning

Explicit cross-modal causal chains are constructed. The reasoning model is forced to follow these explicit evidence chains to generate the final fact determination and hidden intent analysis.

Comparison with Existing Datasets

Dataset Depth Multimodal (T/A/V) Strategic Env. Fact-based Causal Annot. Turn Length
MCIC Explicit Text Only 10-30
SLURP Explicit Text/Audio 1-10
MIntRec Explicit Text/Audio/Video 1-10
MIntRec 2.0 Explicit Text/Audio/Video 10-20
MECPE Explicit Text/Audio/Video 10-40
Genesis Explicit Text/Audio/Video 100-500
CSC Implicit Audio Only 1-10
Bag-of-Lies Implicit Text/Audio/Video 1-10
MELD Implicit Text/Audio/Video 10-100
CraigslistBargain Implicit Text Only 10-30
IntentQA Implicit Text/Video 1-10
Diplomacy Implicit Text Only 100-600
MISID (Ours) Implicit Text/Audio/Video 154-555

Table 1: MISID provides unprecedented depth by combining multi-turn, multimodal dynamics with fact-based causal reasoning in complex strategic scenarios.

Citation

If you find our dataset or framework useful in your research, please cite our paper:

@inproceedings{lin2026misid,
  title={MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games},
  author={Lin, Shufang and Chen, Muyang and Zhou, Xiabing and Zhang, Rongrong and Zhang, Dayou and Wang, Fangxin},
  booktitle={Under Review},
  year={2026},
}