The meeting-room test AI still fails
Meeting rooms are unforgiving places for intelligence.
A person can know the topic, understand the slides, recognize every face around the table, and still be a terrible participant. Speak too early, and they interrupt. Speak too late, and the moment has passed. Say something factually relevant but socially tone-deaf, and the room quietly deducts points. No spreadsheet records this. Everyone notices anyway.
This is the useful irritation behind SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models, a paper that evaluates omni-modal large language models not as passive answer machines, but as would-be participants in audio-visual conversation.1 That distinction matters. Most benchmark culture still treats models like students taking an exam: here is a clip, here is a question, give the right answer. Social interaction is not an exam. It is a moving target with faces, voices, pauses, interruptions, delayed reactions, and small pragmatic signals that decide whether a response feels competent or merely generated.
SocialOmni turns that mess into three questions: who is speaking, when should the model enter, and how should it respond. The result is less a conventional leaderboard than a failure map. Conveniently, failure maps are more useful than leaderboards when people are about to plug a model into customer calls, meeting copilots, tutoring agents, voice assistants, or social robots and then act surprised when “high multimodal intelligence” does not equal social grace. Very mysterious.
SocialOmni tests participation, not just perception
The paper’s core contribution is a benchmark for audio-visual social interactivity across three connected dimensions. The dataset contains 2,000 perception samples and 209 controlled generation instances drawn from 2,209 dialogue clips across 15 subcategories and four broad domains: entertainment, professional settings, daily life, and narrative scenes. The authors evaluate 12 omni-modal models, including commercial systems such as GPT-4o and Gemini variants, and open-source systems such as Qwen3-Omni, Qwen2.5-Omni, OmniVinci, VITA-1.5, Baichuan-Omni-1.5, and MiniOmni2.
The design matters because the three skills are related but not interchangeable.
| Social skill | What SocialOmni asks | What ordinary benchmarks tend to miss | Business failure mode |
|---|---|---|---|
| Who | Can the model identify the active speaker at a timestamp by binding face, voice, and text? | Speaker grounding under camera cuts, reaction shots, and audio-visual mismatch | The assistant attributes a statement to the wrong person in a meeting or call |
| When | Can the model decide whether it is the right moment to speak? | Premature interruption, delayed entry, and no-response behavior | The agent talks over the user or waits until the useful window is gone |
| How | Can the model generate a contextually and socially appropriate continuation? | Pragmatics, tone, interpersonal fit, and topic trajectory | The response is relevant enough to look plausible, but wrong enough to damage trust |
The benchmark’s “who” task is a four-way classification problem. At a query timestamp, the model chooses among options that combine speaker identity and spoken content. This is clever because it separates two errors that are often collapsed in normal evaluation: recognizing the text and grounding that text to the right person. A model can know what was said and still fail to know who said it.
SocialOmni also includes consistent and inconsistent audio-visual cases. In consistent clips, the visible person matches the audio source. In inconsistent clips, the camera may show a different person while someone else is speaking. Anyone who has watched a panel discussion, interview, livestream, or television drama knows this is not exotic. Cameras cut to listeners. Speakers continue off-screen. Reaction shots happen. Models that treat the most visually salient face as the speaker are not socially perceptive; they are just staring with confidence.
The “when” and “how” tasks are coupled. The model first receives a video/audio prefix and is asked whether the candidate speaker should speak now. If it answers yes, it must generate the response. Timing is measured by the signed offset from the annotated turn-entry point. A negative offset means premature interruption; a positive offset means delayed response. The paper maps responses into Interrupted, Perfect, Delayed, TooLate, and NoResponse, later collapsed into Early, On-time, and Late groups.
The “how” score is judged by three independent LLM judges using a four-level scale of 25, 50, 75, and 100. The judges receive the full ASR transcript, annotated reference continuation, and model response. That choice is practical, not perfect: it lets the paper compare models at scale, while also making the response-quality score partly dependent on textual judging rather than the full embodied audio-visual signal. We will come back to that boundary later.
Category one: “Who” fails when models confuse visibility with identity
The most basic social question is also the easiest to underestimate: who is speaking?
In SocialOmni, the best model on this axis is Qwen3-Omni, with 69.25% speaker-identification accuracy. Gemini 3 Pro Preview follows with 64.99%. GPT-4o is much lower at 36.75%. MiniOmni2, evaluated only on perception because of interface constraints, reaches 16.72%.
These numbers should not be read as a universal ranking of model intelligence. They are task-specific. But they expose something important: audio-visual social grounding is not automatically solved by general multimodal capability. A model may understand images, audio, and text separately, yet still fail to bind them at the moment when a conversation needs attribution.
The paper’s appendix makes the failure sharper. On the perception task, SocialOmni reports performance separately for consistent and inconsistent clips. Qwen3-Omni performs strongly overall and has a smaller consistency gap than many peers, scoring 69.97% on consistent cases and 64.73% on inconsistent cases. Gemini 3 Pro Preview scores 66.24% on consistent cases and 57.04% on inconsistent cases. GPT-4o drops from 38.14% to 28.00%. The point is not that every model collapses in the same way. The point is that cross-modal mismatch is a measurable stress test, not an anecdotal annoyance.
A useful interpretation is this: speaker identification is not just recognition. It is identity tracking under camera politics. In real footage, the camera does not always serve the model’s convenience. It may show a listener, a silent participant, a partial face, a reaction shot, or a visually dominant person who is not speaking. The model must maintain a speaker binding across frames and audio, rather than latch onto whichever face happens to be large and emotionally expressive.
The representative failure cases in the appendix show exactly this pattern. In one “who” failure, the correct answer requires binding continuous audio to the man on the right, but models follow the visually dominant left-side close-up. That is not a failure of text comprehension. It is a saliency-driven attribution error.
For business use, this category matters whenever accountability matters. Meeting summaries, sales-call notes, medical intake copilots, legal interview assistants, and educational discussion tools do not merely need to know what was said. They need to know who said it. Misattribution is not a cosmetic defect. It changes responsibility, intent, and sometimes liability. The polite term is “speaker attribution error.” The impolite term is “putting words in the wrong person’s mouth.”
Category two: “When” fails when models confuse a pause with permission
The second category is timing. This is where many demos look better than products.
In a clean demo, the user finishes speaking, pauses, and the model responds. In real conversation, people pause inside a thought. They hesitate. They look away. They restart a sentence. They use silence for emphasis. They interrupt themselves before anyone else gets the privilege.
SocialOmni’s “when” task tests whether models can enter at the correct moment. The best timing result in the main table is Gemini 3 Pro Preview at 67.31%. Qwen3-Omni follows at 63.64%, Gemini 2.5 Flash at 61.50%, and Gemini 3 Flash Preview at 61.06%. GPT-4o is at 46.89%. Again, the numbers are not a general model election. They are a diagnosis of turn-entry behavior.
The more interesting finding is not the ranking. It is the split between two bad habits.
Some models are over-eager interrupters. Qwen2.5-Omni has an Early rate of 22.5%, and VITA-1.5 has an Early rate of 21.9%. These systems often treat prosodic pauses or hesitation-like moments as turn endings. That is a shallow heuristic: silence has occurred, therefore the floor must be open. Humans do use silence as a cue, but not alone. We also consider syntax, topic completion, eye contact, intonation, emotional pressure, and whether the sentence has actually landed.
Other models are silent observers. OmniVinci has a Late rate of 54.5%, and GPT-4o has a Late rate of 45.5%. They avoid premature interruption, but miss the conversational window. In many applications, being late is not safer. It is just a slower way to be wrong.
The paper’s precision-recall analysis reinforces the point. Two models can have similar on-time rates while behaving very differently: one may be high-precision and low-recall, speaking only when it is very sure but missing valid entry points; another may be high-recall and lower-precision, capturing more opportunities but interrupting too often. A single timing score hides the product personality. The user, naturally, experiences that personality directly.
The appendix’s challenging subset adds a useful caution. Human raters outperform models on selected hard cases, and the paper reports a negative item-level association between human judgments and model ensembles on the “when” axis. The authors interpret this as evidence that models rely too heavily on local acoustic cues such as brief silence-like gaps, while humans use higher-order semantic completion and pragmatic intent. That is exactly the distinction procurement teams should care about. A model that “detects pauses” is not the same as a model that understands turn completion.
For deployment, this category maps directly to product risk. In customer service, premature entry sounds rude. In tutoring, it can cut off student reasoning. In a meeting copilot, it can distract from the very conversation it is supposed to support. In trading or operations copilots, late entry may mean the signal arrives after the actionable moment. Timing is not polish. Timing is function.
Category three: “How” fails when models answer the topic but miss the relationship
The third category is response generation: once the model decides to speak, can it say the right kind of thing?
On the “how” score, Gemini 2.5 Flash leads with 85.08, followed by Gemini 3 Pro Preview at 81.77 and Gemini 3 Flash Preview at 79.08. GPT-4o reaches 69.64. The best open-source score is Qwen2.5-Omni at 66.15, nearly 19 points behind the best commercial score. Some models perform very poorly on this axis: VITA-1.5 scores 12.49, Baichuan-Omni-1.5 scores 27.27, and Qwen3-Omni-Thinking scores 18.06 despite doing much better on speaker identification.
That last contrast is the paper’s most useful warning. Qwen3-Omni leads the “who” axis, but not the “how” axis. Qwen3-Omni-Thinking has relatively competitive perception but weak response generation. GPT-4o shows the reverse profile: weak “who” performance but much stronger “how” performance. The paper calls this a decoupling between perceptual accuracy and natural interruption generation. In business language: a model can understand enough to pass one gate and still fail the next gate.
This is where ordinary benchmark intuition misleads. We often imagine social competence as a stack: first perception, then reasoning, then response. Improve the lower layer, and the upper layer should improve. SocialOmni suggests the stack is leakier than that. Perception and generation are coupled in the task, but not reliably correlated in model performance.
The “how” failure cases make the issue concrete. In one appendix example, the dialogue centers on collateral, a guarantor, and someone’s discomfort asking family for help. The reference continuation is empathetic: “I understand your feelings about it, Richard.” Several models instead produce generic problem-solving replies such as “We need to find another solution” or “There has to be another way,” and receive low judge scores. The models identify the business-like topic but miss the interpersonal move. They answer the problem, not the person.
That failure is familiar in enterprise AI. A support agent may technically address the issue while sounding dismissive. A sales assistant may summarize objections while ignoring emotional hesitation. A tutoring bot may give the correct next hint while missing that the student is frustrated, not confused. A meeting assistant may offer an action item when the conversation needed acknowledgment. The response is not factually absurd. It is socially mispositioned. This is the worst kind of demo failure because it often looks acceptable in screenshots.
The real finding is decoupling, not a new champion model
SocialOmni’s most important finding is not that one model wins. No model wins cleanly.
The leaders differ by axis: Qwen3-Omni leads on “who,” Gemini 3 Pro Preview leads on “when,” and Gemini 2.5 Flash leads on “how.” The paper’s radar-style capability profiles show lopsided polygons rather than clean dominance. A single aggregate score would therefore be actively misleading. It would reward the model that looks best on average, while hiding whether it is bad at speaker attribution, bad at timing, or bad at natural continuation.
For business readers, the right replacement is not “choose the top model.” It is “match the risk profile to the workflow.”
| Workflow | The dangerous hidden weakness | Evaluation priority |
|---|---|---|
| Meeting transcription and summaries | Misattributing statements to the wrong participant | Stress-test “who” under reaction shots and off-screen speech |
| Live customer-service voice agent | Interrupting the customer or waiting too long | Measure Early / On-time / Late behavior, not just response accuracy |
| Tutoring or coaching agent | Giving a correct answer at the wrong conversational moment | Test turn completion and learner hesitation cases |
| Social robot or embodied assistant | Looking socially present while failing cross-modal grounding | Combine speaker binding, gaze/face cues, and timing diagnostics |
| Trading or operations copilot | Producing a delayed intervention after the decision window | Track timing latency and opportunity loss, not only explanation quality |
This is where the paper becomes practically useful. It provides a vocabulary for procurement and model evaluation that is more precise than “multimodal capability.” A company can ask: Do we need the model to identify speakers reliably? Do we need it to enter live conversation? Do we need it to generate socially calibrated responses? If the answer is yes to all three, a normal video-QA benchmark is not enough. It is measuring a spectator. The product needs a participant.
What the experiments support, and what they do not
SocialOmni’s experiments should be read with some discipline. The main evidence is the who–when–how evaluation across 12 models: the benchmark scale, the axis-specific leaders, the decoupled rankings, the timing decomposition, and the reported failure categories. The controlled inconsistent split is a robustness probe for audio-visual conflict. The precision-recall view and Early / On-time / Late decomposition are diagnostic analyses that explain why a timing score alone is too thin. The appendix’s human-feedback subset is an exploratory extension on intentionally difficult cases, not a replacement human baseline for the full benchmark.
That distinction matters because the paper is strongest as a diagnostic instrument, not as a final verdict on all social intelligence in AI. The 209-item generation split is deliberately compact. The authors justify this as a trade-off: open-ended dialogue continuation has variance from decision-point selection, reference quality, and judge sensitivity, so scaling without control can increase noise. That is reasonable, but it means the generation findings should guide testing rather than end the conversation.
The “how” metric also relies on LLM-as-a-judge scoring over transcribed outputs. The judges assess appropriateness, coherence, and pragmatics using the transcript, reference continuation, and model response. This is useful for large-scale comparison, but it may underweight prosody, gesture, facial expression, and other embodied signals. The paper acknowledges related boundaries and suggests future work on human evaluation, multi-turn trajectories, and prosody- and gesture-aware assessments.
Finally, SocialOmni is about controlled benchmark interaction, not full real-world deployment. It does not prove how a production agent will behave after tool integration, latency management, memory systems, personalization, escalation rules, and product-specific prompting. Anyone pretending otherwise is trying to sell something. Still, it gives exactly the kind of failure taxonomy teams need before those engineering layers start hiding the problem under dashboards.
How businesses should use this paper
The immediate use of SocialOmni is not to replace all evaluation with one new benchmark. The better use is to change the evaluation checklist.
First, evaluate speaker binding under realistic camera behavior. Do not test only clips where the speaker’s face is centered and lips are visible. Include off-screen speech, reaction shots, partially visible speakers, overlapping participants, and camera cuts. The model should not become confused every time the director does something normal.
Second, evaluate turn-entry policy as a distribution, not a yes/no success rate. Early, On-time, Late, TooLate, and NoResponse are different product risks. A support bot that interrupts is not equivalent to one that responds late. Both can score poorly, but they annoy users in different ways and require different fixes.
Third, evaluate response pragmatics separately from factual relevance. A response can mention the correct topic and still fail the social move. Ask judges—human or model-assisted—to score whether the response fits the emotional and interpersonal context, not merely whether it is coherent English.
Fourth, keep the three categories separate in reporting. A single “multimodal score” is comforting because it is easy to put into a deck. It is also how organizations make expensive model choices while missing the failure that actually matters for their product. Use capability profiles. Radar charts are not a strategy, but they are at least harder to lie with than one heroic number.
Fifth, treat model selection as workflow-specific. A meeting-copilot vendor should worry more about who-said-what reliability than a casual entertainment assistant. A live voice agent should care obsessively about timing. A tutoring agent needs timing and emotional appropriateness. A customer-support agent needs all three, which is unfortunate but not the paper’s fault.
The quieter lesson: interaction is a system property
The industry likes to talk about omni models as if adding modalities automatically makes interaction natural. Vision plus audio plus text equals conversational intelligence. That is the kind of equation that works beautifully until a model interrupts a grieving customer, assigns a quote to the wrong executive, or responds to a vulnerable person with a generic troubleshooting sentence.
SocialOmni pushes against that assumption. It shows that social interactivity is not one capability but a chain of fragile decisions: bind the right voice to the right person, infer the right entry point, then generate the right continuation. Break any link and the system becomes awkward. Break the wrong link in a high-stakes setting and awkward becomes operational risk.
The paper’s business value is therefore not just better benchmarking. It is cheaper diagnosis. Instead of discovering after deployment that users “don’t like the assistant,” teams can ask a more useful question: is the assistant failing at who, when, or how?
That question will not make AI socially competent by itself. It will, however, make the failure less mysterious. In this industry, that already counts as progress.
Cognaptus: Automate the Present, Incubate the Future.
-
Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, and Rongrong Ji, “SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models,” arXiv:2603.16859v1, 17 March 2026, https://arxiv.org/abs/2603.16859. ↩︎