Cameras are easy. Audits are not.
That is the useful irritation inside FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis, a new benchmark for testing multimodal large language models on commercial-kitchen compliance monitoring.1 The paper is not asking whether a model can watch a kitchen video and say something vaguely sensible about hygiene. Many systems can now do that, at least with enough confidence to impress a demo audience and mildly alarm the legal department.
FoodMonitor asks something more operationally serious: can the model identify which rule was violated, describe the non-compliant condition, localise the responsible person when there is one, and produce structured evidence that a human auditor could inspect?
The answer, for now, is: not very well.
The best-performing model in the benchmark, Doubao-Seed-2.0-Pro, reaches only a $C_{\text{score}}$ of 0.360. That number is not a small cosmetic blemish on an otherwise triumphant progress chart. It is the central evidence of the paper. When compliance analysis is treated as rule-grounded, attributable, spatially anchored detection, today’s strong video-capable MLLMs stop looking like auditors and start looking like talented interns who noticed something was wrong but cannot reliably say who did what, under which rule, and with what evidence.
Delightful, in the way procurement surprises often are.
The score is low because the task is finally realistic
FoodMonitor’s most important move is not simply that it adds another dataset to the video-understanding pile. The pile is already tall enough to require a safety inspection of its own.
The contribution is that it changes the shape of the task. Traditional video anomaly detection benchmarks often ask whether an event is normal or abnormal. Kitchen-action datasets ask what action is being performed. General MLLM video benchmarks often use question answering to test whether models understand scenes, events, or temporal relationships.
Compliance monitoring is a different animal. It is not satisfied by “someone is doing food preparation” or “the scene looks unhygienic.” It needs a structured claim:
| Requirement | What the model must produce | Why it matters operationally |
|---|---|---|
| Rule grounding | The violated category or check item | Compliance teams need policy linkage, not visual commentary |
| Behaviour or condition description | A natural-language explanation of the violation | Human reviewers need to understand why the alert exists |
| Person attribution | Spatial-temporal anchors for the worker involved | Accountability requires assigning evidence to the right person |
| Environment-channel detection | Scene-level violations such as bins, surfaces, or floors | Not every compliance issue belongs to an individual |
| Structured output | JSON-like predictions rather than free prose | Downstream systems need parsable audit records |
FoodMonitor contains 477 one-minute commercial-kitchen videos and 3,307 violation annotations. The videos come from environments such as public catering services, school cafeterias, and factory canteens. The dataset includes 1,031 person instances, with 78.4% of clips containing multiple people. That multi-person condition is not an incidental detail. It is where the easy version of video compliance quietly dies.
A system that can describe a kitchen scene is not necessarily a system that can track the worker in the red box, recognise that the relevant violation occurred across specific moments, and avoid blaming the wrong person. In a compliance setting, “mostly nearby” is not evidence. It is paperwork with a lawsuit attached.
FoodMonitor measures evidence production, not anomaly vibes
The dataset uses a dual-channel annotation structure. Person-level violations cover individual behaviours such as improper dress, mask problems, unsafe handling, hand hygiene failures, and personal habits. Environment-level violations cover scene-wide issues such as dirty surfaces, uncovered waste bins, floor contamination, and cleaning-tool placement.
This dual-channel design is practical. Some violations require accountability: a worker failed to wear a mask properly, touched hair without washing hands, or mishandled food. Other violations are environmental: a bin lacks a lid, a shelf is disorganised, a work surface is dirty. A useful compliance system needs to handle both.
The annotation process is also revealing. FoodMonitor uses a five-stage pipeline: person tracking, temporal behaviour captioning, checklist generation from codified food-safety rules, visual verification plus environment assessment, and human quality assurance. The authors use models in the pipeline, but they do not pretend that model output alone is a gold standard. Human expert verification is the quality gate.
That matters because the paper is about explainable compliance, not merely automated labelling. The annotations connect a violation to a rule category, a description, and, for person-level cases, tracking information over time. The benchmark therefore tests whether MLLMs can produce evidence-like outputs rather than charismatic scene summaries.
The distribution is also deeply uneven, which is exactly what one should expect in real compliance. Personal Hygiene & Dress Code dominates with 2,381 violations, or 72.0% of the dataset. Work Surfaces & Shelves contribute 412, and Waste Disposal contributes 278. Meanwhile, Hand Hygiene and Personal Habits together appear only 7 times.
That imbalance is not a flaw to be waved away. It is the job. In regulated environments, common violations are common, rare violations may still be critical, and a system that learns only the fat middle of the distribution is not a compliance system. It is a very expensive mask detector wearing a blazer.
The evaluation protocol is the paper’s real machinery
The benchmark protocol is where FoodMonitor becomes more than a dataset release.
Each model receives 60 frames sampled at 1.0 FPS from a 60-second video, plus the complete document of 27 compliance check items. The model must output structured predictions in two channels: environment-level detections and person-level detections. Person-level detections require spatial-temporal anchors, represented as timestamped bounding boxes, with at least 10 anchors.
This is important because the protocol prevents a model from surviving on fluent description alone. It has to place claims into a structure. It has to map observations to rule categories. It has to provide anchors that can be compared against ground truth.
For environment violations, matching is category-first and then semantic. A predicted environmental issue must belong to the same category as a ground-truth issue, then a text comparison model judges whether the description refers to the same underlying problem. One-to-one matching prevents duplicate predictions from inflating the score.
For person violations, FoodMonitor adds a tougher first stage: localisation. A predicted violation must first align with the right ground-truth person instance. The benchmark computes Intersection-over-Union between predicted anchors and the closest ground-truth tracking points within a temporal tolerance of 0.5 seconds. Localisation succeeds only if the mean IoU reaches the threshold of 0.3 and temporal coverage reaches 0.6. Only then does the prediction proceed to semantic matching.
This two-stage design is the paper’s most useful diagnostic device. It separates two failures that are often conveniently blended together in AI demos:
- The model may fail to point to the right person.
- The model may point to the right person but misunderstand the violation.
Those are different engineering problems. One is spatial perception and tracking. The other is rule-grounded semantic reasoning. Mixing them together gives management a single disappointing score. Separating them gives technical teams something they can actually fix.
The overall $C_{\text{score}}$ is the average of the environment F1 and person F1:
$$ C_{\text{score}} = \frac{1}{2}(F_{1,\text{env}} + F_{1,\text{per}}) $$
That equal weighting is not a claim that every deployment should value both channels equally. It is a benchmark design choice. A school cafeteria, a central kitchen, and a factory canteen may assign different operational costs to missed person-level versus environment-level violations. But as a research protocol, the average makes the point cleanly: compliance requires both systemic and attributable evidence.
The main result: models see more than they can prove
The main evidence comes from the benchmark table comparing 11 MLLMs, including closed-source and open-source models. All are evaluated in thinking mode under the same input and output protocol.
| Model | $C_{\text{score}}$ | Environment F1 | Person F1 | Interpretation |
|---|---|---|---|---|
| Doubao-Seed-2.0-Pro | 0.360 | 0.459 | 0.261 | Best overall, still weak in absolute terms |
| Gemini-3-Flash | 0.316 | 0.376 | 0.256 | Second-best overall; person F1 close to the leader |
| Doubao-Seed-2.0-Lite | 0.298 | 0.421 | 0.175 | Stronger environment channel than person channel |
| GLM-4.6V | 0.279 | 0.382 | 0.177 | Best open-source result in the table |
| GLM-4.6V-Flash | 0.135 | 0.242 | 0.028 | Person-level attribution nearly collapses |
The pattern is consistent. Models perform better on environment violations than on person violations. That is exactly what the task design predicts. Environment violations often involve scene-level states: a waste bin, a dirty shelf, a surface condition. Person-level violations require locating the right individual over time and tying a specific rule violation to that person.
The models also show higher precision than recall across most channels. In plain business language: when they make a prediction, it is more likely to be useful than random; but they miss many actual violations. This is conservative behaviour. It may be tolerable for a triage assistant that surfaces possible issues to a human reviewer. It is not tolerable for a system marketed as autonomous compliance assurance, unless the word “assurance” has recently been redefined to mean “occasionally confident.”
Closed-source models generally lead the ranking, with Doubao-Seed-2.0-Pro, Gemini-3-Flash, and Doubao-Seed-2.0-Lite taking the top three $C_{\text{score}}$ values. But the best open-source model, GLM-4.6V, reaches 0.279, which is competitive with some closed-source alternatives. The paper’s evidence does not support a lazy proprietary-versus-open-source morality play. It supports a more useful conclusion: model family matters, but the compliance bottleneck is not solved by simply buying the strongest general MLLM available.
The task itself is structurally hard.
Person-level compliance breaks at two different seams
FoodMonitor’s error decomposition is where the article’s business interpretation should spend its time. Table 5 is not an ablation. It is diagnostic evidence. It explains why the main scores are low by splitting person-detection failures into localisation and semantic components.
The paper reports four process metrics for person detection:
| Metric | What it measures | Operational reading |
|---|---|---|
| IMR | Fraction of predictions that successfully localise to a ground-truth person instance | Can the system point to the right person? |
| SHR | Fraction of localised predictions that semantically match the violation | Once it points correctly, does it understand the rule breach? |
| $r_{\text{loc}}$ | Share of false positives caused by localisation failure | Are wrong alerts mostly spatial mistakes? |
| $r_{\text{sem}}$ | Share of false positives caused by semantic mismatch | Are wrong alerts mostly rule-interpretation mistakes? |
The authors identify two broad failure modes.
The first is localisation-dominated failure. Here, a model often fails to align predictions with the correct ground-truth person. Gemini-3-Pro, GLM-4.6V, and Qwen3-VL-8B-Thinking are listed as representative examples. The symptoms are low instance match rate and high localisation false-positive ratio. GLM-4.6V, for example, has an IMR of 0.385 and $r_{\text{loc}}$ of 0.805. The model may be reasoning about the scene, but its claim cannot be attached reliably to the correct person.
The second is semantic-dominated failure. Here, the model can often localise the person but then misidentifies the violation. Doubao-Seed-2.0-Pro and Doubao-Seed-2.0-Lite are representative. Doubao-Seed-2.0-Pro has a high IMR of 0.729 but an SHR of 0.459 and $r_{\text{sem}}$ of 0.593. In other words, the best overall model is relatively good at pointing, but still frequently wrong about what rule-relevant violation it has seen.
That distinction is commercially important. A localisation-dominated system needs better visual grounding, tracking, camera calibration, or specialised detection modules. A semantic-dominated system needs better rule mapping, context interpretation, temporal reasoning, and perhaps domain-specific fine-tuning. Treating both as “the model is inaccurate” is technically true and managerially useless.
Figure 3 serves as an illustrative case study rather than a second quantitative result. It compares ground truth with a Gemini-3-Flash prediction. The ground truth includes person violations such as not wearing gloves and not washing hands after touching hair, with no environment violation. The model detects some correct issues but adds false positives, including mask and hair-net claims, and invents an environment violation about disorganised shelves. The figure makes the failure mode tangible: the model is not blind, but its evidence is contaminated by confident extras.
In compliance, confident extras are not harmless. A false accusation against a worker is not just model noise. It is an accountability error.
The benchmark is about auditability, not surveillance theatre
The obvious but wrong reading of FoodMonitor is that it shows current MLLMs are “bad at food safety.” That is too broad and too convenient. The paper tests a specific capability: explainable compliance analysis in 60-second commercial-kitchen surveillance clips, under a structured output and matching protocol.
The better reading is that FoodMonitor exposes the missing layer between video understanding and compliance automation.
A model can produce a plausible description of a kitchen scene. That is not the same as producing a defensible compliance record. A defensible record needs rule linkage, evidence localisation, instance attribution, and semantic specificity. It must survive human review. It must be usable in an appeal. It must avoid confusing the person who violated the rule with the person standing next to them. Tedious requirements, admittedly, but regulation has never been famous for appreciating vibes.
This changes the buyer’s checklist.
| Buyer question | Weak vendor answer | Stronger requirement |
|---|---|---|
| Can the system detect violations? | “Yes, our model understands video.” | Show precision, recall, and F1 by violation channel |
| Can it identify responsible workers? | “It highlights people in the scene.” | Show person-instance matching performance over time |
| Can it explain the violation? | “It generates natural-language summaries.” | Map every alert to a rule category and evidence anchors |
| Can it reduce audit burden? | “It automates monitoring.” | Measure reviewer time saved after false positives and missed violations |
| Can it support accountability? | “It logs alerts.” | Preserve structured, reviewable, appeal-friendly evidence trails |
FoodMonitor does not prove that compliance AI is unusable. It proves that compliance AI should be evaluated as an evidence system, not an anomaly classifier with a nicer interface.
That is a much less glamorous sentence. It is also the one buyers should probably print.
What Cognaptus infers for business use
The paper directly shows that current MLLMs struggle with FoodMonitor’s structured compliance task. It shows that environment detection is easier than person-level attribution. It shows that low recall is common. It shows that person-level errors split into localisation-dominated and semantic-dominated modes. It shows that even the best benchmark score is low.
From that, Cognaptus infers several practical implications.
First, compliance AI should begin as human-in-the-loop audit support, not autonomous enforcement. The systems may be useful for prioritising review, surfacing candidate violations, and creating structured drafts of evidence records. But the benchmark results do not support replacing human auditors where person attribution and rule interpretation matter.
Second, architecture should be modular. A single general-purpose MLLM watching sampled frames is unlikely to be the whole answer. A production system may need dedicated person tracking, camera-specific calibration, specialised object and PPE detectors, temporal event logic, rule engines, and MLLM-based explanation layers. The MLLM may be the narrator and reasoner, but it should not be the only witness.
Third, procurement should demand diagnostic metrics, not just aggregate accuracy. The difference between IMR and SHR is the difference between “the model cannot point” and “the model cannot interpret.” Those lead to different remediation plans, different vendor commitments, and different deployment risks.
Fourth, compliance products should be evaluated against the long tail. FoodMonitor’s rare hand-hygiene and personal-habit categories are precisely the kinds of events that can matter despite low frequency. A system that performs well only on common dress-code issues may still be useful, but only if it is sold honestly as a narrow assistant rather than a comprehensive compliance monitor.
Fifth, structured output matters because it turns AI output into workflow input. JSON detections, rule categories, timestamps, and bounding boxes can be reviewed, stored, escalated, disputed, and audited. Free-text scene descriptions are easier to demo. Structured evidence is easier to operate.
The business value, if it arrives, will not be “AI watches kitchens.” Cameras already do that. The value will be cheaper triage, more consistent evidence capture, faster incident review, and better prioritisation of human inspection. Less cinematic, more useful. The usual trade.
Where the paper’s evidence stops
FoodMonitor should not be overread.
The benchmark uses 477 standardised 60-second clips, sampled at 1.0 FPS for model input. Real deployments may involve different camera angles, lighting, occlusion, frame rates, kitchen layouts, local rules, and operational incentives. A model’s FoodMonitor score is therefore not a full deployment forecast.
The dataset is also heavily imbalanced. That reflects real compliance distributions, but it complicates interpretation across categories. The paper gives strong aggregate evidence and useful channel-level diagnostics, but it does not establish equal reliability for every specific rule item. Rare categories remain especially hard to judge from headline metrics.
The semantic matching step uses a text comparison model to decide whether predicted and ground-truth descriptions refer to the same issue. That is a reasonable evaluation design for paraphrased natural-language violations, but it introduces another model-mediated judgement layer. In production, organisations would still need governance around what counts as equivalent wording, sufficient evidence, and actionable non-compliance.
Finally, FoodMonitor is a benchmark, not a business process study. It does not measure auditor productivity, incident reduction, legal defensibility, staff acceptance, or cost of deployment. Those are downstream questions. The paper gives the technical reality check before the business case. One might call that basic hygiene.
The real lesson is not that models failed; it is that the benchmark got serious
FoodMonitor is useful because it refuses the lazy version of compliance AI.
The lazy version says: give a powerful video model a rulebook, ask it to watch the kitchen, and enjoy the future. The benchmarked version says: produce structured detections, map them to rules, distinguish person and environment violations, localise responsible individuals, explain the observed breach, and accept that errors need to be decomposed before they can be fixed.
Under that more realistic standard, current MLLMs are not ready to serve as independent compliance auditors. They are closer to imperfect evidence assistants: sometimes useful, often conservative, frequently confused about person-level attribution, and still prone to semantic overreach.
That is not a disappointing result. It is a clarifying one.
The next generation of compliance AI will not be won by prettier video summaries. It will be won by systems that can turn messy visual scenes into auditable claims, with enough structure for humans to trust, challenge, and improve them. FoodMonitor gives the field a benchmark for that harder problem.
Which is fortunate, because the world does not need more AI systems that can say “this looks concerning.” It needs systems that can explain what happened, where, to whom, under which rule, and how sure they are.
A compliance monitor, in other words. Not a very confident gossip camera.
Cognaptus: Automate the Present, Incubate the Future.
-
Ruihao Xu, Xingming Shui, Jingxuan Niu, Yiqin Wang, Jilin Yu, Haoji Zhang, and Yansong Tang, “FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis,” arXiv:2605.24503, 2026. https://arxiv.org/abs/2605.24503 ↩︎