Mind the Representation Gap: Why Enterprise AI Fails Before It Thinks

Enterprise AI has developed a charming habit: whenever a system fails, someone suggests using a larger model.

The chatbot misread a customer complaint? Bigger model. The autonomous system struggled with a new sensor configuration? Bigger model. The video classifier understood the objects but missed the actual message? Bigger model, possibly with a more expensive logo.

Sometimes scale helps. Often, the real failure happened earlier. The model did not fail because it lacked intelligence in the abstract. It failed because the world was handed to it in the wrong shape.

That is the shared lesson across three very different recent papers: UniTrans, a universal any-to-any translator for heterogeneous collaborative perception; SomaliWeb v1, a quality-filtered Somali corpus with a matched tokenizer and language-identification benchmark; and VidMsg, a benchmark for implicit message inference in short videos.¹²³ One paper lives in connected vehicles, one in low-resource language infrastructure, and one in short-form video understanding. On the surface, this looks like an awkward dinner party. Underneath, they are all about the same enterprise problem: representation.

Before AI can reason, retrieve, recommend, fuse, classify, or act, something must decide what the input means as a usable signal. That “something” might be a feature translator, a corpus-cleaning pipeline, a tokenizer, a language identifier, an embedding space, or a benchmark. It is not decorative plumbing. It is the contract between messy reality and machine reasoning.

And contracts, as everyone in business eventually learns, are where the pain hides.

The common problem: reality does not arrive model-ready

The three papers are connected by a simple but under-appreciated problem: deployed AI systems rarely receive clean, compatible, task-ready input.

In collaborative driving, vehicles may share intermediate bird’s-eye-view features, but those features come from different sensors, encoders, architectures, and manufacturers. The receiving vehicle cannot simply assume that another agent’s feature map speaks the same internal language.

In Somali NLP, web text exists, but “exists” is doing a lot of unpaid labor. The text may contain duplicates, near-duplicates, mojibake, language-identification errors, weak documentation, dialect gaps, and tokenizer inefficiency. Treating a multilingual corpus partition as “good enough” is not data strategy. It is faith-based preprocessing.

In short-video understanding, the visible content may be easy to describe while the intended message remains implicit. A clip can show food, exercise, travel, a workday, or a street scene; the actual communicative point may be self-care, discipline, environmental responsibility, independence, distrust of consumerism, or some other message not literally shown on screen.

The common pattern is this:

AI systems do not only need better reasoning. They need better intermediate representations of what the world is asking them to reason over.

That is the spine of the article. Not three paper summaries. One logic chain.

Three papers, three layers of the representation stack

Paper	Representation problem	Intervention	What it shows	What it does not prove
UniTrans	Heterogeneous vehicles expose incompatible intermediate perception features	A universal any-to-any feature translator using modality-intrinsic codes and mapping-conditioned parameter instantiation	Feature compatibility can be learned as a translation problem, reducing the need for retraining when new agents appear	It does not prove unrestricted real-road safety or universal robustness outside the tested collaborative perception settings
SomaliWeb v1	Low-resource language data is under-documented, noisy, duplicated, and inefficiently tokenized	A six-stage reproducible corpus pipeline, matched BPE-16K tokenizer, and Somali LID benchmark	Corpus quality and tokenization are measurable infrastructure, not clerical cleanup	It explicitly does not claim downstream language-model gains yet; tokenizer fertility is a proxy, not a full model-quality result
VidMsg	Video systems can describe surface content but miss the implicit message	A message-first benchmark for retrieval and multiple-choice QA over short videos	Evaluation must target communicative meaning, not only objects, captions, actions, or topics	Its labels are consensus-based and culturally situated, not absolute truth about every possible interpretation

Seen together, the papers form a three-step chain:

Translate incompatible internal signals.
Curate and compress raw symbolic data into usable language infrastructure.
Evaluate models at the level of meaning the business actually needs.

This is representation infrastructure. Not glamorous. Not as tweetable as “agentic superintelligence.” More likely to prevent embarrassing deployment failures. Naturally, it receives less applause.

Step 1: translation — when systems do not speak the same feature language

UniTrans begins from a practical deployment problem in collaborative perception. Connected vehicles can share intermediate features to improve perception beyond each agent’s local sensor range. That is attractive because intermediate fusion balances bandwidth and perception performance. The catch is that real-world agents are heterogeneous. Different vehicles may use different sensors, LiDAR encoders, camera pipelines, voxel settings, model capacities, and feature spaces.

Existing approaches often handle this by training pairwise adapters or using a shared protocol space. Both help, but both run into deployment friction. New agents require new adapters, protocol adjustment, retraining, or fine-tuning. In a multi-manufacturer environment, that is not merely inconvenient. It collides with data access, model privacy, integration cost, and update cycles.

UniTrans reframes the issue as any-to-any feature translation. Instead of training a separate adapter for every source-target feature mapping, it learns a modality-intrinsic latent space and uses it to instantiate a mapping-specific translator on the fly. The method combines three main pieces: a Modality-Intrinsic Encoder, a Modality Mapping Router, and a Translator Parameter Bank. The important business idea is not the acronym collection, though academia does enjoy feeding the acronym farm. The important idea is that compatibility becomes a learned, reusable layer.

In the paper’s experiments, UniTrans is tested on OPV2V-H and DAIR-V2X, with 30 modality categories and held-out emerging modalities. It reports stronger average performance than one-to-one adaptation, protocol-space, and classic MoE baselines under its zero-shot any-to-any evaluation setup. It also highlights an efficiency point: instead of executing multiple experts and mixing outputs at inference time, UniTrans linearly combines expert parameters to synthesize one translator for the inferred mapping.

For enterprise AI, this generalizes beyond autonomous driving. Any multi-vendor or multi-agent AI architecture faces a similar problem. Internal representations may be technically valid but semantically incompatible. One vendor’s “risk score,” one OCR model’s confidence vector, one sensor model’s feature embedding, one regional language model’s tokenization, and one recommender’s user-interest vector may not be interchangeable just because they are all arrays with numbers in them.

A representation translator is useful when the business wants interoperability without repeatedly rebuilding the whole system.

But UniTrans also shows the boundary of this idea. Translation is not magic. It depends on learned regularities in a modality space, coverage of training modalities, and task-specific validation. The lesson is not “universal adapters solve integration.” The lesson is: when AI systems must collaborate, feature compatibility should be engineered and tested as a first-class component.

Step 2: curation and compression — when “available data” is not usable data

SomaliWeb v1 sits in a very different domain, but it attacks the same class of problem. Here the issue is not that vehicles cannot share features. The issue is that low-resource language data may be present in large multilingual corpora without being independently curated, documented, or optimized.

The paper builds a Somali-only pretraining corpus from HPLT v2, CC100, and Somali Wikipedia. It applies a six-stage pipeline: source aggregation, byte-exact deduplication, normalization and length filtering, language identification, MinHash near-duplicate removal, character-n-gram quality filtering, and release with a matched tokenizer. The final corpus contains 819,322 documents and about 303 million whitespace-approximated tokens.

The paper’s most useful contribution for business readers is not just the artifact. It is the audit mentality.

The authors measure concrete quality defects in HPLT v2’s “cleaned” Somali partition: 17.27% byte-exact duplicates, 56.06% ftfy-fixable mojibake-bearing documents, and 10.67% near-duplicates among byte-unique documents. That is a polite way of saying that “cleaned” does not always mean “clean.” Sometimes it means someone swept the floor and left the dust under the rug, but with version control.

SomaliWeb also trains a BPE-16K tokenizer matched to the cleaned corpus. On FLORES-200 Somali devtest, the tokenizer emits 40.2% fewer tokens than GPT-4’s cl100k_base. The paper is careful about the claim: this is a tokenizer-fertility result, not proof that downstream language-model perplexity or task performance will automatically improve. That limitation matters. It keeps the paper grounded.

For business, this is the representation layer in its most familiar but most neglected form: data quality and tokenization.

A company building regional AI products cannot simply ask, “Do we have data?” It must ask:

Is the data actually in the target language?
Are duplicates inflating the training signal?
Are encoding errors corrupting text?
Is the tokenizer making the language artificially expensive?
Are dialects covered or silently excluded?
Are there PII risks that general-purpose scrubbers do not handle?
Is the pipeline reproducible enough to audit?

The difference between “data exists” and “data is a usable representation substrate” is the difference between a prototype demo and a maintainable product. SomaliWeb v1 does not claim to solve Somali NLP end-to-end. It does something more modest and more valuable: it shows what a documented language-data asset should look like before model training begins.

That matters because in enterprise AI, bad data infrastructure does not always create visible errors immediately. It creates quiet degradation: higher inference costs, worse regional performance, brittle retrieval, cultural blind spots, and false confidence in benchmark averages that hide weak markets.

A model does not understand a language better because the business inserted that language into a slide deck. Terrible, I know.

Step 3: semantic testing — when the visible content is not the business meaning

VidMsg moves the representation problem from features and tokens to communicative meaning. It asks whether video-language systems can infer the implicit message of short internet-native videos.

This is important because many real video workflows do not care only about what appears on screen. Search, recommendation, moderation, advertising analysis, campaign monitoring, brand safety, education, and social listening often care about what a video is trying to communicate.

A caption might say: “A person prepares a healthy meal after work.” The message might be: “Food is a form of self-care,” “Routine creates discipline,” or “Lifestyle medicine matters beyond symptoms.” These are not identical. If a recommendation system confuses them, the user experience changes. If a moderation or brand-safety system confuses them, the business risk changes. If an advertising analytics tool confuses them, the strategy deck becomes expensive fiction.

VidMsg contains 400 YouTube-derived clips across 9 topic areas and 52 fine-grained target messages. Its construction pipeline is message-first: an LLM generates indirect search scenarios from target messages; candidate videos are retrieved; human annotators retain clips that convey the intended message without being too explicit. The benchmark then evaluates both bidirectional message-clip retrieval and multiple-choice QA.

The paper’s results show that strong contemporary video-language and retrieval models still struggle. Standard retrieval models optimized for visible content or caption-like alignment perform relatively poorly. The proposed VidVec-Msg baseline, optimized with synthetic clip-storyline-message pairs, improves retrieval performance, but the benchmark remains far from saturated. In multiple-choice QA, even strong models show uneven performance across topics and often confuse semantically close messages.

This is the third link in the representation chain: evaluation must match the level of meaning the application needs.

If your business workflow needs “what message does this video communicate?”, then measuring object recognition, caption quality, or coarse topic accuracy is not enough. Those are proxy tasks. Sometimes useful. Often dangerously comforting.

VidMsg also carries an important warning. Implicit message understanding is subjective. It varies across viewers, cultures, contexts, and assumptions. The authors frame the labels as consensus-based rather than exhaustive ground truth. That distinction should survive any business translation of the paper. Message inference can help video search and analysis, but it can also become a tool for profiling, persuasion, or overconfident cultural interpretation. Lovely little knife, sharp on both sides.

The larger framework: representation contracts

The three papers suggest a practical framework for enterprise AI: every deployed system needs a representation contract.

A representation contract defines how raw reality becomes a model-usable signal, what assumptions are made during that transformation, and how the resulting representation is evaluated.

Contract question	Why it matters	Example from the papers
What is the signal boundary?	The system must know whether it is operating on raw data, intermediate features, tokens, embeddings, captions, or messages	UniTrans operates on intermediate BEV features; SomaliWeb operates on web text and tokenization; VidMsg operates on implicit video messages
What compatibility problem exists?	Inputs may be valid individually but incompatible collectively	Vehicle feature spaces differ; multilingual corpora hide Somali-specific defects; caption-like video embeddings miss message-level meaning
What transformation is applied?	The bridge between raw input and model reasoning must be explicit	Feature translation, corpus filtering, tokenizer training, message-first data construction
What evidence validates the representation?	Evaluation must test the representation layer, not only the final model output	AP under held-out modalities, pipeline retention and tokenizer fertility, retrieval and QA over implicit messages
What are the uncertainty boundaries?	Business deployment needs limits, not just headline performance	Safety-critical validation, no downstream LM evaluation yet, consensus-based message labels

This framework is useful because it prevents one of the most common AI governance mistakes: evaluating the visible model while ignoring the invisible representation layer.

Many enterprise AI failures are not dramatic hallucinations. They are quieter:

A multilingual assistant performs well in English but becomes expensive and weak in under-curated languages.
A video analytics tool captures the topic but misses the implied stance.
A multi-agent operational system connects components whose internal signals are not meaningfully aligned.
A retrieval pipeline embeds documents but fails because the chunk, metadata, language filter, or tokenization strategy distorts what should be retrieved.
A model appears robust in benchmark conditions but collapses when deployed across new vendors, sensors, formats, or regional data sources.

These failures are representation failures wearing model-failure costumes.

What business leaders should take from this

The practical lesson is not that every company needs to build UniTrans, SomaliWeb, or VidMsg. Most do not. The lesson is that AI deployment should budget for representation engineering.

That means several concrete habits.

First, treat integration as semantic integration, not only API integration. Two systems can exchange data and still misunderstand each other. A JSON field is not a shared concept. A feature vector is not automatically portable. A confidence score is not a universal unit of trust.

Second, audit data assets before model selection. If a regional market matters, inspect language coverage, duplication, encoding quality, tokenization cost, dialect coverage, and privacy risks. The model choice comes later. Buying a stronger model before checking the corpus is like buying a better coffee machine for a kitchen with no clean water.

Third, evaluate at the business meaning layer. If the application needs implicit intent, sentiment, risk, persuasion, evidence quality, or communicative message, then evaluate that directly. Proxy metrics are acceptable only when everyone remembers they are proxies. This is rare, because proxies are convenient and humans enjoy convenience right up to the lawsuit.

Fourth, document uncertainty boundaries. UniTrans is impressive within collaborative perception benchmarks, but safety-critical deployment needs additional validation. SomaliWeb gives a strong data artifact, but downstream LM gains remain future work. VidMsg exposes message-level gaps, but implicit interpretation is culturally and contextually sensitive. Mature AI operations should preserve these limits instead of sanding them down into marketing copy.

The strategic point: representation is infrastructure

The deeper business implication is that representation infrastructure may become one of the major differentiators in enterprise AI.

Base models are increasingly accessible. APIs are accessible. Open models are accessible. Tooling is improving. What remains scarce is not the ability to call a model. It is the ability to prepare the right representation of the problem before the call, and to test whether that representation supports the intended action.

In operational systems, that means translation layers between heterogeneous agents.

In language products, it means curated corpora, efficient tokenizers, language identifiers, metadata discipline, and privacy-aware processing.

In media systems, it means benchmarks and embeddings that represent what content communicates, not only what it depicts.

In knowledge work, it means document structures, chunking strategies, retrieval schemas, source provenance, and evaluation tasks that align with decision needs.

The common lesson is almost annoyingly practical: better AI often starts before the model.

That is not a rejection of model progress. It is a reminder that model progress lands inside systems. Systems have interfaces, assumptions, data contracts, failure modes, and evaluation layers. A powerful model placed on top of a weak representation pipeline will still produce confident nonsense, just with better grammar and a nicer invoice.

Conclusion: build the bridge before blaming the brain

UniTrans, SomaliWeb v1, and VidMsg do not solve the same task. That is exactly why they work well together as a business lesson.

UniTrans shows that heterogeneous agents need translation before collaboration. SomaliWeb shows that language data needs reproducible curation and compression before modeling. VidMsg shows that video understanding needs evaluation at the level of implied meaning before deployment into search, recommendation, or analysis.

Together, they point to a broader principle:

Reliable AI depends on engineered representation layers that translate, clean, compress, and test reality before models are asked to reason over it.

The next time an AI system fails, the useful first question may not be, “Which larger model should we use?”

It may be, “What did we actually hand the model as the world?”

Less glamorous. More useful. Deeply unfair to keynote slides.

Cognaptus: Automate the Present, Incubate the Future.

Yang Li, Weize Li, Quan Yuan, Congzhang Shao, Guiyang Luo, Yunqi Ba, Xuanhan Zhu, Xinyuan Ding, Xiaoyuan Fu, and Jinglin Li, “One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception,” arXiv:2605.17907, 2026, https://arxiv.org/html/2605.17907. ↩︎
Khalid Yusuf Dahir, “SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark,” arXiv:2605.18232, 2026, https://arxiv.org/html/2605.18232. ↩︎
Issar Tzachor, Michael Green, and Rami Ben-Ari, “VidMsg: A Benchmark for Implicit Message Inference in Short Videos,” arXiv:2606.03635, 2026, https://arxiv.org/html/2606.03635. ↩︎

The common problem: reality does not arrive model-ready#

Three papers, three layers of the representation stack#

Step 1: translation — when systems do not speak the same feature language#

Step 2: curation and compression — when “available data” is not usable data#

Step 3: semantic testing — when the visible content is not the business meaning#

The larger framework: representation contracts#

What business leaders should take from this#

The strategic point: representation is infrastructure#

Conclusion: build the bridge before blaming the brain#