Cheap Signals, Expensive Insights: Rethinking AI Evaluation with Tensor Factorization

Budget is where evaluation systems usually lose their innocence.

A team wants to compare several models across hundreds or thousands of prompts. The obvious answer is human evaluation. The less obvious invoice arrives later: annotator time, reviewer fatigue, prompt coverage gaps, inconsistent judgments, and the slow realization that “we evaluated the model” often means “we averaged away the only differences that mattered.”

So the team turns to automated judges. LLM-as-a-Judge is cheaper, faster, and available at scale. It is also biased, template-sensitive, and occasionally very confident about things that a human reviewer would gently throw into the recycling bin.

This is the uncomfortable gap addressed by Felipe Maia Polo, Aida Nematzadeh, Virginia Aglietti, Adam Fisch, and Isabela Albuquerque in “Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization.”¹ The paper does not claim that cheap automated judges can simply replace human judgment. That would be convenient. It would also be false in exactly the boring way most convenient claims are false.

The better idea is subtler: use cheap autorater labels to learn the latent structure of models, prompts, and raters, then use a small amount of human labeling to align that structure to human judgment. In other words, do not crown the cheap judge as king. Make it work as unpaid statistical labor.

That distinction is the paper’s business value.

The real problem is not scoring models; it is locating their capability surface

Most AI evaluation still behaves as if a model has one performance level. This is administratively pleasant. It gives procurement teams a number, benchmark pages a ranking, and executives a slide.

But generative models do not fail uniformly. A model can be strong on concise technical prompts, weak on open-ended reasoning, reliable on short image-generation instructions, and fragile when asked to count objects or render text. The average score hides this landscape.

Fine-grained evaluation tries to recover that landscape at the prompt level or within narrow prompt groups. The paper’s examples include text-to-image prompts in Gecko, language-generation tasks in BigGen Bench, and pairwise preference data from LMArena. The objective is not merely “which model wins?” but:

where one model beats another;
which prompt categories are genuinely coherent;
where uncertainty is still too wide to make a confident decision;
whether a new model can be assessed using autorater data before buying a full human-labeling campaign.

The catch is that fine-grained evaluation needs many labels. If every model-prompt-rater combination must be judged by humans, the evaluation budget expands faster than the patience of the finance department. Autoraters solve the scale problem, but not the alignment problem.

The paper’s mechanism is designed precisely for this mismatch: abundant weak signals plus scarce strong signals.

The tensor is the bookkeeping device that makes the trick possible

The paper models evaluation as a three-way interaction among models, prompts, and raters. That structure is naturally a tensor.

For a model $i$, prompt $j$, and rater $k$, the paper defines a latent capability value:

$$ \Psi_{i,j,k} $$

This is not directly the observed rating. It is the underlying capability of model $i$ on prompt $j$ as perceived through rater $k$.

For single-sided scoring, the effective advantage is simply the model’s capability on that prompt:

$$ \Delta_{i,j,k} = \Psi_{i,j,k} $$

For side-by-side evaluation, where the rater compares model $i_1$ against model $i_0$, the outcome depends on the difference:

$$ \Delta_{i,j,k} = \Psi_{i_1,j,k} - \Psi_{i_0,j,k} $$

This lets the same framework handle both pointwise ratings and pairwise preference judgments.

The key assumption is that $\Psi$ is not arbitrary. Model performance, prompt demand, and rater preference are assumed to interact through a relatively small number of latent factors. Formally, the paper uses CP tensor factorization:

$$ \Psi_{i,j,k} = \sum_{r=1}^{R} \Theta_{i,r} A_{j,r} \Gamma_{k,r} $$

Here, $\Theta$ represents model-side latent factors, $A$ represents prompt-side latent factors, and $\Gamma$ represents rater-side latent factors. The factors are not treated as clean psychological categories. The paper is careful here: the practical goal is accurate estimation of the capability tensor, not storytelling about what latent dimension number seven “really means.”

That is wise. Latent dimensions are useful servants and terrible dinner guests.

The low-rank structure matters because it links observations. If an autorater sees many models on many prompts, it helps learn model and prompt representations. If a small human-labeled calibration set is then added, the system does not need to learn everything from scratch. It only needs to learn how the human rater maps onto the already-learned latent space.

The two-stage fitting process is the economic engine

The paper’s method has a clean sequence.

First, it learns from autorater labels. These are plentiful but imperfect. In this stage, the model estimates representations for models, prompts, and autoraters by minimizing the negative log-likelihood of autorater observations. The observations are ordinal labels, so the paper uses an ordered-logit-style likelihood with rater-specific cutoffs.

Second, it freezes most of those learned representations and fits the human rater parameters using the limited human-labeled set. This is the calibration stage. The paper explicitly frames this as analogous to transfer learning: autorater data pretrains useful representations, and scarce human data aligns them to the target judgment distribution.

There is also an optional fine-tuning stage, where all parameters can be updated using human labels. The paper reports that this often improves point prediction when enough human labels exist per prompt, especially in settings like Gecko where multiple annotations per prompt are available. But the trade-off is important: after fine-tuning, the standard confidence intervals derived for the simpler two-stage estimator no longer directly apply.

That is not a minor footnote. It is the difference between “we predict better” and “we can attach interpretable uncertainty to the prediction.” In business use, those are different products.

Stage	What it learns	Main benefit	Practical risk
Autorater pretraining	Model, prompt, and autorater latent representations	Uses cheap labels at scale	Learns structure only if autoraters contain useful signal
Human calibration	Human rater embedding and cutoffs	Aligns cheap structure to human judgment	Too few human labels can make calibration unstable
Optional fine-tuning	All parameters adjusted on human labels	Can improve point prediction	Standard confidence intervals no longer directly apply

The mechanism is therefore not “replace humans.” It is “use humans where they are most valuable: calibration.”

That should sound familiar to anyone who has paid for expert review. Experts are expensive because they are scarce. The economic question is not how to avoid them; it is how to stop wasting them on repetitive labeling that a weaker system can partially structure in advance.

Confidence intervals are not decoration; they are the decision layer

A useful evaluation system should not merely say Model A is above Model B. It should say whether that difference is large enough to trust.

The paper derives confidence intervals for prompt-level and category-level capability estimates after human calibration. More importantly, it uses simultaneous confidence intervals when comparing across many models or leaderboards. This matters because prompt-level evaluation invites multiple comparisons. If you test enough prompts, some differences will look significant by accident. Statistics, like sales dashboards, becomes increasingly creative when nobody corrects for multiplicity.

For category-specific evaluation, the paper introduces a “reference composite.” Instead of naïvely averaging prompts in a category, it uses the leading direction in the prompt embedding space to summarize the dominant shared skill. The cohesion of a prompt group is measured by how much of the group’s variation is captured by that leading direction.

This is a useful move. It asks whether a category is actually a category, rather than a convenient folder name invented by a benchmark designer.

The appendix strengthens this point with permutation tests. For cohesive Gecko and BigGen Bench groups, p-values are generally low; for non-cohesive groups, p-values are much higher. Gecko’s broad “landmarks” group, for example, is treated as less cohesive than narrower prompt groups. In BigGen Bench, all groups considered in the table have the same size, yet some are still non-cohesive. Size alone is not the explanation.

For business evaluation, that means taxonomy quality should be tested, not assumed. A dashboard category called “reasoning” may be coherent. It may also be a bucket where unrelated tasks have been politely forced to socialize.

The experiments test different claims, not one giant victory lap

The paper’s empirical section is best read as a sequence of tests with different purposes. Collapsing them into “the method works better” loses the point.

Evidence	Likely purpose	What it supports	What it does not prove
Predictive loss against Constant, Prompt-specific/IRT-style, and P2L baselines	Main predictive evidence	Autorater-assisted tensor structure improves human-label prediction under limited labels	Universal superiority across all future domains
Category-specific rankings with 10% human annotations	Practical demonstration	Fine-grained leaderboards can be recovered with sparse human calibration	That every benchmark category is meaningful
Prompt-level model comparisons	Diagnostic use case	The method identifies where model differences concentrate	That all prompt-level conclusions are equally stable
Held-out model prediction	Cold-start extension	Autorater data can estimate average score or win-rate difference without human labels for that model	That no human labels are ever needed in the system
Cohesion tests and full-data appendix plots	Robustness and interpretability checks	Category structure and interval behavior can be inspected	A fully automated benchmark design procedure
Rank and autorater-fraction sensitivity tests	Sensitivity/ablation	Performance depends on rank choice and autorater diversity	A closed-form deployment recipe
Autorater prompt/persona appendix	Implementation detail	Diversity of automated raters is engineered through templates and personas	That any random set of judge prompts will suffice

This distinction matters because the paper’s most useful result is not a leaderboard. It is a workflow for making leaderboards more diagnostic.

What the main results actually say

The paper evaluates the method across three datasets.

Gecko(S) is a text-to-image alignment benchmark with roughly 1,000 prompts and about 18,000 pairwise human annotations across four image-generation models. BigGen Bench contributes 695 English-language instances across 77 tasks and nine capabilities, producing 2,780 human-annotated data points across four models. LMArena contributes nearly 5,000 filtered human preference matches among ten selected state-of-the-art language models.

Autorater collection differs across settings. For BigGen Bench, the authors aggregate existing automated ratings, resulting in 15 autoraters. For Gecko and LMArena, they build custom autoraters using Gemini 2.5 Flash-Lite, varying single-sided and side-by-side templates as well as personas and criteria. Gecko uses 8 autoraters; LMArena uses 24. They also sample multiple ratings per input and use all replicas during fitting.

The first major result is predictive. In test cross-entropy loss, the proposed method and its fine-tuned variant outperform the Constant baseline, the Prompt-specific baseline, and the P2L baseline where applicable. The Constant baseline is essentially what happens when prompt heterogeneity and auxiliary autorater data are ignored; in pairwise settings, it reduces to a Bradley–Terry-style model. The Prompt-specific baseline allows prompt variation but does not use auxiliary autorater data. P2L, used for LMArena, brings in prompt embeddings from a separately trained model.

The interpretation is not complicated: prompt specificity helps, but prompt specificity plus auxiliary autorater structure helps more, especially when human annotations are scarce. Gecko is the partial exception where the Prompt-specific baseline and fine-tuned model do well once enough human labels exist, because Gecko has multiple annotations per prompt. That is a useful boundary, not an embarrassment.

The second major result is category-specific evaluation with only 10% of human annotations. In Gecko, this corresponds to fewer than two human labels per prompt on average; in BigGen Bench, fewer than half a human label per prompt on average. Despite that sparse calibration, the method recovers category-level rankings with simultaneous 95% confidence intervals.

One Gecko example is especially clear: Imagen ties SDXL on a compositional-language category but performs significantly worse on an additive-counting category. That is exactly the kind of difference an aggregate leaderboard buries with a tasteful little shovel.

The third result is prompt-level comparison. With only 10% of human annotations, the method compares Imagen and Muse across Gecko prompts and LLaMa-2-13b against GPT-3.5-Turbo across BigGen Bench prompts. The pattern is interpretable: Imagen’s advantages over Muse are often connected to text rendering, while Muse shows advantages in object counting. In BigGen Bench, GPT-3.5-Turbo shows a significant advantage on reasoning-related prompts, while LLaMa-2-13b has limited areas of advantage and matches GPT-3.5-Turbo on many instruction-following and safety prompts.

This is where the paper becomes operationally interesting. It does not merely say one model is better. It says where substitution might be safe, where specialization matters, and where a model router could use a cheaper or faster model without blindly degrading quality.

The LMArena comparison sharpens that point. Using the full set of human annotations, the authors compare LLaMa-3.3-70b-Instruct with Gemini-2.5-Pro and estimate that LLaMa beats Gemini on about 8% of prompts and ties it on about 24%. The paper interprets this as meaning Gemini could be substituted by LLaMa in roughly 32% of cases without loss. That statement should be read carefully: “without loss” here depends on the confidence interval logic and the specific prompt distribution, not on a universal law of model interchangeability.

The fourth result is held-out model prediction. The paper withholds all human labels for a model and tries to predict its average score or win-rate difference using the learned structure and autorater scores. Across Gecko, BigGen Bench, and LMArena, predicted values track ground truth closely; in Gecko and LMArena, the sign of the performance difference is mostly preserved. This supports the cold-start use case: a new model can be assessed using autorater labels before human labels are purchased for that model.

Finally, the paper includes exploratory diagnostics. One example shows that SD1.5 performs relatively better on shorter Gecko prompts, using prompt length as a rough proxy for complexity. Another appendix analysis tries to explain the LLaMa-versus-Gemini performance gap using LMArena prompt tags. The specificity and code tags favor LLaMa in that regression, while domain knowledge is negative. But the regression’s $R^2$ is only 0.011, so this is best read as a light interpretability probe, not a grand theory of model behavior. The paper does not oversell it. We should not do the authors the disservice of overselling it for them.

The business interpretation: cheaper diagnosis, not cheaper truth

For an AI product team, the direct lesson is not “use LLM judges and save money.” That is the shallow reading.

The better workflow is:

Generate broad autorater coverage across models, prompts, and evaluation templates.
Use this cheap coverage to learn latent model and prompt representations.
Purchase a smaller human calibration set.
Estimate human-aligned prompt-level and category-level performance.
Use confidence intervals to decide where differences are actionable.
Repeat data collection where uncertainty remains too wide.

This changes the role of evaluation from score production to decision support.

Business task	How this method helps	What still needs human control
Model selection	Compares models by prompt category, not only average score	Define deployment-relevant categories and risk tolerance
Model routing	Identifies prompts where a cheaper model may tie or beat a stronger model	Validate router decisions on real traffic
Regression testing	Detects capability drops in specific prompt families	Maintain stable test distributions
Vendor evaluation	Estimates model performance under sparse human labels	Ensure autoraters are not biased toward one vendor’s style
Benchmark design	Tests whether categories are cohesive	Redesign broad or incoherent categories
Annotation budgeting	Uses human labels as calibration rather than brute-force scoring	Decide where uncertainty is costly enough to justify more labels

The strongest business case is model routing. If a company pays for premium model calls across all prompts, but a cheaper model performs equally well on a subset, the savings are obvious. The hard part is not imagining the savings. Anyone with a spreadsheet can do that. The hard part is knowing which subset is safe. This paper offers a statistical route toward that answer.

The second strong case is evaluation governance. Many organizations now use automated judges informally. A prompt template here, a judge model there, a spreadsheet of scores, a few vibes pretending to be methodology. The paper’s framework gives this practice a more disciplined shape: autoraters become auxiliary measurement instruments, human labels become calibration data, and uncertainty becomes visible.

That is an improvement over pretending the judge prompt is objective because it contains the word “rubric.”

Where the method should not be overused

The paper’s limitations are not cosmetic. They define the safe operating zone.

First, the framework assumes a low-rank capability tensor and an ordinal logit observation model with rater-specific cutoffs. If the real evaluation landscape is too irregular, the latent structure may compress away important differences.

Second, autorater data help only when they retain some correlation with human preferences and provide enough diversity. If all autoraters share the same blind spot, the tensor will learn that blind spot beautifully. Statistical elegance does not disinfect biased measurement.

Third, side-by-side human templates identify relative capabilities, not absolute cross-prompt scores. The paper explicitly notes that values are not freely comparable across prompts in pairwise settings. A prompt-level win-rate difference is not the same thing as a universal capability coordinate.

Fourth, the confidence intervals rely on approximations and treat first-stage autorater-learned parameters as effectively fixed. The paper argues this is reasonable when autorater labels are abundant, but it can be optimistic if autorater data are limited. Optional human fine-tuning further complicates standard uncertainty guarantees.

Fifth, held-out model prediction still requires autorater observations for the held-out model. The method can reduce human annotation for a new model; it does not evaluate a model in a vacuum. There is no free lunch, only a discounted lunch with assumptions printed on the receipt.

These boundaries do not weaken the paper. They make it deployable. A method with clear failure modes is more useful than a benchmark score with none admitted.

The important shift: human labels become leverage, not volume

The paper’s most useful contribution is not tensor factorization by itself. It is the evaluation design philosophy behind it.

Traditional evaluation asks: how many human labels do we need to score everything?

This paper asks: how can cheap signals learn enough structure so that scarce human labels only need to align the structure?

That shift is practical. It matches how modern AI systems are actually deployed: many models, many prompt types, frequent model updates, limited evaluation budgets, and growing pressure to justify why one model is used instead of another.

The result is not a magical replacement for human evaluation. It is a way to make human evaluation less wasteful. Humans remain the standard. Autoraters become scaffolding. Tensor factorization supplies the geometry. Confidence intervals decide whether a difference deserves action.

That is a more mature evaluation stack than “let us ask a judge model and hope its tone sounds rigorous.”

For Cognaptus readers, the takeaway is straightforward: the next generation of AI evaluation will not be built on bigger leaderboards alone. It will be built on measurement systems that know where models differ, where categories are coherent, where uncertainty remains, and where cheap signals can safely reduce expensive human work.

Average scores will still exist. They are too convenient to die. But if a business is deciding which model to deploy, route, monitor, or retire, convenience is not enough.

The expensive insight is knowing when the cheap signal can be trusted.

Cognaptus: Automate the Present, Incubate the Future.

Felipe Maia Polo, Aida Nematzadeh, Virginia Aglietti, Adam Fisch, and Isabela Albuquerque, “Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization,” arXiv:2603.02029, 2026. HTML full text and PDF. ↩︎

The real problem is not scoring models; it is locating their capability surface#

The tensor is the bookkeeping device that makes the trick possible#

The two-stage fitting process is the economic engine#

Confidence intervals are not decoration; they are the decision layer#

The experiments test different claims, not one giant victory lap#

What the main results actually say#

The business interpretation: cheaper diagnosis, not cheaper truth#

Where the method should not be overused#

The important shift: human labels become leverage, not volume#