Tensor Factorization

Budget is where evaluation systems usually lose their innocence. A team wants to compare several models across hundreds or thousands of prompts. The obvious answer is human evaluation. The less obvious invoice arrives later: annotator time, reviewer fatigue, prompt coverage gaps, inconsistent judgments, and the slow realization that “we evaluated the model” often means “we averaged away the only differences that mattered.” ...