Cover image

Stop Scaling the Wrong Thing

TL;DR for operators Most AI performance failures are not solved by scaling the most visible knob. Three recent papers make the same uncomfortable point from different angles. A controlled image-classification study finds that more data gives more stable generalization gains than simply increasing model complexity, while added visual priors help only when the architecture can use them.1 A document parsing benchmark shows that frontier VLMs and specialized parsers still fail on expert documents with dense layouts, formulas, tables, music notation, rotation, and long-document reading order.2 A LoRA optimization paper argues that adapter performance is often limited not by rank alone, but by a mis-scaled LoRA scaling factor, usually treated as a small implementation detail because apparently we needed another reminder that details run the building.3 ...

June 29, 2026 · 14 min · Zelina
Cover image

The Model Is Not the Medical System

TL;DR for operators Health AI does not fail only because the model is weak. It fails because the model learned the wrong context, explained the wrong thing, protected the wrong boundary, retrieved the wrong evidence, or performed beautifully in the one language where the evaluation happened to be convenient. Two recent arXiv papers make that point from opposite ends of the same operational chain. One builds an explainable, privacy-aware framework for detecting career-related depression and anxiety among university students, using structured student data, facial-behavior features, multimodal fusion, label smoothing, federated learning, and attribution methods.1 The other builds MMed-Bench-IR, a multilingual medical information retrieval benchmark designed to test cross-lingual medical alignment, concept discrimination, and evidence retrieval across six languages and three tasks.2 ...

June 27, 2026 · 17 min · Zelina
Cover image

The White Coat Is Not the Treatment

TL;DR for operators Belmadani et al. study a question every serious enterprise LLM team eventually meets after the prototype stops looking magical: which adaptation bill is actually worth paying?1 In French medical question answering, they compare continual pretraining (CPT), supervised fine-tuning (SFT), and CPT followed by SFT across Gemma, Mistral, and Llama-family models, with general, instruction-tuned, and medical initializations. ...

June 27, 2026 · 20 min · Zelina
Cover image

Think Twice, Halt Once

TL;DR for operators The current enterprise mistake is treating “reasoning” as a personality trait of a model. It is not. It is a process: decompose the task, inspect the evidence, decide what matters, test counterarguments, synthesize a position, and stop before the machine starts producing beautifully cited nonsense. Two recent papers expose that process from opposite ends. Hedge-Bench defines a realistic demand signal: open-ended financial reasoning tasks derived from hedge fund analyst work, graded against expert analytical moves and source-grounded claims.1 It finds that frontier agents remain weak on this kind of work, with the best model achieving only a limited perfect-score rate and with stronger exploration often bringing more hallucination along for the ride. Delightful. The junior analyst has read the filings, opened the spreadsheet, and still occasionally invents the economy. ...

June 26, 2026 · 18 min · Zelina
Cover image

When 'Check the AC' Becomes the Hard Part

TL;DR for operators Smart-home assistants do not fail only when users are vague. They fail when users become efficient. The PEC-Home paper studies a familiar pattern: after repeated interaction, people stop saying the whole thing. “Please turn on the air conditioner in the bedroom and set it to 26 degrees at 10 PM” eventually becomes “check the AC” or “handle that thing.” Humans manage this because shared context, identity, place, and prior routines do the missing work. Current LLM assistants are much less charming under that burden. ...

June 25, 2026 · 19 min · Zelina
Cover image

The Code Agent Wasn’t Self-Correcting. The Test Harness Was.

TL;DR for operators Code agents do not become reliable because they are asked politely to “fix the bug.” They become more useful when they are placed inside a loop that can run their output, return structured failure evidence, and decide how many further attempts are worth buying. That is the practical point of Zhang and Kothari’s paper, Unlocking LLM Code Correction with Iterative Feedback Loops.1 The authors evaluate four LLMs across Python and Java using LeetCode problems, then move from ordinary one-shot performance to an automated correction loop: generate code, execute it, feed back compiler/runtime/testcase information, and repeat up to ten iterations. ...

June 22, 2026 · 17 min · Zelina
Cover image

The Missing Ingredient Wasn’t Vision: NutriMLLM and the Data Recipe for Micronutrient AI

TL;DR for operators Food-image nutrition AI is usually sold as a vision problem: recognise the meal, estimate the portion, output the nutrients, preferably with a pleasant progress spinner. NutriMLLM suggests that this is only half right. The harder missing piece is not necessarily seeing the food. It is knowing the full nutrient profile once the food is identified. ...

June 19, 2026 · 19 min · Zelina
Cover image

Statecraft, Not Scorecards: Why Reliable AI Lives on the Path

TL;DR for operators AI reliability is increasingly a path problem, not a score problem. One paper argues that post-training methods such as supervised fine-tuning, reinforcement learning, and on-policy distillation should be understood by asking where supervision is applied in the model’s state space.1 Another argues that GUI-agent software evaluation fails when a single unsuccessful rollout is treated as proof of a broken application, even though the evaluator has only inspected one path through a larger UI state graph.2 ...

June 15, 2026 · 3 min · Zelina
Cover image

Raw Is Not Ready: Why Reliable AI Needs Evidence Architecture

Raw Is Not Ready: Why Reliable AI Needs Evidence Architecture Production AI has entered its awkward teenage phase. It can speak fluently, see impressively, forecast usefully, and still fail in ways that make operators quietly reach for the manual override. The problem is not simply that models are too small, not enough tokens have been burned, or someone forgot to add “think step by step” to a prompt. The deeper problem is that many AI systems are being asked to reason directly from raw inputs that have not yet been converted into the right operational form. ...

June 12, 2026 · 14 min · Zelina
Cover image

None Taken: Why Video AI Must Learn When No Answer Is Correct

A camera sees the scene. The model reads the question. The options look reasonable. One of them must be right. That last sentence is the problem. Many enterprise video-AI workflows are built around this quiet assumption. A model reviews a warehouse clip and chooses the most likely safety violation. It watches a customer interaction and classifies the complaint. It checks a manufacturing video and identifies the defect category. The system may be wrong, of course, but the menu is treated as complete. The correct answer is assumed to be hiding somewhere among the choices, waiting for the model to point at it with sufficient confidence. ...

June 10, 2026 · 17 min · Zelina