LLM Reasoning

Beyond the Linear Ceiling: Why Non-Linearity Is the Next Frontier in PEFT

More Rank Is Not Always More Capacity Fine-tuning teams love a simple knob. If the model underperforms, increase rank. If the adapter looks too small, increase rank. If the downstream task is hard, increase rank again and call it strategy. This is comforting because rank is measurable, budgetable, and easy to explain in a meeting. Unfortunately, reality has its usual habit of being less cooperative. ...

Mirror, Mirror on the LLM: Teaching Models to Think About Their Thinking

Evidence is not the same as judgment. Anyone who has watched an AI assistant work through a multi-document question has seen the strange version of this failure. The model finds the relevant fact. It even says something that looks like the right answer. Then, a few paragraphs later, it invents an extra condition, follows that condition with great confidence, and lands somewhere else. ...

Beyond Chain-of-Thought: When Models Start Arguing with Themselves

The mirror test is more useful than another monologue Mirror. That is where the paper’s argument becomes easy to see. Ask a multimodal model to generate an image of a plush lion in front of a mirror. The generated image may look plausible at first glance. Then ask the same model’s understanding branch whether the image actually matches the prompt. The model may say no: if the lion faces the camera, the mirror should mostly show its back. The generator has produced the scene; the understander has rejected it. ...

Thoughts in Motion: From Static Prompts to Self-Optimizing Reasoning Graphs

A workflow looks harmless until it starts waiting on itself. One LLM call asks for a plan. Another evaluates the plan. A third revises the result. A fourth retrieves evidence. Somewhere in the middle, three subtasks could have run at the same time, two repeated calls could have been reused, and one prompt should probably have been tuned before anyone proudly called the system “agentic.” Instead, the whole thing runs as a neat little chain: expensive, slow, and quietly brittle. Very elegant, in the way a traffic jam is elegant if viewed from a drone. ...

Thinking in New Directions: When LLMs Learn to Evolve Their Own Concepts

A familiar business scene: a team has already tried the standard AI improvement kit. Better prompts. More examples. Chain-of-thought. Self-consistency. A small agent wrapper. Maybe even a heroic tree-of-thought workflow that burns compute like a startup burns runway. The model improves, but not in the way the team hoped. It can explain more. It can sample more. It can retry more. Yet when the task requires a new abstraction — a hidden rule in a grid, a nested logical constraint, a multi-step scientific relation, a variable-binding trick in math — the model still behaves like someone confidently rearranging old furniture in a room that needs a new door. ...

Mind Your Mode: Why One Reasoning Style Is Never Enough

Enterprise workflows rarely fail because nobody “thought step by step.” They fail because the wrong kind of thinking is applied for too long. A compliance analyst does not review an incident report the same way she reconciles a spreadsheet. A software engineer does not debug production latency with the same mindset used to design a product roadmap. A CFO does not evaluate a warehouse automation proposal by “being creative” all the way through, unless the board has a strong appetite for interpretive dance. ...

Confidence Is Not Truth, But It Can Steer: When LLMs Learn When to Stop

Stop Every production LLM workflow eventually meets the same boring question: should the model answer now, think again, or throw away the current path and try something else? That question sounds less glamorous than “build a bigger model.” It is also closer to where real deployment costs live. Reasoning models can improve by sampling more answers, extending chains of thought, or running repeated critique-and-revision loops. The bill, naturally, arrives in tokens, latency, GPU capacity, and engineering patience. The last item is rarely benchmarked, perhaps because it would make too many papers look expensive. ...

When LLMs Lose the Plot: Diagnosing Reasoning Instability at Inference Time

Mistakes are easy to audit after the fact. That is why most AI evaluation still behaves like a mildly disappointed teacher: wait for the final answer, mark it right or wrong, and pretend the interesting part happened at the end. But in real LLM workflows, the damage often starts earlier. A model begins with a plausible line of reasoning, then drifts. It changes route without noticing. It over-explains a wrong intermediate step. It doubles back, patches the logic, and sometimes recovers. Other times it gracefully walks into a wall, with the confidence of a consultant holding a laser pointer. ...

Conformal Thinking: Teaching LLMs When to Stop Thinking

Thinking is not free. That sentence should not need explaining to anyone who has paid an inference bill, waited for a reasoning model to finish its theatrical inner monologue, or watched an AI agent spend half its budget trying to solve a task it was never going to solve. Reasoning models have become better at using more tokens. They have not automatically become better at knowing when more tokens have stopped helping. ...

Identity Crisis: How a Trivial Trick Teaches LLMs to Think Backwards

Facts are rude. They rarely arrive in the direction your software needs them. A customer database may know that Alice reports to Bob, while the compliance officer asks, “Who reports to Bob?” A product catalog may store that SKU-17 belongs to Category X, while the chatbot receives, “Show me all products in Category X.” A medical knowledge base may encode one directional relation, while the user asks for the inverse. Humans treat these as the same fact seen from opposite ends. Language models, being very expensive autocomplete machines with a talent for plausible theater, do not always share our confidence. ...