Cover image

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

TL;DR for operators Code review usually starts after code exists. FPBench argues that this is already too late. The paper behind FPBench tests whether large language models can detect faulty premises in code-generation requests before obediently producing code from them.1 The answer is awkward. Many models can identify the flaw when explicitly told to check the question first, but most do not do so proactively. They behave less like careful engineers and more like very fast interns with a tragic respect for bad tickets. ...

August 6, 2025 · 14 min · Zelina
Cover image

From Autocomplete to Autonomy: How LLM Code Agents are Rewriting the SDLC

TL;DR for operators The useful question is no longer “Can an LLM write code?” It can. Often quite well, occasionally with the confidence of a junior developer who has just discovered Stack Overflow and caffeine. The better question is: which parts of the software development lifecycle can be safely handed to an agentic workflow, and under what controls? ...

August 4, 2025 · 17 min · Zelina
Cover image

Learning to Struggle: Teaching LLMs to Code Like Real Students

TL;DR for operators ParaStudent asks a sharper question than “Can an LLM solve programming homework?” It asks whether an LLM can generate code that looks like it came from a real novice: incomplete, inconsistent, stylistically awkward, and improving over time.1 The key empirical surprise is that GPT-4.1 is often too competent to be realistic. In the high-resolution experiment, GPT-4.1 produces pass rates of 96.7% on familiar problems and 100.0% on new problems, while real student submissions average 9.8% and 12.1% respectively at the evaluated next-submission points. A fine-tuned Qwen-2.5 Coder 7B model, called qwen-student, comes much closer to real student behaviour across pass rate, PEP 8 violations, style score, embedding distance, and incremental edit patterns. The paper’s business relevance is not “AI will replace students,” which would be a rather grim product roadmap. The useful pathway is synthetic student behaviour for training tutor agents, testing feedback systems, building benchmarks, and stress-testing interventions where real student data is scarce or sensitive. The boundary is material. ParaStudent works best when the model has seen related problems from the same course. Generalisation to new problems is weaker, and the high-resolution setup predicts the next submission using real prior attempts rather than generating an entire student journey from scratch. For edtech teams, the takeaway is simple: if the product depends on modelling learners, correctness is the wrong north star. The right question is whether the system can represent how learners fail, revise, and partially recover. Homework code is supposed to look a little broken Student code is not merely worse professional code. It has its own texture. ...

July 19, 2025 · 17 min · Zelina