Code Generation

Hook, Line, and Import: How RAG Lets Attackers Snare Your Code

Imports look harmless until they become procurement. A developer asks an AI assistant for a plotting snippet. The assistant returns clean-looking Python, a few lines of explanation, and an import statement for matplotlib_safe. The name sounds prudent. Safer is good. Safer is what the security team keeps asking for, usually in meetings that could have been static analysis. ...

Guard Rails > Horsepower: Why Environment Scaffolding Beats Bigger Models

A demo is cheap. Ask an AI agent to build a web app, watch it spin up a cheerful interface, click a few buttons, and everyone briefly pretends software engineering has been solved. Then production begins. The app boots but stores nothing. The database schema exists but the handler quietly forgets foreign keys. The UI looks plausible until the first state transition. The test suite passes because it checked the page title, not the workflow. Somewhere, a dashboard reports “success.” Somewhere else, a user discovers the thing is an elegant cardboard storefront. ...

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

TL;DR for operators Code review usually starts after code exists. FPBench argues that this is already too late. The paper behind FPBench tests whether large language models can detect faulty premises in code-generation requests before obediently producing code from them.1 The answer is awkward. Many models can identify the flaw when explicitly told to check the question first, but most do not do so proactively. They behave less like careful engineers and more like very fast interns with a tragic respect for bad tickets. ...

From Autocomplete to Autonomy: How LLM Code Agents are Rewriting the SDLC

TL;DR for operators The useful question is no longer “Can an LLM write code?” It can. Often quite well, occasionally with the confidence of a junior developer who has just discovered Stack Overflow and caffeine. The better question is: which parts of the software development lifecycle can be safely handed to an agentic workflow, and under what controls? ...

Learning to Struggle: Teaching LLMs to Code Like Real Students

TL;DR for operators ParaStudent asks a sharper question than “Can an LLM solve programming homework?” It asks whether an LLM can generate code that looks like it came from a real novice: incomplete, inconsistent, stylistically awkward, and improving over time.1 The key empirical surprise is that GPT-4.1 is often too competent to be realistic. In the high-resolution experiment, GPT-4.1 produces pass rates of 96.7% on familiar problems and 100.0% on new problems, while real student submissions average 9.8% and 12.1% respectively at the evaluated next-submission points. A fine-tuned Qwen-2.5 Coder 7B model, called qwen-student, comes much closer to real student behaviour across pass rate, PEP 8 violations, style score, embedding distance, and incremental edit patterns. The paper’s business relevance is not “AI will replace students,” which would be a rather grim product roadmap. The useful pathway is synthetic student behaviour for training tutor agents, testing feedback systems, building benchmarks, and stress-testing interventions where real student data is scarce or sensitive. The boundary is material. ParaStudent works best when the model has seen related problems from the same course. Generalisation to new problems is weaker, and the high-resolution setup predicts the next submission using real prior attempts rather than generating an entire student journey from scratch. For edtech teams, the takeaway is simple: if the product depends on modelling learners, correctness is the wrong north star. The right question is whether the system can represent how learners fail, revise, and partially recover. Homework code is supposed to look a little broken Student code is not merely worse professional code. It has its own texture. ...