Beyond Stack Overflow: CodeAssistBench Exposes the Real Gaps in LLM Coding Help

TL;DR for operators

Coding assistants look much better when the task is a clean question than when the task is a messy software support conversation. That is the inconvenient point of CodeAssistBench, or CAB, a benchmark that turns resolved GitHub issues into multi-turn, project-grounded conversations where a model must behave like a maintainer, not a code-snippet vending machine.¹

The headline result is blunt. Models that score roughly 70-83% on Stack Overflow-style questions solve only 7.22-16.49% of CAB issues from recent repositories. Even on all-time repositories, correctness ranges only from 13.58% to 29.14%. The gap is not decorative. It says that repository context, changing dependencies, build environments, user clarification, and satisfaction criteria are still hard.

For buyers and builders of AI developer tools, the lesson is not “LLMs cannot code.” That would be tidy, dramatic, and mostly wrong. The better lesson is that coding assistance is several jobs wearing one hoodie: search, diagnosis, environment reconstruction, dependency reasoning, explanation, and iterative support. Benchmarks that test only code generation measure one part of that bundle.

CAB’s contribution is not just another leaderboard. It is a method for generating harder, fresher evaluation tasks from GitHub issues, with simulated users, containerised environments, extracted satisfaction conditions, and an automated judge. That makes it useful as a pattern for internal evaluation: test agents against your own recent repositories, your own build systems, and your own definition of a satisfied developer.

The boundary is equally important. CAB is automated, and automation leaves fingerprints. Its satisfaction conditions are accurate more often than they are complete. Its evaluation uses a sampled subset of issues. Its simulated user cannot fully mimic every developer’s confused, contradictory, “actually I meant…” follow-up. Its LLM judge reaches 65.92% average agreement with human annotators, which is useful but not oracle-grade. In other words, CAB is a better mirror, not a crystal ball. Sadly, procurement decks still prefer crystal balls.

The uncomfortable number is 16.49%

Start with the number that should make engineering leaders pause: 16.49%.

That is the best correctness score reported on CAB’s recent-repository subset. The model achieving it is ChatGPT 4.1 Mini. Other evaluated systems score lower: DeepSeek R1 and Sonnet 3.7 reach 11.34%, Sonnet 3.7 Think reaches 13.40%, Llama 3.3 70B reaches 9.33%, and Haiku 3.5 reaches 7.22%.

This is not a toy benchmark where the models are being asked to reverse a linked list while blindfolded. CAB is built from real GitHub issues labelled as questions or help-wanted, filtered for resolved technical support conversations, and converted into structured evaluation tasks. The assistant has to answer in a multi-turn interaction, with access to project context and, where relevant, a containerised environment.

The paper’s contrast is what matters. On Stack Overflow-style questions, the same broad class of models can look competent, with reported accuracy in the 70-83% range. On recent CAB repositories, the ceiling drops below 17%. That is not a rounding error. That is the difference between “this assistant is useful for known patterns” and “this assistant is ready to handle live support inside evolving codebases.”

The all-time repository subset is less brutal but still sobering. Correctness ranges from 13.58% to 29.14%. ChatGPT 4.1 Mini again leads with 29.14%, while Llama 3.3 70B is at the bottom with 13.58%. Partial correctness is common, which is exactly the problem: a half-right answer in a codebase can be worse than no answer, because it gives the user a beautifully formatted wrong path.

Here is the core model-result table, compressed to the numbers an operator actually needs:

Model	Correct on recent repos	Correct on all-time repos	What to notice
ChatGPT 4.1 Mini	16.49%	29.14%	Best overall in this evaluation, still weak on recent repositories
DeepSeek R1	11.34%	27.14%	Stronger on all-time than recent, suggesting recency and ecosystem drift matter
Llama 3.3 70B	9.33%	13.58%	Lowest all-time score among the evaluated models
Haiku 3.5	7.22%	16.86%	Lowest recent score
Sonnet 3.7	11.34%	25.71%	The synthetic ablation later complicates the story
Sonnet 3.7 Think	13.40%	27.43%	Competitive on all-time and some dynamic-language subsets

The obvious reading is “new repositories are harder.” The more useful reading is narrower: recent software ecosystems expose the brittleness of model knowledge, tool use, and contextual grounding at the same time.

That matters because most enterprise software is not a benchmark prompt. It is a living pile of version pins, internal conventions, unmerged assumptions, fragile build steps, and tribal knowledge wearing a README as a hat.

CAB tests the support conversation, not the code snippet

Most coding benchmarks ask a model to produce an answer. CAB asks whether the answer survives a support interaction.

That distinction is not academic. A real developer rarely arrives with a perfect prompt. They arrive with symptoms: a failing build, a mysterious dependency conflict, a configuration question, a port-mapping issue, an error message that may or may not be the real error, and the quiet suspicion that the documentation is lying. The assistant must infer context, ask or answer follow-ups, and converge on something the user can apply.

CAB models this through a three-agent setup:

CAB component	Role in the evaluation	Why it matters
User agent	Starts with the GitHub issue question and gives follow-up responses	Forces the model into multi-turn support rather than one-shot answering
Maintainer agent	The evaluated model; answers as if maintaining the project	Measures assistance quality in repository context
Judge agent	Assesses the final conversation against satisfaction conditions, execution results, and response quality	Converts messy support into a repeatable evaluation signal

The dataset itself comes from GitHub issues, not invented programming riddles. The authors begin with public repositories, filter for active support communities and permissive licences, collect closed issues labelled as question or help-wanted, remove low-signal or unsafe cases, and restructure conversations into user-maintainer turns. The final dataset contains 3,286 retained issues across 214 repositories and seven languages, including 238 validated Docker environments.

The seven languages are Python, Java, C++, C#, JavaScript, TypeScript, and C. The paper builds both an all-time cohort and a recent cohort, with recent repositories created after November 2024. This matters because it reduces the chance that the benchmark merely tests whether a model has memorised yesterday’s public issue thread. It also makes the benchmark more irritating for models. Benchmarks, like auditors, become more useful when they are harder to charm.

CAB’s automation is central. The pipeline uses LLMs to filter issues, remove non-contributory comments, generate Dockerfiles, repair failed builds, extract satisfaction conditions, and judge final answers. In total, the pipeline executed 44,628 Sonnet 3.7 calls end-to-end without manual intervention. That scale is the point: the authors are not hand-crafting a boutique test set; they are proposing a machinery for continuously generating fresh, project-grounded evaluation.

Satisfaction conditions are the quiet mechanism

The most important part of CAB is not the GitHub scraping. It is the attempt to define what it means for a user to be satisfied.

For each issue, CAB extracts satisfaction conditions from the original thread. These are not supposed to be exact implementation steps. They are criteria: what the user needs, at the right level of abstraction, grounded in the conversation, and clear enough to judge. For example, in a media-playback issue, the relevant condition might be “confirmation that video playback is possible without audio codec support,” not “use this exact function call.”

This matters because developer support is not reducible to “the code compiles.” Sometimes the correct answer is configuration guidance. Sometimes it is an explanation of why the requested change is unnecessary. Sometimes it is a dependency version correction. Sometimes the model must tell the user that the premise is wrong, preferably without sounding like an overconfident intern who just discovered Docker.

CAB’s judge evaluates along three axes:

Axis	What CAB asks	Business translation
Technical correctness	Is the response technically accurate?	Does this reduce engineering risk or create a new one?
Satisfaction completeness	Are the extracted user conditions met?	Did the assistant actually solve the support case, not just say plausible things?
Verbosity	Is the answer terse, appropriate, or verbose?	Does the assistant save time, or merely relocate the time cost into reading?

The Docker rule is especially useful. For Docker-related issues, a technically plausible explanation is not enough if the build or test validation fails. This is a small but important shift from language evaluation to operational evaluation. In software, correctness eventually has to touch the machine. The machine is rude, but at least it is consistent.

The recent-repository gap is probably not just “new code is weird”

A lazy interpretation would be that recent repositories are simply worse, stranger, or more chaotic. CAB’s appendix pushes against that.

The authors run a synthetic dataset ablation using 50 problems from recent repositories. They generate synthetic reproductions that preserve the bugs and satisfaction conditions but use language and library versions within the model’s training knowledge. These synthetic repositories are not tiny single-file puzzles; they contain 20-28 files and over 1,500 lines of code. They are, however, more documented, more focused, and more familiar to the model.

The result is dramatic: Claude 3.7 Sonnet reaches 74.0% correctness on the synthetic repositories, compared with 11.34% on recent real repositories and 25.71% on all-time repositories.

This ablation has a clear purpose: it tests whether poor performance on recent repositories is mainly caused by the inherent weirdness of AI-generated or synthetic code, or by knowledge and ecosystem mismatch. The authors argue that training-data recency is a major driver. Recent repositories often use new library APIs, language features, and framework practices that post-date model training cutoffs. When the same general issue is moved into a more familiar ecosystem, model performance improves sharply.

But the ablation does not prove that recency is the only issue. The synthetic repositories are smaller, better documented, and more regular than many real-world repositories. That matters. Models are quite fond of codebases written in the style models like to write. A benchmark that lets the model meet its own reflection is useful, but we should not mistake it for production.

The practical interpretation is therefore two-part:

Finding	What it supports	What it does not prove
Recent real repos are much harder than all-time repos	Model knowledge and ecosystem drift are serious blockers	All poor performance is caused by recency alone
Synthetic familiar repos are much easier	Models can reason better when libraries, versions, and documentation are familiar	Synthetic success transfers cleanly to enterprise codebases
All-time repos still have low correctness	Repository-grounded support is hard even without the worst recency penalty	Existing coding agents are useless

This is the better business lesson. Updating model knowledge helps, but it does not remove the need for repository indexing, dependency inspection, environment execution, retrieval, test generation, and human escalation.

The failures look like real engineering work, unfortunately

CAB is not dominated by syntax trivia. That is part of its value.

In the human-validation subset, the paper’s error-category analysis finds that logic issues and environment configuration issues dominate, at 28.8% and 28.5% respectively. Dependency issues account for 14.6%. Syntax errors are only 1.9%.

This distribution feels familiar because it resembles actual engineering support. Developers do not spend most of their time asking whether a missing semicolon is missing. They ask why a build fails only in one environment, why a package version breaks a workflow, why an API behaves differently from the documentation, why a container needs one port exposed but not another, or why a function returns the right thing until it doesn’t.

CAB’s qualitative example captures the flavour. A user asks about remapping ports for an application and a Cloudflare proxy. The weaker model response fails to address key proxy-port requirements or introduces wrong configuration. The successful response explains how to change the main application port, clarifies that the proxy port does not need to be exposed, and gives a conflict-free docker-compose.yml.

The point is not that port mapping is glamorous. It is that these questions are operationally expensive. They consume attention, interrupt workflow, and often require someone who knows the repository’s topology. If a coding assistant cannot resolve this class of question reliably, its value is narrower than the product brochure suggests. A brochure, being untroubled by CI logs, rarely notices.

Verbosity is not harmless when correctness is low

The paper also evaluates response style. Models tend toward over-explanation: verbose responses account for roughly 40-60% of outputs, with verbosity increasing on recent repositories. The authors suggest this may reflect higher uncertainty.

That is plausible. When a model does not know, it often compensates by becoming longer. This is not merely annoying; it is operationally expensive. A verbose wrong answer creates more surface area for the developer to inspect. It can also bury the key uncertainty under confident explanation.

For internal AI tooling, this implies that response quality should not be measured only by correctness. Teams should track:

Metric	Why it matters
Correct resolution rate	The obvious measure: did the assistant actually solve the issue?
Partial-correct rate	Reveals whether the model is close enough to support human-in-the-loop workflows
Time-to-satisfactory answer	Captures whether the assistant reduces support friction
Turn count before success or failure	Shows whether the assistant converges or wanders
Verbosity under uncertainty	Detects answer-padding when the model should instead ask, inspect, or escalate
Execution-backed validation	Separates plausible advice from working advice

CAB’s conversation-length analysis is useful here. Correct CAB conversations are similar in length to the original GitHub threads, typically around two to three turns. Incorrect conversations tend to run longer, often by one or two additional turns. That is what bad support looks like: not immediate failure, but graceful wandering.

What the paper directly shows, and what Cognaptus infers

The paper is an evaluation contribution. It does not claim that CAB is the final word on coding agents, and it does not test every tool configuration that a company might deploy. Still, it gives a strong basis for changing how AI coding assistance should be evaluated.

Layer	Claim	Confidence
Direct paper result	CAB can automatically generate multi-turn, project-grounded code-assistance tasks from resolved GitHub issues	High within the paper’s described scope
Direct paper result	Six leading models perform far worse on recent CAB repositories than on Stack Overflow-style questions	High for the sampled evaluation setup
Direct paper result	Satisfaction conditions are usually accurate but often incomplete	High, based on 663 human-reviewed conditions
Direct paper result	LLM judging is scalable but materially below human agreement	High, based on the reported human-validation study
Cognaptus inference	Public coding benchmark scores should not be used as sole evidence for enterprise support readiness	Strong, but depends on the enterprise workflow
Cognaptus inference	Internal evaluation should use fresh repositories, executable environments, and multi-turn support scenarios	Strong for teams deploying coding assistants into real engineering work
Still uncertain	How CAB scores would change with richer retrieval, specialised tooling, better user simulation, or human-in-the-loop review	Open

This separation matters. The paper does not say that all coding assistants are doomed. It says that the common evaluation regime is too forgiving. For business readers, that is the actionable part.

The buying question changes from “Can it code?” to “Can it support?”

Most organisations evaluating AI developer tools still ask a version of the wrong question: can the model generate code?

It can. Often impressively. Sometimes usefully. Occasionally with the confidence of a consultant billing by the metaphor.

The better question is: can it support developers inside our actual software environment?

That question has different requirements:

Procurement question	CAB-shaped replacement
What is the model’s benchmark score?	What is its resolution rate on our recent repositories?
Can it answer coding questions?	Can it handle multi-turn debugging and clarification?
Does it know the language?	Does it know our dependency versions, framework conventions, and build system?
Does it produce plausible code?	Does the proposed fix execute, build, or pass relevant tests?
Does it explain well?	Does it satisfy the developer’s actual condition without wasting attention?
Is it safe to roll out broadly?	Which issue categories require human escalation?

For vendors, CAB suggests a sharper product roadmap. A useful coding assistant needs more than a better base model. It needs repository-aware retrieval, dependency graph inspection, environment execution, version-aware documentation, test feedback, and escalation logic when uncertainty is high. It also needs to know when not to improvise. This is harder to market than “agentic coding,” but less likely to set the build on fire.

For buyers, CAB suggests a practical evaluation pattern:

Select recent issues from internal repositories, especially those involving build, dependency, configuration, and debugging problems.
Preserve the original support trail or expected resolution criteria.
Run candidate assistants in a multi-turn setup, not just one prompt.
Require execution or test validation where applicable.
Score not only correctness, but partial correctness, turn count, verbosity, and escalation quality.
Compare model-only performance with tool-augmented performance.
Treat public leaderboard performance as a prior, not as evidence of fit.

The ROI question then becomes less theatrical. You are not asking whether AI will replace developers. You are asking whether it can reduce support load, shorten issue triage, improve onboarding, and catch routine configuration failures without generating new ones. Sensible, measurable, and therefore less fun at conferences.

CAB’s own limitations are part of the lesson

CAB is more realistic than many code benchmarks, but it is still an evaluation system built from approximations.

The most important limitation is satisfaction-condition extraction. Human annotators judged 86.27% of extracted conditions correct, but recall was only 65.71%. That means the extracted conditions are usually valid, but they miss a non-trivial share of what users need. In practical terms, CAB may under-measure some forms of completeness or misclassify answers when the condition set is incomplete.

The judge is also imperfect. In the validation study, two software engineers reached 78.28% overall agreement across 310 model responses. The LLM judge reached 65.92% average agreement with human annotators, equal to 84.2% of the inter-human agreement baseline. That is respectable for scalable evaluation. It is not enough to treat individual judgments as absolute truth.

The evaluation sample is another boundary. Running all 3,286 issues in multi-turn mode was computationally expensive, so the paper evaluates a balanced subset of 350 all-time and 194 recent issues. That is reasonable, but it means fine-grained model comparisons should be interpreted carefully, especially where scores are close. The authors also do not report error bars or statistical significance tests.

The simulated user is grounded in historical responses through BM25 retrieval, but it cannot fully capture the strange theatre of real developer interaction: missing context, frustration, contradictory reports, partial logs, security restrictions, or internal conventions that never reached the README. Enterprise deployment will be messier.

Finally, CAB covers seven programming languages and permissively licensed open-source repositories. It does not directly evaluate proprietary codebases, Rust or Go ecosystems, regulated development environments, or deeply customised enterprise workflows. Those are not footnotes; they are precisely where many organisations plan to spend money.

The strategic lesson: evaluate where the software lives

CodeAssistBench is valuable because it moves evaluation closer to where software work actually happens. Not all the way there, but closer.

The paper’s evidence-first story is simple. Models look strong on clean Q&A. They look weaker in repository-grounded support. They look weakest on recent repositories where ecosystem drift, dependencies, and undocumented context matter. The right conclusion is not despair. The right conclusion is better measurement.

For executives, this changes vendor evaluation. Do not ask only for public benchmark scores. Ask for recent-repository resolution rates, executable validation, partial-correct analysis, and failure-category breakdowns. Ask how the assistant handles dependencies, build systems, and clarification turns. Ask how often it escalates instead of hallucinating. Watch the room become less cheerful. That is usually when the real evaluation begins.

For engineering leaders, CAB points toward an internal benchmark strategy. Build a small but representative support benchmark from your own resolved issues. Include messy setup problems, not just algorithmic tasks. Extract satisfaction criteria from actual resolutions. Track whether the assistant reduces human work or merely produces a second issue to debug.

For model builders, the message is sharper still. Coding assistance is not solved by memorising public code and producing fluent patches. The hard part is supporting a developer through an evolving problem in an evolving project. That requires current ecosystem knowledge, codebase grounding, execution feedback, and disciplined uncertainty.

The old benchmark question was: can the model answer a programming question?

CAB asks the more expensive question: can it help a developer get unstuck?

That is the question businesses should have been asking all along. Naturally, it took a benchmark to make the obvious measurable.

Cognaptus: Automate the Present, Incubate the Future.

Myeongsoo Kim, Shweta Garg, Baishakhi Ray, Varun Kumar, and Anoop Deoras, “CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance,” arXiv:2507.10646, 2025, https://arxiv.org/abs/2507.10646. ↩︎

TL;DR for operators#

The uncomfortable number is 16.49%#

CAB tests the support conversation, not the code snippet#

Satisfaction conditions are the quiet mechanism#

The recent-repository gap is probably not just “new code is weird”#

The failures look like real engineering work, unfortunately#

Verbosity is not harmless when correctness is low#

What the paper directly shows, and what Cognaptus infers#

The buying question changes from “Can it code?” to “Can it support?”#

CAB’s own limitations are part of the lesson#

The strategic lesson: evaluate where the software lives#