Learning to Struggle: Teaching LLMs to Code Like Real Students

TL;DR for operators

ParaStudent asks a sharper question than “Can an LLM solve programming homework?” It asks whether an LLM can generate code that looks like it came from a real novice: incomplete, inconsistent, stylistically awkward, and improving over time.¹
The key empirical surprise is that GPT-4.1 is often too competent to be realistic. In the high-resolution experiment, GPT-4.1 produces pass rates of 96.7% on familiar problems and 100.0% on new problems, while real student submissions average 9.8% and 12.1% respectively at the evaluated next-submission points.
A fine-tuned Qwen-2.5 Coder 7B model, called qwen-student, comes much closer to real student behaviour across pass rate, PEP 8 violations, style score, embedding distance, and incremental edit patterns.
The paper’s business relevance is not “AI will replace students,” which would be a rather grim product roadmap. The useful pathway is synthetic student behaviour for training tutor agents, testing feedback systems, building benchmarks, and stress-testing interventions where real student data is scarce or sensitive.
The boundary is material. ParaStudent works best when the model has seen related problems from the same course. Generalisation to new problems is weaker, and the high-resolution setup predicts the next submission using real prior attempts rather than generating an entire student journey from scratch.
For edtech teams, the takeaway is simple: if the product depends on modelling learners, correctness is the wrong north star. The right question is whether the system can represent how learners fail, revise, and partially recover.

Homework code is supposed to look a little broken

Student code is not merely worse professional code. It has its own texture.

A novice solution may pass no tests but still reveal a useful misconception. Another may be syntactically valid but logically inverted. A third may contain a perfectly reasonable idea buried under overlong conditionals, missing base cases, inconsistent variable names, and the kind of indentation that makes linters quietly reconsider their life choices.

That texture matters because modern AI tutors are increasingly expected to diagnose the learner, not just solve the assignment. A tutor that only recognises correct code is not a tutor. It is an answer key with nicer typography.

ParaStudent starts from this gap. The authors study whether LLMs can generate student-like code using real timestamped submissions from UC Berkeley’s CS 61A introductory programming course. The dataset spans four semesters, 5,478 students, 22 assignments, 33 problems, and 689,023 code submissions. To reduce contamination from post-ChatGPT behaviour, the authors use Spring 2021 through Fall 2022 data and exclude more recent semesters.

The study compares two broad strategies. One is prompting: tell a model to act like a student. The other is fine-tuning: adapt a code model on real student submission streams. The tested models are qwen-student, a fine-tuned Qwen-2.5 Coder 7B; qwen-inst, the instruction-tuned Qwen-2.5 Coder 7B; and GPT-4.1 as a proprietary prompted baseline.

This is where the paper becomes useful. Prompting a capable model to “be a student” mostly teaches it to cosplay weakness. Fine-tuning teaches it some of the distributional habits of actual struggle.

The smartest model is too polished to be the learner

The paper’s most operator-relevant result is not that fine-tuning improves performance. That part is familiar. The interesting result is that better code generation can be worse student simulation.

In the high-resolution experiment, the model must predict the next submission in a student’s sequence, conditioned on previous attempts. This is the closest setup to what an AI tutor might face when observing a learner over time. The results are blunt:

Test setting	Real students pass rate	GPT-4.1 pass rate	qwen-inst pass rate	qwen-student pass rate	Interpretation
New students, old problems	9.8%	96.7%	24.6%	10.5%	GPT-4.1 solves too much; qwen-student closely matches student correctness.
New students, new problems	12.1%	100.0%	41.6%	6.3%	qwen-student is still closer than prompted models, but underpredicts final correctness on unseen problems.

This is not a ranking of coding assistants. If the job is to ship working Python, GPT-4.1 is obviously preferable. If the job is to simulate an introductory student’s next attempt, near-perfect correctness is a defect.

That distinction is easy to miss because most LLM evaluation culture rewards capability. ParaStudent flips the objective. The best model is not the one that knows the answer. It is the one that fails in the right neighbourhood.

The same pattern appears in the low-resolution experiment, where the authors reduce each submission stream to first, middle, and last attempts. On familiar problems, qwen-student achieves the lowest average embedding distance and highest coverage of the student code distribution: 0.058 average KNN distance and 71.9% coverage. The prompted Qwen baseline is worse on both metrics. GPT-4.1 sometimes sits closer on new-problem embedding distance, but that advantage is not the whole story because its functionality profile is far too clean.

In other words, semantic proximity alone can be misleading. A generated program can be “near” student code in embedding space and still be pedagogically unrealistic if it jumps straight to correctness. This is the sort of measurement trap edtech teams should recognise before they build dashboards around a single metric and congratulate themselves into irrelevance.

ParaStudent measures realism across four dimensions, not one

The paper’s strongest methodological contribution is the evaluation design. The authors do not ask whether generated code passes tests and then call it a day. They measure student-likeness across four dimensions:

Dimension	What the paper measures	Why it matters operationally
Semantics	Code embeddings, cosine similarity, KNN distance, coverage	Checks whether generated code occupies the same broad solution space as real submissions.
Functionality	Autograder pass rate and error type distribution	Captures whether the model makes plausible novice mistakes instead of merely producing correct code.
Style	PEP 8 violations, verbosity, AST depth, AST width, AST nodes, aggregate style score	Distinguishes polished solutions from the awkward structure of novice programs.
Progress	Doctest improvement, style progression, Levenshtein edit distance between attempts	Tests whether generated submissions evolve like learning, not like unrelated samples.

This matters because “student-like” is not a scalar property. A model can match students on style but not errors. It can match errors but not revisions. It can make mistakes, but the wrong kind of mistakes. Randomly breaking correct code is not pedagogy; it is vandalism with an API key.

ParaStudent’s evaluation gives operators a more useful diagnostic vocabulary. If a synthetic student is intended for tutor-agent training, then its value depends on which dimension the tutor must learn from. A feedback model trained on final answers needs a different simulator from a tutor trained to respond after each failed attempt.

That distinction is especially important for coding education. A wrong first attempt can reveal more about a learner than a correct final answer. The model must preserve the intermediate mess because that is where instruction happens.

The evidence says fine-tuning captures the mess, especially on familiar ground

The paper’s evidence is best read as a sequence of increasingly demanding tests.

Experiment 1, the low-resolution setting, is the coarse test. It asks whether models can reproduce student submissions at the first, middle, and last stages. This is main evidence for distributional alignment. It shows that qwen-student better overlaps with real student embedding distributions on familiar problems, especially at the first and last stages. It also better matches error profiles, while GPT-4.1 produces mostly functional code from the start.

Experiment 2, the high-resolution setting, is the more business-relevant test. It asks whether models can predict the next submission given previous attempts. This is main evidence for learning-trajectory simulation. Here qwen-student aligns most closely with students across pass rate, PEP 8 violations, style score, and embedding distance.

The appendix then plays three supporting roles:

Appendix component	Likely purpose	What it supports	What it does not prove
No-context results	Sensitivity test	Checks whether patterns hold without student-specific context.	Does not show full deployment readiness because real tutor systems may have richer context.
Fine-tuning ablations across model variants	Ablation / implementation check	Tests whether the phenomenon is tied only to one specific model choice.	Does not replace a broad model-family benchmark; the paper itself says compute limited the comparison.
Different problem sets	Robustness test	Checks whether results persist on alternative, more introductory test problems.	Does not prove generalisation to other courses, languages, institutions, or difficulty levels.

The ablations are useful but should not be inflated into a second thesis. They suggest that fine-tuning can improve student-like alignment across variants, but the main story remains the comparison between fine-tuned qwen-student and prompted baselines under low- and high-resolution settings.

The strongest evidence is still this: qwen-student does not merely add errors. It reproduces a more realistic combination of errors, style, and incremental edits.

On familiar problems in the high-resolution setting, qwen-student’s pass rate is 10.5% against a student average of 9.8%. Its PEP 8 violations average 7.00 against student 6.92. Its style score is 0.70 against student 0.64. Its cosine distance MAE is 0.02, compared with 0.10 for GPT-4.1 and 0.07 for qwen-inst.

That is not just “worse code.” It is better-matched bad code. There is a difference, and product teams should tattoo it somewhere near the evaluation plan.

Style is not decoration; it is part of the learner model

Style metrics may look secondary next to pass rate, but in this paper they carry real signal.

Novice programmers do not merely fail tests. They organise code differently. They may use more verbose constructions, deeper nesting, inconsistent formatting, and non-standard structures. ParaStudent measures this through PEP 8 violations, code length, AST shape, and a composite style score derived from verbosity and AST features.

In the low-resolution final-stage comparison on familiar problems, students average 7.49 PEP 8 violations. qwen-student with context averages 7.18, closer than GPT-4.1 at 5.91 and qwen-inst at 5.60. For style score, students average 0.89; qwen-student reaches 0.33 with context and 0.41 without context, while prompted models remain lower and flatter.

This is imperfect alignment, not magic. But it is directionally important. Prompted models tend to generate code that is too tidy. They may insert errors, but the surrounding code still smells like a competent model intentionally pretending not to know something. Real students are less curated. Their mistakes are embedded in a broader pattern of structure, naming, verbosity, and revision.

That broader pattern is what a tutor system needs to understand. A student who repeatedly produces deeply nested conditionals needs different feedback from a student who understands decomposition but misses edge cases. If the simulator erases style, the downstream tutor may learn to diagnose only correctness. That is how one builds an expensive hint machine and mistakes it for teaching.

Progression is the business-critical layer

The most important part of ParaStudent is not that it can generate a plausible wrong answer. It is that it tries to model change.

In the high-resolution experiment, the authors examine pass-rate progression, style progression, and edit distance across normalised submission steps. qwen-student most closely tracks the student curve. GPT-4.1 stays near perfect. qwen-inst remains comparatively static. qwen-student makes smaller, more incremental edits, which is closer to real student behaviour.

That incremental pattern is operationally valuable because tutoring is sequential. A real learning system must decide whether a student is stuck, exploring, regressing, copying, or converging. Static snapshots are not enough.

A useful student simulator should therefore produce sequences like this:

The first attempt reflects a partial understanding.
The next attempt changes something locally, sometimes improving, sometimes not.
The code gradually moves toward functionality.
The student’s stylistic habits persist across attempts.
The revision size itself carries information.

ParaStudent gets closer to that behaviour than prompted baselines. But the setup should be read carefully. In Experiment 2, models predict the next submission conditioned on ground-truth prior attempts. The paper is not claiming that qwen-student can autonomously roll out a full semester of realistic student behaviour with minimal supervision. That harder autoregressive simulation is left for future work.

For operators, this means the near-term product use is not “generate unlimited fake learners.” The stronger use case is controlled simulation around observed or seeded learner histories: training tutor responses, testing feedback policies, filling benchmark gaps, and evaluating whether a diagnostic model can respond to plausible intermediate attempts.

The business value is tutor rehearsal, not answer generation

The practical value of ParaStudent sits in a narrow but important lane.

Most coding-assistant businesses optimise for fewer errors. Edtech platforms, by contrast, sometimes need realistic errors. Not because errors are desirable, but because teaching systems must learn to respond to them.

The business pathway looks like this:

What the paper directly shows	Cognaptus business inference	Remaining uncertainty
Fine-tuned models better match student code trajectories than prompted baselines in this course setting.	Edtech firms can use fine-tuned simulators to create more realistic training and evaluation data for tutor agents.	Transfer to other courses, languages, institutions, and age groups remains unproven.
GPT-4.1 overpredicts correctness and produces overly polished outputs for student simulation.	General-purpose frontier models may be poor default synthetic-student engines unless constrained or adapted.	Stronger prompting, tool constraints, or reasoning-model comparisons may change the balance.
Multi-dimensional metrics reveal differences hidden by pass rate alone.	Tutor-platform evaluation should include style, error type, and edit progression, not just solution correctness.	Which metrics best predict real learning outcomes is still an open product question.
qwen-student better captures incremental edits in next-submission prediction.	Simulators could support intervention testing: “What feedback should the tutor give after this kind of attempt?”	The paper does not prove that tutors trained with this data improve actual student outcomes.
Realistic student-code generation creates misuse and privacy risks.	Deployment requires governance: anonymisation, access controls, anti-cheating design, and careful output policies.	Privacy-preserving fine-tuning and misuse-resistant release strategies need further work.

For assessment vendors, this can support benchmark generation. If a grading or feedback model only sees clean solutions and obvious failures, it will perform badly on the middle zone where real students live. Synthetic student submissions can expand that middle zone.

For AI tutor companies, it can support rehearsal. A tutor agent can be tested against plausible student trajectories before being exposed to actual learners. The question becomes: does the tutor give useful feedback after a logical error? Does it over-help? Does it reveal the answer? Does it adapt when the next attempt improves only slightly?

For universities and training providers, the value may be safer experimentation. Real student data is sensitive, sparse in some courses, and institution-specific. A calibrated simulator could allow early-stage testing of feedback policies without immediately touching live classrooms.

But the word “calibrated” is doing work. Synthetic data is not automatically useful because it is synthetic. It is useful only if it preserves the behavioural structure relevant to the downstream decision.

The misconception: prompting can imitate struggle

The paper pushes against a common assumption: a sufficiently strong LLM can simulate a student if the prompt is explicit enough.

The prompt can say: “You are a student in an introductory Python course.” It can request imperfect code. It can ask for mistakes. It can include prior attempts. Yet the prompted models still tend to produce outputs that are too correct, too stable, or too polished.

This should not surprise anyone who has watched instruction-tuned models behave. They are trained to be helpful, concise, and correct. Those traits are excellent for productivity and awkward for simulating a learner who misunderstands recursion on a Tuesday night.

Fine-tuning changes the model’s default behaviour by exposing it to actual submission distributions. It learns not only that students make mistakes, but which mistakes appear with which problem types, how attempts evolve, how style persists, and how code changes between submissions.

That does not mean prompting is useless. It means prompting alone may be the wrong control surface for behavioural realism. If the desired output is a structured human process rather than a single answer, the model needs examples of the process.

This is a broader lesson beyond education. Many enterprise AI systems do not need the model to be maximally expert. They need it to model a role, a workflow, a failure mode, or a user population. In those settings, “more capable” can quietly mean “less representative.” Annoying, but reality often is.

Where the result applies, and where it should not be stretched

ParaStudent is promising, but its scope is deliberately tight.

First, the data comes from one introductory programming course at one institution. CS 61A is a serious, well-instrumented course with large-scale autograder logs. That is excellent for research. It is not proof that the method transfers cleanly to a bootcamp, a high-school Java class, an enterprise SQL course, or a data-science notebook environment.

Second, old-problem and new-problem results behave differently. On familiar problems, qwen-student is consistently strong. On new problems, alignment weakens. In low-resolution results, GPT-4.1 can be competitive or better on some embedding proximity measures. In high-resolution results, qwen-student remains closer across the summary metrics, but it underpredicts correctness on new problems, with a 6.3% pass rate against students’ 12.1%.

Third, the high-resolution experiment uses ground-truth prior attempts. That is a strong supervision regime. It tests next-step realism, not open-ended simulation of a complete learner history. For many practical tutoring systems, next-step realism is already valuable. But it is not the same as generating a fully autonomous synthetic student over an entire course.

Fourth, privacy is not solved by fine-tuning. The authors explicitly note that their standard LoRA fine-tuning setup does not provide privacy guarantees. Any real deployment would need careful anonymisation, access controls, and possibly privacy-preserving fine-tuning. In education, “we trained on student data” is not a casual sentence. It is a governance meeting.

Finally, there is an academic-integrity risk. A model that can produce realistic student-like code could be misused to generate plausible homework submissions. The danger is not that it writes perfect code. The danger is that it writes imperfect code with believable fingerprints. Cheating, like software, also has product-market fit.

What edtech teams should do differently

The immediate lesson for operators is not “fine-tune Qwen and call it a strategy.” The lesson is to evaluate learner simulation as a behavioural modelling problem.

A reasonable implementation roadmap would look like this:

Product question	Better evaluation target
Can the simulator produce wrong code?	Does it reproduce the right error-type distribution for the problem and stage?
Can it generate varied submissions?	Does it cover the real student code distribution in embedding space?
Does it look like student code?	Does it match style features such as verbosity, AST structure, and formatting irregularity?
Can it model learning?	Do pass rate, style score, and edit distance evolve like real submission streams?
Can it generalise?	Does performance hold for new students, new problems, and easier or harder problem sets?
Can it be deployed safely?	Are privacy, misuse, and access-control risks addressed before release?

This changes what “good” means. A tutor-training simulator should not be evaluated by how often it solves the task. It should be evaluated by whether it creates realistic instructional moments.

That phrase matters: instructional moments. The simulator’s job is not to be a student. Its job is to produce situations in which a tutor, grader, or feedback model can be tested. It is a rehearsal partner.

For Cognaptus-style automation strategy, the implication is broader. Automation does not always mean making the expert faster. Sometimes it means manufacturing the messy edge cases that allow the expert system to become reliable. In education, the edge case is not rare. It is the student.

The real lesson is about modelling process, not lowering quality

ParaStudent is easy to misread as a paper about making LLMs worse at coding. That is not the point.

The point is that realism and competence are different objectives. A model that produces flawless code may be a useful assistant and a bad synthetic learner. A model that produces broken code may still be useless if the brokenness is random. ParaStudent’s contribution is to show that fine-tuning on real student trajectories can move a model toward the specific kind of imperfection that teaching systems need to understand.

That is a more mature view of AI in education. The goal is not to paste a chatbot onto a course and hope the branding department says “personalised.” The goal is to model the learner’s path with enough fidelity that feedback can be tested, tutors can be trained, and interventions can be improved before they reach real students.

The paper does not solve AI tutoring. It does something more modest and more useful: it reminds us that learning is not a final answer. It is a sequence of attempts, each carrying evidence about what the student currently understands.

A model that can represent that sequence is not just generating code. It is learning to struggle on command.

That sounds inefficient. For education, it may be the whole point.

Cognaptus: Automate the Present, Incubate the Future.

Mihran Miroyan, Rose Niousha, Joseph E. Gonzalez, Gireeja Ranade, and Narges Norouzi, “ParaStudent: Generating and Evaluating Realistic Student Code by Teaching LLMs to Struggle,” arXiv:2507.12674v2, 2025, https://arxiv.org/abs/2507.12674. ↩︎

TL;DR for operators#

Homework code is supposed to look a little broken#

The smartest model is too polished to be the learner#

ParaStudent measures realism across four dimensions, not one#

The evidence says fine-tuning captures the mess, especially on familiar ground#

Style is not decoration; it is part of the learner model#

Progression is the business-critical layer#

The business value is tutor rehearsal, not answer generation#

The misconception: prompting can imitate struggle#

Where the result applies, and where it should not be stretched#

What edtech teams should do differently#

The real lesson is about modelling process, not lowering quality#