School of Thought: How Fine-Tuned Open LLMs Are Challenging the Giants in Education

TL;DR for operators

A useful AI education product does not always need the largest model in the room. Sometimes it needs a smaller model that has been taught one job properly and then told, firmly, not to hand students the answer on a silver platter.

The paper behind this article studies exactly that: whether supervised fine-tuning can make open-source models good enough to explain C programming errors for novice students. The authors use real CS1/2 error logs from DCC Help, generate 40,000 structured explanations with GPT-4.1, fine-tune Qwen3-4B, Llama-3.1-8B, and Qwen3-32B using QLoRA, then compare them against base models, GPT-4.1, and the original deployed DCC Help responses.

The headline is not “open models beat GPT-4.1”. They generally do not. GPT-4.1 still has the best expert mean rank: 3.09, compared with 4.12 for SFT-Qwen-4B and 4.27 for SFT-Llama-8B. But the fine-tuned small models improve sharply over their base versions, outperform the existing DCC Help baseline, and get close enough on several expert-rated criteria that the operational question changes.

The old question was: “Can we afford to use frontier APIs everywhere?” The better question is: “Which tasks are repetitive, structured, private, and valuable enough to deserve a specialised local model?”

For edtech vendors, universities, and enterprise learning teams, that is the practical takeaway. A compact open model can be viable when the task is narrow, the data is real, the response format is constrained, and the evaluation rubric reflects the actual teaching goal. The fine print is not decorative. This evidence does not prove that small open models can replace broad tutoring systems, improve grades, or generalise to every programming language and classroom. It proves something narrower—and therefore more useful.

The compiler error is the product surface

Programming education has a wonderfully cruel user interface: the compiler error. A novice writes code, presses run, and receives a message that may be technically precise while still sounding like a tiny legal notice from a machine that has never met a beginner.

That makes compiler-error explanation a natural test case for educational AI. The student has a concrete problem. The model receives code, error context, and sometimes runtime state. The desired response is not an essay, not a full solution, and not a philosophical meditation on semicolons. It should clarify the error, identify plausible causes, and give a hint that leaves the student with some thinking still to do.

That last clause matters. In education, the best answer is not always the most complete answer. A model that immediately provides the fixed code may look helpful in a demo and be pedagogically useless in practice. Very on brand for AI: maximising the metric right up to the point where the human stops learning.

The paper, “Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools”, asks whether smaller open-source models can be specialised for this exact job.¹ It is a comparison paper, and it should be read that way. The interesting story is not model size in isolation. It is the ladder of alternatives:

Operational choice	What it means in this paper	Why it matters
Use an existing deployed tool	Original DCC Help responses generated by GPT-3.5 Turbo or GPT-4o mini	Practical baseline, not a theoretical strawman
Use a frontier proprietary model	GPT-4.1	Strong quality benchmark, but with API cost, privacy, and dependency concerns
Use a base open model	Qwen3-4B, Llama-3.1-8B, Qwen3-32B without fine-tuning	Tests whether open models work out of the box
Fine-tune an open model	SFT versions of Qwen3-4B, Llama-3.1-8B, Qwen3-32B	Tests whether specialisation narrows the quality gap

This is the right frame for operators because procurement decisions are rarely “Which model is smartest in general?” They are usually “Which system is good enough for this workflow, at this cost, under these privacy and reliability constraints?”

The paper does not train a tutor; it trains a feedback appliance

The authors build their dataset from DCC Help usage at a large Australian university across five teaching periods between September 2023 and February 2025. The raw logs are substantial: about 180,000 compile-time invocations and 50,000 run-time invocations. Each example includes student code, error context, and the original DCC Help response. The data is pre-processed and redacted to remove identifying features where possible.

From the first two teaching sessions, they sample 40,000 examples for training. This sampling is not simply random across the whole pool. They cap the number of compile-time and run-time examples per teaching week before sampling, because student errors are not evenly distributed. Assignment deadlines have a way of producing both panic and data skew. The resulting training set has roughly a 3:1 ratio of compile-time to run-time examples, with total example lengths ranging from 300 to 4,000 tokens and a mean of 849 tokens.

The training labels are generated by GPT-4.1 with temperature set to 0. This is important. The smaller models are not learning from human-written explanations. They are learning from a structured GPT-4.1 teacher. The prompt asks for three parts:

a short, jargon-free clarification of the error message;
one or two short sentences about potential causes;
one or two short sentences of debugging guidance, without giving the solution in code.

That format is the product design hiding inside the method section. The paper is not just “fine-tune a model and hope”. It constrains the response into a pedagogical pattern: explain enough, guide enough, stop before solving the exercise.

The fine-tuned models are Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. All are trained for one epoch at a learning rate of $2e^{-5}$ using QLoRA through the Unsloth fine-tuning API on a single Nvidia A100 GPU with 80GB VRAM. QLoRA matters operationally because it freezes the base model, quantises it to 4-bit, and trains adapter parameters. Translation: the expensive part stays mostly intact, and the specialised behaviour is added in a lighter-weight way. Less glamorous than “AGI tutor”, much more likely to survive a budget meeting.

The evaluation uses the remaining three teaching periods. The authors construct an 8,000-example evaluation set: about 5,600 compile-time and 2,400 run-time examples. Each example receives responses from eight systems: the original DCC Help response, GPT-4.1, three base open models, and three fine-tuned open models. That produces 64,000 model responses.

The paper then evaluates at two scales. Human experts assess a smaller sample, while an LLM-as-judge ensemble evaluates the full set. This split is sensible. Human evaluation is expensive but grounded. LLM judging is scalable but bias-prone, and the paper is refreshingly explicit about that nuisance rather than pretending the robot jury has descended from Mount Objectivity.

How to read the evidence without over-reading it

The paper contains several kinds of evidence, and they do different jobs. Mixing them together produces the usual AI-reading fog: everything sounds impressive, and somehow nothing is actionable.

Evidence component	Likely purpose	What it supports	What it does not prove
Expert ranking of 100 examples	Main comparative evidence	Whether humans prefer fine-tuned responses over baselines	Real classroom learning gains
Rubric-based expert annotations	Main quality evidence	Which pedagogical qualities improve: correctness, clarity, no overhelp, etc.	Universal quality across courses or languages
LLM-as-judge evaluation over 64,000 responses	Scaled comparison and sensitivity check	Broad relative patterns across many examples	Absolute pedagogical truth
Gwet’s AC1 reliability checks	Robustness check for evaluation trustworthiness	Which metrics are more or less reliable	That weak-agreement metrics are settled
Base vs SFT comparison	Ablation-like evidence for fine-tuning effect	Whether training changed model behaviour usefully	Why each individual improvement occurred internally
DCC Help comparison	Deployment baseline comparison	Whether SFT models can beat an existing real tool	Whether they beat all commercial systems
GPT-4.1 comparison	Frontier benchmark comparison	How close specialised open models get to a strong proprietary model	That open models are general replacements for GPT-4.1

This matters because the paper’s strongest operational claim is not the same as its most eye-catching one.

The eye-catching claim is that SFT-Qwen-4B performs within 0.10 of GPT-4.1 on all expert-judged metrics and has a mean rank only 1.03 worse. That is impressive.

The stronger operational claim is that SFT-Qwen-4B and SFT-Llama-8B beat their base versions and the deployed DCC Help baseline by enough to justify treating fine-tuned small models as practical replacements for a narrow feedback tool. That is not just impressive. That is potentially useful.

GPT-4.1 still wins the beauty contest

Let us remove the easiest misunderstanding first. The fine-tuned open models do not generally beat GPT-4.1.

In the expert ranking table, GPT-4.1 has the best mean rank at 3.09. SFT-Qwen-32B follows at 3.60, base Qwen-32B at 3.62, SFT-Qwen-4B at 4.12, and SFT-Llama-8B at 4.27. The base Qwen-4B, original DCC Help, and base Llama-8B trail behind at 5.50, 5.67, and 6.13 respectively.

The win-rate comparisons tell the same story. Against GPT-4.1, the fine-tuned models are usually still behind:

Fine-tuned model	Win-rate vs GPT-4.1, compile-time	Win-rate vs GPT-4.1, run-time
SFT-Qwen-4B	0.34	0.36
SFT-Llama-8B	0.36	0.44
SFT-Qwen-32B	0.36	0.46

A win-rate below 0.5 means experts preferred GPT-4.1 more often. So no, this is not a clean “small open model dethrones frontier model” story. Anyone selling it that way has probably also discovered a revolutionary new way to relabel a bar chart.

But GPT-4.1’s victory is not the end of the analysis. It is the beginning of the procurement question. A frontier model can be better and still not be the right default for every embedded educational interaction. The paper’s value is in showing how much of the performance gap can be closed when the task is narrow and the model is specialised.

The smaller models learn the job

The most important comparison is not SFT model versus GPT-4.1. It is SFT model versus its own base version. That isolates the practical effect of supervised fine-tuning.

Here the small models do well. SFT-Llama-8B is preferred over base Llama-8B in 74% of compile-time cases and 72% of run-time cases. SFT-Qwen-4B is preferred over base Qwen-4B in 70% of compile-time cases and 58% of run-time cases. The run-time result for Qwen-4B is less dramatic, but still above chance.

The expert rubric shows the same direction. For SFT-Qwen-4B, expert-rated correctness rises to 0.89, selectivity to 0.84, completeness to 0.77, clarity to 0.76, novice appropriateness to 0.82, no-overhelp to 0.92, and Socratic guidance to 0.65. GPT-4.1 remains stronger on several criteria, but the fine-tuned 4B model becomes recognisably competent across the rubric.

For SFT-Llama-8B, the pattern is similar: correctness 0.88, selectivity 0.78, completeness 0.71, clarity 0.72, novice appropriateness 0.79, no-solution 0.85, no-overhelp 0.87, and Socratic guidance 0.56.

The business interpretation is straightforward. Fine-tuning is not merely polishing the wording. It appears to move the small models into a different operational class. Before fine-tuning, base Llama-8B sits at the bottom of the expert ranking with a mean rank of 6.13. After fine-tuning, SFT-Llama-8B reaches 4.27. Base Qwen-4B has a mean rank of 5.50. After fine-tuning, SFT-Qwen-4B reaches 4.12.

That is the difference between “interesting toy” and “candidate component”. Not a universal component. Not a magical teaching assistant. A candidate component.

The deployed baseline is the quiet casualty

The paper’s most commercially relevant comparison may be the least glamorous one: fine-tuned models versus the existing DCC Help baseline.

All three SFT models are preferred over DCC Help by experts. SFT-Qwen-4B beats DCC Help in 68% of compile-time cases and 78% of run-time cases. SFT-Llama-8B beats it in 62% and 74%. SFT-Qwen-32B beats it in 70% and 82%.

That matters because DCC Help is not a strawman. It is an existing compiler-integrated tool whose historical responses were generated by GPT-3.5 Turbo or GPT-4o mini. In operational terms, the paper is not saying, “A toy open model beat a toy baseline.” It is saying, “A specialised open model can outperform the kind of deployed AI feedback system institutions might actually be using.”

The expert metric comparison reinforces this. The fine-tuned models improve over DCC Help on nearly all metrics, with the authors specifically highlighting gains of at least 20% in clarity and no-overhelp. The exception is LLM-judged completeness, where DCC Help scores better. But that exception is not especially stable. Expert–LLM agreement on completeness is extremely weak, with AC1 at 0.05, suggesting that the LLM-judge process may be measuring something unreliable for that criterion.

This is one of those moments where the methods section saves the results section from becoming a slogan. The paper does not simply report “completeness is worse” and leave the reader to panic. It shows that the measurement itself is suspect. Completeness is a tricky educational criterion: a response can be complete enough for learning without being exhaustive, and exhaustive enough to become counterproductive. Machines, like some committee members, can struggle with that distinction.

The 32B model complicates the simple scaling story

If the article were only about small models getting better through fine-tuning, it would be neat. The Qwen3-32B result makes it messier, and therefore more interesting.

Qwen3-32B is already strong without fine-tuning. In expert ranking, base Qwen-32B has a mean rank of 3.62, essentially tied with SFT-Qwen-32B at 3.60. In direct win-rate comparisons, SFT-Qwen-32B is not clearly preferred over base Qwen-32B: it wins only 46% of compile-time and 46% of run-time comparisons against its base counterpart.

This suggests a ceiling effect for supervised fine-tuning, at least under this setup. A strong enough base model may already possess much of the task capability, and one epoch of SFT on GPT-4.1-style outputs may not add much. It may even trade off some desirable behaviours depending on how the judging is done.

But the 32B result has another operational implication. The authors argue that Qwen3-32B’s strong performance suggests it could potentially replace GPT-4.1 as the generator of future training datasets. That would matter for privacy. If an institution can use a strong open model to generate labels internally, it reduces reliance on sending sensitive student code and context to a third-party API during dataset construction.

That is an inference worth treating carefully. The paper shows Qwen3-32B performs competitively in this setting. It does not fully test a pipeline where Qwen3-32B generates the training labels and smaller models learn from those labels. But the direction is plausible: frontier APIs may be needed less as the “teacher model” once strong open models are good enough for the labelling stage.

The judge panel is useful, not holy

The paper uses a three-model LLM-as-judge ensemble: GPT-4.1, Gemini-2.5-Flash, and Qwen3-32B. Each judge provides binary decisions across the eight rubric criteria, and the final verdict requires unanimity. This is a conservative design. A criterion is marked correct only if all judges agree.

The authors also validate agreement between expert and LLM annotations using Gwet’s AC1. The agreement pattern is informative:

Criterion	Expert agreement	Expert–LLM agreement	Practical reading
Correctness	0.84	0.75	Relatively trustworthy
Selectivity	0.56	0.57	Moderately useful
Completeness	0.56	0.05	LLM-judge result should be treated cautiously
Clarity	0.20	0.46	Human agreement itself is weak
Novice appropriate	0.62	0.71	Reasonably usable
No solution	0.50	0.74	Useful, especially for detecting direct answers
No overhelp	0.72	0.71	Stronger signal
Socratic	-0.12	0.08	Do not over-interpret

The table is a quiet methodological gift. It tells operators which metrics can support decisions and which are still wobbly.

Correctness, novice appropriateness, no-solution, and no-overhelp look more stable. Completeness and Socratic guidance are much more fragile. That does not make them unimportant. It means they are harder to measure reliably. Any product team that blindly optimises for an LLM-judged Socratic score should probably be asked to sit quietly and reflect on what “Socratic” means.

The LLM-judge design also has self-bias risks. GPT-4.1 generated the training dataset and served as one of the judges. GPT-4.1 and Qwen3-32B also appeared both as candidate systems and judges. The authors try to mitigate this with a diverse ensemble and unanimity, but they acknowledge that the higher LLM-judged performance of GPT-4.1 and base Qwen-32B suggests self-bias may remain.

That limitation does not invalidate the paper. It narrows what should be trusted. The expert rankings and expert rubric scores carry more weight for the core claim. The LLM-judge results help scale the comparison, but they should not be treated as a final court of pedagogical appeal.

The business value is controlled specialisation, not open-source ideology

There is a lazy version of the open-source AI argument: open models are cheaper, therefore use open models. That is not wrong, exactly. It is just undercooked.

The stronger business argument is controlled specialisation. This paper shows a pattern that many applied AI teams should recognise:

Find a high-volume, repetitive interaction.
Capture real user context, not synthetic toy examples.
Define a response structure that encodes the service goal.
Generate or curate high-quality target responses.
Fine-tune a compact model using parameter-efficient methods.
Evaluate against a rubric that reflects actual user value.
Keep the frontier model as a benchmark, not necessarily as the production default.

For education, the high-volume interaction is novice debugging. For enterprise learning, it could be explaining policy violations in training simulations, guiding employees through internal tools, or giving structured feedback on repetitive procedural tasks. For developer platforms, it could be build errors, test failures, or configuration mistakes. The pattern travels further than the specific C compiler setting, but only if the target task shares the same properties: narrow, structured, repetitive, and evaluable.

The ROI logic is not just inference cost. It is also privacy, availability, latency, and behavioural stability. A local or institution-controlled model avoids sending sensitive student code to a third-party service. It is less exposed to rate limits, vendor outages, and sudden model behaviour changes. It can be versioned, audited, and adapted to institutional teaching norms. None of that makes the model smarter. It makes the system more governable. In production, governable often beats dazzling.

Still, operators should not confuse “fine-tuned on real logs” with “automatically safe for students”. The response format in this paper does heavy work. The instruction to avoid code solutions, keep language short, and address the student directly is part of the pedagogical design. The model is not merely smaller. It is boxed into the right kind of helpfulness.

What the paper directly shows, what Cognaptus infers, and what remains uncertain

A clean operator reading needs three separate columns.

Layer	Claim	Status
What the paper directly shows	SFT improves Qwen3-4B and Llama-3.1-8B for C programming-error explanations compared with their base models	Directly supported
What the paper directly shows	Fine-tuned models outperform the deployed DCC Help baseline on expert rankings and nearly all rubric metrics	Directly supported
What the paper directly shows	GPT-4.1 remains the strongest overall expert-ranked model	Directly supported
What the paper directly shows	Qwen3-32B gains little from SFT under this setup	Directly supported
Cognaptus inference	Specialised open models can replace frontier APIs for some narrow educational feedback workflows	Reasonable, with validation
Cognaptus inference	Strong open models may increasingly act as internal teacher models for dataset generation	Plausible, not fully tested here
Still uncertain	Whether these systems improve learning outcomes in live classrooms	Not established
Still uncertain	Whether results generalise beyond one institution, C programming, and CS1/2 contexts	Not established
Still uncertain	Whether LLM judges can reliably assess subtle pedagogical traits such as Socratic quality	Weakly supported at best

This separation is important because the commercial temptation is obvious. The moment a 4B model approaches GPT-4.1 on expert metrics, someone will want to declare a general replacement cycle. That would be premature. The paper supports replacement for a bounded workflow, not substitution across the whole educational stack.

The boundary: this is not a general tutor hiding in a compiler

The most important limitation is scope. The dataset comes from real student errors, which is a strength, but from one large Australian university and a specific CS1/2 C programming environment. Different languages, curricula, assignment styles, student backgrounds, and debugging tools may produce different error distributions.

The second limitation is the teacher model. GPT-4.1 generated the 40,000 training explanations. That means the fine-tuned models inherit not only its useful structure but also its style and biases. If GPT-4.1 tends to phrase hints in a particular way, the small models learn that. This is distillation, not independent pedagogical discovery.

The third limitation is evaluation. Expert review is meaningful but limited in scale. LLM-as-judge evaluation is broad but vulnerable to self-bias and weak agreement on some criteria. Completeness and Socratic guidance are especially hard to treat as settled.

The fourth limitation is outcome evidence. The paper evaluates response quality, not downstream student learning. It does not show that students debug faster, repeat fewer errors, retain concepts better, or become less dependent on help. Those are the next-level product questions. In education, a better explanation is a promising intermediate metric, not the whole game.

Finally, the original DCC Help responses may have been disadvantaged because they did not conform to the common three-part response structure used by the generated responses. The authors note this possibility. In evaluation, format can masquerade as quality. Anyone who has sat through a vendor demo already knows this, but it is nice to see the paper say it politely.

The real lesson: model choice follows workflow design

The useful conclusion is not that small models are “as good as” large models. The useful conclusion is that workflow design can change the model requirement.

A general tutoring assistant is hard. A structured error-explanation system is easier. A system that receives code, error context, call stack, and variable state, then produces a three-part explanation with no direct solution, is not trying to be an all-purpose teacher. It is trying to do one controlled thing well.

That is where fine-tuned open models become strategically interesting. They do not need to win every benchmark. They need to be good enough on the exact interaction that drives cost, privacy risk, and product usage. In this paper, SFT-Qwen-4B and SFT-Llama-8B move close enough to the frontier model, and past the deployed baseline, to deserve serious consideration.

This is the part many AI strategies still miss. The cheapest model is not the one with the lowest token price. It is the one that can be governed, evaluated, deployed, updated, and trusted for the actual job. Sometimes that will be GPT-4.1. Sometimes it will be a compact open model with a very specific education and no illusions of grandeur.

The giants are not toppled. They are being made less necessary in places where the work has been properly bounded. For operators, that is better than a revolution. It is a deployment plan.

Cognaptus: Automate the Present, Incubate the Future.

Lorenzo Lee Solano, Charles Koutcheme, Juho Leinonen, Alexandra Vassar, and Jake Renzella, “Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools,” arXiv:2507.05305, 2025. ↩︎

TL;DR for operators#

The compiler error is the product surface#

The paper does not train a tutor; it trains a feedback appliance#

How to read the evidence without over-reading it#

GPT-4.1 still wins the beauty contest#

The smaller models learn the job#

The deployed baseline is the quiet casualty#

The 32B model complicates the simple scaling story#

The judge panel is useful, not holy#

The business value is controlled specialisation, not open-source ideology#

What the paper directly shows, what Cognaptus infers, and what remains uncertain#

The boundary: this is not a general tutor hiding in a compiler#

The real lesson: model choice follows workflow design#