Feedback, Not Freefall: Why LLM Writing Tools Need a Teacher in the Loop

Feedback is expensive.

Anyone who has managed a classroom, a content team, a training programme, or a junior analyst cohort knows the pattern. The first draft is rarely the problem. The problem is the second draft, because the second draft requires specific feedback, delivered in language the learner can act on, without exhausting the person giving it. Multiply that by thirty students, ten assignments, uneven ability levels, and a calendar that refuses to become more generous. Suddenly “just give everyone personalised feedback” becomes one of those ideas beloved by people who do not have to do it.

So the obvious AI pitch writes itself: let the LLM generate the feedback. The less obvious question is whether that makes the learner better, the teacher irrelevant, or the whole class quietly more generic.

Wang et al.’s paper, Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale, is useful because it does not treat that as a branding problem.¹ It studies a deployed K-12 writing platform in which the LLM, teacher, and student each have a defined role. The LLM generates preliminary feedback. The teacher screens, edits, removes, and adds suggestions. The student revises from the teacher-mediated version. The paper then traces not just whether scores improve, but which suggestions survive, which suggestions students adopt, and where extra linguistic complexity stops helping.

That last part matters. The lazy interpretation of AI feedback is that more intervention means more improvement. More comments, more vocabulary, more structure, more polish. Very efficient. Also sometimes wrong.

The system is a workflow, not a chatbot

The paper’s most important contribution is not that an LLM can comment on student writing. We are well past the point where “the model can produce plausible feedback” counts as a revelation. The contribution is the division of labour.

The deployed system follows a four-stage cycle:

A student writes an initial draft for a standardised K-12 writing task.
The LLM grades the draft and generates preliminary suggestions.
A teacher reviews the draft and raw LLM suggestions, keeping useful feedback, changing tone, removing weak or redundant advice, and adding missing pedagogical scaffolding.
The student receives the final teacher-mediated feedback and produces a revised draft.

This design is quietly different from the usual “student plus chatbot” arrangement. The LLM is not positioned as a tutor with direct authority over a child. It is positioned as a generative engine inside a review workflow. The teacher is not asked to handcraft every comment from scratch. The teacher is asked to validate, refine, contextualise, and intervene where professional judgement matters.

That distinction is not cosmetic. It changes where the system’s risk sits.

A direct LLM writing assistant pushes uncertainty downstream to the student: “Here is advice; good luck knowing whether it is appropriate.” A triadic workflow pushes uncertainty into a teacher review layer before it reaches the learner. Less glamorous, yes. Also closer to how serious operational systems survive contact with reality.

Role	What the paper’s system assigns to it	Operational meaning
LLM	Generate draft feedback and preliminary ratings	Scale the repetitive first pass
Teacher	Filter, refine, add, and align feedback	Convert generic output into pedagogically usable guidance
Student	Revise from mediated suggestions	Keep learning as an active revision task
Evaluation pipeline	Track writing change and feedback uptake	Measure whether the workflow changes behaviour, not just whether it emits advice

The business translation is simple: the product is not “AI feedback”. The product is the handoff.

The measurement framework asks what changed in the writing

The study evaluates writing through Systemic Functional Linguistics, splitting development into three broad functions of language.

The ideational function covers vocabulary and syntax: lexical richness and syntactic diversity. The textual function covers how the text hangs together: semantic dispersion and semantic shift. The interpersonal function covers emotion and moral framing: emotional spectrum and moral alignment.

That is a more useful frame than a single essay score. A total score can rise because the writing is cleaner, longer, more emotionally varied, more logically connected, or merely more aligned with evaluator preferences. The paper tries to separate those channels.

It also builds a pipeline for feedback uptake. Final teacher-mediated suggestions are compared with raw LLM suggestions using sentence embeddings. If a final suggestion is close enough to the raw LLM version, it is treated as retained LLM feedback. If not, it is treated as teacher-revised or teacher-added feedback. The revised student essay is then compared with the original draft to identify changed sentences, and an attention-based semantic matching process estimates whether each suggestion was adopted.

This is not a perfect window into student cognition. A student may internalise advice without producing a sentence that neatly matches it. A similar revision may happen for reasons other than the feedback. But the pipeline is more informative than simply counting comments and hoping the score moves. Hope, as usual, is not a metric.

The paper’s evidence falls into several categories:

Evidence component	Likely purpose	What it supports	What it does not prove
Dataset statistics and pre/post grades	Main evidence	The deployed workflow is associated with higher revised-draft scores at scale	That the same effect would appear in every school, genre, model, or long-term setting
SFL linguistic metrics	Main mechanism evidence	Writing changes across vocabulary, syntax, coherence, emotion, and moral framing	That every dimension is equally desirable in every genre
Feedback share and uptake rates	Mechanism evidence	LLMs provide volume; teachers improve actionability	That teacher mediation is always cost-effective under all staffing models
Teacher-effort comparison	Operational evidence	Reviewing LLM feedback is far lighter than creating feedback from scratch	That review effort captures all cognitive load or quality assurance costs
Fixed-effects regressions	Main inferential evidence	Adopted suggestions from both LLM and teacher relate to score gains after controls	Full causal isolation from every unobserved classroom or motivation factor
Proficiency-quartile correlations in Figure 2	Sensitivity / exploratory mechanism test	The value of linguistic expansion declines as baseline proficiency rises	The exact intervention threshold for every learner
Appendix annotation pipeline	Implementation detail and validity support	Emotion and moral metrics were produced through a calibrated LLM annotation process	Human-level ground truth for all affective and moral interpretations

That last row deserves attention. The interpersonal metrics are generated through an automated annotation pipeline. The authors segment essays into sentences, filter for emotion or moral relevance, use an A-B-A model workflow with GPT-4.1 and Claude Sonnet 4, calibrate prompts against human annotations, and then run large-scale annotation with validation and completeness checks. The reported agreement scores are respectable: Cohen’s kappa reaches 0.61 for emotion and 0.67 for moral classification during calibration, while full-run inter-agent consistency is very high.

Still, those values do not magically turn affective and moral interpretation into granite. They make the measurement process more disciplined. They do not remove the fact that “emotional spectrum” and “moral alignment” are model-mediated constructs. Useful, but not sacred tablets from Mount Validity.

The scale makes the result harder to dismiss

The dataset is large by education-intervention standards: 57,954 essays, 28,977 student-task pairs, 10,195 students, 1,602 writing tasks, and 120 schools, spanning May 2023 to July 2025. Each student-task pair includes an initial draft, raw LLM feedback, teacher-mediated final suggestions, and a revised draft.

The score changes are clear. LLM-assigned grades rise from 80.74 to 85.75, a gain of 5.01 points. Teacher-assigned grades rise from 75.17 to 80.34, a gain of 5.17 points. Both improvements are statistically significant.

The paper also reports improvements across all six linguistic dimensions:

Dimension	Initial mean	Revised mean	Reported change
Lexical richness	0.850	0.875	+3.313%
Syntactic diversity	0.696	0.701	+1.334%
Semantic dispersion	0.483	0.486	+1.556%
Semantic shift	0.524	0.528	+1.641%
Emotional spectrum	1.115	1.231	+7.719%
Moral alignment	0.480	0.536	+5.735%

The biggest relative gains are in the interpersonal dimension: emotional spectrum and moral alignment. The paper further reports a shift toward more approach-oriented and pro-social narrative features: anticipation, joy, trust, and fear increase; anger and disgust decrease; sanctity, care, and loyalty increase; harm, cheating, subversion, and degradation decline. Authority and fairness also decrease, which the authors interpret as a move toward more egalitarian narrative styles.

That is interesting, but it should be handled carefully. The result does not mean the system manufactures virtue. It means revised essays, as classified by the paper’s annotation pipeline, contain broader emotional registers and a different distribution of moral categories after teacher-mediated feedback. That may reflect better narrative maturity. It may also reflect alignment with school writing norms. In education, those are often neighbours. Occasionally they share a driveway.

The teacher’s contribution is adoption, not volume

The feedback breakdown is the paper’s cleanest operational insight.

In the final suggestion pool, 58.158% of suggestions are retained LLM feedback and 41.842% are teacher-mediated refinements. The LLM supplies the majority of feedback volume. That is what LLMs are good at: producing plausible candidates quickly and without requesting a coffee break.

But adoption tells a different story.

Overall, students adopt 84.688% of retained LLM suggestions and 91.895% of teacher-mediated suggestions. Teacher-mediated suggestions show higher adoption across nearly all dimensions, except semantic shift, where the difference is not statistically significant. This is especially important because students could not directly distinguish whether a suggestion originated from the LLM or the teacher. The higher adoption rate is therefore not simply a “teacher brand” effect. It suggests that the teacher-mediated suggestions were more actionable, better aligned, or better phrased for student revision.

This is where the “teacher in the loop” phrase usually becomes vague and sentimental. Here it has a measurable operational meaning: teacher mediation improves the conversion rate from suggestion to revision.

Dimension	LLM share of final suggestions	Teacher-mediated share	LLM uptake	Teacher-mediated uptake
Lexical richness	53.037%	46.963%	85.368%	91.761%
Syntactic diversity	62.847%	37.153%	79.369%	93.539%
Semantic dispersion	76.280%	23.720%	84.573%	91.311%
Semantic shift	39.140%	60.860%	92.621%	93.701%
Emotional spectrum	54.075%	45.925%	88.926%	94.026%
Moral alignment	98.912%	1.088%	82.699%	84.463%
Overall	58.158%	41.842%	84.688%	91.895%

The distribution also reveals a useful division of labour. The LLM dominates moral-alignment suggestions by volume, while teachers contribute relatively more to semantic shift, the dimension tied to local logic and transitions. That is not a trivial distinction. LLMs can easily propose “expand the idea”, “add richer content”, “strengthen the theme”. Teachers are often better at seeing whether the student’s argument or story is actually moving.

For product design, this implies that a review interface should not merely show teachers a pile of AI-generated comments. It should distinguish categories of feedback, surface where teacher intervention historically improves uptake, and reduce review friction where the LLM is already producing stable low-risk suggestions. Otherwise the system is just moving work into a prettier inbox. The enterprise software industry has already perfected that mistake.

The workload result is real, but it is not a licence to remove judgement

The paper estimates teacher effort under two scenarios: creating feedback from scratch versus modifying LLM-generated feedback. Teacher effort is measured as the volume of modifications to LLM output plus manual additions.

The contrast is stark. In the creation scenario, total teacher effort is 3.951. In the modification scenario, it falls to 0.095, which the paper describes as nearly a 40-fold decrease. The largest creation burden is lexical richness at 1.400, followed by semantic shift at 0.520 and syntactic diversity at 0.290. Under modification, effort becomes tiny across most categories.

There is one exception: moral alignment. Teacher effort rises from 0.011 in creation to 0.026 in modification. That looks small in absolute terms, but it is conceptually important. When the LLM supplies draft feedback, teachers spend less time inventing ordinary writing advice and relatively more time checking sensitive educational alignment.

That is the right shape for human-in-the-loop systems. Automation should absorb repeatable generation; humans should concentrate on ambiguous, high-stakes, context-sensitive judgement. If the opposite happens, the organisation has not automated work. It has automated noise and reserved the drudgery for humans. A bold strategy, frequently encountered.

The practical lesson is not “teachers can now do almost nothing”. The lesson is that teachers can be moved from first-pass production to exception handling and quality control. That has staffing implications, training implications, and audit implications.

A school or education platform deploying this kind of system would need to ask:

Which feedback categories require mandatory teacher review?
Which categories can be approved quickly after confidence checks?
Which student profiles need more or less scaffolding?
How are teacher edits logged so the model, rubric, or prompt strategy can improve?
When should the system deliberately reduce intervention?

The last question is where the paper becomes more interesting.

More linguistic expansion eventually stops paying

The fixed-effects regression analysis supports a two-channel story. Adopted suggestions from both LLM and teacher predict writing-quality improvement. Teacher-mediated suggestions show a higher marginal relationship with grade improvement than retained LLM suggestions. The models include student, teacher, and task fixed effects, which strengthens the interpretation by controlling for persistent differences across those units.

The dimensional regression results add texture. Lexical richness, emotional spectrum, and moral alignment are positive predictors in the teacher-grade model. Semantic shift is positive at a weaker significance level. Syntactic diversity is not significant for teacher grades, though it is positive for LLM grades. Semantic dispersion is negative for LLM grades and slightly negative but not significant for teacher grades.

Translated out of metric language: better vocabulary, broader emotional expression, and richer moral framing appear to help. Local flow may help. But simply spreading across more semantic ground can hurt. A student who adds more themes, examples, and conceptual breadth may not produce a better essay. Sometimes they produce an essay with more places to get lost.

Figure 2 then splits the relationship between linguistic gains and grade gains by baseline proficiency quartiles. This is best read as a sensitivity or exploratory mechanism test, not a second causal theorem. The pattern is nevertheless important: the positive correlation between linguistic expansion and grade improvement tends to decline from lower to higher proficiency quartiles. For syntactic diversity, semantic dispersion, and semantic shift, the correlation turns negative in the highest quartile.

This is the paper’s ceiling effect.

Lower-performing students benefit from expansion because they need more vocabulary, more structure, more transitions, more expressive range. Higher-performing students may already have enough of those. More complexity can become over-writing. More structure can become stiffness. More thematic breadth can become drift. The model, naturally, may keep recommending “more” because “more” is a wonderfully easy thing for a model to produce.

That is why the paper’s conclusion points toward dynamic, proficiency-aware collaboration. The system should not apply the same feedback intensity to every learner. It should taper, redirect, or specialise intervention as proficiency rises.

In business terms, this is adaptive service design. Entry-level users need scaffolding. Advanced users need selective diagnosis. Treating them the same is not personalisation. It is segmentation with the lights off.

The misconception: AI feedback does not replace the teacher; it changes the teacher’s bottleneck

The common argument about LLMs in education tends to collapse into two theatrical positions. One side imagines automated tutors delivering infinite personalised instruction to every child. The other imagines teachers heroically defending the last human outpost against robot essay polishers. Both are emotionally satisfying. Neither is how useful systems are usually built.

The paper supports a more boring and therefore more valuable interpretation: LLMs reduce the cost of generating candidate feedback, while teachers increase the probability that feedback becomes usable revision.

This matters for any business building AI-assisted expert workflows, not just education technology.

The same pattern appears in legal drafting, sales coaching, customer-support QA, clinical documentation, analyst training, and compliance review. The model can generate a first pass. The expert still determines what is relevant, safe, timed correctly, and worth acting on. The measurable business unit is not model output. It is accepted, implemented, outcome-improving output.

That means adoption rate is a better product metric than generation volume. A platform bragging about “10,000 AI suggestions generated” is mostly confessing to a storage problem. The better question is: how many suggestions were reviewed, retained, modified, adopted, and associated with measurable improvement?

The paper’s uptake pipeline is not directly portable to every domain, but the logic is. Track the journey from model output to expert mediation to user action to outcome. Without that trace, “AI assistance” remains a mood.

What an education-AI product should take from this

The strongest product implication is not a student-facing chatbot. It is an assisted-feedback platform with structured teacher mediation.

A serious implementation would likely need five layers.

First, a generation layer that produces draft feedback mapped to known pedagogical dimensions: vocabulary, syntax, content breadth, logical flow, emotional expression, moral or thematic framing. The categories may differ by curriculum, but the product must know what kind of advice it is generating.

Second, a review layer that makes teacher action cheap. Teachers should be able to approve, edit, reject, merge, and add suggestions quickly. The interface should learn where teacher edits are common and where the LLM is reliably retained.

Third, a provenance layer that records which suggestions came from the model, which were modified, which were teacher-added, and which reached students. This is not bureaucracy. It is the difference between product learning and product theatre.

Fourth, an uptake layer that estimates whether students acted on feedback. The method can vary, but the principle should not: feedback has value only when it changes the next attempt.

Fifth, an adaptive scaffolding layer that changes intervention intensity by proficiency. Novices may need expansion. Advanced learners may need pruning, prioritisation, and restraint. A strong student does not necessarily need more adjectives. Revolutionary thought, but we persist.

Product design choice	Paper-grounded rationale	Business implication
Keep teachers in the workflow	Teacher-mediated suggestions show higher uptake	Sell workload leverage, not replacement
Track suggestion provenance	The study separates retained LLM feedback from teacher-mediated feedback	Enables QA, training, audit, and model improvement
Measure uptake, not just feedback volume	Student adoption is central to the mechanism	Aligns product metrics with learning behaviour
Vary feedback intensity by proficiency	Higher-proficiency learners show diminishing returns from expansion	Supports adaptive pricing, differentiated UX, and better outcomes
Treat sensitive dimensions as review-heavy	Moral alignment is the only category where teacher effort rises under modification	Human review should concentrate where context and norms matter

The broader enterprise lesson is equally blunt. AI workflow design should preserve expert judgement where judgement changes adoption, not where nostalgia demands it. The teacher stays in the circuit because the evidence says the circuit works better that way.

Boundaries: where the result should not be over-sold

This is a strong applied study, not a universal law of learning.

The first boundary is domain. The study is about K-12 writing revision, not mathematics tutoring, foreign-language speaking, university research writing, workplace training, or therapy-adjacent coaching. The triadic mechanism may generalise, but the metrics and intervention effects should not be copy-pasted across domains as if context were an optional plugin.

The second boundary is time. The study focuses on short-term revision cycles. It does not settle long-term effects such as whether teachers become over-reliant on model suggestions, whether students internalise feedback or merely comply with it, or whether repeated exposure changes writing style diversity over months or years.

The third boundary is measurement. The interpersonal results depend on automated emotion and moral annotation. The authors do more validation work than many papers do, but those constructs remain mediated by model-based classification. Businesses should treat such metrics as decision support, not as moral X-rays.

The fourth boundary is equity. The paper uses fixed effects and includes many schools, but it does not fully resolve how teacher AI literacy, school resources, student background, or socioeconomic variation shape outcomes. In deployment, those differences tend to arrive with invoices attached.

The fifth boundary is model dependence. The underlying LLMs matter. A workflow that performs well with one model, prompt regime, or review culture may behave differently when the model changes, costs shift, or a vendor quietly updates behaviour. Organisations that cannot observe those changes are not operating an AI system. They are participating in a subscription mystery.

The real lesson is controlled delegation

The paper’s title asks whether LLMs in K-12 writing are a double-edged sword or a sharp tool. The answer is: neither by default. A tool becomes sharp when the workflow controls where it cuts.

The LLM is valuable because it generates draft feedback cheaply and at scale. The teacher is valuable because mediation makes feedback more adoptable and keeps sensitive judgement in human hands. The student is valuable because revision remains an active learning act rather than a polished output delivery service. Remove any of the three, and the mechanism changes.

For businesses, the lesson is not confined to classrooms. The paper is a case study in controlled delegation: use AI to expand first-pass capacity, use experts to shape actionability, and use outcome tracing to learn when more intervention helps and when it starts making things worse.

That is less exciting than replacing everyone with a chatbot. It is also much closer to how durable AI systems will be built.

Cognaptus: Automate the Present, Incubate the Future.

Canran Wang, Yuwen Yang, Zhen Wang, Ming Ma, Chentai Wang, Ding Yu, Keman Huang, and Xiaoyong Du, “Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale,” arXiv:2605.30200, 2026, https://arxiv.org/abs/2605.30200. ↩︎

The system is a workflow, not a chatbot#

The measurement framework asks what changed in the writing#

The scale makes the result harder to dismiss#

The teacher’s contribution is adoption, not volume#

The workload result is real, but it is not a licence to remove judgement#

More linguistic expansion eventually stops paying#

The misconception: AI feedback does not replace the teacher; it changes the teacher’s bottleneck#

What an education-AI product should take from this#

Boundaries: where the result should not be over-sold#

The real lesson is controlled delegation#