Ethics Isn’t a Footnote: Teaching NLP Responsibility the Hard Way

Training usually ends with a green tick.

Employees watch a video, answer several questions whose correct responses are not exactly mysterious, and confirm that they understand the policy. The organization records completion. Everyone returns to work with roughly the same judgment they had before, plus one more certificate in the learning-management system.

This format is efficient when the objective is proving that information was delivered. It is less convincing when the objective is preparing people to recognize ambiguous harms, question familiar practices, explain technical trade-offs, and accept responsibility for decisions whose consequences are not yet fully visible.

A paper by Malvina Nissim, Viviana Patti, and Beatrice Savoldi documents a rather different approach. Their six-week course on ethical aspects of natural language processing treats responsibility as something students must repeatedly practise: examining data choices, auditing published research, debating contested claims, interviewing experts, teaching schoolchildren, and eventually creating reusable educational products.¹

The paper does not present a controlled experiment proving that this approach produces more ethical practitioners. Its evidence is descriptive: four years of course development across three universities, multiple teaching formats, student reflections, and a surprisingly diverse collection of completed projects.

That boundary matters. The paper is not an efficacy trial.

It is, however, a useful account of how ethics education can evolve from discussing responsibility to making students perform it in front of people who depend on their explanations.

The Course Begins by Making Invisible Choices Visible

The course was first introduced in the 2021–2022 academic year for final-year Bachelor of Information Science students at the University of Groningen. It was later adapted for master’s students in linguistics, language technologies, and digital humanities at the universities of Pavia and Turin.

Its standard structure spans six weeks, provides five to six European Credit Transfer and Accumulation System credits, and includes roughly 28 to 36 contact hours. Each week combines a two-hour lecture with a two-hour seminar containing hands-on activities.

The sequence covers much of the NLP development and deployment pipeline:

Week	Central concern	Questions made difficult
1	Social implications and values in AI research	Which values are embedded in seemingly technical choices?
2	NLP-specific challenges and research-community practices	What responsibilities belong to researchers, reviewers, and institutions?
3	Bias, measurement, and debiasing	What exactly is being measured, and does the method travel beyond English?
4	Downstream applications and users	Who may be harmed, excluded, misled, or affected by dual use?
5	Data, annotation, ownership, and documentation	Who produced the data, who transformed it, and who disappears from the account?
6	Evaluation, interpretation, and reporting	Does a reported score support the claim being made about real-world capability?

The curriculum is broad, but its more important feature is structural: lectures introduce concepts, while activities force students to encounter the complications hidden inside those concepts.

During one lecture, for example, a student transcribes the classroom conversation. The class then examines the resulting text. Does it belong to the speakers, the person typing, or the instructors who requested it? Why were some details omitted? Why were certain spellings, punctuation marks, and line breaks chosen? Were those transformations neutral?

“Data” initially appears to be the least controversial part of the pipeline. It is simply the material models consume. After the exercise, data becomes the output of numerous human decisions concerning selection, representation, ownership, and interpretation.

This is a small activity with a useful mechanism. It changes the learner’s position from observer to implicated participant.

The same principle appears in the weekly assignments. Students inspect ethics statements in published ACL papers and compare them with a responsible-research checklist. They perform emotion-annotation tasks and confront disagreement over supposedly correct labels. They analyse sensational reporting and attempt to explain disputed NLP claims to non-specialists.

These exercises do not provide a single ethical rule that resolves every case. They expose where rules become incomplete, where metrics conceal assumptions, and where apparently ordinary workflows distribute responsibility so widely that nobody feels entirely responsible.

That is harder to assess than memorisation. It is also closer to the problem organizations actually have.

2021–2022: Students Had to Question the People Defining the Rules

The first edition ended with interviews between student groups and experienced NLP researchers, many of whom were members of the recently established ACL Ethics Committee.

This final project was not an isolated conversation arranged at the end of the course. From the first week, students developed questions based on each new topic. They revised those questions with instructors and classmates, tested their clarity, assigned interview roles, conducted the discussion, and produced a report combining the interview with their own commentary and individual reflections.

The sequence matters:

encounter an ethical issue;
form a question;
discover that the question is vague or loaded;
revise it;
ask someone with a different perspective;
interpret the answer rather than merely record it.

A traditional examination would test whether students remembered what an ethics committee does. The interview required them to decide what was worth asking the committee, explain why it mattered, and respond when an expert’s answer did not fit neatly into the course material.

It also weakened a quiet but common classroom assumption: that the instructors’ framing is the neutral version of the subject. The authors explicitly acknowledge that their own teaching inevitably reflects their perspectives. Interviews gave students access to additional interpretations and made disagreement part of the learning process.

The first edition therefore treated ethical competence partly as inquiry. Responsibility began with knowing how to formulate a question that could reveal a neglected consequence, an unstated assumption, or an institutional limitation.

The project was engaging, but difficult to scale. Senior researchers are not an infinitely renewable educational resource. The next version of the course moved responsibility toward a different audience.

2022–2024: Explaining AI to Schoolchildren Raised the Stakes

For the second and third Groningen editions, student groups prepared and delivered presentations to high-school classes.

The change was partly practical: expert availability could not be assumed every year. It was also a response to the public arrival of ChatGPT in late 2022. Language technology was no longer primarily a specialist subject. Schoolchildren were already using it, while often having limited opportunities to understand how it worked, what it could not reliably do, or what happened to the information they supplied.

Students were now required to teach.

Before presenting, groups rehearsed in class. Instructors and peers checked factual accuracy, reviewed the handling of sensitive topics, and tested interactive components such as quizzes and live polls. The instructors also attended the school presentations.

The audience transformed the assignment.

A university student can write “algorithmic bias is a complex sociotechnical problem” and sound adequately informed. A teenager may ask what that sentence means, who was harmed, how the system became biased, and why anyone released it. The presenter must then choose examples, remove unnecessary jargon, preserve important qualifications, and answer without hiding behind abstraction.

This is the course’s clearest use of learning by teaching. Educational research cited by the authors associates preparing to teach with the protégé effect: learners may understand and retain material more effectively when they expect to explain it to others.

The paper does not independently test that effect. It reports student experiences consistent with it, including an informal comment from one participant who felt that preparing the school presentation forced an unusually thorough review of the material.

The more interesting mechanism is not simply repetition. Teaching creates accountability to an audience.

When students prepare for an exam, the likely cost of misunderstanding falls on their grade. When they teach younger students, misunderstanding can become somebody else’s misunderstanding. Simplification becomes necessary but dangerous. Claims must be clear without becoming false. Uncertainty must be communicated without making the lesson useless.

That pressure resembles an important professional responsibility. Technical teams rarely communicate only with other technical specialists. They explain systems to product managers, executives, regulators, customers, affected communities, and employees expected to use the systems correctly.

The ability to detect a problem is valuable. The ability to explain it accurately enough for another person to act is operationally valuable.

Adaptations in Pavia and Turin also revealed which parts of the model were portable. Because those courses were delivered in compressed or differently structured formats, students sometimes prepared complete presentations without visiting schools. Target audiences also varied by project.

These versions lacked the full experience of teaching a real external audience. They retained the preparation process: selecting material, anticipating questions, and designing explanations for a defined group.

The distinction is useful. Preparing to teach may still improve understanding, but the real audience adds friction that a mock audience cannot fully reproduce. Real people interrupt, misunderstand, challenge, and care about different details than the presenter expected.

The inconvenience is part of the pedagogy.

2024–2025: Responsibility Became a Product That Others Could Reuse

In the fourth edition, the final project changed again. Student groups created educational outreach products for audiences of their choosing.

The resulting artifacts included:

card games about privacy, bias, responsibility, danger, and hype;
an illustrated book explaining chatbot risks and benefits to primary-school children;
podcast episodes examining Meta’s use of data for AI training;
informative leaflets for schoolchildren and residents of care homes;
surveys and street interviews about public perceptions of AI;
an interactive “Build Your Own Chatbot” simulation in which budget-dependent choices affect data sources, filtering, model behaviour, and debiasing.

The projects are colourful, but their educational role is more serious than “making ethics fun.”

Creating a reusable product forces students to encode their understanding into an object that must continue working when its creators are absent.

A presentation can be rescued by a skilled speaker who notices confusion and explains again. A card game, leaflet, podcast, or interactive demo has to carry its own assumptions, instructions, boundaries, and warnings. The creator must decide what the audience needs to know, what can be simplified, and what must not be simplified.

Consider the interactive chatbot builder. Users make choices about budget, training data, content filtering, system behaviour, and debiasing. This turns ethical reflection into a sequence of constrained development decisions. The learner does not merely receive the message that responsible AI is important. The learner sees how resource limits and design priorities shape available choices.

The card games use different mechanisms.

A quartet game organizes ethics topics into recognizable categories. A fill-in-the-blank game uses humour to begin conversations about data, bias, danger, and hype, accompanied by explanatory materials and a disclaimer that the game is not an authoritative source. A debate game assigns positions and rewards persuasive discussion, reflecting a domain in which disagreement cannot always be settled by retrieving one correct answer.

These artifacts make the course more scalable in one specific sense: a completed assignment can become an input to future teaching and public outreach. One quartet game was later used with teenagers during a public research event.

The course had moved through three forms of responsibility:

Edition	Final-project form	Responsibility practised
2021–2022	Interview experts	Ask informed, critical questions and interpret competing perspectives
2022–2024	Teach schoolchildren	Explain technical and ethical issues accurately to a real audience
2024–2025	Build reusable outreach products	Encode responsible explanations into artifacts that work without the creator present

This evolution is the paper’s strongest contribution. It shows how a course can progress beyond adding ethical topics to an otherwise conventional technical curriculum. The assessment itself becomes a practice environment for ethical judgment and communication.

The Paper Shows Feasibility and Engagement, Not Reduced Harm

The authors describe the course and its outputs as highly successful. The paper provides several reasons to take that judgment seriously.

The course has continued across multiple years. It has been adapted across institutions, degree levels, student backgrounds, and formats ranging from a six-week programme to an intensive six-day course. Students completed demanding public-facing projects. Their reflections describe changes in how they interpret AI claims, discuss technology, and think beyond model accuracy. Several projects produced tangible materials suitable for later reuse.

This is meaningful evidence of feasibility, adaptability, engagement, and productive output.

It is not evidence that graduates later make better deployment decisions, prevent more harms, or reduce legal and reputational risk.

Evidence in the paper	What it supports	What it does not establish
Four years of course iterations	The approach can be sustained and refined over time	That refinement caused better learning outcomes
Adaptation across three institutions and several programmes	The design is portable across teaching contexts	That it works equally well for every audience
Completed interviews, presentations, games, podcasts, leaflets, and demos	Students can translate course ideas into substantial outputs	That every output is accurate, effective, or behavior-changing
Student testimonials and reflections	Participants perceived the course as valuable and thought-provoking	The magnitude or persistence of learning gains
Prior research supporting learning by teaching	A plausible pedagogical mechanism	Direct causal evidence from this course
Public reuse of at least one game	Some outputs have value beyond grading	Broad outreach impact or improved public decision-making

There is no reported control group, pre-course and post-course assessment, longitudinal follow-up, standardized judgment test, or measurement of professional behaviour. The paper also does not quantify whether particular activities outperform lectures, essays, or other active-learning approaches.

This is not a minor technical omission to mention apologetically before accepting the conclusion anyway. It defines the conclusion.

The paper directly shows a well-developed model for practising responsibility. It offers rich examples of implementation. It does not show how much more responsible students become.

For educators and organizations, the correct response is neither dismissal nor immediate celebration. It is replication with measurement.

What Businesses Can Copy Without Pretending a Company Is a University

Corporate AI-ethics training often suffers from an objective mismatch.

The organization wants employees to notice risk, escalate concerns, challenge misleading claims, document trade-offs, and communicate limitations. The training then rewards completion, recall, and agreement with predetermined statements.

A company does not need to reproduce a university course exactly. It can copy the mechanisms.

1. Organize Ethics Around the System Lifecycle

The course covers choices from data collection and annotation through model evaluation, reporting, downstream use, and public communication.

For businesses, this means replacing a generic “responsible AI” module with role-specific questions attached to actual work:

Where did the data come from?
Whose language or behaviour is poorly represented?
What does the evaluation metric fail to capture?
Which users bear the cost of errors?
What claims will marketing make from this result?
Who can stop deployment when evidence is weak?
How will affected people contest an output?

Ethics becomes easier to operationalize when it appears at decision points rather than in a separate annual presentation.

2. Make Employees Inspect Real Artifacts

The course asks students to analyse ethics statements, annotation tasks, reporting practices, and design choices.

Organizations can do the same with internal materials: model cards, risk assessments, user complaints, incident reports, evaluation dashboards, product copy, data documentation, and escalation records.

The objective is not to locate a villain in every document. It is to develop the habit of noticing what the document makes visible and what it quietly leaves outside the frame.

3. Require Teams to Explain Systems to a Defined Audience

The school presentations reveal a useful test: can a technically informed person explain the system’s limitations to someone who has to make a decision based on that explanation?

An internal equivalent could require teams to prepare different briefings for:

senior executives approving deployment;
frontline employees using model outputs;
compliance and legal reviewers;
customers providing data;
people affected by automated decisions;
incident-response teams handling failures.

A single explanation rarely works for all six. Preparing the versions exposes different risks, assumptions, and missing information.

4. Turn Training Outputs Into Reusable Governance Assets

The fourth course edition produces objects that can outlive the assignment.

Companies could structure training so that participants create scenario libraries, decision games, short explainers, onboarding modules, evaluation checklists, or simulations based on real operational dilemmas. The strongest products could be reviewed and reused.

This does not automatically make the materials correct. It changes training from pure consumption into governed production.

5. Separate Practice From Formal Approval

The course’s weekly assignments are mandatory but ungraded, with feedback feeding into the final project. This creates room to explore uncertain questions without requiring every early attempt to look polished.

Organizations need a similar distinction. Employees require low-stakes environments where they can expose confusion, debate edge cases, and practise escalation before participating in formal deployment reviews.

When every discussion is attached to approval, performance evaluation, or legal exposure, people become skilled at producing safe-looking paperwork. That is not identical to becoming skilled at recognizing risk.

A Business Adaptation Needs Measurement the Course Did Not Provide

The paper gives organizations a design pattern, not a ready-made return-on-investment claim.

A business implementation should measure at least four levels:

Level	Example question	Possible evidence
Participation	Did employees complete the activities?	Attendance, completion, artifact delivery
Understanding	Can they identify and explain relevant risks?	Scenario-based assessments before and after training
Behaviour	Do they change decisions or escalation practices?	Review comments, documented trade-offs, escalation quality
Operational outcome	Are preventable failures reduced?	Incident patterns, complaint handling, audit findings, remediation costs

The first level is easiest and least interesting. Most compliance programmes stop there because completion is measurable, comparable, and comfortable.

The harder questions concern transfer. Does someone who successfully debates bias in a workshop notice a biased sampling decision three months later? Does a product manager who explains uncertainty well during training resist a misleading launch claim under commercial pressure? Does a team escalate a foreseeable harm early enough to change the design?

Those outcomes depend on more than training. Incentives, deadlines, authority, governance processes, and leadership behaviour can overwhelm even excellent instruction. A course may teach people to recognize a problem while the organization teaches them not to delay release.

There are also practical boundaries to the course model.

Real audiences and open-ended products require instructor time, logistical coordination, factual review, and careful handling of sensitive material. The paper’s own course adaptations show that format constraints can remove some of the richest elements, such as presenting to an external audience. Corporate versions will face the same temptation to retain the visible activity while removing the inconvenience that gives it value.

A reusable artifact can also preserve a mistake. Any internally produced game, guide, simulation, or explanation needs expert review, version control, ownership, and retirement rules. “Created by employees during training” is not a quality standard.

Finally, the approach is designed to foster reflection in a domain without standardized answers. Some business obligations do have definite legal or procedural requirements. Hands-on ethical inquiry should complement those requirements, not replace them with an endless seminar in which every policy becomes debatable.

Responsibility Becomes Real When Someone Else Depends on the Explanation

The course described in Practising Responsibility did not begin with card games and podcasts. It began as a six-week attempt to integrate ethical reflection into NLP education. Over four years, its final project evolved from interviewing experts, to teaching schoolchildren, to building educational products for wider audiences.

That timeline reveals the underlying idea.

Responsibility is not acquired by adding more ethical vocabulary to technical work. It develops when people must examine ordinary choices, confront disagreement, explain consequences, and create something that another person may rely upon.

The paper offers no controlled evidence that its students later become better practitioners. Its artifacts and testimonials cannot support that claim. What it does offer is a credible answer to a more immediate design question: what would ethics education look like if the objective were practice rather than acknowledgement?

It would be interactive, inconvenient, audience-aware, and difficult to complete by clicking “Next.”

That sounds less scalable than a compliance video.

It also sounds considerably more like responsibility.

Cognaptus: Automate the Present, Incubate the Future.

Malvina Nissim, Viviana Patti, and Beatrice Savoldi, “Practising Responsibility: Ethics in NLP as a Hands-on Course,” arXiv:2512.24825, 2025, https://arxiv.org/abs/2512.24825. ↩︎

The Course Begins by Making Invisible Choices Visible#

2021–2022: Students Had to Question the People Defining the Rules#

2022–2024: Explaining AI to Schoolchildren Raised the Stakes#

2024–2025: Responsibility Became a Product That Others Could Reuse#

The Paper Shows Feasibility and Engagement, Not Reduced Harm#

What Businesses Can Copy Without Pretending a Company Is a University#

1. Organize Ethics Around the System Lifecycle#

2. Make Employees Inspect Real Artifacts#

3. Require Teams to Explain Systems to a Defined Audience#

4. Turn Training Outputs Into Reusable Governance Assets#

5. Separate Practice From Formal Approval#

A Business Adaptation Needs Measurement the Course Did Not Provide#

Responsibility Becomes Real When Someone Else Depends on the Explanation#