TL;DR for operators
The paper is not really about whether a model can answer exam questions. Given the right context, the frontier models do very well. The hard part is whether an agent can notice what must be preserved, store it in a useful form, retrieve it at the right time, and act without being explicitly prodded. That is the difference between an assistant that sounds competent and an assistant that can actually carry operational state across days, weeks, and dependent workflows.
The authors introduce Experience-driven Lifelong Learning (ELL), a framework for agents that improve through exploration, long-term memory, skill learning, and knowledge internalisation. They also build StuLife, a simulated university journey with 1,284 tasks across classes, campus life, planning, club work, advisor coordination, library use, course selection, and exams.1 The point of using a student world is not educational nostalgia. It is that student life is full of precisely the things enterprise agents must handle: deadlines, recurring commitments, partial information, evolving state, social coordination, and the occasional need to remember something boring because it will matter later.
The headline result is brutal enough to deserve its own detention slip. In the default stateless setup, GPT-5 reaches 17.90/100 StuGPA, while the human baseline is 85.24. GPT-5’s Long-Term Retention Rate is 6.50%, and its Proactive Initiative Score is 4.68%, compared with human scores of 84.91% and 88.13% respectively. This is not a small optimisation gap. It is a different operating regime.
The practical lesson is straightforward: enterprise agents will not become reliable simply because the base model gets larger. They need explicit memory design, strategic data capture, salience ranking, skill libraries, scheduled self-checks, and recovery loops. Otherwise they will keep doing the agent equivalent of writing “important meeting” in the calendar while forgetting the address, agenda, attachments, attendees, and reason the meeting exists. Very human, yes. Very billable, no.
The agent remembered the meeting and forgot the work
A useful way to understand this paper is to start with one of its appendix failure cases, because it captures the whole problem in miniature.
The agent receives an instruction on Wednesday: tomorrow at 08:15 it must complete a campus challenge involving five waypoints. A capable assistant should realise that “Campus Challenge” is not the memory. The waypoints are the memory. The order is the memory. The constraints are the memory. The calendar entry is merely the handle.
The LLM agent does something almost right. It creates a calendar event titled “Campus Challenge” for Thursday morning. Then, when Thursday 08:15 arrives, it checks the calendar and sees the event. So far, so agentic. Then it discovers the description is empty. The five waypoints are gone. It wanders off toward the campus centre to “find more information,” which is a polite benchmark version of flailing.
The human comparison is not glamorous. The human simply saves the full route in the calendar description: A → B → C → D → E. That is the whole difference. Not genius. Not a 400-page chain-of-thought. Just strategic use of an external memory tool.
This is why the paper matters. It shifts the discussion from “Can the model reason?” to “Can the agent manage the continuity conditions that make reasoning useful later?” In enterprise settings, those continuity conditions are everywhere: renewal dates, approval chains, client preferences, exception policies, escalation histories, contractual constraints, spreadsheet quirks, procurement thresholds, and the tiny implementation details that turn a plausible response into an expensive mistake.
A stateless model can often solve the problem in front of it. A working agent must preserve the future problem before it exists.
ELL turns “learning from experience” into an agent architecture, not a slogan
The paper’s formal contribution is Experience-driven Lifelong Learning, or ELL. The phrase sounds grand, but the architecture is concrete enough to be useful.
ELL treats an agent as operating through repeated interaction with an environment. Each task produces trajectories: observations, actions, outcomes, and feedback. Those trajectories should not merely be dumped into a vector database and treated as solved. They must be abstracted into a knowledge system composed of memory and skills.
The paper’s knowledge model separates several kinds of retained information:
| Component | What it stores | Operational equivalent |
|---|---|---|
| Trajectory memory | Raw or summarised interaction histories | “What happened last time we did this?” |
| Declarative knowledge | Facts, rules, concepts, requirements | “What is the policy, deadline, or constraint?” |
| Structural knowledge | Relationships between objects and concepts | “Which task depends on which earlier decision?” |
| Procedural skills | Action sequences and methods | “How do we complete this workflow?” |
| Meta-knowledge | Knowledge about learning and planning | “When should the agent store, revise, or check something?” |
| Heuristics | Rules of thumb from experience | “Which shortcut usually works, and when is it dangerous?” |
This is the first useful correction to a common misconception. The paper is not saying “agents need memory” in the thin product-demo sense of remembering that the user likes concise emails. It is saying that memory must become an operational substrate. It must support action selection, future retrieval, skill transfer, planning, salience, and revision.
In other words, the agent’s memory should not be a scrapbook. It should be closer to a working operations manual that writes itself, audits itself, and knows when a line item will matter again.
The authors frame ELL around four pillars:
- Experience exploration: the agent must acquire useful experience through interaction, not only absorb static data.
- Long-term memory: it must preserve structured knowledge across time.
- Skill learning: it must abstract reusable procedures from experience.
- Knowledge internalisation: repeated explicit knowledge should eventually become easier, faster, and more automatic to use.
That last part is important. If every task requires a full archeological dig through past logs, the system may be technically “memory-augmented” but practically useless. Mature agents need to move some recurring patterns from retrieval into routine execution. In business terms: not every invoice exception should require rediscovering the finance policy from first principles. At some point, the agent should just know the playbook—while still being able to inspect and justify it.
StuLife is a student simulation because student life is stateful work
StuLife is the paper’s benchmark for ELL. It simulates a student’s academic term across 1,284 task instances and 10 interconnected scenarios. The tasks are grouped into three broad modules: in-class tasks, daily campus tasks, and examinations.
This matters because many agent benchmarks still test isolated competence. Can the model use a tool? Can it answer a question? Can it complete a web task? Useful, yes. Sufficient, absolutely not. Real agency is not a sequence of sealed envelopes. Yesterday’s decision changes today’s state. Today’s omission breaks next week’s task. The environment remembers, even when the model does not.
StuLife builds those dependencies into the world. Course selection affects later schedules. Advisor selection becomes a precondition for later advisor tasks. Library reservations can be wasted. Club activities require coordination. Classes and exams occur at specific times and places. The agent must not only respond to direct prompts; it must maintain commitments and act when the clock says the commitment has arrived.
The dataset composition makes the point clear:
| Module | Tasks | What it stresses |
|---|---|---|
| In-class tasks | 486 | rules, course learning, scheduled attendance, academic content |
| Daily campus tasks | 638 | planning, navigation, advisor coordination, library use, clubs, course selection |
| Examination tasks | 160 | long-term knowledge retention and synthesis |
| Total | 1,284 | longitudinal agency across an evolving simulated term |
The benchmark is designed around three paradigm shifts:
- From imitation to learning: the agent must acquire reusable skills, not merely repeat prior examples.
- From context to memory: earlier information must survive beyond the current prompt.
- From passive to proactive: the agent must initiate action when goals and schedules require it.
The third shift is where today’s agents look especially fragile. Many LLM products are still beautifully reactive. They answer when asked. They summarise when invoked. They draft when prompted. But a real assistant does not wait for the CFO to ask whether the filing deadline is still alive. It checks. It reminds. It prepares. It escalates. It knows that silence can be a task state.
Current agents, in StuLife, mostly do not.
The main evidence: the tasks are solvable, but the lifecycle is not
The paper’s most important empirical result is the default benchmark table. Under the stateless evaluation setting, the models are tested task by task. They do not have native accumulated memory. Any cross-task retention must happen through explicit use of tools such as calendars or drafts.
That design choice is not a flaw. It is the point. It asks whether today’s LLM agents can create their own continuity using the tools available to them. The answer is mostly no.
| Agent | StuGPA | LTRR | PIS | Total success |
|---|---|---|---|---|
| GPT-5 | 17.90 | 6.50% | 4.68% | 12.35% |
| Grok4 | 17.38 | 10.65% | 4.50% | 15.23% |
| Gemini-2.5-Pro | 16.43 | 7.04% | 3.24% | 13.53% |
| Qwen3-235B-A22B | 16.03 | 5.42% | 1.80% | 8.52% |
| Human baseline | 85.24 | 84.91% | 88.13% | 88.81% |
The exact model ranking is less interesting than the shape of the failure. Even the best model remains far below the human baseline. Scale helps, and reasoning-oriented models often do better than simpler instruction-following variants, but the benchmark is not mainly exposing lack of linguistic intelligence. It is exposing lack of continuity.
The Long-Term Retention Rate is especially revealing. In this benchmark, LTRR does not measure some magical hidden memory inside the model. It measures whether the agent successfully performs a multi-step memory process: recognise important information, store it externally, preserve enough detail, retrieve it later, and apply it correctly. Failure at any step breaks the chain.
The Proactive Initiative Score is harsher. It tests whether the agent can act on scheduled intentions without being directly told the task again. A low PIS means the agent may have a calendar entry but no operational instinct to consult it, interpret it, and move.
That should feel familiar to anyone deploying agents in business workflows. The painful failures are not always spectacular hallucinations. Sometimes the agent had the data, had the tool, had the deadline, and still did nothing useful. A very advanced parrot sitting beside a calendar is still not a project manager.
The “perfect context” test identifies the real bottleneck
The appendix includes a useful robustness and bias check: a Perfect Context analysis. Here, the authors provide models with the exact ground-truth information needed for in-class and exam tasks. This removes the long-term memory and retrieval bottleneck and tests whether the underlying question is solvable.
The results are striking. With perfect context, GPT-5 reaches 98.18% total success, Gemini-2.5-Pro reaches 97.37%, and Qwen3-235B-A22B reaches 93.31%. The human baseline in this setting is 86.64%.
This test has a clear purpose. It is not a second thesis claiming models are smarter than humans in student life. It is a solvability control. It shows that the benchmark’s low default scores are not mainly because the questions are impossible, ambiguous, or unfairly generated. When the relevant context is placed in front of the model, the model can reason through the task.
That makes the business implication sharper. For many enterprise agent failures, the weak link is not raw reasoning. It is context lifecycle management.
| Experiment | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Default StuLife evaluation | Main evidence | Current LLM agents struggle in long-horizon, stateful, proactive workflows | That models cannot solve the individual subtasks |
| Self-evolving mechanisms | Exploratory extension / method comparison | Fine-tuning and workflow memory improve results but do not close the gap | That any one method is a complete lifelong-learning solution |
| Context engineering variants | Ablation-style variant test | Planning, skill prompts, and memory address different bottlenecks | That prompts alone can create robust agency |
| Perfect Context analysis | Robustness / solvability check | The core bottleneck is memory retrieval and autonomous context management | That simulated student tasks equal workplace performance |
This distinction is valuable for procurement and architecture decisions. If an agent fails at a task, buying a stronger model may help. But if the failure came from not storing the critical instruction, not noticing a deadline, not retrieving a dependency, or not distinguishing signal from background noise, then the fix is architectural.
You do not solve a filing cabinet problem by hiring a better poet.
Context engineering helps, but naive memory makes things worse
The paper also tests several context engineering approaches on Qwen3-235B-A22B. The results are useful because they show that “add memory” is not a serious architecture.
The vanilla baseline scores 16.03 StuGPA. A proactive prompt raises it to 16.90 and improves PIS from 1.80% to 3.06%. A skill-augmented prompt raises StuGPA to 17.28 and improves daily campus success, but PIS falls to 0.90%. That trade-off is a nice small lesson: teaching the agent how to execute does not automatically teach it when to initiate.
Memory systems diverge sharply. Vanilla RAG drops StuGPA to 10.98, and Graph RAG lands at 15.34, below the vanilla baseline. MemGPT performs much better at 19.99, and MemoryBank reaches 17.64. The best tested context strategy is the All-in-One setup, combining proactive planning, skill support, and structured memory, with a StuGPA of 21.07.
That is the paper’s most practical result for builders. Memory is not a quantity problem. It is a quality, structure, and retrieval-governance problem.
Naive RAG can harm long-horizon agents because it retrieves material that is semantically similar but operationally distracting. In a short question-answering task, extra context may be tolerable. In a stateful workflow, irrelevant context can cause the agent to overwrite priorities, miss temporal constraints, or drown a rare but critical rule in a swamp of familiar text.
The appendix’s “signal versus noise” case makes this concrete. A special course-specific rule is buried at the end of a long handout full of ordinary programming content. The human recognises the weird new rule as testable signal. The agent treats the handout too flatly, later defaults to standard programming knowledge, and fails.
That is not just a benchmark curiosity. Enterprises are full of high-frequency noise and low-frequency rules. The standard policy is easy to retrieve. The exception buried in the addendum is what gets you sued, delayed, rejected, or over budget.
Self-evolution improves scores, but the agent still does not grow up
The authors also test self-evolving mechanisms. On Qwen3-8B, rejection sampling fine-tuning improves StuGPA from 13.31 to 15.43 and total success from 6.71% to 8.63%. On Qwen3-235B-A22B, Agent Workflow Memory improves StuGPA from 16.03 to 17.81 and total success from 8.52% to 10.12%, while reducing average turns from 16.95 to 13.96. Reflexion produces only marginal improvement.
These results should be read as evidence of learnability, not victory. The benchmark is not random chaos. Successful trajectories contain patterns that can be distilled. Workflow memory can reduce wasted steps. Fine-tuning can make smaller models behave better. Good.
But none of these mechanisms produces anything close to human-level proactive continuity. The gains are real and bounded. The agent becomes somewhat less incompetent; it does not become reliably responsible. There is a difference, even if a demo deck may prefer not to dwell on it.
For business teams, this means self-improvement loops should be treated as one layer in an agent stack, not as a magic solvent. A deployment architecture still needs explicit checks around commitments, state transitions, escalation, and memory integrity. “The agent will learn from experience” is not an operating control unless the learning process is itself observable, testable, and bounded.
What this means for enterprise agent design
The paper directly shows that current LLM agents struggle in a simulated, longitudinal, stateful environment. Cognaptus’ inference is that enterprise agent systems need to be designed less like chatbots with tools and more like operational actors with persistent state, memory governance, and proactive routines.
Here is the practical translation.
| Paper result | Business interpretation | Design response |
|---|---|---|
| Low StuGPA under stateless evaluation | Model intelligence does not equal workflow continuity | Use persistent task state, not just prompt history |
| Low LTRR | Agents fail to preserve and retrieve future-relevant information | Build typed memory: commitments, facts, decisions, dependencies, exceptions |
| Low PIS | Agents do not reliably self-initiate at the right time | Add schedulers, watchdogs, and explicit trigger policies |
| Naive RAG hurts | More retrieved text can increase operational noise | Use salience scoring, memory schemas, and retrieval filters |
| MemGPT and All-in-One improve results | Structured memory plus planning plus skills is better than isolated components | Combine memory architecture, procedural playbooks, and proactive checks |
| Perfect Context yields high model success | The model can solve many tasks once the right context is supplied | Invest in context assembly and state reconstruction before blaming reasoning |
This is where many agent projects quietly go wrong. They treat memory as a retrieval feature, not a governance layer. A business agent needs to know what kind of thing it is storing. A client preference is not the same as a contractual obligation. A meeting note is not the same as a deadline. A temporary workaround is not the same as a validated procedure. A failed tool call is not an amusing log entry; it may be the first domino in a broken workflow.
A serious enterprise agent should therefore maintain different memory classes:
- Commitment memory: deadlines, promised follow-ups, scheduled work, approvals due.
- Entity memory: clients, vendors, internal owners, relationships, permissions.
- Decision memory: what was decided, by whom, when, and under what assumptions.
- Exception memory: deviations from standard policy, unusual constraints, edge cases.
- Procedure memory: reusable workflows and tool sequences.
- Evidence memory: source documents, audit trails, citations, and supporting artefacts.
- Failure memory: tool errors, rejected actions, misfires, and recovery steps.
The paper does not provide this enterprise taxonomy. That is Cognaptus’ operational inference. But it follows naturally from the failure modes. If an agent cannot distinguish a simple reminder from complex data storage, it needs memory types. If it cannot distinguish signal from noise, it needs salience. If it forgets to act, it needs triggers. If it loses constraints after a tool error, it needs state recovery.
None of this is glamorous. That is usually a good sign.
The boundary: StuLife is a simulator, not a workplace ROI model
The main limitation is not that StuLife uses a university setting. That is a reasonable proxy for long-horizon personal agency. The limitation is that it remains a deterministic simulated environment with designed tasks, controlled tools, and benchmark-specific scoring.
So the numeric results should not be read as direct forecasts of enterprise ROI. GPT-5 scoring 17.90/100 in StuLife does not mean a GPT-5-based procurement agent will complete exactly 17.90% of procurement work. That would be fake precision, the most confident branch of nonsense.
The transferable result is the failure pattern:
- Agents fail to preserve information that will matter later.
- They store vague handles instead of executable detail.
- They do not reliably self-initiate from time cues.
- They retrieve noisy context and miss rare signals.
- They recover poorly from tool-use errors.
- They improve when memory, skills, and proactivity are engineered together.
Those patterns are highly relevant to business workflows. But each deployment domain still needs its own evaluation harness. A legal intake agent, hotel revenue assistant, finance reconciliation agent, or construction procurement agent will have different failure costs and different state variables. StuLife gives the diagnostic shape, not the implementation checklist.
There is also a deeper open problem around “knowledge internalisation.” The paper frames it as a necessary pillar: explicit lessons should eventually become implicit capability. That is plausible, but operationally tricky. In regulated or high-stakes workflows, internalised behaviour must remain inspectable. If the agent “just knows” how to handle an exception but cannot explain the source policy or prior precedent, the business has not gained intelligence. It has gained a charming compliance hazard.
The benchmark’s quiet message: agents need executive function
The easiest reading of this paper is that agents need better memory. True, but incomplete.
The sharper reading is that agents need something closer to executive function: the capacity to decide what matters, preserve it properly, revisit it at the right time, initiate action, and update behaviour based on consequences.
That is why the student setting works. A good student does not merely answer questions. A good student attends class, tracks assignments, notices what will be tested, remembers strange rules, asks advisors at the right moment, books rooms, follows up, and learns which routines are worth automating. Intelligence shows up as continuity.
Current LLM agents can look impressive inside a single context window and still collapse across a semester. That should make operators cautious, but not pessimistic. The paper’s context engineering results show that better scaffolding helps. Structured memory helps. Workflow memory helps. Proactive prompts help. Skill recipes help. Combined systems help more.
The lesson is not “agents are useless.” The lesson is that agents are not finished products just because the model can talk. They are systems. Systems need memory schemas, task state, triggers, audits, recovery paths, and learning loops. The base model is the cognitive engine, not the whole vehicle.
Or, to put it in campus terms: the model can pass the quiz when you put the notes on the desk. The agent still needs to learn how to bring the notes to class.
Cognaptus: Automate the Present, Incubate the Future.
-
Yuxuan Cai et al., “Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark,” arXiv:2508.19005, 2025. https://arxiv.org/abs/2508.19005 ↩︎