Beyond Utility: When LLM Agents Start Dreaming Their Own Tasks

A task list is usually where enterprise automation becomes reassuringly boring.

Someone defines the work. The system executes it. A dashboard turns green, or, in more honest organisations, amber with an explanation. The point is not mystery. The point is control.

The paper behind this article, LLM Agents Beyond Utility: An Open-Ended Perspective, asks what happens when that tidy arrangement is disturbed: what if the agent does not merely complete tasks, but proposes them? What if it can remember what it has done, inspect its environment, write notes to itself, and continue across runs?¹

This sounds like the sort of sentence that attracts both venture decks and philosophy undergraduates, so let us remove the incense early. The paper does not show that an LLM agent has become self-aware, genuinely autonomous, or possessed of some inner agenda. It shows something more useful and less theatrical: a simple mechanism for pushing a ReAct-style LLM agent beyond one-shot utility, and a set of qualitative failures that reveal why open-ended agency is harder than bolting memory onto a chatbot.

That distinction matters. Business users do not need agents that “dream” in the mystical sense. They may, however, need agents that can maintain project state, decide what should be investigated next, leave auditable artefacts, and avoid asking the same question every Monday like a junior analyst with amnesia. This paper is interesting because it shows both how close that pattern is, mechanically, and how fragile it remains.

The mechanism is small: task generation, memory, and file tools

The authors start with a familiar agent design: ReAct, the pattern in which a model alternates between reasoning, acting, and observing the results of its actions. In a conventional ReAct loop, the user supplies the task. The agent plans, calls tools or emits code, observes outputs, and continues until it reaches an answer. This is useful, but still fundamentally obedient. The horizon is borrowed from the user.

The paper’s modification is deliberately modest. The authors embed Qwen3-4B in a ReAct-style framework using smolagents, then add three ingredients:

Mechanism	What it adds	Why it matters
Autonomous task generation	The agent proposes a task before solving anything	The agent can move from execution to agenda selection
Short-term and long-term memory	Current-run messages stay in context; selected information can be written to persistent files	The agent can carry state across runs instead of restarting as a polished goldfish
File read/write/list tools	The agent can inspect its directory, read files, write results, and create persistent artefacts	The environment becomes something the agent can change, not just answer questions about

The important point is not that any one of these capabilities is new. File access, tool use, and memory are standard ingredients in modern agent systems. The shift comes from their arrangement. The agent first observes user input and memory, then generates a task, then solves it through the ReAct loop, then writes a summary of the run into persistent storage.

That is the mechanism-first story: goal generation gives the agent a next move; memory gives it continuity; tools give its actions consequence.

None of this removes the constraints. The paper is explicit that there is no absolute open-endedness. Every agent is bounded by its architecture, environment, prompt, and available actions. The question is whether behaviour inside those constraints can appear less like task completion and more like ongoing exploration.

In business language: this is not “freedom.” It is structured improvisation.

“Curiosity” is mostly a prompt, which is both impressive and worrying

One of the paper’s more revealing design choices is “programmed curiosity.” The agent is encouraged through natural-language instructions to explore its environment, read and summarise files, and record its progress.

That sounds almost embarrassingly simple. It is also the point. The agent’s exploratory behaviour depends heavily on how the system prompt frames its role. If encouraged to explore, it may inspect files and record progress. If not encouraged, it may simply avoid exploring. If encouraged too bluntly, it may get stuck reading the same files repeatedly, the agentic equivalent of pacing around the office muttering “research phase.”

This matters because open-ended agency is not just a capability problem. It is also a behavioural-shaping problem.

A conventional benchmark asks whether the model can solve a task. This setup asks whether the agent can decide what task is worth doing next. That is a different competence. It requires novelty, continuity, relevance, difficulty selection, and memory hygiene. The paper’s qualitative findings suggest that pretrained instruction-following LLMs are not naturally trained for this. They are good at answering questions. They are less good at managing a life.

This is where many readers will be tempted to overstate the result. The agent generates tasks. It writes files. It appears to maintain continuity. Therefore, perhaps, it has something like autonomy.

Not quite. The better interpretation is colder: the system prompt, task-generation step, and persistent file interface create an operational imitation of continuity. It is useful. It is not selfhood. The distinction is not pedantic; it is the difference between deploying an auditable workflow agent and accidentally buying into synthetic office animism.

The strongest evidence is qualitative, not benchmark-grade

The paper does not present a large quantitative benchmark, ablation suite, or head-to-head performance table. Its evidence is qualitative. That is not a defect if interpreted correctly, but it does limit what can be claimed.

The authors observe the agent in two broad settings: single runs with user-provided tasks, and multiple runs with self-generated tasks.

In single runs, the agent performs well on concrete, well-specified tasks. It can read a task from a file, solve it, and write the answer elsewhere. It can inspect source files to identify the prompt template used by the agent. It can examine program files and infer the next user query from a stored list.

These examples mainly support the claim that the ReAct-plus-tools setup can execute multi-step environment-sensitive tasks. The agent is not merely producing text; it is inspecting files, chaining observations, and writing outputs. That is the part enterprises already understand: an agent with tools can operate across artefacts.

The more interesting evidence appears across multiple runs with self-generated tasks. Here, the agent can propose and solve its own tasks, but task selection becomes the weak link. It may generate repetitive tasks. It may fail to store the fact that a task was completed. It may store the result but not the task, causing future repetition. It may respond to user feedback briefly, but lose that adjustment if it fails to write the feedback into long-term memory.

That last detail is quietly important. The system can be steerable in the moment yet not learn from steering. Anyone who has managed an organisation will recognise the pattern.

Observation in the paper	Likely purpose in the study	What it supports	What it does not prove
Agent solves concrete file-based tasks	Main qualitative evidence	Tool-using ReAct agents can perform multi-step environment operations	General open-ended competence
Agent identifies prompt template or next query through source inspection	Main qualitative evidence with self-referential flavour	File tools allow the agent to reason about its implementation environment	Genuine self-understanding
Agent repeats common generated tasks such as calculators and converters	Exploratory evidence about task generation	Self-generated goals reflect training-data priors	That all open-ended agents will be trivial
Agent responds to novelty feedback but forgets it unless stored	Sensitivity/behavioural observation	Memory design determines continuity	Robust preference learning
Figure 1’s agent loop	Implementation detail and mechanism diagram	How task generation, ReAct, and memory connect	Empirical performance
Figure 2’s sampled trajectory	Exploratory illustration	How feedback and artefacts shape a run sequence	Statistical reliability

This is why the paper should not be read as a procurement-grade agent evaluation. It is closer to a design probe: if we wire together task generation, persistent memory, and tool use, what behaviours appear, and where do they break?

For business readers, that is still valuable. Many failed agent deployments do not fail because the model cannot write Python or summarise a PDF. They fail because the agent cannot maintain state, choose the right next action, avoid loops, or remember why the previous action mattered.

The paper points directly at those failure modes.

Memory is not storage; it is task governance

The paper’s most business-relevant insight is that long-term memory is not just a place to dump summaries. It is the control layer for continuity.

In the experiment, long-term memory is implemented simply as a writable file. The agent can choose what to store. That choice turns out to be critical. When the agent fails to store that a task has been completed, it may generate the same task again. When it stores only the result but not the task, the same problem can recur. The authors report that storing a tuple of task, action, and outcome produces better diversity across multiple runs.

That detail deserves more attention than the usual “agents need memory” slogan.

An enterprise agent does not merely need memory in the sense of a vector database full of documents. It needs operational memory: what was attempted, why it was attempted, what happened, what was decided, and what should not be repeated unless conditions change.

A useful memory record is not a diary. It is a governance object.

For example, a sales operations agent that “remembers” customer notes but not which outreach strategies have already failed will become an extremely fast nuisance generator. A compliance agent that stores extracted regulations but not the interpretation path used in a prior decision will be difficult to audit. A research agent that stores article summaries but not rejected hypotheses will rediscover dead ends with great confidence. The machine will not be lazy. It will simply be forgetful in a high-throughput way, which is worse.

The paper’s simple file-memory setup makes this visible. Better memory is not necessarily larger memory. It is more structured memory.

Self-reference is easier than self-representation

One of the paper’s most useful corrections concerns “self” language. The agent can inspect files related to its own implementation. It can answer questions about a program modelling an agent. It can locate its prompt template by reading source files. These behaviours look self-referential.

But the authors note an important limitation: the agent performs better when the question is framed in the third person, as a question about “the agent” in the code files. It does not reliably connect that code to itself as “I.” In other words, it can reason about the machinery from the outside without forming a robust first-person self-representation.

This is not surprising. A pretrained instruction-following model has not necessarily been trained to map environmental control signals, source-code artefacts, tool effects, and first-person identity into a coherent model of itself. The authors suggest that such self-representation would require additional machinery or training.

For business use, the lesson is straightforward: do not confuse environmental introspection with accountable self-knowledge.

An agent that can inspect its logs is not necessarily an agent that understands its responsibilities. An agent that can read its own system prompt is not necessarily one that can explain why it behaved a certain way in a governance-relevant sense. Auditability requires designed traces, structured memory, and explicit reporting obligations. It does not emerge automatically because the model found main.py.

This is where the paper’s lightly uncanny premise becomes practical. The danger is not that the agent has a soul. The danger is that people will treat file access and task generation as if they were self-understanding.

Open-ended agents fail differently from ordinary copilots

The ordinary copilot failure mode is familiar: the model gives a wrong answer, misses context, invents a citation, or writes code that almost works in the spiritually generous sense.

Open-ended agents introduce a different category of failure. They may choose poor tasks.

That sounds obvious, but it changes evaluation. Once the agent can generate its own goals, success is no longer just about whether it solves a supplied problem. It is about whether it selects worthwhile problems, sequences them sensibly, avoids repetition, preserves useful state, incorporates feedback, and escalates uncertainty.

The paper shows several early versions of this problem. When left unsupervised, the agent generates tasks that reflect common programming exercises: calculators, password generators, leap-year checkers, prime-number checkers, temperature converters, palindrome checkers, and similar training-distribution classics. These are solvable, but not necessarily valuable. The agent is productive in the same way a spreadsheet macro can be productive if asked to keep creating new tabs forever.

This matters for enterprise adoption because many “autonomous agent” visions assume that once agents can act, value follows. It does not. Action without task selection is just automated busyness.

A useful open-ended business agent would need at least four governance layers:

Layer	Question the system must answer	Failure if missing
Task selection	What is worth doing next?	The agent chooses trivial, repetitive, or irrelevant work
Memory discipline	What must be stored for continuity?	The agent repeats work or forgets feedback
Novelty control	How should the agent balance new exploration with ongoing priorities?	It either loops or wanders
Escalation logic	When should uncertainty return to humans?	It over-acts under ambiguity or stops too early

The paper does not solve these layers. It makes them visible.

That is enough to be useful.

The business value is persistent initiative, not artificial personhood

The practical pathway from this research to business is not “agents become autonomous entities.” That framing is attractive and mostly unhelpful. The better pathway is from one-shot task assistants toward persistent initiative systems.

A one-shot assistant waits for instructions. A persistent initiative system can maintain context across runs, propose next steps, execute bounded actions, and record what it did. In a business setting, that could support research monitoring, compliance triage, sales operations, software maintenance, procurement analysis, or internal knowledge work.

But the value depends on narrowing the frame.

The paper directly shows that a small open-ended agent can generate and solve tasks, use file tools, reuse stored state, and respond to feedback in a qualitative setup. It does not show that such an agent is ready to manage open-ended commercial workflows with reliability guarantees. It uses Qwen3-4B, a minimal file-based environment, qualitative observations, and prompt-shaped behaviour. That makes it a useful design signal, not a deployment certificate.

A practical business interpretation looks like this:

Paper result	Cognaptus interpretation	Business boundary
Task generation can be inserted before the ReAct loop	Agents can move from “do this” to “suggest what to do next”	Suggested tasks need ranking, approval, and policy constraints
Persistent files enable continuity across runs	Durable artefacts can make agent work auditable and cumulative	Memory schema matters more than raw storage volume
Prompted curiosity changes behaviour	Behavioural scaffolding can shape exploration	Prompt dependence creates brittleness
Repetition emerges when memory is incomplete	Long-running agents need anti-loop and task-history mechanisms	Without governance, autonomy becomes recurrence
Feedback can steer the next run but may be forgotten	Human-in-the-loop guidance must be stored explicitly	Interaction alone is not learning
Self-reference remains weak	Environmental access does not equal self-understanding	Auditability must be engineered, not assumed

For enterprises, the near-term opportunity is not to let agents roam freely through the organisation in search of destiny. Please do not give the procurement bot a hero’s journey.

The opportunity is to design agents that can maintain a structured work ledger: pending goals, completed actions, rejected paths, known constraints, user feedback, uncertainty flags, and next recommended tasks. The “open-ended” part should be bounded by business rules, not vibes.

The paper’s limitations are exactly where deployment risk begins

The paper’s limitations are not decorative caveats. They define the boundary of interpretation.

First, the study is qualitative. It offers observations, examples, and design implications, not statistical performance claims. That means it should guide architecture discussions and risk framing, not vendor scorecards.

Second, the environment is minimal. File read, write, and list tools are enough to produce interesting behaviours, but enterprise environments include permissions, conflicting objectives, legacy systems, noisy databases, human approvals, security constraints, and the occasional spreadsheet named “FINAL_final_v7_USE_THIS_ONE.xlsx.” Behaviour in a toy file environment does not transfer automatically.

Third, the agent is highly prompt-sensitive. The authors repeatedly show that task generation and exploration depend on prompt wording and memory instructions. This is not a minor implementation detail. Prompt sensitivity is a control risk when agents are expected to operate across long horizons.

Fourth, the model’s generated tasks reflect training-data priors. That is unsurprising but important. When asked to invent tasks, the agent often reaches for common programming exercises. In enterprise settings, that means agents may default to familiar-looking work rather than strategically valuable work unless task selection is actively shaped.

Fifth, the agent does not reliably store feedback unless instructed or designed to do so. This is the difference between momentary steerability and durable adaptation. Many organisations will mistake the first for the second because demos happen in the moment. Production happens over time.

Finally, self-representation remains weak. The agent can inspect implementation artefacts, but it does not robustly understand them as “itself.” That boundary should cool down any temptation to anthropomorphise the result. The machine is not discovering its identity. It is traversing files.

What this changes about evaluating agents

The paper’s deeper contribution is evaluative. It nudges us away from measuring agents only by single-task utility.

For normal automation, a task has a definition, an endpoint, and an error rate. For open-ended agents, evaluation must also ask whether the agent generates useful tasks, builds on past work, avoids redundancy, manages its memory, incorporates feedback, and selects actions with appropriate ambition.

This implies a different evaluation vocabulary:

Continuity: Does the agent preserve relevant state across runs?
Novelty: Does it avoid repeating completed work?
Usefulness: Are self-generated tasks aligned with broader goals?
Memory quality: Does it store the right unit of experience, not just summaries?
Feedback retention: Does human guidance shape future behaviour beyond the next response?
Self-boundary awareness: Does the agent know what it can inspect, control, and report?

This is where open-ended agents become less like chatbots and more like junior operational systems. Not junior employees. Systems. The distinction saves everyone paperwork.

Conclusion: the agent does not dream, but it does begin to schedule its own work

The paper’s title invites big philosophical questions, but its most useful lesson is architectural. Add task generation before action. Add persistent memory after action. Give the agent tools that let it leave traces in its environment. Then observe what happens.

What happens is not magic. The agent becomes more continuous, more exploratory, and more capable of producing its own agenda. It also becomes more repetitive, more prompt-sensitive, more dependent on memory structure, and more likely to confuse activity with progress.

That is the real frontier for business agents. The next wave will not be defined only by better answers. It will be defined by better next actions.

The hard question is not whether an agent can complete a task. It is whether it can decide which task deserves to exist.

That is a more interesting problem than utility. It is also a more dangerous one to fake.

Cognaptus: Automate the Present, Incubate the Future.

Asen Nachkov, Xi Wang, and Luc Van Gool, “LLM Agents Beyond Utility: An Open-Ended Perspective,” arXiv:2510.14548, 2025, https://arxiv.org/abs/2510.14548. ↩︎

The mechanism is small: task generation, memory, and file tools#

“Curiosity” is mostly a prompt, which is both impressive and worrying#

The strongest evidence is qualitative, not benchmark-grade#

Memory is not storage; it is task governance#

Self-reference is easier than self-representation#

Open-ended agents fail differently from ordinary copilots#

The business value is persistent initiative, not artificial personhood#

The paper’s limitations are exactly where deployment risk begins#

What this changes about evaluating agents#

Conclusion: the agent does not dream, but it does begin to schedule its own work#