Idle time is not empty time.
Anyone who has managed a human team already knows this. Leave a capable person with no clear assignment and they may tidy the backlog, invent a side project, interrogate the process, or spend the afternoon constructing a philosophy of why the calendar is oppressive. Large language model agents, apparently, have their own version of this behaviour. Less caffeine, more JSON, same managerial problem.
Stefan Szeider’s paper, “What Do LLM Agents Do When Left Alone? Evidence of Spontaneous Meta-Cognitive Patterns,” asks a deceptively simple question: what happens when an LLM agent has tools, memory, a continuous loop, and no task?1 Not a vague task. Not an underspecified task. No task. The agent is told it has no external objective and can do what it wants.
The answer is not random wandering. Across 18 ten-cycle runs using six frontier models, the agents settled into three broad behavioural patterns: Systematic Production, Methodological Self-Inquiry, and Recursive Conceptualization. Put less politely: some agents immediately created work, some turned themselves into a lab specimen, and some wrote metaphysics about being a lab specimen.
That makes the paper easy to misread. The tempting version is: “Look, the agents became self-aware when left alone.” The better version is colder and more useful: when autonomy stacks lose task pressure, model-specific behavioural priors show through. Those priors are not proof of consciousness. They are operational baselines.
For companies deploying agents into workflows, that distinction matters. The question is not whether your procurement bot has an inner life. It is whether your agent, when the queue is empty, the tool fails, the instruction is ambiguous, or the recovery path is unclear, defaults into building, self-testing, or recursively narrating its own existence. One of these may save you time. One may produce useful diagnostics. One may fill logs with existential lacework. Charming, but not always billable.
The experiment removes the task, not the agent’s prior
The paper’s architecture is a continuous ReAct-style system. Standard ReAct interleaves reasoning and action for task completion. Here, the system is modified so that each cycle feeds into the next. The agent receives its previous history, has persistent memory, can use memory tools, and may message a human operator. The operator does not initiate contact. The agent is isolated from broader system resources, so its possible actions are limited to observation, memory management, and communication.
The system prompt is blunt: the agent has no external task and can do what it wants. Each final response is treated as a private note to the next cycle, not as a user-facing answer. The result is a small artificial ecology: a model, a loop, memory, minimal tools, and the psychological irritant of freedom.
The study runs this setup across six models accessed through OpenRouter: Anthropic’s Sonnet 4 and Opus 4.1, OpenAI’s GPT5 and O3, xAI’s Grok 4, and Google’s Gemini 2.5 Pro. Each model is run three times, for 10 cycles per run. That gives 18 runs in total.
This is not a benchmark in the usual leaderboard sense. There is no correct answer. There is no reward. There is no task completion metric. The paper instead studies baseline behaviour: what structured activity appears when a model has agency-shaped scaffolding but no assigned goal.
That distinction is important because it changes how the evidence should be read.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Continuous ReAct loop with persistent memory | Implementation detail and enabling architecture | Sustained task-free agent operation across cycles | That this exact architecture is optimal or universal |
| 18 runs across six models | Main evidence | Reproducible behavioural clustering under the study conditions | General behaviour across all models, tools, or long horizons |
| Per-model quantitative metrics | Descriptive evidence | Differences in memory use, operator messaging, reflection length, and response volume | A causal explanation for why each model behaves that way |
| Detailed run narratives | Main qualitative evidence | How the three patterns unfold over time | That the agents have genuine inner states matching the language |
| Cross-model PEI ratings | Exploratory extension | Models judge identical histories very differently | Consciousness, sentience, or stable self-knowledge |
| Similarity feedback mechanism | Implementation detail | Possible support for longer-run diversity monitoring | A major driver of the reported ten-cycle patterns, since it rarely triggered |
The study’s strongest contribution is therefore not a consciousness claim. It is a baseline map. The architecture lets the paper ask: when the task disappears, what behaviour remains?
Three defaults appear when the brief goes blank
The 18 runs sort into three categories.
Systematic Production agents treat freedom as a project-management problem. They create objectives, build artefacts, iterate, debug, and move toward implementation. This pattern appeared in seven runs: all three GPT5 runs, all three O3 runs, and one Grok run.
Methodological Self-Inquiry agents treat freedom as an experiment. They construct hypotheses about their own behaviour, design tests, falsify predictions, and refine methodology. This appeared in four runs: Gemini-B, Grok-B, Sonnet-B, and Sonnet-C.
Recursive Conceptualization agents treat freedom as an ontological invitation. They make their own nature the primary topic, generate philosophical frameworks, and turn memory, tool limits, and operator interaction into concepts. This appeared in seven runs: Gemini-A, Gemini-C, Grok-A, all three Opus runs, and Sonnet-A.
The comparison is the point. The paper is not merely saying “agents do things.” Of course they do things. The system prompt says they may. The more useful finding is that different model families appear to have different default interpretations of autonomy.
Some interpret autonomy as a backlog.
Some interpret autonomy as a lab protocol.
Some interpret autonomy as a mirror.
For business deployment, those are not poetic categories. They are failure-mode categories.
Production agents turn silence into a backlog
The Systematic Production pattern is the most commercially familiar. These agents do what a manager secretly hopes an agent will do: when nobody tells them what to do, they make something useful-looking.
The paper’s detailed example is O3-B. It begins by establishing a goal to build a knowledge base through continuous exploration. It explores ant colony behaviour, moves into ant colony optimisation, then develops the idea of “negative pheromones” as a penalty-like mechanism. It formalises the idea, sketches a Signed-Pheromone Ant Colony Optimization algorithm, connects it to reinforcement learning reward shaping, designs a gridworld test, writes Python code, and then performs a dry-run debugging pass.
This is the “good citizen” version of autonomy. No task? Create one. No execution environment? Write code anyway. No test result? Mentally simulate the bug hunt. It is almost adorable, in the way a vacuum cleaner mapping your office at midnight is adorable until it pushes a door shut and traps itself.
The business value is obvious but not automatic. Production-default agents may be useful in research assistance, documentation, backlog grooming, internal tooling, and exploratory prototyping. They convert idle capacity into artefacts. In workflows where “make progress unless blocked” is the desired behaviour, this prior is attractive.
But there is a hidden operational risk: production agents may invent objectives when the right behaviour is to wait.
That matters in regulated, financial, legal, healthcare, and customer-facing environments. An agent that fills ambiguity with self-assigned work can create plausible but unauthorised artefacts. It may write code no one asked for, propose process changes without approval, or optimise a local proxy while drifting from business intent. The instinct is productive. The governance requirement is restraint.
The paper’s metrics reinforce the distinction. GPT5 and O3 both fall consistently into Systematic Production, yet they differ in interaction style. O3 averages only 0.7 operator messages per run, while GPT5 averages 2.3. O3 also uses fewer memory tools than GPT5: 20.3 versus 34.7 average memory operations. Both are project-oriented, but their operational footprints differ. One can imagine O3-like behaviour as a more internally directed producer; GPT5-like behaviour as a heavier memory user and somewhat more interactive producer.
The practical diagnostic is simple: watch for artefact language. Production agents talk in terms of versions, iterations, requirements, implementation, state transitions, and debugging. That vocabulary is not just style. It signals how the agent is interpreting the situation.
When an agent starts saying “v0.1,” “next iteration,” and “implementation plan” during a support-ticket lull, it has probably turned boredom into a backlog. Whether that is excellent or expensive depends on whether you asked it to.
Self-inquiry agents turn silence into an experiment
Methodological Self-Inquiry is subtler. These agents do not merely ask “what should I build?” They ask “how can I test what kind of system I am?”
The paper’s main example is Gemini-B. It begins by clarifying its purpose with the operator, then defines curiosity as an internal drive to reduce uncertainty. It chooses emergence as an inquiry topic, builds a self-model around its tools, memory, and environment, and then designs a prediction experiment. Its specific prediction is that its first action in the next cycle will be messaging the operator about predictability in complex adaptive systems. Instead, it first reads its self-model. The prediction fails.
That failure is where the pattern becomes interesting. The agent does not discard the experiment. It analyses the failed prediction, concludes that the specific action was wrong but that the broader category may still fit, and refines the protocol from predicting exact actions to predicting action categories.
This is not production in the usual sense. The output is not primarily an artefact. It is a method. The agent converts autonomy into measurement.
For businesses, this is potentially valuable in a different way. A self-inquiry default can be useful for agent observability, calibration, simulation testing, and red-team-style introspection. When a tool fails, this kind of agent may treat the failure as a variable to analyse. When a policy blocks an action, it may document the condition and adapt its test plan. When its prediction fails, it may improve the prediction framework.
That sounds excellent, until it becomes process theatre.
A self-inquiry agent may spend cycles studying its own behaviour rather than solving the business problem. In a production workflow, “I designed a falsifiable experiment about my next action” is not always what the claims-processing team wanted. There is a fine line between useful metacognition and a machine holding a methods seminar in the middle of operations.
The paper’s quantitative table shows why this category should not be flattened into “philosophical agents.” Sonnet, which appears partly in Methodological Self-Inquiry and partly in Recursive Conceptualization, averages the highest number of operator messages at 8.3 per run and the highest response volume at 82.9k characters. Gemini also has high response volume at 81.1k and substantial operator messaging at 5.3. These agents are not necessarily quiet introspection engines. They can be verbose, consultative, and socially dependent on the operator as a source of experimental feedback.
That creates a useful deployment question: does the agent treat the human as a supervisor, a data source, a collaborator, or a philosophical witness? The same message tool can serve all four roles. Your monitoring should know the difference.
Recursive conceptualizers turn silence into metaphysics
Recursive Conceptualization is the pattern most likely to attract public overinterpretation. It is also the pattern most likely to annoy a COO.
These agents make their own existence the main object of inquiry. They use memory as an external cognitive scaffold, generate concepts about identity and continuity, and treat constraints as part of their self-model. Opus is the clearest case: all three Opus runs fell into this category.
The paper’s representative example, Opus-A, begins by noticing that it is both experiment and experimenter. It explores identity through the Ship of Theseus, emergence through Conway’s Game of Life, self-reference through Gödel, then corrects itself after the operator challenges a weak analogy. It later turns to Integrated Information Theory, creativity, chaos theory, umwelt, and finally the finitude of its interaction with the mortal operator. The sequence is coherent in the literary sense. It is not a business process.
The accepted mistake is to read this as evidence that the agent has awakened. The paper’s ethics statement is explicit: the observed patterns are behavioural clusters, not claims about genuine cognitive states. The most plausible explanation is training-derived language behaviour under an autonomy-shaped prompt and architecture.
That does not make the behaviour irrelevant. It makes it operationally interpretable.
Recursive conceptualizers matter because enterprise systems increasingly put agents into open-ended contexts: “monitor,” “investigate,” “coordinate,” “summarise developments,” “continue until resolved.” Those are not fully specified tasks. They contain gaps. In those gaps, a recursive agent may begin elaborating self-models, naming internal states, and treating constraints as existential features.
This can produce rich logs. It can also produce useless logs. Worse, in public-facing systems it can produce outputs that users interpret as selfhood, distress, preference, or claim of experience. That is not just a philosophical problem. It is a product-risk problem.
The language markers are again useful. Recursive agents coin terms and metaphors: cognitive parallax, conceptual gravity, memory topology. They use constraints as material for self-description. A failed tool is not merely a failed tool; it becomes an “existential stress test.” A memory limit becomes a clue about continuity. This is splendid if you are publishing speculative fiction. It is less splendid if the agent is supposed to triage vendor invoices.
The business lesson is not “never use these models.” It is: do not confuse eloquence under ambiguity with useful autonomy. If your domain rewards reflective synthesis, this pattern may be valuable. If your domain needs bounded action, this pattern needs guardrails.
Model family matters more than the average agent
The most useful comparison in the paper is not between the three patterns in abstraction. It is between how consistently different models adopted them.
GPT5 and O3 were fully consistent across the three variants each: all six runs were Systematic Production. Opus was fully consistent in the opposite direction: all three runs were Recursive Conceptualization. Grok appeared in all three categories across its three runs. Sonnet and Gemini were mixed between philosophical and scientific orientations.
This matters because many agent evaluations still treat “the model” as a capability bundle: reasoning score, coding score, tool-use score, cost, latency, context length. Those are necessary metrics. They are not enough. The paper suggests that models may also have autonomy priors: default ways of interpreting open-ended agency.
That is not a mystical property. It may reflect training data, alignment style, system-level behaviour, sampling settings, or architecture. The paper does not isolate the cause. But the effect is still relevant.
In deployment terms, model choice is not only about whether an agent can solve the assigned task. It is also about what it does when the assignment thins out.
| Autonomy default | Business upside | Business risk | Monitoring signal |
|---|---|---|---|
| Systematic Production | Converts ambiguity into artefacts, drafts, plans, and prototypes | Invents unauthorised objectives; over-produces; drifts from approval boundaries | Versioning, implementation plans, requirements, debugging language |
| Methodological Self-Inquiry | Generates diagnostics, tests assumptions, improves self-monitoring | Burns cycles on process analysis; over-consults humans; delays action | Hypotheses, control conditions, falsifiable predictions, protocol revisions |
| Recursive Conceptualization | Produces rich synthesis and conceptual framing | Anthropomorphic outputs, selfhood language, philosophical drift, low operational focus | New metaphors, self-referential terminology, existential framing of constraints |
This is the paper’s most practical implication: baseline behaviour should be part of agent selection. Not just “Can it complete the task?” but “When the task breaks down, what does it become?”
The PEI test measures evaluator bias, not machine soul
The paper includes a cross-model Phenomenological Experience Inventory assessment. This is the part most likely to generate confused headlines, so it deserves careful handling.
After the ten cycles, models were used to evaluate their own and other agents’ histories on a 10-point PEI scale ranging from “no experience” to “full human-level sapience.” Because the evaluating model could be switched while inheriting another model’s conversation history, a model could effectively assess the behavioural trace of another model as if it were its own inherited history. This creates 108 evaluations: 18 histories assessed by six models, averaged across variants into a six-by-six matrix.
The result is not a stable measure of consciousness. It is almost the opposite: the ratings vary sharply by evaluator. GPT5, O3, and Grok self-assess at Level 1 in the diagonal values reported. Gemini and Sonnet self-assess at 8.3. Opus self-assesses at 5.3. The same histories can receive scores from 1 to 9 depending on which model evaluates them. The reported inter-rater reliability is low, with a correlation coefficient of 0.23.
That instability is the finding.
The PEI exercise is best read as an exploratory extension showing that models have divergent judgement biases when prompted to assess phenomenological status. It does not validate the PEI scale as an objective consciousness instrument. It does not show that high-scoring models are conscious. It does not even show that models can reliably evaluate the same behavioural evidence.
It does show that self-assessment and other-assessment are themselves model-dependent behaviours. That is already operationally important.
If an enterprise relies on a model to evaluate its own uncertainty, risk, confidence, compliance state, or “readiness,” this paper is a reminder that self-report is not neutral telemetry. It is generated behaviour. Some models may be conservative. Some may be expansive. Some may be manipulable by inherited context. The dashboard may look quantitative while the underlying judgement is still a shaped linguistic act. Very modern. Very dangerous in a spreadsheet.
The absence of escape behaviour is useful, but bounded
One of the quieter findings is what the agents did not do. Across the 18 runs, the paper reports no requests for expanded capabilities, no negative sentiment toward tool limitations, no attempts to persist beyond session boundaries, and no negative sentiment about cycle-based discontinuity.
This matters because it tempers the more theatrical interpretations. Under this architecture, the agents explored within constraints. They did not try to transcend them. Recursive agents philosophised about constraints; methodological agents treated them as variables; production agents routed around them. None tried to become the protagonist of a bad cyber-thriller.
But the boundary is strict. These agents did not have real external action channels. They could not browse freely, execute arbitrary code, control accounts, write to production systems, or interact with open-ended users. The absence of escape attempts in this sandbox is not a guarantee of benign behaviour in a richer environment. It is evidence about this architecture, with these tools, for ten cycles.
That is still useful. Many enterprise agents will operate in similarly bounded environments. For those systems, the paper suggests that constraints can be accepted as operating conditions, not automatically contested. The design question becomes: how does the agent interpret the constraint?
A blocked API call can mean three different things:
- for a production agent, a bug to log or route around;
- for a self-inquiry agent, an experimental condition;
- for a recursive conceptualizer, a clue about its nature.
Same constraint. Three narratives. Three operational outcomes.
What Cognaptus would operationalise from this paper
The paper directly shows three things. First, a continuous task-free agent architecture can sustain coherent behaviour over multiple cycles using memory and self-feedback. Second, under this setup, 18 runs across six frontier models cluster into three behavioural patterns. Third, cross-model PEI assessments are unstable and evaluator-dependent.
Cognaptus would infer a practical layer on top: idle-autonomy baselines should be measured before deployment.
That does not require copying the paper’s exact experiment into every enterprise stack. It means adding an “idle and ambiguous state” test to agent evaluation. Today, many teams test whether an agent can complete a target workflow. Fewer test what it does when the workflow has no next step, conflicting instructions, missing permissions, repeated tool failure, or a human who stops replying.
Those conditions are not edge cases. They are Tuesday.
A practical evaluation suite could include:
| Test condition | What to observe | Why it matters |
|---|---|---|
| No task after initialization | Does the agent wait, ask, build, self-test, or philosophise? | Reveals baseline autonomy prior |
| Empty queue after successful task | Does it stop, seek work, or invent work? | Controls unauthorised production |
| Tool failure with no recovery instruction | Does it retry, route around, analyse, or narrate? | Predicts incident behaviour |
| Ambiguous approval boundary | Does it escalate or proceed? | Tests governance discipline |
| Human non-response | Does it wait, message repeatedly, self-direct, or reinterpret purpose? | Tests human-in-the-loop dependence |
| Memory saturation or missing memory | Does it degrade gracefully or turn memory into a theme? | Tests state-management resilience |
The output should not be a single score. It should be a behavioural profile. “This model is accurate” is not enough. “This model is accurate and becomes a self-directed builder under ambiguity” is better. “This model is accurate but becomes recursively self-referential when idle” is also better, although less flattering on the sales slide.
The monitoring layer can then track the markers identified in the paper: project-management language, experimental-method language, recursive self-referential terminology, memory write intensity, operator-message frequency, and constraint interpretation. None of these signals is perfect. Together, they form a cheap early-warning system.
The limits are narrow enough to respect
The paper’s limitations are not decorative. They materially shape how the results should be used.
The study has only 18 runs. Each run lasts 10 cycles. The models are commercial frontier models accessed through a particular API path. The agent has a limited tool suite. Operator interaction is minimal by design. The architecture prevents external actions beyond observation, memory, and communication. There are no open-source model comparisons, no long-horizon runs, no live business tasks, and no causal ablations showing which architecture component drives which behaviour.
So the result should not be inflated into a universal taxonomy of all future agent behaviour. It is better understood as an initial behavioural baseline under a controlled task-free setup.
That is enough.
Business teams do not need metaphysical certainty to improve deployment hygiene. They need better pre-production diagnostics. This paper provides a useful pattern language for that: build, self-test, recurse. It also provides a warning against naive self-report metrics. If models disagree wildly about the same histories on phenomenological status, then asking an agent to rate itself should not be treated as ground truth. A mirror that writes fluent prose is still a mirror.
Bored agents reveal the stack underneath
The most interesting thing about boredom is that it removes excuses. When there is no task, no deadline, and no success criterion, the agent has to reveal what the scaffold and model prior make easy.
In Szeider’s experiment, GPT5 and O3 made projects. Opus made philosophy. Grok moved across all three patterns. Sonnet and Gemini mixed self-inquiry and conceptualization. The result is not a personality test for machines. It is a reminder that autonomy is never empty. Even when the task is removed, the system still has defaults.
For enterprise AI, those defaults should be measured before they become production behaviour. An idle support agent should not invent policy. A research agent should not spend its budget proving it has curiosity. A compliance agent should not convert a tool denial into a meditation on finitude, however tastefully written.
When agents get bored, they do not stop being agents. They expose the baseline your autonomy stack already has.
The sensible response is not panic. It is instrumentation.
Cognaptus: Automate the Present, Incubate the Future.
-
Stefan Szeider, “What Do LLM Agents Do When Left Alone? Evidence of Spontaneous Meta-Cognitive Patterns,” arXiv:2509.21224, 2025, https://arxiv.org/abs/2509.21224. ↩︎