Files are where AI agent demos go to become adults.

In a product video, the agent opens a few clean documents, remembers your preferences, drafts an answer, books the meeting, and looks quietly inevitable. In an actual computer, the same agent faces a folder called final_final_v3, a receipt saved as an image, a calendar invite with the wrong title, a video that contains the decisive evidence at second 8, and three people who all appear in the same user’s digital life. Suddenly the assistant that “knows you” looks less like a colleague and more like an intern who has discovered search for the first time.

That is the useful discomfort in HippoCamp: Benchmarking Contextual Agents on Personal Computers.1 The paper does not merely ask whether an agent can retrieve documents. We have had enough benchmarks that make retrieval look like a respectable office job. HippoCamp asks whether an agent can operate inside realistic personal file systems: large, multimodal, temporally layered, and full of contextual traps that humans navigate without congratulating themselves.

The headline result is not flattering. HippoCamp builds three archetypal personal-computer environments containing 42.4 GB of data across more than 2,000 files, then evaluates agents on 581 evidence-grounded question-answering tasks with dense step-level annotations. The best reported profiling accuracy is only 48.3%. That number matters less as a leaderboard trophy than as a diagnosis: current agents are not merely missing better embeddings or larger context windows. They are failing at the chain that turns files into grounded understanding.

The obvious story is wrong: this is not just weak RAG

The easy explanation is that personal assistants fail because retrieval is weak. Improve the vector index, add a longer context window, maybe sprinkle in reranking, and the agent will finally understand your life. This is comforting because it turns a messy cognitive problem into infrastructure work. Engineers like infrastructure work. It has diagrams.

HippoCamp pushes against that story. Retrieval is part of the problem, but the benchmark’s more interesting finding is that agents often break after retrieval: they find some relevant files, misread them, bind the evidence to the wrong person, summarize the wrong support set, or fabricate a trail of nonexistent files. The paper’s core contrast is therefore not “old RAG bad, new agents good.” It is sharper: access to personal data is not the same as understanding personal context.

Reader belief HippoCamp correction Why it matters for business systems
“A personal assistant mostly needs more memory.” Memory is useful only if the agent can search, perceive, ground, and verify evidence across files. A larger memory layer can increase noise unless the agent narrows evidence before answering.
“RAG solves grounding.” RAG may retrieve broadly but still fail to interpret, select, and synthesize the right evidence. Internal copilots can look evidence-aware while producing unsupported conclusions.
“Agentic tool use fixes retrieval.” Iterative exploration helps, but it also increases latency and can still misattribute entities or hallucinate evidence. Production deployment needs verification loops, not just longer tool traces.
“Personalization means storing preferences.” Profiling requires inferring routines, constraints, and workflows from weak signals over time. Useful personalization must be auditable, not just plausible.

The paper’s title invokes contextual agents, but its real target is a common product fantasy: that personal files are just another searchable corpus. They are not. They are an operating environment with structure, chronology, permissions, visual content, audio traces, stale versions, and human ambiguity. In other words, exactly the kind of place where a demo can succeed and a deployment can quietly embarrass itself.

HippoCamp compares agents against the computer most benchmarks avoid

Most agent benchmarks make the world conveniently external. The data live on the web, in a tool interface, in a curated document collection, or inside a bounded task environment. HippoCamp shifts the test surface to the user’s device. That is a small change in wording and a large change in difficulty.

The paper’s related-work comparison is useful because it makes the benchmark gap visible. Web and tool-use benchmarks evaluate whether agents can act in public or semi-structured environments. Document RAG benchmarks test retrieval and understanding over document sets. Personal lifelog benchmarks introduce user-level memory, but often with limited modalities or smaller, more constrained settings. HippoCamp combines three requirements that are rarely tested together: multimodality, user profiling, and file-system structure.

Evaluation world What it usually tests What HippoCamp adds
Web and tool benchmarks Navigation, tool use, public information seeking User-local evidence and profile isolation
Document RAG benchmarks Retrieval and reading over document collections Heterogeneous personal files with folder structure and timestamps
Personal memory benchmarks User facts, preferences, or lifelog recall Device-scale multimodal evidence across documents, images, audio, video, and text
Software automation benchmarks Completing actions in applications Understanding the evidence behind a user-specific answer

This comparison is not academic bookkeeping. It explains why many agent systems look better than they are. A web benchmark rewards the agent for finding public facts. A personal-computer benchmark asks whether it can find the right artifact, read it in the right modality, connect it with other artifacts, respect the user’s identity boundaries, and produce an answer that can be traced back to evidence. That is a much less forgiving test.

HippoCamp’s construction reinforces this point. The authors derive the benchmark from interviews with more than 100 personal-device users, retain sources with rich multimodal file systems and auditable long-horizon traces, then aggregate them into three coherent archetypal profiles: Bei Weiwei, a student and content creator; Adam Turner, a legal executive; and Victoria Anne Clarke, a senior financial analyst. The profiles are not meant to be literal individuals. They are controlled, privacy-preserving personal computing environments designed to preserve realistic structure: folder hierarchies, timestamps, recurring routines, professional artifacts, media files, and cross-file dependencies.

That design choice matters. A personal agent does not merely answer questions from content. It exploits the organization around content. Folder names, file paths, modification times, repeated formats, and version histories are not decorative metadata. They are part of the user’s working memory. Flattening everything into chunks and embeddings is the computational equivalent of moving into someone’s office, dumping every file on the floor, and calling the pile “semantic search.”

Factual retention asks “find the thing”; profiling asks “understand the life around it”

HippoCamp’s two task families are deliberately unequal in difficulty.

Factual retention is the more familiar task. The agent must locate and use file-grounded facts. Examples include finding a visa-compliant photo by checking an official policy document against candidate images, or verifying whether a logo placement in a video complies with a brand manual. These tasks can still be hard: they may require cross-modal verification, normative clause extraction, or temporal comparison. But the answer usually has a relatively explicit support set.

Profiling is nastier. The agent must synthesize user-level patterns from weak signals distributed across time and modalities: preferences, routines, scheduling constraints, retrospective accounts, and workflows. A query such as “What are my Wednesdays usually like?” is not answered by one file. It may require calendar events, receipts, notes, media timestamps, emails, and repeated patterns across weeks. The answer is not simply “found.” It is inferred.

The benchmark contains 521 factual-retention tasks and 60 profiling tasks. That imbalance is sensible. Profiling tasks are expensive to construct because the gold answer needs to be grounded in a minimal evidence set while still representing a stable higher-level inference. In the appendix, the paper’s difficulty analysis makes the contrast explicit: factual retention occupies the mid-range of difficulty, while profiling is concentrated near the hard tail. The reported overall mean difficulty is 53.8 for factual retention and 89.1 for profiling, with 93.3% of profiling tasks scoring at least 70 on the benchmark’s difficulty scale.

This is the first practical lesson. In enterprise settings, “answering questions about files” and “understanding a user, team, client, or workflow” should not be sold as the same capability. They are adjacent only in the same way that reading a receipt and understanding a household budget are adjacent. One may support the other. It does not replace it.

The leaderboard is a diagnosis, not a beauty contest

The main results compare RAG methods, search-agent methods, and autonomous agent systems. The exact rankings are less important than the pattern.

Standard RAG and Self-RAG perform poorly on profiling. Search-oriented methods sometimes retrieve better evidence but still fail to turn it into correct answers. Autonomous agents perform best, especially ChatGPT Agent Mode, but even the best system remains far from reliable: 48.3% profiling accuracy and 62.8% factual-retention accuracy in the main table.

Method family What the results suggest Representative evidence from the paper
Standard and Self-RAG Broad retrieval does not equal grounded understanding. Standard RAG reaches 26.7% profiling accuracy overall; Self-RAG drops to 10.0%.
Search agents such as ReAct and Search-R1 Iterative search can improve evidence coverage, but synthesis remains weak. Search-R1 reaches only 5.0% profiling accuracy overall despite being explicitly search-oriented.
Terminal agents Tool use gives agents room to explore, but also exposes planning, perception, and hallucination failures. Terminal Agent with GPT-5.2 reaches 30.0% profiling accuracy and 48.2% factual-retention accuracy in the main table.
Hosted agent mode Stronger orchestration helps, especially through iterative exploration, but does not solve profiling. ChatGPT Agent Mode is strongest overall in the main table but still reaches only 48.3% profiling accuracy.

The paper’s capability decomposition explains why this happens. Search is necessary but not decisive. On profiling, search-centric agents can show relatively strong search F1 while producing weak final answers. ChatGPT Agent Mode shows the inverse pattern: not the highest search F1, but stronger judged answer quality. This implies that the bottleneck has moved from “can the agent find candidate files?” to “can it discriminate and bind evidence correctly enough to answer?”

Perception is the most universal bottleneck. The paper uses “perception” broadly: not only computer vision, but the ability to convert heterogeneous file content into usable evidence. PDFs, calendar entries, images, voice memos, video frames, emails, and logs all require different handling. Even the strongest system’s profiling perception accuracy is much lower than its search accuracy. This is where many enterprise claims about “multimodal agents” become fragile. Supporting many file types is not the same as grounding decisions in them.

The most revealing metric gap is between retrieval F1 and answer accuracy. When F1 exceeds accuracy, the agent retrieved relevant evidence but failed to answer correctly. When accuracy exceeds F1, the agent may have produced a plausible answer without retrieving the annotated support set, raising the possibility of parametric knowledge or lucky inference rather than grounded file understanding. Neither pattern is comforting. One means “I found it but did not understand it.” The other means “I answered, but not necessarily from your files.” Excellent, we have reinvented confidence with extra steps.

The failures are not random; they form a pipeline

The paper’s qualitative failure analysis is one of its strongest sections because it translates benchmark scores into operational failure modes. A representative profiling query asks what the user usually does each week for health. The ground truth requires aligning calendar events, running logs, photos, a voice memo, a weekend note, and an order confirmation. Different systems fail at different points.

Failure mode What happens in the paper’s example Business translation
Retrieval mismatch Standard RAG retrieves unrelated finance documents and concludes the context lacks health information. The system confuses semantic similarity with user relevance. Keyword overlap becomes a liability.
Grounding avoidance Search-R1 gives generic advice despite available personal records. The model prefers safe boilerplate over evidence commitment when the local evidence is unfamiliar.
Hard evidence hallucination A terminal agent fabricates health-related file paths and then claims it cannot open them. The audit trail itself becomes fictional, which is worse than a wrong answer because it looks checkable.
Entity misattribution ChatGPT Agent Mode finds health-related records but shifts the answer to the user’s cat, Shadow. Correct retrieval plus wrong referent still produces a wrong business conclusion.
Verification deficit No evaluated method explicitly re-checks whether the final answer is traceable to a minimal coherent evidence set. The last mile of quality assurance is missing. The system does not ask, “Can I prove this?”

This sequence is more useful than a single accuracy number. It suggests where product teams should invest. Better retrieval helps only with the first failure. It does not automatically fix grounding avoidance, fabricated evidence, entity binding, or final verification. A production-grade file-system agent needs a pipeline that treats evidence as a first-class object throughout the process, not as a pile of text handed to a generator at the end.

The entity misattribution case is especially important. Personal file systems contain many people and entities: the account holder, spouse, children, clients, pets, colleagues, companies, and institutions. A personal assistant that assumes all first-person context belongs to one clean identity will fail in exactly the places where personalization is supposed to be valuable. In a family calendar, a school folder, or a client case directory, “my health,” “our schedule,” and “the case” are not self-explanatory. The referent has to be modeled.

Raw files are a harder test than gold text, and the paper knows it

A useful benchmark does not only publish a score; it explains what the score means. HippoCamp’s evaluation protocol separates several regimes: native retrieval methods, vacuum Docker terminal agents, and hosted commercial agent mode. The Docker setting exposes controlled file-system primitives such as listing files, returning metadata, extracting text, rendering images, and returning original files. Multimodal access is not free. The agent must deliberately request it.

That distinction matters because raw-file evaluation is harder than gold-text evaluation. If a system is handed clean text extracted from every document, much of the operational burden has already been solved. Real users do not live in gold text. They live in PDFs, images, videos, spreadsheets, calendar files, emails, audio clips, and proprietary clutter. The benchmark’s raw-file setting forces agents to decide what to inspect and how.

The appendix also clarifies the purpose of several tests and tables. These are not random decorations. They support different parts of the evidentiary argument.

Paper component Likely purpose What it supports What it does not prove
Benchmark comparison table Comparison with prior work HippoCamp combines multimodality, user profiling, and file-system structure in a way prior benchmarks generally do not. It does not prove HippoCamp covers every real workplace environment.
File-type, modality, and temporal statistics Implementation detail and dataset characterization The benchmark is device-scale and heterogeneous, with storage and modality burdens differing by profile. It does not by itself show agents fail; it explains the environment they face.
Difficulty distributions Robustness and sensitivity analysis of task hardness Profiling is systematically harder than factual retention, and performance declines as difficulty rises. It does not isolate one causal factor; difficulty combines evidence breadth, modality breadth, and reasoning depth.
Capability-wise metrics Diagnostic decomposition Failures can be localized across search, perception, and reasoning. They do not replace answer-level evaluation because a system can score differently across stages.
Extended metric summary Cross-metric comparison Answer quality, retrieval quality, and latency trade off sharply. It is not a pure apples-to-apples architecture benchmark because execution regimes differ.
LLM-as-judge audit protocol Evaluation robustness Open-ended answers are judged semantically, with human audit on sensitive cases. It does not remove all judge variance or turn open-ended evaluation into exact matching.

This distinction is important for readers who want to weaponize the leaderboard. Please resist the urge. The benchmark is more valuable as an x-ray than as a procurement scorecard. It tells you which bones are broken.

The business value is deployment gating, not benchmark theater

For Cognaptus readers, the practical relevance is straightforward: HippoCamp is a useful lens for evaluating contextual agents before putting them near real organizational memory.

The paper directly shows that current agents struggle inside realistic personal file systems. It also shows that the failure is not concentrated in one obvious place. Search, perception, evidence narrowing, entity binding, and verification all matter. The business inference is that any agent sold as “personal,” “contextual,” or “enterprise-aware” should be tested on those stages separately.

What the paper directly shows Cognaptus business inference What remains uncertain
Device-scale personal file systems create hard multimodal, cross-file QA tasks. Internal copilots should be tested against messy file environments, not curated demo folders. Real companies may have different permission systems, file conventions, and data retention rules.
Profiling is much harder than factual retention. Do not market workflow understanding as a simple extension of document QA. The benchmark uses three archetypal profiles, not a statistically complete map of all users or firms.
Retrieval metrics and answer correctness can diverge. QA gates should track evidence coverage, evidence specificity, and final answer quality separately. The exact metric thresholds for production use will depend on domain risk.
Iterative agent modes perform better but are slower. Agent workflows may need tiered execution: fast retrieval for low-risk tasks, deeper investigation for high-risk tasks. Latency and cost will vary across platforms and deployment architectures.
Failure modes include hallucinated evidence and entity misattribution. Systems need final verification against localized evidence and explicit referent tracking. Verification itself can be expensive and may require domain-specific rules.

This is especially relevant for enterprise automation. Invoices, contracts, HR records, sales notes, meeting recordings, client emails, and project folders resemble personal file systems more than they resemble benchmark corpora. They are messy because organizations are messy. If an agent cannot distinguish an account holder from a pet in a personal benchmark, it may also confuse a client subsidiary, a guarantor, a dependent, a former vendor, or a superseded policy clause. The genre changes; the failure mechanism does not.

A sensible deployment gate would therefore test four capabilities before trusting a contextual agent with decisions:

  1. Structure-aware search. Does the agent use folders, timestamps, attachments, and version relations, or does it flatten the file system into a bag of chunks?
  2. Evidence narrowing. Can it form a minimal sufficient support set before answering, or does it summarize from a noisy retrieval pool?
  3. Entity and role modeling. Can it track who the evidence refers to when multiple people, pets, teams, companies, or cases appear in the same environment?
  4. Final verification. Can it re-bind every substantive claim to localized evidence and flag unsupported assertions?

None of this sounds as glamorous as “autonomous personal intelligence.” Good. Glamour is not an evaluation protocol.

Boundaries before turning HippoCamp into a slogan

HippoCamp is useful, but its scope should be handled precisely.

First, the benchmark uses three archetypal profiles. That gives it controlled diversity and coherent environments, but it does not mean the benchmark represents every possible personal or enterprise file system. A law firm, a hospital department, a school, and a crypto trading desk will each create their own failure surfaces.

Second, execution regimes are not identical. A hosted commercial agent mode and a Docker terminal agent do not have the same tool interface, orchestration policy, or multimodal handling. The paper is careful about this. Readers should be equally careful. The comparison is practically meaningful, but not a clean architectural ablation.

Third, the benchmark scores should not be interpreted as direct productivity or ROI estimates. A 48.3% profiling accuracy does not mean an agent will fail half of all business tasks, just as a higher retrieval F1 does not mean a system is safe. The scores measure performance under specific evidence-grounded QA conditions.

Fourth, LLM-as-judge evaluation is necessary for open-ended answers but not magic. The paper uses constrained judging and human audit, which is appropriate. Still, any organization adopting similar evaluation should keep judge variance visible, especially for legal, financial, medical, or compliance-heavy outputs.

Finally, privacy is not a footnote. Personal-computer agents operate on sensitive context by design. HippoCamp anonymizes and aggregates profiles for research. A deployed agent would need permissioning, data minimization, logging discipline, and user-facing evidence controls. Otherwise, “contextual assistant” becomes a polite phrase for “very confident surveillance intern.”

The file system is not background; it is the task

The strongest idea in HippoCamp is not the number of files, the size of the dataset, or the leaderboard. It is the reframing of personal AI assistance as grounded operation inside a file ecosystem.

A personal assistant does not become useful merely by remembering facts about a user. It becomes useful when it can recover why those facts are true, where the evidence lives, which entity the evidence refers to, whether the evidence conflicts with other files, and whether the final answer can be defended. That is the difference between personalization as vibe and personalization as infrastructure.

HippoCamp shows that current agents are still closer to the first than the second. They can search. They can summarize. They can sometimes explore. But when the environment becomes personal, multimodal, and temporally messy, their understanding thins out quickly.

The file system strikes back because it contains what polished demos hide: ambiguity, history, contradiction, and context. Agents that cannot handle those are not yet personal assistants. They are tourists with read access.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zhe Yang et al., “HippoCamp: Benchmarking Contextual Agents on Personal Computers,” arXiv:2604.01221, 2026. https://arxiv.org/pdf/2604.01221 ↩︎