TL;DR for operators

Smart-home assistants do not fail only when users are vague. They fail when users become efficient.

The PEC-Home paper studies a familiar pattern: after repeated interaction, people stop saying the whole thing. “Please turn on the air conditioner in the bedroom and set it to 26 degrees at 10 PM” eventually becomes “check the AC” or “handle that thing.” Humans manage this because shared context, identity, place, and prior routines do the missing work. Current LLM assistants are much less charming under that burden.

PEC-Home turns this into an execution benchmark. It contains 1,780 simulated smart-home dialogues from 1,424 personas, with four levels of progressively elliptical commands across two practical ambiguity types: multi-user preference conflicts and dynamic preferences that change with environment state.1 The target is not conversational niceness. The assistant must output the exact room, device method, and parameters needed to execute the intended operation.

The result is operationally blunt. Zero-shot and in-context prompting collapse on high ellipsis. RAG helps substantially because history matters, but even GPT-4o with RAG drops from around 93% execution accuracy on complete multi-user commands to 55.71% at the most elliptical level, and from 91.57% to 50.70% in dynamic preference scenarios. That is not a rounding error. That is a product liability wearing a friendly wake word.

The practical lesson is not “add memory.” It is: evaluate whether the assistant can select the right memory, ignore the wrong memory, resolve whose preference applies, adapt to state changes, and still produce the exact executable action. Memory without disambiguation is just a junk drawer with embeddings.

The failure starts when users stop being polite to the interface

A new smart assistant usually receives complete sentences. Users are still training the machine, and the machine is still earning permission to be trusted. So the command is explicit:

“Turn on the air conditioner in the bedroom and set the temperature to 26 degrees at 10 PM.”

Then the relationship matures. The user has said this before. The room is obvious. The time is routine. The device is known. The preference has become shared. The command shrinks:

“Set the AC to 26 tonight.”

Then:

“Make it comfortable before I sleep.”

Then:

“You know the AC, right? Handle it tonight.”

This is not sloppy language. It is efficient language. Human dialogue naturally compresses repeated references once shared context has been established. The PEC-Home paper calls this “progressively elliptical commands,” and that framing matters because it separates the problem from ordinary ambiguity.

A one-off ambiguous command asks the model to infer missing intent from the current utterance. A progressively elliptical command asks the model to reconstruct what the user and assistant have built together over time. The missing content is not floating in semantic space. It is buried in prior turns, user identity, household context, environmental state, and learned preferences.

The misconception the paper usefully punctures is simple: a smart-home assistant does not become context-aware merely because it has a larger model, a vector database, or a retrieval tool strapped to the side. Those help. They do not solve the mechanism.

The mechanism is context compression. The user compresses the command because the relationship is supposed to carry the missing information. If the system cannot recover that information, the better the user experience becomes for a human, the worse it becomes for the assistant. Progress, naturally, arrives disguised as a bug report.

PEC-Home makes ellipsis an execution problem, not a vibes problem

PEC-Home is built around a specific operational mapping. A user command must be converted into an executable device operation: room, device-method, and parameters. In simplified form, the assistant must move from a natural-language utterance to something like:

bedroom.air_conditioner.set_temperature(26)

That distinction is important. Many assistant evaluations reward plausible interpretation. PEC-Home rewards correct execution. “Close enough” is not good enough if the wrong room gets cooled, the wrong user’s comfort preference is applied, or a time parameter arrives in a format the API rejects. The thermostat, famously, does not grade on semantic intent.

The dataset contains two smart-home scenarios that expose different forms of missing information:

Scenario What becomes ambiguous Why it matters operationally
Multi-user preferences The same phrase can mean different things for different people. One person’s “comfortable temperature” may not match another’s. The assistant must retrieve the right user-specific history, not merely the most semantically similar prior command.
Dynamic user preferences The same user’s preference may change with weather, humidity, time, or other environmental state. The assistant must combine dialogue history with sensor context, not replay yesterday’s setting because it sounds familiar.

The design is clean because it does not treat “context” as one blob. It splits context into at least three hard things: identity, history, and state. That is closer to how deployed assistants fail. The issue is rarely that the system has no memory at all. The issue is that it remembers something adjacent, applies it confidently, and leaves the user wondering why the bathroom light is now part of the evening routine.

Four ellipsis levels turn household shorthand into a measurable gradient

PEC-Home defines four command levels. Each level removes more of the explicit information that a device-control system needs.

Level What the user still says What has been omitted Product interpretation
Lv1 Room, device, operation, and parameters are present. Nothing essential. Standard command understanding. This is the demo-friendly case.
Lv2 Device, operation, and parameters remain. Room is usually omitted. The assistant must recover spatial context.
Lv3 Device remains, but operation and parameters become vague. Room, operation detail, and parameter detail are missing or ambiguous. The assistant must infer both what to do and how to configure it.
Lv4 Usually only the device is explicitly referenced. Room, operation, and parameters are mostly absent. The assistant must rely heavily on shared history and preference conventions.

The paper’s example progression is exactly the sort of thing product teams like to market as “natural interaction.” At Lv1, the user fully specifies an air conditioner, a bedroom, a temperature, and a time. By Lv4, the command only gestures at the air conditioner and expects the assistant to know the rest.

That expectation is not exotic. It is the point of personalization.

PEC-Home’s statistics show the intended compression. Average command length falls from about 42 tokens at Lv1 to about 28 tokens at Lv4 in both the multi-user and dynamic-preference subsets. Room mentions collapse after Lv1. Operation mentions also drop sharply, reaching 4.49% at Lv4 in multi-user preferences and 6.04% in dynamic preferences. Device references remain relatively common, which is why Lv4 is not pure silence. It is worse: it is a small clue attached to a large missing context.

The dataset construction is simulated, not scraped from real households. The authors build a virtual environment with 12 device types, more than 50 executable methods, and more than 350 personalized methods allocated across rooms. GPT-4o generates the natural-language commands from function-call-style operations and personas. The authors then manually validate sampled operations and dialogues: 500 function-call operations are reported as 100% correct, 96% aligned with environmental states, and 200 dialogues are reported as 94.5% meeting the defined command standards.

This does not make PEC-Home a perfect mirror of production homes. It makes it a controlled stress test. That is exactly what benchmarks are supposed to be before someone wires them to appliances.

The main result: RAG helps, but shorthand still wins too often

The main evidence is Table 3: ten LLMs evaluated across zero-shot prompting, in-context learning, and RAG. The models include LLaMA3-8B, Mistral-7B, Gemma2 variants, Qwen2.5 variants, GPT-4o, and DeepSeek-V3. The metrics are Execution Accuracy and F1.

Execution Accuracy is the better headline metric here because it asks whether the operation would functionally do the right thing in the virtual environment. F1 is stricter about output format and operation matching. The paper uses both because smart-home execution has two layers: understanding the user and producing an interface-compatible command. Sadly, products need both. Reality remains unfashionably strict.

The broad pattern is consistent:

Condition What happens Interpretation
Zero-shot prompting Models perform moderately on complete commands, then collapse as ellipsis increases. Without history, the missing information stays missing. Astonishing, but worth measuring.
In-context learning Examples improve complete-command handling for many models, but do little for highly elliptical commands. Demonstrations of the format are not the same as access to the user’s actual interaction history.
RAG Performance improves sharply at elliptical levels, but still declines as commands become more compressed. Retrieval is necessary but not sufficient. The hard part is selecting and applying the right context.

GPT-4o with RAG gives a useful anchor. In the multi-user preference task, its Execution Accuracy is 93.07 at Lv1, 92.70 at Lv2, 76.05 at Lv3, and 55.71 at Lv4. In the dynamic preference task, it moves from 91.57 at Lv1 to 88.34 at Lv2, 61.24 at Lv3, and 50.70 at Lv4.

That is the practical shape of the problem. The assistant handles complete commands. It survives mild shorthand. It becomes unreliable when the user expects the relationship to carry the meaning.

The paper also shows that RAG can make smaller models competitive. Qwen2.5-7B with RAG reaches 91.57 Execution Accuracy at Lv2 in the dynamic preference task and 48.50 at Lv4 in the multi-user task. But the important finding is not that one model wins a row. It is that even the strongest RAG-equipped systems remain far below their complete-command performance at the highest ellipsis level.

For business readers, the useful number is not the leaderboard score. It is the drop. If the assistant’s success rate falls by roughly 37 to 41 percentage points when the user moves from explicit commands to household shorthand, the product is not ready for the kind of user familiarity it is trying to create.

A delightful strategic contradiction: the better the assistant trains users to rely on it, the more elliptical their commands become, and the more the evaluation must shift from generic language understanding to longitudinal context recovery.

Bigger models buy room, not immunity

The paper reports that larger models can use dialogue history better on less elliptical commands, especially around Lv2. This makes sense. If the user omits the room but still gives the device, operation, and parameter, the model has a relatively bounded retrieval-and-resolution task. It must find where this recurring instruction belongs.

At Lv4, the advantage weakens. The current command may mention only the device. A larger model can reason over more possibilities, but it still needs the correct historical binding: which user, which room, which prior routine, which environmental condition, which parameter convention.

That is why “just use a bigger model” is the laziest expensive answer. Larger models improve parts of the interpretation stack. They do not replace the stack.

The comparison also reveals a product design lesson: latency constraints matter. The authors explicitly exclude reasoning models because home assistants require rapid responses, and reasoning models typically introduce delays by generating longer reasoning trajectories. This is not a trivial footnote. Smart-home control lives in the awkward zone where the user wants context-sensitive intelligence and immediate execution. The system cannot spend a small philosophical afternoon deciding whether “handle the AC” refers to comfort, energy use, sleep, humidity, or the argument from Tuesday.

The analysis section tests remedies, not a second thesis

The paper’s Section 6 is best read as a set of diagnostic extensions. These are not separate claims competing with the main result. They test common product instincts: add memory, fine-tune the model, and use tool-based agents.

Test Likely purpose What it supports What it does not prove
Preloaded irrelevant memory Robustness and sensitivity test Memory quality and filtering affect elliptical command interpretation. Noisy memory makes context use harder. It does not establish an optimal memory architecture.
Fine-tuning Qwen2.5-7B and Gemma2-9B Adaptation ablation Fine-tuning improves low-ellipsis behavior but does not solve highly elliptical commands. It does not rule out all possible training regimes or specialized architectures.
Sasha and SAGE tool-based assistants Comparison with prior smart-home agent styles Current tool-based approaches can handle some cases but are inconsistent across ellipsis levels. It does not prove tools are useless; it shows naive or existing tool orchestration is insufficient.
Error analysis on sampled Qwen2.5-7B RAG failures Failure-mode diagnosis The largest observed issue is ignoring or failing to use history correctly. It is based on 100 sampled errors from one model-method setting, not every system.

This distinction matters because otherwise the paper could be misread as saying “RAG failed,” “fine-tuning failed,” or “tools failed.” That is too crude. The better interpretation is that each remedy addresses part of the mechanism.

RAG gives the model access to history. It does not guarantee that the retrieved memory is relevant, current, user-specific, or correctly applied.

Fine-tuning teaches the model the task distribution. It does not insert the missing context at inference time.

Tools provide action interfaces and memory operations. They do not guarantee that the agent invokes the right tool at the right moment or resolves conflicting context correctly.

In other words, every obvious fix improves the plumbing. The hard part is still deciding which water should flow through it.

Noisy memory turns personalization into retrieval risk

The memory experiment is especially useful because it pokes the fantasy version of personalization. In practical systems, memory does not remain clean. Users issue many commands. Households contain repeated devices, recurring routines, conflicting preferences, abandoned habits, and random one-off requests. A vector database full of “things the user once said” is not a personalization layer. It is evidence storage with a future confusion budget.

The paper tests this by adding irrelevant preloaded memory and observing performance across ellipsis levels for Sasha, SAGE, and RAG variants. The qualitative result is straightforward: as irrelevant memory appears, systems struggle more to filter and use the right dialogue history, especially as commands become more elliptical.

That is an important boundary between retrieval and interpretation. In a complete command, the utterance itself constrains the action. In an elliptical command, the retrieval result can dominate the action. A bad memory match is not background noise. It becomes the instruction.

For enterprise AI, the analogy is immediate. Replace smart-home commands with workflow requests:

“Send the usual report.”

“Use the client’s preferred format.”

“Route it the way we did last time.”

“Apply the APAC exception.”

These are elliptical business commands. They are only safe when the system knows which “usual,” which client, which last time, and which exception. Otherwise, RAG becomes a confident intern with access to too many folders. We have all met the type.

Fine-tuning improves the easy part

The fine-tuning experiment uses a 5:1:4 train-validation-test split and LoRA fine-tunes Qwen2.5-7B and Gemma2-9B. The result is instructive: performance improves on low-ellipsis commands, but collapses on highly elliptical ones.

This is exactly what should happen if the bottleneck is missing contextual binding rather than generic domain unfamiliarity. Fine-tuning can teach the model the schema, the device domain, and the style of outputs. It can make complete commands easier. It can make mildly incomplete commands less surprising.

But Lv4 is not primarily a domain adaptation problem. It is a stateful reference problem. The current utterance does not contain enough information. The model needs the right prior interaction and the right selection rule.

A product team that fine-tunes on historical commands may see strong offline improvements on explicit or semi-explicit requests and still fail in the moments that feel most personalized to the user. This is how AI systems become impressive in evaluation demos and faintly haunted in kitchens.

Tool agents still need a policy for when memory matters

The tool-based comparison uses Sasha and SAGE with Qwen2.5-7B-Instruct. Both represent current LLM-based smart-home assistant patterns: tool use, planning, memory access, device control, and environmental data. This sounds like the correct architecture family. It probably is. The scores are still uneven.

SAGE records 51.40 Execution Accuracy at Lv1 and only 4.49 at Lv4 in the multi-user task; in dynamic preferences, it goes from 52.00 to 5.33. Sasha does better at Lv4: 43.26 in multi-user preferences and 41.33 in dynamic preferences. But Sasha also behaves oddly at Lv2, scoring lower there than at Lv3 or Lv4 in both scenarios.

The paper interprets this as a tool-invocation problem: Sasha may fail to decide when to invoke external memory for moderately elliptical commands. That is the product lesson. Tool availability is not tool governance. An assistant needs a policy for when to retrieve, what to retrieve, how much to trust it, and when to ask a clarifying question.

Otherwise, the agent has the same problem as many enterprise copilots: every tool exists, and no one is clearly in charge.

The error analysis says the assistant is not merely formatting badly

The appendix error analysis is small but revealing. The authors manually examine 100 randomly sampled error cases from Qwen2.5-7B with RAG and categorize primary failure modes:

Failure mode Share of sampled errors Operational meaning
Ignoring History 38% The model treats the elliptical command too much like a standalone command.
Room Missing 22% The device/action may be right, but the spatial target is absent.
Room Error 17% The model picks the wrong room due to ambiguous context switching.
Parameter Error 13% The value or format is wrong, such as time format or vague parameter interpretation.
Parameter Missing 10% The operation is incomplete because required parameters are absent.

This is more useful than a leaderboard because it tells operators where to instrument their systems.

The largest category is not “model cannot speak JSON.” It is “model fails to use history.” Room-related errors together account for 39% of sampled failures. Parameter errors and missing parameters account for another 23%. The pattern says the assistant is failing at binding: binding utterances to prior context, rooms, user-specific meanings, and API-ready values.

The paper also makes a subtle metric distinction. Execution Accuracy tolerates outer formatting mistakes when the operation is functionally correct, treating syntax wrappers as an engineering issue. But parameter values remain strict because APIs are strict. If the system outputs a time or value in the wrong format, the device call can fail. This is a reasonable distinction. Curly braces can be patched. A wrong room cannot.

What the paper directly shows

The paper directly shows four things.

First, progressively elliptical commands are a distinct evaluation problem for smart-home assistants. They are not identical to ordinary ambiguous commands because the missing content is created by prior interaction.

Second, PEC-Home provides a controlled simulated benchmark for that problem, including 1,780 dialogues, four levels of ellipsis, multi-user preference conflicts, dynamic environmental preferences, and executable target operations.

Third, evaluated LLMs degrade as commands become more elliptical. RAG improves results substantially, but it does not restore complete-command performance at high ellipsis.

Fourth, common enhancement strategies—irrelevant-memory handling, fine-tuning, and tool-based assistants—expose different weaknesses rather than eliminating the core failure.

That is the evidence. Everything beyond it should be treated as business inference, not as something the authors measured in deployed homes.

What Cognaptus infers for business use

The business implication is not limited to smart homes. PEC-Home is a compact version of a broader assistant problem: users compress instructions once they believe a system shares context.

The moment an AI system becomes embedded in a workflow, users stop saying everything. They rely on convention, relationship history, team norms, project state, and prior decisions. That is where many assistants become brittle.

A useful operational framework looks like this:

Design requirement Smart-home example Enterprise analogue
User-specific memory selection “Comfortable” differs between family members. “Standard format” differs by client or manager.
State-aware interpretation Humidifier preference depends on humidity or weather. Approval routing depends on region, contract stage, or risk level.
Context freshness Yesterday’s routine may not apply today. Last quarter’s pricing rule may be obsolete.
Exact action generation Device method and parameters must match the API. Workflow action must match system schema and permissions.
Clarification policy Ask when room or user preference is uncertain. Escalate when context conflicts or evidence is stale.

This suggests a sharper evaluation agenda. Do not only test whether the assistant understands explicit requests. Test whether it survives natural compression.

A production-grade assistant should be evaluated on command trajectories, not isolated prompts. It should be tested under household or team-level memory noise. It should include conflicting user preferences. It should include changed environmental or business state. It should require exact executable outputs. And it should measure when the assistant asks a clarification instead of guessing, because a graceful “which room?” is often better than automated nonsense at scale.

The same principle applies to RAG design. Retrieval should not be judged only by whether it finds semantically similar text. It should be judged by whether the retrieved evidence is the right evidence for this user, this time, this state, and this action. Semantic similarity is not authority. It is only a candidate list with better manners.

The boundary: PEC-Home is a risk proxy, not production telemetry

The main limitation is that PEC-Home is simulated. The authors explain why: real smart-home interaction data is privacy-sensitive and difficult to collect at scale. They mitigate this with a pilot study, cognitive-science grounding, GPT-4o generation, manually defined ellipsis levels, and human validation. That is reasonable. It does not make the dataset the same as lived household behavior.

There are three practical boundaries.

First, the generated commands may be cleaner than real speech. Real users interrupt themselves, use nicknames, mix languages, change their minds, and produce commands that would make a benchmark designer quietly close the laptop.

Second, the virtual environment is broad but still constrained. Twelve device types and explicit methods are enough for useful evaluation, not enough to represent every deployed smart-home configuration.

Third, the benchmark measures execution in a defined environment. Real deployments add permissions, device failures, sensor errors, household routines, safety constraints, and user frustration. The paper does not measure those. It opens the door to measuring them properly.

So the right business use of PEC-Home is not to quote its scores as expected production failure rates. The right use is to copy its evaluation shape: progressive ellipsis, longitudinal context, competing preferences, dynamic state, noisy memory, and exact execution.

That is the part worth stealing. Politely, of course.

The product lesson is that personalization must survive shorthand

Most assistants are evaluated as if users remain first-time users forever. PEC-Home evaluates what happens after users become familiar enough to stop over-explaining.

That is the correct pressure test for personalized AI.

An assistant that only works when users fully specify room, device, operation, and parameter is not context-aware. It is a voice-operated form. A better assistant must understand when a short command is safe to execute, when it needs history, when history is noisy, when multiple users conflict, when environmental state changes the intended action, and when a clarifying question is cheaper than a wrong operation.

The paper’s most useful contribution is therefore not just a dataset. It is a reminder that natural interaction creates missing information by design. If the system cannot reconstruct that information, the product will fail precisely when the user starts treating it as intelligent.

That is an inconvenient place for the failure to live. Which is usually where the important failures prefer to live.

Cognaptus: Automate the Present, Incubate the Future. :::


  1. Yingyu Shan, Zeming Liu, Silin Li, Boao Qian, Jiashu Yao, Yuhang Guo, and Haifeng Wang, “PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes,” arXiv:2606.18636, 2026. ↩︎