A high electricity bill arrives. You ask your smart-home assistant what happened.

It checks the meter data, explains that the electric-vehicle charger ran during peak-rate hours, and recommends a cheaper schedule. Useful.

Then you ask how much the new schedule will save next month. The assistant retrieves the tariff, forecasts consumption, applies export credits from the solar panels, and confidently reports a number.

Less useful—especially when the number is wrong.

Both requests look like ordinary conversation. Behind the interface, however, they are radically different tasks. One asks an agent to retrieve a known state and execute a bounded action. The other requires a chain of data selection, unit conversion, forecasting, tariff interpretation, and financial calculation. Fluency conceals the difference rather well. Arithmetic is less easily impressed.

A study by Tianzhi He and Farrokh Jazizadeh examines this distinction through a prototype large-language-model agent for building energy management.1 The system connects natural-language interaction with household energy data, structured memory, analytical tools, and simulated device-control interfaces.

The headline result is not simply that an LLM can help manage a smart building. It is that the agent’s reliability changes sharply depending on what “help” requires.

It achieved response accuracy of 97% on memory tasks, 98% on general support, and 86% on device status and control. Cost-management accuracy fell to 49%, despite those queries consuming more time, more tokens, and more tools.

The practical question is therefore not whether a building can talk back. It is which parts of the conversation should be allowed to affect devices, schedules, and money.

One Assistant Is Being Asked to Perform Six Different Jobs

Traditional building energy management systems connect sensors, meters, databases, analytical models, control equipment, and user interfaces. Their weakness is rarely the complete absence of data. It is that the data often reaches occupants through dashboards designed for people who regard kilowatt-hours as a friendly conversational topic.

The proposed agent replaces much of that interaction burden with natural language. A resident can ask about consumption, request a schedule, inspect a device, save a preference, or seek technical support without navigating several disconnected interfaces.

The paper organizes the system into three modules:

Module Operational role Prototype implementation
Perception Observes the building and its available resources Historical energy data, current meter readings, device metadata, and device states
Brain Interprets requests, retrieves context, reasons, and selects tools GPT-4o, profiles, instructions, long-term memory, file search, and code execution
Action Communicates results and changes the environment Natural-language responses, visualizations, schedules, and simulated device commands

The architecture matters because the LLM is not expected to contain every answer internally. It acts as a coordinator.

A typical request follows a workflow resembling:

User request → intent classification → context retrieval → tool selection → tool execution → response or action

For device operations, the prototype uses a particularly important sequence:

Synchronize available devices → query the current device state → execute a validated command

That sequence prevents the model from blindly acting on an imagined appliance. In the evaluation, the agent detected an offline kettle instead of pretending to turn it on. It also recognized that a requested television did not exist in one test home.

This is what grounding looks like in practice. It is less cinematic than giving a model “agency,” but considerably more useful.

The agent was tested across six primary task categories:

  1. Energy consumption and analysis
  2. Cost management
  3. Device status and control
  4. Device scheduling and automation
  5. Long-term memory
  6. General information and support

Those categories share a conversational interface. They do not share the same operational risk, reasoning burden, or reliability.

The Benchmark Exposes a Reliability Frontier

The researchers evaluated the prototype using four single-family homes from the Pecan Street residential energy dataset. Each home contributed one month of 15-minute energy data and had a different set of monitored appliances. All four included photovoltaic generation and electric-vehicle charging.

The benchmark contained 120 predefined queries across six primary and 24 secondary intent categories. Each query was tested against every home, producing 480 observations.

The evaluation measured more than whether the final answer sounded plausible. It recorded:

  • response latency;
  • primary and secondary intent-classification accuracy;
  • tool-call selection and execution;
  • final-response accuracy;
  • token usage and estimated inference cost.

Final responses received a score of 1 when correct, 0.5 when partially correct, and 0 when incorrect. The authors compared analytical and numerical answers against manually prepared benchmark logic and ground-truth outputs.

The category-level results reveal the central comparison:

Task category Response accuracy Average processing time Average total tokens
General information and support 98% 13 seconds 14,217
Memory 97% 12 seconds 14,130
Device status and control 86% 19 seconds 29,187
Energy consumption and analysis 77% 27 seconds 35,625
Device scheduling and automation 74% 14 seconds 21,763
Cost management 49% 49 seconds, or 34 seconds after removing two extreme outliers 52,761

The prototype averaged 79% response accuracy across all categories. Its average processing time was 23 seconds, and its tool-call accuracy reached 94%.

That last number is easy to misread.

A 94% tool-call accuracy does not mean the agent produced a correct result 94% of the time. It means the agent was generally good at selecting the tools expected for the task. Correctly opening a calculator is not the same as correctly calculating the bill.

The gap becomes visible when tool use and final answers are compared:

Task category Tool-call accuracy Final-response accuracy
Energy consumption and analysis 96% 77%
Cost management 90% 49%
Device status and control 90% 86%
Device scheduling and automation 92% 74%
Memory 98% 97%
General information and support 98% 98%

For memory and support tasks, choosing the correct operation usually leads to the correct result. For cost management, the agent often selects the appropriate tools and still fails somewhere inside the analytical chain.

The important deployment boundary is therefore not between “tool-using” and “non-tool-using” agents. It is between workflows in which the tool largely determines the correct outcome and workflows in which the model must decide how to transform, combine, and interpret several intermediate outputs.

Bounded Operations Give the Agent Less Room to Be Creatively Wrong

Device-control tasks performed relatively well because the available actions were structured.

The agent could synchronize the device list, inspect the state of a specific appliance, identify supported parameters, and submit a command through a predefined function. The backend constrained what could happen.

Consider the request to set a living-room light to a brightness suitable for reading. The agent identified that brightness was adjustable, observed a current level of 50, and changed it to 75. For a request to operate an offline kettle, it stopped and recommended checking the network connection. When asked to turn off a television that was not listed, it did not fabricate a successful action.

These examples involve inference, but the inference occurs inside a bounded environment. The model can choose among available devices, states, and commands. It is not being asked to invent the underlying mechanics.

Single-device status checks and ordinary operations averaged less than 14 seconds. More complex group operations and custom configurations took approximately 25 and 32 seconds, respectively, partly because they required several tool calls.

The same principle helps explain the strong memory performance.

The prototype did not treat long-term memory as an unlimited archive of conversational fragments. It converted relevant preferences into structured entries stored in JSON. A request such as remembering a preferred air-conditioning setting became a normalized rule that could later be retrieved, changed, or deleted through specific memory tools.

Memory tasks achieved 97% response accuracy and generally required fewer tokens and less processing time than the analytical categories.

General support also performed well because many questions could be answered directly or through a limited device-status check. Explaining available air-conditioning modes is substantially easier than forecasting their future operating cost.

Across these categories, the agent succeeds when the surrounding system reduces the number of decisions it must improvise.

Cost Management Fails as a Chain, Not as a Single Calculation

Cost-management requests produced the weakest results: an average response-accuracy score of 49%.

That percentage deserves interpretation. Among 80 cost-related responses, 21 were fully correct, 37 were partially correct, and 22 were incorrect. Because partial answers received half credit, the resulting average was approximately 49%.

The agent was not failing to produce anything. It was frequently producing answers with some correct logic and at least one consequential mistake.

A cost question may require the agent to:

  1. identify the correct appliance or energy stream;
  2. select the relevant time period;
  3. aggregate interval-level measurements;
  4. distinguish power from energy and convert units correctly;
  5. retrieve peak and off-peak tariffs;
  6. treat exported solar generation as a credit rather than a charge;
  7. forecast future consumption when necessary;
  8. communicate the result clearly.

Every step introduces another opportunity for a locally plausible decision to corrupt the final answer.

The paper provides several examples. In one case, the agent classified exported energy generation as a cost instead of a credit. In another, it selected the wrong record when interpreting a request about air conditioning. Some prediction responses mislabeled power measurements as energy after failing to convert from kilowatts to kilowatt-hours.

Cost-information tasks achieved only 30% accuracy, while cost-prediction tasks reached 50%.

The forecasting behavior is particularly revealing. Because the instructions did not prescribe a forecasting method, the agent independently selected different approaches across 20 energy-prediction responses:

Forecasting method selected by the agent Number of responses
Random forest regression 10
Linear regression 6
ARIMA 2
Simple moving average 2

This is not evidence that the agent intelligently matched each method to the data. The models were generally simple, trained on only one month of historical observations, and sometimes used little more than timestamps as predictive features.

The variation instead exposes a reproducibility problem. Two users can ask equivalent questions and receive answers generated by different analytical methods, without being told why one method was selected over another.

The LLM is functioning as an unsupervised junior analyst: energetic, adaptable, and occasionally inspired. Finance departments generally prefer the analyst to use an approved spreadsheet.

Selecting More Tools Does Not Automatically Produce More Intelligence

Cost-management queries required an average of 3.56 tool calls, compared with 0.92 for general support and 1.60 for memory. They also consumed the most tokens and took the longest to complete.

The paper’s correlation analysis found that greater tool usage and token consumption were associated with lower response accuracy. The authors attribute this pattern to task complexity: difficult requests require more tools, more context, and more intermediate reasoning, creating more places for errors to enter.

That interpretation is plausible, but the correlation should not be mistaken for proof that tool calls themselves reduce accuracy. Hard tasks both require more tools and produce more errors. The analysis identifies a trade-off; it does not isolate its cause.

Still, the operational implication is useful.

Adding tools expands what an agent can attempt. It does not guarantee that the agent can reliably coordinate them.

A file-search function may retrieve the correct tariff. A code interpreter may execute valid Python. A device API may expose accurate states. The final answer can remain wrong if the agent combines those outputs incorrectly.

This distinction matters well beyond energy management. Enterprises frequently evaluate agents by checking whether they can call internal systems. A more meaningful test asks whether the entire workflow produces a correct, repeatable, and auditable outcome after those calls have been made.

The paper also compares performance across the four homes using one-way analysis of variance. Almost every measured performance difference had a p-value above 0.05; the exception was the rate at which the agent performed explicit intent classification.

This test is best understood as a robustness check across the selected testbeds. It suggests that the prototype did not collapse when appliance configurations and household data changed. It does not establish that the system will generalize across commercial buildings, different countries, live device ecosystems, or unstructured user behavior.

Four houses can demonstrate consistency. They cannot grant universal building intelligence by committee vote.

The Most Valuable Role Is Orchestration, Not Unsupervised Calculation

The paper presents the LLM as the brain of the building-energy agent. For practical deployment, “brain” may be too generous and slightly unhelpful.

A safer interpretation is that the LLM serves as the conversational orchestration layer. It translates an occupant’s request into a structured task, retrieves relevant context, selects approved services, explains results, and asks for clarification when necessary.

Specialized systems should remain responsible for operations where correctness depends on precise and repeatable computation.

Deployment zone Suitable tasks Appropriate LLM role Required controls
Lower risk Guidance, troubleshooting, preference retrieval, device-status checks Interpret requests and explain grounded results Device-state validation, access controls, logging
Controlled risk Scheduling, group-device operations, consumption summaries, visualizations Coordinate tools and propose actions Preview, user confirmation, approved templates, rollback
Higher risk Billing calculations, tariff optimization, cost forecasting, autonomous financial decisions Explain inputs and outputs from validated services Deterministic calculation engine, reconciliation checks, audit trail, human approval

This division does not make the LLM less important. It places the model where its flexibility is valuable and removes it from work where variation is merely another word for inconsistency.

A production cost-management workflow, for example, could use the LLM to clarify whether the user wants historical cost, projected cost, or potential savings. The request would then be converted into a structured specification and sent to a validated calculation service with fixed unit-conversion rules, approved tariffs, explicit treatment of export credits, and a documented forecasting model.

The LLM could explain the result afterward. It should not decide halfway through that today feels like a random-forest day.

The authors suggest that multi-agent workflows may improve performance by assigning specialized tasks to separate agents. That is a reasonable research direction, but specialization alone does not guarantee reliability. A collection of LLM agents can still pass errors politely between themselves.

For business deployment, the more important distinction is between probabilistic reasoning and deterministic responsibility. Forecasting, billing, safety constraints, and device permissions need named owners, testable logic, and clear failure behavior—whether they are implemented as agents, services, or conventional software.

The Benchmark Is More Valuable Than a Polished Demonstration

The prototype’s three-module architecture is sensible, but architectures for sensing, reasoning, and action are not rare. The paper’s more useful contribution is the multidimensional benchmark.

A polished demonstration can show an agent successfully answering one carefully selected question. This study asks 120 questions across analytical, operational, memory, and support tasks, then records accuracy, latency, tool use, and computational cost.

That evaluation exposes differences a demonstration would conveniently hide:

  • The agent can select tools correctly while producing an incorrect answer.
  • Straightforward memory operations can outperform apparently sophisticated analytics.
  • A vague visualization request may produce a technically valid but useless chart.
  • More reasoning time and more tokens do not necessarily buy greater accuracy.
  • Backend implementation quality can affect measured agent performance.

The visualization results illustrate the last two points. Of 40 requested visualizations, 16 were rated accurate, 21 partially correct, and three incorrect. Some outputs were mathematically valid but uninformative, such as displaying an entire month of consumption as a single bar.

The agent also generated 33 visualizations without being explicitly asked. Some improved the explanation; others added little. Proactivity, like autonomy, sounds better before someone measures its usefulness.

The study therefore offers a useful evaluation principle: assess an agent by task class and workflow, not by a single overall accuracy score.

A system averaging 79% accuracy may be acceptable as a support assistant and unacceptable as a billing analyst. The average does not resolve the decision. The category breakdown does.

What the Study Shows—and What It Does Not

The evidence supports several direct conclusions.

The prototype can connect natural-language requests with real residential energy data, structured memory, analytical tools, and simulated device operations. It performs especially well on bounded memory and support tasks. Its sync-query-execute workflow helps ground device actions. Multi-step analytical tasks, particularly cost management, remain substantially less reliable and more computationally demanding.

Several practical claims remain untested.

First, the evaluation uses one month of historical data from each of four U.S. single-family homes. All four homes included solar generation and electric-vehicle charging, but they do not represent the diversity of residential and commercial buildings.

Second, the energy data is real, while current device states and action interfaces are simulated. The experiment does not test live hardware failures, communication delays, conflicting commands, occupant interruptions, or physical safety consequences.

Third, the queries are predefined. The study does not include end users interacting naturally over extended conversations, nor does it measure satisfaction, usability, trust, or whether occupants follow the recommendations.

Fourth, the benchmark evaluates responses and operations, not achieved energy savings. It does not establish that deploying the agent reduces consumption, lowers bills, improves comfort, or produces a positive return on investment.

Fifth, some errors came from the prototype infrastructure itself. Scheduling-information accuracy, for example, was reduced by an inconsistency in the simulated API. Agent performance cannot be separated entirely from the quality of the tools and data supplied to it.

Finally, the experiment uses GPT-4o in zero-shot mode and prices inference according to rates available during the evaluation. Model performance, latency, and cost will change. The paper’s enduring contribution is therefore the evaluation framework and the observed task hierarchy, not the exact dollar cost of a query.

Safety and privacy also remain design requirements rather than tested outcomes. A building agent may infer occupant routines from energy data and directly affect devices that influence comfort or well-being. The paper appropriately argues that users must retain ultimate control, but it does not evaluate the controls needed to guarantee that outcome.

A Talking Building Still Needs a Calculator and a Permission System

The appealing vision is a building that understands ordinary language, remembers preferences, diagnoses problems, manages devices, and explains its decisions.

The study shows that parts of this vision are already technically plausible. The agent was effective when retrieving structured information, managing memory, providing support, and operating through validated device tools.

The same system became less dependable when a request required several analytical transformations. Cost management consumed the most resources and produced the least reliable answers. The agent often knew which tools to use. It simply could not consistently manage everything that happened after opening them.

That is the useful correction to the assumption that grounding an LLM in building data makes it equally trustworthy across tasks.

A well-designed building agent should speak naturally, understand context, and coordinate operations. It should also know when to hand a problem to a deterministic calculation engine, request confirmation, or decline to act.

The future smart building may talk back.

It should still keep a calculator, a ledger, and a permission system nearby.

Cognaptus: Automate the Present, Incubate the Future.


  1. Tianzhi He and Farrokh Jazizadeh, “Context-aware LLM-based AI Agents for Human-centered Energy Management Systems in Smart Buildings,” arXiv:2512.25055, 2025. https://arxiv.org/abs/2512.25055 ↩︎