Grid Chat: When Your Battery Negotiates With the Power Market

Battery.

At 5 p.m., the grid wants help.

The evening peak is approaching, the aggregator needs 3 kW of flexibility between 17:00 and 19:00, and one household in the portfolio looks promising. In the old demand-response world, this might become a price alert, an app notification, or a silent automated command. The household either complies, ignores it, or discovers later that the “smart” system has made a decision that feels less smart when dinner, laundry, or comfort is involved.

In the paper Conversational Demand Response: Bidirectional Aggregator-Prosumer Coordination through Agentic AI, the exchange looks different.¹

The aggregator asks. The household energy system evaluates. The battery optimizer checks whether the request is actually feasible. The resident receives a plain-language explanation: the event is fully feasible, the net benefit is €1.13, the battery state of charge changes from 30% to 43%, and there is no comfort impact. The resident approves. The household commits. The aggregator updates the portfolio.

That is the whole idea in miniature: demand response not as command, not as blind automation, and not as another charming notification to be dismissed with the rest of digital civilization’s mosquito swarm. It becomes a conversation grounded in optimization.

The important word is grounded.

This paper is not saying that a large language model should directly control your battery, EV charger, or heat pump. That would be a bold way to rediscover liability law. The authors propose something more useful and less theatrical: LLM-based agents act as the communication and orchestration layer, while technical feasibility remains handled by a Home Energy Management System, asset-specific sub-agents, and a Mixed-Integer Linear Programming optimizer.

So the real contribution is not “the grid gets a chatbot.”

The real contribution is a coordination pattern: natural-language negotiation at the edge of a mathematically constrained energy system.

The 17:00–19:00 request is the architecture in one scene

Start with the case.

An operator tells the aggregator: the portfolio needs 3 kW of flexibility between 17:00 and 19:00. The aggregator agent does not ask every household to please be heroic for the energy transition. It queries its portfolio, identifies a suitable registered household, and dispatches a contextualized request.

That request arrives at the prosumer-side Home Energy Management System, or HEMS. The HEMS does not simply say yes. It switches into demand-response event mode, interprets the request, and decides which household asset should evaluate it. In the demonstrated case, the battery sub-agent is the relevant specialist.

The battery sub-agent then calls an optimizer. This matters because a battery’s ability to provide flexibility is not just a yes/no property. It depends on state of charge, expected PV production, household demand, efficiency losses, degradation cost, feed-in tariff, electricity prices, DR compensation, and whether earlier time slots are already locked because the day has already happened. Physics, unfortunately, remains stubbornly pre-LLM.

The optimizer compares two schedules:

a baseline schedule without the demand-response commitment;
a schedule with the 17:00–19:00 discharge obligation imposed.

That dual-solve procedure turns “Can you help the grid?” into a quantified trade-off. If the full request is feasible, the system can report the economics. If it is not feasible, the optimizer can search for the maximum deliverable level. The LLM then translates that technical result into a message the resident can understand before approving or rejecting participation.

This is the paper’s strongest design choice. The LLM does not replace the energy model. It narrates and coordinates around it.

Conversational demand response solves an interface problem, not an optimization problem

Demand response has never lacked clever optimization.

The harder problem is that residential participation depends on trust, attention, and perceived control. Households do not behave like perfectly responsive batteries with Wi-Fi. They have routines, discomfort thresholds, habits, confusion, and a limited appetite for reading tariff logic at 6 p.m.

Traditional residential demand response usually falls into two imperfect categories.

Approach	What it gives the system	What it takes away from the household
Fully automated control	Low-friction dispatch and scalable execution	Transparency, agency, and sometimes trust
Price alerts or one-way notifications	User choice and low technical intrusion	Attention, comprehension, and timely response
Conversational Demand Response	A structured negotiation layer between aggregator and household	Still requires careful design, validation, and privacy controls

Conversational Demand Response, or CDR, aims to occupy the middle space. The aggregator still needs scalable coordination. The prosumer still needs agency. The household system still needs rigorous feasibility checks. CDR connects these layers through bidirectional natural-language interaction.

Downstream, the aggregator can send a flexibility request and receive a commitment or refusal. Upstream, the household can update preferences, asset availability, or constraints without waiting for a formal re-enrollment process.

This upstream direction is easy to underestimate. In the paper’s second demonstration, the prosumer tells the HEMS: “I’m away on holiday next week. Maximize revenue from my battery and EV.” The HEMS updates the household preference and notifies the aggregator, which updates portfolio planning.

That is not a dramatic exchange. It is operationally important precisely because it is mundane. A household’s flexibility profile changes whenever someone travels, buys an EV, installs a heat pump, changes work hours, or decides comfort matters more than compensation this week. If aggregators cannot hear those changes, their portfolio model becomes a polite fiction with a dashboard.

CDR turns preference changes into routable operational information.

The two-tier system keeps market logic and household feasibility separate

The architecture has two main tiers.

Layer	Agent	Main responsibility	What it should not do
Aggregator side	Aggregator agent	Convert market or operator needs into household-level flexibility requests; manage portfolio updates	Estimate detailed household feasibility without local asset context
Household side	HEMS orchestrator	Interpret requests, delegate to asset sub-agents, present options to the resident	Pretend every asset has the same operating logic
Asset level	Battery or appliance sub-agents	Assess feasibility for specific loads or devices	Make portfolio-level market decisions
Optimization tool	Battery MILP optimizer	Compute baseline and DR-constrained battery schedules	Hold a conversation or infer user intent

This separation is not just clean software design. It is politically and commercially important.

The aggregator should know enough to coordinate a portfolio, but not necessarily enough to inspect every household’s intimate operational details. The household should evaluate feasibility locally, because that is where the asset state, user preferences, and physical constraints live. The LLM interface makes the exchange legible without centralizing every detail into the aggregator’s brain.

For utilities and aggregators, this suggests a practical deployment principle: do not build one giant agent that “understands the grid and the household.” Build role-specific agents with limited responsibilities and tool access.

The aggregator asks for flexibility.

The household proves whether it can provide it.

The resident gets the final explanation.

This is boring in the best possible way.

The battery optimizer is where the conversation earns its credibility

The battery component is the technical anchor of the paper.

A battery cannot be evaluated like a simple appliance that shifts from one time slot to another. Its schedule couples decisions across the whole day. Charge now, discharge later, preserve state of charge, avoid excessive degradation, respect charge and discharge limits, handle PV production, and do not accidentally invent free energy because a language model became enthusiastic.

The authors use a Mixed-Integer Linear Programming optimizer over a 24-hour horizon with 15-minute resolution. The objective considers net electricity cost, battery degradation, and peak discharge regularization. The constraints enforce energy balance, state-of-charge dynamics, capacity limits, and operational restrictions such as preventing battery-to-grid export outside DR events.

For the proof-of-concept simulation, the battery has 15 kWh capacity, 8 kW charge/discharge rate, 92% round-trip efficiency, 20–90% state-of-charge bounds, and an initial state of charge of 30%. The economic assumptions include a €0.04/kWh feed-in tariff, €0.015/kWh degradation cost, and €0.20/kWh DR compensation. PV forecast is 8.5 kWh/day, and the model runs at 96 slots per day.

Those details matter because the paper is not merely showing that an LLM can produce a fluent answer. It shows that the answer can be backed by a structured optimization result.

In the 17:00–19:00 case, the optimizer compares normal self-consumption against a DR-constrained schedule. The constrained schedule pre-charges the battery to a higher state of charge before the event, discharges during the requested window, and then experiences a rebound through increased grid import afterward. That rebound is not a bug. It is the cost of participation.

The HEMS can then tell the resident something economically meaningful: participation is worthwhile because compensation exceeds the additional cost created by changing the battery schedule.

That is the difference between an AI assistant and an AI-shaped brochure.

The evidence shows an operational loop, not market-scale performance

The results section contains two kinds of evidence.

First, the authors demonstrate end-to-end exchanges: one downstream aggregator-initiated DR dispatch and one upstream prosumer-initiated profile update. Second, they benchmark six scenarios to test whether the agentic workflow stays within conversational latency and cost bounds.

These tests are useful, but they need to be read correctly.

Test or result	Likely purpose	What it supports	What it does not prove
3 kW request from 17:00–19:00	Main end-to-end demonstration	The architecture can route a market request through aggregator, HEMS, optimizer, resident approval, and portfolio update	That real households will participate more often
Holiday revenue-maximization update	Upstream capability demonstration	Prosumer preferences can flow back into portfolio planning	That aggregators can safely automate all preference interpretation
High-target 5 kW scenario	Stress-style variant within the benchmark	The optimizer-agent loop can adapt when business-as-usual scheduling is insufficient	That the system handles all extreme grid events
Five repeated runs per scenario	Basic variability check	Outputs and latency are reasonably consistent in the tested setup	Production reliability under many concurrent households
Under-12-second completion across scenarios	Computational feasibility evidence	The prototype can sustain conversational-speed interaction in isolated runs	Field-ready scalability, privacy, or regulatory acceptability

The benchmark results are still interesting.

Direction	Scenario	Iterations	Tool calls	Tokens	Time
Downstream	Acceptance	3.6 ± 0.5	2.6 ± 0.5	23.4k ± 4.0k	8.3s ± 2.1s
Downstream	Rejection	5.0 ± 0.7	3.6 ± 0.5	34.2k ± 6.0k	9.8s ± 1.9s
Downstream	High target	3.4 ± 0.5	2.4 ± 0.5	21.8k ± 4.1k	7.8s ± 2.2s
Upstream	Availability	1	0	1.3k ± 0.0k	1.3s ± 0.8s
Upstream	Preference	1	0	1.0k ± 0.0k	1.7s ± 1.6s
Upstream	New asset	1	0	1.6k ± 0.1k	1.3s ± 1.2s

The downstream cases are heavier because they require reasoning cycles and tool calls. The rejection case is the most expensive: the system still performs the full feasibility workflow, then composes an explanation for why the resident declined and relays that outcome to the aggregator. That is exactly where a conversational layer adds work. Refusal is not a missing value; it is a decision with context.

The upstream cases are much lighter. They require classification and routing, not optimization. This difference is important for product design. A commercial CDR interface would likely handle many more lightweight profile updates than full real-time DR negotiations. Treating all interactions as equally expensive would overstate the operating cost.

Still, the benchmark is a prototype test. It shows that the loop can complete quickly in isolated scenarios using GPT-OSS-120B served through Cerebras inference. It does not show what happens when thousands of households are negotiating, rejecting, asking follow-up questions, or providing contradictory preference updates at the same time.

The paper is convincing as architecture.

It is not yet evidence of market-scale deployment.

The business value is not “AI for energy”; it is lower coordination friction

For utilities, aggregators, and smart-home platforms, the business pathway is fairly clear.

Residential flexibility is valuable only when it can be mobilized repeatedly. Repeated participation depends on more than compensation. Households need to understand what is being asked, what they earn, what they lose, and whether they remain in control. CDR directly targets that coordination friction.

The paper suggests three practical business shifts.

Business problem	CDR mechanism	Practical meaning	Remaining uncertainty
Low DR participation	Plain-language explanations before commitment	More households may be willing to participate when trade-offs are visible	No field trial proves behavior change yet
Poor portfolio visibility	Upstream preference and asset updates	Aggregators can maintain fresher flexibility profiles	Requires robust identity, consent, and data governance
Opaque incentive logic	Compensation linked to specific asset actions and market conditions	Residents can see why a reward exists, not just receive a number	Explanation quality must be tested with real users
Rigid dispatch	Household-local feasibility checks	Aggregator requests can adapt to actual asset state	Multi-household coordination remains untested
Customer support burden	Conversational follow-up and rejection handling	Some clarification work can be automated	Poorly designed agents may create new support problems

The immediate ROI story is not that conversational agents magically create energy flexibility. They do not. The flexibility is already sitting in batteries, EV chargers, heat pumps, and shiftable loads.

The ROI story is that conversational coordination may make that flexibility easier to access without making users feel as if they joined a remote-control experiment.

This matters for platform strategy. A smart-home vendor might see CDR as a premium energy assistant. A utility might treat it as a customer engagement layer for demand flexibility programs. An aggregator might use it to improve portfolio commitment quality. A DERMS provider might embed it as an interface between optimization engines and end users.

But the same warning applies across all of these use cases: the agent should not be sold as the decision engine. The safer and more credible product is an agentic interface around existing forecasting, optimization, consent, and settlement systems.

In other words: let the model explain the trade-off. Do not let it freelance the grid.

The misconception: the AI is not “controlling the home”

The most tempting reading of the paper is also the least useful one: “LLMs will control household energy devices.”

That framing misses the architecture.

The LLM-based agents coordinate, translate, route, and explain. They call tools. They delegate to sub-agents. They present options. They update portfolio records. But the feasibility of the battery response is computed by an optimizer with explicit constraints. Execution remains inside the HEMS workflow and requires approval in the demonstrated scenario.

That distinction matters because energy systems are safety-critical and contract-heavy. A fluent model answer is not a valid operating schedule. A resident saying “maximize revenue while I’m away” is not automatically a legally complete authorization for every possible dispatch event. A natural-language explanation is not the same as settlement-grade proof.

The useful future is therefore not a household chatbot with a dangerous amount of confidence.

The useful future is a layered system:

formal optimization for feasibility;
structured agent workflows for routing and tool use;
natural-language explanation for user understanding;
explicit approval and policy constraints for control;
auditable logs for accountability.

That stack is less glamorous than “AI runs your home.” It is also much closer to something a utility might eventually be able to defend in front of regulators, customers, and its own lawyers.

The limits are product requirements in disguise

The paper is careful about its boundaries. The evaluation is proof-of-concept. It uses a single representative household setup, isolated sessions, and simulated exchanges. There is no field trial comparing CDR against conventional DR interfaces. There is no evidence yet that households would participate more often, understand incentives better, or maintain trust over months of repeated events.

The concurrency question is also open. All scenarios complete in under 12 seconds, but operational deployment would involve many households, simultaneous requests, retries, failures, and follow-up questions. A CDR platform would need queueing logic, fallback automation, rate limits, monitoring, and graceful degradation when LLM services slow down or fail.

Privacy is another practical boundary. The paper’s architectural separation helps because household feasibility can remain local to the HEMS. But any real implementation would still need to define which data leaves the household, what the aggregator stores, how consent is represented, and how preference updates are audited.

Then there is the issue of explanation quality. The paper demonstrates that technical results can be translated into natural language. It does not prove that average users will interpret those explanations correctly, trust them appropriately, or make better decisions because of them.

These are not reasons to dismiss the paper. They are the next product checklist.

Boundary	Why it matters for deployment
Single-household proof of concept	Portfolio-scale coordination may introduce congestion, conflicting commitments, and aggregate forecast errors
No field-trial engagement data	The main business value depends on human participation, not just technical feasibility
Cloud-based inference setup	Cost, latency, and privacy depend on deployment architecture
Limited asset scope in demonstration	Real homes combine EVs, batteries, heat pumps, appliances, and messy user routines
Natural-language authorization	Product teams must distinguish preference, approval, contract change, and complaint
No settlement-layer validation	DR rewards need auditable measurement and verification

The paper opens a plausible design path. It does not remove the need for engineering discipline.

Terribly inconvenient. Also true.

From smart grid to negotiable grid

The deeper lesson of this paper is that infrastructure coordination is becoming conversational because infrastructure itself is becoming distributed.

A centralized power plant does not need to discuss dinner plans. A million households with batteries, EVs, rooftop solar, and comfort preferences do. The grid increasingly depends on assets it does not fully own and cannot casually command. That changes the interface problem.

Conversational Demand Response is one answer to that problem. It says the aggregator should be able to express a market need in plain operational terms. The household should be able to evaluate that request locally. The resident should see the cost, benefit, and comfort impact before committing. Preference changes should flow upward before the portfolio model becomes stale.

That is not just a nicer app interface. It is a different coordination grammar.

The current paper demonstrates this grammar in a small, controlled setting: one household, representative battery parameters, two end-to-end exchange types, and six benchmarked scenarios. The results are modest but meaningful. Downstream negotiations complete in seconds. Upstream updates are lighter. The optimizer grounds the conversation. The agents keep the interaction legible.

The business implication is equally modest and meaningful: the first value of agentic AI in energy may not be autonomous control. It may be making existing control systems understandable enough that people keep using them.

The grid does not need to become chatty for the sake of being chatty.

But when flexibility depends on consent, context, and repeated participation, silence is expensive.

The battery may not negotiate like a trader. The HEMS may not reason like a human. The aggregator may not become a charming conversationalist.

Still, the architecture points to a future where the household is no longer a passive endpoint in demand response. It becomes a participant with a local model, a negotiating interface, and a memory of its own constraints.

That is the real “grid chat.”

Not small talk.

Coordination.

Cognaptus: Automate the Present, Incubate the Future.

Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl, and Hans Auer, “Conversational Demand Response: Bidirectional Aggregator-Prosumer Coordination through Agentic AI,” arXiv:2603.06217v1, 6 March 2026. ↩︎

The 17:00–19:00 request is the architecture in one scene#

Conversational demand response solves an interface problem, not an optimization problem#

The two-tier system keeps market logic and household feasibility separate#

The battery optimizer is where the conversation earns its credibility#

The evidence shows an operational loop, not market-scale performance#

The business value is not “AI for energy”; it is lower coordination friction#

The misconception: the AI is not “controlling the home”#

The limits are product requirements in disguise#

From smart grid to negotiable grid#