Building simulation is not glamorous work. It is a room full of configuration files, simulator interfaces, reward functions, time-series outputs, and small mistakes that quietly invalidate a week of analysis. The industry likes to talk about intelligent buildings. The less marketable truth is that before a building can be intelligent, someone has to wire the experiment together correctly.
That is the real bottleneck behind the paper on AutoB2G, a large language model-driven agentic framework for automated building–grid co-simulation.1 The obvious reading is that this is another reinforcement-learning-for-buildings paper. That reading is tempting, tidy, and slightly wrong.
The more useful reading is this: AutoB2G is a study of whether natural language can become a reliable interface for building an executable simulation workflow, not merely for writing a few lines of plausible code. The paper connects CityLearn V2 with Pandapower, uses a directed acyclic graph to keep the codebase structurally honest, and then lets a multi-agent system called SOCIA generate, execute, evaluate, and repair the simulator.
That distinction matters. A building controller that improves voltage behavior is interesting. A system that makes complex building–grid experiments easier to generate, debug, and repeat is more operationally important. One improves a model. The other compresses the workflow around the model.
The real bottleneck is not the controller; it is the experiment
Building energy research already has simulation tools. CityLearn supports building-side control experiments. EnergyPlus can generate high-fidelity building data. Pandapower can model distribution-network power flow. GridLearn previously connected CityLearn V1 with grid models. So the problem is not that every ingredient is missing.
The problem is that the ingredients do not automatically become a meal. Someone still needs to decide which building data are used, how buildings map to grid buses, which controller is trained, which reward function is applied, what metrics are exported, which grid elements are modified, and whether the output files actually correspond to the requested experiment. This is where many AI-for-infrastructure projects quietly lose time. The model may be clever; the workflow is still artisanal.
AutoB2G attacks that unglamorous layer. It extends CityLearn V2 into a building-to-grid environment, then wraps the simulation workflow in an LLM-based automation system. A user can describe a task in natural language, such as training a SAC controller, testing it on an IEEE 33-bus distribution network, adding a reactive power shunt, computing bus voltages and line loadings, and exporting plots and CSV files. The framework’s job is to turn that request into an executable simulation program.
This is why the paper should not be read as “LLMs help with energy simulation.” That is too broad to be useful. The sharper claim is that LLM agents need two forms of scaffolding before they can automate serious simulation work: structural retrieval, so they select the right modules in the right order, and iterative repair, so they can correct the inevitable errors created by long dependency chains.
AutoB2G has three moving parts, but one workflow
The framework combines three layers. Each layer solves a different failure mode.
| Layer | What it does | Failure mode it addresses | Business interpretation |
|---|---|---|---|
| Building–grid co-simulation | Connects CityLearn V2 building control with Pandapower grid analysis | Buildings optimize locally while grid impacts remain secondary | Enables grid-aware digital-twin experiments rather than isolated energy-control demos |
| DAG-based agentic retrieval | Represents functions and dependencies as a structured codebase | LLM retrieves useful fragments but composes them in the wrong order | Makes automation auditable and less dependent on heroic prompt wording |
| SOCIA + textual-gradient descent | Generates, executes, evaluates, and patches simulator code | One-shot code generation fails under multi-step configuration constraints | Turns simulation building into an iterative repair process, not a single guess |
The important word in that table is not “LLM.” It is “workflow.” AutoB2G does not assume that the model knows the simulator. It assumes the model must be forced to operate within a structured environment. This is a healthier design philosophy. LLMs are impressive pattern machines; they are not naturally careful simulation engineers. Shocking, I know.
First, buildings and grids must share the same state
In many building-control settings, the controller optimizes building-side outcomes such as cost, comfort, emissions, or peak demand. Those are legitimate objectives, but they do not automatically tell us whether a distribution network is happier. A cluster of buildings can look efficient locally while creating voltage or loading problems upstream.
AutoB2G changes the simulation loop. At each simulation step, building-level electricity demand from CityLearn is mapped to load buses in a Pandapower distribution network. Pandapower runs power-flow analysis and returns grid states, such as bus voltage magnitudes. Those grid variables can then enter the observation space and the reward design for the building-control agent.
In plain terms, the building controller is no longer playing a private game. Its actions affect the grid, and the grid state feeds back into the next decision. That is the difference between energy management as spreadsheet optimization and energy management as system participation.
The paper also adds grid-side evaluation metrics: voltage admissibility, thermal loading limits, N–1 resilience, and short-circuit current. This matters because reinforcement learning does not provide the same explicit guarantees as mathematical optimization. If the controller is learning through interaction, the evaluation system needs to ask harder questions than “Did the bill go down?” A cheap electricity bill attached to a stressed feeder is not a great victory. It is just a problem with nicer branding.
Then the codebase becomes a map, not a pile of snippets
A normal retrieval-augmented generation system can retrieve relevant code or documentation. That helps when the problem is missing knowledge. It is less sufficient when the problem is missing order.
Building–grid simulation is not a bag of independent helper functions. Data generation may need to happen before CityLearn configuration. The grid network must be created before power-flow evaluation. A reward function must match the observation and action structure. Output management depends on what the simulation actually records. A model can retrieve all the right pieces and still fail by assembling them in a subtly wrong sequence.
AutoB2G addresses this by organizing the codebase as a directed acyclic graph. Functions become nodes. Dependencies become edges. Each node carries attributes such as input interfaces, output interfaces, stage, mandatory or optional status, and a natural-language description of its role. A valid execution sequence must respect topological order: if function $f_j$ requires outputs from $f_i$, then $f_i$ must appear earlier.
The business translation is simple: the system does not merely search for relevant code; it plans a valid path through the codebase. When retrieved modules do not form a valid DAG, a validator produces feedback about missing dependencies or violated constraints, and the agent revises its proposal.
That is the core difference between “LLM with context” and “LLM inside an executable architecture.” Context is a library. A DAG is a traffic system. Without the second, the first can still cause accidents.
Finally, SOCIA treats code as something to repair, not something to admire
The third layer is SOCIA, the multi-agent orchestration framework used to construct the simulator. The agents have specialized roles: a workflow manager coordinates the process; a code generator produces and patches code; a simulation executor compiles and runs it; a result evaluator checks structural and functional consistency; and a feedback generator converts failures into repair instructions.
The paper describes this repair process as textual-gradient descent. The phrase sounds like it was designed to irritate both software engineers and mathematicians, but the mechanism is useful. Instead of numeric gradients, the system uses structured language feedback as the backward signal.
The paper defines a feasible set of programs as those satisfying constraints such as syntax validity, compilation success, and interface consistency. It then defines a loss over constraint violations:
A fully compliant program has zero loss. A program with violated constraints gets feedback. The feedback generator identifies what went wrong, where the problem likely sits, and what minimal correction is needed. The code generator patches the program, then a projection step applies self-validation and repair to restore feasibility.
This is not magic. It is structured debugging with language as the coordination layer. The useful point is that AutoB2G does not bet everything on the first generated answer. It assumes the first answer will often be wrong, then designs a loop around that fact. Quite a radical idea in AI: expect errors, then build machinery to reduce them.
The main experiment tests workflow reliability, not just simulation output
The paper evaluates four configurations: a single LLM without retrieval, SOCIA without retrieval, a single LLM with agentic retrieval, and SOCIA with agentic retrieval. All methods use the same experimental setup and are tested with the OpenAI GPT-5 API, according to the paper.
The tasks are grouped into simple, medium, and complex levels. Simple tasks are single-stage workflows. Medium tasks involve model training and evaluation. Complex tasks require multi-stage data preparation, policy training, grid-side analysis, configuration changes, and result organization. This task design is sensible because the paper’s central claim is not that LLMs can write a small script. We already knew that. The question is whether they can survive dependency complexity without slowly turning the simulator into soup.
The first result is task success rate.
| Method | Simple tasks | Medium tasks | Complex tasks |
|---|---|---|---|
| LLM | 0.90 ± 0.08 | 0.77 ± 0.12 | 0.53 ± 0.19 |
| SOCIA | 0.93 ± 0.09 | 0.83 ± 0.12 | 0.73 ± 0.09 |
| LLM + AR | 0.97 ± 0.05 | 0.80 ± 0.08 | 0.67 ± 0.05 |
| SOCIA + AR | 1.00 ± 0.00 | 0.93 ± 0.09 | 0.83 ± 0.05 |
The pattern is more important than any single number. The single LLM begins reasonably well at 0.90 on simple tasks, then drops to 0.53 on complex tasks. SOCIA plus agentic retrieval stays at 0.83 on complex tasks. That is not perfection, but it is a meaningful difference: the hardest setting is exactly where structure and iterative repair matter most.
The second metric is code score:
This metric is useful because a generated program can finish a task while still including unnecessary modules, missing required components, or producing code that is hard to inspect. In business settings, “it ran” is not enough. The analyst still needs to understand whether the result came from the requested workflow or from a machine-generated detour wearing a confident expression.
| Method | Simple tasks | Medium tasks | Complex tasks |
|---|---|---|---|
| LLM | 0.69 ± 0.08 | 0.66 ± 0.05 | 0.44 ± 0.08 |
| SOCIA | 0.82 ± 0.14 | 0.78 ± 0.05 | 0.67 ± 0.09 |
| LLM + AR | 0.72 ± 0.08 | 0.74 ± 0.07 | 0.73 ± 0.13 |
| SOCIA + AR | 1.00 ± 0.00 | 0.84 ± 0.07 | 0.88 ± 0.09 |
Here the gap becomes sharper. On complex tasks, the single LLM scores 0.44, while SOCIA plus agentic retrieval reaches 0.88. That result supports the paper’s mechanism-first argument: retrieval alone helps ground the model; multi-agent iteration helps repair it; the combination is stronger because it attacks both knowledge selection and execution correctness.
The appendix-like tests are not side decoration; they explain why the system works
The paper includes several forms of evidence. They should not all be read as the same kind of result.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Success-rate comparison across LLM, SOCIA, LLM + AR, SOCIA + AR | Main evidence | Structure plus iteration improves task completion, especially for complex workflows | General reliability across every simulator or domain |
| Code-score comparison | Main evidence | The framework improves correctness and avoids irrelevant or missing components | That code is production-ready without review |
| TGD decision log | Mechanism illustration / implementation detail | The repair loop can identify a concrete configuration error and issue targeted fixes | Statistical robustness by itself |
| Success rates over TGD iterations | Repair-loop evidence | Early iterations repair many failures; harder tasks converge more slowly | That more iterations will always solve remaining failures |
| Platform comparison with CityLearn V1, CityLearn V2, and GridLearn | Comparison with prior work | AutoB2G combines natural-language configuration, grid modeling, voltage control, and CityLearn V2 features | Numerical superiority over GridLearn, which the authors explicitly avoid claiming |
| Voltage dynamics, voltage distribution, and net-load tables | Simulation demonstration | The generated environment can evaluate grid-aware control behavior | Universal performance of the learned control policy |
This distinction is important because the paper is easy to overread. The strongest evidence is the automation result: complex-task success and code-score improvements. The grid-control results show that the generated co-simulation can produce meaningful grid-side evaluation, but they are not a sweeping claim that this particular RL policy is the new king of voltage control. The authors themselves note that direct numerical comparison with GridLearn is not the goal because the CityLearn versions differ.
That restraint is useful. It keeps the paper from pretending that every result must be a leaderboard.
The grid results show whether the workflow can support real system questions
After testing automation reliability, the paper demonstrates the simulation environment itself. AutoB2G is compared with CityLearn V1, CityLearn V2, and GridLearn. Its claimed coverage is broader: natural-language configuration, building modeling, grid modeling, voltage control, customized datasets, EV integration, thermal dynamics, and occupant behavior.
The grid-side evaluation uses settings close to GridLearn: an IEEE 33-bus distribution network, voltage-control objectives, and a building replication simplification to scale the system. Baseline and RL-based strategies are executed. Under the RL setting, the controller coordinates building-level actions with grid constraints and keeps voltages closer to the nominal value of 1.0 p.u.
The more interesting part is the behavioral interpretation. In over-voltage conditions, the controlled strategy shifts net load toward higher positive load levels. In the table reported by the paper, the share of operating points in the $[1, 3)$ kW interval rises from 49.7% uncontrolled to 76.3% controlled under over-voltage. That means the controller increases flexible building demand when extra demand can help reduce voltage rise.
Under under-voltage conditions, the behavior reverses. The controlled strategy concentrates more mass in the $[-1, 1)$ kW interval: 74.3% controlled versus 54.4% uncontrolled. It also reduces the share in the $[1, 3)$ kW interval from 41.1% to 23.8%. In less table-shaped English: when voltage is low, the controller reduces load pressure.
This is the part of the paper where building control becomes grid participation. The building is no longer a passive consumer with a clever thermostat. It becomes a flexible asset that can modulate demand in response to grid conditions. That is the business story behind the technical workflow.
What AutoB2G directly shows, and what Cognaptus infers
The paper directly shows that, within its CityLearn–Pandapower setting, AutoB2G can automate natural-language-driven building–grid simulation tasks more reliably than a single LLM baseline. It also shows that DAG-based retrieval and SOCIA-style iterative repair are complementary. Complex workflows are where the combined system earns its keep.
Cognaptus’s business inference is narrower and more useful than “LLM agents will automate energy.” A better inference is that simulation-heavy organizations may be able to reduce the cost of scenario construction. That includes energy-management teams, digital-twin vendors, utilities experimenting with demand response, real-estate operators testing flexible-load strategies, and software companies building interfaces for technical simulation platforms.
The value pathway is workflow compression:
| Practical step | Operational effect | ROI relevance |
|---|---|---|
| Natural-language task description | Less manual setup for standard experiments | Shorter cycle from question to scenario |
| DAG-constrained retrieval | Fewer missing dependencies and wrong module combinations | Lower debugging burden and easier audit trail |
| Agentic execution and repair | Errors become structured feedback instead of dead ends | Higher reuse of simulation templates |
| Grid-side metrics | Building-control results can be evaluated against grid constraints | Better screening before pilot projects |
| Exportable plots and CSV outputs | Results are easier to inspect, compare, and communicate | Faster reporting and decision review |
Notice what is not in that table: replacing domain experts. AutoB2G does not eliminate the need for people who understand buildings, grids, control objectives, or simulation validity. It changes where expertise is used. Instead of spending attention on repetitive workflow wiring, experts can spend more attention on whether the scenario is meaningful, whether the assumptions are defensible, and whether the result should influence a real deployment.
For any serious organization, that is a better automation target anyway. Replacing judgment is the fantasy. Reducing clerical friction around judgment is the product.
The remaining failures are the most honest part of the paper
AutoB2G still fails in a non-trivial share of complex cases. The paper identifies two main sources. First, cross-module dependencies are tightly coupled. Even when the correct modules appear, small inconsistencies in configuration or execution order can break the pipeline. Second, natural-language instructions are ambiguous. Some requirements are implicit, and the model may infer extra operations that the user did not intend.
These failure modes are not minor. They are exactly the kind of failures that matter in infrastructure simulation. A misplaced shunt, an incorrect bus index, an unverified line-loading threshold, or a silent mismatch between a reward function and evaluation metric can produce results that look professional while being wrong. That is not automation. That is a very well-formatted trap.
The TGD example is therefore revealing. In the decision log, SOCIA detects a reactive power compensation element placed at the wrong bus and flags it as materially affecting the grid results. It then recommends patch instructions, verification tests, and re-running the relevant simulations. This does not prove that the system catches every serious error. It does show the right design instinct: do not just regenerate code; create localized repair instructions tied to execution evidence.
That is the difference between “try again” and “fix the part that violates the requirement.” The former is a slot machine. The latter is engineering.
Where this should not be overread
The boundary is clear. AutoB2G is validated within a limited set of platforms and experimental settings: CityLearn, Pandapower, specific workflow modules, and controlled task categories. The paper does not establish robustness across arbitrary building simulators, power-system tools, interface standards, control methods, or messy enterprise data environments.
It also does not remove the need for verification. A generated simulator should still be reviewed, especially when outputs inform investment, operational policy, safety studies, or regulatory decisions. The code-score metric rewards structural correctness, but production reliability requires a wider set of controls: versioned scenario definitions, unit tests, reproducibility checks, data validation, numerical sanity checks, and human review of assumptions.
Finally, the paper’s grid-side demonstration should be interpreted as feasibility evidence for the co-simulation pipeline, not as a universal benchmark of RL control performance. The authors explicitly frame part of the comparison as consistency under comparable settings rather than direct numerical superiority over GridLearn.
These boundaries do not weaken the paper. They make the contribution easier to place. AutoB2G is not the final form of autonomous infrastructure simulation. It is a credible prototype of the scaffolding such systems will need.
The larger lesson: prompts are not enough; architecture does the boring work
The fashionable version of this story would say that natural language is becoming the new programming interface. That is partly true, but incomplete. AutoB2G suggests a more grounded conclusion: natural language becomes useful for serious simulation only when wrapped in architecture that understands dependencies, execution order, validation, and repair.
The lesson generalizes beyond buildings and grids. Many business workflows have the same shape: multiple tools, hidden dependencies, configurable modules, fragile execution order, and output formats that must match downstream analysis. Financial stress testing, supply-chain scenario simulation, manufacturing process modeling, regulatory compliance checks, and clinical operations planning all share versions of this problem.
In each case, the hard part is not asking an LLM to produce a plausible answer. The hard part is making sure the answer is generated through the right operational pathway.
AutoB2G is valuable because it points away from prompt theatre and toward executable scaffolding. The future of enterprise AI will not be won by the longest prompt or the most theatrical agent persona. It will be won by systems that make models operate inside well-defined structures, catch their mistakes, and repair them with evidence.
Less magic. More plumbing. Better results.
Cognaptus: Automate the Present, Incubate the Future.
-
Borui Zhang, Nariman Mahdavi, Subbu Sethuvenkatraman, Shuang Ao, and Flora Salim, “AutoB2G: A Large Language Model-Driven Agentic Framework For Automated Building–Grid Co-Simulation,” arXiv:2603.26005, 2026. https://arxiv.org/abs/2603.26005 ↩︎