TL;DR for operators
HVAC optimisation is not really about “setting the right temperature”. That is the version suitable for brochure copy and mildly insulting procurement decks. The harder problem is deciding when comfort, occupancy, outdoor conditions, and electricity prices should overrule one another.
The paper behind this article proposes a human-in-the-loop reinforcement learning controller for HVAC systems.1 Its main idea is simple enough to be useful: when occupants override the system, that feedback should not merely fix the current moment. It should also teach the controller what went wrong, so future decisions require fewer overrides.
The system combines four ingredients: a PPO-based reinforcement learning policy, a feedback buffer that records recent user overrides, an occupancy prediction model based on past occupancy and time-of-day patterns, and dynamic wholesale electricity prices. The controller then decides whether the HVAC should be on or off, while the heating or cooling mode is assumed to be preset.
The strongest business reading is not “AI replaces HVAC professionals”. Please. The better reading is that HVAC work is moving toward adaptive control design: feedback engineering, occupancy-data governance, tariff-aware scheduling, and commissioning systems that learn without annoying everyone in the building.
The evidence is promising but bounded. The experiments use a 30-day residential occupancy dataset, weather data for Clayton in Melbourne, Victorian wholesale electricity prices from AEMO, simulated occupant feedback, 15-minute timesteps, 23 training days, and 7 test days. This is not yet a multi-building deployment. The paper shows a mechanism worth watching, not a magic box ready to be bolted onto every chiller plant by Tuesday.
The thermostat was never the real interface
A thermostat looks like a simple device because the user interface is simple. A number goes up, a number goes down, and everyone pretends the building is now under control. Behind that tiny display, however, sits a bad negotiation.
The occupant wants comfort now. The operator wants lower energy cost. The grid would prefer that load shifts away from expensive or constrained periods. The building has thermal inertia. Occupancy is uncertain. Weather is impolite. Electricity prices move. The thermostat, heroic little rectangle that it is, is asked to settle all of this with a fixed setpoint and perhaps a schedule.
This is why the paper matters. It does not treat HVAC control as a one-time settings problem. It treats it as a loop.
The system observes indoor temperature, outdoor temperature, time of day, forecasted outdoor temperature, occupancy information, forecasted occupancy, recent feedback, current wholesale electricity rate, and forecasted market rates. From that state, the controller chooses an action: HVAC on or HVAC off. The paper assumes the direction of conditioning, heating or cooling, is preset by occupants; the AI is deciding timing, not redesigning the entire mechanical system from orbit.
That distinction matters. The contribution is not a grand unified theory of building climate. It is a practical control framework for deciding when to run the system under uncertainty, while learning from people when the decision is wrong.
The loop has two jobs: fix the moment and train the policy
The clever part is the role assigned to feedback. In many “human-in-the-loop” systems, the phrase means humans are dragged in whenever automation gets nervous. Very reassuring. Very scalable, obviously.
Here, feedback has a cleaner technical function. When an occupant overrides the HVAC decision, the override immediately affects the control action. If the system would keep the HVAC off but the occupant indicates discomfort, the final control can turn it on. That is the immediate correction.
But the same feedback is also stored in a feedback buffer. The buffer records recent override signals over a horizon equivalent to four hours, using 15-minute timesteps. The learning policy then sees this history in future states. So the override is not just “someone complained”; it becomes structured evidence about recent discomfort.
That creates two learning channels:
| Mechanism | What happens operationally | Why it matters |
|---|---|---|
| Immediate override | User feedback changes the current HVAC control decision | Occupants are not trapped inside a bad automated decision |
| Feedback buffer | Recent overrides become part of the state seen by the RL policy | Repeated discomfort becomes something the policy can learn to avoid |
| Reward penalty | Frequent recent feedback increases discomfort cost | The controller is pushed away from strategies that require constant human correction |
| No-feedback reward signal | When occupied and no feedback is received, the model receives a small stabilising signal | The system is encouraged not to twitch unnecessarily when users appear satisfied |
This is the paper’s real mechanism. The controller does not need a detailed physiological comfort model using skin temperature, clothing insulation, humidity sensors, and other ways to make a building feel like a low-budget medical trial. It uses overrides as a pragmatic signal.
That does not mean comfort is “objective”. In the simulation, feedback is generated probabilistically from deviations around a comfort temperature of 22°C with a ±3°C comfort range. So the human behaviour is modelled, not collected from actual occupants during deployment. The paper is careful enough to expose the mechanism; business readers should be careful enough not to confuse simulated feedback with office workers actually mashing a thermostat app during a heatwave.
The reward function turns comfort into an operating trade-off
The controller is trained with Proximal Policy Optimization, or PPO. For a business reader, the important part is not the acronym. It is the reward structure.
The paper defines total cost as a weighted combination of discomfort cost and energy cost:
The parameter $\beta$ controls the relative importance of discomfort versus energy cost. A higher $\beta$ makes the system care more about comfort. A lower $\beta$ makes it care more about energy cost.
This is not merely a mathematical decoration. It is the operating policy in disguise.
A hotel, hospital, office tower, warehouse, and university building should not use the same $\beta$. In some spaces, comfort violations are commercial risk. In others, they are tolerable if they avoid peak-price energy use. The paper’s framework makes that trade-off explicit, which is precisely where many “smart building” discussions become suspiciously vague.
The energy cost is also concrete. The paper models HVAC energy cost as a function of the final controlled action, HVAC power consumption, timestep duration, and the wholesale electricity price. The experiment sets HVAC power consumption at 3.5 kW and uses 15-minute timesteps. That means the model is not just minimising runtime. It is trying to decide whether runtime is worth paying for at a particular moment.
That is the difference between a schedule and a market-aware controller.
Occupancy prediction is useful, but only when comfort actually matters
The paper tests four scenarios. These are not four product tiers. They are controlled experimental conditions designed to ask how much occupancy information helps.
| Scenario | What the controller knows | Likely purpose in the paper |
|---|---|---|
| S1: Perfect prediction | Current and future occupancy are known | Idealised ceiling for occupancy-aware control |
| S2: No occupancy prediction | Occupancy is excluded from the state | Tests whether feedback alone can compensate for missing occupancy data |
| S3: Current occupancy only | The controller sees only current occupancy | Tests the value of immediate occupancy without forecasting |
| S4: Realistic predicted occupancy | Current occupancy plus predicted future occupancy probabilities | Tests the practical value of imperfect prediction |
The occupancy predictor itself uses a two-layer bidirectional LSTM. It looks at historical occupancy and time-of-day features, then predicts future occupancy probabilities over the horizon. In the experiment, the chosen prediction horizon is two hours. The authors say longer horizons were considered but brought diminishing returns while increasing state complexity and training overhead.
This is a useful operational detail. A building does not need a prophecy engine. It needs enough foresight to preheat or precool without wasting energy on rooms nobody will use. Two hours is plausible because HVAC decisions have thermal lag, but uncertainty expands quickly when people are involved, because people insist on behaving like people.
The results show a subtle pattern. Perfect prediction, S1, performs best across discomfort weights. That is not shocking; perfect information has a habit of looking clever. The more useful comparison is S3 versus S4. When discomfort weight is low or moderate, specifically at $\beta = 0.1$, $0.3$, and $0.5$, S3 and S4 show similar total costs. In other words, adding realistic occupancy prediction does not automatically help when the system is mostly optimising energy cost.
But as discomfort becomes more important, S4 begins to outperform S3. The paper reports that S4’s occupancy prediction accuracy is 92.52%, and with that level of accuracy it can better anticipate occupancy changes under higher discomfort weights.
The business implication is refreshingly inconvenient: occupancy prediction is not always worth the same amount. Its value rises when comfort matters enough to justify anticipation. If your operating policy treats comfort as secondary, a fancy predictor may mostly become an expensive way to decorate a dashboard.
The rule-based controller is comfortable because it is not paying attention
The paper compares the HITL approach with two benchmarks: a rule-based controller and an optimisation controller.
The rule-based controller is familiar. Turn HVAC on when people are home. Turn it off when they leave. Maintain indoor temperature when occupied. Ignore dynamic electricity prices. It is the kind of thing that makes sense until energy invoices begin their small career as horror literature.
The optimisation controller is different. It uses a Gurobi rolling-horizon optimisation model with the same thermal dynamics as the RL environment. But it assumes perfect prediction of future outdoor temperatures, wholesale prices, and occupancy. It also uses explicit comfort constraints and a 24-hour prediction horizon. In plain language: it gets information that real systems do not get.
This makes the comparison interesting, but also easy to misread. The optimisation controller is not a normal commercial baseline. It is closer to an idealised ceiling under unrealistic foresight. The rule-based controller is closer to everyday practice, but it ignores market prices.
The temperature metrics add a useful correction to the paper’s broader optimism:
| Method | Temperature violation probability | MAE to 22°C setpoint | What this means |
|---|---|---|---|
| HITL | 10.24% | 1.82°C | More flexible, but not the tightest comfort controller |
| Optimisation | 0.00% | 1.60°C | Strong comfort performance under perfect prediction and explicit constraints |
| Rule-based | 3.13% | 0.49°C | Very close to setpoint, but not market-responsive |
This table is the part a lazy summary would probably mishandle. The HITL controller does not dominate the rule-based controller on pure temperature precision. The rule-based controller has lower MAE and lower violation probability than HITL. Why? Because a simple rule can keep temperature near the setpoint by running the system whenever occupancy says it should. Comfort is easy when one ignores the bill.
The figure-level evidence tells the other half of the story. In the paper’s cost analysis, the rule-based method is shown with much higher energy cost than the optimisation reference line, while HITL scenarios vary with the discomfort weight and occupancy information. The operational question is therefore not “which controller hugs 22°C most tightly?” It is “which controller can trade comfort against energy price without requiring perfect forecasts or explicit comfort models?”
That is where HITL earns attention.
The figures are evidence, not decoration
The paper’s experimental section does several jobs, and mixing them together would blur the interpretation.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Figure 2(a): cost box plots across scenarios and $\beta$ values | Main scenario evidence | Shows how occupancy information and comfort weighting affect total, energy, and discomfort costs across 25 RL runs | Does not prove field performance in deployed buildings |
| Figure 2(b): temperature trajectories | Behavioural illustration | Shows how selected controllers maintain temperature over time under occupied and unoccupied periods | Does not by itself establish general performance |
| Figure 2(c): controller decisions and feedback signals | Mechanism illustration | Shows the interaction among wholesale market price, HVAC decisions, and feedback events | Does not prove actual humans would provide feedback in the same way |
| Table 1: violation probability and MAE | Comfort-performance comparison | Clarifies that HITL trades off comfort precision against cost-aware flexibility | Does not show HITL is always more comfortable than rule-based control |
| Figure 3: sensitivity to feedback probability cap | Robustness/sensitivity test | Tests whether policies trained under ideal feedback remain stable when users are less responsive | Does not model delayed, strategic, inconsistent, or interface-driven human behaviour |
Figure 3 is especially worth placing in the correct category. It is not a second thesis about human psychology. It is a robustness test.
The authors train the RL agent with maximum feedback probability $p^{max}=1.0$, representing an ideal case where occupants always provide feedback under discomfort. During evaluation, they test lower caps from 0.50 to 1. This simulates occupants who may not always provide feedback even when uncomfortable.
The result is broadly stable. Total cost remains relatively stable across caps, although the paper notes slight decreases at lower feedback frequency because reduced feedback can lower the measured discomfort cost. Temperature MAE remains stable across scenarios. S2, which has no occupancy information and relies most heavily on feedback, is the most sensitive, with maximum MAE fluctuation around 12%. S1, with perfect occupancy prediction, has less than 0.5% variation in MAE.
This is a useful but narrow result. It suggests the system does not collapse when feedback becomes less frequent. It does not prove that real users will provide clean, timely, honest, or consistent feedback. Humans are rarely so considerate to control theory.
The career shift is from equipment operation to control-loop stewardship
The existing career story around AI and HVAC is often framed as replacement. That is dramatic, easy to sell, and mostly the wrong lesson.
This paper points to a different shift. HVAC careers become more valuable when professionals can govern adaptive systems rather than merely install equipment and adjust schedules.
The emerging work has several layers.
First, someone has to define the operating trade-off. The $\beta$ parameter is not a lab curiosity; it represents a management decision. A facility team must decide how aggressively the system should chase energy savings versus comfort stability. That decision will vary by building, tenant, weather season, contract structure, and reputational risk.
Second, someone has to design the feedback channel. A human override is only useful if it is captured cleanly and interpreted properly. If the interface is annoying, users will stop giving feedback. If it is too easy, they may generate noisy signals. If it is invisible, the system learns from silence, which is not always consent. Delightful little governance problem, that one.
Third, occupancy prediction becomes a facilities data problem. The paper uses ARAS residential activity data and constructs occupancy status from “Going Out” activity labels, aggregating the home as unoccupied only when all residents are away. Real commercial buildings will use badge data, motion sensors, booking systems, Wi-Fi association, access control, or privacy-preserving alternatives. Each source has gaps, biases, and legal implications.
Fourth, energy literacy becomes part of HVAC work. A controller that responds to wholesale market rates requires operators to understand tariffs, price pass-through, demand charges, and whether the building is actually exposed to the market signal being modelled. A site on a flat tariff cannot capture the same value as a site with dynamic pricing or demand-response incentives.
Finally, commissioning does not end at handover. A learning controller must be monitored. Has it learned the wrong preference? Is it responding to stale occupancy patterns? Are overrides falling because comfort improved, or because users gave up? This is where the “smart” building quietly becomes a management discipline rather than a software feature.
The business value is demand response without a full comfort-model circus
Cognaptus’ practical inference is straightforward: the value pathway is not “AI makes HVAC smart”. That sentence should be escorted out of the building.
The value pathway is:
- Use external signals such as weather and electricity prices where reliable feeds already exist.
- Use internal occupancy prediction only where it improves the comfort-cost trade-off enough to justify the privacy and integration burden.
- Use occupant overrides as low-friction feedback instead of building expensive predefined comfort models.
- Train a controller to reduce repeated overrides while shifting HVAC runtime away from expensive periods where possible.
- Monitor the system as an adaptive operating process, not a one-time installation.
The financial case will depend on tariff exposure, occupancy volatility, HVAC responsiveness, comfort tolerance, and the cost of integration. The paper does not provide a deployment ROI model, and it should not be treated as one. It does, however, identify a plausible architecture for buildings where energy cost and comfort cannot be handled by fixed schedules.
That architecture is especially relevant for facilities with variable occupancy and meaningful energy-price exposure: campuses, serviced offices, hotels, certain residential portfolios, and commercial buildings participating in demand-response schemes. It is less compelling for small sites with flat tariffs, predictable occupancy, and low automation maturity. In those cases, basic scheduling and maintenance may still beat elaborate AI theatre. Sad for the pitch deck, excellent for the budget.
Boundaries that matter before anyone buys this
The paper is worth reading because it is mechanism-rich. It is also bounded in ways that matter.
The experiment is simulation-based. Occupant feedback is simulated through a probabilistic model based on temperature deviation from a set comfort temperature. That is a reasonable experimental choice, but it is not the same as observing real occupants using a live feedback interface.
The dataset is narrow. Occupancy comes from 30 days of residential activity data, with 23 days used for training and 7 for testing. Weather data comes from Clayton, Melbourne, for May 2023. Wholesale price data comes from Victoria over the same period. This gives alignment across inputs, but not broad generalisation across climates, building types, seasons, or tariff designs.
The HVAC action space is simplified. The controller decides on or off, while the heating or cooling mode is preset. Many commercial systems involve variable air volume, staged equipment, chilled water loops, multi-zone constraints, humidity requirements, ventilation standards, and maintenance restrictions. Buildings are very good at making simple models feel personally attacked.
The optimisation benchmark has perfect foresight. That is useful as an upper reference, but not a fair picture of what most deployed optimisation systems can know. Meanwhile, the rule-based controller performs well on temperature metrics but ignores dynamic prices. So each benchmark is informative for a different reason.
The sensitivity test covers incomplete feedback, not the full weirdness of human feedback. Real users may delay responses, overuse controls, underreport discomfort, respond differently by time of day, or avoid feedback tools altogether. The authors explicitly identify delayed feedback and privacy-preserving occupancy data as future work. Those are not minor details; they are deployment questions.
Cool heads prevail when the loop is designed, not merely automated
The best way to read this paper is as a control-loop design proposal.
The occupant is not removed. The operator is not removed. The market is not ignored. The building is not assumed to be perfectly predictable. Instead, the system tries to assemble a workable loop: observe, predict, decide, receive feedback, update, and repeat.
That is less glamorous than “autonomous buildings”. It is also more believable.
For HVAC professionals, the career message is not that mechanical knowledge becomes obsolete. It is that mechanical knowledge becomes insufficient on its own. The valuable operator will understand thermal behaviour, sensor reliability, user feedback, occupancy modelling, tariff exposure, and AI policy monitoring. In other words, the job moves from keeping machines running to keeping a learning control system honest.
Cool heads prevail, yes. But mostly because someone still has to design the loop.
Cognaptus: Automate the Present, Incubate the Future.
-
Xinyu Liang, Frits de Nijs, Buser Say, and Hao Wang, “Human-in-the-Loop AI for HVAC Management Enhancing Comfort and Energy Efficiency,” arXiv:2505.05796, 2025, https://arxiv.org/abs/2505.05796. ↩︎