TL;DR for operators

EV charging optimization has a small, rude problem: the most important variable is often the one the operator does not know. A plugged-in car may leave in twenty minutes or three hours. That difference determines whether the controller can wait for cheap electricity or must charge immediately like an anxious intern with a deadline.

The paper by Gabriele, Pavirani, Karimi Madahi, and Develder proposes a decision-focused reinforcement learning framework for controlled EV charging under unknown departure times.1 The system uses a forecaster to estimate the remaining session duration, but the twist is where the forecaster receives its discipline. It is not trained only to minimize prediction error. It is also trained through feedback from the charging policy, so its predictions become useful to the downstream decision.

The operational lesson is not “use RL for EV charging.” That sentence is now so broad it could power a conference booth by itself. The sharper point is this: when a forecast is used inside a controller, the right objective may not be statistical accuracy. The right objective may be avoiding expensive operational mistakes.

In the paper’s experiments, a conventional RL controller with an MSE-trained departure forecaster reduces unmet demand compared with RL that has no departure information. But decision-focused variants do better: the best mixed variant reduces unmet demand from 7.3% to 6.3%, reduces unsatisfied EVs from 33 to 21 out of 120 test sessions, and improves total reward from -151.00 to -142.85. Since reward is negative cost plus penalty, “better” means less negative, not magical revenue appearing from a charging socket.

The result has a clean business pathway. Charging networks, fleet depots, workplace chargers, and energy platforms should not judge predictive modules only by forecast error. They should also ask whether the forecast changes dispatch decisions in a way that improves cost, service completion, penalty exposure, and customer satisfaction. The boundary is equally clean: this is a simulated study based on real hospital charging data, not a multi-site production deployment with stochastic prices, charger congestion, feeder constraints, angry drivers, and procurement committees. One must not confuse an elegant mechanism with a finished operating model. Tempting, yes. Wrong, still wrong.

The missing variable is not time; it is freedom to wait

A charging controller is not deciding whether electricity is good. It is deciding when electricity is worth buying.

If the vehicle will stay parked until noon, the controller can wait for a cheaper price window. If the vehicle leaves at 9:00, waiting is not optimization; it is customer dissatisfaction with equations attached. Departure time is therefore not just a feature. It defines the feasible control space.

The paper frames EV charging as a Markov Decision Process. Each charging session has an arrival time, departure time, and requested energy. The controller observes measurable information such as electricity price, remaining energy demand, and time of day. It chooses a binary action at each 30-minute step: charge at full power or idle. In the experiments, charging power is fixed at $P_c = 6.5$ kW.

The reward function makes the trade-off explicit. Charging costs money at the current electricity price. Leaving the vehicle undercharged triggers a penalty: a fixed cost plus a squared penalty for unmet energy. In business language, the controller is balancing energy arbitrage against failed service delivery. Cheap electricity is lovely. A half-charged car at departure is less lovely. Sophisticated, I know.

The departure time is the awkward part. In many real charging environments, operators do not know exactly when a user will unplug. That uncertainty matters because the controller’s state includes a remaining session duration estimate. Without it, the agent cannot distinguish “safe to wait” from “stop admiring the tariff curve and charge the car.”

The paper changes what the forecaster is paid to be right about

The obvious solution is to train a departure-time forecaster and feed its estimate into the RL policy. That is also the obvious trap.

A conventional forecaster is trained to reduce prediction error, typically with mean squared error. It wants the predicted session duration to be close to the true session duration. That sounds reasonable until the prediction enters a controller. In a control problem, not all errors have equal operational cost.

Overestimating departure time can be disastrous. If the forecaster predicts that the car will stay longer than it actually does, the controller may delay charging and miss the user’s energy requirement. Underestimating departure time may be statistically “wrong” but operationally safer, because it pushes the controller to charge earlier.

That is the paper’s core mechanism. The forecaster should not be a detached weather reporter. It is part of the control loop. Its job is not merely to describe the world. Its job is to help choose the action.

The proposed framework uses Soft Actor-Critic, an off-policy actor-critic RL method that optimizes expected return while encouraging exploration through policy entropy. The charging policy receives a state that includes the forecasted remaining session duration. The forecaster predicts total session duration at each timestep using inputs such as time information, energy already charged, and unmet energy. The remaining duration is then computed by subtracting elapsed time.

The key training objective blends two losses:

$$ L_{\text{total}} = \beta L_R + (1-\beta)L_{DF} $$

$L_R$ is the normal regression loss. In the implementation, it is MSE against the true session duration. $L_{DF}$ is the decision-focused component, derived from the RL policy’s downstream objective. The parameter $\beta$ controls the blend. At $\beta = 1$, the system becomes conventional regression-based forecasting. At $\beta = 0$, the forecaster is trained only through the decision-focused signal.

This is not cosmetic. The blended loss changes what the forecast is allowed to care about. The model can learn predictions that are not merely close to the truth in isolation, but useful for the controller’s actual charging decisions.

A slightly conservative forecast can be better than an accurate-looking delay

The paper’s most intuitive evidence is a single-session example. This figure should be read as an explanatory mechanism illustration, not as the main aggregate proof.

In the example, the decision-focused forecaster with $\beta = 0.4$ predicts a slightly shorter session than reality: it expects departure around 11:10, while the true departure is around 11:20. That conservative estimate pushes the RL agent to begin charging at 8:30, and the EV completes charging before departure.

The conventional forecaster with $\beta = 1$ does the more dangerous thing. It overestimates the available session duration, predicting departure around 12:05. The controller waits too long, and the session ends undercharged.

This is the entire paper in miniature. A forecast can be “worse” in the ordinary statistical sense and better in the operational sense. The dashboard metric that wins the modeling contest may lose the customer.

There is a broader AI systems lesson here. Predictive accuracy is not the same as decision quality when predictions are consumed by downstream policies. That is true in EV charging, but also in inventory allocation, staffing, loan servicing, preventive maintenance, supply-chain routing, claims triage, and any other system where a model output becomes an action. The model is not writing poetry. It is moving resources.

The main experiment tests whether decision-shaped prediction beats accuracy-shaped prediction

The experiment is compact. That is both useful and limiting.

The authors use a real-world dataset of EV charging sessions from a hospital public parking facility, extracting arrival times, session durations, and charging requirements. From the 20 most frequent EV users, they model EV features using statistical distributions and generate simulation samples. The training dataset contains 350 sessions, and the test set contains $N = 120$ sessions. Results are averaged over four experiment runs.

The setup assumes a fixed charging power of 6.5 kW and a single deterministic time-varying price profile. The price profile is part of the implementation environment, not a general proof about electricity markets. It gives the controller a reason to delay charging when prices are low later, but it does not test market volatility, demand charges, feeder constraints, or stochastic wholesale pricing.

The baselines are well chosen for the paper’s question:

Test Likely purpose What it supports What it does not prove
BAU immediate charging Operational reference point Shows the cost of always charging immediately Does not represent optimized control
RL with no forecast Missing-information baseline Shows what happens when the policy lacks departure-time information Does not isolate forecasting quality
RL with MSE forecaster, $\beta = 1$ Conventional forecast-plus-control baseline Tests whether ordinary prediction helps RL Does not test decision-focused training
RL with real departure Oracle-style upper reference Shows the value of perfect departure information Not deployable unless users reveal departure time accurately
DF-RL with $\beta \in {0,0.2,0.4,0.6,0.8}$ Sensitivity and ablation over loss weighting Shows the effect of blending regression and decision-focused learning Does not establish a universally optimal $\beta$
Single-session figure Mechanism illustration Explains why a conservative forecast can improve charging completion Not aggregate evidence by itself
Appendix price profile Implementation detail Defines the tariff environment used in the experiment Does not test price-profile robustness

That last column matters. The paper is not claiming the method is production-ready across every charging market. It is testing a mechanism: if a forecaster is trained with downstream control feedback, can it produce better charging actions than the same basic controller using a conventionally trained forecast?

The answer, in this setup, is yes.

BAU is expensive, no-forecast RL is cheap, and both are incomplete answers

The results table is more interesting than a simple “method beats baseline” story.

Business-as-usual immediate charging produces zero unmet demand and zero unsatisfied EVs, but total charging cost is €168.43. This is the brute-force service-quality strategy: charge immediately, avoid customer pain, ignore arbitrage opportunities. Reliable, but not exactly Mensa-level energy management.

RL without departure forecasts cuts charging cost to €60.94 on average, but creates 33 unsatisfied EVs and 14.7% unmet demand. It saves money by waiting, but it does not know when waiting becomes stupid. Its total reward is -166.88, barely better than BAU’s -168.43, because penalties eat much of the charging-cost gain.

RL with a conventional forecaster improves the picture. Charging cost rises to €93.42, unmet demand falls to 7.3%, penalty cost falls from 105.94 to 57.58, and total reward improves to -151.00. So, yes, departure-time forecasting helps. The paper is not arguing that prediction is useless. It is arguing that prediction should be trained for the job it is actually doing.

The real-departure baseline shows the ceiling created by information quality. With true departure time, the RL controller reaches €102.17 in charging cost, only 5 unsatisfied EVs, 0.6% unmet demand, and total reward of -115.48. That gap is important. Decision-focused forecasting improves the conventional forecast baseline, but it does not magically remove uncertainty. Reality remains inconvenient. A lesser publication might have tried to hide that. This one leaves the oracle in the table, which is useful.

The best DF-RL variants improve service without reverting to panic charging

The decision-focused variants with a blended loss, where $0 < \beta < 1$, are the main result.

The best-performing mixed variant in user-focused terms is $\beta = 0.2$. It produces:

Method Charging cost (€) Unsatisfied EVs Unmet demand (%) Total reward Penalty cost
RL with conventional forecaster, $\beta = 1$ 93.42 ± 14.06 33 ± 11.67 7.3 ± 5.2 -151.00 ± 22.20 57.58 ± 36.26
DF-RL, $\beta = 0.2$ 93.79 ± 12.36 21 ± 8.14 6.3 ± 4.3 -142.85 ± 6.59 51.16 ± 16.87
DF-RL, $\beta = 0.4$ 91.90 ± 14.46 24 ± 13.32 6.6 ± 5.6 -142.85 ± 14.02 50.96 ± 26.04
DF-RL, $\beta = 0.6$ 94.69 ± 15.45 24 ± 12.42 6.6 ± 5.6 -142.90 ± 12.61 48.21 ± 27.73
DF-RL, $\beta = 0.8$ 92.62 ± 14.36 27 ± 10.26 7.3 ± 4.0 -144.14 ± 6.82 51.52 ± 20.99

Against the conventional forecaster, $\beta = 0.2$ reduces unsatisfied EVs from 33 to 21 and unmet demand from 7.3% to 6.3%. That is not a giant absolute movement in unmet demand, but it is operationally meaningful because the same controller is now failing fewer sessions while keeping charging cost almost unchanged. It is not buying service quality by simply charging everything immediately.

Total reward improves from -151.00 to -142.85. Since the reward is negative charging cost plus penalties, this is a reduction in the combined pain of energy expense and undercharging penalties. The authors report this as a 5% reward improvement compared with conventional forecasting, and about 14% less total unmet energy.

The result is also not monotonic in the most naïve way. Higher $\beta$ gives more weight to the regression term. Lower $\beta$ gives more weight to decision-focused feedback. In the table, $\beta = 0.2$ has fewer unsatisfied EVs than $\beta = 0.8$, but $\beta = 0.8$ keeps similar charging cost and unmet demand. The useful region is not “throw away accuracy.” It is “blend accuracy with decision consequence.”

This distinction is not academic fussiness. It is the difference between decision-focused learning and metric vandalism.

The pure decision-focused ablation is the warning label

The $\beta = 0$ result is the paper’s quiet warning, and it deserves attention.

At $\beta = 0$, the forecaster ignores regression accuracy entirely and learns only through the decision-focused objective. That variant performs badly: charging cost falls to €55.70, but unsatisfied EVs rise to 48, unmet demand jumps to 24.9%, penalty cost reaches 121.76, and total reward falls to -177.46. It is worse than the no-forecast RL baseline on unmet demand and total reward.

The authors explain this as a training issue: without forecaster-loss guidance, the forecaster and RL agent still converge but need more exploration and additional training iterations. That interpretation is plausible, but the operational reading is simpler.

A forecast used in control still needs grounding. Decision feedback can reshape the forecast toward usefulness, but if the model no longer has any direct pressure to remain connected to the target variable, the policy may receive unstable or misleading signals. The controller cannot build a reliable charging policy on vibes, even if those vibes are end-to-end differentiable.

For businesses, this is a valuable anti-hype result. Decision-focused learning is not permission to discard predictive accuracy. It is permission to demote predictive accuracy from supreme ruler to one member of the cabinet. A modest constitutional monarchy, perhaps.

The business value is fewer failed service outcomes, not prettier forecasts

The immediate business interpretation is straightforward: forecast modules should be evaluated by downstream service outcomes.

For an EV charging operator, the relevant metrics are not only departure-time MSE. They include:

Operational question Paper metric proxy Business interpretation
Did the customer leave undercharged? Unsatisfied EVs Service failure count
How much requested energy was missed? Unmet demand (%) Severity of failure
How expensive was the charging plan? Charging cost (€) Energy procurement exposure
Did the controller balance cost and service? Total reward Combined objective quality
How much penalty came from undercharging? Penalty cost Implied dissatisfaction or contractual loss

That mapping is more important than the particular hospital dataset. The paper’s broader value is architectural: when a model feeds a controller, the model’s training objective should reflect the controller’s consequences.

In production, this suggests a practical design pattern:

  1. Define the operational failure explicitly. For charging, it is unmet energy at departure. For inventory, it may be stockout. For staffing, it may be uncovered demand. For credit operations, it may be missed intervention windows.

  2. Train the predictor with a hybrid objective. Keep a conventional regression or classification loss, but add a decision-aware component tied to downstream reward, penalty, constraint violation, or service outcome.

  3. Evaluate the combined system, not the model in isolation. A forecast with slightly worse MSE may be better if it reduces the costly class of errors.

  4. Sweep the loss-weighting parameter. The $\beta = 0$ ablation makes clear that end-to-end cleverness can become end-to-end nonsense if the predictor loses grounding.

  5. Keep an oracle or perfect-information baseline where possible. The real-departure baseline shows how much value remains trapped in information uncertainty. That is useful for deciding whether to improve modeling, change user interfaces, or simply ask drivers for departure estimates.

The last point is not trivial. If the gap between decision-focused forecasting and real departure remains large, a business might gain more from better data collection than from a fancier algorithm. Sometimes the best model is a user prompt with a default departure time and a reminder. Deep learning hates this sort of sentence, but operations people should enjoy it.

What the paper directly shows, what Cognaptus infers, and what remains uncertain

The paper directly shows that, in its simulated charging environment, blended decision-focused RL variants can improve downstream control performance relative to RL with a conventionally trained departure-time forecaster. It also shows that departure-time information matters: the real-departure baseline is much better than all forecast-based methods, and the conventional forecaster improves unmet demand relative to no forecast.

Cognaptus infers a broader design lesson: forecasting should be treated as part of the decision system, not as an isolated modeling artifact. If a forecast is consumed by a policy, optimizer, or human workflow, then the evaluation protocol should include downstream decision quality. This is the difference between building a model that wins a leaderboard and building a model that avoids annoying customers in a parking lot.

The uncertainty is mostly about scale and realism. The study uses 20 frequent users from one hospital charging context, generated simulation samples, a 350-session training set, a 120-session test set, four runs, fixed 6.5 kW charging power, and one deterministic price profile. It does not test heterogeneous charger power, queues, multiple simultaneous vehicles competing for constrained capacity, stochastic tariffs, demand charges, distribution-grid constraints, user-entered departure preferences, or adversarial user behavior. It also does not provide a full production architecture for monitoring drift, retraining the forecaster, or handling rare user patterns.

Those are not fatal limitations. They are scope boundaries. A bounded mechanism can still be useful when read correctly. The mistake would be to treat a five-page conference paper as if it had quietly solved EV infrastructure operations while nobody was looking.

The useful forecast is the one that changes the action in the right direction

The paper’s title says “forecasting what matters,” and for once the title is not merely doing decorative labor.

What matters is not whether the forecaster minimizes duration error in a vacuum. What matters is whether its output causes the charging controller to make better decisions under uncertainty. In the paper’s EV setting, that means charging early enough to avoid unmet demand while still exploiting price variation where possible.

The single-session example captures the point cleanly. The decision-focused forecast is slightly conservative. It predicts an earlier departure than reality. That small “wrongness” gives the controller a better action. The conventional forecast is more relaxed, predicts too much available time, and the car leaves undercharged. Congratulations: the forecast was perhaps more comfortable. The user was not.

For operators, the lesson is not to prefer inaccurate forecasts. The lesson is to price the errors correctly. Overestimating available time is worse than underestimating it when the penalty is a failed charging session. A good training objective should know that.

This is where decision-focused learning earns its keep. It turns forecasting from a descriptive exercise into a control-aware component. The model is no longer asked only, “What will happen?” It is asked, “What prediction helps the system choose the least regrettable action?”

In enterprise AI, that question is usually the adult one.

Cognaptus: Automate the Present, Incubate the Future.


  1. Giuseppe Gabriele, Fabio Pavirani, Seyed Soroush Karimi Madahi, and Chris Develder, “Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times,” arXiv:2606.19199v1, 2026, https://arxiv.org/abs/2606.19199↩︎