Molding the Future: How DRL is Revolutionizing Process Optimization

TL;DR for operators

Factory optimisation usually begins with a polite fiction: if the process makes good parts, the business must be doing well. Injection molding knows better. A technically acceptable part can still be produced at the wrong pressure, at the wrong cycle time, during the wrong electricity tariff window, with just enough mold wear to make the accountant quietly unhappy.

The paper behind this article, DRL-Based Injection Molding Process Parameter Optimization for Adaptive and Profitable Production, proposes a deep reinforcement learning framework that treats injection molding as an economic control problem rather than a purity contest for product quality.¹ The system learns to adjust ten process parameters while observing environmental variables and time-of-use electricity prices. Its reward function includes sale revenue, resin cost, mold wear, electricity cost, product quality, and cycle time. That is the important move. The algorithmic choices—SAC and PPO—matter, but the mechanism is the reward design.

The headline result is not that DRL humiliates genetic algorithms. It does not. The genetic algorithm often earns slightly more profit in the paper’s virtual deployment tests. The practical result is sharper: DRL gets very close to GA-level profit while making decisions fast enough for real production use. In a fixed-condition comparison, GA takes 21.201 seconds, while SAC and PPO take 0.421 seconds and 0.287 seconds respectively after offline training. In the 24-hour seasonal deployment tests, GA still produces the highest profit, but requires 781.0 minutes of computation in the spring scenario, compared with 15.5 minutes for SAC and 10.4 minutes for PPO.

For operators, the business value is not “AI magic in molding.” Please, the machines have suffered enough. The value is a deployable margin controller: a system that can adjust pressure, speed, position, and hold time as temperature, humidity, cycle time, defect risk, electricity tariffs, and mold-wear economics shift. The strongest use case is high-volume, low-margin production where a small per-cycle difference becomes meaningful only because the cycle repeats thousands of times.

The boundary is equally important. The paper validates the framework through surrogate-model-based virtual deployment, not a full closed-loop production rollout. The dataset contains 2,794 samples from one instrumented injection molding testbed producing one ABS cosmetic container cap. The data are not public. The electricity tariff assumptions are based on South Korean seasonal time-of-use pricing. The method is promising, but not yet a universal injection molding autopilot with a lab coat and delusions of grandeur.

The real innovation is the accounting function, not the neural network

Most AI process-control papers orbit the same planet: reduce defects, predict quality, stabilise output. That is useful, but incomplete. In real manufacturing, the process engineer is not paid to make quality scores glow aesthetically on a dashboard. The job is to make acceptable products, repeatedly, profitably, and without burning margin through material waste, excessive cycle time, electricity costs, or avoidable tool wear.

This paper’s key contribution is to place those economic variables inside the control objective. The authors define profit per production cycle as revenue from good cavities minus resin cost, mold cost, and electricity cost. There are four cavities per cycle. Each good cavity is assigned a unit price of $0.20. Resin cost is fixed at $0.04 per cavity. Mold cost increases when maximum injection pressure crosses 140 bar. Electricity cost depends on maximum pressure and time-of-use tariff.

That turns a familiar process-control question into a more uncomfortable one:

Should the machine use a setting that protects quality if it also increases pressure, cycle time, electricity cost, and mold wear more than the product margin can justify?

This is exactly where conventional quality-first optimisation becomes too narrow. A process setting can be technically “good” and economically mediocre. The paper’s reinforcement learning agent is rewarded not merely for avoiding defects, but for producing profitable output over a 10-minute production interval. The reward scales profit by the number of cycles that fit into that interval, so cycle time becomes economically visible rather than a secondary metric politely waiting outside the model.

The authors formulate the reward approximately as:

$$ r_t = \frac{600}{T} \times \left[\text{revenue from good cavities} - \text{resin cost} - \text{mold cost} - \text{electricity cost}\right] $$

Here, $T$ is cycle time in seconds. The $600/T$ term converts per-cycle economics into a 10-minute production reward. This is not decorative math. It is the bridge from “make a good part” to “make enough good parts, at a cost that still leaves money on the table for the business rather than for the electricity provider.”

The electricity component is also unusually operational. The paper uses South Korean time-of-use electricity prices across spring/fall, summer, and winter. Summer on-peak power is especially expensive at $0.2345/kWh, compared with $0.0995/kWh off-peak. Winter on-peak is $0.2101/kWh, while spring/fall on-peak is $0.1527/kWh. These tariffs are not background decoration; they become part of the state the agent observes and part of the cost it tries to manage.

That makes the agent less like a static parameter recommender and more like a process economist with a pressure gauge.

The agent sees process settings, shop-floor weather, and the price of electricity

The decision problem is framed as a Markov Decision Process. The state includes three categories of information: current process parameters, environmental variables, and electricity price state.

The process parameters are ten controllable settings: three injection speeds, three injection pressures, three injection positions, and one hold time. The environmental variables are machine temperature, machine humidity, factory temperature, and factory humidity. Electricity pricing is encoded as a nine-dimensional one-hot vector representing the combination of season and tariff period.

The action is a continuous adjustment to the ten process parameters. This matters because injection molding control is not naturally a menu of discrete choices. The operator is adjusting a continuous process surface, not selecting “Option B: profit, medium rare.” The agent’s actions are rescaled into practical parameter adjustments and clipped within operational bounds.

The architecture has two phases. First, offline training. The agent interacts not with a physical machine, but with a virtual environment built from surrogate models. Second, deployment. The trained policy receives current process, environment, and tariff inputs, then outputs parameter adjustments quickly enough to be usable.

This offline/online split is essential. Training a reinforcement learning system directly on a molding machine would be expensive, slow, and occasionally theatrical in the bad sense. So the authors use real production testbed data to build surrogate models that simulate quality and cycle time. The DRL agent learns inside that surrogate world before being evaluated in virtual deployment scenarios.

Component	What the paper directly builds	Operational meaning	Main boundary
Profit reward	Revenue minus resin, mold, and electricity costs, scaled by cycle time	Optimises margin per production interval, not just part acceptability	Cost assumptions are study-specific
State	Process parameters, temperature, humidity, electricity price category	Allows adaptation to shop-floor and tariff variation	Environmental scenarios remain within collected-data bounds
Action	Continuous adjustment of 10 process parameters	Fits real control variables better than discrete action menus	Clipping and step size affect convergence
Surrogate environment	Quality classifier and cycle-time regressor	Enables offline DRL training without live machine experimentation	Surrogate fidelity governs deployment credibility
Deployment policy	Deterministic action selection from trained SAC/PPO policies	Gives repeatable decisions for identical states	Tested in virtual deployment, not full physical closed loop

The point is not that the model “understands” injection molding like a veteran technician. It does not need to. The claim is narrower and more testable: given a learned approximation of how settings affect quality and cycle time, and given a profit function that captures key operating costs, the agent can learn adjustments that preserve quality while improving economic performance under changing conditions.

That is a much better claim than industrial AI usually gives us. It is smaller, less glamorous, and therefore more useful.

The surrogate models are the factory where the agent learns

The paper uses 2,794 samples collected from a fully instrumented injection molding testbed. The product is a circular cosmetic container cap made from ABS. Each sample includes ten controllable process parameters, four environmental variables, and a binary quality label assigned through automated visual inspection.

The process parameters were varied using an L81 orthogonal array design. That detail is not just methodological housekeeping. It means the dataset was designed to cover a broad parameter space rather than merely recording whatever the machine happened to do on a Tuesday afternoon when everyone was tired.

Two surrogate models are then used inside the DRL environment. The quality classifier predicts whether the product is good or defective. That classifier comes from the authors’ prior work. The cycle-time regressor is developed in this paper, because cycle time becomes necessary once the reward function is profit per production interval rather than quality alone.

For cycle-time prediction, the authors compare 18 machine-learning models using PyCaret and 10-fold cross-validation. LightGBM performs best, with RMSE 0.1468, MAE 0.0632, and $R^2 = 0.9743$. Random Forest and Extra Trees are close behind with $R^2$ values of 0.9719 and 0.9710. Linear models perform far worse, which is not shocking; injection molding is not famous for politely linear relationships.

The purpose of this surrogate-model test is implementation support, not the paper’s main economic evidence. It establishes that the virtual environment has a plausible cycle-time model, which is necessary before DRL training can mean anything. It does not prove that the final DRL policy will perform identically on a physical production line.

That distinction matters. A strong surrogate is not a factory. It is a carefully trained stand-in. Useful, yes. Omniscient, no.

SAC and PPO learn the same economic game in different styles

The authors train two actor-critic reinforcement learning algorithms: Soft Actor-Critic and Proximal Policy Optimization. SAC is off-policy and can reuse past experience through replay, which tends to help sample efficiency. PPO is on-policy and updates from recently collected trajectories, which often gives stable training but can require more samples and may explore less broadly.

In training, each episode represents a 24-hour period divided into 144 ten-minute steps. The agents train for 1,250 episodes, or 180,000 steps. The virtual environmental scenarios include temperature and humidity variation, with electricity price fluctuations introduced over time. Training is conducted on an Intel Core i9-13900K CPU using an OpenAI Gym-based framework.

The training curves show both algorithms converging toward an average reward of roughly 6.3. SAC converges slightly faster and reduces defective cavities faster. PPO, however, performs slightly better in a one-time adjustment evaluation after convergence, with average profits of 958.99 versus 953.76 for SAC in the cited predefined spring/fall scenario.

This is a useful reminder that algorithm rankings are rarely a royal succession. SAC explores more broadly. PPO can reach strong deterministic adjustment behaviour quickly in some settings. The operational question is not “Which acronym is morally superior?” The operational question is: which policy gives reliable profit-improving adjustments within the time and computational constraints of the plant?

Later comparisons make that clearer. Under fixed environmental conditions—14°C, 45% relative humidity, and spring off-peak electricity pricing—the authors test nine initial parameter cases. Six begin with no defective cavities but different profit levels. Three begin with defects. Both SAC and PPO move the process toward profit-maximising settings within a small number of adjustment steps. SAC reaches average profit of 6.800 in 10 steps; PPO reaches 6.799 in 7 steps.

The defective starting cases are especially important. Case 9 begins at a profit of -2.323 with three defective cavities. After one adjustment, SAC brings it to 6.680 and PPO to 6.728. This is not merely fine-tuning. It is a move from economically broken settings to high-profit feasible settings. The paper does not dwell on that drama, because academic prose has a long-standing allergy to drama, but operators should notice it.

Genetic algorithms win narrowly where time does not matter, which is not where factories live

The paper compares DRL with a genetic algorithm using the same surrogate models and the same profit function. This is the comparison most readers will be tempted to oversimplify. Resist the urge. The GA is not a straw man. It is a strong static optimiser and often finds slightly higher profit.

Under fixed conditions, GA converges to an average profit of 6.799 after 20 generations, using 800 profit evaluations. SAC and PPO achieve similar profits: 6.800 and 6.799 respectively. The online time difference is the more important number. SAC takes 0.421 seconds over 10 steps, PPO takes 0.287 seconds over 7 steps, and GA takes 21.201 seconds over 20 generations.

The authors note that the target product’s cycle time is approximately 39 seconds. That means GA’s 21-second optimisation consumes more than half the available cycle time before an expert has reviewed the recommendation or entered parameters into the molding machine. A method that is theoretically elegant but operationally late is not optimisation. It is a memo.

Test	Likely purpose	What it supports	What it does not prove
Training curves for SAC and PPO	Main training evidence	Both agents learn stable reward-improving policies; SAC reduces defects faster	Real-world closed-loop deployment
Nine fixed-condition initial cases	Robustness across starting points	DRL can recover from both profitable and defective initial settings	Universal robustness outside studied parameter bounds
GA comparison under fixed conditions	Baseline comparison	DRL reaches near-GA profit with far lower online time	That DRL always beats GA in profit
24-hour seasonal deployment	Main deployment evidence	DRL remains close to GA across spring, summer, winter virtual scenarios	Performance under all factories, products, tariffs, and materials
Step-size comparison	Sensitivity test	Larger adjustment steps are more computationally efficient and generally preferable	That large steps are always safe for every machine/process

This is the paper’s central business lesson. GA is allowed to be slightly better in profit because it is solving a different operational problem: search from scratch under a given condition. DRL pays its search cost upfront during offline training. Once trained, it can infer adjustments quickly as conditions change.

For real-time process control, that trade-off is the whole story. The shop floor does not award medals for optimums that arrive after the relevant cycle has passed.

The seasonal deployment test turns optimisation into operations

The most business-relevant experiment is the 24-hour virtual deployment across seasonal scenarios. The authors evaluate SAC, PPO, and GA under spring, summer, and winter conditions. Temperature and humidity follow seasonal profiles, electricity prices follow time-of-use schedules, and production is assumed to continue continuously for 24 hours on a single injection molding machine.

All three methods show the same seasonal ranking of cumulative profit: spring is highest, winter second, summer lowest. That ranking follows the electricity pricing structure. Summer is expensive, and the model’s economics notice. This is exactly what a profit-aware controller should do. It should not treat August energy costs as a philosophical inconvenience.

The results are close:

Season	SAC profit and cavities	PPO profit and cavities	GA profit and cavities
Spring	$958.88 (8,644)	$958.33 (8,640)	$959.69 (8,652)
Summer	$915.63 (8,744)	$914.68 (8,736)	$915.87 (8,748)
Winter	$930.85 (8,824)	$929.85 (8,816)	$932.66 (8,844)
Computational time, spring scenario	15.5 min	10.4 min	781.0 min

GA wins on total profit and production volume in every season. But the gain is small. In spring, GA beats SAC by $0.81 over 24 hours and PPO by $1.36. In summer, GA beats SAC by $0.24 and PPO by $1.19. In winter, GA beats SAC by $1.81 and PPO by $2.81.

Now compare that with the compute burden. GA requires 781.0 minutes in the spring scenario. SAC requires 15.5 minutes, PPO 10.4 minutes. The paper reports this as a major advantage for DRL because GA must rerun a full optimisation routine when conditions change, while the DRL model uses a trained policy.

The exact dollar values are study-specific and small because this is one machine, one product, one day, and one modelled pricing environment. But the structure of the result is more general. In high-volume manufacturing, the relevant comparison is not “Which method wins in a one-off static search?” It is “Which method can keep making acceptable economic decisions as the world changes around the process?”

Factories are, annoyingly, located in the world.

The step-size experiment is a sensitivity test, not a second thesis

The paper also compares different adjustment step sizes. This section is easy to misunderstand as a minor hyperparameter detail. It is more useful than that. Step size is an operational design choice: how aggressively should the agent adjust process parameters?

The authors compare the large step-size condition used in the main seasonal deployment with smaller step sizes. For small steps, they test two cases: the same number of steps as before, and double the number of steps. The pattern is sensible. Smaller steps often reduce performance when the number of steps is held constant. Doubling the number of steps helps recover performance, but increases computational time.

For SAC, large steps over 10 steps produce $958.88, $915.63, and $930.85 across spring, summer, and winter. Small steps over 10 steps reduce summer and winter profit to $912.91 and $928.64. Small steps over 20 steps recover some performance, but require 30.5 minutes instead of 15.5.

For PPO, large steps over 7 steps produce $958.33, $914.68, and $929.85. Small steps over 7 steps reduce spring and summer to $952.81 and $909.70, though winter rises slightly to $930.76. The authors correctly avoid overclaiming that winter bump. They treat consistent performance across environments as more important than one isolated improvement. This is a small but welcome act of statistical maturity. We should encourage it; it is becoming rare wildlife.

The practical takeaway is not “always use large steps.” It is more disciplined: in this setup, larger adjustment steps gave a better trade-off between convergence, performance, and computation. For a real factory, the acceptable step size would also depend on machine safety, process stability, operator trust, actuator constraints, and the cost of overshooting into defect-producing territory.

What Cognaptus infers for business use

The paper directly shows that a profit-aware DRL framework can be trained offline using surrogate quality and cycle-time models, then evaluated in virtual deployment scenarios where it achieves near-GA profitability with far lower online optimisation time. It also shows that the reward design can integrate operational costs that many quality-focused optimisation systems leave outside the model.

Cognaptus infers a broader business pattern: industrial AI becomes more valuable when it optimises the economic unit that managers actually care about. In injection molding, that unit is not simply defect rate. It is profitable output per unit time under changing cost and environmental conditions.

That inference leads to three practical design principles.

First, process AI should include cost structure as a first-class variable. Resin, electricity, maintenance, tool wear, labour constraints, scrap handling, and cycle time cannot be bolted on after the model has already optimised a purely technical score. By then, the model has already learned the wrong game.

Second, real-time usefulness should be measured against production cadence. The GA comparison is valuable because it exposes a common industrial AI failure: optimisation that is excellent in a notebook and awkward beside a machine. If the process cycle is around 39 seconds, a 21-second optimiser may already be too slow once expert review and manual input are considered.

Third, offline training is commercially acceptable only if the surrogate is trustworthy. The DRL agent is only as good as the factory it learns inside. If the surrogate model misses key causal relationships, under-samples rare but costly defect regimes, or fails to represent seasonal/environmental shifts, the learned policy may optimise a beautiful simulation and a disappointing production line. Nobody needs more beautiful disappointments.

Where the evidence stops

The strongest boundary is that this is not a full physical closed-loop deployment. The DRL models are trained and evaluated through surrogate models and virtual seasonal scenarios. The paper’s experimental evidence is credible for a framework demonstration, but it is not the same as proving long-term autonomous operation on multiple machines, products, materials, and plants.

The second boundary is data scope. The dataset has 2,794 samples from one instrumented testbed and one ABS cosmetic container cap. That is enough to show a worked method; it is not enough to claim universal generalisation across injection molding. Different materials, molds, part geometries, machines, sensors, and operator practices can change the relationship between pressure, speed, cooling, cycle time, and quality.

The third boundary is economic specificity. The profit function includes resin, mold, and electricity costs, but other factories may need labour, downtime, machine depreciation, scrap logistics, order priority, contractual penalties, maintenance windows, carbon pricing, or energy demand charges. The framework can be customised, but customisation is work, not a slide bullet.

The fourth boundary is surrogate fidelity. The paper itself recognises this. If the surrogate environment is inaccurate because of measurement errors, limited data, biased coverage, or unmodelled process dynamics, the DRL policy can learn suboptimal behaviour. Future work suggested by the authors includes expanding datasets, synthetic data through multi-fidelity simulation, transfer learning, few-shot learning, and multi-agent DRL for more complex parameter spaces.

That last point is worth taking seriously. As parameter count and interaction complexity rise, a single-agent setup may struggle. A multi-agent architecture could let specialised agents handle different parameter subsets while coordinating toward a global objective. That is promising, but also another layer of coordination risk. Multi-agent systems are excellent at producing emergent behaviour, which is charming until the behaviour emerges inside your production margin.

The useful lesson is margin control under time pressure

This paper is not important because it adds another DRL acronym to manufacturing. The industry has enough acronyms. They breed in conference proceedings.

It is important because it changes the optimisation target from technical acceptability to operational profit. The model is asked to make good products, quickly, under changing environmental and electricity-price conditions, while accounting for resin, energy, and mold wear. That is closer to how factories actually live.

The genetic algorithm comparison is the right kind of humbling. GA remains a strong optimiser and often finds slightly better profit. But DRL’s advantage is deployment speed after offline training. In a production setting where decisions must fit inside cycle time and still leave room for human verification, near-optimal and fast can beat theoretically optimal and late.

The broader lesson for industrial AI teams is simple enough to be dangerous: optimise the business process, not the proxy metric. But doing that properly requires machinery-specific data, cost-aware reward design, reliable surrogate models, and sober validation. Otherwise, “profit-aware AI” becomes another dashboard pretending that margins improve when the colour gradients look expensive.

In this paper, the promise is narrower and better. A DRL controller does not need to replace every process engineer. It can become a fast economic recommender that keeps quality, cycle time, electricity, and tool wear in the same decision loop.

That is a useful future for factory AI: less theatre, more throughput, and just enough accounting to keep the mold honest.

Cognaptus: Automate the Present, Incubate the Future.

Joon-Young Kim, Jecheon Yu, Heekyu Kim, and Seunghwa Ryu, “DRL-Based Injection Molding Process Parameter Optimization for Adaptive and Profitable Production,” arXiv:2505.10988, 2025. https://arxiv.org/pdf/2505.10988 ↩︎

TL;DR for operators#

The real innovation is the accounting function, not the neural network#

The agent sees process settings, shop-floor weather, and the price of electricity#

The surrogate models are the factory where the agent learns#

SAC and PPO learn the same economic game in different styles#

Genetic algorithms win narrowly where time does not matter, which is not where factories live#

The seasonal deployment test turns optimisation into operations#

The step-size experiment is a sensitivity test, not a second thesis#

What Cognaptus infers for business use#

Where the evidence stops#

The useful lesson is margin control under time pressure#