ODEs Without the Drama: How FPGAs Finally Make Physical AI Practical at the Edge

Battery. It is a wonderfully effective way to end an argument about elegant algorithms.

A wearable device may benefit from learning how its surrounding physical system changes over time. It may even need an interpretable equation rather than another black-box prediction. But if one model update consumes more energy than the device stores, theoretical elegance becomes a rather expensive form of decoration.

The MERINDA paper begins with this uncomfortable constraint. Its illustrative automated-insulin-delivery scenario assumes a smartwatch-sized battery containing roughly 4,000 joules. A GPU-based model-recovery update consumes 29,943.68 joules. MERINDA’s FPGA implementation uses 261.79 joules—enough, on the paper’s arithmetic, for approximately fifteen updates per charge.¹

That comparison is memorable, but it can also encourage the wrong conclusion. MERINDA is not evidence that an FPGA is simply a smaller, thriftier GPU waiting to replace one. The paper’s more useful contribution comes earlier: it redesigns model recovery so that the expensive parts become suitable for deeply parallel hardware.

The hardware gain follows the architectural change. Reverse that order, and there is very little drama-free ODE computing to discuss.

Model recovery asks for an equation, not merely a prediction

Much of edge AI performs model learning. Given recent sensor readings, a neural network predicts what comes next. This may be sufficient for anomaly detection, short-horizon forecasting, or routine control.

Model recovery asks a harder question: what governing relationship produced the observations?

A physical system can be represented as:

$$ \dot{X} = h(X, U, \theta) $$

Here, $X$ describes the observed state, $U$ represents external inputs, and $\theta$ contains the coefficients defining the system’s dynamics. Model recovery attempts to infer those coefficients from measurements.

The distinction matters because two systems can produce similar short-term predictions while behaving very differently after conditions change. An interpretable recovered model can expose which variables interact, which terms matter, and how the system may respond outside its recent operating range.

That is useful in environments where operators need more than a forecast:

An industrial monitor may need to distinguish ordinary variation from a change in the machine’s physical behaviour.
An autonomous system may need to update its internal model after wear, damage, or environmental change.
A medical device may need to reason about changing patient dynamics rather than repeat a pattern learned from yesterday.

Recovery is therefore attractive precisely where mistakes are expensive. Unfortunately, it is also much more computationally awkward than ordinary prediction.

Neural ODEs put the difficult computation in the wrong place

State-of-the-art recovery approaches such as EMILY and PINN+SR rely on Neural Ordinary Differential Equation layers. These layers represent continuously evolving dynamics and repeatedly invoke numerical integration while the model is being trained.

That design has a natural appeal. Physical systems evolve continuously; Neural ODEs provide a mathematically appropriate way to represent them.

The difficulty is computational shape.

A numerical ODE solver proceeds through a sequence of dependent steps. Later calculations depend on earlier ones, and the solver may need to adjust its behaviour as the learned dynamics change. Model recovery makes the problem harder because the underlying equation coefficients are themselves being revised during training.

FPGAs are highly effective when computation can be expressed as a predictable stream of parallel operations. They are considerably less enthusiastic about irregular iterative procedures with changing control flow. Hardware, like management, prefers meetings with an agenda.

Previous ODE accelerators often assume a fixed architecture or static coefficients. Model recovery violates both assumptions. The depth and behaviour of the computation can change as training searches for the governing equation.

MERINDA’s central move is therefore not to accelerate the existing Neural ODE layer more aggressively. It replaces that layer with a different computational mechanism that approximates the same role while exposing far more parallelism.

MERINDA moves the solver away from the centre of the architecture

MERINDA uses neural-flow theory to replace the Neural ODE layer with a discretized recurrent alternative based on a Gated Recurrent Unit, or GRU.

Under the paper’s stated conditions, a recurrent neural flow can approximate the Neural ODE trajectory. The GRU handles the forward evolution of the observed dynamics, while a dense layer approximates the inverse mapping needed to recover candidate equation coefficients.

The resulting recovery process can be summarized as:

Sensor traces and external inputs
              ↓
     GRU-based neural flow
              ↓
 Dense layer proposes coefficients
              ↓
 Threshold-based sparsity removes weak terms
              ↓
     Candidate physical equation
              ↓
 Runge–Kutta simulation reconstructs the trace
              ↓
 Reconstruction error guides training

This distinction is important: MERINDA does not remove ODE solving altogether.

The recovered coefficients and initial conditions are still passed through a Runge–Kutta solver. The resulting trajectory is compared with the observed measurements, and the reconstruction error contributes to training.

What changes is the role of the solver. It is no longer embedded as the dominant adaptive Neural ODE layer responsible for representing the evolving latent dynamics. The GRU-based flow takes over that computationally troublesome role, while the remaining solver is used to test whether the proposed sparse equation reconstructs the data.

That is a subtler claim than “MERINDA eliminates ODE solvers,” but it is also the commercially useful one. Expensive iterative computation does not always need to disappear. Sometimes it needs to be moved out of the critical path.

Sparse coefficients keep the recovered physics readable

The dense layer initially considers a large collection of possible nonlinear terms. For a second-order system, these candidates might include individual variables, squared terms, interactions between variables, and external inputs.

Most real recovery problems assume that only a small subset belongs in the true governing equation. MERINDA applies threshold-based dropout to suppress weak coefficient estimates and retain a sparse result.

The paper illustrates this process using a Lotka–Volterra system. Early training iterations produce a crowded candidate equation containing many incorrect terms. As training continues and low-valued coefficients are removed, the recovered equation becomes substantially sparser.

This example is best understood as a mechanism demonstration, not an independent robustness result. It shows how MERINDA can move from a broad candidate library toward a compact equation. It does not establish that the same threshold will reliably identify every physical system under noisy field conditions.

Still, the mechanism matters. Without sparsity, the model may reconstruct the observations while producing an equation too complicated to inspect or trust. A physically interpretable model should not require a committee to explain every coefficient.

The FPGA becomes useful only after the computation becomes regular

Once the Neural ODE layer has been replaced with GRU-based computation, MERINDA can exploit the strengths of an FPGA.

The authors implement several hardware-level optimizations:

Loop pipelining with an initiation interval of one. A new GRU computation can begin every clock cycle rather than waiting for the previous sequence to complete.
Full unrolling of inner operations. Calculations across input and hidden dimensions are expanded into parallel multiply–accumulate operations.
Streaming dataflow between stages. Independent stages execute concurrently and exchange data through first-in, first-out buffers.
On-chip storage of frequently reused values. Hidden states and temporary values remain in registers or block RAM rather than repeatedly travelling to off-chip memory.

These choices explain the efficiency gains more convincingly than the word “FPGA” alone.

An FPGA is not automatically fast. It is fast when the workload can be transformed into a stable pipeline with predictable data movement. MERINDA first changes the algorithmic structure, then specializes that structure for the hardware.

This also explains why the paper should not be read as a simple hardware benchmark. The GPU implementation uses TensorFlow and Keras on an NVIDIA RTX 6000, while the FPGA version is built from scratch in C++ using high-level synthesis and integrated with a PYNQ-Z2 board.

The comparison is between two different hardware-software systems, not merely two chips executing identical code. That limits any claim that the FPGA alone caused the improvement. It also reinforces the actual lesson: co-design can outperform the convenient habit of deploying an unchanged framework implementation.

The evidence supports efficiency with an accuracy trade-off

The paper presents several experimental components. They do not all serve the same purpose.

Evidence	Likely purpose	What it supports	What it does not prove
Reconstruction error against EMILY and PINN+SR	Comparison with prior work	MERINDA achieves competitive error on four benchmark systems	Equal accuracy under one controlled head-to-head experiment
Coefficient evolution for Lotka–Volterra	Mechanism illustration	Dense outputs and sparsity can converge toward a compact equation	General robustness to noise, drift, or poor identifiability
FPGA and GPU results across hidden sizes	Main performance evidence and sensitivity test	FPGA-based MR substantially reduces time, energy, and memory, with higher error	FPGA is universally faster or more accurate
FPGA resource-utilization table	Implementation and scaling detail	Larger configurations consume substantial FPGA resources	Unlimited scaling to larger models
MILP-selected configurations	Exploratory deployment framework	Platform, task, and hyperparameters can be selected jointly	A fully validated production deployment optimizer

The prior-work accuracy comparison is encouraging. On the four listed recovery benchmarks, MERINDA reports the following reconstruction mean squared errors:

Benchmark	EMILY	PINN+SR	MERINDA
Lotka–Volterra	0.03	0.05	0.03
Chaotic Lorenz	1.70	2.11	1.68
F8 Cruiser	4.20	6.90	5.10
Pathogenic Attack	14.30	21.40	15.10

MERINDA matches or slightly improves upon the reported comparison methods on the first two systems. On F8 Cruiser and Pathogenic Attack, it performs worse than EMILY but better than PINN+SR.

These figures justify describing MERINDA as competitive with existing model-recovery methods. They do not justify pretending the methods are indistinguishable. The comparison values for EMILY and PINN+SR are taken from prior work rather than generated through a newly controlled rerun in this study.

That is sufficient for positioning a new architecture. It is thinner evidence for declaring universal accuracy equivalence.

The largest gains appear where the original computation was most hostile

At a hidden size of 16, the model-recovery comparison is striking:

Metric	GPU MR	FPGA MR	Interpretation
Training time	163.51 s	55.23 s	FPGA is 2.96× faster
Energy consumption	29,943.68 J	261.79 J	FPGA uses about 114× less energy
DRAM footprint	5,862.32 MB	211.29 MB	FPGA uses about 28× less DRAM
Average error	3.179	5.3678	GPU remains more accurate

The efficiency improvement is substantial. So is the accuracy difference.

Across all four tested hidden sizes, the FPGA implementation of model recovery trains faster and uses dramatically less energy and memory than the GPU version. Across those same configurations, the GPU reports lower model-recovery error.

That is not an inconvenient footnote. It defines the deployment decision.

An application with a strict energy budget may rationally accept an error of 5.37 instead of 3.18. An offline engineering analysis seeking the best possible recovered equation may not. “Comparable” accuracy becomes meaningful only after an application defines how much error it can tolerate.

The hidden-size sweep also corrects the tempting belief that FPGAs are simply faster for everything. For ordinary model learning, the GPU trains faster at every tested hidden size. At hidden size 128, GPU model learning takes 29.41 seconds, while the FPGA takes 66.74 seconds.

The FPGA’s decisive speed advantage appears in model recovery, where the original solver-heavy design is particularly ill-suited to parallel execution. The benefit is therefore workload-specific, not a referendum on GPUs.

Configuration changes the headline number

The paper reports both a 2.96× and a 1.68× training-time improvement. These refer to different hidden sizes.

At hidden size 16, FPGA model recovery takes 55.23 seconds against the GPU’s 163.51 seconds, producing the 2.96× result. At hidden size 128, it takes 88.5 seconds against 149.14 seconds, producing the 1.68× result.

The energy comparison also deserves careful reading. The paper’s prose states that the hidden-size-128 GPU configuration consumes 49,375.12 joules, while Table III lists 27,375.12 joules. Using the table value, the FPGA’s 434.09 joules represents roughly a 63× reduction, not 114×.

The 114× energy reduction is cleanly supported by the hidden-size-16 figures: 29,943.68 joules versus 261.79 joules.

This numerical inconsistency does not erase the efficiency result. Even the lower ratio is commercially material. It does mean that readers should attach each headline improvement to its specific configuration rather than treating all maximum figures as one representative deployment.

Numbers rarely become less useful when given a denominator and a row label.

FPGA capacity is a design constraint, not an infinite resource

The FPGA resource table provides an important scaling boundary.

As the tested matrix size increases from 16 to 128:

lookup-table utilization rises from 11.75% to 54.16%;
lookup-table RAM rises from 6.75% to 67.54%;
digital-signal-processing utilization reaches 72.73% by size 32 and remains there;
block RAM rises from 11.43% to 34.29%.

The implementation still fits within the tested board, but larger configurations consume substantial portions of available resources.

This reinforces the paper’s broader argument. Edge deployment is not achieved by selecting the largest configuration the board can tolerate. It requires choosing a model, task, and hardware arrangement that jointly satisfy the operational constraint.

In other words, the design space needs a decision process.

The MILP framework chooses which kind of intelligence to deploy

MERINDA adds a mixed-integer optimization framework for selecting among:

platform: FPGA or GPU;
task: model learning, physics-guided model learning, or full model recovery;
hidden size;
training epochs;
sequence length.

The objective balances estimated power and memory use while enforcing limits on reconstruction error and execution time. Ridge-regression surrogate functions estimate the performance of candidate configurations, and an optimizer searches the resulting design space.

The selected examples reveal three distinct operating modes:

Deployment objective	Selected direction	Paper result	Practical interpretation
Fast approximate monitoring	FPGA + model learning	Error 5.00, time 20.00 s	Use when rapid detection matters more than recovering an equation
Resource-aware physical insight	FPGA + model recovery	Error 5.40, time 446.37 s	Use when local interpretability is valuable but resources remain constrained
Highest-accuracy recovery	GPU + model recovery	Error 1.00, time 804.90 s	Use offline when equation fidelity dominates latency and energy

This may be the paper’s most operationally mature idea.

The important question is not, “Should the company use an FPGA or a GPU?” It is, “Which kind of intelligence must run at which point in the workflow?”

A system can use inexpensive model learning for continuous monitoring, trigger FPGA-based recovery when behaviour changes, and send difficult cases to a GPU for slower high-accuracy analysis. Hardware becomes part of a tiered decision architecture rather than a single procurement choice.

The optimizer itself remains an early framework. The paper shows only one of the eighteen stated surrogate functions and does not provide extensive out-of-sample validation of the selected configurations. It should therefore be read as a credible design method, not a ready-made production scheduler.

The business value is local diagnosis, not merely cheaper computation

The direct result of MERINDA is lower resource consumption for a redesigned model-recovery workflow.

The business interpretation is broader.

1. Architecture review should come before hardware procurement

A company may discover that its edge workload is expensive not because the model is large, but because one iterative component forces sequential execution.

Buying a smaller accelerator or quantizing the existing model may deliver incremental gains. Replacing the structurally hostile component may change the deployment economics entirely.

The first question for edge-AI teams should therefore be:

Which operation prevents the workload from becoming a predictable pipeline?

That question is less fashionable than comparing TOPS ratings. It is also more likely to save money.

2. Online and offline intelligence can use different standards

MERINDA’s results support a split architecture:

Online edge models prioritize responsiveness, energy efficiency, and sufficient accuracy.
Offline models prioritize the most accurate recovered dynamics and can tolerate higher compute cost.
Escalation rules decide when the edge result is uncertain enough to require offline analysis.

This is particularly relevant for industrial equipment and autonomous systems. A local device does not necessarily need to produce the final authoritative engineering model. It may only need enough interpretable evidence to detect that the system has changed and initiate the correct response.

3. Accuracy should be expressed as an operational constraint

The GPU produces lower recovery error. The FPGA uses dramatically fewer resources.

Neither result identifies the correct business choice until error is translated into consequences.

A tolerable reconstruction error for predictive maintenance may be unacceptable for a safety-critical controller. Conversely, paying for the lowest possible error may be wasteful when the model is only being used to rank inspection priorities.

MERINDA’s MILP framing encourages teams to specify error, latency, memory, and energy limits explicitly. That is healthier than selecting the most accurate model and discovering later that it cannot leave the data centre.

4. Interpretable local models can reduce dependence on connectivity

When a device can update a compact physical model locally, it may continue diagnosing changing behaviour when cloud access is delayed, costly, or unavailable.

This could be valuable for remote industrial assets, mobile robotics, and autonomous monitoring systems. It may also reduce the amount of raw sensor data that must be continuously transmitted.

These are reasonable business inferences from the paper’s architecture and resource results. The paper does not directly measure network savings, operational uptime, or total cost of ownership.

The insulin-pump scenario is an illustration, not clinical validation

The paper’s automated-insulin-delivery example makes the energy constraint easy to understand. It also requires precise interpretation.

The study uses fourteen glucose–insulin time series derived from the OhioT1D virtual-patient dataset. Each series contains 200 samples covering sixteen hours and forty minutes, with varied meal timing, carbohydrate intake, and insulin delivery.

That supports experimentation on changing glucose–insulin dynamics. It does not establish that MERINDA is ready to control an insulin pump used by patients.

A real medical deployment would require evidence covering measurement noise, unexpected patient behaviour, model failures, safety constraints, hardware reliability, and regulatory validation. The paper does not attempt that programme.

The battery calculation demonstrates feasibility in an energy-budget sense: an update that was physically impossible under the stated GPU consumption becomes plausible under the FPGA implementation. It does not demonstrate clinical safety or continuous operation in a commercial wearable.

The same distinction applies elsewhere. MERINDA presents a path toward local model recovery. It does not remove the validation burden attached to whatever decisions the recovered model will influence.

What the paper shows, what Cognaptus infers, and what remains uncertain

Category	Assessment
Directly shown	Replacing the Neural ODE component with a GRU-based neural-flow architecture enables a deeply pipelined FPGA implementation with substantially lower energy and memory use and faster model-recovery training in the tested configurations.
Directly shown	MERINDA produces competitive reconstruction errors on four listed nonlinear-system benchmarks, while GPU-based model recovery remains more accurate in the platform comparison.
Directly shown	Platform, task type, and hyperparameters produce materially different trade-offs, making a universal “best platform” an unhelpful concept.
Cognaptus inference	Organisations can divide physical-AI workloads into continuous monitoring, resource-aware local recovery, and high-accuracy offline recovery rather than forcing one architecture to perform every role.
Cognaptus inference	The most valuable optimization opportunity may be replacing sequential algorithmic components before compressing models or purchasing new hardware.
Still uncertain	How MERINDA performs under noisy, drifting, incomplete, or adversarial field data.
Still uncertain	Whether its efficiency gains persist across different FPGA boards, GPU implementations, larger systems, and end-to-end deployed applications.
Still uncertain	Whether the recovered equations satisfy the safety, reliability, and certification requirements of mission-critical commercial systems.

Practical boundaries before declaring physical AI solved

MERINDA offers strong evidence for a design principle, but several boundaries matter when considering deployment.

First, the paper assumes that the underlying physical model is identifiable. If different coefficient combinations can produce indistinguishable observations, no architecture can reliably recover the correct equation from those observations alone.

Second, the neural-flow replacement is presented as approximately equivalent under initial-condition and invertibility requirements. The empirical results are promising, but the paper does not establish universal end-to-end equivalence for every model-recovery problem.

Third, the hardware comparison combines architectural redesign, custom FPGA engineering, and different software stacks. It demonstrates the value of co-design, but it cannot isolate the value of the FPGA from the value of replacing the solver-heavy architecture and implementing it carefully.

Fourth, accuracy degradation is not hypothetical. GPU model recovery reports lower error across the tested hidden sizes. Any deployment must decide whether the FPGA’s efficiency gain justifies that difference.

Fifth, FPGA engineering has its own cost. Custom high-level-synthesis development, resource planning, verification, and maintenance require skills that are not captured in the paper’s energy or runtime tables. An efficient device can still support an expensive engineering programme.

These boundaries narrow the claim without weakening the central lesson.

Physical AI becomes practical when the algorithm respects the machine

The easiest version of the MERINDA story is that FPGAs make model recovery cheaper.

The more useful version is that hardware cannot rescue an algorithm whose most important operation resists the hardware’s strengths.

MERINDA begins by changing that operation. It replaces the solver-heavy Neural ODE layer with a GRU-based neural flow, uses a dense layer and sparsity mechanism to recover compact equations, retains a smaller ODE-simulation role for reconstruction, and then maps the regularized computation onto a deeply pipelined FPGA design.

The result is not an unconditional victory over GPUs. GPUs remain faster for some ordinary learning tasks and more accurate for the tested model-recovery configurations. The result is a credible method for making interpretable physical modelling available inside a much tighter energy and memory envelope.

For businesses building edge intelligence, that is the real takeaway: do not begin by asking which chip should run the model.

Begin by asking whether the model was designed to run anywhere except the laboratory.

Cognaptus: Automate the Present, Incubate the Future.

Bin Xu, Ayan Banerjee, and Sandeep Gupta, “Enabling Physical AI at the Edge: Hardware-Accelerated Recovery of System Dynamics,” arXiv:2512.23767, 2025. https://arxiv.org/abs/2512.23767 ↩︎

Model recovery asks for an equation, not merely a prediction#

Neural ODEs put the difficult computation in the wrong place#

MERINDA moves the solver away from the centre of the architecture#

Sparse coefficients keep the recovered physics readable#

The FPGA becomes useful only after the computation becomes regular#

The evidence supports efficiency with an accuracy trade-off#

The largest gains appear where the original computation was most hostile#

Configuration changes the headline number#

FPGA capacity is a design constraint, not an infinite resource#

The MILP framework chooses which kind of intelligence to deploy#

The business value is local diagnosis, not merely cheaper computation#

1. Architecture review should come before hardware procurement#

2. Online and offline intelligence can use different standards#

3. Accuracy should be expressed as an operational constraint#

4. Interpretable local models can reduce dependence on connectivity#

The insulin-pump scenario is an illustration, not clinical validation#

What the paper shows, what Cognaptus infers, and what remains uncertain#

Practical boundaries before declaring physical AI solved#

Physical AI becomes practical when the algorithm respects the machine#