When Wings Meet Transformers: Neural Surrogates at Mach Speed

A wing team has one expensive habit: asking CFD again

A design team is trying to improve a wing. Not the poetic version of a wing, with clean curves and heroic renderings, but the irritating engineering version: span, taper ratio, sweep angle, root chord, velocity, angle of attack, shocks, vortices, boundary layers, and drag that refuses to behave politely.

The team changes the geometry. Then it waits for Computational Fluid Dynamics. It adjusts the angle of attack. Then it waits again. It explores a promising sweep angle. More waiting. Eventually someone suggests evaluating a wider design space, and everyone in the room silently hears the sound of compute budget evaporating.

This is the business problem behind Going with the Speed of Sound: Pushing Neural Surrogates into Highly-turbulent Transonic Regimes, the paper introducing Emmi-Wing, a large public CFD dataset for 3D wings in subsonic and transonic regimes, and benchmarking neural surrogates on the job aerospace engineers actually care about: exploring drag–lift trade-offs without rerunning high-fidelity simulation for every candidate design.¹

The important claim is not “AI replaces CFD.” That would be convenient, dramatic, and mostly wrong. The stronger claim is narrower and more useful: a well-trained neural surrogate may let engineers screen wing designs quickly, approximate flow fields and aerodynamic coefficients, trace useful Pareto fronts, and reserve expensive CFD for validation rather than first-pass exploration. Less Hollywood. More productivity.

And, as usual, the less glamorous version is the one business readers should actually pay attention to.

The paper turns one slow workflow into a reusable learning problem

Traditional CFD solves the governing equations again for each geometry and operating condition. The paper reframes that workflow as a mapping problem:

Input	Learned output	Engineering use
Wing geometry and inflow parameters	Surface and volumetric flow fields	Inspect pressure, shear stress, velocity, and vorticity
Predicted surface fields	Lift and drag coefficients	Compare aerodynamic efficiency
Many candidate designs	Approximate drag–lift Pareto front	Screen promising designs before CFD validation

The dataset is the first major contribution. Emmi-Wing contains 29,727 CFD simulation cases, each based on a NACA0012 airfoil extruded into a 3D wing and varied across four geometry parameters and two inflow parameters. The geometry parameters are root chord, span, taper ratio, and sweep angle; the inflow parameters are freestream velocity and angle of attack. The paper reports sampling ranges of root chord $[0.7, 1.2]$ m, span $[1.0, 1.5]$ m, taper ratio $[0.4, 0.7]$, sweep angle $[0, 40]^\circ$, velocity $[150, 300]$ m/s, and angle of attack $[-10, 10]^\circ$.

This matters because many earlier aerospace ML datasets are 2D airfoil datasets. Useful, yes. Enough for transonic 3D wing design? Not really. A 2D airfoil benchmark cannot express wingtip vortices, 3D shock structures, or the particular ways drag and lift misbehave when geometry and inflow interact. Training neural surrogates only on that world is like training a driver in a parking lot and then handing them a mountain road. Technically there is still a steering wheel.

The authors use OpenFOAM-v2506 with steady-state compressible RANS simulations, the rhoSimpleFoam solver, a perfect-gas assumption, and the Spalart–Allmaras turbulence model. The resulting dataset includes both surface fields—pressure and wall shear stress—and volumetric fields, including pressure, velocity, and vorticity. That matters because a surrogate that only predicts a final drag number is a response surface with a nicer haircut. A surrogate that predicts flow fields gives engineers something closer to diagnostic visibility.

The benchmark is not just accuracy; it is whether the model survives unfamiliar wings

The paper evaluates four surrogate families: PointNet, a standard Transformer, Transolver, and AB-UPT. The evaluation is deliberately structured around generalization. The authors split the data into training, validation, two in-distribution test regimes, and an out-of-distribution regime built from the outer regions of the parameter space.

That split choice is important. If a model performs well only on random held-out samples near familiar designs, it may be useful as a compression trick but not as a design tool. Engineering teams do not optimize by asking, “Can we predict designs almost identical to yesterday’s?” They ask, “Can we move toward better designs without stepping off a cliff?”

The benchmark table gives the first answer. On out-of-distribution cases, AB-UPT matches or beats the strongest transformer-style baselines on most fields and is especially better on vorticity, the high-variance field where the task becomes less forgiving.

OOD relative L2 error	PointNet	Transformer	Transolver	AB-UPT
Surface pressure $p_s$	0.120	0.009	0.008	0.008
Wall shear stress $\tau$	0.586	0.060	0.055	0.055
Volume pressure $p_v$	0.115	0.008	0.007	0.007
Velocity $u$	0.402	0.056	0.050	0.049
Vorticity $\omega$	0.543	0.182	0.156	0.126

The surface-field story is almost boring: transformer-like models perform similarly, and PointNet lags badly. The volume-field story is more interesting. AB-UPT’s advantage is clearest where the flow is harder to compress into a smooth prediction. Vorticity is not a decorative output; it is one of the places where 3D flow complexity shows up. If the model fails there, it may still predict a headline coefficient but lose the physics that makes the coefficient meaningful.

The coefficient results are stronger. For the OOD test set, AB-UPT achieves $R^2 = 1.000$ for lift coefficient $C_L$ and $R^2 = 0.998$ for drag coefficient $C_D$. Those numbers should not be read as universal aerospace truth. They are results under this dataset, solver setup, parameterization, and test construction. But inside that box, the signal is unusually clean: the surrogate is not merely producing pretty flow-field images; it is preserving the aerodynamic quantities needed for design screening.

The Pareto front is where this becomes a design workflow

The paper’s most business-relevant move is not the standard benchmark. It is the parameter scan and design-optimization demonstration.

The authors create an additional 248 evaluation cases using a wing geometry not present in the original 29,727 cases. They then sweep angle of attack and sweep angle to test whether AB-UPT can recover drag–lift behavior beyond the ordinary training distribution. This is not a minor appendix curiosity. This is the closest the paper gets to the real design question: can the model help trace the frontier between “more lift” and “less drag” when engineers are exploring candidate designs?

For the parameter scans, AB-UPT reports $R^2 = 0.911$ for $C_L$ and $R^2 = 0.804$ for $C_D$. That is notably weaker than the OOD test-set coefficient result, which is exactly why the scan is useful. The parameter scan is a stress test, not a victory lap. It shows both promise and strain.

The qualitative finding is more useful than pretending the correlation numbers settle everything. AB-UPT reproduces the drag–lift Pareto front well for much of the scan. It shows minor deviations for angle of attack values up to roughly $\alpha \sim 20^\circ$, far outside the training range of $[-10^\circ, 10^\circ]$. It also captures the tangent of the drag–lift Pareto front up to sweep angle $\Lambda = 50^\circ$, even though the training sweep-angle range ends at $40^\circ$. Beyond that, especially for larger sweep angles such as $\Lambda = 70^\circ$ and high angle-of-attack regimes, divergence grows.

That pattern is exactly what one should want from a serious engineering paper: not “the model generalizes magically,” but “here is where it holds, here is where it bends, and here is where it starts coughing.” Very inconsiderate of the hype cycle, but useful.

The optimization demo shows screening value, not certification value

The paper then uses AB-UPT for a rapid design exploration and optimization demo. The authors adapt the workflow so the model can take a CAD-style representation: they generate STL geometry differentiably from the parametric NACA0012 wing, use lower-resolution geometry inputs for the model, query higher-resolution surface fields, and compute lift and drag coefficients from the predicted fields.

This is an important mechanism. If every new candidate design still requires expensive meshing before the surrogate can operate, the workflow loses much of its advantage. The paper’s CAD-to-surrogate path makes the design loop closer to continuous exploration.

The authors test three optimization methods—gradient-based optimization with Adam, evolutionary search with CMA-ES, and Bayesian optimization—each for about two minutes on an H100 GPU, with the search bounded to the training range. The best lift-to-drag ratios are close:

Method	Steps	$C_D$	$C_L$	$C_L/C_D$
Gradient	900	0.0179	0.3281	18.36
Evolutionary	2700	0.0165	0.3040	18.43
Bayesian	100	0.0163	0.3002	18.43
Best in dataset	—	0.0160	0.2905	18.12

The careful reading is that the surrogate finds configurations with slightly higher lift-to-drag ratios than the best design already present in the dataset. The cautious reading is just as important: this is not proof that an aircraft manufacturer can now skip CFD validation, wind-tunnel testing, certification, or physics. The optimization remains bounded to the training range, uses a simplified NACA0012-derived wing family, and relies on the fidelity of the CFD simulations used to train the model.

The business interpretation is therefore not “replace the solver.” It is “change when the solver is used.”

A practical workflow might look like this:

Stage	Old workflow	Surrogate-assisted workflow
Early exploration	Run CFD repeatedly across candidate designs	Use surrogate to screen many candidates
Pareto-front discovery	Expensive and sparse	Dense, fast, approximate
Diagnostic review	Inspect CFD fields case by case	Inspect predicted fields and flag anomalies
Final validation	CFD and domain review	Still CFD and domain review
Business impact	Slow iteration	Cheaper narrowing of the search space

This is where the paper becomes relevant beyond aerospace. Many industrial AI deployments fail because they promise to remove expert workflows. The more durable pattern is different: move expensive expert tools later in the funnel, after cheaper models have reduced the search space.

The artifact finding is a small clue with large operational meaning

One of the more interesting observations is almost hidden inside the results. The paper notes that some surface friction fields in the CFD data contain non-physical streaks, likely numerical artifacts. AB-UPT does not reproduce these high-frequency streaks; instead, it predicts smoother friction fields. The authors attribute this to neural networks’ bias toward low-frequency components and suggest that the model may help detect anomalies during data curation.

This is not the main evidence for aerodynamic optimization. It is better understood as an exploratory extension. Still, it is operationally interesting.

In many engineering workflows, simulation data is treated as ground truth because it is expensive, formal, and generated by serious-looking software. But expensive data can still contain numerical artifacts. A surrogate trained across many cases can sometimes behave like a consistency detector: it learns the common structure of valid simulations and struggles where the input case is contaminated or poorly converged.

The appendix makes this point more concretely. The authors describe using AB-UPT prediction error as an additional quality-control signal to flag failed or ambiguous CFD cases. Manual inspection confirmed issues such as spurious flow patterns, inadequate boundary-layer mesh resolution, or early numerical divergence.

That does not mean the neural model is “more correct than CFD.” Please do not put that on a slide unless you enjoy being gently destroyed by aerodynamicists. It means the surrogate may become part of the data-quality workflow: not the judge, but the suspicious analyst who notices when a supposedly valid simulation smells wrong.

The appendix is mostly engineering support, not a second thesis

The appendices serve different purposes, and mixing them together would overstate the paper.

Paper component	Likely purpose	What it supports	What it does not prove
Main benchmark table	Main evidence	AB-UPT is strongest among tested surrogates, especially for vorticity	Universal superiority across all CFD regimes
OOD coefficient correlations	Main evidence	Lift and drag remain highly aligned with CFD on the OOD test set	Certified prediction outside the dataset assumptions
Parameter scans	Robustness / stress test	Pareto-front behavior remains useful under wider sweeps	Reliable extrapolation under all extreme geometries
Optimization demo	Exploratory workflow demonstration	Surrogate-based screening can find promising designs quickly	Production-ready aircraft optimization
Conditioning ablation	Implementation diagnostic	Input representation and conditioning choices affect model behavior	A general theory of surrogate architecture
Failed-case detection	Data-quality extension	Prediction error may help identify problematic CFD samples	Neural models can replace convergence analysis

This distinction matters because a paper like this is easy to oversell. The exciting part is not that every result is equally conclusive. It is that the results form a coherent workflow: build a 3D transonic dataset, train a strong surrogate, verify field and coefficient prediction, stress-test Pareto behavior, and then show how fast design exploration might work.

That workflow is the contribution. Not a single magic number.

The boundary is narrow, and that is why the paper is credible

The most likely reader misconception is straightforward: if AB-UPT works this well, neural surrogates are ready to replace aerospace CFD.

No. The paper does not show that.

It uses steady-state RANS simulations, not full unsteady high-fidelity physics. The wing geometry family is relatively simple and derived from NACA0012, not a full commercial aircraft configuration with all the delightful complications engineers get paid to suffer through. The solver setup itself introduces uncertainty: mesh quality, turbulence-model assumptions, convergence criteria, and solver fidelity all matter, especially in transonic regimes where shocks and turbulence can turn small numerical choices into large practical differences.

The authors are explicit about these limits. OpenFOAM is attractive because it is accessible and automatable, but accessibility is not the same thing as final authority. Steady-state RANS reduces data-generation cost, but it does not capture inherently unsteady phenomena. The benchmark is valuable because it is public, 3D, transonic, and operationally structured; it is not a complete substitute for industrial validation.

So the business conclusion should be framed carefully:

Directly shown by the paper	Cognaptus interpretation	Still uncertain
Emmi-Wing provides ~30K 3D sub-/transonic CFD cases with geometry and inflow variation	Aerospace ML finally gets a more serious public benchmark	How well models transfer to richer aircraft geometries
AB-UPT performs strongly on fields and coefficients, especially vorticity	Neural surrogates can support early diagnostic screening	Robustness under unsteady, higher-fidelity, or different solver regimes
Parameter scans preserve useful Pareto-front behavior within a wider but bounded stress test	Surrogates can accelerate design-space exploration	Reliability in true extrapolation and certification workflows
Optimization demo finds slightly better lift-to-drag candidates than those in the dataset	Surrogate loops may shorten early-stage iteration	Whether the candidates remain optimal after independent CFD and experimental validation

In plain business language: this paper supports cheaper search, not cheaper truth. That is still a serious result. In fact, it is often where the best ROI lives.

The real shift is from simulation-as-verdict to simulation-as-validation

The quiet change signaled by Emmi-Wing is architectural. For years, CFD has been treated as the place where candidate designs go to receive judgment. The surrogate-assisted workflow suggests another pattern: use neural models to explore the design space densely, identify promising regions, detect suspicious simulations, and then send the finalists back to CFD.

That is a different allocation of expensive computation. It does not weaken engineering rigor; it changes where rigor is applied.

The paper’s contribution is therefore bigger than “AB-UPT did well on a benchmark.” It shows a plausible design loop for a class of engineering problems where the number of possible candidates is large, the simulator is expensive, and the final answer must still be validated by serious physics.

Aerospace is just a particularly unforgiving test case. If neural surrogates can be useful near Mach-speed transonic wing flows—where shocks, vortices, and 3D effects make the problem genuinely unpleasant—then the broader lesson for industrial AI is clear: the next wave of useful automation may not come from replacing expert tools, but from making expert tools less lonely, less overused, and less burdened with every mediocre candidate design.

The wing still needs physics. It may just need fewer full CFD sermons before engineers know which designs deserve one.

Cognaptus: Automate the Present, Incubate the Future.

Fabian Paischer, Leo Cotteleer, Yann Dreze, Richard Kurle, Dylan Rubini, Maurits Bleeker, Tobias Kronlachner, and Johannes Brandstetter, “Going with the Speed of Sound: Pushing Neural Surrogates into Highly-turbulent Transonic Regimes,” arXiv:2511.21474, 2025, https://arxiv.org/abs/2511.21474. The dataset is released at https://huggingface.co/datasets/EmmiAI/Emmi-Wing. ↩︎

A wing team has one expensive habit: asking CFD again#

The paper turns one slow workflow into a reusable learning problem#

The benchmark is not just accuracy; it is whether the model survives unfamiliar wings#

The Pareto front is where this becomes a design workflow#

The optimization demo shows screening value, not certification value#

The artifact finding is a small clue with large operational meaning#

The appendix is mostly engineering support, not a second thesis#

The boundary is narrow, and that is why the paper is credible#

The real shift is from simulation-as-verdict to simulation-as-validation#