Face avatars fail in a very human way: they look fine until someone actually uses their face.

A slight smile is easy. A frontal view is easy. A polite corporate-video expression, the kind that says “I am excited to join this quarterly alignment session,” is also easy enough. The real test begins when the mouth opens wide, the eyebrows compress, the head rotates, the teeth appear, the skin folds, and the avatar must still look like a person rather than a damp sticker stretched over a mesh.

That is the problem behind TexAvatars, a paper by Jaeseong Lee, Junyeong Ahn, Taewoong Kang, and Jaegul Choo on photorealistic Gaussian head avatars.1 The paper is not merely asking whether 3D Gaussian Splatting can render a nicer face. It asks a more operational question: how should an avatar representation deform when expression moves beyond the tidy examples it saw during training?

The tempting answer is “more neural capacity.” Give the model a bigger network, more offsets, more appearance parameters, more data, more everything. This is the familiar AI reflex: when something breaks, add scale and hope the geometry apologizes.

TexAvatars makes a quieter argument. The failure is not only a data problem. It is also a representation problem. Existing Gaussian head-avatar methods tend to fall between two imperfect choices. Analytic mesh rigging is stable and interpretable, but too rigid for fine facial details. Fully learned or heavily neural texel-space deformation is flexible, but can drift when expressions become extreme. TexAvatars tries to keep the useful part of both: neural prediction for local expression-dependent detail, analytic mesh-aware Jacobians for deformation control, and a UV-space design that makes neighboring texels behave like they know they belong to the same face.

That last phrase sounds obvious. It is not. In avatar systems, even “neighboring” can be a lie.

The real enemy is not blur; it is coordinate confusion

A head avatar has to solve two jobs at once.

First, it must represent appearance: skin color, wrinkles, facial hair, teeth, mouth cavities, opacity changes, and view-dependent effects. Second, it must move those represented elements under expression and pose. A good avatar does not merely paste a sharp texture onto a moving head. It must decide how the underlying primitives stretch, rotate, compress, and reveal or hide detail as the face changes.

In 3D Gaussian Splatting, the scene is represented as Gaussian primitives. Each primitive has attributes such as position, rotation, scale, opacity, and color. For head avatars, those Gaussians need to be driven by a facial model, often a 3D Morphable Model such as FLAME. Existing systems commonly choose one of two broad approaches.

Approach What it gives you What breaks
Mesh-driven analytic rigging Stable deformation, interpretable control, good extrapolation under pose and expression Fine expression details are limited; fixed triangle binding can create discontinuities or blobby deformation
Texel/neural regression Smooth UV-space prediction, CNN-friendly structure, expressive local detail Learned offsets can become unstable under extreme deformation; heuristic regularization can suppress useful detail

The misconception is that these failures are mainly about insufficient data or model size. The paper’s sharper diagnosis is that the model needs the right division of labor. Neural networks should predict local appearance and local Gaussian attributes. Geometry should still control how those attributes enter 3D space.

The reason is simple enough to be annoying: a face is not just an image. It is a deforming surface with local frames, triangle boundaries, cavities, and anisotropic stretch. If a system predicts global offsets directly, the training signal can scale with large deformations. When the face moves far from the canonical state, the model is asked to learn both semantic expression detail and geometric displacement at the same time. That is a generous invitation to instability.

TexAvatars instead predicts local Gaussian attributes in UV space, then lifts them into 3D using mesh-aware Jacobians. In other words, the network describes what should happen locally; the rigging tells it how that local description should deform geometrically.

A compact way to read the mechanism is this:

Expression / pose / image-animation signal
CNN decoders predict local Gaussian attributes in UV space
Mesh-derived Jacobians are remapped into UV space
Local attributes are transformed into coherent 3D Gaussians
3D Gaussian Splatting renders the animated head

The important design choice is not merely “use UV maps.” Many systems already use UV space. The important choice is when geometry enters the pipeline. TexAvatars does not let UV-space neural regression float freely and then hope it behaves. It injects mesh-aware deformation into the texel representation before interpolation creates trouble.

TexAvatars gives neural texels a geometric leash

The paper’s first contribution is local flexible Gaussians in texel space. Instead of binding each Gaussian rigidly to a specific mesh triangle, TexAvatars predicts expression-conditioned local attributes using CNN decoders. The geometry decoder predicts local position, rotation, scale, and opacity. The appearance decoder predicts RGB color and is conditioned on view information as well.

This gives the model flexibility. Wrinkles, folds, mouth details, and opacity changes do not have to be hard-coded into a fixed mesh deformation. They can vary with expression.

But the flexibility is deliberately bounded. The predicted attributes live in local coordinates, not in an unconstrained global deformation space. This matters because local coordinates keep the learning problem better conditioned. The paper’s analysis argues that global offset prediction can make gradients scale with global displacement, while local prediction transformed through near-isometric mesh Jacobians keeps the error signal more controlled.

That is the technical core of the paper’s business relevance. A production avatar system does not merely need a model that looks good on validation frames. It needs one that does not panic when the user does something outside the polite training distribution. Stability under expression extrapolation is not academic decoration. It is the difference between a digital double that survives real usage and one that becomes a horror-comedy asset at the first wide-open mouth.

The second contribution is the piece that makes the hybrid work: the Quasi-Phong Jacobian Field.

The Quasi-Phong Jacobian Field makes triangle boundaries less stupid

UV maps are convenient because they unwrap a 3D surface into a 2D domain where CNNs can operate. The catch is that UV adjacency does not always mean clean geometric adjacency. A texel may sit near another texel in UV space while their corresponding local frames are defined by different mesh triangles. If the system interpolates local attributes directly across triangle boundaries, it can blend values that do not share the same semantic coordinate system.

This is where avatar engineering becomes less glamorous and more useful.

A naive TexAvatars-like method would predict local attributes in UV space, sample them with bilinear interpolation, and then transform them into 3D using the Jacobian of the associated face. That still leaves a discontinuity problem: each triangle has its own local frame, so crossing a triangle boundary can create inconsistent deformation.

TexAvatars reverses the order. It remaps triangle-wise Jacobians into the UV domain and uses bilinear sampling to form a smoother Jacobian field. Then local attributes are transformed through this field. The analogy to Phong shading is apt: instead of accepting piecewise-flat behavior per triangle, the method blends deformation information across nearby surface regions.

The paper calls this a Quasi-Phong Jacobian Field. The name is slightly grand, but the idea is practical: smooth the deformation field in texel space so the neural attributes are interpolated in a geometrically meaningful domain.

This is the paper’s best mechanism-first lesson. The model is not winning because it uses a fashionable renderer. It wins because it carefully decides which quantities are safe to interpolate.

Quantity Naive risk TexAvatars’ correction
Local Gaussian position and covariance UV interpolation can mix attributes defined in incompatible triangle frames Transform using remapped Jacobians before global texel sampling
Mesh deformation Triangle-wise fields can be discontinuous Smooth Jacobian field in UV space
Fine expression appearance FLAME expression parameters may miss wrinkles or subtle muscle activation Add an image-animation expression embedding
High-frequency detail Pixel losses can reward smooth but blurry reconstructions Add delayed VGG perceptual loss after training stabilizes

This is not a generic “hybrid is better” story. It is more specific: hybrid systems work only when the boundary between analytic geometry and neural prediction is drawn in the right place.

The extra expression code is there because FLAME does not know every wrinkle

The paper uses FLAME expression and pose parameters, but it also adds an expression-related embedding from an image-animation model. This is not decorative conditioning. It addresses a real ambiguity.

Two frames can share similar FLAME expression parameters but differ in visible microstructure: a wrinkle appears, the glabellar region tightens, the cheek fold changes, or the mouth cavity reveals detail differently. Standard 3DMM expression spaces are useful, but they do not encode every surface-level consequence of facial muscle activation.

TexAvatars therefore conditions its decoders on an additional image-animation expression code. In the ablation, removing this component produces only small changes in LPIPS but a clearer qualitative loss in wrinkles and subtle skin deformation. This is exactly the kind of result that metric-only readers often mishandle. A tiny metric movement does not mean the component is useless. It may mean the metric is not sensitive to the perceptual feature that matters for believability.

The same logic applies to the VGG perceptual loss. The paper reports that VGG loss is activated only after 300K iterations out of 600K, because applying it from the beginning destabilizes PSNR and SSIM. That scheduling detail is easy to skip, but it tells us something important: high-frequency realism is useful only after the coarse geometry and reconstruction are stable. Start optimizing for texture too early, and the model can chase the wrong signal. Very human, really.

The experiments test deployment pressure, not just pretty frames

The paper evaluates TexAvatars against five baselines: GaussianAvatars, SurFhead, RGBAvatar, GEM, and RGCA. The useful part is not just the leaderboard. It is the structure of the tests.

Test or evidence block Likely purpose What it supports What it does not prove
Novel Expression held-out split Main evidence for generalizing to expressions withheld from training TexAvatars remains competitive or best across LPIPS and PSNR, though GaussianAvatars has the top SSIM It does not prove robustness to all real-world capture conditions
FREE extreme-expression split Stress test for unconstrained expression and head motion The hybrid representation performs better where extrapolation is hard It is still within the NeRSemble-style controlled data ecosystem
Novel View split Comparison with prior work under camera viewpoint changes TexAvatars improves SSIM and PSNR and matches best LPIPS It does not solve lighting, specular control, or dynamic hair
Component ablations Mechanism audit Global UV sampling, VGG loss, Jacobian deformation, and image-animation embedding each support a different part of the mechanism Single-run ablations do not fully quantify variance across training seeds
Supplementary qualitative tests Robustness and visual diagnosis The method better preserves wrinkles, mouth detail, facial hair boundaries, and cross-reenactment detail Qualitative evidence remains dependent on selected visual examples

This is why a mechanism-first reading is better than a metric-first summary. The results make most sense after the architecture is understood. TexAvatars is not merely “better.” It is better in the places where the mechanism predicts it should be better: extreme expressions, triangle-boundary deformation, anisotropic stretching, and fine expression-dependent detail.

The numbers are strong, but not a universal landslide

The headline result is that TexAvatars achieves the best or near-best performance across the three evaluation settings. But the details matter.

In the paper’s Table 1, lower LPIPS is better, while higher SSIM and PSNR are better.

Evaluation setting TexAvatars result Closest or notable baseline Interpretation
Novel Expression held-out, LPIPS 0.048 ± 0.013 RGCA: 0.050 ± 0.013 Small numerical edge over a strong texel-neural baseline
Novel Expression held-out, SSIM 0.894 ± 0.030 GaussianAvatars: 0.897 ± 0.028 Analytic rigging remains very stable on structural similarity
Novel Expression held-out, PSNR 25.61 ± 2.10 RGCA: 25.55 ± 2.07 Slight improvement, not a dramatic separation
FREE, LPIPS 0.077 ± 0.017 RGCA: 0.086 ± 0.025 Stronger advantage in the harder extreme-expression split
FREE, SSIM 0.861 ± 0.033 GEM: 0.863 ± 0.034 TexAvatars is close but not top on this metric
FREE, PSNR 22.84 ± 2.05 RGCA: 22.68 ± 2.59 Best reconstruction fidelity under unconstrained motion
Novel View, LPIPS 0.030 ± 0.005 RGCA: 0.030 ± 0.006 Essentially matched at the top
Novel View, SSIM 0.947 ± 0.013 RGCA: 0.943 ± 0.013 Clear but modest improvement
Novel View, PSNR 35.15 ± 1.33 RGCA: 34.24 ± 1.23 Meaningful gain in view synthesis

This table prevents two lazy readings.

The first lazy reading is that TexAvatars simply dominates everything everywhere. It does not. GaussianAvatars still scores the highest SSIM on the held-out novel-expression split, and GEM edges TexAvatars on FREE SSIM. Different metrics reward different properties. SSIM may favor structural smoothness. LPIPS may better reflect perceptual realism. PSNR can favor pixel-level fidelity but punish high-frequency perceptual choices. No single number carries the whole argument.

The second lazy reading is that the gains are too small to matter. That is also wrong. Avatar realism often breaks locally: mouth interiors, wrinkles, hair boundaries, stretched nasal lines, teeth separation. A method can show modest aggregate metric improvement while fixing precisely the artifacts that users notice first. The paper’s qualitative comparisons are important because they show the failure modes behind the numbers: blobby deformation, lost nasal lines, missing wrinkles, weak oral cavity structure, and poorer cross-reenactment detail.

The FREE split is the most business-relevant part of the evaluation. It contains longer, less constrained sequences with arbitrary expressions and head motions. In polite benchmark conditions, many methods can survive. Under expressive use, representation assumptions get exposed. TexAvatars’ stronger FREE performance is therefore the result to watch.

The ablations read like a mechanism audit

The ablation table is especially useful because each removed component corresponds to a specific claim in the method.

FREE ablation LPIPS ↓ SSIM ↑ PSNR ↑ What the test is really asking
Full TexAvatars 0.077 ± 0.017 0.861 ± 0.033 22.84 ± 2.05 Does the full hybrid design work under extreme expression?
Without VGG loss 0.096 ± 0.021 0.859 ± 0.034 22.66 ± 2.24 Does perceptual supervision help preserve high-frequency detail?
Without Global UV 0.096 ± 0.028 0.855 ± 0.035 22.46 ± 2.39 Does geometry-aware global UV sampling prevent semantic misalignment?
Without Jacobian 0.080 ± 0.018 0.859 ± 0.035 22.77 ± 2.13 Does full Jacobian deformation improve anisotropic stretch and reduce blob artifacts?
Without image-animation embedding 0.078 ± 0.017 0.853 ± 0.034 22.69 ± 1.97 Does the auxiliary expression signal capture details missing from FLAME?

The biggest quantitative drops come from removing Global UV sampling and VGG loss. That supports two separate claims. Global UV sampling is central to coherent deformation; VGG loss helps recover perceptual detail that pixel-level objectives may smooth away.

The Jacobian ablation is subtler. Removing the Jacobian only modestly worsens aggregate metrics, but the qualitative results show blob-like artifacts and weaker handling of directional stretch. This is a useful reminder for technical buyers: an ablation can matter operationally even when its average metric penalty looks small. If the artifact appears on the mouth, nose, or eye region during a sales demo, no one will care that the LPIPS movement was modest. The face has veto power.

The image-animation embedding also looks numerically small in LPIPS, but the paper shows that removing it can erase wrinkles or subtle skin deformation. This supports a narrower interpretation: the embedding is not the main geometric stabilizer. It is a detail disambiguator.

So the ablations do not form a second thesis. They reinforce the first one. TexAvatars needs stable geometry, coherent UV sampling, and expression-conditioned appearance detail because each solves a different failure mode.

The business value is not “better avatars”; it is fewer catastrophic expressions

For companies thinking about telepresence, XR meetings, virtual influencers, digital doubles, gaming characters, or VFX previsualization, the practical question is not whether TexAvatars is a deployable product tomorrow. It is not presented as that. The practical question is what design principle should influence avatar pipelines.

The answer is: do not let neural flexibility replace geometric discipline.

Paper result Directly shown Cognaptus business inference Boundary
Hybrid texel-3D representation improves extreme-expression rendering Better or near-best metrics and qualitative results across held-out, FREE, and novel-view tests Avatar systems should treat expression extrapolation as a representation problem, not only a dataset problem Evidence is from controlled multi-view capture, not casual phone capture
Quasi-Phong Jacobian Field smooths deformation across texels Ablations and qualitative comparisons support reduced boundary and stretch artifacts Geometry-aware interpolation can reduce visible failure modes in high-stakes facial regions It does not solve every appearance effect, especially specular highlights
Image-animation embedding helps fine details Removing it can reduce wrinkles and subtle skin effects Production systems may need auxiliary expression signals beyond standard 3DMM parameters Added signals may entangle with non-face dynamics such as hair
Real-time rendering is reported 50.85 FPS under the paper’s benchmark protocol The representation is potentially compatible with interactive avatar use Training still takes about 16 hours per identity in the reported setup
Memory and hardware requirements are moderate Training uses 6–10 GB GPU memory; authors cite feasibility on a 12 GB GPU The method is not limited to massive research clusters Capture and data curation remain nontrivial

This matters because many business discussions about avatars over-index on output quality. Output quality is visible, but failure tolerance is what determines whether a system can be used repeatedly. A digital human that works only under controlled expressions is a demo asset. A digital human that survives unusual expressions, head rotations, and mouth shapes starts to become infrastructure.

That is also where ROI thinking becomes less glamorous. The value is not merely prettier renderings. It is lower cleanup cost, fewer unusable takes, less manual correction, more reliable cross-driving, and fewer uncanny moments during interactive use. The best avatar system is not the one that looks stunning in one selected frame. It is the one that does not betray you during frame 8,217.

The practical boundary: capture rigs still matter

TexAvatars is strong, but its deployment boundary is clear.

The method assumes tracked meshes and controlled multi-view data. The paper follows the NeRSemble-style protocol, using 15 cameras for training and reserving one near-frontal camera for novel-view testing. It also depends on broad expression coverage. This makes the system much less casual than few-shot or monocular talking-head generation.

That boundary is not a weakness in the research. It is part of the product category. TexAvatars is closer to high-fidelity personalized avatar reconstruction than to “upload one selfie and become a metaverse executive,” which remains a sentence that should make everyone nervous.

The paper’s limitations are also concrete.

First, dynamic hair is not modeled well. The authors note that hair motion can entangle with expression embeddings, producing jitter during cross-reenactment. This is not a minor issue for consumer avatars, where hair often occupies prime visual territory and has the audacity to move.

Second, tongue articulation remains limited. The supplementary material describes a custom UV layout that includes teeth and a tongue mesh to improve oral structure separation, but the broader pipeline still depends on tracked 3DMM geometry, and standard FLAME does not include tongue articulation. So the model can improve mouth cavity realism, but it is not a full speech-articulation system.

Third, specular appearance remains difficult. Eye glints, tooth gloss, and facial shine are only partially handled by view-dependent appearance modeling. For premium digital doubles, relighting and explicit specular control will matter.

Fourth, the number of Gaussians is fixed for training stability. That makes the system less adaptive in high-frequency regions such as pores or facial hair. Adaptive allocation could help, but would introduce another stability problem. There is always another bill to pay.

Finally, identity misuse deserves attention. The authors note that controlled multi-camera capture and broad expression data reduce the casual misuse risk compared with few-shot methods, but high-fidelity likeness reconstruction still raises obvious concerns around consent, impersonation, watermarking, and verification. For business deployment, rights management is not a compliance afterthought. It is part of the product architecture.

The strategic lesson: hybrid AI needs the right interface between learning and structure

TexAvatars is valuable beyond head avatars because it illustrates a broader AI engineering pattern.

When systems must behave under distribution shift, the question is not simply whether to use neural models or analytic structure. The question is where the interface belongs.

Put too much burden on analytic rigging, and the avatar becomes stable but dull. Put too much burden on neural offsets, and the avatar becomes expressive but unreliable. TexAvatars draws the interface in a more useful place: neural decoders predict local expression-dependent attributes; mesh-aware Jacobians control deformation; UV-space sampling provides spatial continuity; auxiliary expression embeddings recover details that the base face model misses.

That is the real contribution. Not “UV maps plus Gaussians.” Not “another avatar benchmark win.” The paper shows that photorealistic avatars need representations that respect both the convenience of neural texture space and the discipline of geometry.

A face is a hostile benchmark because humans are exquisitely trained to detect when it goes wrong. We may forgive a blurry background. We do not forgive a mouth that forgets how teeth work.

TexAvatars does not solve the whole digital human problem. It does something more useful: it clarifies why the next generation of avatar systems should not choose between learned appearance and analytic rigging. They need both, coupled carefully enough that UV maps stop pretending geometry is optional.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jaeseong Lee, Junyeong Ahn, Taewoong Kang, and Jaegul Choo, “TexAvatars: Hybrid Texel-3D Representations for Stable Rigging of Photorealistic Gaussian Head Avatars,” arXiv:2512.21099, 2025. https://arxiv.org/abs/2512.21099 ↩︎