Opening — Why this matters now

Photorealistic digital humans have quietly become infrastructure. Telepresence, XR collaboration, virtual production, and real‑time avatars all demand faces that are not just pretty, but stable under abuse: extreme expressions, wild head poses, and cross‑identity reenactment. The industry’s dirty secret is that many state‑of‑the‑art avatars look convincing—until you ask them to smile too hard.

The TexAvatars paper enters this space with a clear thesis: realism without rigging discipline collapses under extrapolation, while rigid analytic rigging without learning looks plasticky. The solution, unsurprisingly, is neither extreme.

Background — Two camps, one recurring failure mode

Recent Gaussian‑based head avatars largely split into two families:

Paradigm Strength Failure Mode
Analytic mesh‑rigged Gaussians Stable, interpretable deformation Poor fine detail, weak non‑linear expressions
Texel / neural regression avatars Smooth, high‑frequency detail Unstable under extreme or OOD expressions

Mesh‑bound methods inherit physical plausibility from 3DMMs, but their linear or near‑linear deformations struggle with wrinkles, folds, and anisotropic stretch. Texel‑space CNN methods, meanwhile, enjoy spatial continuity but often discard mesh geometry almost entirely, hoping neural offsets will “figure it out.” They often don’t—especially outside the training manifold.

TexAvatars starts from a blunt observation: UV space is smooth, but not geometric. Adjacent texels may correspond to distant 3D regions. Treating UV interpolation as if it were physically meaningful is the original sin behind many texel‑based artifacts.

Analysis — What TexAvatars actually does differently

1. Local attributes are learned, not global offsets

Instead of predicting canonical‑to‑deformed Gaussian offsets (the usual unstable route), TexAvatars regresses local Gaussian attributes in texel space:

  • Local position, rotation, scale, opacity
  • View‑dependent color
  • Conditioned on expression, pose, and a lightweight expression embedding

Crucially, these predictions live in bounded local coordinates, not global 3D space. This keeps gradients well‑behaved even under large facial motions.

2. Geometry is re‑introduced via Jacobians — properly

Each mesh triangle already defines a deformation Jacobian. TexAvatars unwraps these triangle‑wise Jacobians into UV space, forming what the authors call a Quasi‑Phong Jacobian Field.

Conceptually:


Local Gaussian (UV) → Jacobian‑transformed → Global 3D Gaussian

By interpolating Jacobians in UV space—rather than interpolating already‑deformed Gaussians—the method achieves smooth, cross‑triangle deformation while preserving mesh semantics. It’s an elegant inversion of the usual pipeline, and one that pays dividends in stability.

3. Learning and rigging are cleanly separated

Semantic variation (wrinkles, subtle muscle activation, mouth cavity appearance) is handled by neural decoders. Geometric control (how things move in 3D) remains analytic. This separation is not just aesthetic—it bounds error propagation and makes extrapolation far less brittle.

Findings — The numbers (and faces) back it up

Across held‑out expressions, long FREE‑motion sequences, and novel views, TexAvatars consistently outperforms prior methods.

Quantitative snapshot

Method LPIPS ↓ SSIM ↑ PSNR ↑
GaussianAvatars 0.123 0.858 22.01
RGCA 0.086 0.854 22.68
TexAvatars 0.077 0.861 22.84

More telling than the metrics are the qualitative results: stable teeth, believable wrinkles, coherent mouth interiors, and no “Gaussian soup” when faces stretch or rotate aggressively.

Ablations confirm that removing either the global UV Jacobian field or the local‑space parameterization quickly degrades quality—usually in exactly the ways the theory predicts.

Implications — Why this matters beyond faces

TexAvatars is not just a better head avatar. It’s a design pattern:

  • Learn locally, deform analytically
  • Interpolate structure, not outcomes
  • Bound learning, let geometry scale

These ideas generalize cleanly to other articulated, high‑detail surfaces: hands, full‑body avatars, even non‑human deformables. For practitioners, the work is also refreshingly pragmatic—real‑time performance, modest GPU requirements, and no baroque auxiliary networks.

The limitations are acknowledged and honest: no dynamic hair, limited tongue articulation, weak specular modeling. None are structural dead‑ends.

Conclusion — A rare case of restraint paying off

TexAvatars succeeds because it resists the temptation to let neural networks do everything. By forcing learning to respect geometry—rather than overwrite it—the authors achieve a rare combination: expressiveness and stability under stress.

In a field addicted to bigger models and looser constraints, this paper is a reminder that sometimes the smartest move is to put the math back where it belongs.

Cognaptus: Automate the Present, Incubate the Future.