OmniAvatar’s Metrics & Training: Under the Hood of Next-Gen Avatars

The magic behind OmniAvatar isn’t just in its motion—it’s in the meticulous training pipeline and rigorous evaluation metrics that power its realism. Here’s a closer look at how the model was built and validated.

Training Data: Curated, Filtered, and Massive

OmniAvatar trains on a carefully filtered subset of the AVSpeech dataset (Ephrat et al., 2018), a publicly available corpus with over 4,700 hours of speech-aligned video. To ensure lip-sync precision and high visual quality:

SyncNet and Q-Align are used to filter poor-quality clips based on facial synchronization and image fidelity.
Result: A curated 774,207 clip dataset, totaling ~1,320 hours of content.

From this refined set:

100 samples are reserved for semi-body test evaluation.
The remaining are used for LoRA-based fine-tuning on top of the Wan2.1-T2V-14B foundation model.

OmniAvatar also includes a second test set from HDTF (Zhang et al., 2021) for facial-only benchmarks.

Architecture & Efficiency

The model builds on Wan2.1-T2V-14B, a large-scale open diffusion transformer.
Training resolution: 480p video, gradually mixed with higher-resolution samples.
Uses LoRA rank 128, alpha 64 for efficient fine-tuning.
Training on 64 A100 GPUs, with smart caching of video latents and text embeddings.

Metrics: How Quality is Judged

To validate output realism, temporal coherence, and audio-visual sync, OmniAvatar adopts:

Metric	Purpose	Model Result (↓/↑)
FID	Image-level realism	↓ 37.3 (HDTF)
FVD	Video-level realism	↓ 382 (HDTF)
Sync-C	Lip-sync correlation (↑ better)	↑ 7.62 (HDTF)
Sync-D	Lip-sync distance (↓ better)	↓ 8.14 (HDTF)
IQA	Image quality (human proxy)	↑ 3.82
ASE	Aesthetic quality	↑ 2.41

These outperform strong baselines such as EchoMimic, HunyuanAvatar, and FantasyTalking across every key domain.

Long-Video Support & Identity Preservation

Generating continuous avatars that remain stable in identity and motion requires:

Reference Image Embedding: Anchors the avatar’s appearance.
Frame Overlapping Strategy: Smoothens transitions in long-form video.
Prefix Latents: Retains motion context over clips.

Inference pipeline supports batch-wise segment generation with overlapping regions to eliminate temporal artifacts.

Summary: Why This Matters

Training and metrics aren’t just back-end details—they determine whether avatars resonate as authentic. OmniAvatar’s careful data curation, efficient LoRA adaptation, and exhaustive metric validation make it the current benchmark in expressive, controllable full-body avatar generation.

Cognaptus: Automate the Present, Incubate the Future

Training Data: Curated, Filtered, and Massive#

Architecture & Efficiency#

Metrics: How Quality is Judged#

Long-Video Support & Identity Preservation#

Summary: Why This Matters#

Training Data: Curated, Filtered, and Massive

Architecture & Efficiency

Metrics: How Quality is Judged

Long-Video Support & Identity Preservation

Summary: Why This Matters