The magic behind OmniAvatar isn’t just in its motion—it’s in the meticulous training pipeline and rigorous evaluation metrics that power its realism. Here’s a closer look at how the model was built and validated.

Training Data: Curated, Filtered, and Massive

OmniAvatar trains on a carefully filtered subset of the AVSpeech dataset (Ephrat et al., 2018), a publicly available corpus with over 4,700 hours of speech-aligned video. To ensure lip-sync precision and high visual quality:

  • SyncNet and Q-Align are used to filter poor-quality clips based on facial synchronization and image fidelity.
  • Result: A curated 774,207 clip dataset, totaling ~1,320 hours of content.

From this refined set:

  • 100 samples are reserved for semi-body test evaluation.
  • The remaining are used for LoRA-based fine-tuning on top of the Wan2.1-T2V-14B foundation model.

OmniAvatar also includes a second test set from HDTF (Zhang et al., 2021) for facial-only benchmarks.

Architecture & Efficiency

  • The model builds on Wan2.1-T2V-14B, a large-scale open diffusion transformer.
  • Training resolution: 480p video, gradually mixed with higher-resolution samples.
  • Uses LoRA rank 128, alpha 64 for efficient fine-tuning.
  • Training on 64 A100 GPUs, with smart caching of video latents and text embeddings.

Metrics: How Quality is Judged

To validate output realism, temporal coherence, and audio-visual sync, OmniAvatar adopts:

Metric Purpose Model Result (↓/↑)
FID Image-level realism 37.3 (HDTF)
FVD Video-level realism 382 (HDTF)
Sync-C Lip-sync correlation (↑ better) 7.62 (HDTF)
Sync-D Lip-sync distance (↓ better) 8.14 (HDTF)
IQA Image quality (human proxy) 3.82
ASE Aesthetic quality 2.41

These outperform strong baselines such as EchoMimic, HunyuanAvatar, and FantasyTalking across every key domain.

Long-Video Support & Identity Preservation

Generating continuous avatars that remain stable in identity and motion requires:

  • Reference Image Embedding: Anchors the avatar’s appearance.
  • Frame Overlapping Strategy: Smoothens transitions in long-form video.
  • Prefix Latents: Retains motion context over clips.

Inference pipeline supports batch-wise segment generation with overlapping regions to eliminate temporal artifacts.

Summary: Why This Matters

Training and metrics aren’t just back-end details—they determine whether avatars resonate as authentic. OmniAvatar’s careful data curation, efficient LoRA adaptation, and exhaustive metric validation make it the current benchmark in expressive, controllable full-body avatar generation.


Cognaptus: Automate the Present, Incubate the Future