The magic behind OmniAvatar isn’t just in its motion—it’s in the meticulous training pipeline and rigorous evaluation metrics that power its realism. Here’s a closer look at how the model was built and validated.
Training Data: Curated, Filtered, and Massive
OmniAvatar trains on a carefully filtered subset of the AVSpeech dataset (Ephrat et al., 2018), a publicly available corpus with over 4,700 hours of speech-aligned video. To ensure lip-sync precision and high visual quality:
- SyncNet and Q-Align are used to filter poor-quality clips based on facial synchronization and image fidelity.
- Result: A curated 774,207 clip dataset, totaling ~1,320 hours of content.
From this refined set:
- 100 samples are reserved for semi-body test evaluation.
- The remaining are used for LoRA-based fine-tuning on top of the Wan2.1-T2V-14B foundation model.
OmniAvatar also includes a second test set from HDTF (Zhang et al., 2021) for facial-only benchmarks.
Architecture & Efficiency
- The model builds on Wan2.1-T2V-14B, a large-scale open diffusion transformer.
- Training resolution: 480p video, gradually mixed with higher-resolution samples.
- Uses LoRA rank 128, alpha 64 for efficient fine-tuning.
- Training on 64 A100 GPUs, with smart caching of video latents and text embeddings.
Metrics: How Quality is Judged
To validate output realism, temporal coherence, and audio-visual sync, OmniAvatar adopts:
Metric | Purpose | Model Result (↓/↑) |
---|---|---|
FID | Image-level realism | ↓ 37.3 (HDTF) |
FVD | Video-level realism | ↓ 382 (HDTF) |
Sync-C | Lip-sync correlation (↑ better) | ↑ 7.62 (HDTF) |
Sync-D | Lip-sync distance (↓ better) | ↓ 8.14 (HDTF) |
IQA | Image quality (human proxy) | ↑ 3.82 |
ASE | Aesthetic quality | ↑ 2.41 |
These outperform strong baselines such as EchoMimic, HunyuanAvatar, and FantasyTalking across every key domain.
Long-Video Support & Identity Preservation
Generating continuous avatars that remain stable in identity and motion requires:
- Reference Image Embedding: Anchors the avatar’s appearance.
- Frame Overlapping Strategy: Smoothens transitions in long-form video.
- Prefix Latents: Retains motion context over clips.
Inference pipeline supports batch-wise segment generation with overlapping regions to eliminate temporal artifacts.
Summary: Why This Matters
Training and metrics aren’t just back-end details—they determine whether avatars resonate as authentic. OmniAvatar’s careful data curation, efficient LoRA adaptation, and exhaustive metric validation make it the current benchmark in expressive, controllable full-body avatar generation.
Cognaptus: Automate the Present, Incubate the Future