Provider: Meta AI License: Apache 2.0 (permissive, commercial-friendly) Access: Open weights on Hugging Face Architecture: Vision Transformer (ViT-Large/14) Training: Self-supervised (no labels)
๐ Overview
DINOv2 ViT-L/14 is a high-quality self-supervised vision foundation model released by Meta AI. Unlike task-specific CNNs, DINOv2 learns rich and transferable visual representations that can be reused across a wide range of computer vision tasks without additional labeling.
It is widely used as a drop-in visual backbone for:
- Image retrieval and similarity search
- Object discovery and segmentation
- Multimodal systems (vision + language)
- Robotics and perception pipelines
โ๏ธ Technical Specs
- Model Family: DINOv2
- Backbone: Vision Transformer (ViT-Large)
- Patch Size: 14ร14
- Embedding Dimension: 1024
- Training Data: Large-scale curated image corpus
- Training Method: Self-distillation with no labels (DINO)
๐ Deployment
- Hugging Face Repo: https://huggingface.co/facebook/dinov2-vit-large-14
- Frameworks: ๐ค Transformers, PyTorch, ONNX
- Use Cases: Image embeddings, retrieval, zero-shot transfer, feature extraction
- Hardware: GPU recommended for batch embedding; CPU viable for inference