CLIP ViT-B/32
A widely used multimodal model from OpenAI that learns joint image–text embeddings, enabling zero-shot image classification, search, and multimodal applications.
A widely used multimodal model from OpenAI that learns joint image–text embeddings, enabling zero-shot image classification, search, and multimodal applications.
A multi-modal foundation model by DeepSeek AI, integrating vision and language for high-performance tasks including OCR, captioning, and visual reasoning.