Vision-Language

CLIP ViT-B/32

A widely used multimodal model from OpenAI that learns joint image–text embeddings, enabling zero-shot image classification, search, and multimodal applications.

DeepSeek-V3

A multi-modal foundation model by DeepSeek AI, integrating vision and language for high-performance tasks including OCR, captioning, and visual reasoning.