Provider: Google
License: Gemma License (non-commercial and responsible use terms)
Access: Open weights available via Kaggle and Hugging Face
Architecture: Multimodal Transformer combining SigLIP vision encoder + Gemma LLM decoder
Modalities: Image + Text
๐ Overview
PaliGemma 2 is Googleโs state-of-the-art open vision-language model, designed for fine-grained multimodal reasoning. It is the successor to PaLI-X and PaliGemma v1, offering better instruction following and image-text alignment, with fast and lightweight performance.
Key Features:
- ๐ผ๏ธ Vision-Language Integration: Combines SigLIP vision features with Gemma-based language modeling
- ๐ Task Generalist: Supports VQA, image captioning, image OCR, and instruction-based visual tasks
- โก Efficient: Small enough for multi-modal research and accessible deployment
โ๏ธ Technical Details
- Architecture: SigLIP (vision encoder) + Gemma (language decoder)
- Pretraining Tasks: Captioning, VQA, OCR, alignment-based reasoning
- Context Length: Supports both long prompts and detailed image-text fusion
- Model Size: Not publicly disclosed, but lightweight for VLM standards
๐ Deployment
- Model Access: PaliGemma 2 on Kaggle
- Compatible With: Hugging Face Transformers, JAX/Flax, Google AI Studio
- Use Cases: Multimodal assistants, document intelligence, multilingual OCR-VQA pipelines