
Google has taken a major step forward in embedding technology, one of the most critical building blocks of modern AI infrastructure. Announced on March 10, 2026, Gemini Embedding 2 is the company's first natively multimodal embedding model, now available in public preview through the Gemini API and Vertex AI. Built on the Gemini architecture, the model goes beyond previous text-only embedding systems by mapping text, images, videos, audio, and PDF documents into a single unified embedding space.
One of the model's standout capabilities is its support for interleaved multimodal inputs within a single request. Developers can combine different data types — such as an image paired with a text description — and the model captures the semantic relationships between them. On the technical side, Gemini Embedding 2 supports up to 8,192 input tokens for text, up to 6 images per request in PNG/JPEG format, up to 120 seconds of MP4/MOV video, native audio processing without transcription, and PDF documents up to 6 pages. With semantic understanding across more than 100 languages, the model targets use cases including RAG, semantic search, sentiment analysis, and data clustering.
The model also incorporates Matryoshka Representation Learning (MRL), which enables flexible scaling of embedding vector dimensions. While the default output is 3,072 dimensions, developers can reduce this to 1,536 or 768 to optimize for storage and latency requirements. Google reports that the model outperforms leading alternatives across text, image, and video tasks, while introducing robust speech embedding capabilities that set a new performance benchmark in the multimodal embedding space.
Gemini Embedding 2 is already integrated with widely used frameworks and vector database tools including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search. Early access partners have reported significant improvements, particularly in video retrieval scenarios where text queries can now surface untranscribed visual content. For teams building RAG pipelines, semantic search systems, or large-scale data management solutions, the availability of a single model handling all major modalities represents a meaningful simplification of previously complex multi-model architectures.