Implementing Multimodal Understanding and Semantic Search with ImageBind
Discover how I enhanced SummarizeMe by implementing multimodal understanding and semantic search with ImageBind.
Antje Barth
Amazon Employee
Published Oct 30, 2024
A few weeks ago, I shared how I used AI tools to build SummarizeMe, an app that transforms meeting recordings into concise summaries.
In this post, I'll show you how I've enhanced the app by implementing multimodal understanding and semantic search using a Weaviate vector store and ImageBind. This addition allows me to store and semantically search video summaries using both text and visual content.
💻 Jump directly to the GitHub repo.
I chose several key technologies for this implementation:
- Weaviate: An open-source vector database that supports multimodal data
- ImageBind: A multimodal embedding model that can understand relationships between different types of media by creating a single embedding space for all supported modalities.
- multi2vec-bind: Weaviate's module that implements ImageBind for vector embeddings
- Docker: For running the Weaviate and multi2vec-bind services locally
The vector store is configured using Docker Compose.
The main configuration for running Weaviate locally is in the
weaviate-docker-compose.yaml
file.Note: For brevity, only a portion of the code is shown here. You can find the full code in GitHub.
Weaviate’s vector store supports both textual and video embeddings, which enables multimodal search without requiring separate models. The core vector store operations are implemented in
vector_store_index.py
. Let's break down the key components:The
save_video_in_vector_store
function handles the complete video storage pipeline by creating Weaviate collections, converting videos to base64, and storing metadata alongside ImageBind-processed embeddings.Note: For brevity, only a portion of the code is shown here. You can find the full code in GitHub.
For retrieval, the
get_related_videos
function accepts natural language queries and performs semantic search across the vector embeddings, returning the top 5 most relevant videos with distance metrics and metadata.Note: For brevity, only a portion of the code is shown here. You can find the full code in GitHub.
Together, these functions create a seamless experience where users can naturally search through video content without worrying about exact matches or specific keywords.
I integrated the vector store functionality directly into SummarizeMe's main workflow, defined in
app.py
. After generating a video summary, the application automatically stores the video in Weaviate and performs a sample search to demonstrate the retrieval capabilities. If you run
app.py
, SummarizeMe will:- Prompt you for the input file path
- Transcribe the audio/video using Amazon Transcribe
- Summarize the transcription and extract key points and action items using Anthropic's Claude 3 Haiku in Amazon Bedrock
- Create a video summary using HeyGen (feat. my avatar!)
- Store the video summary in the vector store
- Search related videos using semantic queries
Here is what you see if you run
app.py
in the Terminal app:For demonstration purposes, when running
app.py
, I not only ingest the generated meeting summary video but also include two test videos: one of a playful cat and another of a running dog. This allows us to see how ImageBind's multimodal understanding works in practice. When performing a search with the phrase "action items", the meeting summary video shows a smaller distance (around 0.80), while the cat and dog videos show a larger distance (around 0.87-0.90). Smaller distance values indicate higher similarity, higher distance values indicate lower similarity. The default distance metric in Weaviate is cosine similarity.
Note that I did not embed the actual transcription with the video or any other text metadata, other than a timestamp. We're truly performing cross-modal retrieval, finding similar videos based on the text query.
By the way, I've also tested the new Amazon Q Developer inline chat feature in my VSCode IDE while building this code. For instance, since I often forget how to create timestamps 🙈, I selected the relevant line of code, pressed ⌘ + I on Mac (or Ctrl + I on Windows), and asked Q Developer to help me out.
Inline chat in Amazon Q Developer allows you to describe issues or ideas directly in the code editor and returns a diff so I can see exactly what code will be added and removed.
ImageBind, developed by Meta AI, is a groundbreaking AI model that can understand and create connections between six different modalities, including Images, Text, Audio, Video, Depth, and Thermal (infrared) data.
Unlike traditional multimodal models that often require explicit paired training data (e.g., images with captions), ImageBind learns to bind different modalities through what's called "emergent alignment." This means it can understand relationships between modalities even when they weren't explicitly trained together.
- Joint Embedding Space
- All modalities are mapped to the same high-dimensional vector space
- This allows for direct comparison between different types of media
- Enables "cross-modal" retrieval (e.g., finding videos using text queries)
- Zero-Shot Capabilities
- Can handle new combinations of modalities without additional training
- Understands relationships that weren't explicitly shown during training
- Unified Architecture
- Uses a single model for all modalities
- More efficient than maintaining separate models for each modality
- Reduces complexity in deployment and maintenance
In SummarizeMe, I'm using ImageBind through Weaviate's
multi2vec-bind
module, defined in weaviate-docker-compose.yaml
:Note: For brevity, only a portion of the code is shown here. You can find the full code in GitHub.
When I store a video, ImageBind:
- Processes the video content to understand visual elements, motion, and temporal relationships
- Creates embeddings that capture the semantic meaning of the video
- Processes any associated text (like timestamps or names)
- Maps both the video and text embeddings to the same vector space
This allows for sophisticated queries like:
- Finding videos with similar visual content
- Retrieving videos based on textual descriptions
- Identifying thematically related content across different meetings
Let me break down how semantic search processes and retrieves video content in two distinct phases.
Weaviate indexes both video and text embeddings simultaneously to facilitate cross-modal retrieval. When a new video summary is created, the system:
- Converts videos to base64 encoding for standardized storage
- Uses ImageBind to process the video content and create visual embeddings
- Embeds text metadata using the same model for unified understanding
- Stores both embeddings in Weaviate's vector space
When a user searches for content, the system:
- Converts user queries into the same embedding space
- Leverages Weaviate to conduct similarity searches across both text and video embeddings
- Ranks results based on distance (the default distance metric in Weaviate is cosine; smaller distance values indicate higher similarity)
- Returns matched videos with contextual metadata
- Multimodal Understanding: The system interprets relationships between text and video content, enabling more intelligent, concept-based search.
- Semantic Search: Rather than relying on exact matches, the system finds conceptually similar content.
By incorporating ImageBind through Weaviate, I've created a system that understands the content of my video summaries, making information retrieval more natural and effective.
- 💻 The complete SummarizeMe code is available in this GitHub repo.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.