Implementing Multimodal Understanding and Semantic Search with ImageBind

A few weeks ago, I shared how I used AI tools to build SummarizeMe, an app that transforms meeting recordings into concise summaries.

In this post, I'll show you how I've enhanced the app by implementing multimodal understanding and semantic search using a Weaviate vector store and ImageBind. This addition allows me to store and semantically search video summaries using both text and visual content.

💻 Jump directly to the GitHub repo.

The Technology Stack

I chose several key technologies for this implementation:

Weaviate: An open-source vector database that supports multimodal data
ImageBind: A multimodal embedding model that can understand relationships between different types of media by creating a single embedding space for all supported modalities.
multi2vec-bind: Weaviate's module that implements ImageBind for vector embeddings
Docker: For running the Weaviate and multi2vec-bind services locally

Setting Up the Vector Store

The vector store is configured using Docker Compose.

The main configuration for running Weaviate locally is in the weaviate-docker-compose.yaml file.

Note: For brevity, only a portion of the code is shown here. You can find the full code in GitHub.

Vector Store Implementation

Weaviate’s vector store supports both textual and video embeddings, which enables multimodal search without requiring separate models. The core vector store operations are implemented in vector_store_index.py. Let's break down the key components:

Storing Videos

The save_video_in_vector_store function handles the complete video storage pipeline by creating Weaviate collections, converting videos to base64, and storing metadata alongside ImageBind-processed embeddings.

Note: For brevity, only a portion of the code is shown here. You can find the full code in GitHub.

Searching Videos

For retrieval, the get_related_videos function accepts natural language queries and performs semantic search across the vector embeddings, returning the top 5 most relevant videos with distance metrics and metadata.

Note: For brevity, only a portion of the code is shown here. You can find the full code in GitHub.

Together, these functions create a seamless experience where users can naturally search through video content without worrying about exact matches or specific keywords.

Integration with SummarizeMe

I integrated the vector store functionality directly into SummarizeMe's main workflow, defined in app.py. After generating a video summary, the application automatically stores the video in Weaviate and performs a sample search to demonstrate the retrieval capabilities.

If you run app.py, SummarizeMe will:

Prompt you for the input file path
Transcribe the audio/video using Amazon Transcribe
Summarize the transcription and extract key points and action items using Anthropic's Claude 3 Haiku in Amazon Bedrock
Create a video summary using HeyGen (feat. my avatar!)
Store the video summary in the vector store
Search related videos using semantic queries

Here is what you see if you run app.py in the Terminal app:

For demonstration purposes, when running app.py, I not only ingest the generated meeting summary video but also include two test videos: one of a playful cat and another of a running dog. This allows us to see how ImageBind's multimodal understanding works in practice.

When performing a search with the phrase "action items", the meeting summary video shows a smaller distance (around 0.80), while the cat and dog videos show a larger distance (around 0.87-0.90). Smaller distance values indicate higher similarity, higher distance values indicate lower similarity. The default distance metric in Weaviate is cosine similarity.

Note that I did not embed the actual transcription with the video or any other text metadata, other than a timestamp. We're truly performing cross-modal retrieval, finding similar videos based on the text query.

AI to the Rescue

By the way, I've also tested the new Amazon Q Developer inline chat feature in my VSCode IDE while building this code. For instance, since I often forget how to create timestamps 🙈, I selected the relevant line of code, pressed ⌘ + I on Mac (or Ctrl + I on Windows), and asked Q Developer to help me out.

Inline chat in Amazon Q Developer allows you to describe issues or ideas directly in the code editor and returns a diff so I can see exactly what code will be added and removed.

Understanding ImageBind

ImageBind, developed by Meta AI, is a groundbreaking AI model that can understand and create connections between six different modalities, including Images, Text, Audio, Video, Depth, and Thermal (infrared) data.

Unlike traditional multimodal models that often require explicit paired training data (e.g., images with captions), ImageBind learns to bind different modalities through what's called "emergent alignment." This means it can understand relationships between modalities even when they weren't explicitly trained together.

Key Features

Joint Embedding Space
- All modalities are mapped to the same high-dimensional vector space
- This allows for direct comparison between different types of media
- Enables "cross-modal" retrieval (e.g., finding videos using text queries)
Zero-Shot Capabilities
- Can handle new combinations of modalities without additional training
- Understands relationships that weren't explicitly shown during training
Unified Architecture
- Uses a single model for all modalities
- More efficient than maintaining separate models for each modality
- Reduces complexity in deployment and maintenance

Integration with SummarizeMe

In SummarizeMe, I'm using ImageBind through Weaviate's multi2vec-bind module, defined in weaviate-docker-compose.yaml:

Note: For brevity, only a portion of the code is shown here. You can find the full code in GitHub.

When I store a video, ImageBind:

Processes the video content to understand visual elements, motion, and temporal relationships
Creates embeddings that capture the semantic meaning of the video
Processes any associated text (like timestamps or names)
Maps both the video and text embeddings to the same vector space

This allows for sophisticated queries like:

Finding videos with similar visual content
Retrieving videos based on textual descriptions
Identifying thematically related content across different meetings

How Semantic Search Works

Let me break down how semantic search processes and retrieves video content in two distinct phases.

Indexing Phase

Weaviate indexes both video and text embeddings simultaneously to facilitate cross-modal retrieval. When a new video summary is created, the system:

Converts videos to base64 encoding for standardized storage
Uses ImageBind to process the video content and create visual embeddings
Embeds text metadata using the same model for unified understanding
Stores both embeddings in Weaviate's vector space

Retrieval Phase

When a user searches for content, the system:

Converts user queries into the same embedding space
Leverages Weaviate to conduct similarity searches across both text and video embeddings
Ranks results based on distance (the default distance metric in Weaviate is cosine; smaller distance values indicate higher similarity)
Returns matched videos with contextual metadata

Practical Benefits

Multimodal Understanding: The system interprets relationships between text and video content, enabling more intelligent, concept-based search.
Semantic Search: Rather than relying on exact matches, the system finds conceptually similar content.

Conclusion

By incorporating ImageBind through Weaviate, I've created a system that understands the content of my video summaries, making information retrieval more natural and effective.

Resources

💻 The complete SummarizeMe code is available in this GitHub repo.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.

Implementing Multimodal Understanding and Semantic Search with ImageBind

Discover how I enhanced SummarizeMe by implementing multimodal understanding and semantic search with ImageBind.

The Technology Stack

Setting Up the Vector Store

Vector Store Implementation

Storing Videos

Searching Videos

Integration with SummarizeMe

AI to the Rescue

Understanding ImageBind

Key Features

Integration with SummarizeMe

How Semantic Search Works

Indexing Phase

Retrieval Phase

Practical Benefits

Conclusion

Resources

Comments