AWS Logo
Menu
Multimodal AI with Llama 3.2 on Amazon Bedrock

Multimodal AI with Llama 3.2 on Amazon Bedrock

This blog explores Meta's Llama 3.2 multimodal models on Bedrock, highlighting OCR, diagram analysis, predictive maintenance, and multimodal AI apps.

Published Oct 5, 2024
Last Modified Oct 8, 2024
Meta's Llama 3.2 is a new collection of large language models (LLMs) that are now available on the AWS Bedrock service. Llama 3.2 represents an important advancement in multimodal AI capabilities, combining sophisticated language processing with powerful image understanding.
The Llama 3.2 models come in different sizes, from small and medium-sized vision-enabled models (11B and 90B parameters) to lightweight text-only models (1B and 3B) optimized for edge and mobile devices. These models excel not only at language tasks, but also at image-related applications, going beyond what was previously possible with open-source multimodal models.
The availability of Llama 3.2 on AWS Bedrock allows developers and researchers to easily use these advanced AI models within Amazon's robust and scalable cloud infrastructure. This integration opens up new opportunities to create innovative applications that leverage the multimodal capabilities of Llama 3.2, such as visual reasoning, image-guided text generation, and enhanced user experiences. This blog post explores an overview Llama 3.2' 11Bs multimodal capabilities on Amazon Bedrock. You can reference the Github to get examples in following four following use cases:
  • OCR — Simple text extraction and extraction from nested structures
  • Diagram analysis — Comparing molar mass versus boiling points and some fictitious organic compounds to demonstrate its capabilities beyond its training data
  • Predictive maintenance — Detecting dents and repairs in cars from images
  • Multi-modal RAG (Retrieval-Augmented Generation) — Allowing users to supply both text and images as input for querying, comparing, and analyzing data.
If you are interested in reviewing the above usecases for Anthropic Claude Sonnet model, refer to my blog here.

Summary of Llama 3.2 11B model

  • Multimodal model - input text and image. Suitable for use cases requiring image analysis, document processing, and multimodal chatbots.
  • Max tokens: 128K
  • Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
As of now, Meta’s Llama 11B model is available in via cross-region inference. Head to the documentation page to get details on coverage in your region and get cross reference identifier for this.

Bedrock Playground, Two APIs - Converse API and Invoke Model

You can access LLama 3.2 model from Bedrock playground - Text or Chat.
Amazon Bedrock - Playground
Amazon Bedrock - Playground
LLama 3.2 models conform to Amazon Bedrock APIs both invoke_model and converse APIs. You can get the API specification from the references section.
Below is an example of Converse API. As you can see, it is a standardized API specification
if you want to access via invoke_model, API you do the following

Exploring Multi modal capabilities

Lets explore a couple of multi modal capabilities of Llama 3.2 11B model. This page is from an automobile manual, vertically divided into two sections. The right portion covers the Tire Pressure Monitoring System (TPMS) and includes a table with recommended tire pressures.
User manual
User manual
LLama is able to extract the specific section when asked:
Extracted table
Extracted table
With its ability to describe image from predictive maintenance when it reviewed the below image
SRS
SRS
Below is its observation
For rest of the use cases, you can find the code in this Github repo.

References


Thank you for taking the time to read and engage with this article. Your support in the form of following me and sharing the article is highly valued and appreciated. The views expressed in this article are my own and do not necessarily represent the views of my employer. If you have any feedback and topics you want to cover, please reach me at https://www.linkedin.com/in/gopinathk/
 

Comments