Multimodal AI with Llama 3.2 on Amazon Bedrock

Meta's Llama 3.2 is a new collection of large language models (LLMs) that are now available on the AWS Bedrock service. Llama 3.2 represents an important advancement in multimodal AI capabilities, combining sophisticated language processing with powerful image understanding.

The Llama 3.2 models come in different sizes, from small and medium-sized vision-enabled models (11B and 90B parameters) to lightweight text-only models (1B and 3B) optimized for edge and mobile devices. These models excel not only at language tasks, but also at image-related applications, going beyond what was previously possible with open-source multimodal models.

The availability of Llama 3.2 on AWS Bedrock allows developers and researchers to easily use these advanced AI models within Amazon's robust and scalable cloud infrastructure. This integration opens up new opportunities to create innovative applications that leverage the multimodal capabilities of Llama 3.2, such as visual reasoning, image-guided text generation, and enhanced user experiences. This blog post explores an overview Llama 3.2' 11Bs multimodal capabilities on Amazon Bedrock. You can reference the Github to get examples in following four following use cases:

OCR — Simple text extraction and extraction from nested structures
Diagram analysis — Comparing molar mass versus boiling points and some fictitious organic compounds to demonstrate its capabilities beyond its training data
Predictive maintenance — Detecting dents and repairs in cars from images
Multi-modal RAG (Retrieval-Augmented Generation) — Allowing users to supply both text and images as input for querying, comparing, and analyzing data.

If you are interested in reviewing the above usecases for Anthropic Claude Sonnet model, refer to my blog here.

Summary of Llama 3.2 11B model

Multimodal model - input text and image. Suitable for use cases requiring image analysis, document processing, and multimodal chatbots.
Max tokens: 128K
Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

As of now, Meta’s Llama 11B model is available in via cross-region inference. Head to the documentation page to get details on coverage in your region and get cross reference identifier for this.

Bedrock Playground, Two APIs - Converse API and Invoke Model

You can access LLama 3.2 model from Bedrock playground - Text or Chat.

Image not found

Amazon Bedrock - Playground

LLama 3.2 models conform to Amazon Bedrock APIs both invoke_model and converse APIs. You can get the API specification from the references section.

Below is an example of Converse API. As you can see, it is a standardized API specification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
br = session.client("bedrock-runtime", region_name=studio_region)
prompts = ["Command: Write an overview on abrasives, body fillers, glazes used for auto repair."]
text_prompts = []
for p in prompts:
    text_prompts.append( {"text": p})

messages = [{"role": "user","content": [],}]
messages[0]["content"].extend(text_prompts)

inference_config={"maxTokens": 2048, "temperature": 1.0, "topP": 0.9}

response = br.converse(
    modelId=meta_llama_model_id,
    messages=messages,
    inferenceConfig = inference_config
)

if you want to access via invoke_model, API you do the following

1
2
3
4
5
6
7
8
9
10
11
12
native_request = {
    "prompt": "Command: Write an overview on abrasives, body fillers, glazes used for auto repair.",
    "max_gen_len": 512,
    "temperature": 0.5,
}

# Convert the native request to JSON.
request = json.dumps(native_request)

# Invoke the model with the request.
response = br.invoke_model(modelId=meta_llama_model_id, body=request)
response

Exploring Multi modal capabilities

Lets explore a couple of multi modal capabilities of Llama 3.2 11B model. This page is from an automobile manual, vertically divided into two sections. The right portion covers the Tire Pressure Monitoring System (TPMS) and includes a table with recommended tire pressures.

Image not found

User manual

LLama is able to extract the specific section when asked:

Image not found

Extracted table

With its ability to describe image from predictive maintenance when it reviewed the below image

Image not found

SRS

Below is its observation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
The image shows a car dashboard displaying various warning lights. The "SRS Malfunction Service Required" light is illuminated, indicating a problem with the vehicle's Supplemental Restraint System (SRS), also known as the airbag system.

**Possible Causes**

* A malfunction in the SRS system, such as a faulty airbag sensor or a issue with the airbag itself
* A loose or corroded electrical connection
* A short circuit in the SRS wiring harness
* An issue with the SRS computer or control module

**Recommended Actions**

* Have a professional mechanic inspect the vehicle and diagnose the problem
* Repair or replace any faulty components identified during the inspection
* Follow the manufacturer's recommendations for resetting the SRS system after repairs are made

**Importance of Prompt Action**

Ignoring the warning light can result in:

* A decrease in the effectiveness of the airbags in the event of an accident
* A risk of injury or death if the airbags do not deploy properly
* Further damage to the vehicle's electrical system

**Conclusion**

The illuminated "SRS Malfunction Service Required" light indicates a serious issue with the vehicle's airbag system. Prompt attention from a professional mechanic is crucial to ensure the safety of the occupants and prevent further damage to the vehicle.

For rest of the use cases, you can find the code in this Github repo.

References

Meta Llama 3.2 announcement page

Llama 3.2 generative AI models now available in Amazon Bedrock

Amazon Bedrock converse API specification

Amazon Bedrock Invoke Model API specification

Thank you for taking the time to read and engage with this article. Your support in the form of following me and sharing the article is highly valued and appreciated. The views expressed in this article are my own and do not necessarily represent the views of my employer. If you have any feedback and topics you want to cover, please reach me at https://www.linkedin.com/in/gopinathk/

Select your cookie preferences

Site Terms, Privacy, and more.

Multimodal AI with Llama 3.2 on Amazon Bedrock

This blog explores Meta's Llama 3.2 multimodal models on Bedrock, highlighting OCR, diagram analysis, predictive maintenance, and multimodal AI apps.

Summary of Llama 3.2 11B model

Bedrock Playground, Two APIs - Converse API and Invoke Model

Exploring Multi modal capabilities

References

Comments