AWS | Community | Reduce hallucinations using feedback - Amazon Bedrock multi-modal capabilities

Multi-modal generative AI capabilities of Amazon Bedrock provide alternate, easy on-ramp into the world of image analysis and object recognition. However, hallucinations are common while using generative AI. This blog provides a technique to use generative AI to gather feedback and reduce hallucination.

(Disclaimer: The content below is to indicate the art of the possible. There exist opportunities to refactor and reduce duplicate code, optimize runtime efficiency, and prompt engineering. )

Step 1: Set Up Your Environment

First, ensure you have Python installed on your system. For this project, you'll also need to install boto3, which is the Amazon Web Services (AWS) SDK for Python. This library allows you to create, configure, and manage AWS services, such as the Amazon Bedrock models.

1
pip install boto3

- I will use Anthropic Claude v3 Sonnet for our exercise. You can find details of supported models and regions here: https://docs.aws.amazon.com/bedrock/latest/userguide/models-regions.html.

- Set up model access for your account and region following the link here: https://catalog.workshops.aws/building-with-amazon-bedrock/en-US/prerequisites/bedrock-setup.

The following snippet of code imports required libraries and sets up basic logging:

1
2
3
4
5
6
7
8
import json
import logging
import base64
import boto3

from botocore.exceptions import ClientError
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

Step 2 : Define function for running multi-modal prompt

I will define a method for invoking Amazon Bedrock API. This method will be used for both identifying objects in the image and validating if the objects are truly present in the image.

1
2
3
4
5
6
7
8
9
10
11
12
def run_multi_modal_prompt(bedrock_runtime, messages, max_tokens):
    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
             "messages": messages
        }
    )
    response = bedrock_runtime.invoke_model(
        body=body, modelId=model_id, )
    response_body = json.loads(response.get('body').read())
    return response_body

Step 3 : Define AI Model, prompt and image identification function

For this blog, I will use Anthropic Claude-3-sonnet model. Depending on the size and complexity of the image in context, you may need to adjust max_tokens parameter. The prompt asks what is expected of the model and the schema format. Since object detection is often used for downstream processing, giving a schema will make output validation and parsing simple.

1
2
3
4
5
6
7
8
9
10
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'
max_tokens = 2000
prompt_text_identify_objects = """
Accurately identify each object and its location in this image. 
If there are multiple objects of the same type, identify each separately. 
After identifying, double-check if each of the identified object is present the given picture. 
Ignore architectural features like floor, wall, etc. Give response in JSON format using double quotes.
Sample JSON format: {"objects": {"Object-1 Name": "Object-1 location", "Object-2 name": "Object-2 location"}}. 
This is only a sample JSON document with two placeholder items. You will likely have more objects in the image.
"""

Using the above parameters, define the method to identify the objects in image. Please note that the input image will needs to be base64 encoded and passed as part of the API payload.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def get_objects_from_model(input_image, prompt_text):
    try:
        bedrock_runtime = boto3.client(service_name='bedrock-runtime')
        with open(input_image, "rb") as image_file:
            content_image = base64.b64encode(image_file.read()).decode('utf8')

        message = {
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": content_image}},
                {"type": "text", "text": prompt_text}
            ]
        }

        messages = [message]
        response = run_multi_modal_prompt(bedrock_runtime, messages, max_tokens)
        logger.debug(response)

        content = response["content"]
        first_item = content[0]
        text = first_item["text"]

        # Extract the JSON string from the text
        start = text.index("{")
        end = text.rindex("}") + 1
        json_string = text[start:end]
        logger.debug(json_string)

        text_json = json.loads(json_string)
        objects = text_json["objects"]
        return objects
    except Exception as e:
        logger.error("Exception occurred: %s", e)   
        return None

Step 4 : Define prompt and function to validate objects

The validation prompt categorizes the objects as present, not-present and unsure. The prompt below is appended with each identified object.

1
prompt_text_validate_objects = "Answer if the given object can be identified in the image.  If the identified object is present in the image, respond with 'Yes'. If the identified objects are not present in the image, respond with 'No'. If you are unsure, respond with 'Unsure'. Give response in JSON format.  Use double quotes for constructing json objects. Don't add newline characters.  Does the image contain "

This function verifies if the objects identified and returned in get_objects_from_model function is accurate.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def validate_objects(input_image, prompt_text):
    try:
        bedrock_runtime = boto3.client(service_name='bedrock-runtime')

        with open(input_image, "rb") as image_file:
            content_image = base64.b64encode(image_file.read()).decode('utf8')

        message = {
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": content_image}},
                {"type": "text", "text": prompt_text}
            ]
        }

        messages = [message]

        response = run_multi_modal_prompt(bedrock_runtime, messages, max_tokens)
        logger.debug(response)

        content = response["content"]
        logger.debug(content)
        return content
    except Exception as e:
        logger.error("Exception occurred: %s", e)   
        return None

Step 5 : Putting it all together and running

I exercise image identification and feedback evaluation in our main() method. While I tried to "enforce" the models to provide outputs complying with specific schemas using detailed prompts, the model outputs occasionally don't comply with the specified schema. To account for variations in model response, two techniques are showcased in here.

1. Retry when there's exception: Occasionally, the object schema returned by the model in get_objects_from_model function varies, resulting in failure of parsing and data-extraction. Simply re-running the model invocation provides better outcomes. To avoid cost-overrun, I can limit the maximum retries to a specific number (3 in this blog)

2. Alternate pattern matching: In validate_objects function, the model occasionally returned values mapped to different keys. To account for this variation, alternate matching patterns are given during evaluation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def main():
    input_image_1 = "./sample-image.jpg"

    # sometimes the JSON output from the model can be inconsistent. Retry 3 times before giving up
    retry_count = 0
    while retry_count < 3:
        try:
            objects = get_objects_from_model(input_image_1, prompt_text_identify_objects)
            pretty_json_objects_image_1 = json.dumps(objects, indent=4)
            logger.info(pretty_json_objects_image_1)
            break  
        except ValueError as e:
            logger.error(f"Error invoking model: {e}")
            retry_count += 1
            if retry_count == 3:
                logger.error("Maximum retries reached. Exiting.")
                return  
                
    keys = list(objects.keys())
    logger.info(keys)
    key_list_to_be_removed = []

    for key in objects:
        validation_result = validate_objects(input_image_1, prompt_text_validate_objects + str(key) + " ? ")

        first_item = validation_result[0]
        text_field = first_item["text"]
        validation_result_inner_dict = json.loads(text_field)
        # response is sometimes called "response" and sometimes called "answer"
        response_value = validation_result_inner_dict.get("response", validation_result_inner_dict.get("answer", None))
        
        if(not (response_value == "Yes")):
            key_list_to_be_removed.append(key)
    
    logger.info("Objects to be removed :" + str(key_list_to_be_removed))

    for key in key_list_to_be_removed:
        objects.pop(key)

    logger.info(json.dumps(objects, indent=4))

if __name__ == "__main__":
    main()

Run the code from command line:

$ python multimodal-image-analysis-with-bedrock.py

Sample output for image(sample-image.jpg) given in the git repository:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
INFO:__main__:{
    "Ceiling light fixtures": "On the ceiling, multiple irregular-shaped pendant light fixtures",
    "Curtains": "Hanging on windows near the outdoor scenery",
    "Sofas": "Two gray sofas in the center of the living room",
    "Coffee table": "A rectangular wooden coffee table between the sofas",
    "Potted plant": "A potted plant near the corner of the room",
    "Floor lamp": "A standing floor lamp next to one of the sofas",
    "Area rug": "A large dark-colored area rug underneath the coffee table",
    "Wall art": "A framed artwork hanging on the wall",
    "Books": "Books placed on the coffee table"
}
INFO:__main__:['Ceiling light fixtures', 'Curtains', 'Sofas', 'Coffee table', 'Potted plant', 'Floor lamp', 'Area rug', 'Wall art', 'Books']
INFO:__main__:Objects to be removed :['Books']
INFO:__main__:{
    "Ceiling light fixtures": "On the ceiling, multiple irregular-shaped pendant light fixtures",
    "Curtains": "Hanging on windows near the outdoor scenery",
    "Sofas": "Two gray sofas in the center of the living room",
    "Coffee table": "A rectangular wooden coffee table between the sofas",
    "Potted plant": "A potted plant near the corner of the room",
    "Floor lamp": "A standing floor lamp next to one of the sofas",
    "Area rug": "A large dark-colored area rug underneath the coffee table",
    "Wall art": "A framed artwork hanging on the wall"
}

In the above output, "Books" is initially identified as an object in the given image. During feedback processing using validate_objects function, Amazon Bedrock could not identify "Books" in the image, hence it will be removed from the list. The list of objects before and after are logged in console. Please note that if you run this code, your output might differ from outputs of run above.

Summary

This blog provides you with a technique to do object identification from images and seeking feedback from the same model to reduce hallucinations. The schema validations and fail-safe implemented in main() can be potentially included get_objects_from_model() and validate_objects() functions separately. Invoking models repeatedly for validation increases the accuracy and also increases cost of overall solution. Use of different Bedrock models to balance accuracy and cost can potentially help offset/reduce this increased cost. Full code for this article can be found at https://github.com/gopinaath/multimodal-image-analysis-bedrock.

Disclaimer

The content above is to indicate the art of the possible. There exist opportunities to refactor and reduce duplicate code, runtime optimizations, and prompt engineering.
If you'd like to dive deeper into this, look here for a hands-on workshop: Building with Amazon Bedrock

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Reduce hallucinations using feedback - Amazon Bedrock multi-modal capabilities

As in Large Language Models (LLMs), Large Multi-modal models (LMMs) also cause hallucination challenges. Processing the outputs' feedback from these models can increase the accuracy.