Adding LLM capabilities to Crawl4AI in 30 Minutes

As builders we are always on the lookout for efficient tools, I recently stumbled upon Crawl4AI, an open-source web scraping library that simplifies web crawling and data extraction for large language models (LLMs) and AI applications.

Excited to test it out, I cloned the repository and began exploring its capabilities. However, I quickly realized there was a catch: Crawl4AI didn't support Amazon Bedrock, which meant i couldn't leverage the models I'm used to using.

Instead of abandoning the tool or spending days figuring out how to add support myself, I saw an opportunity.

Could I use Amazon Q, an AI-powered coding assistant, to integrate Bedrock into Crawl4AI? This challenge would not only solve my immediate need but also test the limits of AI-assisted coding in real-world scenarios.

In this post, I'll walk you through how I was able to update the code in just 30 minutes.

Understanding Crawl4AI

Crawl4AI is an impressive open-source tool that simplifies web crawling and data extraction for AI applications. Its core strength lies in its ability to efficiently collect and process web data, making it an invaluable asset for developers working with AI systems.

Here's a quick example of Crawl4AI in action:

1
2
3
4
5
6
7
8
9
10
11
12
13
from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url="https://aws.amazon.com/blogs/aws/category/artificial-intelligence/generative-ai/")

# Print the extracted content
print(result.markdown)

While exploring Crawl4AI's LLM capabilities, I hit a roadblock no support for Amazon Bedrock. Adding this support typically requires understandingCrawl4AI's codebase a task that could take hours, especially with unfamiliar code.

Enter Amazon Q's /dev feature, a potential game-changer for tackling such challenges.

The Solution: Leveraging Amazon Q /dev

With the goal of adding Amazon Bedrock support, I decided to put Amazon Q /dev to the test. Here's how the process unfolded:

1. Analyzing the Repository

Looking at the README, I identified the `LLMExtractionStrategy` module as the key area to inspect . The extraction_strategy.py file contained the perform_completion_with_backoff function, which seemed to be where the LLM was being invoked.

1
2
3
4
5
6
7
8
9
10
11
12
13
result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy= LLMExtractionStrategy(
            provider= "anthropic/claude-3.5-sonnet", api_token = os.getenv('ANTHROPIC_API_KEY'), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "claude-3.5-sonnet", "input_fee": "US$15.00 / 1M tokens", "output_fee": "US$3.00 / 1M tokens"}."""
        ),            
        bypass_cache=True,
    )

2. Engaging Amazon Q /dev

With a clear target in mind, I turned to Amazon Q /dev. I crafted a prompt explaining my objective:

I need to update the repo to allow for Amazon Bedrock to be a provider. The idea is that we will have bedrock as a provide and a model_id. Then we will have the code us that for making an request in with a new "perform_completion_with_backoff" function for bedrock. This would use this function for the api request. ...

Within seconds, Q /dev generated a code snippet that looked promising. It included the necessary imports, a new function to call the code i provided to invoke Amazon Bedrock, and modifications to the existing code to incorporate the new provider.

Image not found

Amazon Q /dev in action

3. Refining the Generated Code

While the initial output from Amazon Q /dev was impressive, it wasn't perfect. I noticed two issues that required attention:

1. The generated code hard coded specific Bedrock models, which could limit future flexibility.

2. It included explicit AWS credential handling, which is unnecessary when using boto3's automatic credential management.

I promptly fed this feedback back to Q /dev, asking for a more flexible and AWS-friendly approach. The revised code addressed these concerns, providing a more robust and future-proof solution.

Good start, the thing with Bedrock is that there are always new models, so lets not constrain it by "hard coding" the model e.g bedrock/anthropic.claude-v2:1

Also boto3 will get the creds automatically in the environment so lets not do os.getenv("AWS_ACCESS_KEY_ID")

Image not found

4. Implementing and Testing

With the refined code in hand, I integrated it into the Crawl4AI codebase. To test the new functionality, I set up a sample script to extract information from the AWS news blog using the newly added Bedrock support.

Image not found

Upon running the test, I encountered an error:

"content": "cannot access local variable 'content' where it is not associated with a value"

This hiccup showcased the importance of developer intuition in the AI-assisted coding process. Upon inspection, I realized that the library was trying to extract content from the LLM response in a way that worked for other providers but not for the Amazon Bedrock implementation.

To resolve this issue, I added a simple conditional statement:

1
2
3
4
if self.provider.startswith("bedrock"):
    content = response
else:
    content = response.choices[0].message.content

This small adjustment bridged the gap between the existing code and our new Bedrock implementation.

5. Success!

With this final piece in place, I ran the test script again. This time, it worked flawlessly, successfully extracting information from the AWS news blog using the newly added Bedrock support.

Image not found

crawl4ai working with Amazon Bedrock

Why This Matters for Developers

This experience with Amazon Q /dev highlights three key implications for software development:

Rapid Integration in Unfamiliar Territory: In just 30 minutes, I added significant new functionality to a codebase I had never worked with before. This dramatic speed of integration opens up new possibilities for quick prototyping and feature additions, even when facing unfamiliar code.
AI as a Coding Partner, Not a Replacement: Amazon Q /dev served as an intelligent assistant, handling initial implementation details. However, my developer expertise was crucial for understanding where to integrate the code and how to debug issues. This synergy between AI assistance and human oversight represents a new paradigm in coding efficiency.
Accelerated Learning and Adaptation: By working with AI-generated code in a new codebase, I gained insights into both Crawl4AI's structure and leveraging Amazon Bedrock's API. This process not only solved an immediate problem but also served as a powerful learning tool, demonstrating how AI can help developers quickly adapt to new technologies and codebases.

These three aspects showcase how AI tools like Amazon Q /dev can dramatically accelerate the development process while emphasizing the ongoing importance of human expertise in guiding and refining AI-assisted solutions.

Getting Started with Amazon Q /dev

The most exciting part? Amazon Q /dev is free to use and doesn't require an AWS account. you can install it VScode right now.

Remember, while AI-assisted coding is powerful, it's not a replacement for developer expertise. As demonstrated in this post, understanding the problem domain, being able to debug issues, and knowing how to integrate AI-generated code into existing systems are still crucial skills.

Have you tried Amazon Q /dev? How has it impacted your development process? Share your experiences in the comments below!

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Adding LLM capabilities to Crawl4AI in 30 Minutes

Learn how AI tools can speed up your development process

Understanding Crawl4AI

The Solution: Leveraging Amazon Q /dev

1. Analyzing the Repository

2. Engaging Amazon Q /dev

3. Refining the Generated Code

4. Implementing and Testing

5. Success!

Why This Matters for Developers

Getting Started with Amazon Q /dev

1 Comment