logo
Menu
How to develop a tag-based web searcher using AI

How to develop a tag-based web searcher using AI

This service enhances users' convenience in using web pages by supporting smart page management and search using tags.

Hyunjoong Shin
Amazon Employee
Published Apr 18, 2024

Preview

Image preview

technology used

  • Streamlit
  • Langchain
  • Bedrock (anthropic.claude-instant-v1)
  • Opensearch

Purpose of development

In recent years, as the number of websites and web pages has increased exponentially, it is becoming increasingly difficult to effectively manage the web pages stored by individuals. In particular, when a large number of web pages are stored, it is difficult to search and use them because it is difficult to check the main information and contents of each page one by one. Accordingly, this service extracts and provides key information and appropriate tags on the corresponding page through an AI algorithm based on the web page URL entered by the user. Specifically, the overall contents of the web page are summarized, and representative keywords related to the contents are extracted as tags. Since the information extracted in this way is stored as a brief description and tag of each page, users can use it much more efficiently in the process of searching and managing pages. In particular, it has the advantage of being able to search related pages immediately when a specific tag is selected. This service is an AI-based tagging solution for managing users' web pages, and is expected to be useful to individual users or companies that store large amounts of web pages.

Brief description of the feature

  1. Users register by entering the URL of the webpage they want to bookmark.
  2. When a page is registered and refreshed, the AI analyzes the contents of the page and automatically generates appropriate tags.
  3. Users can click on the generated tag to search for other pages with the same tag.
  • These tag-based searches allow you to quickly browse pages with similar content.
  • In addition, page-specific tags allow you to grasp the content and characteristics of the saved page at a glance.

Step1. Configure AWS credentials

  • Create a new IAM user.
  • The IAM User you created connects its policy (Amazon BedrockFullAccess, Amazon OpenSearchServiceFullAccess).
  • Perform credentials through aws configure using the users created above through the CMD window.
  • At this time, the region must select the region in which the Opensearch is generated.

Image preview

Step2. Configure opensearch

  • Connect Opensearch dashboard, Management > Security > Roles > search all_access
  • Click manage mapping in all_access > Mapped users
  • Map the ARNs of IAM users created in the previous step,
Image preview
  • Connect Opensearch dashboard > Management > Click Devtools
  • Opensearch should set the index as follows.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
PUT site_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"stop",
"snowball"
]
}
}
}
}
}
}
  • In Opensearch, insert the following example documents.
1
2
3
4
5
6
PUT site_index/_doc/1
{
"site_url": "https://www.amazon.com/",
"description": "This is a 1 line summary of the amazon homepage.",
"tags": "products"
}

Step3. Code settings

  • Run IDE similar to Pycharm.
  • Install the required libraries as follows.
1
2
3
4
pip install streamlit
pip install requests_aws4auth
pip install boto3
pip install langchain
  • Enter the code below.
  • Change the region and host according to your environment in the code below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import boto3
import json
from datetime import datetime
from requests_aws4auth import AWS4Auth
import requests
from streamlit_pills import pills
import streamlit as st

from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import BedrockChat

def opensearch_post(index,datatype,site_url,description,tags):
host = '<Opensearch Domain endpoint>' # domain endpoint with trailing /
region = '<Opensearch Domain Region>' # e.g. us-west-1
service = 'es'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)
url = host + index + '/' + datatype
headers = {"Content-Type": "application/json"}
document = {"site_url": site_url, "description": description,"tags":tags}
r = requests.post(url, auth=awsauth, json=document, headers=headers)
response = r.json()
return response


def insert_url(site_url):
bedrock_runtime = boto3.client(
service_name="bedrock-runtime",
region_name="us-east-1",
)
model_kwargs = {
# "max_tokens_to_sample": 2048,
"temperature": 0.0,
"top_k": 250,
"top_p": 1,
"stop_sequences": ["\n\nHuman"],
}
model_id = "anthropic.claude-instant-v1"
parser = JsonOutputParser()
prompt = PromptTemplate(
template="Answer the user query.\n{format_instructions}\n{query}\n",
input_variables=["query"],
partial_variables={
"format_instructions": parser.get_format_instructions()},
)
model = BedrockChat(
client=bedrock_runtime,
model_id=model_id,
model_kwargs=model_kwargs,
)
chain = prompt | model | parser
query = "Please summarize the attached document in 1 line and extract the 1 noun lowercase only keyword. For the summary of the attached document, put the keywords in 'description' and 'tags'. site url: {}".format(site_url)
response=chain.invoke({"query":query})
if type(response['tags'])==list:
opensearch_post("site_index","_doc",site_url,response['description'],response['tags'][0])
else:
opensearch_post("site_index","_doc",site_url,response['description'],response['tags'])

def configure(host,region,service):
host = host
region = region
service = service
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)
return awsauth

# @st.cache_data
def opensearch_requests(index,datatype,query):
host = '<Opensearch Domain endpoint>' # domain endpoint with trailing /
region = '<Opensearch Domain Region>' # e.g. us-west-1
service = 'es'

credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)

url = host + index + '/' + datatype
headers = {"Content-Type": "application/json"}

r = requests.get(url, auth=awsauth, json=query, headers=headers)
response = r.json()
return response

tag_lst,site_url_lst=[],[]
query = {
"_source": ["site_url","tags"],
"query": {
"match_all": {}
}
}
response=opensearch_requests("site_index","_search",query)
for hit in response['hits']['hits']:
site_url_lst.append(hit['_source']['site_url'])
tag_lst.append(hit['_source']['tags'])
with st.form("my_form"):
st.title("Register website and search tag-based")

url=st.text_input("## please enter the site url")
submit=st.form_submit_button("Submit")
if submit:
if url not in site_url_lst:
insert_url(url)
st.success("You have been successfully registered.")
st.rerun()
else:
st.error("Already registered url.")

tag_lst=list(set(tag_lst))
selected = pills("Category", tag_lst)

if selected:
query = {
"_source": ["site_url","description","tags"],
"query": {
"bool": {
"must": [
{"match": {"tags": "{}".format(selected)}}
]
}
}
}
response=opensearch_requests("site_index","_search",query)
for hit in response['hits']['hits']:
container=st.container(border=True)
container.write(hit['_source']['site_url'])
container.write(hit['_source']['description'])
  • If the code above has been entered, use the command below to execute it.
streamlit run <your python file>.py
  • You can successfully check the page below.
Image preview

 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments