Browser agents: why we aren't really there yet
Simple experiments and failures with news, flight booking and Financial markets
Shreyas Subramanian
Amazon Employee
Published Jan 27, 2025
In the previous article, we saw how the browser-use open source project integrates well with Amazon Bedrock. Using browser-use, we showed how some simple tasks can be automated when a browser is involved. In this article, we try slightly more complex tasks (still very simple for your human friends). The following tasks were tested:
- Summarizing the top news from a news website (here, CNN.com)
- Looking for cheap flights from IAD to SEA on:
- Google flights
- Kayak.com
- United.com
- Summarizing details about top ETFs in finance websites, such as:
- Bloomberg.com
- Yahoo finance
Let's watch the screen recordings of the agent first:
- News summarization: Simple tasks, like summarizing news seen on a website does not involve a lot of actions like clicking, scrolling etc. A screenshot of the page displays the headlines and a short abstract of the articles. Therefore navigating the webpage (handled by browser-use) and extracting the screenshot (also browser use) are easy. Image understanding by Claude Sonnet 3.5 v2, is also easy in this case. The agent may still get distracted by ads on the header section, dropdowns caused by hovering over them etc.
- Flight booking: don't rush to provide your credit card details yet. In all cases - Google flights, Kayak.com and United, the agent fails to list and compare options although the required details are all there - origin, destination, and dates of the journey.
- For google flights, there is confusion between the text box element, the '+' element to add airports, and the dropdown element that shows up immediately after typing. The expected behavior, which is to type, then wait for the dropdown, and the click the right airport in the dropdown does not happen. The agent then tries an alternative approach after multiple rounds of the above failure and then clicks the explore button. After "exploring" the explore page, which is meant for open-ended origin-only searches. The agent then returns to the the main search screen, and incorrectly validates that the job was done.
- For kayak.com that has similar UI elements, the dynamic dropdown once an airport is typed in confuses the agent. Several click failures are seen in the date selection box. A realistic distraction comes in the form of a popup from kayak that says the an airport has not been selected yet. The popup is not successfully closed, but the validator incorrectly marks the task as completed.
- For united.com, IAD is already prefilled, and the task is to be interpreted as adding SEA to the destination box, followed by the right dates. Once again the agent tries the "alternative" approach for going to the explore tab. In all airline booking websites, the explore tab (unless the task is open ended) usually leads to a tougher, longer solution path. Especially in this case when all booking details are provided.
- Market ETFs summary:
- Bloomberg (correctly) recognizes the traffic and blocks the agent. Good job bloomberg! This is supposed to happen. Bigger picture, agents/systems should get good at calling APIs and respecting rate limits.
- For Yahoo finance, there is no real activity despite mentioning where to find the ETF related information. The yahoo finance page is also information rich and more complicated to navigate than the booking websites.
Finally, Adding a playlist of the gifs generated by the browser use tool. Interestingly what the agent believes it has done is very different from what it has actually done.
https://giphy.com/channel/w601sxs/agent-gifs
Some misunderstandings that cause failures include:
- Not recognizing that the current way is easier than the potential alternative. This is not surprising as the model does not have this world knowledge or any domain specific prompts
- Unexpected popup/distraction breaks the flow
- Infinite loop after landing on the wrong page. The agent rarely hits the "back" button. The context and history of navigation is stored, but the core model used (here Sonnet) ignores it.
- Hallucinating next steps - there is no way to "accept the cookie policy" and move forward here.
- Prematurely ending the task. As seen here, the agent is not on the ETF page. Although instrunctions provided how to navigate, the agent ends up in this page and summarizes what it sees. For those who are curious, this is the page with a summary of top ETFs https://finance.yahoo.com/markets/etfs/top/
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.