Product upvotes vs the next 3

Waiting for data. Loading

Product comments vs the next 3

Waiting for data. Loading

Product upvote speed vs the next 3

Waiting for data. Loading

Product upvotes and comments

Waiting for data. Loading

Product vs the next 3

Loading

Web Bench

A 10x better benchmark for AI browser agents

Compare and benchmark different AI web browsing agents. Web Bench provides comprehensive performance metrics for AI agents navigating the web.

Top comment

TL;DR: Web Bench is a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced. It builds on the foundations of WebVoyager, which didn't represent the internet well because it only spanned 15 websites. Anthropic Sonnet 3.7 CUA is the current SOTA, with Skyvern being the best agent for WRITE-HEAVY tasks. The detailed results here.

I bet you've seen a bunch of flashy demos of web browsing agents, looked at the crazy high scores on the benchmarks and excitedly tried them out... only to realize they don't work as well as advertised

This is because the previous benchmark (WebVoyager) only spanned 643 tasks across 15 websites. While it was a great starting point, the internet’s adversarial nature towards browser automation and the difficulty of tasks involving mutating of data on a website.

As a result, the Skyvern and Halluminate and created a new benchmark to better quantify these failures. Our goal was to create a new consistent measurement system for AI Web Agents by expanding the foundations created by WebVoyager by:

  1. Expanding the number of websites from 15 → 452, and tasks from 642 -> 5,750 to test agent performance on a wider variety of websites

  2. Introduce the concept of READ vs WRITE tasks

    1. READ tasks involve navigating websites and fetching data

    2. WRITE tasks involve entering data, downloading files, logging in, solving 2FA, etc and were not well represented in the WebVoyager dataset

  3. Measure the impact of browser infrastructure (eg access the websites, solve captchas, not crash, etc)

We ran the benchmark and open sourced 2454 of the tasks to help the industry move towards a new standard, and the results surprised us:

  1. The best model is Anthropic's CUA model

  2. All models did very poorly on write heavy tasks

  3. Browser Infrastructure played a bigger role in the agents' ability to take actions than previously expected

If you're interested, read the full report here

Have any cool use-cases for browser agents? Reply below and let me know below👇