BenchSpan is a benchmarking platform for AI agents. Running benchmarks is slow, expensive, and fragile. We fix that. Onboard your agent once (we onboarded Claude Code in 37 lines), run any benchmark in parallel in the cloud, and get every result in one place your whole team can see. When runs fail halfway, rerun just what broke. Compare runs side by side to see exactly where your agent is improving. Stop fighting your benchmarks and start shipping your agent.
Hey PH 👋, Ritesh from Benchspan here,
We were building AI agents and needed to know if they were getting better. Sounds simple. It wasn't.
Every benchmark assumed a different interface, days of glue code just to get running. Full suites took 14 hours on a laptop. A single failure at 72% burned $600 in tokens and we'd start from scratch. Nobody on the team trusted anyone else's numbers because nobody ran the same config. And results? Scattered across CSVs, messages, and spreadsheets nobody could find.
We realized we were spending more time fighting our benchmarks than improving our agent. So we built the tool we wished existed.
How it works
1. Onboard your agent. Write a small bash script that passes standard inputs to your agent.
2. Pick a benchmark and run
3. Results flow in automatically. Scores, trajectories, errors, timing. Everything captured and tagged with your agent's commit hash so you can compare runs side by side.
What you get
- Any agent that runs via bash. No framework lock-in. No interface conformance. One-time setup.
- Massively parallel execution. Every instance in its own Docker container. 500 instances that took 14 hours take a fraction of the time.
- Rerun only what failed. Network error on 37 instances? Rerun those 37. Join the results. Stop paying twice.
- Identical environments, every time. Same Docker image, same config, tagged with the exact commit hash. No more "works on my machine."
- One source of truth. Every run, every result, every trajectory — tagged, searchable, comparable. The whole team sees the same thing.
- Smoke tests. Run 5 instances to validate your setup before kicking off a 500-instance run. Catch bugs cheap.
If you're benchmarking agents and have feedback , I'm in the comments 👇