Product Thumbnail

Rippletide Eval CLI

Rippletide CLI is an evaluation tool for AI agents

Analytics
Developer Tools
Artificial Intelligence

Rippletide CLI is an interactive terminal tool to evaluate AI agent endpoints directly from your command line. It generates questions from the agent’s knowledge, supports predefined questions for reproducible benchmarking, and delivers clear hallucination KPIs. Get instant feedback on performance with real time progress, automatic evaluation, and detailed reports.

Top comment

As an early engineer at Rippletide, I've spent countless hours testing AI agents and getting frustrated with all the vague performance metrics.

That's why we built Rippletide CLI: a terminal tool that lets you benchmark your AI agent directly from the command line. It generates questions from the agent's own knowledge, supports reproducible test sets, and gives clear KPIs on hallucinations.

Everything runs in real time with automatic evaluation and detailed reports, so you actually see where your agent struggles.

Would love to hear what the PH community thinks and get feedback from fellow AI builders 🚀

Comment highlights

Very interesting. Something like Langfuse? Or do you have a different goal?

Voted!How does the tool automatically generate test questions from an agent’s "knowledge"? Is it based on RAG retrieval results or existing documents? Is the generation logic controllable?

Hi,

The idea of a terminal-first evaluation tool for AI agents is solid, it gives developers clear metrics on hallucination rates and agent behavior before production, instead of opaque dashboards or manual tests. That’s a real pain point when you’re building autonomous systems and need trustworthy, reproducible signals early in the cycle.

One area the current page could sharpen is connecting those technical outputs to business impact: for example, how much time teams save on regression testing, or how improved evaluation correlates with deployment confidence and fewer escapes/bugs in prod builds. Right now it’s clear what it does, but not yet why this matters to teams beyond engineers.

I help early SaaS and developer tool founders refine their positioning and landing copy so technical products are presented as clear productivity and risk-reduction stories that convert. If you decide to tighten how Rippletide’s value is communicated around measurable engineering outcomes, I can help craft that messaging

Best,

Paul

Congrats on the launch! Love how Rippletide Eval CLI brings reproducible, CLI-native evals and concrete hallucination KPIs for serious agent teams.

May I ask if there is a way to evaluate human's ability to utilize AI tools?

Congrats on the launch! Love the Rippletide CLI first approach, benchmarking agents where they actually run feels way more honest than abstract dashboards. How granular are the hallucination KPIs, and can you trace them back to specific prompts or knowledge gaps?

A lot of eval stacks lean on LLM-as-a-judge and people struggle with score variance and trust: what is Rippletide’s core approach to scoring hallucinations, and how do you handle the hardest case—when the agent’s answer is partially correct, partially unsupported, or the “truth” isn’t explicitly in the knowledge source?

Congrats on the launch! A lof of the times, the evaluation metrics are heavily related to the use cases, will the generated questions and evals be specially tuned for specific scenarios/ industries?

does the benchmarking feature allow us to compare historical runs side-by-side to track drift over time?

Hi all, very excited to present an Agent evaluation module today!

As AI engineers, my team and I struggled to reliably tell whether the latest version of an agent was actually performing well or not.
So we built a module to evaluate agents, and we’re open-sourcing the hallucination measurement part.

How it works:

1 – Connect your agent
Use our CLI to provide your agent endpoint (localhost works).
Connect the data your agent needs. Today, we support PostgreSQL databases, internal APIs, and Pinecone as a vector store. If you’d like to add a new source, feel free to open a PR on the repo.

2 – Launch tests
Tests are automatically generated to evaluate your agent’s behavior and make sure no agent possible wrong behavior nothing is left out - stay safe. You can also add your own test set if needed.

3 – Understand what failed
For each test question, we check every fact in the agent’s answer and verify whether it has a reference in our graph. We then explain where additional data is needed to improve your agent.

You can then improve the agent on the Rippletide platform and re-test it. We believe that when an agent reaches less than 1% hallucinations, it can be deployed in production. Some use cases require 0.1% or even 0.01%, depending on volume or industry.

Feel free to ask any questions, or reach out if you’d like to know more about what we’re building.

Cheers,
Yann