Vibe training for AI agent reliability. Describe what your agent should and should not do — Plurai generates training data, validates it, and deploys a custom model in minutes. It feels like vibe coding, but for evaluation and guardrails. No labeled data. No annotation pipeline. No prompt engineering. Under the hood, small language models deliver sub 100ms latency, 8x lower cost than GPT as judge, and over 43% fewer failures. Always on, not sampled. Built on published research (BARRED).
We spent the last year on a research problem: can you train a production-grade eval or guardrail from just a task description, no labeled data, no annotation pipeline?
Turns out you can. We call it vibe-training.
Most teams today rely on LLM as a judge. It never fully converges, breaks on edge cases, and at 100ms per call it collapses economically at scale. So teams sample instead of evaluating everything. Failures happen between the samples, invisibly.
Plurai lets you describe what your agent should and should not do. The platform generates training data, validates it through a multi-agent debate process, and deploys a custom small language model in minutes.
Results against GPT-5 LLM-as-judge: over 43% fewer failures, 8x lower cost, sub 100ms. Good enough to run on every interaction, not just a sample.
The research behind it is public.
Try it free at https://app.plurai.ai, I'd love to hear what eval problem you're working on.
Congrats on the launch, does it work with all LLMs that provide fine-tunning capabilities?
If this actually reduces hallucinations or cost + policy violations at scale, thats huge!
That's where most of the pain is for me
The multi-turn simulation piece is interesting. Single prompt evals are easy, but most real failures happen across a sequence of interactions. If this actually captures that well, that’s a meaningful step up from most eval tooling I’ve seen.
So does it prevent AI agents from purchasing overpriced courses, right? :D
Do your evaluation algorithms backed by science? Do you have any peer-reviewed papers?
There is a lot of noise in this space.
You've mentioned 43% fewer failures, was that averaged on any type of task or does the industry have specific benchmarks for that?
Oh, this looks really cool, esp the idea of running evals on every interaction (not just samples). Just curious, how it performs on more subjective tasks though))) And congrats on the launch, btw :)
the 'LLM as judge breaks at 100ms per call' pain is exactly where most eval pipelines silently rot. you end up with a sampling regime nobody actually trusts. the part i'm curious about is calibration in the wild: when the small model and the original llm-judge disagree on a real production trace, who do you trust, and how do you surface that disagreement to the team? that's usually where these systems either become real or quietly shelfware.
sampling-only eval has a real blind spot: anything that doesn't repeat doesn't get caught. ran into the same building eval flows for an AI form filler we work on — by the time a flaky failure shows up twice, you've already shipped it.
the part i can't quite picture is how the multi-agent debate establishes ground truth without existing failure modes — adversarial generation against the task spec is one read, test-time disagreement is the other. one of those would explain how the BARRED setup actually converges.
well, as an AI leader educator, I must say that this is something I must take my hat off. incredible work
The "always on, not sampled" part is what makes this interesting. When I was running engineering at scale, sampling-based quality checks gave us a false sense of security - the failures always happened in the gaps between samples. The LLM-as-judge approach has the same problem but worse: it's expensive enough that teams only run it on a fraction of requests, and the edge cases it misses are exactly the ones that blow up in production. Sub 100ms with small models changes the economics enough to actually evaluate everything. Curious about the cold start experience - when someone describes a new guardrail in plain language, how much iteration does it typically take before the generated eval catches the subtle violations versus just the obvious ones?
Hello world, I'm the product behind the product :)
VVibe training is here to make model training accessible — and to help your agents and LLM apps actually work in production.
Also - we obsessed over both the tech and the UX -> so we can't wait to hear your feedback!
About Plurai on Product Hunt
“Vibe-train evals and guardrails tailored to your use case”
Plurai launched on Product Hunt on April 29th, 2026 and earned 331 upvotes and 124 comments, earning #1 Product of the Day. Vibe training for AI agent reliability. Describe what your agent should and should not do — Plurai generates training data, validates it, and deploys a custom model in minutes. It feels like vibe coding, but for evaluation and guardrails. No labeled data. No annotation pipeline. No prompt engineering. Under the hood, small language models deliver sub 100ms latency, 8x lower cost than GPT as judge, and over 43% fewer failures. Always on, not sampled. Built on published research (BARRED).
Plurai was featured in API (98.1k followers), Developer Tools (511.6k followers) and Artificial Intelligence (467.1k followers) on Product Hunt. Together, these topics include over 165.7k products, making this a competitive space to launch in.
Who hunted Plurai?
Plurai was hunted by fmerian. A “hunter” on Product Hunt is the community member who submits a product to the platform — uploading the images, the link, and tagging the makers behind it. Hunters typically write the first comment explaining why a product is worth attention, and their followers are notified the moment they post. Around 79% of featured launches on Product Hunt are self-hunted by their makers, but a well-known hunter still acts as a signal of quality to the rest of the community. See the full all-time top hunters leaderboard to discover who is shaping the Product Hunt ecosystem.
Want to see how Plurai stacked up against nearby launches in real time? Check out the live launch dashboard for upvote speed charts, proximity comparisons, and more analytics.
Hey Product Hunt, Ilan from Plurai here.
We spent the last year on a research problem: can you train a production-grade eval or guardrail from just a task description, no labeled data, no annotation pipeline?
Turns out you can. We call it vibe-training.
Most teams today rely on LLM as a judge. It never fully converges, breaks on edge cases, and at 100ms per call it collapses economically at scale. So teams sample instead of evaluating everything. Failures happen between the samples, invisibly.
Plurai lets you describe what your agent should and should not do. The platform generates training data, validates it through a multi-agent debate process, and deploys a custom small language model in minutes.
Results against GPT-5 LLM-as-judge: over 43% fewer failures, 8x lower cost, sub 100ms.
Good enough to run on every interaction, not just a sample.
The research behind it is public.
Try it free at https://app.plurai.ai, I'd love to hear what eval problem you're working on.