Metoro is an AI SRE for systems running in Kubernetes. Metoro autonomously monitors your environment, detecting incidents in real time. After it detects an incident it root causes the issue and opens a pull request to fix it. You just get pinged with the fix. Metoro brings its own telemetry with eBPF at the kernel level, that means no code changes or configuration required. Just a single helm install and you're up and running in less than 5 minutes.
Hey PH! We're Chris & @ece_kayan , the founders of Metoro.
We built Metoro because dealing with production issues is still far too manual.
Teams are shipping faster than ever with AI, but when something breaks, engineers still end up jumping between dashboards, logs, traces, infra state, and code changes just to figure out what happened and how to fix it.
We started working on this back in 2023 during YC’s S23 batch, and learned a hard lesson from customers early on: generalized AI SRE doesn't work reliably for two reasons.
Every system is different. The architecture is different. Some teams run on VMs, some on Lambdas, some on managed services, some on Kubernetes, others on mixtures of all of them.
On top of that, telemetry is usually inconsistent. Some services have traces, some don’t. Some have structured logs, some barely log at all. Metrics are named differently everywhere.
This means that teams need to spend weeks or even months generating system docs, adding runbooks, producing documentation and instrumenting services before the AI SRE can be useful. That wasn't workable.
So we took a different approach.
With Metoro, we generate telemetry ourselves at the kernel level using eBPF. That gives us consistent telemetry out of the box with zero code changes required. No waiting around for teams to instrument services. No huge observability blind spots.
And because Metoro is built specifically for Kubernetes, the agent already understands the environment it’s operating in. It doesn’t need to learn a brand new architecture every time.
The result is an AI SRE that works out of the box in under 5 minutes.
We automatically monitor your infrastucture and applications, when we detect an issue we investigate and root cause it. When we have the root cause, we automatically generate a pull request to fix it, whether that's application code or infrastructure configuration. Detect, root cause, fix.
We’re really excited to be launching on Product Hunt today 🚀
We’d love for you to check it out, try it, and ask us anything. Whether that’s about Metoro, Kubernetes observability, or AI in the SRE space.
@chrisbattarbee Interesting direction. Most tools stop at alerts and dashboards, going into auto-fix is a big step. How do you handle edge cases where the issue isn’t clearly defined or spans multiple services?
@chrisbattarbee Generating telemetry at the kernel level with eBPF to remove the instrumentation overhead is a strong approach. That part makes a lot of sense, especially given how inconsistent telemetry can be across services and teams.
The part that feels much harder is the auto-fix layer. In real systems, issues are rarely isolated. You often have partial signals, cascading failures, or symptoms that look like root causes. In those cases, even getting the diagnosis right is non-trivial, let alone generating a fix that is safe to apply.
How do you validate that a generated PR is actually safe in production and not just technically correct in isolation? For example, avoiding cases where the fix resolves one symptom but introduces regressions elsewhere or conflicts with existing infra assumptions.
I’ve been working in a similar space on the code side with Codoki.ai (AI code review and automated fixes), and even at that level, ensuring suggestions are reliable and not contextually wrong is a constant challenge, especially as systems get larger and more complex. So pushing this into infra-level auto-remediation is a big step.
Would be interesting to understand how you’re handling validation, rollback strategies, or confidence scoring before applying fixes.
Congrats on the launch.
@chrisbattarbee and @ece_kayan Good stuff! In my experience, you’re spot on about how heterogeneous and inconsistent observability is in practice. I’m going to try it out and might ping you for a chat.
Congrats, looking forward to trying it.
Is it just kubernetes or does it also work on apps too?
love the s23 batch background. it’s clear you guys learned a lot from the 'generalized ai' failure. how does the agent handle 'false positives' in a noisy environment where some services are naturally spikey?
Nice! I think we could use this at Asteroid. I'm interested to know how you've thought about keeping it secure when things go wrong
This is a very compelling direction, moving from observability to actual autonomous remediation is a huge step for SRE workflows.
Love the idea of going from detection → root cause → PR with a fix, especially without requiring code changes. The eBPF + zero-config setup makes it even more impressive.
We also launched on Product Hunt today — building Ogoron, an AI system that automatically generates and maintains test coverage as products evolve. Different part of the lifecycle, but very aligned in spirit: reducing the manual overhead of keeping complex systems reliable :)
Good luck with the launch!
AI SRE using eBPF to collect telemetry definitely seems like the way to go - I was dreaming of such a solution, could you onboard me @chrisbattarbee ? Looks amazing would love to have a chat and test it !
Looks promising!! Can't wait to try this out. Quick question: If eBPF can see all requests in the cluster, how do you avoid accidentally collecting or shipping sensitive data from them? That’d be one of my first concerns in prod.
the autonomy angle is appealing. my concern is auto-PRs that fix one incident and quietly regress something else - without a human gate somewhere, that's a hard failure category to catch.
Hey Chris, that lesson about generalized AI SRE not working because every system is different and telemetry is inconsistent sounds like it came from real pain. Was there a specific customer or incident where you watched the AI completely miss the problem because the data just wasn’t there or didn’t line up?
The way you approached this with setting up consistent telemetry as a first step makes this very promising.
I wonder if I can also use it to monitor some longer term trends in the metrics?
Where is telemetry data stored when using Metoro (cloud vs self-hosted)? Do you support running on Azure Kubernetes Service (AKS), and are there any limitations?
Does it work well with many scheduled jobs/tasks for which the code is in a large monorepo?
Hey PH! We're Chris & @ece_kayan , the founders of Metoro.
We built Metoro because dealing with production issues is still far too manual.
Teams are shipping faster than ever with AI, but when something breaks, engineers still end up jumping between dashboards, logs, traces, infra state, and code changes just to figure out what happened and how to fix it.
We started working on this back in 2023 during YC’s S23 batch, and learned a hard lesson from customers early on: generalized AI SRE doesn't work reliably for two reasons.
Every system is different. The architecture is different. Some teams run on VMs, some on Lambdas, some on managed services, some on Kubernetes, others on mixtures of all of them.
On top of that, telemetry is usually inconsistent. Some services have traces, some don’t. Some have structured logs, some barely log at all. Metrics are named differently everywhere.
This means that teams need to spend weeks or even months generating system docs, adding runbooks, producing documentation and instrumenting services before the AI SRE can be useful. That wasn't workable.
So we took a different approach.
With Metoro, we generate telemetry ourselves at the kernel level using eBPF. That gives us consistent telemetry out of the box with zero code changes required. No waiting around for teams to instrument services. No huge observability blind spots.
And because Metoro is built specifically for Kubernetes, the agent already understands the environment it’s operating in. It doesn’t need to learn a brand new architecture every time.
The result is an AI SRE that works out of the box in under 5 minutes.
We automatically monitor your infrastucture and applications, when we detect an issue we investigate and root cause it. When we have the root cause, we automatically generate a pull request to fix it, whether that's application code or infrastructure configuration. Detect, root cause, fix.
We’re really excited to be launching on Product Hunt today 🚀
We’d love for you to check it out, try it, and ask us anything. Whether that’s about Metoro, Kubernetes observability, or AI in the SRE space.