Product Thumbnail

SCRAPR

The data layer for the agentic web

Productivity
API
Artificial Intelligence

SCRAPR is a new approach to web data extraction. Instead of relying on fragile DOM selectors or heavy browser automation, SCRAPR looks at how modern websites actually load their data and extracts structured responses directly from those sources. The goal is to make web data pipelines faster, more reliable, and easier to maintain. Right now SCRAPR is in early MVP and we’re looking for developers, data teams, and AI builders who need clean structured data from websites.

Top comment

I built SCRAPR after running into the same problem again and again:

Getting structured data from websites is still way harder than it should be.

Most tools fall into two buckets:

• Browser automation (Puppeteer / Selenium) — slow and expensive
• Traditional scrapers — fragile and constantly breaking

SCRAPR tries a different approach.

Instead of rendering pages or parsing messy HTML, it focuses on how websites actually load their data and extracts structured responses from there.

The goal is to make web data extraction more reliable — especially for AI pipelines and data workflows.

It’s still early (MVP stage), and I’m looking for builders who want to try it and give feedback.

Comment highlights

The "data layer for the agentic web" framing is interesting - curious how you're handling anti-bot countermeasures that vary by target site. Are you routing through rotating proxies or using something more sophisticated on the infrastructure side? Asking because this seems like it gets complicated fast at scale.

Wait also @gabe how is this even allowed as per Product Hunt launch rules, this is just a vercel app website with a waitlist?

I thought the product hunt rules were that no waitlists.

This is such a smart pivot from the usual DOM-parsing headaches! As a dev who's spent way too many hours fixing scrapers because of a tiny CSS change, focusing on the data responses directly sounds like a lifesaver. How do you handle sites with heavy anti-bot protections or obfuscated API endpoints?

Intercepting network calls instead of rendering pages is a smart approach. Way less fragile than the usual scraping setups. What kinds of sites have been trickiest to support so far?

So what happens when the API changes?

Sites like Linkedin also use server side rendering and hydration for pages so this approach won't work on most websites?

Great implementation! Is the live demo on the website operable? I can't seem to enter text into the fields. Early access requested!

Really smart approach to web scraping. Focusing on where data actually comes from rather than relying on DOM selectors is a much more resilient strategy. Most scraping tools break the moment a site updates its frontend, so anchoring to underlying API calls makes a lot of sense.

Curious about how you handle rate limiting and sites that aggressively block automated access. Either way, congrats on the launch!

Smart approach intercepting the underlying API calls instead of fighting the DOM. I've built data pipelines that relied on traditional scraping and the maintenance burden of broken selectors is brutal. Curious -- do you have plans for a schema definition layer where users can map the intercepted responses to a consistent output format? That would make it really useful for feeding structured data into AI workflows.

The interception approach is clever, way faster than spinning up a headless browser for every request. Have you thought about a batch endpoint where you can throw a list of URLs at it in one call? Anytime I've built a scraping pipeline for a project, the single-URL-at-a-time loop is where things get slow and annoying to manage.

Does this handle things like fingerprinting and bot detection? Awesome that you coming at this with a new angle!

Most scrapers fight the rendered HTML. This goes upstream to where the data actually comes from, am I understanding that right? That's quite interesting.

What gets me most is the stability angle. Anything built on CSS selectors or DOM structure breaks the moment a site redesigns its front-end. If you're anchored to the underlying API calls instead, that problem should mostly disappear.

I'm building an AI platform that pulls structured data into its pipeline, so this is genuinely relevant to me. The edge case I keep running into with this type of approach: sites that sign their internal API requests dynamically, session tokens, HMAC signatures, that kind of thing. How does SCRAPR handle those? That's usually where it gets complicated in my experience.

Sounds cool. Would love to try it out for example on https://www.maxxi.art/events/categories/mostre/

This approach is super clever — basically doing what I always do manually in Chrome DevTools Network tab (hunting for those fetch/GraphQL calls) but automated 😮

Does the engine just statically analyze the page source to find those internal API requests, or does it use AI/LLM in some way to detect and reconstruct the right endpoints even on tricky sites?

And how well does it handle completely arbitrary URLs — like, throw any random modern site at it and it still finds the clean data source reliably?

This is such a clean solution to a problem that's been annoying developers forever. Rooting for you!

Looks cool — but how well does it actually handle hard targets like Cloudflare, JS-heavy sites, proxies, and rate limits in the real world?

Hey Sukrit, that frustration of scraping tools either being slow and fragile or breaking constantly on modern sites is painfully real. Was there a specific project where you watched your scraper break for the tenth time on some JS-heavy page and thought okay, there has to be a completely different approach?

How does this engine handle JavaScript-heavy or dynamic content without a browser, and what mechanisms ensure data accuracy when the source website changes its layout?