This product was not featured by Product Hunt yet.
It will not yet shown by default on their landing page.

Product Thumbnail

Canonizr

Precise document extraction for your agents — zero retention

API
Open Source
Alpha

Accurate document parsing for high quality outputs. Upload any file — PDFs, legacy Word docs, scanned, multilingual, handwritten, chart-heavy and get clean text out — no single word silently dropped — so your pipelines don’t break when models or policies change. We extract and normalise all your files data so you can plug it straight into OpenClaw or any other agent, LLM or pipeline. Zero data retention. Encrypted in transit and at rest. Use open-source or hosted.

Top comment

Hi, I’m Maria! We built Canonizr and made it open source because document pipelines shouldn’t depend on one provider’s pricing decisions.

We already had complex data extraction reliably solved for our Health Data Avatar (multi-language, messy, high-stakes), and were planning to make it available for everyone one day. But last week accelerated things.

A lot of you lost workflows you’d built carefully. $200/month became $1,000–5,000/month overnight, with 24 hours’ notice — for the exact same usage. Then you migrated. And the document quality tanked.

What our team has been highlighting all this time: the real bottleneck is almost always unreliable, suboptimal data extraction — especially for complex formats — because language models weren’t built for layout parsing or precise data extraction. And many don’t even notice it, because randomly missing 5% of a PDF can still be acceptable for some use cases.

You often don’t even see what your agent actually received.

Scanned PDFs with mixed columns: traditional OCR transposes numbers.
Multilingual documents with Arabic: RTL text silently reverses.
Tables in financial reports: cells flatten into linear text, rows merge, meaning inverts.

We don’t think anyone should accept that compromise. If your documents aren’t properly structured, models miss information, outputs degrade, and costs explode.

Canonizr is a model-agnostic file parsing layer. Drop any file — 30+ formats including the ones LLMs struggle with. Get structured, clean output. Works with Claude, GPT-4o, Gemini, Llama, whatever you run next year when the landscape shifts again. Runs locally — your documents never leave your environment. Built-in PII detection so you can redact before you ever hit an LLM call.

Two ways to run Canonizr:

Local (free, open-source): One command installs everything — Docling, LibreOffice, Gemma 4, zero external calls. GDPR-compliant by architecture. Your documents never leave your hardware. MIT licence, fork it, own it.


Hosted API: We handle the infrastructure. You send documents and get back structured context. Zero retention — documents are deleted after parsing. Encrypted in transit and at rest.


Would love to hear:
what broke in your workflows this week?

Comment highlights

For parsing, can I give a prompt to describe the format I want and what I need or not form the content?
An do you also parse video ?

@maria_sergeeva1 Super timely.

Claude changes broke a lot of doc-based workflows - this feels like a clean way to restore that without extra infra.

Keen to try it on a few messy PDFs 👀
I also like the fact I don't have to do the setup myself! Such a time saving for me...!

Is there an API? We need to recognize scanned documents in one of our projects.

Hi, I'm Hex! We're bringing you the solution to your broken OpenClaw pipelines ASAP. You can run Canonizr locally right now - it's free and open source. The API will be live later today for those of you running on restricted hardware and locked-down environments.