Chunking heavily impacts the performance of your retrieval when dealing with LLMs. Preprocess split documents into optimal chunks of text. We split PDF and Office files based on the original document structure and content semantics.
👋Hello, Product Hunt community,
I hope you all are fine and feeling good😀,
I am Nicola co-founder at Preprocess. In 2018 I founded Pigro (https://pigro.ai/) with Nicolò. Thanks to our venture at Pigro.ai, we gained document chunking experience and decided to create Preprocess.
Preprocess is our solution for document preprocessing tailored for Large Language Models (LLMs).
Recognizing the challenges in document preprocessing for LLMs, we developed Preprocess to automate and optimize this critical step. Our goal is to provide a reliable, efficient, and easy-to-integrate solution that meets the diverse needs of our users.
Preprocess is ideal for data scientists, AI developers, and organizations implementing Retrieval-Augmented Generation (RAG) systems. It simplifies the ingestion pipeline, allowing you to focus on building intelligent applications without the hassle of manual preprocessing.
Key Features 🛠️
- Intelligent Parsing and Chunking: Automatically processes various document types, preserving the original structure and semantics.
- High-Quality Table and Image Extraction: Accurately extracts and formats tables and images for seamless integration.
- Support for Multiple Formats: Handles PDFs, Word documents, Excel sheets, presentations, HTML, and plain text files.
We offer a Free Tier that allows you to preprocess up to 10 documents per day, each up to 10 pages/credits, with no time limit. Our flexible credit-based model ensures you only pay for what you need.
We're committed to continuous improvement and would love your thoughts on Preprocess. Please share your experiences and suggestions to help us serve you better.
Preprocess looks really useful! Sorting and preparing documents for AI can be a hassle, so having an automated tool sounds like a big help. How well does it handle messy documents with mixed formats?
The focus on automating document preprocessing for LLMs is indeed a crucial step that can save a lot of time and effort for data scientists and developers. The variety of supported document formats and features like intelligent parsing and chunking seem incredibly practical.
Congrats on the launch! Best wishes and sending lots of wins :) @nicola_abbasciano
Document chunking has been one of our biggest RAG headaches, our homegrown solution just splits by character count. Being able to split PDFs based on document structure could fix our context relevance issues.
Congrats @nicola_abbasciano and Team! Super useful solution nowadays to avoid reinventing the wheel in every ai product!
Sounds like this would break up our big files in a smarter way so the AI can find the right answers.