LiteParse — the lightweight sibling of LlamaParse. PDFs, DOCX, PPTX, XLSX, images → clean text or structured JSON with bounding boxes. Runs on your laptop, your container, or a client's VPC. The POPIA-safe ingestion layer for RAG.
LiteParse is a Node CLI (lit) and TypeScript library from LlamaIndex that parses unstructured documents into text or structured JSON. It accepts PDFs natively and routes Office formats through LibreOffice and images through ImageMagick — everything converges on a PDF parse pass with optional OCR. The output is either flat text (for quick extraction) or JSON with per-block text, bounding boxes, and confidence scores (for RAG chunking, layout-aware retrieval, or overlay rendering).
It is the local sibling of LlamaParse, LlamaIndex's cloud parser. Same authorship, same parse quality floor, different deployment shape: LiteParse runs on your machine, inside a container, inside a client VPC — wherever the documents live. No API key, no upload, no telemetry. That constraint is the whole feature.
OCR ships with Tesseract.js bundled (zero setup) and can delegate to any HTTP OCR server that speaks the LiteParse protocol — EasyOCR and PaddleOCR wrappers are ready-made in the LlamaIndex repo. Batch mode walks a directory. Screenshot mode renders pages as PNG for multimodal agents that need visual layout, not just text.
Data residency by construction. The parser runs where the documents live. POPIA subjects never transit through a foreign cloud. Legal, HR, and healthcare use-cases that reject LlamaParse SaaS accept LiteParse.
Zero-cost at any volume. No per-page pricing, no rate limits. Parse a million pages — the cost is CPU time, not API invoices. Relevant for ingestion-heavy discovery-phase projects.
Same output shape as LlamaParse. JSON schema is compatible enough that an ingestion pipeline written for LlamaParse swaps in LiteParse with minimal changes. Start local for pilots; upgrade to LlamaParse cloud only where parse quality becomes the binding constraint.
Everything that isn't already a PDF gets wrapped to PDF first, then parsed. That single-lane pipeline means parse behaviour is consistent across formats — and it means LibreOffice and ImageMagick are hard dependencies for non-PDF inputs.
| Category | Extensions | Path |
|---|---|---|
.pdf |
Native. Text-layer first; OCR fallback for scans. | |
| Word | .doc .docx .docm .odt .rtf |
LibreOffice headless → PDF → parse. |
| PowerPoint | .ppt .pptx .pptm .odp |
LibreOffice headless → PDF → parse. |
| Spreadsheets | .xls .xlsx .xlsm .ods .csv .tsv |
LibreOffice for binary; direct parse for CSV/TSV. |
| Images | .jpg .jpeg .png .gif .bmp .tiff .webp .svg |
ImageMagick wrap → PDF → OCR. |
Setup. The Node package is one install; the native deps are the friction. Get them right once per machine and forget them.
# Core CLI npm i -g @llamaindex/liteparse # Office formats (DOCX, PPTX, XLSX) — LibreOffice converts to PDF brew install --cask libreoffice # macOS apt-get install libreoffice # Debian/Ubuntu # Image inputs (JPG, PNG, TIFF, etc.) brew install imagemagick # macOS apt-get install imagemagick # Debian/Ubuntu # Verify lit --version
lit parse for one file. lit batch-parse for a directory. lit screenshot when you need page images for a multimodal agent. A config file when you want to stop passing the same flags every time.
Parse a single file. Default output is plain text to stdout. Pass --format json for bounding boxes, confidence scores, and per-block structure — the shape RAG pipelines want.
# Plain text extraction lit parse document.pdf # Structured JSON — the RAG-friendly output lit parse document.pdf --format json -o output.json # Target pages only — faster for appendix-heavy PDFs lit parse document.pdf --target-pages "1-5,10,15-20" # Text-native PDF — skip OCR for a 3–10× speedup lit parse document.pdf --no-ocr # High-DPI render for small text or scanned docs lit parse document.pdf --dpi 300
Batch-parse a directory. Point lit batch-parse at input and output paths. Recurse with --recursive. Filter by extension with --extension. Output mirrors input structure — each source file produces a text or JSON sibling in the output directory.
# All supported formats, flat output lit batch-parse ./input ./output # PDFs only, recursive — the common ingestion pattern lit batch-parse ./input ./output --extension .pdf --recursive
Page screenshots are for multimodal agents — contracts with signatures, forms with checkboxes, scientific papers with diagrams. Render the page as PNG, hand it to Claude or GPT-4V alongside the parsed text, and the model sees what a human sees.
# All pages lit screenshot document.pdf -o ./screenshots # Specific pages, high-DPI PNG lit screenshot document.pdf --pages "1,3,5" --dpi 300 --format png -o ./screenshots
For repeated runs with the same settings — typical in a scheduled ingestion job — commit a liteparse.config.json next to the script. Same keys as the CLI flags, camelCase in JSON:
{ "ocrEnabled": true, "ocrLanguage": "en", "dpi": 150, "outputFormat": "json", "preciseBoundingBox": true, "maxPages": 1000 }
Run with lit parse document.pdf --config liteparse.config.json. Flags on the command line override the config — convenient for one-off reprocessing of a specific file.
Tesseract.js is bundled — zero setup, acceptable accuracy on clean English text. For non-English, scanned-scan-of-a-scan, or handwriting, delegate OCR to an external server. The LiteParse repo ships EasyOCR and PaddleOCR wrappers that speak the protocol out of the box.
| Flag | Purpose |
|---|---|
(default) |
Tesseract.js bundled. English out of the box. |
--ocr-language fra |
ISO code for Tesseract language pack. First run downloads the trained data. |
--ocr-server-url <url> |
Delegate to an external HTTP OCR server (EasyOCR, PaddleOCR, custom). |
--no-ocr |
Disable OCR entirely — 3–10× faster on text-native PDFs. |
External OCR protocol. Any HTTP server implementing the following contract can be plugged in — swap OCR engines without touching the ingestion pipeline.
# Endpoint: POST /ocr # Accepts: multipart with `file` + `language` # Returns: { "results": [ { "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 } ] } # Use it: lit parse scan.pdf --ocr-server-url http://localhost:8828/ocr
Non-Latin scripts (Arabic, Chinese, Amharic), handwriting, or low-quality scans — Tesseract's ceiling is low on any of the three. EasyOCR handles 80+ languages out of the box; PaddleOCR is state-of-the-art on Chinese and has strong structure-aware models. Both run locally, so the data-residency argument still holds.
The cost: one more service to run. On a single VM that's cheap; at fleet scale it starts mattering. Start with Tesseract, measure error rate on a sample, upgrade only if the error rate breaks the downstream use case.
LiteParse runs where the documents are. The JSON output lands in R2. A Cloudflare Worker pulls chunks, embeds them with Workers AI, and writes vectors to Vectorize. The pipeline respects data residency while still getting the latency benefits of edge infrastructure.
// Node-side ingestion script — run on a box with file-system access import { parse } from "@llamaindex/liteparse"; import { readdir } from "node:fs/promises"; const files = await readdir("./contracts"); for (const file of files) { const result = await parse(`./contracts/${file}`, { outputFormat: "json", ocrEnabled: true, dpi: 150, preciseBoundingBox: true, }); // Each page is an array of blocks with text + bbox const chunks = result.pages.flatMap(page => page.blocks.map(b => ({ text: b.text, sourceFile: file, page: page.number, bbox: b.bbox, })) ); // Push to R2 as JSONL for downstream embedding await putToR2(`parsed/${file}.jsonl`, chunks); }
Why parse off-edge and embed on-edge. LiteParse has native deps (pdfjs, LibreOffice, ImageMagick) that do not run in Cloudflare Workers — parsing has to happen in a Node container, a VM, or a user machine. Embedding is the opposite: it's pure linear algebra over network, perfect for Workers AI at the edge. The split is natural: heavy IO + native deps on the parsing side, stateless compute at the serving side.
Most of these come from the native dependency chain. The Node CLI is clean; LibreOffice, ImageMagick, and Tesseract are where the friction lives.
A 100-page scanned PDF with default Tesseract.js takes minutes, not seconds. Use --no-ocr for text-native PDFs, or delegate to an external OCR server for batch jobs. Profile before optimising the parser — the parser isn't the problem, OCR is.
If lit parse doc.docx hangs or errors, LibreOffice is either missing or running as a GUI instance that blocks the headless invocation. pkill soffice and retry. On servers, ensure LibreOffice was installed as a CLI-only package.
Fresh Debian/Ubuntu installs of ImageMagick ship with a security policy that refuses to read PDF. If image parsing errors with "not authorized", remove the <policy domain="coder" rights="none" pattern="PDF" /> line from /etc/ImageMagick-6/policy.xml. Known issue, well-documented upstream.
For non-English docs, pass --ocr-language <iso> and accept that the first run will be slow — the Tesseract trained data for that language has to download on demand. Pre-warm the cache on build machines so production runs don't stall on language-pack fetch.
The bbox tuple is in PDF points (72 DPI). If you render a page screenshot at 300 DPI and try to overlay bounding boxes, multiply by dpi / 72. Every visual-overlay integration hits this on first attempt.
batch-parse is sequential
Large directories process one file at a time. For throughput, shard the input directory and run multiple lit batch-parse processes — or drop to the TypeScript API and use Promise.all with a concurrency limiter. There's no built-in parallelism flag.
LiteParse is the right call when data cannot leave the machine. When that constraint lifts, the decision gets more interesting.
pdftotext from poppler is 50 lines of Bash and no Node dependency.LiteParse is the ingestion entry point for every RAG use case in the tree. It produces clean text; the rest of the stack turns that text into retrieval, reasoning, and compounding agent capability.
The compounding play is this: LiteParse doesn't produce value on its own — it removes a blocker. Every skill downstream that needs structured text from unstructured documents (legal contract review, healthcare record intake, recruitment CV parsing, ERP document capture, procurement OCR) runs into the same first-mile problem: get the text out cleanly, without sending the document anywhere. LiteParse is the skill you load before any of those skills have a chance of working inside a POPIA-bounded deployment.
The agent angle: an agent can invoke lit parse as a shell command, then pass the parsed JSON back into the context window for reasoning. The skill is compatible with Claude Code, Cursor, and any agent runtime with shell access. No API wrapper, no MCP server required — just a CLI that returns structured output. That simplicity is deliberate; the value compounds where the parsed output lands, not in the parsing step itself.
LlamaIndex ships the canonical docs and the OCR-server wrappers. The npm page has the latest release notes; the GitHub repo is where issues and new format support land.