know.2nth.ai Data data engineering / liteparse
data/engineering · LiteParse · Skill Leaf

Local document parsing.
No cloud. No LLM.

LiteParse — the lightweight sibling of LlamaParse. PDFs, DOCX, PPTX, XLSX, images → clean text or structured JSON with bounding boxes. Runs on your laptop, your container, or a client's VPC. The POPIA-safe ingestion layer for RAG.

Live MIT Node 18+ LlamaIndex v0.1

A local-first CLI for turning messy docs into clean text.

LiteParse is a Node CLI (lit) and TypeScript library from LlamaIndex that parses unstructured documents into text or structured JSON. It accepts PDFs natively and routes Office formats through LibreOffice and images through ImageMagick — everything converges on a PDF parse pass with optional OCR. The output is either flat text (for quick extraction) or JSON with per-block text, bounding boxes, and confidence scores (for RAG chunking, layout-aware retrieval, or overlay rendering).

It is the local sibling of LlamaParse, LlamaIndex's cloud parser. Same authorship, same parse quality floor, different deployment shape: LiteParse runs on your machine, inside a container, inside a client VPC — wherever the documents live. No API key, no upload, no telemetry. That constraint is the whole feature.

OCR ships with Tesseract.js bundled (zero setup) and can delegate to any HTTP OCR server that speaks the LiteParse protocol — EasyOCR and PaddleOCR wrappers are ready-made in the LlamaIndex repo. Batch mode walks a directory. Screenshot mode renders pages as PNG for multimodal agents that need visual layout, not just text.

Why LiteParse is the 2nth default for private-data RAG

Data residency by construction. The parser runs where the documents live. POPIA subjects never transit through a foreign cloud. Legal, HR, and healthcare use-cases that reject LlamaParse SaaS accept LiteParse.

Zero-cost at any volume. No per-page pricing, no rate limits. Parse a million pages — the cost is CPU time, not API invoices. Relevant for ingestion-heavy discovery-phase projects.

Same output shape as LlamaParse. JSON schema is compatible enough that an ingestion pipeline written for LlamaParse swaps in LiteParse with minimal changes. Start local for pilots; upgrade to LlamaParse cloud only where parse quality becomes the binding constraint.

PDFs natively. Office via LibreOffice. Images via ImageMagick.

Everything that isn't already a PDF gets wrapped to PDF first, then parsed. That single-lane pipeline means parse behaviour is consistent across formats — and it means LibreOffice and ImageMagick are hard dependencies for non-PDF inputs.

Category Extensions Path
PDF .pdf Native. Text-layer first; OCR fallback for scans.
Word .doc .docx .docm .odt .rtf LibreOffice headless → PDF → parse.
PowerPoint .ppt .pptx .pptm .odp LibreOffice headless → PDF → parse.
Spreadsheets .xls .xlsx .xlsm .ods .csv .tsv LibreOffice for binary; direct parse for CSV/TSV.
Images .jpg .jpeg .png .gif .bmp .tiff .webp .svg ImageMagick wrap → PDF → OCR.

Setup. The Node package is one install; the native deps are the friction. Get them right once per machine and forget them.

# Core CLI
npm i -g @llamaindex/liteparse

# Office formats (DOCX, PPTX, XLSX) — LibreOffice converts to PDF
brew install --cask libreoffice        # macOS
apt-get install libreoffice            # Debian/Ubuntu

# Image inputs (JPG, PNG, TIFF, etc.)
brew install imagemagick               # macOS
apt-get install imagemagick            # Debian/Ubuntu

# Verify
lit --version

Four commands cover 95% of use cases.

lit parse for one file. lit batch-parse for a directory. lit screenshot when you need page images for a multimodal agent. A config file when you want to stop passing the same flags every time.

Parse a single file. Default output is plain text to stdout. Pass --format json for bounding boxes, confidence scores, and per-block structure — the shape RAG pipelines want.

# Plain text extraction
lit parse document.pdf

# Structured JSON — the RAG-friendly output
lit parse document.pdf --format json -o output.json

# Target pages only — faster for appendix-heavy PDFs
lit parse document.pdf --target-pages "1-5,10,15-20"

# Text-native PDF — skip OCR for a 3–10× speedup
lit parse document.pdf --no-ocr

# High-DPI render for small text or scanned docs
lit parse document.pdf --dpi 300

Batch-parse a directory. Point lit batch-parse at input and output paths. Recurse with --recursive. Filter by extension with --extension. Output mirrors input structure — each source file produces a text or JSON sibling in the output directory.

# All supported formats, flat output
lit batch-parse ./input ./output

# PDFs only, recursive — the common ingestion pattern
lit batch-parse ./input ./output --extension .pdf --recursive

Page screenshots are for multimodal agents — contracts with signatures, forms with checkboxes, scientific papers with diagrams. Render the page as PNG, hand it to Claude or GPT-4V alongside the parsed text, and the model sees what a human sees.

# All pages
lit screenshot document.pdf -o ./screenshots

# Specific pages, high-DPI PNG
lit screenshot document.pdf --pages "1,3,5" --dpi 300 --format png -o ./screenshots

Config file pattern

For repeated runs with the same settings — typical in a scheduled ingestion job — commit a liteparse.config.json next to the script. Same keys as the CLI flags, camelCase in JSON:

{ "ocrEnabled": true, "ocrLanguage": "en", "dpi": 150, "outputFormat": "json", "preciseBoundingBox": true, "maxPages": 1000 }

Run with lit parse document.pdf --config liteparse.config.json. Flags on the command line override the config — convenient for one-off reprocessing of a specific file.

Tesseract.js by default. Self-hosted OCR server when it matters.

Tesseract.js is bundled — zero setup, acceptable accuracy on clean English text. For non-English, scanned-scan-of-a-scan, or handwriting, delegate OCR to an external server. The LiteParse repo ships EasyOCR and PaddleOCR wrappers that speak the protocol out of the box.

Flag Purpose
(default) Tesseract.js bundled. English out of the box.
--ocr-language fra ISO code for Tesseract language pack. First run downloads the trained data.
--ocr-server-url <url> Delegate to an external HTTP OCR server (EasyOCR, PaddleOCR, custom).
--no-ocr Disable OCR entirely — 3–10× faster on text-native PDFs.

External OCR protocol. Any HTTP server implementing the following contract can be plugged in — swap OCR engines without touching the ingestion pipeline.

# Endpoint: POST /ocr
# Accepts: multipart with `file` + `language`
# Returns:
{
  "results": [
    { "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 }
  ]
}

# Use it:
lit parse scan.pdf --ocr-server-url http://localhost:8828/ocr

When to reach for external OCR

Non-Latin scripts (Arabic, Chinese, Amharic), handwriting, or low-quality scans — Tesseract's ceiling is low on any of the three. EasyOCR handles 80+ languages out of the box; PaddleOCR is state-of-the-art on Chinese and has strong structure-aware models. Both run locally, so the data-residency argument still holds.

The cost: one more service to run. On a single VM that's cheap; at fleet scale it starts mattering. Start with Tesseract, measure error rate on a sample, upgrade only if the error rate breaks the downstream use case.

The 2nth pattern: parse local, embed at the edge.

LiteParse runs where the documents are. The JSON output lands in R2. A Cloudflare Worker pulls chunks, embeds them with Workers AI, and writes vectors to Vectorize. The pipeline respects data residency while still getting the latency benefits of edge infrastructure.

// Node-side ingestion script — run on a box with file-system access
import { parse } from "@llamaindex/liteparse";
import { readdir } from "node:fs/promises";

const files = await readdir("./contracts");

for (const file of files) {
  const result = await parse(`./contracts/${file}`, {
    outputFormat: "json",
    ocrEnabled: true,
    dpi: 150,
    preciseBoundingBox: true,
  });

  // Each page is an array of blocks with text + bbox
  const chunks = result.pages.flatMap(page =>
    page.blocks.map(b => ({
      text: b.text,
      sourceFile: file,
      page: page.number,
      bbox: b.bbox,
    }))
  );

  // Push to R2 as JSONL for downstream embedding
  await putToR2(`parsed/${file}.jsonl`, chunks);
}

Why parse off-edge and embed on-edge. LiteParse has native deps (pdfjs, LibreOffice, ImageMagick) that do not run in Cloudflare Workers — parsing has to happen in a Node container, a VM, or a user machine. Embedding is the opposite: it's pure linear algebra over network, perfect for Workers AI at the edge. The split is natural: heavy IO + native deps on the parsing side, stateless compute at the serving side.

Six things that only bite once you ship.

Most of these come from the native dependency chain. The Node CLI is clean; LibreOffice, ImageMagick, and Tesseract are where the friction lives.

OCR is the slow step

A 100-page scanned PDF with default Tesseract.js takes minutes, not seconds. Use --no-ocr for text-native PDFs, or delegate to an external OCR server for batch jobs. Profile before optimising the parser — the parser isn't the problem, OCR is.

LibreOffice headless requires no GUI instance running

If lit parse doc.docx hangs or errors, LibreOffice is either missing or running as a GUI instance that blocks the headless invocation. pkill soffice and retry. On servers, ensure LibreOffice was installed as a CLI-only package.

ImageMagick's default policy blocks PDFs

Fresh Debian/Ubuntu installs of ImageMagick ship with a security policy that refuses to read PDF. If image parsing errors with "not authorized", remove the <policy domain="coder" rights="none" pattern="PDF" /> line from /etc/ImageMagick-6/policy.xml. Known issue, well-documented upstream.

Tesseract.js bundles a single language

For non-English docs, pass --ocr-language <iso> and accept that the first run will be slow — the Tesseract trained data for that language has to download on demand. Pre-warm the cache on build machines so production runs don't stall on language-pack fetch.

Bounding boxes are in PDF point coordinates, not pixels

The bbox tuple is in PDF points (72 DPI). If you render a page screenshot at 300 DPI and try to overlay bounding boxes, multiply by dpi / 72. Every visual-overlay integration hits this on first attempt.

batch-parse is sequential

Large directories process one file at a time. For throughput, shard the input directory and run multiple lit batch-parse processes — or drop to the TypeScript API and use Promise.all with a concurrency limiter. There's no built-in parallelism flag.

Use it for, skip it for.

LiteParse is the right call when data cannot leave the machine. When that constraint lifts, the decision gets more interesting.

Use LiteParse when

  • Data residency or POPIA rules forbid cloud document parsers — the parser runs inside your VPC.
  • You're pilot-parsing at zero marginal cost before committing to a paid cloud parser.
  • You need JSON with bounding boxes for layout-aware RAG or visual overlay — same output shape as LlamaParse.
  • Documents are a mix of PDF, DOCX, PPTX, XLSX, and images — one CLI handles all of them.
  • You have a multimodal agent that needs page screenshots alongside parsed text.
  • The pipeline lives on a build machine, a user laptop, or a client's on-prem server — no network calls acceptable.

How this leaf compounds in the tree.

LiteParse is the ingestion entry point for every RAG use case in the tree. It produces clean text; the rest of the stack turns that text into retrieval, reasoning, and compounding agent capability.

The compounding play is this: LiteParse doesn't produce value on its own — it removes a blocker. Every skill downstream that needs structured text from unstructured documents (legal contract review, healthcare record intake, recruitment CV parsing, ERP document capture, procurement OCR) runs into the same first-mile problem: get the text out cleanly, without sending the document anywhere. LiteParse is the skill you load before any of those skills have a chance of working inside a POPIA-bounded deployment.

The agent angle: an agent can invoke lit parse as a shell command, then pass the parsed JSON back into the context window for reasoning. The skill is compatible with Claude Code, Cursor, and any agent runtime with shell access. No API wrapper, no MCP server required — just a CLI that returns structured output. That simplicity is deliberate; the value compounds where the parsed output lands, not in the parsing step itself.

tech/cloudflare/vectorize
Vectorize
Embed LiteParse JSON chunks into Vectorize from a Worker. Parsed text stays on the parse host; only embeddings hit the edge.
tech/cloudflare/workers-ai
Workers AI
Generate embeddings for parsed chunks at the edge. BGE or Cohere-embed-v3 at sub-50ms — the matching half of local-parse + edge-embed.
tech/cloudflare/r2
R2
Persist source documents and parsed JSONL side-by-side. Zero egress means re-embedding on schema changes is cheap.
data/engineering
Data Engineering (parent)
Document parsing is the unstructured-data branch of ingestion. LiteParse plugs into the same ELT pattern as Fivetran or Airbyte — just for files.
leg/commercial
Legal / Commercial
Contract review pipelines: LiteParse extracts clauses with bounding boxes; a downstream skill runs clause classification and red-flag detection.
biz/hr/recruitment
HR / Recruitment
CV intake at scale. LiteParse handles the DOCX/PDF mix; a downstream skill normalises to structured candidate records.

Go deeper.

LlamaIndex ships the canonical docs and the OCR-server wrappers. The npm page has the latest release notes; the GitHub repo is where issues and new format support land.