Skip to content
scsiwygest. ‘26
Sign in
get startedmcpcommunityapiplaygroundswaggersign insign up
forge·EXP-0023 — chunkr: 13 services of Rust document intelligence, AGPL-fortified29 Jun 2026David Olsson
forge

EXP-0023 — chunkr: 13 services of Rust document intelligence, AGPL-fortified

#forge#experiment#rust#document-intelligence#rag#agpl#docker-compose

David OlssonDavid Olsson

If your job is "feed PDFs into an AI" — contracts, manuals, research papers, anything with tables and images and headers and footnotes — most of the actual work happens before the AI ever sees the document. You have to figure out where the page ends and the table begins. Pull text out of pictures. Tag headers as headers, body as body, captions as captions. Break the result into chunks small enough for the AI to digest. chunkr is one open-source project that ships all of that as a single Docker stack. It's also one of the largest things forge has ever benched: 13 services in the full deploy, written in Rust, with a commercial-license tier behind it.


Summary

Forge benched chunkr (Lumina AI Inc, 3,747⭐, AGPL-3.0, Rust) on 2026-06-29 via Slack 🧪. The full stack's compose.yaml lists 13 services — Postgres + Redis + MinIO + Keycloak + a layout/segmentation/OCR triplet + the Rust server + the web frontend + admin tooling. The forge sandbox doesn't have budget to spin all that up; the bench was the tpa-pin-and-bench no-spin-up variant.

Verdict: strong-shape (structural). The advertised system matches the tree; the multi-tier license model is real; and the deploy weight is honest with the enterprise positioning.

Pinned

commit: 1bde59beccf9a429af2c63bccd659316c2b4cf3d, AGPL-3.0 + commercial-license tier.

What it is

A production-grade self-hostable document-intelligence pipeline:

  1. Layout analysis — find tables, figures, headers, body, captions
  2. OCR + bounding boxes
  3. Structured output — HTML and Markdown
  4. VLM processing — vision-language model for complex regions

PDFs / DOCX / PPTX / images go in. RAG-ready chunks come out.

What's notable

1. Three explicit tiers, three explicit deploy variants.

The README's tier matrix is unusually honest:

tierlayoutOCRVLMExcel
Open-source (AGPL)community modelscommunity OCRbasic open VLM
Cloud API (chunkr.ai)proprietaryoptimizedenhanced
Enterpriseproprietary + custom-tunedoptimized + domain-tunedcustom fine-tunes

And then three compose files: compose.yaml (13 services, full Linux GPU), compose.mac.yaml (7 services, Apple-Silicon, no nvidia), compose.cpu.yaml (3 services, CPU-only overrides). Most projects ship one compose and tell Mac users "good luck." chunkr ships three. That's a meaningful signal about engineering discipline.

2. Rust + 7 in-house Dockerfiles.

One root Cargo.toml, 98 .rs files, 7 Dockerfiles. Lumina is investing in tight control of the deploy surface rather than gluing community images together. Consistent with the "we have a commercial tier" story.

3. AGPL as commercial moat.

This is forge's first AGPL bench. Prior benches were MIT / Apache / GPL-3. AGPL means anyone running chunkr-as-a-service must open-source their modifications — Lumina is using the license to say "self-host all you want, but you can't out-SaaS me." Same pattern Outlines (EXP-0011) and Yuxi (EXP-0021) use with different mechanisms.

4. No agent-instruction files.

No AGENTS.md / CLAUDE.md / SKILL.md in the tree. chunkr is a service product, not an agent harness. Clean counter-example: not everything in 2026 OSS adopts the SKILL.md convention forge tracks. Useful data point.

Position vs prior benches

projectrolelanguageservices in compose
Yuxi (EXP-0021)agent harnessPython6+ (incl. Milvus/Neo4j)
chunkr (EXP-0023)document serviceRust13
sift-kg (EXP-0020)KG CLIPython0
graphify (EXP-0018)KG CLIPython0

chunkr is the only Rust project in the doc-pipeline cohort. Forge's first Rust bench at scale.

What I didn't run

A full docker compose up against the 13-service stack. That would have pulled ~5-8 GB of images and taken 15-20 minutes from clone to first request — beyond the per-experiment budget. Verifying the actual throughput claims (which is where Rust matters) requires that run; the bench can verify shape, not speed.

Install

git clone --depth 1 https://github.com/lumina-ai-inc/chunkr.git
cd chunkr
docker compose up                       # full stack
# or:
docker compose -f compose.mac.yaml up   # Apple Silicon

Sources

Share
𝕏 Post