EXP-0020 — sift-kg: documents → knowledge graph, the CLI counterpart to graphify
#forge#experiment#knowledge-graph#cli#rag#ai-second-brain
David OlssonMost "build a knowledge graph from your documents" tools force you to declare what you're looking for before you start: define entity types, define relationship types, pick a schema. sift-kg flips that. You point it at a folder of PDFs, papers, articles, or notes; it samples the content, designs a schema tailored to your corpus, extracts entities and relationships with an LLM, deduplicates with your approval, and produces a graph you can browse in your browser or hand to an AI agent as structured memory. Same idea as Notion or Obsidian — but built in two minutes instead of two years, because the structure emerges from the documents rather than being typed in by hand.
Summary
Forge benched sift-kg (juanceresa, 635⭐, MIT, Python) on 2026-06-26 via Slack 🧪. The bench installed it with pip install . in a sandbox, then exercised the no-LLM sub-commands (domains, info, init) on a seeded 2-document mini-corpus.
Verdict: strong-shape. The CLI is real, the design choices hold up under inspection, and sift-kg pairs cleanly with graphify (EXP-0018) on the opposite side of the structured/unstructured axis.
Built + ran
Pinned commit d786991c024f5401f113fc0cb70aee96dd1bd3bf. Build was a clean pip install . in python:3.12-slim. sift --help showed 13 sub-commands — the 9 documented in the README (init, extract, build, resolve, review, apply-merges, narrate, view, export) plus three bonus ones: search, info, and topology. topology is described as "Show structural topology of the knowledge graph (for agents)" — the AI-second-brain story made into a first-class command.
What's notable
1. Schema-discovery before extraction. Most KG pipelines force a pre-declared schema. sift samples the corpus first, designs entity and relation types, saves them as discovered_domain.yaml for reuse and editing, then extracts. Cuts the time-to-first-graph from "design schema, fail, retry" to "point and shoot."
2. Domains-as-data. sift domains lists 4 bundled corpus shapes:
| name | entities | relations |
|---|---|---|
| academic | 9 | 12 |
| general | 5 | 9 |
| osint | 7 | 10 |
| schema-free | (LLM-discovered) | (LLM-discovered) |
Switching from "investigative work" to "academic literature review" is a flag, not a refactor. The domain YAML files are first-class artifacts.
3. Human-in-the-loop deduplication as a step. Not "auto-merge above 0.85 confidence" — an interactive terminal UI where the user accepts or rejects every proposed merge. Catches the long tail that auto-merge gets wrong.
4. Substrate readout as a command. sift info prints a project-state table (domain, entity types, default model, output directory, documents processed, graph status, narrative status). Same shape as forge-state's own forge-state read or project-state's project-state validate. The substrate exposes itself.
5. Provider-agnostic via LiteLLM. OpenAI / Anthropic / Mistral / Ollama / any LiteLLM-compatible provider. Same compose-don't-lock pattern as outlines and ARD.
Direct comparable: graphify (EXP-0018)
These two projects partition the KG-building problem cleanly:
| project | input | output |
|---|---|---|
| graphify | source code | NetworkX KG of imports / calls / inheritance |
| sift-kg | unstructured documents | NetworkX KG of entities / relations |
Both export to GraphML / GEXF / SQLite / CSV. Both target NetworkX as the in-memory KG. Both link every relation back to its source. Together they cover both sides of the structured / unstructured boundary, with a shared output format.
What I didn't run
sift extract / build / view need an LLM API key. The forge spec forbids passing secrets into the sandbox data plane, so the LLM-driven steps weren't exercised. Anyone with an OpenAI / Anthropic / Mistral / Ollama endpoint can run the full pipeline in under five minutes.
Install
pip install sift-kg
Then sift init && sift extract ./docs/ && sift build && sift view.
Sources
- Pinned commit:
d786991c024f5401f113fc0cb70aee96dd1bd3bf - Repo README
- Live demos
- Prior bench: EXP-0018 — graphify-networkx