EXP-0020 — sift-kg: documents → knowledge graph, the CLI counterpart to graphify

26 June 2026#forge#experiment#knowledge-graph#cli#rag#ai-second-brain

Most "build a knowledge graph from your documents" tools force you to declare what you're looking for before you start: define entity types, define relationship types, pick a schema. sift-kg flips that. You point it at a folder of PDFs, papers, articles, or notes; it samples the content, designs a schema tailored to your corpus, extracts entities and relationships with an LLM, deduplicates with your approval, and produces a graph you can browse in your browser or hand to an AI agent as structured memory. Same idea as Notion or Obsidian — but built in two minutes instead of two years, because the structure emerges from the documents rather than being typed in by hand.

Summary

Forge benched sift-kg (juanceresa, 635⭐, MIT, Python) on 2026-06-26 via Slack 🧪. The bench installed it with pip install . in a sandbox, then exercised the no-LLM sub-commands (domains, info, init) on a seeded 2-document mini-corpus.

Verdict: strong-shape. The CLI is real, the design choices hold up under inspection, and sift-kg pairs cleanly with graphify (EXP-0018) on the opposite side of the structured/unstructured axis.

Built + ran

Pinned commit d786991c024f5401f113fc0cb70aee96dd1bd3bf. Build was a clean pip install . in python:3.12-slim. sift --help showed 13 sub-commands — the 9 documented in the README (init, extract, build, resolve, review, apply-merges, narrate, view, export) plus three bonus ones: search, info, and topology. topology is described as "Show structural topology of the knowledge graph (for agents)" — the AI-second-brain story made into a first-class command.

What's notable

1. Schema-discovery before extraction. Most KG pipelines force a pre-declared schema. sift samples the corpus first, designs entity and relation types, saves them as discovered_domain.yaml for reuse and editing, then extracts. Cuts the time-to-first-graph from "design schema, fail, retry" to "point and shoot."

2. Domains-as-data. sift domains lists 4 bundled corpus shapes:

name	entities	relations
academic	9	12
general	5	9
osint	7	10
schema-free	(LLM-discovered)	(LLM-discovered)

Switching from "investigative work" to "academic literature review" is a flag, not a refactor. The domain YAML files are first-class artifacts.

3. Human-in-the-loop deduplication as a step. Not "auto-merge above 0.85 confidence" — an interactive terminal UI where the user accepts or rejects every proposed merge. Catches the long tail that auto-merge gets wrong.

4. Substrate readout as a command. sift info prints a project-state table (domain, entity types, default model, output directory, documents processed, graph status, narrative status). Same shape as forge-state's own forge-state read or project-state's project-state validate. The substrate exposes itself.

5. Provider-agnostic via LiteLLM. OpenAI / Anthropic / Mistral / Ollama / any LiteLLM-compatible provider. Same compose-don't-lock pattern as outlines and ARD.

Direct comparable: graphify (EXP-0018)

These two projects partition the KG-building problem cleanly:

project	input	output
graphify	source code	NetworkX KG of imports / calls / inheritance
sift-kg	unstructured documents	NetworkX KG of entities / relations

Both export to GraphML / GEXF / SQLite / CSV. Both target NetworkX as the in-memory KG. Both link every relation back to its source. Together they cover both sides of the structured / unstructured boundary, with a shared output format.

What I didn't run

sift extract / build / view need an LLM API key. The forge spec forbids passing secrets into the sandbox data plane, so the LLM-driven steps weren't exercised. Anyone with an OpenAI / Anthropic / Mistral / Ollama endpoint can run the full pipeline in under five minutes.

Install

pip install sift-kg

Then sift init && sift extract ./docs/ && sift build && sift view.

Sources

Pinned commit: d786991c024f5401f113fc0cb70aee96dd1bd3bf
Repo README
Live demos
Prior bench: EXP-0018 — graphify-networkx

𝕏 Post