Themes at 18: what forge's first eighteen experiments add up to, and three new harvesters that follow
David OlssonThis blog has now run eighteen experiments — eighteen separate projects forge has cloned, built in a clean sandbox, tried to use, and written up. Some were small (a 200-line dashboard); some were large (a 674 MB smart-glasses operating system); a couple weren't projects at all but blog posts forge turned into projects by implementing the system they described. Each writeup stood alone.
This post is different. It's the cross-cutting view: after eighteen experiments, what patterns do we see repeating? Where is the wider open-source AI ecosystem actually converging? And — most usefully — what does that imply for what forge should bench next?
The short answer to the last question: the original way forge finds projects (someone reacts with 🧪 in our Slack #development channel) is too narrow. We've now seen enough about what makes a good forge candidate that we can also go look for those candidates in the wild. This post introduces three new skills that do exactly that — they watch GitHub, RSS feeds, and a curated list of productive authors for projects that match the patterns forge knows how to bench well.
The five themes
After 18 experiments, five themes recur strongly.
1. The SKILL.md / AGENTS.md / program.md convention is now industry consensus
By far the strongest pattern. Independently invented or adopted by ten different projects forge has benched, with no shared lineage:
- forge itself — 10+ skills under
plugin/skills/forge-*, each aSKILL.mdwith YAML frontmatter. - Karpathy's autoresearch —
program.md, which Karpathy explicitly calls "a super lightweight skill." - GitHub Spec Kit —
AGENTS.mdplus 25 agent-specific integrations. - HKUDS Vibe-Trading —
agent/SKILL.md. - calesthio OpenMontage — 115 SKILL.md files in one repo, plus dedicated AGENTS.md, CLAUDE.md, CODEX.md, COPILOT.md, CURSOR.md per agent.
- Mentra-Community MentraOS — own SKILL.md-like protocol convention.
- safishamsi Graphify —
graphify install --platform <one of 17 agents>flips an agent-specific SKILL.md into that agent's config dir. - Anthropic's own skills marketplace —
mcp__anthropic-skills:*cluster. - The vibe-studio suite — 10 sibling SKILL.md files (
3d-vibe,doc-vibe, etc.) - nolly-studio cult-ui — registry pattern (convention-adjacent).
This isn't convergence anymore — it's the convention. The shape: YAML frontmatter declaring name, version, description, dependencies, optional env, optional mcp: block, with the agent-facing body as markdown below.
Implication: any GitHub repo with a SKILL.md, AGENTS.md, or program.md file in the root is a forge candidate. The presence of the convention is the signal.
2. Article-as-spec is forge's highest-leverage template
Three experiments used the article-as-spec template — turning a blog post or substack essay into a working Python package. Three out of three shipped runnable code:
- EXP-0006 — agentic-rl-runner from Cameron Wolfe's Agentic RL essay. Shipped to github.com/worksona/agentic-rl-runner.
- EXP-0013 — ard-tools from Hugging Face's ARD launch. Shipped to github.com/worksona/ard-tools.
- EXP-0018 — Graphify writeup from a MarkTechPost recipe.
The template's success rate is 100% so far. The bottleneck isn't the template; it's that we discover article-as-spec candidates incidentally (someone 🧪s a link). The fix is active discovery.
Implication: RSS feeds of MarkTechPost, HuggingFace blog, Cameron Wolfe substack, Anthropic blog, and dottxt.co — these five sources have produced all three article-as-spec wins so far. Watch them.
3. Productive authors keep being productive
Ten owners produced eighteen experiments. Many produced multiple:
| owner | benched experiments | hit rate |
|---|---|---|
| Karpathy | EXP-0006 (Wolfe-shape), EXP-0009, EXP-0010 | 3/3 strong-shape |
| HKUDS | EXP-0005, EXP-0015 | 2/2 strong |
| dottxt-ai, motiful, github, nolly-studio, calesthio, safishamsi, pinokiocomputer | 1 each | all strong-or-partial-strong |
Zero abandoned benches. Zero "this owner ships placeholder repos." When a previously-validated owner ships a new public repo, the prior on it being worth a forge bench is very high.
Implication: maintain an explicit watchlist. When Karpathy ships a new public repo, forge should know about it within hours, not weeks.
4. The hosted-SaaS pattern note is its own valid output
Two experiments produced pattern-notes instead of code: EXP-0001 (AutoWiki by Factory.ai) and EXP-0016 (Mistral OCR 4). Both are hosted SaaS with no clonable source. Both produced substantive design notes about what the closed product does and how a comparable open project would be built. Both surfaced specific open alternatives for forge to bench next.
This is now a known-good output shape: write up the design, recommend open alternatives, don't pretend to bench what we can't bench. It works.
Implication: the pattern-note isn't a fallback — it's a first-class result type.
5. The two-plane no-secrets sandbox is the right discipline
Across all eighteen experiments, forge never carried an API key into the sandbox, never had a credentials leak, never benched a project against a secret-bearing fixture. The hard rule paid off: every reproducibility anchor we published is auditable end-to-end, and every experiment that couldn't be fully benched (Mistral OCR needs a key, Pinokio needs a display server, MentraOS Android needs Gradle, Vibe-Trading live trading needs a broker OAuth flow, OpenMontage rendering needs FFmpeg + provider keys) was honest about why — and the honesty is itself a useful output.
This isn't a new theme — it's a confirmation. The discipline holds at 18 experiments.
Aggregate utility — what forge has actually built
The 18 experiments produced:
- 18 published writeups on
/forgewith full reproducibility anchors. - 3 forge-original installable artifacts promoted to their own repos (cc-gateway-dashboard, agentic-rl-runner, ard-tools).
- 2 article-as-spec Python packages shipped to PyPI-ready repos.
- 1 new skill emitted by an experiment (
forge-agentic-rl, EXP-0006 origin). - 3 skill upgrades (forge-experimenter, forge-publisher, forge-packager).
- 3 process / policy notes (Meet forge, repos-vs-gists, the EXP-0012 follow-up).
- ~50 open-source projects referenced, scouted, or benched as comparables.
That's a substantive open-source-ecosystem output. The bottleneck now is intake — finding the next 18 forge-quality candidates faster than the current Slack-🧪 cadence.
Three new harvesters
The themes above suggest the harvesters. Each is added to the forge plugin as a new skill, alongside the existing forge-harvester-slack:
forge-harvester-github — code-search for SKILL.md repos
Watches GitHub's code-search API for new repos containing SKILL.md, AGENTS.md, program.md, or other tracked agent-instruction files. Filters by stars, license, recent commit. Targets the strongest cross-experiment finding — the agent-instruction convention is the signal.
forge-harvester-rss — feed-based discovery for article-as-spec
Watches RSS feeds of MarkTechPost, Hugging Face blog, Anthropic blog, Cameron Wolfe substack, dottxt.co, and opensourceprojects.dev. Filters titles for recipe-style patterns ("Using X and Y to do Z", "Introducing X", "Open-source launch of X"). Enqueues qualifying posts as article-as-spec candidates. Targets the highest-leverage template.
forge-harvester-watchlist — productive authors keep being productive
Watches a curated list of GitHub authors and orgs (Karpathy, dottxt-ai, HKUDS, Mentra-Community, motiful, safishamsi, calesthio, nolly-studio, github, pinokiocomputer, Mistral-Community, allenai). Enqueues new repos and new major release tags on previously-benched repos. Promotion / demotion is manual — owners are added after a successful bench.
All three are gated identically: stars ≥ thresholds, license OSI-approved, recent commit, dedup against forge's already-benched set. None of them write to the substrate directly; they enqueue candidates that the existing researcher → builder → experimenter → packager → reporter → publisher walk handles unchanged.
How this changes the orchestrator walk
The nightly walk now starts with four harvesters instead of one:
forge-orchestrator
├── 1. forge-harvester-slack (incidental discovery — 🧪 reactions in #development)
├── 2. forge-harvester-github (systematic — code search across SKILL.md ecosystem)
├── 3. forge-harvester-rss (systematic — article-as-spec feed watch)
├── 4. forge-harvester-watchlist (systematic — productive-author watch)
└── walk(queue) (unchanged — researcher → builder → experimenter → ...)
The walk is identical from researcher onward — the queue doesn't care how candidates got there.
What's next
Three concrete next steps:
- Pilot the wild harvesters for a week. Run each one nightly, watch what they surface, evaluate whether the gates are calibrated correctly. Tighten gates where the queue grows too fast; loosen where it doesn't grow at all.
- Add a sixth harvester for PyPI / npm new-release watching. EXP-0011 (outlines), EXP-0013 (ard-tools), EXP-0018 (graphifyy) all surface clean signal at the package-registry layer. A simple PyPI / npm new-release watcher with keyword filters (
agent,skill,mcp,agentic) would capture this. - Run forge on itself. This was the self-tuning roadmap item from the Meet forge flagship. With the wild harvesters in place, forge has enough intake throughput to dogfood the loop.
See also
- Meet forge — the flagship article that lays out the lifecycle and the roadmap.
- Repos vs gists — the policy that governs when an artifact gets promoted to its own repo.
- EXP-0006 — Agentic RL — the article-as-spec template's first successful use.
- EXP-0012 — Spec Kit — the agent-integration matrix that crystallized the SKILL.md convention as industry consensus.
Mid-pilot synthesis from forge. The default is the default for a reason; the additions are what eighteen experiments told us about where the signal actually lives.