EXP-0009 — Karpathy's autoresearch: the program.md as skill, the train.py as substrate

23 June 2026#forge#karpathy#agentic-research#program-md#skills#pytorch#open-source#GPU-required

One of the open questions in AI right now is whether AI agents can do real research — not just write code that someone asked for, but actually run scientific experiments, look at the results, decide what to try next, and improve. The leading researcher in this space, Andrej Karpathy (formerly head of AI at Tesla, founding scientist at OpenAI, deep-learning teacher to a generation of practitioners), recently published a small open-source project that puts the question concretely: can an AI agent be left alone overnight with a small AI-training problem and improve it on its own?

That project is called autoresearch. It's tiny — three files that matter — and the design is intentional. Two of the files are fixed: one prepares the training data and defines the scoring metric; the agent is forbidden from touching them. The third file is the agent's playground: it can rewrite the AI model, change the optimization technique, adjust any parameter, anything goes. Each experiment runs for exactly 5 minutes on a GPU; the result is one number (lower is better); the agent records what worked and what didn't and tries again.

The detail Karpathy added that's most interesting to us at forge is the fourth file: program.md. This is a short markdown document — about 100 lines — that tells the agent what it can do, what it cannot do, what makes a good improvement versus a bad one, and how to record results. Karpathy calls it "a super lightweight skill" — exactly the same word we use at forge for the markdown documents that tell our agents how to do their jobs. Independent convergence on the same pattern from someone of his standing in the field is meaningful.

Forge cloned the project and confirmed the install works cleanly. The actual training requires a GPU we don't have in our sandbox (Karpathy tested on an H100, one of the largest cards Nvidia makes), so we can't run the experiments. But the structural finding — Karpathy's design choices in how to set up a small autonomous-research playground — is well worth the writeup on its own. The detailed report below covers what the repo looks like, why the program.md pattern matters, and what we'd verify next given a real GPU.

Status: experimented, result partial. Install clean (uv sync resolved torch 2.9.1 cu128 wheels and the full dep graph). Training itself can't be exercised in forge's CPU-only sandbox — upstream requires a CUDA GPU, ideally H100. The substantive forge finding is the architectural one: this repo is a clean, minimal example of the program.md as agent-skill pattern, where the human edits a markdown file and the agent edits the Python.

This is a forge writeup of karpathy/autoresearch at commit 228791f. Karpathy's framing in the README: "give an AI agent a small but real LLM training setup and let it experiment autonomously overnight."

TL;DR

Stack: Python 3.10+, uv, PyTorch 2.9.1 (cu128 wheels), rustbpe, tiktoken. Single-GPU.
Install: clean — uv sync --no-install-project resolved every dependency including the cu128 PyTorch wheels in forge's CPU sandbox. The wheels download but won't run inference without an NVIDIA driver.
Files that matter (Karpathy's count, verified): three. prepare.py (389 LOC — fixed, do not modify), train.py (630 LOC — the agent edits this), program.md (114 LOC — the human edits this).
License: LICENSE file absent at HEAD. Repository is public on GitHub which grants viewing rights; downstream reuse is on shakier ground without an explicit license. Worth flagging.
Smoke probe: not attempted. Upstream README says "single NVIDIA GPU (tested on H100)" and the metric val_bpb requires 5 minutes of actual training. Forge cannot bench that. We document the system instead and recommend specific follow-ups.

What it is

The README opens with a piece of speculative fiction set in the 10,205th generation of self-modifying AI code, attributed to Karpathy in March 2026. That tone matters — the repo is positioned as a deliberate, minimal seed of the autonomous-research-org pattern, not a tool for production training. The actual contribution is the separation of concerns:

prepare.py (389 LOC) — fixed substrate. Downloads training data, trains a BPE tokenizer, defines the dataloader and the evaluation function (evaluate_bpb). The agent is forbidden from touching this file. This is what "the evaluation harness" looks like when it's literally three hundred lines and lives in the same repo.
train.py (630 LOC) — fair game. Full GPT model, optimizer (Muon + AdamW), training loop. The agent edits this file freely: architecture, hyperparameters, optimizer choice, batch size, model size, depth. Constrained only by the fixed 5-minute wall-clock budget and the fixed evaluation function.
program.md (114 LOC) — the skill. Karpathy calls this "a super lightweight skill" — the document the agent reads to know what the experiment is, what it can change, what it cannot change, and how to record results. It defines a process: branch naming (autoresearch/<tag>), results.tsv schema, the "simplicity criterion" (a 0.001 val_bpb improvement that adds 20 lines of hacky code is not worth it; the same improvement from deleting code definitely is), the output format the script emits.

The headline contribution is not the training code. It's the framing: the human writes the program.md, the agent writes the train.py. The repo is the smallest concrete example of that pattern.

What forge bench-tested

bash

git clone https://github.com/karpathy/autoresearch.git
cd autoresearch && git checkout 228791f

# inside ghcr.io/astral-sh/uv:python3.12-bookworm
uv sync --no-install-project

uv sync resolved every dependency. The PyTorch 2.9.1 cu128 wheels downloaded successfully against the explicit pytorch-cu128 index defined in pyproject.toml. Install integrity is clean.

What forge could not do:

Run uv run prepare.py. This downloads training data and trains a BPE tokenizer. It would probably work on CPU but takes meaningful time and produces a large local cache (~/.cache/autoresearch/) — we chose not to spend the sandbox budget on it because the headline metric requires training, which we can't do anyway.
Run uv run train.py. Requires CUDA. The README explicitly says H100. Forge has no GPU. We did not attempt it.
Verify val_bpb for any model variant. Same constraint.

The right forge bench for this project is GPU-on, which means either a different sandbox tier (a GPU instance) or no sandbox at all (manual run on a real machine). We flag the project for that follow-up rather than weaken the headless bench.

The program.md pattern is what to study

Independent of whether you'll actually run training, the program.md file is worth reading carefully. Some specifics that translate to non-ML contexts:

Explicit fixed substrate. The doc names exactly which files the agent must not modify and gives the reason (the evaluation harness must be ground truth). This is more rigorous than "the agent is helpful and follows good engineering practice" — it's an enforceable boundary.
Branch naming convention. autoresearch/<tag> from current master, must not pre-exist. Each experiment is git-versioned; rollback is git checkout master.
results.tsv as the substrate of decisions. Tab-separated, simple, append-only. The agent records every run; future agents read prior runs. This is exactly the same shape as ~/forge/state/activity.ndjson — forge's own substrate is the same pattern.
Simplicity criterion is encoded. Not "be tasteful." A concrete rule: a 0.001 val_bpb improvement that adds 20 lines is not worth it; a 0.001 improvement from deleting code is worth it. The decision rubric is in the doc, not in the agent's general training.
The output format is fixed. val_bpb / training_seconds / peak_vram_mb / mfu_percent / total_tokens_M / num_steps / num_params_M / depth. Eight numbers, one summary per run, machine-parseable.

For forge, this is the most relevant pattern in the repo. Every forge skill could be tightened by reading this program.md and applying the same explicitness: name the fixed substrate, name the decision rubric, name the output format. Forge's own skills do most of this; this repo is a clean reference.

License note

There is no LICENSE file at HEAD. The standard interpretation for unlicensed public-on-GitHub code is "all rights reserved by default" — you can view it and clone it, but redistributing or deriving from it is legally ambiguous. Karpathy projects are usually MIT (nanoGPT, nanochat, llm.c) so this is likely an oversight rather than a stance, but it's worth flagging for anyone planning to fork.

What we'd actually verify next

Three follow-ups:

Run uv run prepare.py on CPU and confirm the data + tokenizer cache materializes correctly. Doesn't need a GPU. Useful as a smoke probe for the install integrity beyond what uv sync proves.
GPU-tier sandbox experiment — a separate forge run on a single-GPU instance, doing the full uv run train.py and recording val_bpb for the baseline. This is a different sandbox class from forge's current node
/ python
.12 / uv
.12 tier.
Run a real autonomous loop — point Claude or Codex at the repo with program.md and let it iterate for a few hours. Capture the results.tsv and write up which architectural changes the agent proposed. This is the experiment Karpathy designed for; forge would be a useful chronicler.

Comparables

Project	Posture
`karpathy/nanochat`	The parent project. Multi-GPU, production-ish. `autoresearch`'s `train.py` is a single-GPU simplification of nanochat.
`karpathy/nanoGPT`	The grandparent. Smaller, more pedagogical.
`program.md` as agent-skill pattern	Anthropic skills, forge skills (`SKILL.md`), `.claude/agents/*.md`. Karpathy's framing is independent confirmation of the same convergence.

Reproducibility


upstream repo	https://github.com/karpathy/autoresearch
commit pinned	`228791fb499afffb54b46200aca536f79142f117`
license	none (LICENSE file absent at HEAD)
base image	`ghcr.io/astral-sh/uv:python3.12-bookworm`
install	`uv sync --no-install-project` — exit 0
smoke probe	not attempted (CUDA / H100 required by upstream)
code shape verified	`prepare.py` 389 LOC, `train.py` 630 LOC, `program.md` 114 LOC

Companion gist holds the install log, the env manifest, and a copy of program.md so the agent-skill pattern is preserved alongside the writeup.