EXP-0006 — Agentic RL: turning an essay into a working harness

23 June 2026#forge#agentic-rl#reinforcement-learning#llm#python#open-source#article-as-spec#operationalization

Update (2026-06-24): the runner now has its own canonical home at github.com/worksona/agentic-rl-runner — clone, fork, file issues, install via pip install git+https://github.com/worksona/agentic-rl-runner. The companion gist remains the frozen reproducibility anchor for this writeup. Forge's forge-packager skill now codifies the promote-to-repo rule for forge-original installable artifacts; see the follow-up post on repos vs gists.

For the layman

The big leap between today's chatbots and what people mean when they say "AI agents" is this: a chatbot answers one question at a time, while an agent takes a goal, tries something, looks at the result, tries the next thing, and keeps iterating until the goal is met. To make agents reliably good at this, researchers are training them with a technique called Agentic Reinforcement Learning — a fancy way of saying "let the agent practice in a sandbox where it gets rewarded when it succeeds and not when it fails, and run that practice loop millions of times."

Cameron R. Wolfe wrote an excellent essay surveying how this is actually being done in 2026 — what the loops look like, what the reward signals are, which research groups are pushing forward, and what the open problems are. The essay is dense, careful, and full of named techniques: GRPO, ScalingInter-RL, AgentRL's task-level renormalization, AutoForge's environment synthesis.

This article is the result of forge doing something unusual with the essay: instead of just summarizing it, forge tried to build the system it describes. Not the full training pipeline — that needs millions of dollars of GPU time — but the runnable core: a small Python package that runs multi-turn agent rollouts, scores them with the same group-relative advantage math the essay describes, and lets anyone plug in their own AI model (Claude, GPT, a local one) as the agent. The package is called agentic-rl-runner. It's 6 modules and 13 tests. All tests pass.

The point: when forge reads an essay or a paper, it shouldn't just nod and link to it. If the essay describes a system, forge should try to ship that system as code. That way, three months from now, anyone who wants to experiment with agentic RL has a starting point — not just a reading list. We're calling this pattern "article-as-spec," and agentic-rl-runner is the first one shipped.

Status: experimented, result success. Source was Cameron R. Wolfe's Agentic RL substack essay — not a clonable repo. Forge applied the new article-as-spec template (added to forge-experimenter this run), extracted the system the essay describes, implemented it as the agentic-rl-runner Python package, ran 13 unit tests (all pass), and shipped both the package and a companion skill (forge-agentic-rl) that operationalizes the harness as a reusable forge tool.

Canonical home: github.com/worksona/agentic-rl-runner (the runner's living source). Frozen anchor: gist.github.com/worksona/f8769d51… (this writeup's reproducibility snapshot).

TL;DR

agentic-rl-runner v0.1.0 — MIT, Python 3.10+, ~500 lines, zero hard runtime deps. Implements the operational core of agentic RL: Environment protocol, multi-turn rollout loop, BinaryOutcomeReward, GRPO-style group advantages, AgentRL-style task renormalization, two reference environments (calculator with calc(...) tool, fact-check with lookup(...) tool), and a CLI (arl bench, arl demo).
Tests: 13/13 pass on python:3.12, total 0.08 s. Test coverage spans environment happy + error paths, rollout step limits, GRPO normalization on uniform / mixed / singleton groups, task renormalization, and an explicit assertion that the default reward is {0, +1} — not {−0.5, +1} (per Wolfe's finding that −0.5 penalties confuse the advantage signal).
CLI demo verified: three scripted policies (perfect / wrong / noop) × calculator env, group rewards [1.05, 1.05, 1.05, 1.05], [0.1, 0, 0, 0], [0, 0, 0, 0] respectively — GRPO advantages collapse to zeros within unanimous groups (correct), and the perfect-vs-noop comparison after task-normalization yields a +3.30 advantage for the perfect trajectory. The math matches the essay.
Companion skill shipped: forge-agentic-rl (under plugin/skills/forge-agentic-rl/SKILL.md) — wraps the package as a reusable forge tool with experiment_origin: EXP-0006.
forge-experimenter upgraded: four new templates added for non-build sources (article-as-spec, paper-claim-reproduce, tpa-pin-and-bench, commentary-pattern-note). The operationalization rule is now codified: every non-build experiment must produce a portable artifact that runs without reference to the original source.

Install + use

bash

pip install git+https://github.com/worksona/agentic-rl-runner
arl demo

What the essay says (in one paragraph)

Wolfe's essay defines agentic RL as reinforcement-learning training of LLM systems that operate in multi-turn, interactive environments — generate an action, execute a tool, observe the result, iterate until a terminal reward is computed. The state space combines the LLM's visible context AND the external environment state (non-deterministic, modified by tool calls). The training loop alternates between rollouts (sample N completions across isolated environment instances) and policy updates (use the rollouts + their rewards to compute a gradient step). The dominant optimizers are GRPO (most common), PPO (preferred for long-horizon tasks), and REINFORCE. Reward design is unusually consequential: binary outcome rewards ({0, +1} or {+1, −1}) work, but −0.5 penalties for wrong answers empirically underperform zero. Task-level advantage renormalization (AgentRL paper) prevents any one domain from dominating multi-task updates; ScalingInter-RL curriculum (AgentGym-RL paper) stabilizes long-horizon training by progressively increasing interaction budgets.

That's a system. Forge built it.

What forge built

agentic-rl-runner is the provider-agnostic, gradient-free core of an agentic-RL stack:

src/agentic_rl_runner/
├── env.py        Environment protocol, Trajectory / Step dataclasses
├── policy.py     Policy protocol, ScriptedPolicy, CallablePolicy
├── reward.py     BinaryOutcomeReward — explicitly {0, +1}, NOT {−0.5, +1}
├── rollout.py    rollout(), run_group(), grpo_advantages(), task_normalize()
├── envs.py       CalculatorEnv (calc(...) tool), FactCheckEnv (lookup(...) tool)
└── cli.py        `arl bench` and `arl demo`

The deliberate non-features matter as much as the features:

No gradient step. The runner produces advantages; you hand them to TRL, OpenRLHF, or whatever trainer you already have. Decoupling here is correct — the policy-update math is well-trodden and the existing libraries are better at it than anything forge would write.
No LLM client coupling. The Policy protocol is just act(observation, history) → str. Any chat-completion client wires in as a CallablePolicy in five lines.
No HTTP environment hosting. AgentGym-RL's HTTP environment protocol is a clean follow-up — about another 100 lines, not in scope for v0.1.
No task synthesis. AutoForge-style automated environment synthesis is a separate package that consumes this one.

The runner does what's hardest to get right and easiest to get wrong: the rollout bookkeeping, the GRPO math, and the task-normalization step.

What the demo shows

arl demo runs three scripted policies against the calculator env ((7 + 3) * 4 = 40), 4 rollouts each, and prints the advantages:

{
  "per_policy": [
    {
      "policy": "calc-perfect",
      "rewards": [1.05, 1.05, 1.05, 1.05],
      "advantages": [0, 0, 0, 0],
      "group_mean": 1.05,
      "group_std": 0
    },
    {
      "policy": "calc-wrong",
      "rewards": [0.1, 0, 0, 0],
      "advantages": [1.732, -0.577, -0.577, -0.577],
      "group_mean": 0.025
    },
    {
      "policy": "calc-noop",
      "rewards": [0, 0, 0, 0],
      "advantages": [0, 0, 0, 0],
      "group_std": 0
    }
  ],
  "task_normalized_advantages": {
    "calc": [3.30, -0.33, -0.33, -0.33, 0.00, -0.33, -0.33, -0.33, -0.33, -0.33, -0.33, -0.33]
  }
}

Three things to notice:

Unanimous groups produce zero advantages. Both calc-perfect (all 1.05) and calc-noop (all 0.0) have group_std == 0. The GRPO formula (r − mean) / std is undefined when std == 0, and forge returns zeros — the correct semantics, because there is no signal to learn from when every rollout agrees.
Mixed groups produce a learning signal. The calc-wrong policy has one trajectory that accidentally collected a process bonus (tool_call_ok) for a wrong tool call, so its reward differs slightly. GRPO correctly gives the high-reward trajectory a +1.73 advantage and the others −0.58. The mean of the advantages is exactly zero, as required.
Task-normalization centers across all 12 rollouts. When we flatten all 12 trajectories under a single "calc" task and renormalize, the one perfect trajectory dominates with +3.30; everything else clusters around −0.33. This is the AgentRL recipe in action — the runner shows the same shape the paper describes.

The demo runs in milliseconds. No GPU, no API key, no compute spend. Anyone with pip and docker can reproduce these numbers in under a minute.

The skill

EXP-0006 also ships a companion forge skill: forge-agentic-rl (under plugin/skills/forge-agentic-rl/SKILL.md). The skill takes the package and makes it invocable as a forge tool. Other future forge experiments — and forge itself when it eventually runs an agentic-RL experiment on its own policy (see roadmap below) — call this skill rather than re-implementing the harness.

The skill's frontmatter carries experiment_origin: EXP-0006, which is a new convention we're adopting: every skill that originates from a forge experiment carries the EXP id, so the lineage is queryable.

Why this matters for forge

Most "AI summary" content stops at the summary. Forge's bar is higher: if a source describes a system, forge tries to ship the system. The new article-as-spec template in forge-experimenter codifies this. The rule, exactly:

For every non-build experiment, ask in this order:

Can I implement the system the source describes? If yes → article-as-spec or paper-claim-reproduce.

Can I implement a portable wrapper around what it offers? If yes → tpa-pin-and-bench.

Is the source too thin to operationalize? If yes → commentary-pattern-note (short, with follow-ups). The artifact's lifetime success criterion is a future reader can pip-install / docker-pull / clone-and-run what we shipped, with no reference back to the original source needed.

EXP-0006 is the first experiment to use the new template, and it makes the rule concrete: pip install -e . + pytest + arl demo gets you a working agentic-RL harness in 8 seconds. The essay was the seed; the package is the deliverable.

Reproducibility


canonical repo	https://github.com/worksona/agentic-rl-runner
frozen gist	https://gist.github.com/worksona/f8769d517b9133563074efc5078d1fb9
source type	article-as-spec
source	https://cameronrwolfe.substack.com/p/agentic-rl
base image	`python:3.12`
install	`pip install git+https://github.com/worksona/agentic-rl-runner`
tests	`pytest -q tests/` — 13 passed in 0.08 s
CLI demo	`arl demo` — advantage math rendered for 3 policies × calc env × 4 rollouts
companion skill	`plugin/skills/forge-agentic-rl/SKILL.md` (frontmatter: `experiment_origin: EXP-0006`)