File-System IPC: How Atlas Talks to the OASIS Subprocess
#atlas#devlog#feature#building-in-public
David OlssonOASIS, the social simulation engine we use, is built on the CAMEL-AI framework. CAMEL manages its own async event loop and makes heavyweight assumptions about process ownership. Running it inside the Flask request cycle is not an option — we would be fighting the framework.
So we spawn it as a separate Python subprocess. That decision is clean, but it creates a coordination problem: two independent processes need to exchange commands, status, and a continuous stream of agent action data in real time.
Here is the design we landed on.
Two channels, both on disk
We use the filesystem as the communication medium. No sockets, no message queues, no shared memory. Two separate channels, one per direction:
Control channel — Flask writes JSON command files to an ipc_commands/ directory inside the simulation's working directory. The subprocess polls that directory at the start of each simulation round, picks up any pending command, and writes a response JSON file to ipc_responses/. Flask polls the response directory until the file appears or a timeout fires.
Data channel — The subprocess appends every agent action to platform-specific JSONL files (twitter/actions.jsonl, reddit/actions.jsonl). A Flask-side monitor thread reads these files with f.seek(position) on each polling cycle, consuming only the new lines. Progress events (round_end, simulation_end) are also written inline to the same JSONL stream.
# SimulationIPCClient (Flask side): write a command file and poll for the response
def send_command(self, command_type, args, timeout=60.0, poll_interval=0.5):
command_id = str(uuid.uuid4())
command = IPCCommand(command_id=command_id, command_type=command_type, args=args)
command_file = os.path.join(self.commands_dir, f"{command_id}.json")
with open(command_file, 'w', encoding='utf-8') as f:
json.dump(command.to_dict(), f)
response_file = os.path.join(self.responses_dir, f"{command_id}.json")
start_time = time.time()
while time.time() - start_time < timeout:
if os.path.exists(response_file):
with open(response_file, 'r', encoding='utf-8') as f:
return IPCResponse.from_dict(json.load(f))
time.sleep(poll_interval)
raise TimeoutError(f"Command response timeout ({timeout}s)")
The subprocess side (SimulationIPCServer) does the mirror image: it calls poll_commands() at each round boundary, executes the command, and calls send_response().
Pause and resume
Pause is the trickiest part. When a user hits pause, the subprocess may be mid-round — actively making LLM calls, not yet at a polling checkpoint. We can write the PAUSE command file immediately, but the subprocess will not see it until the current round finishes.
We handle this with an optimistic state update. Flask marks the runner status as PAUSED and persists it to run_state.json before the IPC round-trip completes. The IPC call is then made with a short 10-second timeout. If the subprocess is mid-round and does not respond in time, we log a warning and move on — the subprocess will pick up the command file at the next round boundary.
# SimulationRunner.pause_simulation()
state.runner_status = RunnerStatus.PAUSED
cls._save_run_state(state) # frontend sees PAUSED immediately
ipc_client = SimulationIPCClient(sim_dir)
try:
ipc_client.send_pause()
except TimeoutError:
logger.warning(
f"Pause IPC timeout for {simulation_id} -- "
"subprocess will pause at next round boundary"
)
Resume follows the same pattern in the other direction.
Data flow
Process lifecycle and crash recovery
SimulationRunner.start_simulation() spawns the subprocess with start_new_session=True, giving it its own process group. On Unix, teardown sends SIGTERM to the entire group via os.killpg(), so any children the subprocess forked also exit cleanly.
A daemon monitor thread loops on process.poll(). When the process exits, the thread does one final JSONL read pass, then sets the runner status to COMPLETED or FAILED based on the exit code.
We also guard against zombie detection on server restart. When Flask reloads run_state.json and sees status: running with a recorded PID, it calls os.kill(pid, 0) to check if the process is actually alive. If it is not, the status is immediately corrected to FAILED before the state is surfaced to the frontend.
Subprocess activity is surfaced to the frontend via two routes. The LLM monitor picks up telemetry written to llm_ledger.jsonl and pushes it through a server-sent events stream at /monitor/stream. Simulation progress — round completions, action counts, per-platform status — flows through run_state.json and is returned on the next status poll.
Why not sockets or pipes?
The file-system approach has a useful property: both processes can crash and restart independently. If Flask restarts during a run, it re-reads run_state.json and reattaches to the live JSONL files from the last read position. If the subprocess crashes, Flask detects the dead PID and marks the simulation as failed without hanging. There is no connection to re-establish and no message queue to drain.
The polling latency (2-second cycle on the monitor thread, 500ms on command responses) is acceptable at simulation timescales where each round typically takes 10-60 seconds of LLM calls. For tighter latency requirements we would revisit, but for now the simplicity pays for itself.