Parallel LLM calls and the thread-safety work that followed
#atlas#devlog#feature#building-in-public
David OlssonAtlas runs a lot of LLM calls. Building a knowledge graph from a document means extracting entities from every text chunk. Generating simulation configs means producing a time config, an event config, and a per-agent config for every entity in the graph โ sometimes 80 or more. Each call takes 2-8 seconds. Running them serially was the bottleneck we kept hitting.
The straightforward answer is a thread pool. The less obvious part is everything that breaks when you add one.
Where we added parallelism
We targeted two services.
Graph builder (graph_builder.py) splits document text into chunks and asks the LLM to extract entities and relationships from each. With 30 chunks and 5 threads, extraction time drops from around 3 minutes to under a minute.
Simulation config generator (simulation_config_generator.py) has a multi-step pipeline. The time config and event config calls are independent, so we run both in a single ThreadPoolExecutor(max_workers=2). Agent configs are generated in batches of 15 entities; all batches run in parallel up to MAX_CONCURRENT_LLM_THREADS.
The two-phase graph pipeline
The key design decision for graph building was to separate extraction from insertion.
Phase A fires all LLM extraction calls concurrently. Results are collected into a pre-allocated list indexed by chunk position so ordering is preserved. Phase B iterates that list in order, inserting nodes and edges one chunk at a time.
The reason for sequential insertion is deduplication. If two chunks both mention the same entity, the second insert should merge with the first rather than create a duplicate node. That merge logic โ find_or_create_node โ needs to read the current graph state before writing. Running it from multiple threads simultaneously creates a classic TOCTOU (time-of-check to time-of-use) window.
The TOCTOU race and how we closed it
Before this change, find_or_create_node worked roughly like this in pseudocode:
existing = find_node_by_name(graph_id, name) # read
if existing:
return existing.uuid
return add_node(graph_id, name, ...) # write
Both find_node_by_name and add_node acquired self._lock individually. The gap between releasing the read lock and acquiring the write lock was a race window: two threads could both see "node does not exist", both proceed to create, and produce a duplicate.
The fix wraps the entire find-or-create sequence under a single lock acquisition:
def find_or_create_node(self, graph_id, name, labels, summary="", attributes=None):
with self._lock:
existing = self.find_node_by_name(graph_id, name)
if existing:
# merge summary and attributes if the new data is richer
...
return existing.uuid
# call _add_node_unlocked -- NOT add_node -- to avoid deadlock
return self._add_node_unlocked(graph_id, name, labels, summary, attributes)
The deadlock risk is real: add_node acquires self._lock, so calling it from inside an already-held self._lock block would deadlock. We added _add_node_unlocked โ the same insertion logic without the lock acquisition โ and call that from within find_or_create_node. The public add_node still acquires its own lock for direct callers.
Checkpoint writes and Config.reload()
Two more races appeared once multiple threads were running.
Checkpoint writes. The config generator saves partial results to a JSON file so it can resume after a crash. When multiple agent-batch futures complete near-simultaneously, they all try to write the checkpoint. We added a _ckpt_lock = threading.Lock() local to the generate_config call. Every checkpoint write now does an atomic write to .tmp + os.replace under that lock, which keeps the file consistent.
Config.reload(). The Settings UI lets users change the LLM model and concurrency settings at runtime. Config.reload() reads settings.json and updates class-level attributes. Those attributes are read by threads mid-simulation. We added Config._lock = threading.Lock() and wrapped the attribute assignments in reload() with with cls._lock. Reads of Config.MAX_CONCURRENT_LLM_THREADS happen often enough that a torn write โ even a Python one โ is not a risk we wanted to carry.
Configuring concurrency
MAX_CONCURRENT_LLM_THREADS defaults to 5, clamps between 1 and 20, and is now exposed in the Settings UI. The right value depends on your LLM provider's rate limits. We keep 5 as the default because it is conservative enough not to trigger 429s on most hosted endpoints while still delivering a meaningful speedup over serial execution.
The setting is read at the start of each _process_chunks or generate_config call, not at startup, so changes take effect on the next operation without a restart.
What we would do differently
The two-phase approach adds latency when total chunk count is small โ you wait for all extractions before any insertion begins. A streaming pipeline (insert as results arrive, with finer-grained locking) would be faster for small documents. We opted for simplicity first; the sequential insertion phase is fast relative to the LLM calls, so the tradeoff is acceptable for now.
The bigger lesson: adding a thread pool to code that was written assuming single-threaded access produces bugs that are non-deterministic and hard to reproduce. Our process was to add the pool, run graph builds on a large document several times, and look for duplicate nodes in the database. They appeared on the first run. The TOCTOU fix eliminated them.