Dante Bronte: Chemical Structure Extraction from Pharmaceutical PDFs
#worksona#atomic47#pharmaceutical#cheminformatics#ai#pdf#computer-vision
David OlssonDante Bronte is a React/Vite/TypeScript application that extracts chemical structure diagrams from pharmaceutical PDFs using AI vision models. Users upload a PDF, the application renders pages via a PDF.js Web Worker, and GPT-4o or Claude vision models detect and localize chemical structure regions, returning bounding boxes over each detected structure.
An interactive bounding-box editor lets chemists refine detections with undo/redo. Detection is not treated as final โ the editor is a correction interface, not an exception handler. Confirmed crops are sent to the Dante Decimer service for SMILES string generation, and results are stored in IndexedDB alongside source metadata. The accumulated IndexedDB store constitutes a project-level compound library that the broader Dante stack consumes for validation and enrichment.
Why is it useful?
Pharmaceutical AI pipelines require structured chemical data, but most published chemistry lives in PDFs: patents, regulatory filings, synthesis papers, clinical study reports. These documents are not structured databases โ chemical structures appear as rasterized diagrams embedded in dense text, formatted for human readers. Converting that content to machine-readable SMILES is the bottleneck before any computational chemistry workflow can begin.
Manual extraction is the established alternative. A trained chemist working through a dense patent can spend hours per document, identifying structures, redrawing them in a structure editor, and exporting SMILES. Vision-model detection reduces that to minutes per document โ the cited reduction is 90 to 95 percent of transcription time. The bounding-box editor preserves chemist judgment at the point where it adds the most value: confirming and correcting detections rather than performing the initial scan.
The choice to expose the bounding-box editor rather than auto-confirming detections reflects a deliberate design position. Vision model accuracy on complex multi-ring structures or structures with non-standard notation is high but not perfect. The correction loop is not overhead โ it is the quality gate.
How and where does it apply?
Bronte is the first stage in the Dante pipeline: Bronte extracts and confirms structures, Decimer converts crops to SMILES, and downstream Dante tooling validates and enriches the compound library. Pharmaceutical R&D teams use it to build compound libraries from literature at a scale that manual extraction cannot reach. Patent analysis workflows apply it to batches of patents, extracting all structures across a technology area in a session rather than over weeks.
const detectStructures = async (pageImageBase64: string): Promise<BoundingBox[]> => {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [{
type: 'image_url',
image_url: { url: `data:image/png;base64,${pageImageBase64}` }
}, {
type: 'text',
text: 'Identify all chemical structure diagrams. Return bounding boxes as [{x, y, width, height}] in pixel coordinates.'
}]
}]
});
return JSON.parse(response.choices[0].message.content ?? '[]');
};
The detection function takes a base64-encoded page image and returns an array of bounding boxes. The prompt instructs the model to return pixel coordinates directly, which the editor overlays on the rendered page image without a coordinate transformation step. Switching between GPT-4o and Claude is a configuration choice โ both models are supported, and the interface contract is the same bounding box array regardless of which model is active.