You Can't Solve Regulatory Writing by Throwing a Frontier Model at the Problem
Why GPT, Gemini, and Claude alone will never be accurate enough for FDA submissions — and how a purpose-built stack of document parsers, domain-specific SLMs, and FDA context graphs changes the equation entirely.
01 We Tried Feeding CSRs Into GPT and Gemini. It Doesn't Work.
Frontier models are trained on the entire internet. For FDA submissions, that's not a feature — it's the root cause of every hallucination.
When frontier models like GPT-4, Gemini, or Claude draft regulatory documents, they draw on a "world model" — a compressed representation of everything from Wikipedia articles to Reddit threads to medical journals. This makes them remarkably versatile, but it also makes them fundamentally unreliable for high-stakes, domain-specific work. A model that "knows" everything knows nothing well enough for an FDA submission.
The core issue isn't intelligence — it's scope. Regulatory writing demands that every sentence is grounded in specific evidence from specific source documents. A frontier model, by contrast, is statistically inclined to blend what it "remembers" from pre-training with what you've provided in context. The result is text that reads convincingly but conflates, fabricates, or subtly distorts — the very definition of hallucination.
Research from Nature and multiple peer-reviewed studies has demonstrated that even LLMs built explicitly for medical purposes remain vulnerable to domain-specific hallucinations, and that these often arise from reasoning failures rather than mere knowledge gaps. In pharmacovigilance — a domain directly adjacent to regulatory writing — hallucinations carry acute significance because inaccuracies can directly affect patient safety.
Pumping source documents through a frontier model and hoping for the best is not a strategy. It is a liability. Every hallucinated claim that reaches an FDA reviewer is a potential Complete Response Letter — and each CRL costs months and millions.
02 "Just Add RAG" Is Not a Pharma Strategy — We Learned That the Hard Way
Everyone's first instinct is to bolt RAG onto a frontier model. We tried it. The retrieval quality on real CSRs and protocols was unacceptable.
The standard RAG approach — chunk documents, embed them into vectors, retrieve relevant chunks, feed them to an LLM — sounds reasonable in theory. In practice, it breaks down in several critical ways when applied to pharmaceutical regulatory documents.
Chunking Destroys Document Structure
Regulatory documents have deep hierarchical structure — sections, subsections, cross-references, tables within narrative context. Standard chunking shatters these relationships. A safety finding in Section 12.2 that references an endpoint defined in Section 6.1 becomes two disconnected text fragments.
Whole-Document Retrieval Causes Noise
When RAG retrieves from whole documents rather than semantically precise segments, the LLM gets "distracted" by irrelevant content. Studies confirm this: models degrade when answers are buried within long retrieved passages, often ignoring retrieved content entirely in favor of parametric memory.
Citations Become Guesswork
When the retrieval unit is a 512-token chunk, the model can at best point you to a vague region of a document. Exact page numbers, section references, and verbatim excerpts — the provenance that FDA reviewers need — are lost at the chunking stage.
Complex Queries Need Multi-Step Reasoning
FDA submissions require synthesis across multiple documents and sections — "What adverse events in the CSR contradict the safety narrative in Module 2?" Basic RAG cannot handle this multi-hop retrieval; it retrieves surface-level matches, not reasoned connections.
Peer-reviewed evaluations confirm these limitations. In pharmaceutical applications specifically, RAG performance was found to be "mixed and ultimately insufficient despite two-stage filtering approaches," with models sporadically generating confabulated descriptions not present in the source context. The RAG paradigm itself has inherent limitations — insufficient capabilities of its components contribute directly to hallucination generation.
03 So We Built the Stack That Should Exist Between Your Documents and the LLM
Asthra doesn't replace frontier models. It builds every layer of intelligence they're missing — the parsing, the domain understanding, the structured retrieval — so they can actually be trusted for regulatory work.
The insight behind Asthra's architecture is simple but non-obvious: the quality of a regulatory draft depends almost entirely on what happens before the language model generates a single word. The parsing, structuring, retrieval, and scoping of source data is where the real technical challenge lies — and where most tools take fatal shortcuts.
Asthra's stack has four distinct layers, each purpose-built for the pharmaceutical regulatory domain.
Our VLM Parser Reads a CSR the Way a Senior Writer Does — Layout, Tables, and All
Pharmaceutical documents are not just text. They are complex visual artifacts: multi-column layouts, nested tables within tables, figures with captions that reference distant sections, headers that encode hierarchical structure, footnotes that carry critical caveats. Traditional OCR or text extraction treats these as flat character streams and loses the structural intelligence that makes a document mean something.
Asthra's Visual Language Models (VLMs) interpret documents holistically — analyzing visual layout, textual content, and semantic relationships in a single processing step. Rather than chunking a 200-page Clinical Study Report into blind 512-token fragments, the VLM produces a structured document tree where every node retains its exact position in the document hierarchy, its page number, and its relationship to surrounding elements.
This is what enables Asthra's multi-level citations downstream. Because the parser preserves exact provenance at the extraction stage, every claim generated later can point back to a specific document, page, and verbatim excerpt — not an approximate chunk.
Most AI tools treat document parsing as a solved problem and use off-the-shelf extractors. In reality, it is the single most important technical differentiator. A mature VLM parsing system is the foundation everything else depends on.
A Model That Only Knows Pharma Filings Beats One That Knows Everything
If a frontier model is a world model — trained on everything — then Asthra's Small Language Models (SLMs) are domain models, purpose-trained to understand the specific structures, conventions, and boundary conditions of pharmaceutical regulatory filings.
These SLMs encode deep knowledge of what makes a regulatory document work:
Research consistently shows that specialized models outperform general-purpose models in domain-specific tasks. A specialized system like DrugGPT consistently beats GPT-4 on pharmaceutical queries, not because it's a "bigger" model, but because its parameters are allocated entirely to the domain that matters. Asthra's SLMs apply this principle to the specific task of understanding how pharma documents are structured and how regulatory agencies interpret them.
The SLMs don't generate the final submission text — they understand the source material at a level that no general-purpose model can match, creating the semantic foundation for precise retrieval downstream.
We Don't Search Your Documents — We Build a Knowledge Graph Across Them
This is where Asthra's architecture diverges most fundamentally from conventional approaches. Instead of a flat vector store where chunks float without relationships, Asthra constructs an FDA Context Graph — a structured knowledge representation where every piece of evidence is connected to its source, its regulatory context, and its relationships to other evidence in the submission package.
The difference is decisive. When the generation layer needs to draft a safety summary, it doesn't perform a vague similarity search across thousands of text fragments. Instead, it traverses the graph to find exactly the relevant adverse event data, the statistical analyses that contextualize it, the protocol sections that define how it was collected, and the regulatory requirements that dictate how it must be presented — all with full provenance intact.
Think of it this way: basic RAG is like searching a library by finding books whose covers mention your keyword. The FDA Context Graph is like having a librarian who has read every book, understands how they relate to each other, and can pull the exact paragraph you need with a full citation chain.
Graph-based retrieval has been shown to deliver significantly better handling of multi-hop queries, richer contextual background, and — crucially for regulated industries — explainable reasoning paths. Microsoft's GraphRAG research confirmed these advantages in controlled testing. Asthra takes this approach further by building the graph with FDA-specific domain intelligence from Layer 2, creating what we call Asthra FDA RAG++ — a retrieval system that is simultaneously more precise and more traceable than anything achievable with generic vector retrieval.
The LLM Only Touches Pre-Scoped, Pre-Cited Evidence — That's Why It Doesn't Hallucinate
Only at this final stage does a frontier language model enter the picture. And the difference is transformative: instead of processing hundreds of pages of raw documents, the LLM receives precisely scoped, pre-structured context from the FDA Context Graph — complete with explicit source attribution and regulatory framing.
This is why Asthra can use the power of frontier models (eloquent generation, nuanced synthesis, format adherence) without inheriting their weaknesses (hallucination, distraction, source confusion). The model isn't being asked to understand your regulatory documents — that work has already been done by Layers 1 through 3. It's being asked to draft from a curated, cited evidence package. The scope of the generation task is narrowed to the point where hallucination risk drops dramatically.
04 No Amount of Prompt Engineering Gets You Here — We Tried That Too
We spent our first year learning what shortcuts don't work. Every layer in our stack exists because the obvious approach failed on real pharma documents.
| The Real Problem | What Happens With Frontier Models | What Asthra Does Differently |
|---|---|---|
| A CSR has nested tables inside narrative sections, with footnotes that change the meaning of the data above | ✗ PDF-to-text extraction flattens the table, strips the footnote, and feeds the LLM a meaningless character stream | ✓ VLMs process visual layout + text together — the table structure, its footnotes, and the surrounding narrative are preserved as a single semantic unit |
| An ICH E3-compliant CSR has 16+ sections with strict inter-section dependencies (e.g., Section 11 must reconcile with Section 12.2) | ✗ A world model has no concept of eCTD structure — it'll write Section 11 without checking what Section 12.2 says | ✓ SLMs encode CTD hierarchy, ICH E3/E6 rules, and cross-section consistency requirements as domain knowledge |
| Drafting a Module 2.7.4 safety summary requires pulling adverse event data from the CSR, cross-checking it against the protocol's SAE definitions, and aligning with the IB | ◐ Vector search returns the 10 most "similar" chunks — which may be from the wrong document or the wrong section of the right document | ✓ FDA Context Graph traverses across CSR → Protocol → IB with relationship-aware retrieval, pulling exactly the linked evidence |
| An FDA reviewer needs to verify that "treatment-emergent AEs occurred in 12.3% of subjects" actually appears in the source TFLs | ✗ Best case: the model cites "the CSR." Reviewer must manually search 200+ pages to find the actual table | ✓ Three-level citation: Table 14.3.1 (Adverse Events) → Page 87 → Exact row showing "TEAEs: 12.3% (n=47/382)" |
| A 200-page CSR + 80-page protocol + 50-page SAP = 500K+ tokens of context needed for a single submission module | ✗ Feeding everything into the context window is where hallucination peaks — the model conflates data from unrelated sections and fabricates plausible-sounding connections | ✓ The graph scopes retrieval to only the specific nodes needed per draft section — the LLM never sees irrelevant data |
| Module 2.5 (Clinical Overview) must synthesize findings across multiple studies, reconciling different statistical approaches and patient populations | ✗ LLM has no mechanism to traverse across studies — it generates a plausible narrative that may conflate Study A's efficacy data with Study B's safety population | ✓ Graph encodes per-study boundaries — cross-study synthesis follows explicit relationship edges, never implicit model "reasoning" |
| The FDA issued 200+ CRLs between 2020-2024, often citing deficiencies that were predictable from the submission structure itself | ✗ No knowledge of historical FDA feedback patterns — the model has no way to flag "the FDA always asks about X in this submission type" | ✓ Mock FDA audit powered by 100+ approved submissions — flags likely reviewer questions before you file |
| Every new IND or NDA project requires re-configuring templates, re-mapping source documents, and re-training the tool on your data | ◐ Days of prompt engineering and template configuration per project — even for submission types the team has done before | ✓ Zero config — select submission type in the Word add-in, connect your sources, Asthra's domain models already know the structure |
The critical point is this: these layers are not independent improvements — they are compounding. The VLM parser produces structured trees that the SLMs can reason over. The SLMs produce domain-enriched representations that the Context Graph can organize. The Context Graph produces precisely scoped evidence packages that the frontier LLM can draft from reliably. Remove any layer and the entire system degrades. This is why bolting a better prompt onto an off-the-shelf model will never match a purpose-built stack.
05 Your Team Ships Submissions Faster, With Citations They Can Actually Defend
All of this engineering exists for one reason: so your regulatory writers spend time on scientific judgment, not fighting tools or verifying AI claims.
If You're a Biotech, Each Approval Is Your Company's Inflection Point
Each FDA approval can increase your valuation by 1.5–2x. You probably have one shot, limited headcount, and a timeline that doesn't forgive rework. You need a first draft you can actually trust from day one — no weeks of template configuration, no manual source labeling. And with our mock FDA audit built on 100+ real approved submissions, you're not submitting blind and hoping for the best. You know what the reviewer is likely to flag before they flag it.
If You're Big Pharma, Every Day in Review Is $500K to $1.4M Walking Out the Door
The math is simple: a draft with built-in, page-level citations clears internal review cycles faster than one your QA team has to manually verify claim by claim against 200-page source documents. Asthra's zero-configuration design means your teams don't re-do template setup for every new IND or NDA — the domain models already know the filing structure. That's days saved on setup alone, on every project, across every team.
The FDA Is Already Reviewing Submissions With AI — Your Documents Should Be Ready for That
The FDA deployed its own generative AI model internally in 2025 and rolled out agentic AI capabilities across the agency by December. Their tools summarize adverse events, compare product labels, and flag inconsistencies across submissions. Documents that are well-structured, precisely cited, and machine-readable will have a natural advantage as these AI-assisted review processes mature. Asthra's architecture produces exactly this kind of document — not by coincidence, but because we built the stack specifically for how regulatory documents need to be consumed, by humans and machines alike.
The tools that will define the next era of regulatory writing won't be the ones with the biggest language model. They'll be the ones with the most intelligent layer between the source documents and the language model. That layer is where accuracy, traceability, and trust are built — or lost.
06 Don't Pick Your Regulatory AI by the LLM on the Label
Frontier models are a necessary ingredient. They're about 10% of what it takes to actually get this right.
The pharmaceutical industry is at a crossroads. AI-powered regulatory writing tools are proliferating, and the temptation to adopt the one with the most impressive demo or the most recognizable LLM brand is real. But demos don't survive contact with a 200-page Clinical Study Report, and brand recognition doesn't prevent hallucination.
What matters is what happens between your source documents and the generated draft. Do your documents get shredded into context-free chunks, or parsed into structured trees with full provenance? Does your retrieval system perform blind similarity matching, or traverse a purpose-built knowledge graph that understands FDA filing conventions? Does your agent look at everything at once (maximizing distraction and hallucination), or scope down to precisely the evidence needed for each section?
Asthra's stack — VLM-powered parsing, FDA-domain SLMs, structured context graphs, and scoped frontier-model generation — represents a fundamentally different answer to these questions. It is not a wrapper around an API. It is a purpose-built regulatory intelligence system that turns frontier AI from a liability into an advantage.
The best regulatory AI isn't the biggest model. It's the most precisely scoped one.