Client
AstraZeneca
Solution
Retrieval-Augmented Generation
Service
AI + Machine Learning

Client
AstraZeneca
Solution
Retrieval-Augmented Generation
Service
AI + Machine Learning
In heavily regulated, research-driven industries like pharmaceuticals, institutional knowledge is a premier competitive advantage. While traditional enterprise data strategies focus on structuring transaction logs and data tables, the true value often remains trapped inside unstructured documents—operational logs, compliance records, and dense scientific studies generated through daily research.
To unlock this latent potential, Datatonic partnered with AstraZeneca to build an enterprise-grade, high-fidelity Retrieval-Augmented Generation (RAG) pipeline. The team engineered a robust, natural language assistant capable of parsing, indexing, and querying a massive corporate knowledge base. This solution enables data professionals and scientists to fetch instant, comprehensive, and grounded answers directly from complex private data sources.
The primary obstacle centered on the sheer variety and dense complexity of AstraZeneca’s scientific datasets. AstraZeneca has thousands of long scientific documents within a pharmaceutical sub-domain, with an ongoing stream of new files being created daily.
These files are completely non-templated and highly diverse—spanning documents, legacy PDFs, extensive spreadsheets, HTML, high-resolution imagery, and handwritten notes—with individual files frequently scaling to hundreds of megabytes and over 1,000 pages.
Standard, out-of-the-box RAG pipelines fail when confronted with complex scientific characteristics:
Furthermore, regional boundaries restricted automated data source synchronization to a maximum of 1,000 documents, blocking a straightforward managed import strategy.
To build a system capable of sustaining future data volume fluctuations without requiring heavy manual maintenance, Datatonic designed a modular, fully serverless architecture.
The end-to-end framework decouples document pulling, pre-processing, metadata engineering, and Bedrock ingestion into independent, non-blocking components following the DRY (Don’t Repeat Yourself) approach.
To establish a one-size-fits-all parsing solution, Datatonic implemented an automated pre-processing step that standardizes all incoming multi-format files into high-portability, parser-friendly PDFs. Powered by a Dockerized command-line application, the pipeline routes incoming files through Python script routines:
To eliminate retrieval ambiguity across highly similar documents, Datatonic introduced an architectural modification to the synchronization cycle. During the ingestion sequence, the framework triggers an optional Transformation Lambda between the semantic chunking and embedding phases.
To bypass the 1,000-document data source limitation tied to automated synchronization, Datatonic developed a custom batch sync engine. A central relational table acts as a synchronization backlog, tracking newly modified, converted, or deleted documents.
The high-fidelity RAG implementation delivered a highly accurate, scalable platform that successfully unlocked AstraZeneca’s unstructured pharmaceutical knowledge repository while remaining firmly within the realm of managed cloud infrastructure.
Read the full article on Medium.
Get started on your next data + AI project.