Accelerating Research with RAG at Scale for AstraZeneca

Client

AstraZeneca

Solution

Retrieval-Augmented Generation

Service

AI + Machine Learning

In heavily regulated, research-driven industries like pharmaceuticals, institutional knowledge is a premier competitive advantage. While traditional enterprise data strategies focus on structuring transaction logs and data tables, the true value often remains trapped inside unstructured documents—operational logs, compliance records, and dense scientific studies generated through daily research.

To unlock this latent potential, Datatonic partnered with AstraZeneca to build an enterprise-grade, high-fidelity Retrieval-Augmented Generation (RAG) pipeline. The team engineered a robust, natural language assistant capable of parsing, indexing, and querying a massive corporate knowledge base. This solution enables data professionals and scientists to fetch instant, comprehensive, and grounded answers directly from complex private data sources.

 

The Challenge

The primary obstacle centered on the sheer variety and dense complexity of AstraZeneca’s scientific datasets. AstraZeneca has thousands of long scientific documents within a pharmaceutical sub-domain, with an ongoing stream of new files being created daily. 

These files are completely non-templated and highly diverse—spanning documents, legacy PDFs, extensive spreadsheets, HTML, high-resolution imagery, and handwritten notes—with individual files frequently scaling to hundreds of megabytes and over 1,000 pages.

Standard, out-of-the-box RAG pipelines fail when confronted with complex scientific characteristics:

  • Complex Technical Sentences: Pharmaceutical documentation contains multi-clause, highly technical sentences that lose their meaning without extensive surrounding context.
  • Interconnected Concepts: Scientific ideas and hypotheses build upon one another sequentially across consecutive paragraphs, making strict character cuts ineffective.
  • Data Interpretation Patterns: Tables, charts, and chemical stability readouts frequently span multiple pages, requiring an unbroken structural link to remain accurate.

 

Furthermore, regional boundaries restricted automated data source synchronization to a maximum of 1,000 documents, blocking a straightforward managed import strategy.

 

The Solution

To build a system capable of sustaining future data volume fluctuations without requiring heavy manual maintenance, Datatonic designed a modular, fully serverless architecture.

The end-to-end framework decouples document pulling, pre-processing, metadata engineering, and Bedrock ingestion into independent, non-blocking components following the DRY (Don’t Repeat Yourself) approach.

 

  1. File Standardization and Metadata Enrichment

To establish a one-size-fits-all parsing solution, Datatonic implemented an automated pre-processing step that standardizes all incoming multi-format files into high-portability, parser-friendly PDFs. Powered by a Dockerized command-line application, the pipeline routes incoming files through Python script routines:

  • Images (.jpg, .jpeg, .png): Processed using the Pillow library.
  • Office & Text Docs (.docx, .xlsx, .txt, .csv, .html): Converted using headless LibreOffice.

 

  1. Injecting Broader Context via Transformation Lambdas

To eliminate retrieval ambiguity across highly similar documents, Datatonic introduced an architectural modification to the synchronization cycle. During the ingestion sequence, the framework triggers an optional Transformation Lambda between the semantic chunking and embedding phases.

 

  1. Custom Multi-Threaded Batch Ingestion

To bypass the 1,000-document data source limitation tied to automated synchronization, Datatonic developed a custom batch sync engine. A central relational table acts as a synchronization backlog, tracking newly modified, converted, or deleted documents.

  • Natural Language Search: Internal planners can now search using complex, contextual phrases (e.g., “Influencers promoting sustainable swimwear in Bali last summer”).
  • Multi-Lingual Search: Crucial for the APAC region, the system accurately processes and matches queries across multiple local languages.
  • Visual-Text Correlation: The ability to search both content assets and the creators themselves.
  • Enhanced Clarity: A ranking score is provided alongside the search results, giving users full clarity on how content was matched to their criteria.
  • Brand Safety Extraction: Automated identification of competing brands or flagged content for sophisticated risk management.

 

Our impact

The high-fidelity RAG implementation delivered a highly accurate, scalable platform that successfully unlocked AstraZeneca’s unstructured pharmaceutical knowledge repository while remaining firmly within the realm of managed cloud infrastructure.

  • 90% Reduction in Ingestion Complexity: Standardizing data inputs directly into PDFs eliminated the need to develop, scale, and maintain dozens of brittle, format-specific parsing microservices.
  • Robust, High-Throughput Scaling: The decoupled, modular pipeline successfully ingested the initial foundation of 20,000+ documents (processing more than 1.5 billion input tokens) and natively absorbs daily incremental updates without pipeline blocks or manual oversight.
  • Tiered Cost Optimization: By architecting a serverless stack and strategically tiering foundational models—reserving premium multi-modal parsers solely for complex visual layouts while utilizing ultra-low-cost models for metadata embeddings and summary injection—the platform achieves an optimal balance between processing high data volumes and maintaining a predictable, cost-efficient operational footprint.

Read the full article on Medium.

Get started on your next data + AI project.