Accelerating Research with RAG at Scale for AstraZeneca

Client

AstraZeneca

Solution

Retrieval-Augmented Generation

Service

AI + Machine Learning

In heavily regulated, research-driven industries like pharmaceuticals, institutional knowledge is a premier competitive advantage. While traditional enterprise data strategies focus on structuring transaction logs and data tables, the true value often remains trapped inside unstructured documents—operational logs, compliance records, and dense scientific studies generated through daily research.

To unlock this latent potential, Datatonic partnered with AstraZeneca to build an enterprise-grade, high-fidelity Retrieval-Augmented Generation (RAG) pipeline. The team engineered a robust, natural language assistant capable of parsing, indexing, and querying a massive corporate knowledge base. This solution enables data professionals and scientists to fetch instant, comprehensive, and grounded answers directly from complex private data sources.

The Challenge

The primary obstacle centered on the sheer variety and dense complexity of AstraZeneca’s scientific datasets. AstraZeneca has thousands of long scientific documents within a pharmaceutical sub-domain, with an ongoing stream of new files being created daily.

These files are completely non-templated and highly diverse—spanning documents, legacy PDFs, extensive spreadsheets, HTML, high-resolution imagery, and handwritten notes—with individual files frequently scaling to hundreds of megabytes and over 1,000 pages.

Standard, out-of-the-box RAG pipelines fail when confronted with complex scientific characteristics:

Complex Technical Sentences: Pharmaceutical documentation contains multi-clause, highly technical sentences that lose their meaning without extensive surrounding context.
Interconnected Concepts: Scientific ideas and hypotheses build upon one another sequentially across consecutive paragraphs, making strict character cuts ineffective.
Data Interpretation Patterns: Tables, charts, and chemical stability readouts frequently span multiple pages, requiring an unbroken structural link to remain accurate.

Furthermore, regional boundaries restricted automated data source synchronization to a maximum of 1,000 documents, blocking a straightforward managed import strategy.

The Solution

To build a system capable of sustaining future data volume fluctuations without requiring heavy manual maintenance, Datatonic designed a modular, fully serverless architecture.

The end-to-end framework decouples document pulling, pre-processing, metadata engineering, and Bedrock ingestion into independent, non-blocking components following the DRY (Don’t Repeat Yourself) approach.

File Standardization and Metadata Enrichment

To establish a one-size-fits-all parsing solution, Datatonic implemented an automated pre-processing step that standardizes all incoming multi-format files into high-portability, parser-friendly PDFs. Powered by a Dockerized command-line application, the pipeline routes incoming files through Python script routines:

Images (.jpg, .jpeg, .png): Processed using the Pillow library.
Office & Text Docs (.docx, .xlsx, .txt, .csv, .html): Converted using headless LibreOffice.

Injecting Broader Context via Transformation Lambdas

To eliminate retrieval ambiguity across highly similar documents, Datatonic introduced an architectural modification to the synchronization cycle. During the ingestion sequence, the framework triggers an optional Transformation Lambda between the semantic chunking and embedding phases.

Custom Multi-Threaded Batch Ingestion

To bypass the 1,000-document data source limitation tied to automated synchronization, Datatonic developed a custom batch sync engine. A central relational table acts as a synchronization backlog, tracking newly modified, converted, or deleted documents.

Our impact

The high-fidelity RAG implementation delivered a highly accurate, scalable platform that successfully unlocked AstraZeneca’s unstructured pharmaceutical knowledge repository while remaining firmly within the realm of managed cloud infrastructure.

90% Reduction in Ingestion Complexity: Standardizing data inputs directly into PDFs eliminated the need to develop, scale, and maintain dozens of brittle, format-specific parsing microservices.
Robust, High-Throughput Scaling: The decoupled, modular pipeline successfully ingested the initial foundation of 20,000+ documents (processing more than 1.5 billion input tokens) and natively absorbs daily incremental updates without pipeline blocks or manual oversight.
Tiered Cost Optimization: By architecting a serverless stack and strategically tiering foundational models—reserving premium multi-modal parsers solely for complex visual layouts while utilizing ultra-low-cost models for metadata embeddings and summary injection—the platform achieves an optimal balance between processing high data volumes and maintaining a predictable, cost-efficient operational footprint.

Read the full article on Medium.

Get started on your next data + AI project.

Latest

View all

Alpian

Building the Future of Banking

Goat

Transforming Influencer Discovery with AI-Powered Semantic Search

Lightspeed Commerce

AI Workbench

Cloud Data Migration

Gemini Enterprise

Gemini Enterprise for FSI

Gemini Enterprise for Media

Gemini Enterprise for Retail

Generative AI

Creative Assistant

Fan Engagement

Gameplay Assistant

Media Search + Discovery

Retail Assistant

Looker

Managed Services

Marketing Analytics

Unified Data + AI Platform

Financial Services

Gaming

Media

Retail

Telecommunications

Events

Insights

Whitepapers