Natural Language Processing: Text Summarisation for Long, Domain-Specific Texts

Author: Hiru Ranasinghe, Machine Learning Engineer

Natural language processing (NLP) is one of the most popular innovation use cases dominating the AI landscape, with the Global NLP Market projected to be worth $48.46 billion by 2026. The pandemic has expedited market growth even further, especially in the healthcare sector, and this rapid expansion is leading to new advancements in the field.

One of these is the increase in the prevalence of transformer language models, such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer 3 (GPT-3). These models have been trained on massive quantities of data and enhance the performance of a wide range of NLP issues dramatically, paving the way for novel and complex applications. Combined with the flexibility and scalability of cloud computing, transformer models have brought cutting-edge NLP solutions ever closer to businesses. 

There are important applications of NLP in the healthcare industry, such as extracting insights from unstructured formats like clinical notes and lab results. Text summarisation is one such example, where large pieces of text are distilled into a concise summary by a language model and can aid time-sensitive decisions. For instance, independent reviews of the medical necessity of prescribed treatments might be requested by a patient or their insurer. In this scenario, the reviewers are required to examine hundreds of documents before coming to a conclusion, which can be extremely time-consuming and can also lead to human error. The application of text summarisation here allows the reviewer to transform the vast volume of text into a decision quickly and accurately.

This blog will cover some common challenges with using text summarisation on longer and domain-specific texts, and how to overcome them.

What Makes Language Models so Powerful? 

Language models are usually pre-trained models that have been trained using very large corpora of text (e.g., BERT was trained on 3.3 billion words) on tasks such as missing word prediction. This results in the hidden layers of the model learning a general understanding of language, which can then be utilised for the downstream task (in our case text summarisation) through transfer learning.

Extractive and Abstractive Summarisation

In NLP, there are two forms of text summarisation: extractive and abstractive.

Extractive summarisation is the process of selecting a predefined number of sentences that are most important in understanding the text.

Abstractive summarisation, on the other hand, generates a summary consisting of novel sentences by either rephrasing or using words not found within the original text. The task of abstractive summarisation is much more complex than extractive summarisation, as it requires the language model to not only comprehend the message conveyed by the text but also generate a summary that encapsulates the general understanding of the original text.


Source Text: The Australian Grand Prix will be the first to feature four separate DRS zones around the lap this weekend as F1 returns to Albert Park. The Albert Park track map on the official Formula 1 website has been updated to show that the circuit now features four DRS zones to aid overtaking. The Melbourne circuit has undergone major re-profiling since the last Australian Grand Prix held at the venue in 2019, with a number of corners being made wider and faster in a bid to improve racing. This will be the first time a grand prix has been held with four separate DRS activation points around the circuit.

Extractive Summarisation: The Australian Grand Prix will be the first to feature four separate DRS zones around the lap this weekend as F1 returns to Albert Park. The Melbourne circuit has undergone major re-profiling since the last Australian Grand Prix held at the venue in 2019, with a number of corners being made wider and faster in a bid to improve racing. 

Abstractive Summarisation: The Australian Grand Prix will feature four DRS zones, which is the first track to do this. Since the last Grand Prix in 2019, corners of the Albert Park track have been made wider and faster to improve racing.

However, a problem arises when we try to summarise very long pieces of text. This is because large pre-trained models usually have a limit to the length of text they can ingest. For example, the base-BERT (the most popular version of BERT) can only take up to 512 words as input, which poses a problem when one would want to summarise a body of text longer than this.

How to Summarise Long + Domain-Specific Text

Given the massive focus on COVID-19 over the past couple of years, we wanted to put text summarisation to the test in the context of coronavirus, and use NLP to summarise COVID-19 research papers. This provided us with an exciting challenge, as research papers not only contain large, complex bodies of text, but also use domain-specific language. With most popular models limited to 512 words, we set out to build a model that uses a mix of approaches to overcome the limitations of existing models.

Our Approach

Combining extractive and abstractive summarisation allows us to summarise longer bodies of text, by using extractive summarisation to reduce the word count of the original text to a length short enough to be input into the abstractive summariser. The extractive step ensures that only the most important sentences are retained, effectively acting as a filter for the abstractive step. Then, the shortened text is used to produce an abstractive summary of the original text. 

Natural language processing: an extractive summary then an abstractive summary

Fig 1: The process of creating an extractive summary before an abstractive summary

General-purpose language models, such as BERT, would struggle to comprehend domain-specific language such as medical terminology. This would mean that the abstractive summariser would find it difficult to produce text in a similar style to medical research papers. Consequently, we needed a solution to train our model on domain-specific language.

Our next steps walk you through how we solved this, and how you can implement a similar solution to finetune models for your purpose.

Getting Started

1. Pick an Appropriate Dataset

To provide more accurate forms of summarisation for COVID-19 papers, our solution required finetuning the type of language of the domain. To do this, we need a training set using example texts as the inputs and a summary as the ‘label’, which the model will use as the ground truth to summarise the input. In our case, we acquired the COVID-19 Open Research Dataset from Kaggle, with 500,000+ scholarly articles related to COVID-19, SARS-CoV-2 and other coronaviruses. The article abstract was split from the main body of text so we could use the abstract as our “label” and the main body of text as the input to summarise. 

2. Extractive Summary

The extractive summariser model did not need to be trained as we use an unsupervised algorithm to generate the summary. To generate an extractive summary we: 

a. Used the Hugging Face implementation of BERT to encode each sentence into vectors 

b. Applied the K-nearest neighbours (KNN) algorithm (an unsupervised algorithm) to identify a predefined number of clusters.

c. Each cluster then contains a group of sentences around a particular topic and the sentence vector closest to the centroid of the cluster is assumed to encapsulate the meaning of all the sentences within the cluster. 

Our model picks the aforementioned sentence from each cluster to produce an extractive summary. It is important to note that the number of sentences in the extractive summary is determined by the number of clusters we seek to find in the KNN algorithm. 

3. Abstractive Summary

On the other hand, the abstractive step could be finetuned, and we used an autoregressive model that takes the extractive summary as input to the model and paraphrases it into a more cohesive summary. For our solution, we trained the Bidirectional Auto-Regressive Transformers (BART) model by using the abstract of the research papers as the target output. The abstract was omitted from the input into the extractive summariser to avoid data leakage.

Training Language Models on Vertex AI 

To train the abstractive BART model, we curated a training dataset consisting of the extractive summary of the original research paper as inputs and the respective abstract as the target label. After experimenting, we found 6 epochs with a learning rate of 2e-4 to be the optimal hyperparameters for the BART model. It took 12,000 training steps for us to reach a training loss that we were happy with.

Many other researchers in this problem domain found resources to be a limiting factor in model performance, but we benefited from the scalability of Vertex AI Workbench on GCP to utilise the A100 GPU, which significantly sped up training time.

Fig 2: Flowchart of Model Fine Tuning

Fig 3: Flowchart of Inference

Using this model, we are able to summarise complex research papers into simple abstracts, which only require some minor grammatical edits and sense checking. This has a wide range of applications, not only in the healthcare sector, but many other industries, including academia, law, and business.

Try the model out for yourself with our “Covid Research Paper Summariser“ demo here.

We hope you enjoyed this blog! Part two of this blog series will cover how we deployed a REST API using Vertex AI endpoints to serve our model on the demo site. We will also cover how we developed a React frontend and Express.js server to communicate with the API using Google Cloud App Engine.

Get in touch to find out how your business could benefit from using Natural Language Processing models!

Up next
Case Studies
View now