Extracting Features from Free Form Text
Experts estimate that 80-90% of data in organisations is unstructured, and the amount of unstructured data such as free-form text, images, audio and video is increasing every single day. Therefore, there is a tremendous amount of information available in the form of unstructured data just waiting to be unlocked.
Natural Language Processing (NLP) enables us to process and analyse large amounts of natural language data. Extracting features from text has wide-ranging applications, from document classification based on category to sentiment analysis of product reviews.
NLP has the scope to improve diagnostics in healthcare, aid decisions in financial markets, improve customer service in retail, gauge brand impressions in advertising, streamline recruitment, and customise media content. It can help a company understand consumer reactions to a specific product launch and evaluate the public’s reaction to various campaigns, leading to insights about potential improvements. With NLP, you can take information from news articles, online product reviews, social media posts, blogs, forums and emails, and extract structured information from it.
In this post, we’ll explore an off-the-shelf method for obtaining features from free-form text using Google’s Cloud Natural Language. It provides a fast and easy way of deriving insights from unstructured text using the latest natural language processing methods available from Google. You can try it out for free by supplying sample text to be analysed and explore the sentiment of the text, the named entities found in it, information about its syntax and the categories associated with it. The syntax and sentiment analysis can be used to extract linguistic information from any text to obtain the overall feelings expressed in it. The entity analysis can be used to obtain known entities from the text, such as landmarks, public figures, events, products, organisations and so on. Sentiment analysis can be combined with the entity analysis to obtain sentiments for each named entity and gain insight into how each entity is influencing the overall sentiment of the document.
These features can provide a distilled version of the free-form text that lends itself well to being input into a machine learning model.
Let’s explore more in-depth the results that can be obtained from the API! As an example, I have picked the following Google Chromecast review.
“I’m very happy to leave 5 stars as this small USB power cable is just perfect for those that are unable to have a power socket close to hand. It is powering my chromecast without any issues so far (& I don’t expect any). Not much else to say expect maybe to ask the question as to why Google don’t supply one in the box as standard.”
Analysing the sentence for sentiment gives the following results
We can see that the overall sentiment of the review is positive, apart from the last sentence where the reviewer is being critical. The category it is most confident about is “/Computers & Electronics/Consumer Electronics” and it is correct.
There are multiple ways to access the API. There are client libraries, service APIs and a command-line interface. In this guide we will be using the python client. If you want to follow along make sure you have exported the service account key for a project which has Cloud Natural Language API enabled. You can export it like this
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
Right, let’s get started! First things first, let’s import all of the libraries we’ll be using and instantiate the natural language client
import pandas as pd from google.cloud import language from google.cloud.language import enums from google.cloud.language import types # instantiate the Cloud Natural Language client language_client = language.LanguageServiceClient()
Then, let’s define some helper functions which will let us display the results more easily.
def show_sentiments(annotations, n): score = annotations.document_sentiment.score magnitude = annotations.document_sentiment.magnitude for index, sentence in enumerate(annotations.sentences[:n]): sentence_sentiment = sentence.sentiment.score print('Sentence {} has a sentiment score of {:.3f}'.format( index + 1, sentence_sentiment)) print('Sentence text:n{}n'.format(sentence.text.content)) print('Overall Sentiment: score of {:.3f} with magnitude of {:.3f}'.format( score, magnitude)) return 0 def show_categories(categories, n): for category in categories.categories[:n]: print('=' * 20) print('name: {}'.format(category.name)) print('confidence: {:.3f}'.format(category.confidence)) def show_sentences(syntax, n): for sentence in syntax.sentences[:n]: print('=' * 20) print('{}'.format(sentence.text.content)) def show_tokens(syntax, n): for token in syntax.tokens[:n]: print('=' * 20) print('text: {}'.format(token.text.content)) print('part_of_speech:n{}'.format(' ' + str(token.part_of_speech).replace('n', 'n '))) print('lemma: {}'.format(token.lemma)) def show_entities(entities, n): for entity in entities.entities[:n]: print('=' * 20) print(' name: {0}'.format(entity.name)) print(' type: {0}'.format(entity.Type.Name(entity.type))) print(' metadata: {0}'.format(dict(entity.metadata))) print(' salience: {:.4f}'.format(entity.salience)) def show_entity_sentiments(entity_sentiments, n): for entity in entity_sentiments.entities[:n]: print('=' * 20) print(' name: {0}'.format(entity.name)) print(' type: {0}'.format(entity.Type.Name(entity.type))) print(' score: {:.3f}'.format(entity.sentiment.score)) print(' magnitude: {:.3f}'.format(entity.sentiment.magnitude))
The dataset we will be using is made available by Kaggle and it can be found at this link. The dataset consists of music album reviews published by Pitchfork magazine. The data is available in a sqlite database, so to proceed we will need to extract the tables and save them as csv files. The following script will do just that. To use it, specify the data directory as a command line argument and run it
#/bin/sh data_dir=$1 echo "Exporting content table from sqlite database to $data_dir/content.csv" sqlite3 -header -csv $data_dir/database.sqlite "select * from content;" > $data_dir/content.csv echo "Exporting genres table from sqlite database to $data_dir/genres.csv" sqlite3 -header -csv $data_dir/database.sqlite "select * from genres;" > $data_dir/genres.csv echo "Exporting labels table from sqlite database to $data_dir/labels.csv" sqlite3 -header -csv $data_dir/database.sqlite "select * from labels;" > $data_dir/labels.csv echo "Exporting reviews table from sqlite database to $data_dir/reviews.csv" sqlite3 -header -csv $data_dir/database.sqlite "select * from reviews;" > $data_dir/reviews.csv
Now, let’s load the csv files into pandas data frames
content = pd.read_csv('content.csv', header=0, delimiter='|') genres = pd.read_csv('genres.csv', header=0, delimiter='|') labels = pd.read_csv('labels.csv', header=0, delimiter='|') reviews = pd.read_csv('reviews.csv', header=0, delimiter='|')
and print a preview of each of the tables to get an idea of what the data looks like
reviews.head()
genres.head()
labels.head()
content.head()
We might want to join up the datasets on reviewId later on to get a better idea of the context for each review, but for now, let’s just use the content
dataset and try inputting the free review text as is into Google’s Natural Language API and see what insights we can get right away.
Now, let’s pick a review text and explore the Natural Language API. I’ve picked a review about The Beatles’ Let It Be album. You are welcome to search for your favourite band/artist on https://pitchfork.com and copy the review ID from the URL. For example, for The Beatles, the URL is https://pitchfork.com/reviews/albums/13430-let-it-be/ and the review ID is 13430
.
review_id = 13430 review_text = row = content[content.reviewid == review_id]['content'].values[0] print('{}...'.format(review_text[:205]))
Let’s load the review text as a document that we can pass directly to the language client methods
document = types.Document(content=review_text, type=enums.Document.Type.PLAIN_TEXT)
Content classification
The Natural Language API can be used for analysing documents and obtaining lists of content categories that apply to the text in the document.
categories = language_client.classify_text(document=document) show_categories(categories, 5)
The API managed to pick up the broad categories that the review falls under. For The Beatles’ album, the categories are “Arts & Entertainment”, “Music & Audio” and more specifically “Rock Music”. Depending on which review you picked, you may see different categories. See whether the API can classify your text correctly if you describe an animal or an object without using the word for it! Below is the text I experimented with.
“These animals are domesticated felines of the genus felidae who are smaller than the majority of their wild cousins who share the genus. Efficient hunters with speed, sharp teeth, retractable claws, superior hearing, eyes which can function in near-complete darkness and a wide variety of vocalizations. Primarily short-haired, but with long-haired variations resulting from selective breeding. Markings can vary from a single solid color to various stripes, swirls, spots and differently-colored extremities, primarily in tones of brown, black, orange, red, white and grey.
They are popular as pets and generally considered sleek, beautiful, intelligent, fastidious and aloof, but are also known for their independence, playfulness and the purr, a vocalization that can indicate contentedness, security and affection, as well as distress and self-comfort. Humans find the purr comforting and studies have shown that the purr reduces stress in humans.”
See whether you can find out which words contribute the most to its confidence that the category “/Pets & Animals/Pets/Cats” is a category it should associate with the text.
Syntactic analysis
The Natural Language API provides a powerful set of tools for analysing and parsing text through syntactic analysis. Syntactic analysis consists of sentence extraction (breaks up the stream of text into a series of sentences) and tokenisation (breaks up the stream of text into a series of tokens where each token corresponds to a single word).
syntax = language_client.analyze_syntax(document)
First 5 sentences
show_sentences(syntax, 5)
First 5 tokens
show_tokens(syntax, 5)
The text
field contains the text data associated with the token. The part_of_speech
field provides grammatical information including morphological information about the token, such as the token’s tense, person, number, gender, etc (for more information, refer to the documentation). The lemma
field contains the “root” word upon which this word is based, which allows you to canonicalise word usage within your text. For example, the words “write”, “writing”, “wrote” and “written” are all based on the same lemma (“write”). Plural and singular forms are also based on lemmas: “house” and “houses” both refer to the same form.
There are other, more advanced fields available too. For more information, refer to the Natural Language API documentation.
Sentiment analysis
sentiments = language_client.analyze_sentiment(document=document)
The sentiments for the first 5 sentences are:
show_sentiments(sentiments, 5)
We can see that it is really easy to get started and obtain meaningful results very quickly. The language API picked up the text language automatically and preprocessed it without us having to do any cleaning.
Guide to interpreting sentiment analysis values
The score of a document’s sentiment indicates the overall emotion of a document. The magnitude of a document’s sentiment indicates how much emotional content is present within the document, and this value is often proportional to the length of the document.
It is important to note that Cloud Natural Language indicates differences between positive and negative emotion in a document, but does not identify specific positive and negative emotions. For example, “angry” and “sad” are both considered negative emotions. However, when Cloud Natural Language analyses text that is considered “angry”, or text that is considered “sad”, the response only indicates that the sentiment in the text is negative, not “sad” or “angry”.
A document with a neutral score (around 0.0) may indicate a low-emotion document, or may indicate mixed emotions, with both high positive and negative values which cancel each out. Generally, you can use magnitude values to disambiguate these cases, as truly neutral documents will have a low magnitude value, while mixed documents will have higher magnitude values.
When comparing documents to each other (especially documents of different length), make sure to use the magnitude values to calibrate your scores, as they can help you gauge the relevant amount of emotional content.
The chart below shows some sample values and how to interpret them:
Sentiment | Sample Values |
---|---|
Clearly Positive | score: 0.8, magnitude: 3.0 |
Clearly Negative | score: -0.6, magnitude: 4.0 |
Neutral | score: 0.1, magnitude: 0.0 |
Mixed | score: 0.0, magnitude: 4.0 |
Entity analysis
We can also perform entity analysis on our review text. Entity analysis provides information about entities in the text, which generally refer to named “things” such as famous individuals, landmarks, common objects, etc.
entities = language_client.analyze_entities(document=document)
The first 5 entities with their names, types, metadata and salience is shown below.
show_entities(entities, 5)
The type field indicates the type of the entity (e.g. person, location, event, etc). This information helps distinguish and/or disambiguate entities, and can be used for writing patterns or extracting information. For example, a type value can help distinguish similarly named entities such as “Lawrence of Arabia”, tagged as a WORK_OF_ART (film), from “T.E Lawrence”, tagged as a PERSON, for example.
The metadata field contains source information about the entity’s knowledge repository and may contain wikipedia_url and mid (machine-generated identifier corresponding to the entity’s Google Knowledge Graph entry which remains unique across languages and can be used to tie entities together from different languages).
The salience field indicates the importance or relevance of this entity to the entire document text. This score can assist information retrieval and summarisation by prioritising salient entities. Scores closer to 0.0 are less important, while scores closer to 1.0 are highly important.
Entity sentiment analysis
We can also combine named entities and sentiment analysis and obtain sentiments for each named entity.
entity_sentiments = language_client.analyze_entity_sentiment(document=document)
The first 5 entity sentiments with their names, types, scores and magnitude is shown below.
show_entity_sentiments(entity_sentiments, 5)
Extracting structured information
Now that we’ve explored the Natural Language API, we can start thinking about how to extract more structured information from the text. We may need it to conform to a schema or we may need to create additional features for a machine learning model. The set of features of interest will vary depending on the use case, but for the purposes of this exercise, let’s assume we want to be able to predict the review score. The overall sentiment of the text and the categories it falls under could make good features for this task. Let’s join up the data and add the additional features.
First, let’s merge content with reviews because they are one-to-one.
df = pd.merge(content.dropna(), reviews.dropna(), how='inner', on='reviewid')
Next, let’s group the labels per review and list them as comma separated values, so that we can join them to the dataframe from the previous step.
review_labels = labels.dropna().groupby(['reviewid'])['label'].apply(', '.join).reset_index() df = pd.merge(df, review_labels, how='inner', on='reviewid')
Finally, let’s do the same for genres.
review_genres = genres.dropna().groupby(['reviewid'])['genre'].apply(', '.join).reset_index() df = pd.merge(df, review_genres, how='inner', on='reviewid')
Great, we have all the review information in one dataframe now. Let’s have a peek.
df.head()
Now, let’s augment the dataframe above with some additional features. In particular, the sentiment of the text and the categories the text falls under would make useful features. We can also add comma separated list of important entities in the text as a feature.
Running classification on the entire dataframe might take a while and we might run into API rate limiting issues, so selecting a small subset of the data is a good way to demonstrate the process without spending too long on waiting for requests to complete.
Let’s create a mini dataframe with just the first 10 rows.
mini_df = df[:10]
Then, let’s run classification on the text for each one and add it to mini_df as a column.
def get_category(content): document = types.Document(content=content, type=enums.Document.Type.PLAIN_TEXT) categories = language_client.classify_text(document=document) return ', '.join([c.name for c in categories.categories]) mini_df_categories = mini_df.content.apply(get_category) mini_df['categories'] = mini_df_categories
Next, let’s obtain the sentiment scores and magnitudes for each review and add that as a column too.
def get_sentiment_scores(content): document = types.Document(content=content, type=enums.Document.Type.PLAIN_TEXT) sentiments = language_client.analyze_sentiment(document=document) score = sentiments.document_sentiment.score magnitude = sentiments.document_sentiment.magnitude return pd.Series([score, magnitude]) mini_df_sentiment_scores = mini_df.content.apply(get_sentiment_scores) mini_df_sentiment_scores = pd.DataFrame(mini_df_sentiment_scores) mini_df_sentiment_scores.columns = ['score', 'magnitude'] mini_df = pd.concat([mini_df, mini_df_sentiment_scores], axis=1)
Finally, let’s add an entities column which includes the 5 entities with the highest salience scores.
def get_top_entities(content): document = types.Document(content=content, type=enums.Document.Type.PLAIN_TEXT) entities = language_client.analyze_entities(document=document) return ', '.join([e.name for e in entities.entities[:5]]) top_entities = mini_df.content.apply(get_top_entities) mini_df['entities'] = top_entities mini_df.head()
That’s it! We have added a few additional features which a machine learning model trying to predict the review score can use. These features are much easier to encode and feed into a model than free-form text is. The Natural Language API has done most of the heavy lifting for us. We have just scratched the surface of what could be done with Cloud Natural Language and the potential use cases are far-reaching!