Classifying text data into multiple topics with a view to extracting primary and secondary topics, amongst other applications. Could be extended to claim cause classification problems in (life) insurance.
…How do claim cause categories change over the lifetime of a claim? Could we distinguish between primary and secondary causes? Could we automate the allocation of a claim to a cause category and does this help limit the operational risk of mis-classification?…
Most (life) insurers have some form of categorisation of claims into cause. This is often an input from the claims assessor into the admin system. Some businesses categorise claims into primary and secondary causes.
This analysis considers methods of modelling topics or categories based upon the underlying text. This is useful where the classifications do not exist or we want to consider different classifications e.g. evolving claim drivers or secondary causes. Here we use customer review data, but the approach could be applied to claim file text in an insurance context.
The approach could be extended by pairing raw cause text with existing cause mappings i.e. the models could be defined to learn the categories based upon existing classifications - that form of the problem is a more standard categorisation problem. More categorisation problems of that type are described in the recipes in the Actuaries’ Analytical Cookbook, Natural Language Processing sections.
The methods below are not exhaustive but are are rather intended to be illustrative examples of topic modelling concepts to stimulate thinking. The articles in the further reading section expand on some of these thoughts.
A few articles of interest:
Topic extraction using LDA discusses using Latent Dirichlet Allocation (LDA) models. It highlights the simplifying assumptions of the models. The example dataset used in the article already has topics assigned to the data, showing the extension to classification problems.
Topic modelling with NMF An example of topic modelling using Non-Negative Matrix Factorization (NMF).
Topic modelling with BERT An example of topic modelling using Bidirectional Encoder Representations from Transformers (BERT).
Topic analysis discusses topic analysis in detail, similar to the article above.
Multi-label text classification and Multi-Class Text Classification Model Comparison and Selection consider a variety of classifier models to train against an existing label column. Includes methods of testing model accuracy.
Part of speech tagging sets out to extract and visualize part of speech in a piece of text. Could be extended to building a classification model on, for example, only nouns within the text.
Set up python in Rstudio with Reticulate is an article setting out simple steps for using Python within Rstudio (Posit).
Setting up the environment as well as a list of some of the packages used in the recipe. For more on calling Python from within Rstudio/ Posit, see further reading.
# calling python from r
library(reticulate)
# create environment if does not exist
# conda_create("r-reticulate")
# py_config() # to check configuration
# activate environment
use_condaenv("r-reticulate", required=TRUE) # set environment before running any Python chunks
# if not already installed, install the below. if env specified, can drop envname parameter
# py_install("pandas",envname = "r-reticulate")
# py_install("numpy",envname = "r-reticulate")
# ...etc
Libraries:
# import some standard libraries
import pandas as pd
import numpy as np
import random
The sections below analyse customer review data sourced from Kaggle: Amazon Customer Reviews.
This dataset is not appropriate for publication and has been anonymised, but is nevertheless useful for illustrative purposes here.
Extracting a sample of 20k reviews from the set. Of interest are the ‘Summary’ and ‘Text’ as these are the text of the reviews. The analysis is conducted on the ‘Text’ column.
# full dataset saved as reviews_clean
# data_full = pd.read_csv('reviews_clean.csv')
# len(data_full)
# we're taking a sample from the dataset, already extracted to reviews_sample
# data = data_full.sample(n=20000)
# data.to_csv(r'.\reviews_sample.csv', index = False, header=True)
# import sample
= pd.read_csv('reviews_sample.csv')
data 'display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)
pd.set_option(print('Data head')
print('-------------------------------------------')
data.head()print('-------------------------------------------')
Data head
-------------------------------------------
Id ProductId UserId HelpfulnessNumerator HelpfulnessDenominator Score Time \
0 128563 B007L3NVKU A2CH00OW75H2OL 0 0 5 1348272000
1 427149 B001PAS5GK A3923DNTARI2V1 1 2 5 1265328000
2 513073 B001E6EE4C AR676SSCX7JFH 0 0 5 1258416000
3 358466 B005BYP7RG AEC0I4XOMJJ72 2 2 5 1338249600
4 495680 B0098WV8F2 AWZ84TQT9AI2Z 0 0 5 1333065600
Summary Text
0 Kentuckygirl Newman's is really the best, however, I buy th...
1 great transaction Fast delivery, very well priced and defintely ...
2 Great cereal! More fiber, without the disappo... This cereal is absolutely fantastic...best fib...
3 Fresh, tasty & Good Price These tasted very good and the best price I ca...
4 PB2 rocks! The best PB ever. I use PB2 to make protein sh...
-------------------------------------------
Extracting column names:
print('Column names')
print('-------------------------------------------')
# List out column names
list(data.columns)
Column names
-------------------------------------------
['Id', 'ProductId', 'UserId', 'HelpfulnessNumerator', 'HelpfulnessDenominator',
'Score', 'Time', 'Summary', 'Text']
The review columns are converted to lists for processing.
# Convert comment/ text column to list
= data['Text'].values.tolist()
text = data['Summary'].values.tolist() text_summary
Printing a sample of 5 reviews:
= data.sample(5)
review_sample # The 'sample' method in the 'random' package selects, without replacement,
# the number of items you want from a list.
for i in range(len(review_sample)):
print(review_sample.iloc[i,:])
Id 296872
ProductId B0002MLAEQ
UserId A1DC1O4VX6AHPP
HelpfulnessNumerator 6
HelpfulnessDenominator 6
Score 1
Time 1312070400
Summary frequent problem batches of this food
Text We were purchasing this food regularly for our...
Name: 7597, dtype: object
Id 381854
ProductId B002UMD9KO
UserId A2MUGFV2TDQ47K
HelpfulnessNumerator 2
HelpfulnessDenominator 3
Score 1
Time 1305331200
Summary Tang Energy FROZEN Beverage Drink Mix
Text The item description is incorrect. The actual ...
Name: 8491, dtype: object
Id 54739
ProductId B0040XD4RO
UserId A1FBJL96M6T248
HelpfulnessNumerator 1
HelpfulnessDenominator 1
Score 5
Time 1316390400
Summary The elusive Bran Flakes
Text I am delighted with my purchase. It used to be...
Name: 7830, dtype: object
Id 4485
ProductId B001EHDMY4
UserId A1YELVCOA6WEDQ
HelpfulnessNumerator 3
HelpfulnessDenominator 3
Score 5
Time 1261526400
Summary Best Tea
Text I love this tea. I have issues with added flo...
Name: 2294, dtype: object
Id 257695
ProductId B000B1X5IM
UserId A3F411JNZ4LTUX
HelpfulnessNumerator 0
HelpfulnessDenominator 0
Score 4
Time 1318896000
Summary Nice oilive oil
Text Purchased this because our daughter traveled t...
Name: 14824, dtype: object
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.
Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words.
The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is. It involves various techniques of dimensionality reduction(mostly non-linear) and unsupervised learning like LDA, SVD, autoencoders etc.
Source: Wikipedia
Below is an exploratory word cloud of the words in the review text:
from wordcloud import WordCloud, STOPWORDS
from matplotlib import pyplot as plt
import matplotlib.colors as mcolors
= [color for name, color in mcolors.TABLEAU_COLORS.items()]
cols
= WordCloud(
cloud ='white',
background_color=set(STOPWORDS),
stopwords=200,
max_words='tab10',
colormap='steelblue',
contour_color=40,
max_font_size=42
random_state
)
str(data['Text'])) cloud.generate(
plt.gca().imshow(cloud)'off') plt.gca().axis(
=10, y=10)
plt.margins(x
plt.tight_layout()
'img/prev_topic.png', bbox_inches='tight') # or use plt.show() plt.savefig(
A few summary stats on the length of the reviews and words within the reviews using code from the Actuaries Insitute Coookbook, NLP section:
# From the NLP recipe section...
# Check the length of different reviews.
# A list constructor is used to produce a list of how long each review is
# in characters.
= [len(t) for t in text]
review_length_characters
# Print summary statistics for the number of characters in each review.
print('The longest character length in a review is {:,}.'.format(max(review_length_characters)))
print('The shortest character length in a review is {:,}.'.format(min(review_length_characters)))
print('The average character length of reviews is {:.0f}.'.format(np.mean(review_length_characters)))
print('The median character length of reviews is {:.0f}.'.format(np.median(review_length_characters)))
print()
# A list constructor is used to produce a list of how long each review is
# in words.
= [len(t.split()) for t in text]
review_length_words # The str.split() function breaks a string by approximate word breaks.
## Print summary statistics for the number of words in each review.
print('The longest word length in a review is {:,}.'.format(max(review_length_words)))
print('The shortest word length in a review is {:,}.'.format(min(review_length_words)))
print('The average word length of reviews is {:.0f}.'.format(np.mean(review_length_words)))
print('The median word lenth of reviews is {:.0f}.'.format(np.median(review_length_words)))
The longest character length in a review is 11,321.
The shortest character length in a review is 32.
The average character length of reviews is 436.
The median character length of reviews is 303.
The longest word length in a review is 1,901.
The shortest word length in a review is 7.
The average word length of reviews is 80.
The median word lenth of reviews is 57.
A histogram of the character lengths of the review text shows that most reviews are shorter than, say, 750 words.
# Histogram of comment lengths
plt.clf()= data['Text'].str.len()
text_length = text_length.hist(bins = np.arange(0,2000,50))
hist
'on') plt.gca().axis(
'Review lenghts, histogram')
plt.title('text length (characters)')
plt.xlabel(
plt.tight_layout() plt.show()
Extracting some of the shortest reviews and the longest:
# Print some examples of the shortest and longest reviews.
= [(t[:300],len(t.split())) for t in text]
review_and_length = list(filter(lambda c: c[1] < 12, review_and_length))
short_reviews = list(filter(lambda c: c[1] > 1900, review_and_length))
long_reviews print('Sample short reviews:')
print(*short_reviews,sep='\n')
print('\nSample longest review (limit to 300 charaters):')
print(long_reviews)
Sample short reviews:
('Way too sweet and poor texture for beef jerky.', 9)
('Great! Make it all of the time.', 7)
('These are the best tasting refried beans I have ever tasted..', 11)
('I think the word we are looking for here is DELICACY.', 11)
('Very good product. Have to admit, I added nuts and raisins.', 11)
('Incomparably better than regular tuna. My favorite for all uses.', 10)
('Great snack. Consistently crunchy. Hard to stop at one bag.', 10)
Sample longest review (limit to 300 charaters):
[('Lipton Black Pearl Tea. Ahhhh. Now only a bloody fool would equate Black
Pearl with Pirates and the Caribbean, which no doubt is equated to that hack
Jerry Bruckheimer and that hack corporation with that sissy mouse. Truth be
told, the Black Pearl refers to something far darker and ominous. The ',
1901)]
The following chunk is setup for the NLP model, including importing:
Import libraries for text cleaning and model build:
# Regex
import re
# Stopwords
import nltk
'stopwords') nltk.download(
from nltk.corpus import stopwords
= stopwords.words('english')
stop_words 'a')
stop_words.append('br')
stop_words.append('www')
stop_words.append('http')
stop_words.append(
# Lemmatisation
from nltk.stem import WordNetLemmatizer
= WordNetLemmatizer()
lemmatizer
# Stemming
from nltk.stem import PorterStemmer
= PorterStemmer()
stemmer
# Gensim - topic modelling, document indexing and
# similarity retrieval with large corpora.
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim_models
import warnings
"ignore",category=DeprecationWarning) warnings.filterwarnings(
The sections below are built off existing code sourced from Kaggle on topic modelling.
Taking the raw text data and processing it into words, removing stopwords (and short words), stemming and lemmatizing:
= []
filtered_text
# Here looping through the smaller 20k sample, but can extend to the full list
for t in text:
= ""
filtered_sentence = []
stemmed_list = []
lemmatized_list
= str(t)
sentence
# Data Cleansing
= re.sub(r'[^\w\s]', ' ', sentence)
sentence
# Removing numbers
= re.sub(r'[0-9]', '', sentence)
sentence
# Tokenization
= nltk.word_tokenize(sentence)
words
# Convert the tokens into lowercase: lower_tokens
= [w.lower() for w in words]
words
# Stop words removal
= [w for w in words if not w in stop_words]
words
# Stemming
for word in words:
= stemmer.stem(word)
stemmed_word
stemmed_list.append(stemmed_word)
# Lemmatization
for s_word in stemmed_list:
= lemmatizer.lemmatize(s_word)
lemmatized_word
lemmatized_list.append(lemmatized_word)
= [i for i in lemmatized_list if len(i) >= 3]
lemmatized_list
filtered_text.append(lemmatized_list)
Reviewing the processed text shows the impact of removing stopwords etc.
Note that the data prep can lead to the loss of some descriptive data e.g. comment 5, referenced ‘PB’, which is an abbreviation for ‘peanut butter’, but our prep phase stripped out 2 letter words from the data.
print('Row count, pre:',format(len(text)))
print('-------------------------------------------')
# confirm all of the 20k reviews processed
print('Row count, post:',format(len(filtered_text)))
print('-------------------------------------------')
# sample text from 6th and 500th review, pre and post-processing
print('Pre-processed 6th review:',format(text[5]))
print('-------------------------------------------')
print('Processed 6th review:',format(filtered_text[5]))
print('-------------------------------------------')
print('Pre-processed 500th review:',format(text[499]))
print('-------------------------------------------')
print('Processed 500th review:',format(filtered_text[499]))
Row count, pre: 20000
-------------------------------------------
Row count, post: 20000
-------------------------------------------
Pre-processed 6th review: I love this tea. I am a long-time tea drinker, and
purchased tea bags for longer than I want to count (or think about). A couple
of years ago I decided to go for it...make loose tea, and deal with mess of
leaves. And yes, I did know about strainers and tea balls, but I live in no
man-s land. If it isn't prepackaged, it's unfit to ingest according to the
natives. Good utensils are hard to find.<br /><br />I finally found a
wonderful BIG mesh tea ball. Ahhh, a tea drinker's best friend, right after a
good English teapot.<br /><br />The mesh ball in question is like the one sold
here as a rule, by the way.<br /><br />Tea is now a breeze to make, and
absolutely delicious.<br /><br /> The loose packaged tea put out by Lipton is
wonderful. It has a beautiful taste, unlike their teabags, which can be very
iffy. I've had some that tasted like floor sweepings.<br /><br />This loose
tea seems amazingly consistent in flavor. I recommend it without reservation.
And isn't Amazon just lovely...? They'll bring it right to your door. How
much more could you ask?<br /><br />Earlier I mentioned an English teapot. I
favor Sadler's because of the excellent glaze, and the good material they use
in the functional brown teapots they make.<br /><br />A good teapot matters.
Get yourself one, a nice tea ball, and some Lipton Loose tea and you will
likely never use teabags again. If the tea is too strong, do it the way it's
done in England; keep a very lightly boiling kettle going, and dilute with
boiling water to your taste.
-------------------------------------------
Processed 6th review: ['love', 'tea', 'long', 'time', 'tea', 'drinker',
'purchas', 'tea', 'bag', 'longer', 'want', 'count', 'think', 'coupl', 'year',
'ago', 'decid', 'make', 'loo', 'tea', 'deal', 'mess', 'leav', 'know',
'strainer', 'tea', 'ball', 'live', 'man', 'land', 'prepackag', 'unfit',
'ingest', 'accord', 'nativ', 'good', 'utensil', 'hard', 'find', 'final',
'found', 'wonder', 'big', 'mesh', 'tea', 'ball', 'ahhh', 'tea', 'drinker',
'best', 'friend', 'right', 'good', 'english', 'teapot', 'mesh', 'ball',
'question', 'like', 'one', 'sold', 'rule', 'way', 'tea', 'breez', 'make',
'absolut', 'delici', 'loo', 'packag', 'tea', 'put', 'lipton', 'wonder',
'beauti', 'tast', 'unlik', 'teabag', 'iffi', 'tast', 'like', 'floor', 'sweep',
'loo', 'tea', 'seem', 'amazingli', 'consist', 'flavor', 'recommend', 'without',
'reserv', 'amazon', 'love', 'bring', 'right', 'door', 'much', 'could', 'ask',
'earlier', 'mention', 'english', 'teapot', 'favor', 'sadler', 'excel', 'glaze',
'good', 'materi', 'use', 'function', 'brown', 'teapot', 'make', 'good',
'teapot', 'matter', 'get', 'one', 'nice', 'tea', 'ball', 'lipton', 'loo',
'tea', 'like', 'never', 'use', 'teabag', 'tea', 'strong', 'way', 'done',
'england', 'keep', 'lightli', 'boil', 'kettl', 'dilut', 'boil', 'water',
'tast']
-------------------------------------------
Pre-processed 500th review: I remembered this cereal from my childhood and had
to buy it. It is still delicious. The first half of the cereal is made of
larger clumps of krispies with the second half of more broken up, smaller
pieces. I thought this was a result of shipping, but after buying this in a
store, it is simply how it is made.<br /><br />The cereal is closer to the
expiration date than most stores, but one person can easily eat it all before
it even comes close to being questionable. I've never had the problem of
freshness.<br /><br />This is a high quality cereal and one I truly miss. I
will be buying this again and again.
-------------------------------------------
Processed 500th review: ['rememb', 'cereal', 'childhood', 'buy', 'still',
'delici', 'first', 'half', 'cereal', 'made', 'larger', 'clump', 'krispi',
'second', 'half', 'broken', 'smaller', 'piec', 'thought', 'result', 'ship',
'buy', 'store', 'simpli', 'made', 'cereal', 'closer', 'expir', 'date', 'store',
'one', 'person', 'easili', 'eat', 'even', 'come', 'close', 'question', 'never',
'problem', 'fresh', 'high', 'qualiti', 'cereal', 'one', 'truli', 'miss', 'buy']
Looking at the 20 most common words by count shows that the reviews understandably contain some form of opinion - we could consider stripping out adjectives to attempt to address this. There is also some clear indication of the types of products under review, which is of interest to us.
from collections import Counter
# Print the 20 most common words across the whole corpus of complaints.
= Counter([word for t in filtered_text for word in t])
word_count
print("{:<6} {:>12}".format("Word", "Count"))
print("{:<6} {:>12}".format("----", "----------"))
for word, count in word_count.most_common(20): print("{:<6} {:>12,}".format(word, count))
# 'most_common' is a helpful method that can be applied to Counter.
Word Count
---- ----------
like 10,155
tast 9,322
flavor 7,759
good 7,245
product 7,180
love 6,550
one 6,483
use 6,153
coffe 6,002
tri 5,994
great 5,802
food 5,568
tea 5,540
get 4,962
make 4,386
would 4,382
amazon 3,833
eat 3,801
buy 3,686
dog 3,615
Using the filtered text to create a corpus and dictionary for use in the model, where
Corpus: Is a large body of text used to train the model, here the text of the comments.
Dictionary: Is the collection of words used to train the model, here the words within the text of the comments.
# Create Dictionary
= corpora.Dictionary(filtered_text)
id2word
# Create Corpus
= filtered_text
texts
# Term Document Frequency
= [id2word.doc2bow(text) for text in texts]
corpus
# View
# print(corpus), large
print('Dictionary, 100th word:', format(id2word[99]))
print('-------------------------------------------')
print('Length of corpus:',format(len(corpus)))
print('-------------------------------------------')
print('Readable format of corpus (term-frequency):', format([[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]))
Dictionary, 100th word: iffi
-------------------------------------------
Length of corpus: 20000
-------------------------------------------
Readable format of corpus (term-frequency): [[('afford', 1), ('alway', 1),
('best', 1), ('brand', 1), ('buy', 1), ('cheaper', 1), ('coffe', 1), ('could',
1), ('howev', 1), ('newman', 2), ('nowaday', 1), ('purchas', 1), ('realli', 1),
('would', 1)]]
Fitting a Latent Dirichlet Allocation (LDA) model. This method builds a model for topic per review and a model for words per topic, modelled as Dirichlet distributions. A further example application here.
There are a number of alternative model forms, in particular when the problem is viewed as a more standard categorisation/ classification problem where the predicted value is the label and the model is fit to pre-labelled data e.g. logistic regression. This recipe does not consider the relative suitability of the model - this could be explored further. Model parameterisation could be further refined, e.g. considering bi-grams rather than single words.
= gensim.models.ldamodel.LdaModel(corpus=corpus,
lda_model =id2word,
id2word=10,
num_topics=0,
random_state=1,
update_every=100,
chunksize=10,
passes='auto',
alpha=True) per_word_topics
Print out the key words for each of the topics:
# Print the keywords in the topics
from pprint import pprint
pprint(lda_model.print_topics())
[(0,
'0.149*"food" + 0.107*"dog" + 0.034*"chicken" + 0.018*"ingredi" + '
'0.018*"babi" + 0.017*"bowl" + 0.017*"grain" + 0.014*"ginger" + 0.012*"eat" '
'+ 0.012*"valu"'),
(1,
'0.042*"like" + 0.028*"get" + 0.026*"good" + 0.024*"would" + 0.019*"realli" '
'+ 0.018*"eat" + 0.017*"littl" + 0.016*"well" + 0.015*"one" + 0.015*"love"'),
(2,
'0.073*"free" + 0.040*"bean" + 0.032*"gluten" + 0.028*"ice" + 0.028*"bread" '
'+ 0.025*"recip" + 0.024*"mix" + 0.023*"honey" + 0.022*"bake" + '
'0.017*"tuna"'),
(3,
'0.236*"tea" + 0.053*"green" + 0.044*"coconut" + 0.012*"clear" + '
'0.012*"extrem" + 0.011*"cherri" + 0.011*"creami" + 0.010*"tini" + '
'0.010*"stash" + 0.010*"describ"'),
(4,
'0.097*"tast" + 0.079*"flavor" + 0.060*"coffe" + 0.028*"like" + 0.026*"cup" '
'+ 0.023*"drink" + 0.022*"sweet" + 0.019*"sugar" + 0.018*"enjoy" + '
'0.016*"cooki"'),
(5,
'0.039*"buy" + 0.038*"amazon" + 0.033*"order" + 0.033*"price" + 0.027*"find" '
'+ 0.026*"store" + 0.022*"cat" + 0.020*"year" + 0.019*"purchas" + '
'0.015*"ship"'),
(6,
'0.056*"packag" + 0.037*"oil" + 0.021*"arriv" + 0.021*"prefer" + '
'0.019*"soup" + 0.018*"compani" + 0.017*"bitter" + 0.017*"ounc" + '
'0.017*"plea" + 0.017*"fill"'),
(7,
'0.124*"chocol" + 0.083*"bar" + 0.062*"butter" + 0.047*"peanut" + '
'0.035*"popcorn" + 0.026*"stop" + 0.021*"saw" + 0.020*"cover" + '
'0.017*"throw" + 0.016*"eaten"'),
(8,
'0.037*"product" + 0.034*"use" + 0.030*"tri" + 0.030*"great" + 0.023*"make" '
'+ 0.021*"love" + 0.019*"time" + 0.019*"one" + 0.018*"bag" + 0.015*"better"'),
(9,
'0.032*"mix" + 0.028*"delici" + 0.024*"add" + 0.022*"hot" + 0.022*"milk" + '
'0.019*"fresh" + 0.019*"water" + 0.019*"cook" + 0.019*"cereal" + '
'0.017*"serv"')]
= lda_model[corpus] doc_lda
Extract the topics for a sample comment. This shows a comment about tea which is attributed to topic 1 with ~21% probability and topic 3 with ~16% probability. We’ll see in the wordcloud below that topic 3 might be more appropriate and that it might be possible to refine the model by removing certain words e.g. adjectives.
print('Pre-processed 6th review:',format(text[5]))
print('-------------------------------------------')
print('Processed 6th review:',format(filtered_text[5]))
print('-------------------------------------------')
print('6th review corpus:',format(corpus[5]))
print('-------------------------------------------')
print('6th review topic probability:')
5], minimum_probability=None, minimum_phi_value=None, per_word_topics=False) lda_model.get_document_topics(corpus[
Pre-processed 6th review: I love this tea. I am a long-time tea drinker, and
purchased tea bags for longer than I want to count (or think about). A couple
of years ago I decided to go for it...make loose tea, and deal with mess of
leaves. And yes, I did know about strainers and tea balls, but I live in no
man-s land. If it isn't prepackaged, it's unfit to ingest according to the
natives. Good utensils are hard to find.<br /><br />I finally found a
wonderful BIG mesh tea ball. Ahhh, a tea drinker's best friend, right after a
good English teapot.<br /><br />The mesh ball in question is like the one sold
here as a rule, by the way.<br /><br />Tea is now a breeze to make, and
absolutely delicious.<br /><br /> The loose packaged tea put out by Lipton is
wonderful. It has a beautiful taste, unlike their teabags, which can be very
iffy. I've had some that tasted like floor sweepings.<br /><br />This loose
tea seems amazingly consistent in flavor. I recommend it without reservation.
And isn't Amazon just lovely...? They'll bring it right to your door. How
much more could you ask?<br /><br />Earlier I mentioned an English teapot. I
favor Sadler's because of the excellent glaze, and the good material they use
in the functional brown teapots they make.<br /><br />A good teapot matters.
Get yourself one, a nice tea ball, and some Lipton Loose tea and you will
likely never use teabags again. If the tea is too strong, do it the way it's
done in England; keep a very lightly boiling kettle going, and dilute with
boiling water to your taste.
-------------------------------------------
Processed 6th review: ['love', 'tea', 'long', 'time', 'tea', 'drinker',
'purchas', 'tea', 'bag', 'longer', 'want', 'count', 'think', 'coupl', 'year',
'ago', 'decid', 'make', 'loo', 'tea', 'deal', 'mess', 'leav', 'know',
'strainer', 'tea', 'ball', 'live', 'man', 'land', 'prepackag', 'unfit',
'ingest', 'accord', 'nativ', 'good', 'utensil', 'hard', 'find', 'final',
'found', 'wonder', 'big', 'mesh', 'tea', 'ball', 'ahhh', 'tea', 'drinker',
'best', 'friend', 'right', 'good', 'english', 'teapot', 'mesh', 'ball',
'question', 'like', 'one', 'sold', 'rule', 'way', 'tea', 'breez', 'make',
'absolut', 'delici', 'loo', 'packag', 'tea', 'put', 'lipton', 'wonder',
'beauti', 'tast', 'unlik', 'teabag', 'iffi', 'tast', 'like', 'floor', 'sweep',
'loo', 'tea', 'seem', 'amazingli', 'consist', 'flavor', 'recommend', 'without',
'reserv', 'amazon', 'love', 'bring', 'right', 'door', 'much', 'could', 'ask',
'earlier', 'mention', 'english', 'teapot', 'favor', 'sadler', 'excel', 'glaze',
'good', 'materi', 'use', 'function', 'brown', 'teapot', 'make', 'good',
'teapot', 'matter', 'get', 'one', 'nice', 'tea', 'ball', 'lipton', 'loo',
'tea', 'like', 'never', 'use', 'teabag', 'tea', 'strong', 'way', 'done',
'england', 'keep', 'lightli', 'boil', 'kettl', 'dilut', 'boil', 'water',
'tast']
-------------------------------------------
6th review corpus: [(2, 1), (7, 1), (11, 1), (18, 4), (24, 1), (30, 1), (34,
3), (36, 3), (38, 2), (43, 1), (45, 1), (58, 3), (59, 1), (62, 2), (63, 1),
(64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 4), (71, 1), (72,
1), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1),
(81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 2), (87, 1), (88, 1), (89,
2), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1),
(98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1),
(106, 1), (107, 2), (108, 1), (109, 1), (110, 1), (111, 4), (112, 2), (113, 1),
(114, 1), (115, 1), (116, 1), (117, 2), (118, 1), (119, 1), (120, 1), (121, 1),
(122, 2), (123, 1), (124, 1), (125, 1), (126, 1), (127, 1), (128, 1), (129, 2),
(130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 1), (136, 1), (137,
13), (138, 2), (139, 4), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1),
(145, 1), (146, 1), (147, 2), (148, 1), (149, 1)]
-------------------------------------------
6th review topic probability:
[(1, 0.20550294), (3, 0.16463855), (4, 0.09305331), (5, 0.073654056), (6,
0.020393608), (8, 0.35805163), (9, 0.064970925)]
Model test metrics extracted below, see this article for more on these metrics - it also covers off using coherence to optimize the number of topics modelled. Useful for model comparison.
Model perplexity: “[Y]ou can think of the perplexity metric as measuring how probable some new unseen data is given the model that was learned earlier. That is to say, how well does the model represent or reproduce the statistics of the held-out data.”
Coherence: Where there is “semantic similarity between high scoring words in the topic.”
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.
#calculating and displaying the coherence score
Perplexity: -7.756790623989108
= CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
coherence_model_lda = coherence_model_lda.get_coherence()
coherence_lda print('\nCoherence Score: ', coherence_lda)
Coherence Score: -3.518711645765282
Plotting word clouds for each of the topics:
# Visualize the topics
import matplotlib.colors as mcolors
= [color for name, color in mcolors.TABLEAU_COLORS.items()] # more colors: 'mcolors.XKCD_COLORS'
cols
= WordCloud(stopwords=stop_words,
cloud ='white',
background_color=2500,
width=1800,
height=30,
max_words='tab10',
colormap='steelblue',
contour_color=lambda *args, **kwargs: cols[i])
color_func
= lda_model.show_topics(formatted=False)
topics
= plt.subplots(5, 2, figsize=(10,10), sharex=True, sharey=True)
fig, axes
for i, ax in enumerate(axes.flatten()):
fig.add_subplot(ax)= dict(topics[i][1])
topic_words =300)
cloud.generate_from_frequencies(topic_words, max_font_size
plt.gca().imshow(cloud)'Topic ' + str(i), fontdict=dict(size=16))
plt.gca().set_title('off') plt.gca().axis(
=20, hspace=20)
plt.subplots_adjust(wspace'off') plt.axis(
=10, y=10)
plt.margins(x
plt.tight_layout() plt.show()
It is clear from the key words and wordclouds that there is somewhat effective grouping for certain topics (life topic 4, roughly “drinks”), but not others (life topic 1). We could consider pairing with existing categorisation data and train against that. Alternatively, we could strip back the tokens to remove adjectives and focus the model on nouns - we are not that interested in sentiment.
We could also consider changing the number of topics modelled, optimizing model coherance.
In insurance applications, the outcome will rely upon the amount of data we have for each distinct topic e.g. the model may be able to distinguish cancers from mental illness but increasing the number of topics assessed will not necessarily lead to further refined mental illness topics.
The review text used here was chosen as it is broadly analogous to a selection of insurance claim cause texts, however there is likely to be a much greater variety in the products reviewed (and hence hard to derive broad topics) compared to the claim cause categories in insurance data.