Exploring topic analysis

Classifying text data into multiple topics with a view to extracting primary and secondary topics, amongst other applications. Could be extended to claim cause classification problems in (life) insurance.

Pat Reen https://www.linkedin.com/in/patrick-reen/
2022-09-16

Background and application

…How do claim cause categories change over the lifetime of a claim? Could we distinguish between primary and secondary causes? Could we automate the allocation of a claim to a cause category and does this help limit the operational risk of mis-classification?…

Most (life) insurers have some form of categorisation of claims into cause. This is often an input from the claims assessor into the admin system. Some businesses categorise claims into primary and secondary causes.

This analysis considers methods of modelling topics or categories based upon the underlying text. This is useful where the classifications do not exist or we want to consider different classifications e.g. evolving claim drivers or secondary causes. Here we use customer review data, but the approach could be applied to claim file text in an insurance context.

The approach could be extended by pairing raw cause text with existing cause mappings i.e. the models could be defined to learn the categories based upon existing classifications - that form of the problem is a more standard categorisation problem. More categorisation problems of that type are described in the recipes in the Actuaries’ Analytical Cookbook, Natural Language Processing sections.

The methods below are not exhaustive but are are rather intended to be illustrative examples of topic modelling concepts to stimulate thinking. The articles in the further reading section expand on some of these thoughts.

Further reading

A few articles of interest:

Libraries

Setting up the environment as well as a list of some of the packages used in the recipe. For more on calling Python from within Rstudio/ Posit, see further reading.

# calling python from r
library(reticulate) 

# create environment if does not exist
# conda_create("r-reticulate") 
# py_config() # to check configuration

# activate environment
use_condaenv("r-reticulate", required=TRUE) # set environment before running any Python chunks

# if not already installed, install the below. if env specified, can drop envname parameter

# py_install("pandas",envname = "r-reticulate")
# py_install("numpy",envname = "r-reticulate")
# ...etc

Libraries:


# import some standard libraries
import pandas as pd
import numpy as np
import random

Data

Extraction

The sections below analyse customer review data sourced from Kaggle: Amazon Customer Reviews.

This dataset is not appropriate for publication and has been anonymised, but is nevertheless useful for illustrative purposes here.

Extracting a sample of 20k reviews from the set. Of interest are the ‘Summary’ and ‘Text’ as these are the text of the reviews. The analysis is conducted on the ‘Text’ column.

Show code
# full dataset saved as reviews_clean
# data_full = pd.read_csv('reviews_clean.csv')
# len(data_full)

# we're taking a sample from the dataset, already extracted to reviews_sample
# data = data_full.sample(n=20000)
# data.to_csv(r'.\reviews_sample.csv', index = False, header=True)

# import sample
data = pd.read_csv('reviews_sample.csv')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)
print('Data head')
print('-------------------------------------------')
data.head()
print('-------------------------------------------')
Data head
-------------------------------------------
       Id   ProductId          UserId  HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0  128563  B007L3NVKU  A2CH00OW75H2OL                     0                       0      5  1348272000   
1  427149  B001PAS5GK  A3923DNTARI2V1                     1                       2      5  1265328000   
2  513073  B001E6EE4C   AR676SSCX7JFH                     0                       0      5  1258416000   
3  358466  B005BYP7RG   AEC0I4XOMJJ72                     2                       2      5  1338249600   
4  495680  B0098WV8F2   AWZ84TQT9AI2Z                     0                       0      5  1333065600   

                                             Summary                                               Text  
0                                       Kentuckygirl  Newman's is really the best, however, I buy th...  
1                                  great transaction  Fast delivery, very well priced and defintely ...  
2  Great cereal!  More fiber, without the disappo...  This cereal is absolutely fantastic...best fib...  
3                          Fresh, tasty & Good Price  These tasted very good and the best price I ca...  
4                                         PB2 rocks!  The best PB ever. I use PB2 to make protein sh...  
-------------------------------------------

Extracting column names:

Show code
print('Column names')
print('-------------------------------------------')
# List out column names
list(data.columns)
Column names
-------------------------------------------
['Id', 'ProductId', 'UserId', 'HelpfulnessNumerator', 'HelpfulnessDenominator',
'Score', 'Time', 'Summary', 'Text']

The review columns are converted to lists for processing.

# Convert comment/ text column to list
text = data['Text'].values.tolist()
text_summary = data['Summary'].values.tolist()

Printing a sample of 5 reviews:

Show code
review_sample = data.sample(5)
  # The 'sample' method in the 'random' package selects, without replacement,
  # the number of items you want from a list.

for i in range(len(review_sample)):
  print(review_sample.iloc[i,:])
Id                                                                   296872
ProductId                                                        B0002MLAEQ
UserId                                                       A1DC1O4VX6AHPP
HelpfulnessNumerator                                                      6
HelpfulnessDenominator                                                    6
Score                                                                     1
Time                                                             1312070400
Summary                               frequent problem batches of this food
Text                      We were purchasing this food regularly for our...
Name: 7597, dtype: object
Id                                                                   381854
ProductId                                                        B002UMD9KO
UserId                                                       A2MUGFV2TDQ47K
HelpfulnessNumerator                                                      2
HelpfulnessDenominator                                                    3
Score                                                                     1
Time                                                             1305331200
Summary                               Tang Energy FROZEN Beverage Drink Mix
Text                      The item description is incorrect. The actual ...
Name: 8491, dtype: object
Id                                                                    54739
ProductId                                                        B0040XD4RO
UserId                                                       A1FBJL96M6T248
HelpfulnessNumerator                                                      1
HelpfulnessDenominator                                                    1
Score                                                                     5
Time                                                             1316390400
Summary                                             The elusive Bran Flakes
Text                      I am delighted with my purchase. It used to be...
Name: 7830, dtype: object
Id                                                                     4485
ProductId                                                        B001EHDMY4
UserId                                                       A1YELVCOA6WEDQ
HelpfulnessNumerator                                                      3
HelpfulnessDenominator                                                    3
Score                                                                     5
Time                                                             1261526400
Summary                                                            Best Tea
Text                      I love this tea.  I have issues with added flo...
Name: 2294, dtype: object
Id                                                                   257695
ProductId                                                        B000B1X5IM
UserId                                                       A3F411JNZ4LTUX
HelpfulnessNumerator                                                      0
HelpfulnessDenominator                                                    0
Score                                                                     4
Time                                                             1318896000
Summary                                                     Nice oilive oil
Text                      Purchased this because our daughter traveled t...
Name: 14824, dtype: object

Topic modelling

What is topic modelling?

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words.

The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is. It involves various techniques of dimensionality reduction(mostly non-linear) and unsupervised learning like LDA, SVD, autoencoders etc.

Source: Wikipedia

Wordcloud

Below is an exploratory word cloud of the words in the review text:

from wordcloud import WordCloud, STOPWORDS
from matplotlib import pyplot as plt
import matplotlib.colors as mcolors

cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]

cloud = WordCloud(
    background_color='white',
    stopwords=set(STOPWORDS),
    max_words=200,
    colormap='tab10',
    contour_color='steelblue',
    max_font_size=40, 
    random_state=42
)

cloud.generate(str(data['Text'])) 
plt.gca().imshow(cloud)
plt.gca().axis('off')
plt.margins(x=10, y=10)
plt.tight_layout()

plt.savefig('img/prev_topic.png', bbox_inches='tight') # or use plt.show()

Review lengths

A few summary stats on the length of the reviews and words within the reviews using code from the Actuaries Insitute Coookbook, NLP section:

Show code
# From the NLP recipe section...

# Check the length of different reviews.
# A list constructor is used to produce a list of how long each review is
# in characters. 
review_length_characters = [len(t) for t in text]

# Print summary statistics for the number of characters in each review.
print('The longest character length in a review is {:,}.'.format(max(review_length_characters)))
print('The shortest character length in a review is {:,}.'.format(min(review_length_characters)))
print('The average character length of reviews is {:.0f}.'.format(np.mean(review_length_characters)))
print('The median character length of reviews is {:.0f}.'.format(np.median(review_length_characters)))
print()

# A list constructor is used to produce a list of how long each review is
# in words.
review_length_words = [len(t.split()) for t in text]
# The str.split() function breaks a string by approximate word breaks. 

## Print summary statistics for the number of words in each review.
print('The longest word length in a review is {:,}.'.format(max(review_length_words)))
print('The shortest word length in a review is {:,}.'.format(min(review_length_words)))
print('The average word length of reviews is {:.0f}.'.format(np.mean(review_length_words)))
print('The median word lenth of reviews is {:.0f}.'.format(np.median(review_length_words)))
The longest character length in a review is 11,321.
The shortest character length in a review is 32.
The average character length of reviews is 436.
The median character length of reviews is 303.

The longest word length in a review is 1,901.
The shortest word length in a review is 7.
The average word length of reviews is 80.
The median word lenth of reviews is 57.

A histogram of the character lengths of the review text shows that most reviews are shorter than, say, 750 words.

# Histogram of comment lengths
plt.clf()
text_length = data['Text'].str.len()
hist = text_length.hist(bins = np.arange(0,2000,50))

plt.gca().axis('on')
plt.title('Review lenghts, histogram')
plt.xlabel('text length (characters)')
plt.tight_layout()
plt.show()

Extracting some of the shortest reviews and the longest:

Show code
# Print some examples of the shortest and longest reviews.
review_and_length = [(t[:300],len(t.split())) for t in text]
short_reviews = list(filter(lambda c: c[1] < 12, review_and_length))
long_reviews = list(filter(lambda c: c[1] > 1900, review_and_length))
print('Sample short reviews:')
print(*short_reviews,sep='\n')
print('\nSample longest review (limit to 300 charaters):')
print(long_reviews)
Sample short reviews:
('Way too sweet and poor texture for beef jerky.', 9)
('Great!  Make it all of the time.', 7)
('These are the best tasting refried beans I have ever tasted..', 11)
('I think the word we are looking for here is DELICACY.', 11)
('Very good product.  Have to admit, I added nuts and raisins.', 11)
('Incomparably better than regular tuna.  My favorite for all uses.', 10)
('Great snack.  Consistently crunchy.  Hard to stop at one bag.', 10)

Sample longest review (limit to 300 charaters):
[('Lipton Black Pearl Tea.  Ahhhh.  Now only a bloody fool would equate Black
Pearl with Pirates and the Caribbean, which no doubt is equated to that hack
Jerry Bruckheimer and that hack corporation with that sissy mouse.  Truth be
told, the Black Pearl refers to something far darker and ominous.  The ',
1901)]

Model preamble

The following chunk is setup for the NLP model, including importing:

Import libraries for text cleaning and model build:

# Regex
import re

# Stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.append('a')
stop_words.append('br')
stop_words.append('www')
stop_words.append('http')

# Lemmatisation
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Stemming 
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

# Gensim -  topic modelling, document indexing and 
#           similarity retrieval with large corpora.
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim_models  

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

Text processing

The sections below are built off existing code sourced from Kaggle on topic modelling.

Taking the raw text data and processing it into words, removing stopwords (and short words), stemming and lemmatizing:

filtered_text = []     

# Here looping through the smaller 20k sample, but can extend to the full list
for t in text:
  
  filtered_sentence = ""
  stemmed_list = []
  lemmatized_list = []
  
  sentence = str(t)
  
  # Data Cleansing
  sentence = re.sub(r'[^\w\s]', ' ', sentence)
  
  # Removing numbers
  sentence = re.sub(r'[0-9]', '', sentence)
  
  # Tokenization
  words = nltk.word_tokenize(sentence)
  
  # Convert the tokens into lowercase: lower_tokens
  words = [w.lower() for w in words]
  
  # Stop words removal
  words = [w for w in words if not w in stop_words]
  
  # Stemming
  for word in words:
    stemmed_word = stemmer.stem(word)
    stemmed_list.append(stemmed_word)
  
  # Lemmatization
  for s_word in stemmed_list:
    lemmatized_word = lemmatizer.lemmatize(s_word)
    lemmatized_list.append(lemmatized_word)
  
  lemmatized_list = [i for i in lemmatized_list if len(i) >= 3]
  
  filtered_text.append(lemmatized_list) 

Review processed text

Reviewing the processed text shows the impact of removing stopwords etc.

Note that the data prep can lead to the loss of some descriptive data e.g. comment 5, referenced ‘PB’, which is an abbreviation for ‘peanut butter’, but our prep phase stripped out 2 letter words from the data.

Show code
print('Row count, pre:',format(len(text)))
print('-------------------------------------------')
# confirm all of the 20k reviews processed
print('Row count, post:',format(len(filtered_text)))
print('-------------------------------------------')

# sample text from 6th and 500th review, pre and post-processing
print('Pre-processed 6th review:',format(text[5]))
print('-------------------------------------------')
print('Processed 6th review:',format(filtered_text[5]))
print('-------------------------------------------')
print('Pre-processed 500th review:',format(text[499]))
print('-------------------------------------------')
print('Processed 500th review:',format(filtered_text[499]))
Row count, pre: 20000
-------------------------------------------
Row count, post: 20000
-------------------------------------------
Pre-processed 6th review: I love this tea.  I am a long-time tea drinker, and
purchased tea bags for longer than I want to count (or think about).  A couple
of years ago I decided to go for it...make loose tea, and deal with mess of
leaves.  And yes, I did know about strainers and tea balls, but I live in no
man-s land.  If it isn't prepackaged, it's unfit to ingest according to the
natives.  Good utensils are hard to find.<br /><br />I finally found a
wonderful BIG mesh tea ball.  Ahhh, a tea drinker's best friend, right after a
good English teapot.<br /><br />The mesh ball in question is like the one sold
here as a rule, by the way.<br /><br />Tea is now a breeze to make, and
absolutely delicious.<br /><br /> The loose packaged tea put out by Lipton is
wonderful.  It has a beautiful taste, unlike their teabags, which can be very
iffy.  I've had some that tasted like floor sweepings.<br /><br />This loose
tea seems amazingly consistent in flavor.  I recommend it without reservation.
And isn't Amazon just lovely...?  They'll bring it right to your door.  How
much more could you ask?<br /><br />Earlier I mentioned an English teapot.  I
favor Sadler's because of the excellent glaze, and the good material they use
in the functional brown teapots they make.<br /><br />A good teapot matters.
Get yourself one, a nice tea ball, and some Lipton Loose tea and you will
likely never use teabags again.  If the tea is too strong, do it the way it's
done in England; keep a very lightly boiling kettle going, and dilute with
boiling water to your taste.
-------------------------------------------
Processed 6th review: ['love', 'tea', 'long', 'time', 'tea', 'drinker',
'purchas', 'tea', 'bag', 'longer', 'want', 'count', 'think', 'coupl', 'year',
'ago', 'decid', 'make', 'loo', 'tea', 'deal', 'mess', 'leav', 'know',
'strainer', 'tea', 'ball', 'live', 'man', 'land', 'prepackag', 'unfit',
'ingest', 'accord', 'nativ', 'good', 'utensil', 'hard', 'find', 'final',
'found', 'wonder', 'big', 'mesh', 'tea', 'ball', 'ahhh', 'tea', 'drinker',
'best', 'friend', 'right', 'good', 'english', 'teapot', 'mesh', 'ball',
'question', 'like', 'one', 'sold', 'rule', 'way', 'tea', 'breez', 'make',
'absolut', 'delici', 'loo', 'packag', 'tea', 'put', 'lipton', 'wonder',
'beauti', 'tast', 'unlik', 'teabag', 'iffi', 'tast', 'like', 'floor', 'sweep',
'loo', 'tea', 'seem', 'amazingli', 'consist', 'flavor', 'recommend', 'without',
'reserv', 'amazon', 'love', 'bring', 'right', 'door', 'much', 'could', 'ask',
'earlier', 'mention', 'english', 'teapot', 'favor', 'sadler', 'excel', 'glaze',
'good', 'materi', 'use', 'function', 'brown', 'teapot', 'make', 'good',
'teapot', 'matter', 'get', 'one', 'nice', 'tea', 'ball', 'lipton', 'loo',
'tea', 'like', 'never', 'use', 'teabag', 'tea', 'strong', 'way', 'done',
'england', 'keep', 'lightli', 'boil', 'kettl', 'dilut', 'boil', 'water',
'tast']
-------------------------------------------
Pre-processed 500th review: I remembered this cereal from my childhood and had
to buy it.  It is still delicious.  The first half of the cereal is made of
larger clumps of krispies with the second half of more broken up, smaller
pieces.  I thought this was a result of shipping, but after buying this in a
store, it is simply how it is made.<br /><br />The cereal is closer to the
expiration date than most stores, but one person can easily eat it all before
it even comes close to being questionable.  I've never had the problem of
freshness.<br /><br />This is a high quality cereal and one I truly miss.  I
will be buying this again and again.
-------------------------------------------
Processed 500th review: ['rememb', 'cereal', 'childhood', 'buy', 'still',
'delici', 'first', 'half', 'cereal', 'made', 'larger', 'clump', 'krispi',
'second', 'half', 'broken', 'smaller', 'piec', 'thought', 'result', 'ship',
'buy', 'store', 'simpli', 'made', 'cereal', 'closer', 'expir', 'date', 'store',
'one', 'person', 'easili', 'eat', 'even', 'come', 'close', 'question', 'never',
'problem', 'fresh', 'high', 'qualiti', 'cereal', 'one', 'truli', 'miss', 'buy']

Most common words

Looking at the 20 most common words by count shows that the reviews understandably contain some form of opinion - we could consider stripping out adjectives to attempt to address this. There is also some clear indication of the types of products under review, which is of interest to us.

from collections import Counter
# Print the 20 most common words across the whole corpus of complaints.
word_count = Counter([word for t in filtered_text for word in t])

print("{:<6} {:>12}".format("Word", "Count"))
print("{:<6} {:>12}".format("----", "----------"))
for word, count in word_count.most_common(20): print("{:<6} {:>12,}".format(word, count))
# 'most_common' is a helpful method that can be applied to Counter.
Word          Count
----     ----------
like         10,155
tast          9,322
flavor        7,759
good          7,245
product        7,180
love          6,550
one           6,483
use           6,153
coffe         6,002
tri           5,994
great         5,802
food          5,568
tea           5,540
get           4,962
make          4,386
would         4,382
amazon        3,833
eat           3,801
buy           3,686
dog           3,615

Dictionary, corpus and model

Using the filtered text to create a corpus and dictionary for use in the model, where

Show code
# Create Dictionary
id2word = corpora.Dictionary(filtered_text)

# Create Corpus
texts = filtered_text

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
# print(corpus), large
print('Dictionary, 100th word:', format(id2word[99]))
print('-------------------------------------------')
print('Length of corpus:',format(len(corpus)))
print('-------------------------------------------')
print('Readable format of corpus (term-frequency):', format([[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]))
Dictionary, 100th word: iffi
-------------------------------------------
Length of corpus: 20000
-------------------------------------------
Readable format of corpus (term-frequency): [[('afford', 1), ('alway', 1),
('best', 1), ('brand', 1), ('buy', 1), ('cheaper', 1), ('coffe', 1), ('could',
1), ('howev', 1), ('newman', 2), ('nowaday', 1), ('purchas', 1), ('realli', 1),
('would', 1)]]

Fitting a Latent Dirichlet Allocation (LDA) model. This method builds a model for topic per review and a model for words per topic, modelled as Dirichlet distributions. A further example application here.

There are a number of alternative model forms, in particular when the problem is viewed as a more standard categorisation/ classification problem where the predicted value is the label and the model is fit to pre-labelled data e.g. logistic regression. This recipe does not consider the relative suitability of the model - this could be explored further. Model parameterisation could be further refined, e.g. considering bi-grams rather than single words.

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=0,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

Print out the key words for each of the topics:

Show code
# Print the keywords in the topics
from pprint import pprint
pprint(lda_model.print_topics())
[(0,
  '0.149*"food" + 0.107*"dog" + 0.034*"chicken" + 0.018*"ingredi" + '
  '0.018*"babi" + 0.017*"bowl" + 0.017*"grain" + 0.014*"ginger" + 0.012*"eat" '
  '+ 0.012*"valu"'),
 (1,
  '0.042*"like" + 0.028*"get" + 0.026*"good" + 0.024*"would" + 0.019*"realli" '
  '+ 0.018*"eat" + 0.017*"littl" + 0.016*"well" + 0.015*"one" + 0.015*"love"'),
 (2,
  '0.073*"free" + 0.040*"bean" + 0.032*"gluten" + 0.028*"ice" + 0.028*"bread" '
  '+ 0.025*"recip" + 0.024*"mix" + 0.023*"honey" + 0.022*"bake" + '
  '0.017*"tuna"'),
 (3,
  '0.236*"tea" + 0.053*"green" + 0.044*"coconut" + 0.012*"clear" + '
  '0.012*"extrem" + 0.011*"cherri" + 0.011*"creami" + 0.010*"tini" + '
  '0.010*"stash" + 0.010*"describ"'),
 (4,
  '0.097*"tast" + 0.079*"flavor" + 0.060*"coffe" + 0.028*"like" + 0.026*"cup" '
  '+ 0.023*"drink" + 0.022*"sweet" + 0.019*"sugar" + 0.018*"enjoy" + '
  '0.016*"cooki"'),
 (5,
  '0.039*"buy" + 0.038*"amazon" + 0.033*"order" + 0.033*"price" + 0.027*"find" '
  '+ 0.026*"store" + 0.022*"cat" + 0.020*"year" + 0.019*"purchas" + '
  '0.015*"ship"'),
 (6,
  '0.056*"packag" + 0.037*"oil" + 0.021*"arriv" + 0.021*"prefer" + '
  '0.019*"soup" + 0.018*"compani" + 0.017*"bitter" + 0.017*"ounc" + '
  '0.017*"plea" + 0.017*"fill"'),
 (7,
  '0.124*"chocol" + 0.083*"bar" + 0.062*"butter" + 0.047*"peanut" + '
  '0.035*"popcorn" + 0.026*"stop" + 0.021*"saw" + 0.020*"cover" + '
  '0.017*"throw" + 0.016*"eaten"'),
 (8,
  '0.037*"product" + 0.034*"use" + 0.030*"tri" + 0.030*"great" + 0.023*"make" '
  '+ 0.021*"love" + 0.019*"time" + 0.019*"one" + 0.018*"bag" + 0.015*"better"'),
 (9,
  '0.032*"mix" + 0.028*"delici" + 0.024*"add" + 0.022*"hot" + 0.022*"milk" + '
  '0.019*"fresh" + 0.019*"water" + 0.019*"cook" + 0.019*"cereal" + '
  '0.017*"serv"')]
Show code
doc_lda = lda_model[corpus]

Extract the topics for a sample comment. This shows a comment about tea which is attributed to topic 1 with ~21% probability and topic 3 with ~16% probability. We’ll see in the wordcloud below that topic 3 might be more appropriate and that it might be possible to refine the model by removing certain words e.g. adjectives.

Show code
print('Pre-processed 6th review:',format(text[5]))
print('-------------------------------------------')
print('Processed 6th review:',format(filtered_text[5]))
print('-------------------------------------------')
print('6th review corpus:',format(corpus[5]))
print('-------------------------------------------')
print('6th review topic probability:')
lda_model.get_document_topics(corpus[5], minimum_probability=None, minimum_phi_value=None, per_word_topics=False)
Pre-processed 6th review: I love this tea.  I am a long-time tea drinker, and
purchased tea bags for longer than I want to count (or think about).  A couple
of years ago I decided to go for it...make loose tea, and deal with mess of
leaves.  And yes, I did know about strainers and tea balls, but I live in no
man-s land.  If it isn't prepackaged, it's unfit to ingest according to the
natives.  Good utensils are hard to find.<br /><br />I finally found a
wonderful BIG mesh tea ball.  Ahhh, a tea drinker's best friend, right after a
good English teapot.<br /><br />The mesh ball in question is like the one sold
here as a rule, by the way.<br /><br />Tea is now a breeze to make, and
absolutely delicious.<br /><br /> The loose packaged tea put out by Lipton is
wonderful.  It has a beautiful taste, unlike their teabags, which can be very
iffy.  I've had some that tasted like floor sweepings.<br /><br />This loose
tea seems amazingly consistent in flavor.  I recommend it without reservation.
And isn't Amazon just lovely...?  They'll bring it right to your door.  How
much more could you ask?<br /><br />Earlier I mentioned an English teapot.  I
favor Sadler's because of the excellent glaze, and the good material they use
in the functional brown teapots they make.<br /><br />A good teapot matters.
Get yourself one, a nice tea ball, and some Lipton Loose tea and you will
likely never use teabags again.  If the tea is too strong, do it the way it's
done in England; keep a very lightly boiling kettle going, and dilute with
boiling water to your taste.
-------------------------------------------
Processed 6th review: ['love', 'tea', 'long', 'time', 'tea', 'drinker',
'purchas', 'tea', 'bag', 'longer', 'want', 'count', 'think', 'coupl', 'year',
'ago', 'decid', 'make', 'loo', 'tea', 'deal', 'mess', 'leav', 'know',
'strainer', 'tea', 'ball', 'live', 'man', 'land', 'prepackag', 'unfit',
'ingest', 'accord', 'nativ', 'good', 'utensil', 'hard', 'find', 'final',
'found', 'wonder', 'big', 'mesh', 'tea', 'ball', 'ahhh', 'tea', 'drinker',
'best', 'friend', 'right', 'good', 'english', 'teapot', 'mesh', 'ball',
'question', 'like', 'one', 'sold', 'rule', 'way', 'tea', 'breez', 'make',
'absolut', 'delici', 'loo', 'packag', 'tea', 'put', 'lipton', 'wonder',
'beauti', 'tast', 'unlik', 'teabag', 'iffi', 'tast', 'like', 'floor', 'sweep',
'loo', 'tea', 'seem', 'amazingli', 'consist', 'flavor', 'recommend', 'without',
'reserv', 'amazon', 'love', 'bring', 'right', 'door', 'much', 'could', 'ask',
'earlier', 'mention', 'english', 'teapot', 'favor', 'sadler', 'excel', 'glaze',
'good', 'materi', 'use', 'function', 'brown', 'teapot', 'make', 'good',
'teapot', 'matter', 'get', 'one', 'nice', 'tea', 'ball', 'lipton', 'loo',
'tea', 'like', 'never', 'use', 'teabag', 'tea', 'strong', 'way', 'done',
'england', 'keep', 'lightli', 'boil', 'kettl', 'dilut', 'boil', 'water',
'tast']
-------------------------------------------
6th review corpus: [(2, 1), (7, 1), (11, 1), (18, 4), (24, 1), (30, 1), (34,
3), (36, 3), (38, 2), (43, 1), (45, 1), (58, 3), (59, 1), (62, 2), (63, 1),
(64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 4), (71, 1), (72,
1), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1),
(81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 2), (87, 1), (88, 1), (89,
2), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1),
(98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1),
(106, 1), (107, 2), (108, 1), (109, 1), (110, 1), (111, 4), (112, 2), (113, 1),
(114, 1), (115, 1), (116, 1), (117, 2), (118, 1), (119, 1), (120, 1), (121, 1),
(122, 2), (123, 1), (124, 1), (125, 1), (126, 1), (127, 1), (128, 1), (129, 2),
(130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 1), (136, 1), (137,
13), (138, 2), (139, 4), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1),
(145, 1), (146, 1), (147, 2), (148, 1), (149, 1)]
-------------------------------------------
6th review topic probability:
[(1, 0.20550294), (3, 0.16463855), (4, 0.09305331), (5, 0.073654056), (6,
0.020393608), (8, 0.35805163), (9, 0.064970925)]

Model test metrics extracted below, see this article for more on these metrics - it also covers off using coherence to optimize the number of topics modelled. Useful for model comparison.

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.
#calculating and displaying the coherence score

Perplexity:  -7.756790623989108
coherence_model_lda = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

Coherence Score:  -3.518711645765282

Plotting word clouds for each of the topics:

Show code
# Visualize the topics
import matplotlib.colors as mcolors

cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]  # more colors: 'mcolors.XKCD_COLORS'

cloud = WordCloud(stopwords=stop_words,
                  background_color='white',
                  width=2500,
                  height=1800,
                  max_words=30,
                  colormap='tab10',
                  contour_color='steelblue',
                  color_func=lambda *args, **kwargs: cols[i])

topics = lda_model.show_topics(formatted=False)

fig, axes = plt.subplots(5, 2, figsize=(10,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
    plt.gca().axis('off')
Show code
plt.subplots_adjust(wspace=20, hspace=20)
plt.axis('off')
Show code
plt.margins(x=10, y=10)
plt.tight_layout()
plt.show()

Observations

It is clear from the key words and wordclouds that there is somewhat effective grouping for certain topics (life topic 4, roughly “drinks”), but not others (life topic 1). We could consider pairing with existing categorisation data and train against that. Alternatively, we could strip back the tokens to remove adjectives and focus the model on nouns - we are not that interested in sentiment.

We could also consider changing the number of topics modelled, optimizing model coherance.

In insurance applications, the outcome will rely upon the amount of data we have for each distinct topic e.g. the model may be able to distinguish cancers from mental illness but increasing the number of topics assessed will not necessarily lead to further refined mental illness topics.

The review text used here was chosen as it is broadly analogous to a selection of insurance claim cause texts, however there is likely to be a much greater variety in the products reviewed (and hence hard to derive broad topics) compared to the claim cause categories in insurance data.