Topic Modeling Company Reviews with LDA¶
Surveys and open-ended feedback are among many of the data types and datasets that we may come into contact with as I/Os. Whether it's the open-ended section of an annual engagement survey, feedback from annual reviews, or customer feedback, the text that is provided is often difficult to do much with at scale. However, there are unsupervised machine learning methods that have provided us glimpses into how to make sense of this data. In the previous article I worked through how we might use LSA to accomplish the task of topic modeling along with a brief look at the data we'll be using today and a background on turning words into vectors. If you would like more detail on word vectorization or processing the data to be used by LDA I'd refer you to the article linked above.
For this article I'll walk through another topic modeling technique known as Latent Dirichlet Allocation (LDA).
- Singular Value Decomposition (SVD), which Latent Semantic Analysis (LSA) is based off of.
- Latent Dirichlet Allocation (LDA)
- Cluster Analysis (K-Means)
We will again examine the Cons from the Glassdoor Reviews of retailers we extracted in an earlier article to compare to what we found with LSA.
import numpy as np
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter("ignore", category=PendingDeprecationWarning)
import seaborn as sns
import pandas as pd
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# load the data
df = pd.read_csv('data/glassdoor_data.csv')
df['cons'] = df['cons'].astype(str)
LDA¶
Latent Dirichlet Allocation is another method for topic modeling that is a "Generative Probabilistic Model" where the topic probabilities provide an explicit representation of the total response set. The first publication for the use of LDA in machine learning was by a few of the biggest names in the machine learning field in Andrew Ng and the Michael Jordan....oh you guys know a different famous Michael Jordan besides the computer scientist from UC-Berkeley? Here is the original article.
Discuss the background of LDA in simple terms.¶
I think the original article does a good job of outlining the basic premise of LDA, but I'll attempt to go a bit deeper. This text is from the original article.
So, we have.
- The documents are represented as a set of random words over latent topics.
- Each latent topic is a distribution over the words.
By identifying/assigning a set # of topics we derive a latent layer. In this way the latent layer acts as an intermediary where the words connect to the latent topic and the topic connects to the document or response in this case.
For those interested in a deeper conceptual understanding of how LDA works I'd recommend this article from which I borrowed this image, which I feel does a nice job of helping to visualize the document to latent topic and word to latent topic relationship.
For the purposes of this article we will again leverage a scikit-learn implementation of the algorithm.
Word Vectorization¶
1st we'll need to vectorize the responses. An outline of vectorization was discussed in the previous article, so I'd point readers to that article if review is needed.
In an effort to first replicate the SVD/LSA model from the first article we will use the tf-idf methodology for vectorizing our responses, but Blei, et al. (2003) explicitly mention in their paper that tf-idf may not be necessary for the LDA method given the probabilistic nature of the model, so we will compare the results from the tf-idf LDA to those extracted from a count vectorizer methodology as well.
Gridsearch¶
Many machine learning models have parameters that can be adjusted beforehand. These are often referred to as hyperparameters. You can use default numbers for this or you can test different sets of numbers. One method to test these different combinations is referred to as a grid search. GridsearchCV, stands for GridSearch Cross Validation and is what is an exhaustive search or what is commonly referred to in the field of computer science as a brute force approach. It tests all combinations of values provided. If the number of parameters are limited and the values within the parameters are rather small this can be accomplished relatively quickly. However, as the number of parameters increase and the values tested increased the number of combinations can quickly become extremely large. In instances like this a randomized grid search or a bayesian grid search may be preferred. For this example we will be only testing two hyperparameters (learning decay and number of topics), so we will leverage the brute force method.
vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['cons']).todense()
vectors.shape
vocab = np.array(vectorizer.get_feature_names())
# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.model_selection import GridSearchCV
# Define Search Param
search_params = {'n_components': [6, 8, 10, 15, 20], 'learning_decay': [.5, .7, .9]}
# Init the Model
lda = LDA()
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(vectors)
# Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print("Best Model's Params: ", model.best_params_)
# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)
# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(vectors))
# Get Log Likelyhoods from Grid Search
n_topics = [6, 8, 10, 15, 20]
log_likelyhoods_5 = [round(gscore.mean_validation_score) for gscore in model.grid_scores_ if gscore.parameters['learning_decay']==0.5]
log_likelyhoods_7 = [round(gscore.mean_validation_score) for gscore in model.grid_scores_ if gscore.parameters['learning_decay']==0.7]
log_likelyhoods_9 = [round(gscore.mean_validation_score) for gscore in model.grid_scores_ if gscore.parameters['learning_decay']==0.9]
# Plot Topics by Log Likelihood
plt.figure(figsize=(10, 6))
plt.plot(n_topics, log_likelyhoods_5, label='0.5')
plt.plot(n_topics, log_likelyhoods_7, label='0.7')
plt.plot(n_topics, log_likelyhoods_9, label='0.9')
plt.title("Choosing Optimal LDA Model")
plt.xlabel("Num Topics")
plt.ylabel("Log Likelyhood Scores")
plt.legend(title='Learning decay', loc='best');
We can see above that the best learning decay was 0.9 and the ideal number of topics was 6, so we'll go with that (unlike the 10 we used in SVD), but we'll stick with the top 8 words like we did with SVD.
# Tweak the two parameters below
number_topics = 6
number_words = 8
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1, learning_decay=0.9)
% time lda.fit(vectors)
# Helper function
def print_topics(model,n_top_words):
words = vocab
for topic_idx, topic in enumerate(model.components_):
print("\nTopic #%d:" % topic_idx)
print(" ".join([words[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, number_words)
As you can see we get a lot of the same general topics that the SVD gave us with a topic that focuses on management, pay, advancement, etc. In my experience the topics in LDA tend to be a bit easier to interpret, but one of the downsides of all Topic Modeling is that while you've been able to group the words together, it's still relatively difficult to do a lot with them.
Predicting A specific response¶
Let's look at which topics LDA distributes the final two responses under.
print(df['cons'][21451])
print("-----------")
print(df['cons'][21452])
lda.transform(vectors[-2:])
So, according to the LDA model the first comment we see above falls mostly into topic 3 and the second comment falls mostly into topic 1. This could be an interesting topic of exploration to see if human labelers would agree with this.
Data Visualization¶
One thing we did not focus on with LSA is visualizing the topics. One interesting way to visualize unsupervised learning data is to use another data reduction technique known as t-distributed stochastic neighbor embedding or T-SNE. There is a fun package that I recently discovered that leverages tsne to make interactive visualizations called PyLDAvis. It takes as input the model, your vectors, the vectorizer you want to use and the multi-dimensional scaling technique you want to use, in our case we will be using tsne.
You can hover over each of the topic bubbles and the top-30 most relevant words will change to reflect the topic. Feel free to give it a try :)
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda, vectors, vectorizer, mds='tsne')
panel
As you can see these topics line up perfectly with the topics identified above, and the 1 topic is identical to Topic 4 shown above. However, what immediately stands out is that, while we have 6 distinct topics about 70% of the words used in the responses fall into topic 1, with the remaining 5-6% pretty equally distributed among the other 5 topics. In my experience this is fairly typical, where there is generally one large topic that accounts for 50-60% of the responses and the remaining topics are actually pretty distinct, but account for much less of the overall responses. We'll likely see this again when we do K-means clustering on this same dataset in the next article.
But.... PyLDAvis is a great interactive way to visualize your topics. As you hover over each of the topic bubbles you can see the top 30 most relevant terms as well as the estimated term frequency within each topic (which gives you an idea of the actual vs. expected.
Countvectorizer¶
There is conflicting research on whether or not one should use tf-idf or counts for LDA. As mentioned above the authors of LDA hint at tf-idf not being necessary due to the probabilistic nature of LDA, but others have found through research that tf-idf enhances the interpretability of topics, so I figured we'd just try both and compare :)
vectorizer = CountVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['cons']).todense()
vectors.shape
# Tweak the two parameters below
number_topics = 6
number_words = 8
vocab = np.array(vectorizer.get_feature_names())
# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1, learning_decay=0.9)
lda.fit(vectors)
# Helper function
def print_topics(model,n_top_words):
words = vocab
for topic_idx, topic in enumerate(model.components_):
print("\nTopic #%d:" % topic_idx)
print(" ".join([words[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, number_words)
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda, vectors, vectorizer, mds='tsne')
panel
# predictions
print(df['cons'][21451])
print("-----------")
print(df['cons'][21452])
lda.transform(vectors[-2:])
In my opinion counts do seem to create more interpretable topics than tf-idf. The predicted categories for the two responses seem to align better and there is even more of a distribution between topics than the tf-idf predictions.
But, this may be specific to the problem, so it may make sense to try both when doing LDA.
Recap:¶
LDA is another option to use for topic modeling and in general I'd consider it the most popular option for topic modeling in the data science community. The next and final article on topic modeling will focus on K-means clustering as a 3rd option for clustering data without labels into distinct groups.