Clustering Company Reviews with K-means¶
My previous two articles have focused on popular topic modeling techniques, including LSA and LDA.
This article will focus on a 3rd unsupervised learning technique that can be used to categorize open-ended text responses.
Cluster analysis offers another way for us to identify categories or topics. In most instances both LSA and LDA are considered "soft" clustering as each response can be partially assigned to each cluster in the same way softmax offers probabilities for each class and to assign the "category" you take the argmax probability. Whereas k-means clustering is considered "hard" clustering as each response is assigned to exactly 1 topic. In my experience this typically leaves the end user with only about 40-50% of target clusters making sense and the final cluster typically contains the remaining 50-60% of responses, of course as the number of clusters increase the number of remaining responses in the largest cluster will continue to go down and it is possible to to have N clusters where N represents the actual number of responses and in this case each cluster would have exactly one response.
Explaining K-means Clustering.¶
K-means is actually a relatively easy algorithm to understand. Put simply, K-means clustering starts with k randomly initialized centroids, which are chosen by the user. It then iteratively re-calculates the distance of each data point from the cluster centroids and moves each cluster centroid after each iteration of calculating the distance in an effort to minimize the distance of all data points from at least one cluster centroid. The typical distance metric used for k-means is euclidean distance.
This gif does a good job of illustrating the centroid adjustments after each iteration for k = 3.
As you can see after each iteration data points are re-assigned based off of their proximity to each centroid.
For more specifics on k-means I'd point you to the following article.
Now let's load our packages and our data.
# import our packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# load the data
df = pd.read_csv('data/glassdoor_data.csv')
df['cons'] = df['cons'].astype(str)
We will again use TFIDFVectorizer from scikit learn to vectorize our responses and to remain consistent we will use the cons data from our company reviews again. I have walked through how countvectorizer and tfidfvectorizer work in a previous article.
vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['cons']).todense()
print(vectors.shape)
vocab = np.array(vectorizer.get_feature_names())
Again, we have 21,453 total reviews/responses and 11,986 unique words after removing the stopwords.
Identifying the optimal # of clusters¶
Like many other unsupervised techniques an optimal k is not exactly a scientific process, but what we can do is run the k-means algorithm through a loop using different values for k and then plot the the within-cluster sum of squares (WCSS) for each set of clusters. We can then identify where there is a natural elbow, like we would do when plotting eigenvalues for exploratory factor analysis.
from sklearn.cluster import KMeans
wcss=[]
for i in range(3,12):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(vectors)
wcss.append(kmeans.inertia_)
plt.plot(range(3,12), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
When we look at the plot of WCSS we see a decent elbow at around 8 clusters, so let's use that as our number of unique clusters for the cons data.
Another issue with an algorithm like k-means is local convergence rather than global convergence. To lower the probability of this happening the scikit learn implementation leverages an initialization technique known as k-means++, which can be read about in more detail in this article.
So, let's run kmeans on 8 clusters, using the centroid initialization of k-means++ for no more than 1500 iterations.
num_clusters = 8
kmeans = KMeans(n_clusters = num_clusters, init='k-means++', max_iter=1500, n_init=40, random_state=0)
kmeans.fit(vectors) # fit the vectors using KMeans
kmeans.cluster_centers_.shape # examine the cluster shape
we can see by looking at the shape of the cluster_centers we have 8 clusters that have a dimension space of 11,986, which is the same # of columns we have representing each of the unique words.
Next we will create a new column in our dataframe that has the value for the cluster that each response was assigned to. This allows us to do some investigating, which we will do next.
df['kmeans_cluster'] = kmeans.labels_ #assign labels as an additional dataframe column
df['kmeans_cluster'].value_counts() # examine how many responses fall into each cluster
cluster_0 = 12854
total_responses = 21453
round(cluster_0/total_responses,3)
As I mentioned earlier you can see tha that roughly 50-60% of the responses are still assigned to cluster 0, which means the algorithm really couldn't identify much of a difference between those responses. If we had increased our k surely some of those would have gone into the additional clusters, but in general a lot of those responses are just difficult to categorize by only focusing on individual words.
Investigating the Clusters¶
Like EFA the clusters don't give you topic titles. In order to identify the topics you have to read some of the responses. Let's read the first 10 responses for cluster 1.
df[df['kmeans_cluster']==1]['cons'].head(10)
We can see here that in general cluster 1 is discussing hours, not enough hours, types of hours worked, etc. But in general the references are about work hours.
Let's look at cluster 6.
df[df['kmeans_cluster']==6]['cons'].head(10)
Cluster 6 makes a lot of references to Customers, statements around rude customers, mean customers, bad customers, etc.
Responses by store¶
Let's first get the total responses by company and then we can examine the raw counts by company for each cluster.
df['company'].value_counts()
Investigating¶
One cool thing about assigning a cluster to specific responses is it allows you to pretty easily examine trends across stores. Let's take a quick look.
We have relatively equal representation of reviews across all company's outside of Giant Eagle, so we would expect there to be similar representation in each cluster. Let's look at cluster 6 specifically.
df[df['kmeans_cluster']==6]['company'].value_counts()
This shows us that when it comes to customer related cons that our two fast food related companies have the most comments that fall into that cluster. If we remember this cluster focused on rude/poor experiences with customers.
What about cluster 1, around hours?
df[df['kmeans_cluster']==1]['company'].value_counts()
Cluster 1 seems to be much more equally spread across all companies with a slight edge to Target and Publix.
Cluster 2 talks a lot about work/life balance. Let's see which companies had responses that fell into that cluster the most.
df[df['kmeans_cluster']==2]['company'].value_counts()
Here we have a much larger proportion of responses that are in the retail/grocery space.
Finally, let's look at cluster 3, which is around pay, with a specific emphasis on low pay as I'm sure many would have assumed.
df[df['kmeans_cluster']==3]['company'].value_counts()
I was surprised a bit by the company's with the most responses that fall into this cluster, but the 2 with the lowest (Publix and Costco) make sense to me. Kroger, McDonald's and Taco Bell have a much larger proportion of their responses falling into this category than the rest and Costco has by far the lowest.
Instead of looking strictly at value counts which can be a bit boring, we could instead build a count plot. I built one below to help us visualize cluster 3.
cluster_3 = df[df['kmeans_cluster']==3]
sns.set(rc={'figure.figsize':(12,8)})
plot = sns.countplot(cluster_3['kmeans_cluster'],hue=cluster_3['company'],hue_order = cluster_3['company'].value_counts().index)
plot.set_title("# of Responses Focused on Pay");
What I like about Clustering methods for identifying topics is that it assigns each response to a specific cluster instead of allowing it to be parts of each topic or cluster. This is most certainly an over-simplification of most of these responses, but it does allow you to examine trends across departments, companies, etc. much more easily than you can do with topic modeling methods like LSA and LDA.
Recap:¶
K-means clustering allows us to use unsupervised learning to assign each response to a specific category and then read responses within each category to understand the general topic of the responses. It's helpful when it comes to assigning a response to a specific category, but given the need for each response to fall into only 1 cluster it can often be considered overally simplistic as a solution to understand what the general trends in all responses are focused on. In the instances where this is important the topic modeling techniques discussed in earlier articles may be a more ideal solution.
In my next few articles we'll stay within the NLP area of research and focus on supervised learning techniques where we'll try a number of techniques to try to predict the overall rating each employee provided.