Clustering Company Reviews with K-means

My previous two articles have focused on popular topic modeling techniques, including LSA and LDA.

This article will focus on a 3rd unsupervised learning technique that can be used to categorize open-ended text responses.

img

Cluster analysis offers another way for us to identify categories or topics. In most instances both LSA and LDA are considered "soft" clustering as each response can be partially assigned to each cluster in the same way softmax offers probabilities for each class and to assign the "category" you take the argmax probability. Whereas k-means clustering is considered "hard" clustering as each response is assigned to exactly 1 topic. In my experience this typically leaves the end user with only about 40-50% of target clusters making sense and the final cluster typically contains the remaining 50-60% of responses, of course as the number of clusters increase the number of remaining responses in the largest cluster will continue to go down and it is possible to to have N clusters where N represents the actual number of responses and in this case each cluster would have exactly one response.

Explaining K-means Clustering.

K-means is actually a relatively easy algorithm to understand. Put simply, K-means clustering starts with k randomly initialized centroids, which are chosen by the user. It then iteratively re-calculates the distance of each data point from the cluster centroids and moves each cluster centroid after each iteration of calculating the distance in an effort to minimize the distance of all data points from at least one cluster centroid. The typical distance metric used for k-means is euclidean distance.

This gif does a good job of illustrating the centroid adjustments after each iteration for k = 3.

img

As you can see after each iteration data points are re-assigned based off of their proximity to each centroid.

For more specifics on k-means I'd point you to the following article.

Now let's load our packages and our data.

In [1]:
# import our packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
In [2]:
# load the data
df = pd.read_csv('data/glassdoor_data.csv')
df['cons'] = df['cons'].astype(str)

We will again use TFIDFVectorizer from scikit learn to vectorize our responses and to remain consistent we will use the cons data from our company reviews again. I have walked through how countvectorizer and tfidfvectorizer work in a previous article.

In [3]:
vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['cons']).todense()
print(vectors.shape)

vocab = np.array(vectorizer.get_feature_names())
(21453, 11986)

Again, we have 21,453 total reviews/responses and 11,986 unique words after removing the stopwords.

Identifying the optimal # of clusters

Like many other unsupervised techniques an optimal k is not exactly a scientific process, but what we can do is run the k-means algorithm through a loop using different values for k and then plot the the within-cluster sum of squares (WCSS) for each set of clusters. We can then identify where there is a natural elbow, like we would do when plotting eigenvalues for exploratory factor analysis.

In [5]:
from sklearn.cluster import KMeans
In [ ]:
wcss=[]

for i in range(3,12):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(vectors)
    wcss.append(kmeans.inertia_)
In [9]:
plt.plot(range(3,12), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
Out[9]:
Text(0,0.5,'WCSS')

When we look at the plot of WCSS we see a decent elbow at around 8 clusters, so let's use that as our number of unique clusters for the cons data.

Another issue with an algorithm like k-means is local convergence rather than global convergence. To lower the probability of this happening the scikit learn implementation leverages an initialization technique known as k-means++, which can be read about in more detail in this article.

So, let's run kmeans on 8 clusters, using the centroid initialization of k-means++ for no more than 1500 iterations.

In [6]:
num_clusters = 8

kmeans = KMeans(n_clusters = num_clusters, init='k-means++', max_iter=1500, n_init=40, random_state=0)
In [7]:
kmeans.fit(vectors) # fit the vectors using KMeans
Out[7]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=1500,
    n_clusters=8, n_init=40, n_jobs=1, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)
In [30]:
kmeans.cluster_centers_.shape # examine the cluster shape
Out[30]:
(8, 11986)

we can see by looking at the shape of the cluster_centers we have 8 clusters that have a dimension space of 11,986, which is the same # of columns we have representing each of the unique words.

Next we will create a new column in our dataframe that has the value for the cluster that each response was assigned to. This allows us to do some investigating, which we will do next.

In [8]:
df['kmeans_cluster'] = kmeans.labels_ #assign labels as an additional dataframe column
In [9]:
df['kmeans_cluster'].value_counts() # examine how many responses fall into each cluster
Out[9]:
0    12854
5     2649
1     2138
3     1598
6     1138
2      441
4      410
7      225
Name: kmeans_cluster, dtype: int64
In [10]:
cluster_0 = 12854
total_responses = 21453

round(cluster_0/total_responses,3)
Out[10]:
0.599

As I mentioned earlier you can see tha that roughly 50-60% of the responses are still assigned to cluster 0, which means the algorithm really couldn't identify much of a difference between those responses. If we had increased our k surely some of those would have gone into the additional clusters, but in general a lot of those responses are just difficult to categorize by only focusing on individual words.

Investigating the Clusters

Like EFA the clusters don't give you topic titles. In order to identify the topics you have to read some of the responses. Let's read the first 10 responses for cluster 1.

In [15]:
df[df['kmeans_cluster']==1]['cons'].head(10)
Out[15]:
23             not enough hours for experienced workers
26    dealing with con artists (at least at customer...
41               working weekends sucks. too many hours
45           Hours, Management, union dues are not fun.
49    little to no communication from management abo...
56    Part-timers aren't given many hours unless exp...
57    Part-timers aren't given many hours unless exp...
62                   long hours paper cuts really hurt!
82    After the holidays, they really cut back on th...
96    standing in the same spot all day, not enough ...
Name: cons, dtype: object

We can see here that in general cluster 1 is discussing hours, not enough hours, types of hours worked, etc. But in general the references are about work hours.

Let's look at cluster 6.

In [17]:
df[df['kmeans_cluster']==6]['cons'].head(10)
Out[17]:
17     Customers unfriendly, drive through difficult ...
80          Some customers are difficult but its retail.
97        The managers were very condescending and rude.
145    It's retail...so you are going to have obnoxio...
152    No issues with the company as a whole. Some ma...
154      cranky customers, working with moody co workers
161    As what you would expect sometimes customers c...
168    they basically bow down to any customers over ...
221             Low wage, rude customers, bad management
234         Corporate is rude and demeaning to employees
Name: cons, dtype: object

Cluster 6 makes a lot of references to Customers, statements around rude customers, mean customers, bad customers, etc.

Responses by store

Let's first get the total responses by company and then we can examine the raw counts by company for each cluster.

In [46]:
df['company'].value_counts()
Out[46]:
tbell     2509
sams      2509
wmt       2509
kr        2500
push      2500
mcd       2500
tgt       2500
cost      2500
geagle    1426
Name: company, dtype: int64

Investigating

One cool thing about assigning a cluster to specific responses is it allows you to pretty easily examine trends across stores. Let's take a quick look.

We have relatively equal representation of reviews across all company's outside of Giant Eagle, so we would expect there to be similar representation in each cluster. Let's look at cluster 6 specifically.

In [23]:
df[df['kmeans_cluster']==6]['company'].value_counts()
Out[23]:
mcd       368
tbell     214
wmt       132
kr        106
push       87
tgt        70
sams       60
geagle     59
cost       42
Name: company, dtype: int64

This shows us that when it comes to customer related cons that our two fast food related companies have the most comments that fall into that cluster. If we remember this cluster focused on rude/poor experiences with customers.

What about cluster 1, around hours?

In [24]:
df[df['kmeans_cluster']==1]['company'].value_counts()
Out[24]:
tgt       364
push      335
tbell     267
wmt       260
kr        208
sams      196
cost      187
mcd       186
geagle    135
Name: company, dtype: int64

Cluster 1 seems to be much more equally spread across all companies with a slight edge to Target and Publix.

Cluster 2 talks a lot about work/life balance. Let's see which companies had responses that fell into that cluster the most.

In [25]:
df[df['kmeans_cluster']==2]['company'].value_counts()
Out[25]:
wmt       99
kr        88
tgt       84
sams      61
push      31
tbell     27
cost      21
geagle    16
mcd       14
Name: company, dtype: int64

Here we have a much larger proportion of responses that are in the retail/grocery space.

Finally, let's look at cluster 3, which is around pay, with a specific emphasis on low pay as I'm sure many would have assumed.

In [26]:
df[df['kmeans_cluster']==3]['company'].value_counts()
Out[26]:
kr        362
mcd       269
tbell     243
tgt       158
wmt       153
sams      134
geagle    117
push      115
cost       47
Name: company, dtype: int64

I was surprised a bit by the company's with the most responses that fall into this cluster, but the 2 with the lowest (Publix and Costco) make sense to me. Kroger, McDonald's and Taco Bell have a much larger proportion of their responses falling into this category than the rest and Costco has by far the lowest.

Instead of looking strictly at value counts which can be a bit boring, we could instead build a count plot. I built one below to help us visualize cluster 3.

In [17]:
cluster_3 = df[df['kmeans_cluster']==3]
sns.set(rc={'figure.figsize':(12,8)})
plot = sns.countplot(cluster_3['kmeans_cluster'],hue=cluster_3['company'],hue_order = cluster_3['company'].value_counts().index)
plot.set_title("# of Responses Focused on Pay");

What I like about Clustering methods for identifying topics is that it assigns each response to a specific cluster instead of allowing it to be parts of each topic or cluster. This is most certainly an over-simplification of most of these responses, but it does allow you to examine trends across departments, companies, etc. much more easily than you can do with topic modeling methods like LSA and LDA.

Recap:

img

K-means clustering allows us to use unsupervised learning to assign each response to a specific category and then read responses within each category to understand the general topic of the responses. It's helpful when it comes to assigning a response to a specific category, but given the need for each response to fall into only 1 cluster it can often be considered overally simplistic as a solution to understand what the general trends in all responses are focused on. In the instances where this is important the topic modeling techniques discussed in earlier articles may be a more ideal solution.

In my next few articles we'll stay within the NLP area of research and focus on supervised learning techniques where we'll try a number of techniques to try to predict the overall rating each employee provided.

In [ ]: