Topic Modeling Company Reviews with SVD

img

Surveys and open-ended feedback are among many of the data types and datasets that we may come into contact with as I/Os. Whether it's the open-ended section of an annual engagement survey, feedback from annual reviews, or customer feedback, the text that is provided is often difficult to do much with at scale.

However, there are unsupervised machine learning methods that have provided us glimpses into how to make sense of this data. In the next 3 articles I'll outline 3 different unsupervised techniques for topic modeling. This first article will focus on the technique of Singular Value Decomposition (SVD).

  1. Singular Value Decomposition (SVD), which Latent Semantic Analysis (LSA) is based off of.
  2. Latent Dirichlet Allocation (LDA)
  3. Cluster Analysis (K-Means)

We will examine the Cons from the Glassdoor Reviews of retailers we extracted in an earlier article.

Unsupervised Learning

Before we start let's quickly define what unsupervised learning is. In most machine learning problems there is something you are trying to predict or a target. This may be job performance, job satisfaction, customer satisfaction, a hit in baseball, or a goal in hockey/soccer. We refer to these as supervised learning problems because the features can be used to optimize for a specific target. In many other instances we do not have a specific target, we are just trying to make sense of some data. When you have a set of features without a target or outcome variable this is known as unsupervised learning. While this isn't often as common in our psychology stats courses most people should be familiar with factor analysis, which is actually an unsupervised learning method. For the purpose of this article we won't be using a target, we'll be trying to structure the responses to the cons that were listed by the respondents into something that is interpretable at scale.

img

So....let's get started!!

In [1]:
import numpy as np
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import pandas as pd
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
In [2]:
# load the data
df = pd.read_csv('data/glassdoor_data.csv')
In [3]:
df['cons'] = df['cons'].astype(str)

Let's briefly look at the data we'll be using.

We can first calculate the length of each response to get an idea of how long most of the responses to this question actually are.

In [4]:
df['cons_len'] = df['cons'].apply(lambda x: len(x.split(" ")))
In [5]:
df['cons_len'].describe()
Out[5]:
count    21453.000000
mean        22.890272
std         46.628156
min          1.000000
25%          6.000000
50%         10.000000
75%         23.000000
max       1508.000000
Name: cons_len, dtype: float64
In [6]:
sns.distplot(df['cons_len'],bins=25);

We can see that the longest response is 1508 words while the shortest is 1 word and the length is heavily skewed towards less than 25 words, with a few outliers, including the 1508 word response. Let's look at how many responses we have with over 100 words.

In [7]:
len(df[df['cons_len']>100])
Out[7]:
672

So, while it is heavily skewed we do have 672 reviews with more than 100 words, but that only makes up ~ 3% of the total reviews.

Let's look at a few random responses. I'd show you the max one as well, but it would end up being a wall of text :)

In [8]:
df['cons'][755]
Out[8]:
"If you are not one of the chosen or have family in corporate level, there is no advancement. I had 5 different District Leaders in the 3 years at my location. The last three got worse and the last one didn't have a clue without being told. There were other applicants with more experience that were past over. Many were surprised she received the position. Got shafted on the previous years bonus by discharging me 5 days before it was scheduled to be payed."
In [9]:
df['cons'][1435]
Out[9]:
"Almost all of the pros to working for Sams have diminished. Hours are inconsistent. Management doesn't listen and only cares about getting their bonus. They fire employees for members not wanting their upgrade to a plus membership. The only people that the plus membership benefits are businesses. They push and push for these upgrades and credits.\nHere is the big downer...the pay. recently everyone went up to 10.50...in our store at least. As for associates that have been there for years, they got the same raise. The thing that really sucks now is that new people who just stared make as much as the people before them. The pay raises being 2-3% every year for everyone at the same time is flat out ridiculous. I have put almost two years of my life into this place and I wanted to achieve more but not if I am now being treated as a new hire."
In [10]:
df['cons'][13]
Out[10]:
'Boring and tedious work. Not real good other than for a summer job, for a college or high school student.'

So, as we can see we have some common cons, but it looks like we have unique responses as well around no advancement, etc. Let's see if we can use data decomposition strategies to make some logical sense of all of these.

Turning Words into Data

Before we begin topic modeling let's briefly cover a couple of ways we can turn words into data to analyze in the first place. When it comes to differentiating responses we care about the uniqueness of the words used. Because of this, common words like the, and, a, etc. aren't valuable. Think of it as similar to an item in an assessment where everyone answers strongly agree. If there is no variance in an item often times it doesn't make sense to include it. Similarly, if the words are too common across responses it doesn't make sense to include them.

Now, one caveat I want to call out is that this is really only true when using a bag of words NLP method where the order of the words isn't really relevant(which we will be using in this article). When you start using deep learning models the order of the words and all words can add value. We'll discuss deep learning models for NLP in depth at a later time.

For now let's look at a few of the stop words from scikit-learn's stop_words.

In [11]:
len(list(stop_words.ENGLISH_STOP_WORDS))
Out[11]:
318

So we can see that scikit-learn's list of stopwords includes 318 of them. Let's quickly look at the first 12 to get an idea of the type of words included.

In [12]:
sorted(list(stop_words.ENGLISH_STOP_WORDS))[:12]
Out[12]:
['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along']

We will remove the stopwords from all of our vectors moving forward to avoid including data that doesn't differentiate responses.

Now let's talk about ways to structure this data so we can run machine learning algorithms on it.

Dummy-Coding/One-Hot Encoding

The simplest way to look at words is to basically one-hot encode each word where a variable/column is added for each word in all the responses.

img

As an example, using the first 3 responses from this dataset we can see that the first respondent used the terms appreciation, away, bucks, and chances. These are all encoded with a 1 under that vocabulary word using one-hot encoding. The second respondent did not use appear, appreciation, and away, so all of those are encoded as a zero, but they did use the term better and so on.

We can do this in scikit learn using CountVectorizer. It might look something like this.

In [13]:
text = df['cons'][:3]
In [14]:
vectorizer = CountVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(text).todense()
vectors.shape
Out[14]:
(3, 62)
In [15]:
vocab = np.array(vectorizer.get_feature_names())
print(vocab[20:25])
print(vocab[22])
['fee' 'felt' 'good' 'growth' 'hard']
good
In [16]:
vectors
Out[16]:
matrix([[1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
         0, 3, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 2, 0, 0,
         1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 2, 0, 2, 0, 1, 2, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0],
        [0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
         0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
         0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1]],
       dtype=int64)
In [17]:
vectors[0,22]
Out[17]:
3

We can see the 22nd word is good and the first response has a 3 in that column, so let's take a look at that response and confirm the response includes the word good 3 times.

In [18]:
df['cons'][0]
Out[18]:
'no respect for employees, minimum wages for a whole lot of physical work which includes lifting 50pounds, no appreciation for good work, chances are the manager of your dept. would be a high school drop-out so no room for intellectual growth. They will pay you minimum wages and a good portion of it would be deducted for the union which will take away 50 bucks as initiation fee and 7$ every week, the union is good for nothing & will never help you.'

As we can see this is indeed accurate.

If you'd like to read more about representing word counts as vectors this article goes into more depth on the subject but, we can immediately see concerns with this methodology as it doesn't tell us much about the uniqueness of the words used in specific responses.

To help us with this we have what is known as Term Frequency Inverse Document Frequency or TF-IDF. The math behind this is fairly straight-forward. The TF is the # of times a term appears in a document divided by the total number of terms in the document and the IDF is calculated by taking the log(N/n) where N is the total number of documents and n is the number of documents the term in question is present in. You can see how this effectively makes common words less important and rare words more important. We'll walk through the same example above using TF-IDF and we will use scikit-learn's TfidfVectorizer.

img

In [19]:
vectorizer = TfidfVectorizer(stop_words='english')
In [20]:
vectors = vectorizer.fit_transform(text).todense()
vectors.shape
Out[20]:
(3, 62)
In [21]:
vectors.argmax()
Out[21]:
83

The argmax is at position 83, which is the second array and the 21st word within that array. We can figure this out because each array is 62 columns long...83-62 = 21 :)

In [22]:
vectors[1,21]
Out[22]:
0.5067206204978569
In [23]:
vocab[21]
Out[23]:
'felt'
In [24]:
df['cons'][1]
Out[24]:
'Sometime I felt like it was hard to work your way up, especially starting at such a low position. I often felt that the upper management thought they were better than me.'

so we can see by looking at the argmax and the same index in the vocabulary that the word with the highest tf-idf value of the first 3 responses in the cons column is the word felt.

Now let's get into how we can use TF-IDF for the purposes of topic modeling

Singular Value Decomposition (SVD)

A quick note....much of what I will talk about regarding SVD can be found in this great course by Rachel Thomas on NLP and in some cases she has provided examples and resources that I will just reiterate in this article.

The SVD algorithm factorizes a matrix into one matrix with orthogonal columns (U) and one with orthogonal rows (V) (along with a diagonal matrix, which contains the relative importance of each factor).

img

source: Facebook Research: Fast Randomized SVD

SVD is an exact decomposition, since the matrices it creates are big enough to fully cover the original matrix. SVD is extremely widely used in linear algebra, and specifically in data science, including the following problem types:

This article goes into much more depth regarding the math behind SVD for those interested.

But, to quickly summarize:

U is the matrix that represents the documents/responses X the topics. Which would look a bit like this.

img

S is the diagonal topic X topic matrix. Which would look a bit like this.

img

V is the topic X term matrix. Which would look a bit like this.

img

By leveraging the vectors of these matrices we can identify similar words and similar docuemnts and if we focus exclusively on the V matrix we can identify the words with the largest weights associated with each topic.

Latent Semantic Analysis (LSA) uses SVD. You will sometimes hear topic modeling referred to as LSA.

In [25]:
vectorizer = TfidfVectorizer(stop_words='english')
In [26]:
vectors = vectorizer.fit_transform(df['cons']).todense()
vectors.shape
Out[26]:
(21453, 11986)

You can see here that we have 21,453 responses and 11,986 unique tokens or words that were used across the responses.

Let's next create a vocab object and examine a subset of the vocab.

In [27]:
vocab = np.array(vectorizer.get_feature_names())
In [28]:
vocab[3000:3020]
Out[28]:
array(['deployment', 'deposit', 'deposited', 'depot', 'depressed',
       'depressing', 'depression', 'dept', 'depth', 'depths',
       'deptpartments', 'depts', 'derogatory', 'described', 'describes',
       'describing', 'description', 'desensitized', 'deserve', 'deserved'],
      dtype='<U17')

SVD

For the purposes of this first implementation we will use SciPy's SVD and we will time it to show exactly how long it takes.

In [29]:
%time U, s, Vh = linalg.svd(vectors, full_matrices=False)
CPU times: user 39min 1s, sys: 16.3 s, total: 39min 17s
Wall time: 9min 53s
In [30]:
print(U.shape, s.shape, Vh.shape)
(21453, 11986) (11986,) (11986, 11986)

We can clearly see that SVD takes a long time. The reason for this is because a full decomposition of the original matrix creates k topics based on the # of documents or respondents. Because U is 21,453 long SVD creates 21,453 topics hence the reference to exact decomposition above.

Let's actually confirm that these matrices are an exact decomposition.

  • We can first create an array of the reconstructed vectors by matrix multiplying U the diagonal of s and Vh
  • Then let's confirm that the reconstructed vectors and the original vectors are equal by using the numpy allclose method
In [31]:
reconstructed_vectors = U @ np.diag(s) @ Vh
np.allclose(reconstructed_vectors,vectors)
Out[31]:
True

The result of np.allclose evaluated to true, which proves that both vectors are identical.

Now that we've confirmed SVD is an exact decomposition and we've run SVD let's create a function that can extract the top words for the topics created and examine what the top 10 topics look like and we'll discuss a more efficient way to use SVD a little later.

In [48]:
num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]
In [49]:
plt.plot(s[:12])
Out[49]:
[<matplotlib.lines.Line2D at 0x7f3e8cf5d278>]
In [50]:
show_topics(Vh[:10])
Out[50]:
['hours work pay management time low long hard',
 'work hours time long life balance hard cons',
 'hours long pay low week cut standing holidays',
 'work pay balance life low hard great weekends',
 'customers rude management work life balance poor bad',
 'management time poor bad employees hard benefits communication',
 'management bad cons hours poor think balance life',
 'cons job working really don great company good',
 'bad time say hard really work good balance',
 'people don managers job employees good like lot']

As you can see we used the show_topics function with the V matrix and we asked for the top 10 topics.

If you look at the words in the topics they make sense. We see topics around management, pay, customers, hours, etc. These all make sense and this is despite the fact that this is an unsupervised algorithm - which is to say, we never actually told the algorithm how our documents are grouped.

But, part of the problem with this is the standard implementation of SVD takes too long. Is there another option? Yes, there is and it is known as Truncated SVD. Instead of calculating all of the columns let's just calculate the vectors corresponding to the largest singular values.

Shortcomings of classical algorithms for decomposition:

  • Matrices are often "stupendously big"
  • Data are often missing or inaccurate. Why spend extra computational resources when imprecision of input limits precision of the output?
  • Data transfer now plays a major role in time of algorithms. Techniques that require fewer passes over the data may be substantially faster, even if they require more flops (flops = floating point operations).
  • Important to take advantage of GPUs.

(source: Halko)

Advantages of randomized algorithms:

  • Inherently stable
  • Performance guarantees do not depend on subtle spectral properties
  • Needed matrix-vector products can be done in parallel

(source: Halko)

Truncated SVD

In [91]:
from sklearn import decomposition
In [100]:
%time u, s, v = decomposition.randomized_svd(vectors, 10)
CPU times: user 7.87 s, sys: 35.8 ms, total: 7.9 s
Wall time: 2.02 s
In [101]:
show_topics(v)
Out[101]:
['hours work pay management time low long hard',
 'pay low management rude customers bad poor better',
 'hours long pay low week cut standing holidays',
 'work pay balance life low hard great weekends',
 'cons think time pay really job low great',
 'customers rude cons work think balance life long',
 'time hard customers rude pay low benefits busy',
 'think time management balance life poor bad work',
 'bad time say hard really work good customers',
 'people don managers job employees like good lot']

so, we went from a Wall time of ~600 seconds to a Wall time of 2 seconds by using Truncated SVD.

Recap:

img

One option of topic modeling is to leverage an oldie but a goodie from linear algebra in SVD. As Gilbert Strang, the MIT mathematics professor, made "internet" famous from his Linear Algebra OCW course, says..

"SVD is not nearly as famous as it should be."

we can clearly see that by leveraging linear algebra concepts we can derive topics that make conceptual sense given the types of responses we examined.

Another few options to consider for future research:

  • Use Lemmatization to further standardize the word vectors.
  • Look at topics for specific companies as it's likely that some of these companies have unique pros and cons for working for them.
  • Examine other text columns like "pros" and "advice for management".
  • Topic modeling works the best when topics are fairly distinct. It would be interesting to stack the columns together (i.e. pros, cons, and advice to management) and see if 3 basic topics come out that align with the label of the column.
  • Try a larger n-gram range. For simplicity we used the default TfidfVectorizer implementation which is an n-gram range of (1,1). If we had increased it to include bi-grams we may have seen additional clarity, but we would have also greatly increased the # of unique words/wordsets.

In the next article we will examine topic modeling through the lens of LDA.

In [ ]: