Glassdoor Exploratory Analysis

A few months ago I scraped Glassdoor for roughly 2.5k reviews across 9 companies. In this article I'm going to do some exploratory data analysis (EDA). In future articles I'll build ML models to try to predict ratings and explore NLP models as well.

What is EDA and why do we do it?

Often times we will be given data and have a rough understanding of what is included, but not know much about the data. EDA is the step where we explore the data. This typically includes looking at the descriptive statistics for your numnerical data as well as exploring response types and patterns in your categorical data. It also might involve creating plots and charts to explore and find nuggets hidden within the data. It's the first step when it comes to eventually providing useful and actionable insights from your data.

In [1]:
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

First let's load our dataframe and take a look at the features we have.

In [2]:
df = pd.read_csv('data/glassdoor_data.csv')
In [16]:
df.head(2)
Out[16]:
date employee_title location employee_status review_title years_at_company helpful pros cons advice_to_mgmt rating_overall rating_balance rating_culture rating_career rating_comp rating_mgmt company city state
0 2008-07-22 Stock Clerk Fairlawn, OH Former Employee Working at Giant Eagle is not that "Giant NaN 0 flexible schedule, location, proximity from ho... no respect for employees, minimum wages for a ... When your employees quit their menial jobs, as... 2 3.0 2.5 2.5 2.5 NaN geagle Fairlawn OH
1 2008-09-22 Stock Clerk Pittsburgh, PA Former Employee good place to work NaN 0 Very good scheduling and management. They will... Sometime I felt like it was hard to work your ... remember the little people 4 4.0 4.0 3.5 3.5 NaN geagle Pittsburgh PA

One thing we'll want to do is parse the location as you can see it has both the state and the city in each, so let's do that right now.

First let's write a quick lambda function that replaces all of the nans with a no city, no state so our later functions work.

In [4]:
df['location'] = df['location'].apply(lambda x: 'no city, no state' if x=='no ' else x)

Next let's write two lambda functions that split the city and state on the comma and the city will take the first element of the list and the state will take the second element of the list.

In [5]:
df['location']=df['location'].astype(str) # ensure all inputs into this column are in the string format
df['city'] = df['location'].apply(lambda x: x.split(',')[0])
df['state'] = df['location'].apply(lambda x: x.split(',')[1])
In [6]:
df[['city','state','location']].head()
Out[6]:
city state location
0 Fairlawn OH Fairlawn, OH
1 Pittsburgh PA Pittsburgh, PA
2 Pittsburgh PA Pittsburgh, PA
3 Pittsburgh PA Pittsburgh, PA
4 Pittsburgh PA Pittsburgh, PA

Let's remove the data that has the "no state" designation by creating a new dataframe to examine the counts by state.

In [7]:
states_only = df[df['state']!=' no state']
In [8]:
new_df = pd.DataFrame({'count' : states_only.groupby('state').size()}).reset_index()
In [9]:
new_df['state'] = new_df['state'].str.strip() # strip any extra space that might cause issues with the map
In [10]:
new_df.head()
Out[10]:
state count
0 AK 15
1 AL 162
2 AR 241
3 AZ 186
4 CA 960

As you can see this gets us exactly what we want. Now we can make a map looking at the number of reviews by state.

Map of Reviews

When it comes to developing interactive plots some packages provide better support than others. In developing interactive maps one great package is known as Plotly is actually both a company and package. They provide both pay solutions as well as open source solutions for interactive graphing. If you are using it at an individual level in most instances the open source options provided by them will be more than sufficient.

In [11]:
import plotly.plotly as py
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
In [21]:
data = dict(type='choropleth',
            locations = new_df['state'],
            locationmode = 'USA-states',
            colorscale= 'Viridis',
            reversescale=True,
            z = new_df['count'],
            colorbar = {'title':"Reviews Count"}
            ) 
In [22]:
layout = dict(title = 'Reviews by State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

Quick note; the website builder I am using to build my plots is giving me trouble on producing the maps. You should be able to produce interactive maps by following the method I used above within your own Jupyter Notebooks. I'm currently looking into how I can get them to publish as well. For the time being I am going to leverage images of the maps though.

In [23]:
choromap = go.Figure(data = [data],layout = layout)
In [42]:
#iplot(choromap)

Imgur

Now we can see that up in the northeast and northwest areas of the country we have very few reviews. In California, Texas, and Florida we have the most reviews.

Next let's look at some different companies pretty quickly to see if we can see some trends. Let's compare Walmart and Target.

Walmart Map

In [26]:
wmt = states_only[states_only['company']=='wmt']
In [27]:
wmt_grpby = pd.DataFrame({'count' : wmt.groupby('state').size()}).reset_index()
wmt_grpby['state'] = wmt_grpby['state'].str.strip() # strip any extra space that might cause issues with the map
In [28]:
data = dict(type='choropleth',
            locations = wmt_grpby['state'],
            locationmode = 'USA-states',
            colorscale= 'Viridis',
            reversescale = True,
            z = wmt_grpby['count'],
            colorbar = {'title':"Walmart Reviews Count"}
            ) 

layout = dict(title = 'Walmart Reviews by State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )
In [41]:
choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)

Imgur

As you can see from this most of our reviews come from Arkansas (129), Texas (117), and California (89). This makes sense as Arkansas is home to Walmart HQ and both AR and TX have a large quantity of Walmart locations.

Target Map

In [30]:
tgt = states_only[states_only['company']=='tgt']
tgt_grpby = pd.DataFrame({'count' : tgt.groupby('state').size()}).reset_index()
tgt_grpby['state'] = tgt_grpby['state'].str.strip() # strip any extra space that might cause issues with the map
In [33]:
data = dict(type='choropleth',
            locations = tgt_grpby['state'],
            locationmode = 'USA-states',
            colorscale= 'Viridis',
            reversescale=True,
            z = tgt_grpby['count'],
            colorbar = {'title':"Target Reviews Count"}
            ) 

layout = dict(title = 'Target Reviews by State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )
In [35]:
choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)

Imgur

For target the most reviews come from California (162), Minnesota (106), and Texas (82). Again this makes sense as Target's HQ is in Minny.

Now how about more of a regional player, like Publix?

Publix Map

In [36]:
push = states_only[states_only['company']=='push']
push_grpby = pd.DataFrame({'count' : push.groupby('state').size()}).reset_index()
push_grpby['state'] = push_grpby['state'].str.strip() # strip any extra space that might cause issues with the map
In [37]:
data = dict(type='choropleth',
            locations = push_grpby['state'],
            locationmode = 'USA-states',
            colorscale= 'Viridis',
            reversescale=True,
            z = push_grpby['count'],
            colorbar = {'title':"Publix Reviews Count"}
            ) 

layout = dict(title = 'Publix Reviews by State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )
In [40]:
choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)

Imgur

For the most part this makes sense. We know that Publix was founded outside of Orlando, Florida and the majority of their footprint is in Florida, with a few locations in surrounding states. What is interesting to see is that there are a few reviews coming from states that don't have a Publix. Let's see what the individuals from California had to say.

In [26]:
push[push['state']==' CA'].head()
Out[26]:
date employee_title location employee_status review_title years_at_company helpful pros cons advice_to_mgmt rating_overall rating_balance rating_culture rating_career rating_comp rating_mgmt company city state
5257 2017-04-28 Software Engineer Los Angeles, CA Current Employee Software engineer More than 3 years 0 Great environment, friendly people and very or... The perks arent that great NaN 5 5.0 4.0 5.0 4.0 4.0 push Los Angeles CA
5529 2017-05-25 NaN Costa Mesa, CA Current Employee Accountant Review NaN 0 Less Hierarchy\nGood benefits\nPay well to emp... Unqualified employee\nUntransparent office com... Employee appreciation can be improve 4 3.0 3.0 3.0 3.0 3.0 push Costa Mesa CA

We can see here that both of these jobs are corporate roles, so it is possible that Publix allowed them to work remotely, or perhaps they moved before they completed the review.

Most Common Job Titles?

We know that many of these jobs are service retail, but let's see what reviewers put as the most common job titles to get reviewed across all companies.

In [27]:
df.groupby(['employee_title'])['employee_title'].count().reset_index(name='count').sort_values(['count'], ascending=False).head(10)
Out[27]:
employee_title count
256 Cashier 1101
466 Crew Member 322
831 Front End Cashier 304
546 Deli Clerk 210
1692 Sales Associate/Cashier 207
1689 Sales Associate 190
87 Assistant Manager 142
1513 Pharmacy Technician 141
1993 Team Member 136
1842 Shift Manager 123

We can see that cashier is the heavy favorite at roughly 3.5x more than crew member, but we can see here that there are a few cashier titles, such as cashier, front end cashier, and sales associate/cashier. Can we find out how many people have the word cashier in the employee title?

The first thing we'll want to do is ensure the column is a string, so we can apply the .lower() string function to the titles, to ensure all are in the same case. Then it's as simple as a seeing if the string contains the word cashier and getting a count.

In [45]:
df['employee_title'] = df['employee_title'].astype(str)
df['employee_title_low'] = df['employee_title'].apply(lambda x: x.lower())
In [46]:
df['employee_title_low'] = df['employee_title_low'].astype(str)

Let's go ahead and look at all the instances where cashier is included in the job title.

In [47]:
df[df['employee_title_low'].str.contains('cashier')]['employee_title'].count()
Out[47]:
2148
In [55]:
cashier_titles = []
for i in range(len(df)):
    if 'cashier' in df['employee_title_low'][i]:
        cashier_titles.append(df['employee_title_low'][i])
    else:
        pass

Let's see how many different job titles we have.

In [57]:
len(set(cashier_titles))
Out[57]:
189

So we have 189 different job titles with the word cashier in them. Let's look at some of them.

In [62]:
list(set(cashier_titles))[:15]
Out[62]:
[' customer service cashier',
 ' front end (cashier assistant)',
 ' cashier/greeter',
 ' checkout cashier',
 ' cashier and crew member',
 ' cashier and salesperson',
 ' cashier/front team member',
 ' costco wholesale cashier assistant (front end assistant)',
 ' cashier/barista',
 ' front end cashier(formerly eagles nest attendant)',
 " i'm customer service staff cashier and event planner",
 ' third shift cashier',
 ' front end cashier assistant',
 ' overnight cashier/stocker',
 ' refund cashier']

Now let's write a little function that isolates all cases where they use the word cashier in the job title so we can create an additional column that identifies whether it is a cashier job or not.

In [31]:
df['cashier'] = df['employee_title_low'].apply(lambda x: 1 if 'cashier' in x else 0)
In [32]:
df[df['cashier']==1]['company'].value_counts()
Out[32]:
tbell     336
sams      287
push      268
kr        229
cost      227
mcd       219
wmt       218
geagle    186
tgt       178
Name: company, dtype: int64

Now we can see that Taco Bell had the most reviews that included the word cashier in the job title and target had the fewest.

Average Overall Ratings

Let's go ahead and look at the overall ratings across companies and then look at a few different ways to visuzalize these ratings.

In [33]:
df.groupby('company')['rating_overall'].describe().sort_values('mean',ascending=False)
Out[33]:
count mean std min 25% 50% 75% max
company
cost 2500.0 3.982800 1.086543 1.0 3.0 4.0 5.0 5.0
push 2500.0 3.750400 1.162343 1.0 3.0 4.0 5.0 5.0
tgt 2500.0 3.429600 1.162060 1.0 3.0 4.0 4.0 5.0
mcd 2500.0 3.298000 1.239440 1.0 3.0 3.0 4.0 5.0
tbell 2509.0 3.272220 1.256415 1.0 2.0 3.0 4.0 5.0
wmt 2509.0 3.206058 1.219601 1.0 2.0 3.0 4.0 5.0
sams 2509.0 3.159028 1.185631 1.0 2.0 3.0 4.0 5.0
kr 2500.0 3.142400 1.243680 1.0 2.0 3.0 4.0 5.0
geagle 1426.0 3.128331 1.192231 1.0 2.0 3.0 4.0 5.0

We can see here that Costco had both the highest mean and the lowest standard deviation, so people in general agree Costco is a good company to work for. On the opposite end Giant Eagle had the lowest mean, with a standard deviation somewhere in the middle.

Is there a good way to visualize not only the means, but also the distributions?

Violin Plots

The beauty of the violin plots is they provide you with the kernel density estimates on the outside of the plot as well as the box plot on the interior. The white dot represents the median value.

In [34]:
sns.set(rc={'figure.figsize':(18,6)})
sns.violinplot(x='company', y='rating_overall', data=df,palette='Set3');

From this graph we can see that the majority of both Costco and Publix's responses are either a 4 or a 5, while for most other companies the majority of responses are in the 3 range.

Rating of Culture

One thing I think is extremely important in these job types is the culture rating. There isn't a large difference between the quality of these roles and in most instances that is true for the pay as well. So one would think that the work culture would have a huge impact on how much people enjoy their time with these companies. Having said that let's investigate the culture ratings.

In [35]:
df.groupby('company')['rating_culture'].describe().sort_values('mean',ascending=False)
Out[35]:
count mean std min 25% 50% 75% max
company
cost 2161.0 3.932901 1.223094 1.0 3.0 4.0 5.0 5.0
push 2164.0 3.840573 1.327193 1.0 3.0 4.0 5.0 5.0
tgt 1957.0 3.370465 1.363866 1.0 2.0 4.0 5.0 5.0
tbell 2083.0 3.113298 1.444999 1.0 2.0 3.0 4.0 5.0
geagle 1222.0 3.064648 1.328326 1.0 2.0 3.0 4.0 5.0
kr 2097.0 3.049118 1.376940 1.0 2.0 3.0 4.0 5.0
sams 2179.0 3.031207 1.356865 1.0 2.0 3.0 4.0 5.0
mcd 1774.0 3.001691 1.383774 1.0 2.0 3.0 4.0 5.0
wmt 2041.0 2.950514 1.385499 1.0 2.0 3.0 4.0 5.0

We can see that Costco and Publix have the highest culture ratings while Walmart and McDonalds have the lowest and this has some variation from the overall ratings. Let's look to see if Walmart has a higher % of ratings coming from former employees compared to Costco.

In [36]:
df['employee_status_num'] = df['employee_status'].apply(lambda x: 1 if x == 'Current Employee ' else 0)
df['employee_status_num'] = df['employee_status_num'].astype(float)
In [37]:
df_wmt = df[df['company']=='wmt']
df_wmt['employee_status_num'].value_counts()[0]/df_wmt['employee_status_num'].count()
Out[37]:
0.5137504982064568

Roughly 51% of the ratings coming from Walmart are from previous employees. Let's see what that looks like from Costco.

In [38]:
df_cost = df[df['company']=='cost']
df_cost['employee_status_num'].value_counts()[0]/df_cost['employee_status_num'].count()
Out[38]:
0.4464

Compare that with Costco where less than 45% of employees leaving ratings are previous employees. We can look at whether or not these proportions differences are statistically significant by using a proportions z-test calculation.

In [39]:
print('Walmart: ', df_wmt['employee_status_num'].value_counts()[0])
print('Costco: ', df_cost['employee_status_num'].value_counts()[0])
Walmart:  1289
Costco:  1116
In [40]:
from statsmodels.stats.proportion import proportions_ztest

count = np.array([1289, 1116])
nobs = np.array([2041, 2161])
stat, pval = proportions_ztest(count, nobs)
print('{0:0.5f}'.format(pval))
0.00000

We can see that this is highly significant and many more of the ratings for Walmart are coming from those that have left the organization as compared to those ratings for Costco.

Rating Correlations

One final thing we can explore is the relationship between rating types. My hypothesis is all will have fairly high correlations given the ratings are obtained from the same subject and all topics are fairly closely related. But, let's take a look to confirm.

In [67]:
corr_matrix = df[['rating_overall','rating_balance','rating_culture','rating_career','rating_comp','rating_mgmt']]
sns.heatmap(corr_matrix.corr(),annot=True, cmap='viridis')
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f62d8d79ef0>

As you can see as anticipated the ratings are relatively highly correlated with only 2 at .50 or below.

What's Next?

There is clearly a lot of ways we can examine this data. One interesting thing would be to look at ratings for career as well. People that feel they have a career with a company, may be more likely to stay and/or more likely to provide a high overall rating. It also might be interesting to examine job titles for those that give the organization a high career rating. My hypothesis would be there would be a wider variety of job titles in those organizations.

If anyone has any other ideas for EDA I'd be happy to add to this or provide the data. In future articles I'll build ML models to predict overall rating and perhaps a few others and I'll also show you how to use NLP to include as additional features in your model for some of the comments that were left.

In [ ]: