Glassdoor Exploratory Analysis¶

A few months ago I scraped Glassdoor for roughly 2.5k reviews across 9 companies. In this article I'm going to do some exploratory data analysis (EDA). In future articles I'll build ML models to try to predict ratings and explore NLP models as well.

What is EDA and why do we do it?¶

Often times we will be given data and have a rough understanding of what is included, but not know much about the data. EDA is the step where we explore the data. This typically includes looking at the descriptive statistics for your numnerical data as well as exploring response types and patterns in your categorical data. It also might involve creating plots and charts to explore and find nuggets hidden within the data. It's the first step when it comes to eventually providing useful and actionable insights from your data.

In [1]:

#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

First let's load our dataframe and take a look at the features we have.

In [2]:

df = pd.read_csv('data/glassdoor_data.csv')

In [16]:

df.head(2)

Out[16]:

	date	employee_title	location	employee_status	review_title	years_at_company	helpful	pros	cons	advice_to_mgmt	rating_overall	rating_balance	rating_culture	rating_career	rating_comp	rating_mgmt	company	city	state
0	2008-07-22	Stock Clerk	Fairlawn, OH	Former Employee	Working at Giant Eagle is not that "Giant	NaN	0	flexible schedule, location, proximity from ho...	no respect for employees, minimum wages for a ...	When your employees quit their menial jobs, as...	2	3.0	2.5	2.5	2.5	NaN	geagle	Fairlawn	OH
1	2008-09-22	Stock Clerk	Pittsburgh, PA	Former Employee	good place to work	NaN	0	Very good scheduling and management. They will...	Sometime I felt like it was hard to work your ...	remember the little people	4	4.0	4.0	3.5	3.5	NaN	geagle	Pittsburgh	PA

One thing we'll want to do is parse the location as you can see it has both the state and the city in each, so let's do that right now.

First let's write a quick lambda function that replaces all of the nans with a no city, no state so our later functions work.

In [4]:

df['location'] = df['location'].apply(lambda x: 'no city, no state' if x=='no ' else x)

Next let's write two lambda functions that split the city and state on the comma and the city will take the first element of the list and the state will take the second element of the list.

In [5]:

df['location']=df['location'].astype(str) # ensure all inputs into this column are in the string format
df['city'] = df['location'].apply(lambda x: x.split(',')[0])
df['state'] = df['location'].apply(lambda x: x.split(',')[1])

In [6]:

df[['city','state','location']].head()

Out[6]:

	city	state	location
0	Fairlawn	OH	Fairlawn, OH
1	Pittsburgh	PA	Pittsburgh, PA
2	Pittsburgh	PA	Pittsburgh, PA
3	Pittsburgh	PA	Pittsburgh, PA
4	Pittsburgh	PA	Pittsburgh, PA

Let's remove the data that has the "no state" designation by creating a new dataframe to examine the counts by state.

In [7]:

states_only = df[df['state']!=' no state']

In [8]:

new_df = pd.DataFrame({'count' : states_only.groupby('state').size()}).reset_index()

In [9]:

new_df['state'] = new_df['state'].str.strip() # strip any extra space that might cause issues with the map

In [10]:

new_df.head()

Out[10]:

	state	count
0	AK	15
1	AL	162
2	AR	241
3	AZ	186
4	CA	960

As you can see this gets us exactly what we want. Now we can make a map looking at the number of reviews by state.

Map of Reviews¶

When it comes to developing interactive plots some packages provide better support than others. In developing interactive maps one great package is known as Plotly is actually both a company and package. They provide both pay solutions as well as open source solutions for interactive graphing. If you are using it at an individual level in most instances the open source options provided by them will be more than sufficient.

In [11]:

import plotly.plotly as py
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [21]:

data = dict(type='choropleth',
            locations = new_df['state'],
            locationmode = 'USA-states',
            colorscale= 'Viridis',
            reversescale=True,
            z = new_df['count'],
            colorbar = {'title':"Reviews Count"}
            )

In [22]:

layout = dict(title = 'Reviews by State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

Quick note; the website builder I am using to build my plots is giving me trouble on producing the maps. You should be able to produce interactive maps by following the method I used above within your own Jupyter Notebooks. I'm currently looking into how I can get them to publish as well. For the time being I am going to leverage images of the maps though.

In [23]:

choromap = go.Figure(data = [data],layout = layout)

In [42]:

#iplot(choromap)

Imgur

Now we can see that up in the northeast and northwest areas of the country we have very few reviews. In California, Texas, and Florida we have the most reviews.

Next let's look at some different companies pretty quickly to see if we can see some trends. Let's compare Walmart and Target.

Walmart Map¶

In [26]:

wmt = states_only[states_only['company']=='wmt']

In [27]:

wmt_grpby = pd.DataFrame({'count' : wmt.groupby('state').size()}).reset_index()
wmt_grpby['state'] = wmt_grpby['state'].str.strip() # strip any extra space that might cause issues with the map

In [28]:

data = dict(type='choropleth',
            locations = wmt_grpby['state'],
            locationmode = 'USA-states',
            colorscale= 'Viridis',
            reversescale = True,
            z = wmt_grpby['count'],
            colorbar = {'title':"Walmart Reviews Count"}
            ) 

layout = dict(title = 'Walmart Reviews by State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

In [41]:

choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)

Imgur

As you can see from this most of our reviews come from Arkansas (129), Texas (117), and California (89). This makes sense as Arkansas is home to Walmart HQ and both AR and TX have a large quantity of Walmart locations.

Target Map¶

In [30]:

tgt = states_only[states_only['company']=='tgt']
tgt_grpby = pd.DataFrame({'count' : tgt.groupby('state').size()}).reset_index()
tgt_grpby['state'] = tgt_grpby['state'].str.strip() # strip any extra space that might cause issues with the map

In [33]:

data = dict(type='choropleth',
            locations = tgt_grpby['state'],
            locationmode = 'USA-states',
            colorscale= 'Viridis',
            reversescale=True,
            z = tgt_grpby['count'],
            colorbar = {'title':"Target Reviews Count"}
            ) 

layout = dict(title = 'Target Reviews by State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

In [35]:

choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)

Imgur

For target the most reviews come from California (162), Minnesota (106), and Texas (82). Again this makes sense as Target's HQ is in Minny.

Now how about more of a regional player, like Publix?

Publix Map¶

In [36]:

push = states_only[states_only['company']=='push']
push_grpby = pd.DataFrame({'count' : push.groupby('state').size()}).reset_index()
push_grpby['state'] = push_grpby['state'].str.strip() # strip any extra space that might cause issues with the map

In [37]:

data = dict(type='choropleth',
            locations = push_grpby['state'],
            locationmode = 'USA-states',
            colorscale= 'Viridis',
            reversescale=True,
            z = push_grpby['count'],
            colorbar = {'title':"Publix Reviews Count"}
            ) 

layout = dict(title = 'Publix Reviews by State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

In [40]:

choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)

Imgur

For the most part this makes sense. We know that Publix was founded outside of Orlando, Florida and the majority of their footprint is in Florida, with a few locations in surrounding states. What is interesting to see is that there are a few reviews coming from states that don't have a Publix. Let's see what the individuals from California had to say.

In [26]:

push[push['state']==' CA'].head()

Out[26]:

	date	employee_title	location	employee_status	review_title	years_at_company	helpful	pros	cons	advice_to_mgmt	rating_overall	rating_balance	rating_culture	rating_career	rating_comp	rating_mgmt	company	city	state
5257	2017-04-28	Software Engineer	Los Angeles, CA	Current Employee	Software engineer	More than 3 years	0	Great environment, friendly people and very or...	The perks arent that great	NaN	5	5.0	4.0	5.0	4.0	4.0	push	Los Angeles	CA
5529	2017-05-25	NaN	Costa Mesa, CA	Current Employee	Accountant Review	NaN	0	Less Hierarchy\nGood benefits\nPay well to emp...	Unqualified employee\nUntransparent office com...	Employee appreciation can be improve	4	3.0	3.0	3.0	3.0	3.0	push	Costa Mesa	CA

We can see here that both of these jobs are corporate roles, so it is possible that Publix allowed them to work remotely, or perhaps they moved before they completed the review.

Most Common Job Titles?¶

We know that many of these jobs are service retail, but let's see what reviewers put as the most common job titles to get reviewed across all companies.

In [27]:

df.groupby(['employee_title'])['employee_title'].count().reset_index(name='count').sort_values(['count'], ascending=False).head(10)

Out[27]:

	employee_title	count
256	Cashier	1101
466	Crew Member	322
831	Front End Cashier	304
546	Deli Clerk	210
1692	Sales Associate/Cashier	207
1689	Sales Associate	190
87	Assistant Manager	142
1513	Pharmacy Technician	141
1993	Team Member	136
1842	Shift Manager	123

We can see that cashier is the heavy favorite at roughly 3.5x more than crew member, but we can see here that there are a few cashier titles, such as cashier, front end cashier, and sales associate/cashier. Can we find out how many people have the word cashier in the employee title?

The first thing we'll want to do is ensure the column is a string, so we can apply the .lower() string function to the titles, to ensure all are in the same case. Then it's as simple as a seeing if the string contains the word cashier and getting a count.

In [45]:

df['employee_title'] = df['employee_title'].astype(str)
df['employee_title_low'] = df['employee_title'].apply(lambda x: x.lower())

In [46]:

df['employee_title_low'] = df['employee_title_low'].astype(str)

Let's go ahead and look at all the instances where cashier is included in the job title.

In [47]:

df[df['employee_title_low'].str.contains('cashier')]['employee_title'].count()

Out[47]:

In [55]:

cashier_titles = []
for i in range(len(df)):
    if 'cashier' in df['employee_title_low'][i]:
        cashier_titles.append(df['employee_title_low'][i])
    else:
        pass

Let's see how many different job titles we have.

In [57]:

len(set(cashier_titles))

Out[57]:

So we have 189 different job titles with the word cashier in them. Let's look at some of them.

In [62]:

list(set(cashier_titles))[:15]

Out[62]:

[' customer service cashier',
 ' front end (cashier assistant)',
 ' cashier/greeter',
 ' checkout cashier',
 ' cashier and crew member',
 ' cashier and salesperson',
 ' cashier/front team member',
 ' costco wholesale cashier assistant (front end assistant)',
 ' cashier/barista',
 ' front end cashier(formerly eagles nest attendant)',
 " i'm customer service staff cashier and event planner",
 ' third shift cashier',
 ' front end cashier assistant',
 ' overnight cashier/stocker',
 ' refund cashier']

Now let's write a little function that isolates all cases where they use the word cashier in the job title so we can create an additional column that identifies whether it is a cashier job or not.

In [31]:

df['cashier'] = df['employee_title_low'].apply(lambda x: 1 if 'cashier' in x else 0)

In [32]:

df[df['cashier']==1]['company'].value_counts()

Out[32]:

tbell     336
sams      287
push      268
kr        229
cost      227
mcd       219
wmt       218
geagle    186
tgt       178
Name: company, dtype: int64

Now we can see that Taco Bell had the most reviews that included the word cashier in the job title and target had the fewest.

Average Overall Ratings¶

Let's go ahead and look at the overall ratings across companies and then look at a few different ways to visuzalize these ratings.

In [33]:

df.groupby('company')['rating_overall'].describe().sort_values('mean',ascending=False)

Out[33]:

	count	mean	std	min	25%	50%	75%	max
company
cost	2500.0	3.982800	1.086543	1.0	3.0	4.0	5.0	5.0
push	2500.0	3.750400	1.162343	1.0	3.0	4.0	5.0	5.0
tgt	2500.0	3.429600	1.162060	1.0	3.0	4.0	4.0	5.0
mcd	2500.0	3.298000	1.239440	1.0	3.0	3.0	4.0	5.0
tbell	2509.0	3.272220	1.256415	1.0	2.0	3.0	4.0	5.0
wmt	2509.0	3.206058	1.219601	1.0	2.0	3.0	4.0	5.0
sams	2509.0	3.159028	1.185631	1.0	2.0	3.0	4.0	5.0
kr	2500.0	3.142400	1.243680	1.0	2.0	3.0	4.0	5.0
geagle	1426.0	3.128331	1.192231	1.0	2.0	3.0	4.0	5.0

We can see here that Costco had both the highest mean and the lowest standard deviation, so people in general agree Costco is a good company to work for. On the opposite end Giant Eagle had the lowest mean, with a standard deviation somewhere in the middle.

Is there a good way to visualize not only the means, but also the distributions?

Violin Plots¶

The beauty of the violin plots is they provide you with the kernel density estimates on the outside of the plot as well as the box plot on the interior. The white dot represents the median value.

In [34]:

sns.set(rc={'figure.figsize':(18,6)})
sns.violinplot(x='company', y='rating_overall', data=df,palette='Set3');

From this graph we can see that the majority of both Costco and Publix's responses are either a 4 or a 5, while for most other companies the majority of responses are in the 3 range.

Rating of Culture¶

One thing I think is extremely important in these job types is the culture rating. There isn't a large difference between the quality of these roles and in most instances that is true for the pay as well. So one would think that the work culture would have a huge impact on how much people enjoy their time with these companies. Having said that let's investigate the culture ratings.

In [35]:

df.groupby('company')['rating_culture'].describe().sort_values('mean',ascending=False)

Out[35]:

	count	mean	std	min	25%	50%	75%	max
company
cost	2161.0	3.932901	1.223094	1.0	3.0	4.0	5.0	5.0
push	2164.0	3.840573	1.327193	1.0	3.0	4.0	5.0	5.0
tgt	1957.0	3.370465	1.363866	1.0	2.0	4.0	5.0	5.0
tbell	2083.0	3.113298	1.444999	1.0	2.0	3.0	4.0	5.0
geagle	1222.0	3.064648	1.328326	1.0	2.0	3.0	4.0	5.0
kr	2097.0	3.049118	1.376940	1.0	2.0	3.0	4.0	5.0
sams	2179.0	3.031207	1.356865	1.0	2.0	3.0	4.0	5.0
mcd	1774.0	3.001691	1.383774	1.0	2.0	3.0	4.0	5.0
wmt	2041.0	2.950514	1.385499	1.0	2.0	3.0	4.0	5.0

We can see that Costco and Publix have the highest culture ratings while Walmart and McDonalds have the lowest and this has some variation from the overall ratings. Let's look to see if Walmart has a higher % of ratings coming from former employees compared to Costco.

In [36]:

df['employee_status_num'] = df['employee_status'].apply(lambda x: 1 if x == 'Current Employee ' else 0)
df['employee_status_num'] = df['employee_status_num'].astype(float)

In [37]:

df_wmt = df[df['company']=='wmt']
df_wmt['employee_status_num'].value_counts()[0]/df_wmt['employee_status_num'].count()

Out[37]:

0.5137504982064568

Roughly 51% of the ratings coming from Walmart are from previous employees. Let's see what that looks like from Costco.

In [38]:

df_cost = df[df['company']=='cost']
df_cost['employee_status_num'].value_counts()[0]/df_cost['employee_status_num'].count()

Out[38]:

0.4464

Compare that with Costco where less than 45% of employees leaving ratings are previous employees. We can look at whether or not these proportions differences are statistically significant by using a proportions z-test calculation.

In [39]:

print('Walmart: ', df_wmt['employee_status_num'].value_counts()[0])
print('Costco: ', df_cost['employee_status_num'].value_counts()[0])

Walmart:  1289
Costco:  1116

In [40]:

from statsmodels.stats.proportion import proportions_ztest

count = np.array([1289, 1116])
nobs = np.array([2041, 2161])
stat, pval = proportions_ztest(count, nobs)
print('{0:0.5f}'.format(pval))

0.00000

We can see that this is highly significant and many more of the ratings for Walmart are coming from those that have left the organization as compared to those ratings for Costco.

Rating Correlations¶

One final thing we can explore is the relationship between rating types. My hypothesis is all will have fairly high correlations given the ratings are obtained from the same subject and all topics are fairly closely related. But, let's take a look to confirm.

In [67]:

corr_matrix = df[['rating_overall','rating_balance','rating_culture','rating_career','rating_comp','rating_mgmt']]
sns.heatmap(corr_matrix.corr(),annot=True, cmap='viridis')

Out[67]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f62d8d79ef0>

As you can see as anticipated the ratings are relatively highly correlated with only 2 at .50 or below.

What's Next?¶

There is clearly a lot of ways we can examine this data. One interesting thing would be to look at ratings for career as well. People that feel they have a career with a company, may be more likely to stay and/or more likely to provide a high overall rating. It also might be interesting to examine job titles for those that give the organization a high career rating. My hypothesis would be there would be a wider variety of job titles in those organizations.

If anyone has any other ideas for EDA I'd be happy to add to this or provide the data. In future articles I'll build ML models to predict overall rating and perhaps a few others and I'll also show you how to use NLP to include as additional features in your model for some of the comments that were left.

In [ ]:

AIO: Bridging the Gap Between Data Science and IO

EDA of Glassdoor Data

N. Koenig

Glassdoor Exploratory Analysis¶

What is EDA and why do we do it?¶

Map of Reviews¶

Walmart Map¶

Target Map¶

Publix Map¶

Most Common Job Titles?¶

Average Overall Ratings¶

Violin Plots¶

Rating of Culture¶

Rating Correlations¶

What's Next?¶

Comments