Glassdoor Exploratory Analysis¶
A few months ago I scraped Glassdoor for roughly 2.5k reviews across 9 companies. In this article I'm going to do some exploratory data analysis (EDA). In future articles I'll build ML models to try to predict ratings and explore NLP models as well.
What is EDA and why do we do it?¶
Often times we will be given data and have a rough understanding of what is included, but not know much about the data. EDA is the step where we explore the data. This typically includes looking at the descriptive statistics for your numnerical data as well as exploring response types and patterns in your categorical data. It also might involve creating plots and charts to explore and find nuggets hidden within the data. It's the first step when it comes to eventually providing useful and actionable insights from your data.
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
First let's load our dataframe and take a look at the features we have.
df = pd.read_csv('data/glassdoor_data.csv')
df.head(2)
One thing we'll want to do is parse the location as you can see it has both the state and the city in each, so let's do that right now.
First let's write a quick lambda function that replaces all of the nans with a no city, no state so our later functions work.
df['location'] = df['location'].apply(lambda x: 'no city, no state' if x=='no ' else x)
Next let's write two lambda functions that split the city and state on the comma and the city will take the first element of the list and the state will take the second element of the list.
df['location']=df['location'].astype(str) # ensure all inputs into this column are in the string format
df['city'] = df['location'].apply(lambda x: x.split(',')[0])
df['state'] = df['location'].apply(lambda x: x.split(',')[1])
df[['city','state','location']].head()
Let's remove the data that has the "no state" designation by creating a new dataframe to examine the counts by state.
states_only = df[df['state']!=' no state']
new_df = pd.DataFrame({'count' : states_only.groupby('state').size()}).reset_index()
new_df['state'] = new_df['state'].str.strip() # strip any extra space that might cause issues with the map
new_df.head()
As you can see this gets us exactly what we want. Now we can make a map looking at the number of reviews by state.
Map of Reviews¶
When it comes to developing interactive plots some packages provide better support than others. In developing interactive maps one great package is known as Plotly is actually both a company and package. They provide both pay solutions as well as open source solutions for interactive graphing. If you are using it at an individual level in most instances the open source options provided by them will be more than sufficient.
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
data = dict(type='choropleth',
locations = new_df['state'],
locationmode = 'USA-states',
colorscale= 'Viridis',
reversescale=True,
z = new_df['count'],
colorbar = {'title':"Reviews Count"}
)
layout = dict(title = 'Reviews by State',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
Quick note; the website builder I am using to build my plots is giving me trouble on producing the maps. You should be able to produce interactive maps by following the method I used above within your own Jupyter Notebooks. I'm currently looking into how I can get them to publish as well. For the time being I am going to leverage images of the maps though.
choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)
Now we can see that up in the northeast and northwest areas of the country we have very few reviews. In California, Texas, and Florida we have the most reviews.
Next let's look at some different companies pretty quickly to see if we can see some trends. Let's compare Walmart and Target.
Walmart Map¶
wmt = states_only[states_only['company']=='wmt']
wmt_grpby = pd.DataFrame({'count' : wmt.groupby('state').size()}).reset_index()
wmt_grpby['state'] = wmt_grpby['state'].str.strip() # strip any extra space that might cause issues with the map
data = dict(type='choropleth',
locations = wmt_grpby['state'],
locationmode = 'USA-states',
colorscale= 'Viridis',
reversescale = True,
z = wmt_grpby['count'],
colorbar = {'title':"Walmart Reviews Count"}
)
layout = dict(title = 'Walmart Reviews by State',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)
As you can see from this most of our reviews come from Arkansas (129), Texas (117), and California (89). This makes sense as Arkansas is home to Walmart HQ and both AR and TX have a large quantity of Walmart locations.
Target Map¶
tgt = states_only[states_only['company']=='tgt']
tgt_grpby = pd.DataFrame({'count' : tgt.groupby('state').size()}).reset_index()
tgt_grpby['state'] = tgt_grpby['state'].str.strip() # strip any extra space that might cause issues with the map
data = dict(type='choropleth',
locations = tgt_grpby['state'],
locationmode = 'USA-states',
colorscale= 'Viridis',
reversescale=True,
z = tgt_grpby['count'],
colorbar = {'title':"Target Reviews Count"}
)
layout = dict(title = 'Target Reviews by State',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)
For target the most reviews come from California (162), Minnesota (106), and Texas (82). Again this makes sense as Target's HQ is in Minny.
Now how about more of a regional player, like Publix?
Publix Map¶
push = states_only[states_only['company']=='push']
push_grpby = pd.DataFrame({'count' : push.groupby('state').size()}).reset_index()
push_grpby['state'] = push_grpby['state'].str.strip() # strip any extra space that might cause issues with the map
data = dict(type='choropleth',
locations = push_grpby['state'],
locationmode = 'USA-states',
colorscale= 'Viridis',
reversescale=True,
z = push_grpby['count'],
colorbar = {'title':"Publix Reviews Count"}
)
layout = dict(title = 'Publix Reviews by State',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
choromap = go.Figure(data = [data],layout = layout)
#iplot(choromap)
For the most part this makes sense. We know that Publix was founded outside of Orlando, Florida and the majority of their footprint is in Florida, with a few locations in surrounding states. What is interesting to see is that there are a few reviews coming from states that don't have a Publix. Let's see what the individuals from California had to say.
push[push['state']==' CA'].head()
We can see here that both of these jobs are corporate roles, so it is possible that Publix allowed them to work remotely, or perhaps they moved before they completed the review.
Most Common Job Titles?¶
We know that many of these jobs are service retail, but let's see what reviewers put as the most common job titles to get reviewed across all companies.
df.groupby(['employee_title'])['employee_title'].count().reset_index(name='count').sort_values(['count'], ascending=False).head(10)
We can see that cashier is the heavy favorite at roughly 3.5x more than crew member, but we can see here that there are a few cashier titles, such as cashier, front end cashier, and sales associate/cashier. Can we find out how many people have the word cashier in the employee title?
The first thing we'll want to do is ensure the column is a string, so we can apply the .lower() string function to the titles, to ensure all are in the same case. Then it's as simple as a seeing if the string contains the word cashier and getting a count.
df['employee_title'] = df['employee_title'].astype(str)
df['employee_title_low'] = df['employee_title'].apply(lambda x: x.lower())
df['employee_title_low'] = df['employee_title_low'].astype(str)
Let's go ahead and look at all the instances where cashier is included in the job title.
df[df['employee_title_low'].str.contains('cashier')]['employee_title'].count()
cashier_titles = []
for i in range(len(df)):
if 'cashier' in df['employee_title_low'][i]:
cashier_titles.append(df['employee_title_low'][i])
else:
pass
Let's see how many different job titles we have.
len(set(cashier_titles))
So we have 189 different job titles with the word cashier in them. Let's look at some of them.
list(set(cashier_titles))[:15]
Now let's write a little function that isolates all cases where they use the word cashier in the job title so we can create an additional column that identifies whether it is a cashier job or not.
df['cashier'] = df['employee_title_low'].apply(lambda x: 1 if 'cashier' in x else 0)
df[df['cashier']==1]['company'].value_counts()
Now we can see that Taco Bell had the most reviews that included the word cashier in the job title and target had the fewest.
Average Overall Ratings¶
Let's go ahead and look at the overall ratings across companies and then look at a few different ways to visuzalize these ratings.
df.groupby('company')['rating_overall'].describe().sort_values('mean',ascending=False)
We can see here that Costco had both the highest mean and the lowest standard deviation, so people in general agree Costco is a good company to work for. On the opposite end Giant Eagle had the lowest mean, with a standard deviation somewhere in the middle.
Is there a good way to visualize not only the means, but also the distributions?
Violin Plots¶
The beauty of the violin plots is they provide you with the kernel density estimates on the outside of the plot as well as the box plot on the interior. The white dot represents the median value.
sns.set(rc={'figure.figsize':(18,6)})
sns.violinplot(x='company', y='rating_overall', data=df,palette='Set3');
From this graph we can see that the majority of both Costco and Publix's responses are either a 4 or a 5, while for most other companies the majority of responses are in the 3 range.
Rating of Culture¶
One thing I think is extremely important in these job types is the culture rating. There isn't a large difference between the quality of these roles and in most instances that is true for the pay as well. So one would think that the work culture would have a huge impact on how much people enjoy their time with these companies. Having said that let's investigate the culture ratings.
df.groupby('company')['rating_culture'].describe().sort_values('mean',ascending=False)
We can see that Costco and Publix have the highest culture ratings while Walmart and McDonalds have the lowest and this has some variation from the overall ratings. Let's look to see if Walmart has a higher % of ratings coming from former employees compared to Costco.
df['employee_status_num'] = df['employee_status'].apply(lambda x: 1 if x == 'Current Employee ' else 0)
df['employee_status_num'] = df['employee_status_num'].astype(float)
df_wmt = df[df['company']=='wmt']
df_wmt['employee_status_num'].value_counts()[0]/df_wmt['employee_status_num'].count()
Roughly 51% of the ratings coming from Walmart are from previous employees. Let's see what that looks like from Costco.
df_cost = df[df['company']=='cost']
df_cost['employee_status_num'].value_counts()[0]/df_cost['employee_status_num'].count()
Compare that with Costco where less than 45% of employees leaving ratings are previous employees. We can look at whether or not these proportions differences are statistically significant by using a proportions z-test calculation.
print('Walmart: ', df_wmt['employee_status_num'].value_counts()[0])
print('Costco: ', df_cost['employee_status_num'].value_counts()[0])
from statsmodels.stats.proportion import proportions_ztest
count = np.array([1289, 1116])
nobs = np.array([2041, 2161])
stat, pval = proportions_ztest(count, nobs)
print('{0:0.5f}'.format(pval))
We can see that this is highly significant and many more of the ratings for Walmart are coming from those that have left the organization as compared to those ratings for Costco.
Rating Correlations¶
One final thing we can explore is the relationship between rating types. My hypothesis is all will have fairly high correlations given the ratings are obtained from the same subject and all topics are fairly closely related. But, let's take a look to confirm.
corr_matrix = df[['rating_overall','rating_balance','rating_culture','rating_career','rating_comp','rating_mgmt']]
sns.heatmap(corr_matrix.corr(),annot=True, cmap='viridis')
As you can see as anticipated the ratings are relatively highly correlated with only 2 at .50 or below.
What's Next?¶
There is clearly a lot of ways we can examine this data. One interesting thing would be to look at ratings for career as well. People that feel they have a career with a company, may be more likely to stay and/or more likely to provide a high overall rating. It also might be interesting to examine job titles for those that give the organization a high career rating. My hypothesis would be there would be a wider variety of job titles in those organizations.
If anyone has any other ideas for EDA I'd be happy to add to this or provide the data. In future articles I'll build ML models to predict overall rating and perhaps a few others and I'll also show you how to use NLP to include as additional features in your model for some of the comments that were left.