Webscraping Glassdoor for Employee Reviews

Imgur

In this article I'll walk through how to scrape Glassdoor for reviews. In the future I will examine the data more closely, but this article will focus specifically on the scraping. This process will be specific to glassdoor where I will leverage a github users repository via fork. I will link you to his repo as I feel he should get credit for it. This article will consist of the following steps:

  1. forking and cloning the repo
  2. downloading the requirements and setting up your directory
  3. Glassdoor account and secret.json file
  4. Writing the bash script
  5. Making the bash script executable
  6. Running the executable bash script
  7. Merging the data
  8. A quick look at the data

1. Forking the Repo

The first thing you want to do is go to his glassdoor repository. Once you are there you will see the option to fork his repo in the top right. If you have a github repo I would recommend doing that. Otherwise you can click on the clone/download tab and either git clone the web url or just download it and save it in your directory.

fork

You should have a directory with glassdoor-review-scraper in it now.

2. Downloading Requirements

This is a well documented repo so the author has a requirements.txt file. You can follow the instructions in his README.md file and put pip install -r requirements.txt in your shell like this:

scrape_pip

  • It should involve you downloading Pandas and Selenium which automates the web browser/python interaction.
  • After that you will need to download and install chromedriver in the same directory the main.py file is located in.

Note I was getting errors when I first started experimenting and realized I didn't have a newer version of Chrome, you need to have at least Chrome 70. I had Chrome 66, which was causing some errors.

3. Glassdoor Account and .json File

  • Ensure you have a glassdoor account. If you actually use this account I'd recommend getting another one as there is always the possibility it gets flagged for a terms of use issue.
  • You will want to open up a text editor (I recommend Sublime Text and provide it with your username and password, like this:

secretjson

  • Save the file as secret.json in the same directory main.py is located in.

Now we have everything we need to begin. All the hard work is already done by the author, so let's take a quick look at his code.

  • He has a file titled schema.py that shows the table schema we will get from this scrape, let's take a quick look at it:

schema

So, we'll get a lot of great stuff, including the date of the review, the employee's title, their location, the years with the company, their pros and cons (in text) and several ratings. A lot of data to work with.

Now, like I've mentioned several times his documentation is great. He walks us through exactly how to do one scrape via the shell.

Imgur

He even has all of the documentation about his terminal logger arguments, which usually isn't as well documented.

logger

This tells us the following:

  • --headless runs chrome without a browser
  • u or --url is the link to the company's reviews on glassdoor
  • -l or --limit calls the max reviews from that url we will scrape
  • f or --file is the output file you want to save the schema to

Those are all the arguments we will use, but there are additional arguments you can pass about the max date and min date as well.

This is perfect for one url but in his notes he mentions it takes about 25 minutes to scrape 1,000 reviews (I personally think it's faster than that), but that's a long time to wait to scrape 1k reviews from one company, so let's build a bash script that automatically starts the next python script after the previous one completes.

4. The Bash Script

  • The first thing we want to do is open Sublime Text, or your text editor of choice
  • Then in order to make it a bash script we want to start the bash script off with #/bin/bash
  • Then let's just go ahead and use the terminal command we want, for the purposes of this I decided to scrape a bunch of entry hourly level companies, like Walmart, McDonalds, Publix, etc. I figured it might be fun to compare in a later article.
    • So I went and copied each of their urls and pasted them and changed the output file to reflect that companies' name. I also limited the reviews to 2.5k, I figured that would be enough.
  • So for the purposes of this article I am pulling the following:
    • Walmart
    • Target
    • Publix
    • Kroger
    • Safeway
    • Costco
    • Sam's Club
    • Giant Eagle
    • McDonald's
    • Taco Bell

scrape_bash

Note about Windows: Here is a link that walks you through how to create bash scripts in windows

5. Making the Bash Script Executable

if you look at the files in your directory you will notice that they are all in white text. In order to make the bash script executable you will have to type in the following command: chmod +x my_bash_script like this:

chmod

This changes the mode from a non-executable to an executable script via chmod + x. Now when you look at the files in your directory the bash file should be a different color, mine is green.

chmod_green

Now we have an executable bash script, so let's go run it.

6. Running the Bash Script

Running the bash script is relatively simple. You can try just typing the name of your bash script in the directory, but I had to add ./ scraping_bash to mine to get it to run, like this:

Run_scrape

After this, you'll likely want to walk away from your computer for a few hours, depending on how many company pages you decided to scrape and how many reviews from each site.

7. Merging the data

The first thing we need to do is load all the data and then add a Company column before we append all of the data together.

In [1]:
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
In [2]:
wmt = pd.read_csv('data/walmart_reviews.csv')
tgt = pd.read_csv('data/target_reviews.csv')
sams = pd.read_csv('data/sams_reviews.csv')
cost = pd.read_csv('data/costco_reviews.csv')
kr = pd.read_csv('data/kroger_reviews.csv')
geagle = pd.read_csv('data/giant_eagle_reviews.csv')
push = pd.read_csv('data/publix_reviews.csv')
mcd = pd.read_csv('data/mcd_reviews.csv')
tbell = pd.read_csv('data/tbell_reviews.csv') 
In [3]:
df_list = [wmt,tgt,sams,cost,kr,geagle,push,mcd,tbell]
df_list_str = ['wmt','tgt','sams','cost','kr','geagle','push','mcd','tbell']
index = 0
for i in df_list:
    i['company']=df_list_str[index]
    index +=1
In [4]:
full_df = pd.concat([wmt,tgt,sams,cost,kr,geagle,push,mcd,tbell])

8. A quick look at the data

In [5]:
full_df.groupby('company')['rating_overall'].describe()
Out[5]:
count mean std min 25% 50% 75% max
company
cost 2500.0 3.982800 1.086543 1.0 3.0 4.0 5.0 5.0
geagle 1429.0 3.127362 1.192536 1.0 2.0 3.0 4.0 5.0
kr 2500.0 3.142400 1.243680 1.0 2.0 3.0 4.0 5.0
mcd 2500.0 3.298000 1.239440 1.0 3.0 3.0 4.0 5.0
push 2500.0 3.750400 1.162343 1.0 3.0 4.0 5.0 5.0
sams 2509.0 3.159028 1.185631 1.0 2.0 3.0 4.0 5.0
tbell 2509.0 3.272220 1.256415 1.0 2.0 3.0 4.0 5.0
tgt 2500.0 3.429600 1.162060 1.0 3.0 4.0 4.0 5.0
wmt 2509.0 3.206058 1.219601 1.0 2.0 3.0 4.0 5.0
In [60]:
sns.set(rc={'figure.figsize':(18,6)})
sns.boxplot(x='company', y='rating_overall', data=full_df,palette='Set3');
In [61]:
sns.set(rc={'figure.figsize':(18,6)})
sns.violinplot(x='company', y='rating_overall', data=full_df,palette='Set3');

Finally

we can save it to csv if we want to use it later, which we will as we will explore this dataset further in future posts.

In [6]:
full_df.to_csv('data/glassdoor_data.csv',index=False)
In [ ]: