Data Wrangling Survey Data in Python

As all social scientists that have collected data before know the responses are often far from perfect. You have to make decisions on whether to include data across a number of factors.

I've gotten a lot of questions from I/Os about what I/O specific work looks like in Python and the process is very similar to what you traditionally do, just using different syntax. This article will walk through how we cleaned the data we used for the SIOP ML Competition.

Recently, as the host of the 2019 SIOP Machine Learning Competition our team had to identify criteria for inclusion into the final dataset. We wanted to ensure that the respondents took their task seriously, so we had to identify whether or not an individuals' responses were quality, but we couldn't manually review all 2000+, so we had to devise a strategy to do so at scale. That's where manipulating data or data wrangling/munging, whichever term you want to use comes into play. In this article I will walk through how we approached this given our criteria using the SIOP ML original dataset.

Note: Because at this time the competition is still going on I will not do any data viz or discuss any specific items (which isn't even relevant to general cleaning anyway), but will walk through the cleaning process we took.

However, I wanted to add that because of Shaker International's support of open source and open source research they have decided to make all of the data available after the competition is completed. This includes the 5 open-ended SJIs, the 60 item level responses, and several demographic variables that were collected. The total sample size is 1688.

Data Wrangling

Import Packages

The first thing we always do is import our packages. We'll only be using NumPy and Pandas for this process, I also import warnings because sometimes I get ugly red warnings about keyword argument deprecation which just means the newer versions have changed keywords.

You can find out the exact keywords and inputs a method will take by visiting the documentation site or in Jupyter by hitting the shift+tab keys when you are in the method of a python argument. The link and image are examples for the pandas.describe() method.

Method

In [9]:
# import our packages
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Load the Dataframe

In [5]:
df = pd.read_excel('data/01_14_19_num_content.xlsx')

Next let's get a count of how many respondents we have in the original dataset by using the .count() method.

In [6]:
df['Respondent ID'].count()
Out[6]:
1404

We have 1404 individuals in our dataset to begin with.

Overview of the survey and the data

The survey consisted of 5 open-ended responses and 65 likert type items along with a few demographic related questions. The first thing we want to do is identify all the people that completed the survey. The questions were all mandatory, until the demographics section, so let's use the last likert type item as the indicator of whether or not they completed the portion of the survey we need.

How would we do that? Well let's create a new column in the dataframe that flags whether or not the last personality item is null or not and assign that to a new variable called "completed survey" by writing the code below

In [12]:
df['completed_survey']=df['Last_P_Item'].isnull().apply(lambda x: 0 if x==True else 1)

Thi .isnull() method returns a boolean of True or False for each row in the column. The pandas .apply() method is extremely powerful for data wrangling, which I use often. Additionally when I first learned about lambda functions I thought I'd never need them, turns out I was extremely wrong. They are very useful for writing straightforward functions in very little code. The lambda above is a very short if else statement; assign x a 0 if x is equal to True else assign x a 1.

In [28]:
df['completed_survey'].value_counts()
Out[28]:
1    1188
0     216
Name: completed_survey, dtype: int64

We can now see that roughly 216 people did not answer all the required questions and 1188 did. So let's subset the dataframe to only include the people that completed the survey.

In [29]:
df_completes = df[df['completed_survey']==1]

Data Wrangling the Open Ends

In our request for data we asked that all responses be at least 3-4 sentences long, so if the responses aren't very long we want to flag those. Those people did not follow the instructions and their data will not be very useful for the purposes of the competition. We decided 15 words or fewer was a reasonable threshold. So let's look at how we would do this.

Here is what I want this cell to accomplish; I want to ensure that all the open-ended responses are of type string, so I just set them to that to be safe, then I want to loop through all of the open ends, get the word count, finally I want to assign a new variable a value of 1 if the response was greater than or equal to 15 words or a 0 if it was not.

I've mentioned it before, but you can chain methods together in python. So instead of creating two variables I can just chain both .apply() methods on each other and create 1 variable. After the code cell I will walk through each of the steps in a bit more detail.

In [30]:
open_ends = ['Q1','Q2','Q3','Q4','Q5'] # make a list of the open-ended items

for i in open_ends:
    df_completes[i]=df_completes[i].astype(str)
    df_completes[i+'_good']=df_completes[i].apply(lambda x: len(x.split())).apply(lambda x: 1 if x >=15 else 0)

Let's go through the for loop in a bit more detail.

  • First we assigned each open end to .astype() string to ensure we can use the .split() method which I'll talk about in a second.

  • Second we assign a new feature/variable with the open end name plus an underscore good, so the end results will be Q1_good for example.

  • Third let's look at the first lambda function. It asks for the length of x(which represents whatever is in the specific "cell") split which splits everything in a string on on what you tell it to. The default in a blank method is a space. Let's see how it works below.

Example of what .split() is doing

In [35]:
string_1 = 'I think all I/Os should learn Python because it is so versatile and useful'

print("This is what the split method outputs: ",string_1.split())
print("This is how many words are in the string_1 object: ",len(string_1.split()))
This is what the split method outputs:  ['I', 'think', 'all', 'I/Os', 'should', 'learn', 'Python', 'because', 'it', 'is', 'so', 'versatile', 'and', 'useful']
This is how many words are in the string_1 object:  14
  • Fourth, we used another lambda if/else statement to assign the variable a value of 1 if the number of words was equal to or greater than 15 or else we assigned it a 0.

After we identified which open-ended responses met our criteria we want to identify if all of the individuals responses met this word length requirement, again we are going to use chaining, where we first sum along the row axis and then we identify if that sum equals 5 or not.

In [31]:
df_completes['all_meet_word_length_req']=df_completes[['Q1_good','Q2_good','Q3_good','Q4_good','Q5_good']].sum(axis=1).apply(lambda x: 1 if x==5 else 0)

We are first summing across the rows of these 5 variables and then we are applying a if/else lambda function where the sum had to equal 5 (meaning all responses were at least 15 words).

Current Count

Let's look at how many additional people were flagged for not meeting the word length requirement.

In [32]:
df_completes['all_meet_word_length_req'].value_counts()
Out[32]:
1    1098
0      90
Name: all_meet_word_length_req, dtype: int64

As you can see this process found 90 individuals that did not meet the requirement of having written at least 15 words to each response

Data Wrangling Likert Items

In addition to flagging open end length we also added in 5 attention items to ensure that the respondents were paying attention as they went through the likert items. So let's do something similar for each of those items as well.

Because all of these are slightly different we can't take quite as much advantage of the efficiency we had up above where we just ran it through a for loop. But we can still take advantage of chaining.

In this instance we gave people credit if they answered strongly disagree or disagree for each of the "opinion-based" ones and for the questions where we asked them to mark a specific response we ensured that had to equal that response. Then we summed all the attention check flags together and they needed to have made at least 4 of the attention checks to get credit.

These are all relatively straightforward compared to above, but we used if/else lambdas. Then we again took the sum of the row, but this time we gave some leeway and only flagged them if they missed 2 or more attention checks.

In [33]:
df_completes['Attention_1']=df_completes['Attention_Item_1'].apply(lambda x: 1 if x==5 else 0)
df_completes['Attention_2']=df_completes['Attention_Item_2'].apply(lambda x: 1 if x<=2 else 0)
df_completes['Attention_3']=df_completes['Attention_Item_3'].apply(lambda x: 1 if x==3 else 0)
df_completes['Attention_4']=df_completes['Attention_Item_4'].apply(lambda x: 1 if x<=2 else 0)
df_completes['Attention_5']=df_completes['Attention_Item_5'].apply(lambda x: 1 if x<=2 else 0)
df_completes['Good_Attention_Checks']=df_completes[['Attention_1','Attention_2','Attention_3','Attention_4','Attention_5']].sum(axis=1).apply(lambda x: 1 if x>=4 else 0)

Putting it all together

We said earlier we wanted them to pass the attention check section and meet the word length requirements, so now we have to put it all together. Again, let's sum those two columns and if that equals 2 we'll say they are valid, if not let's discard them as poor respondents.

In [34]:
df_completes['good_attention_&_length']=df_completes[['Good_Attention_Checks','all_meet_word_length_req']].sum(axis=1).apply(lambda x: 1 if x==2 else 0)
df_completes['good_attention_&_length'].value_counts()
Out[34]:
1    1059
0     129
Name: good_attention_&_length, dtype: int64

We started with 1404 individuals and after our cleaning criteria we now have 1059 quality responses to include in our final dataset

Note: This is isn't all the data that was used in the competition, this was just one main collection effort the final sample consisted of 1688 examples.

We were able to do all of that in 15 lines of code (if you ignore the value_counts I used to show what it had done) and I'm not even an efficient programmer. I think this really points out the power of using a programming language outside of SPSS syntax. Plus did I mention Python and R are FREE!!!

TLDR;

In [13]:
df = pd.read_excel('data/01_14_19_num_content.xlsx')
df['completed_survey']=df['Is original, comes up with new ideas at work.'].isnull().apply(lambda x: 1 if x==True else 0)
df_completes = df[df['completed_survey']==0]
open_ends = ['Q1','Q2','Q3','Q4','Q5']
for i in open_ends:
    df_completes[i]=df_completes[i].astype(str)
    df_completes[i+'_good']=df_completes[i].apply(lambda x: len(x.split())).apply(lambda x: 1 if x >=15 else 0)
df_completes['all_meet_word_length_req']=df_completes[['Q1_good','Q2_good','Q3_good','Q4_good','Q5_good']].sum(axis=1).apply(lambda x: 1 if x==5 else 0)
df_completes['Attention_1']=df_completes['For attention purposes please mark agree strongly.'].apply(lambda x: 1 if x==5 else 0)
df_completes['Attention_2']=df_completes['I have never used a computer.'].apply(lambda x: 1 if x<=2 else 0)
df_completes['Attention_3']=df_completes['For attention purposes please  mark neutral; no opinion.'].apply(lambda x: 1 if x==3 else 0)
df_completes['Attention_4']=df_completes['I do not understand a word of english at work.'].apply(lambda x: 1 if x<=2 else 0)
df_completes['Attention_5']=df_completes['I can run two miles in two minutes.'].apply(lambda x: 1 if x<=2 else 0)
df_completes['Good_Attention_Checks']=df_completes[['Attention_1','Attention_2','Attention_3','Attention_4','Attention_5']].sum(axis=1).apply(lambda x: 1 if x>=4 else 0)
df_completes['good_attention_&_length']=df_completes[['Good_Attention_Checks','all_meet_word_length_req']].sum(axis=1).apply(lambda x: 1 if x==2 else 0)

In future articles I plan to go into a bit more detail about data wrangling, data exploration (EDA), data viz during data exploration, etc. But I wanted to briefly show I/Os and other social scientists used to working in SPSS or SAS that it's quick and easy to clean data in Python too. This data, while very relevant to I/Os is being used as part of the competition and I don't want to do and data exploration for them :) After the competition as I mentioned above Shaker International plans to open source this data for all to explore and learn from.

Additional Resources:

I know I reference Jake VanderPlas' work a lot but he has another great chapter on Pandas in his Python Data Science Handbook.

Another great resource for Pandas is Wes Mckinney's book the guy who actually started and wrote the large majority of the Pandas library.