Natural Language Processing Using the 2019 SIOP Machine Learning Competition Data¶
In recent years academics and practitioners a like have realized that while structured surveys and assessments are valuable, employees and candidates can provide us much more rich data via the written or spoken word. In this article I'll leverage the 2019 SIOP Machine Learning Competition Dataset to explore using NLP to create features to predict outcomes of interest. In future articles I'll create additional features, use bag of words, and even leverage deep learning methods. I hope this is engaging for readers because with this comes a benchmark via the leaderboard with which we can compare each method via ranking compared to other teams.
The 2019 SIOP Machine Learning Competition¶
The 2019 SIOP Machine Learning Competition was an NLP task where the goal was to predict someone's big 5 trait scores from a series of 5 open-ended situational judgment items, that were designed to each elicit one of the big 5 traits. The metric of interest was the mean correlation across all traits. The training set had 1088 respondents, who each answered 5 questions for a total of 5,440 open-ended responses and the public and private leaderboard datasets each had 300 respondents for a total of 1,500 open-ended responses. Below are some screenshots from the competition website.
Personality Traits¶
For the purpose of this article I'm going to assume most readers are familiar with the Big 5 personality traits, but for those that are not wikipedia is a good starting point, but for simplicity you can remember the traits by the Acronym OCEAN
Items¶
As mentioned each item was designed to elicit a specific trait. Let's quickly look at the items under the trait they were designed to elicit.
Openness To Experience¶
- "The company closed a deal with a client from Norway and asks who would like to volunteer to be involved on the project. That person would have to learn some things about the country and culture but doesn't necessarily need to travel. Would you find this experience enjoyable or boring? Why?"
Conscientiousness¶
- "You have a project due in two weeks. Your workload is light leading up to the due date. You have confidence in your ability to handle the project, but are aware sometimes your boss gives you last tasks that can take significant amounts of time and attention. How would you handle this project and why?"
Extraversion¶
- "You and a colleague have had a long day at work and you just find out you have been invited to a networking meeting with one of your largest clients. Your colleague is leaning towards not going and if they don't go you won’t know anyone there. What would you do and why?"
Agreeableness¶
- "The company closed a deal with a client from Norway and asks who would like to volunteer to be involved on the project. That person would have to learn some things about the country and culture but doesn't necessarily need to travel. Would you find this experience enjoyable or boring? Why?"
Neuroticism¶
- "Your manager just gave you some negative feedback at work. You don’t agree with the feedback and don’t believe that it is true. Yet the feedback could carry real consequences (e.g., losing your annual bonus). How do you feel about this situation? What would you do?"
Let's start the way we would any project, by importing some of our packages and doing a little bit of Exploratory Data Analysis (EDA)
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# load the data
df = pd.read_csv('data/siop_ml_train_full.csv')
list(df.columns)
From looking at the columns included in the dataset we can see they have the 5 open-ended responses, the trait scores for each of the respondents and I added a label column so the data could be split identically to the way it was split for the competition. The first thing we want to do is ensure that each of the open-ended responses are all strings. It's a small formatting process that could potentially cause errors later if we don't do it up front. So I'll create a list of column names and then write a quick for loop to ensure they are all strings.
oes = ['open_ended_1',
'open_ended_2',
'open_ended_3',
'open_ended_4',
'open_ended_5']
for i in oes:
df[i] = df[i].astype(str)
Examining the Criterion¶
Let's look at the trait level score distributions.
df[['E_Scale_score','A_Scale_score','O_Scale_score','C_Scale_score','N_Scale_score']].describe()
What do we immediately notice?¶
- Everyone is pretty positive. Not very many means near 3.
- The conscientiousness scale is extremely skewed with a mean score of over 4.4 out of 5. This could provide us problems, given the large mean and small standard deviation. You need variance for prediction.
Examining the Predictors¶
Let's take a look at the responses as well.
Let's look at length, common words, etc.
For simplicity sake let's just focus on the Neuroticism item for this. Another way to do this might be to concatenate them all together and ignore the fact that they are responses to different prompts, but for the sake of this article we'll continue to keep them separate.
Length¶
Let's write a quick for loop that leverages the oes list above and create new columns for the length of each open-ended response.
for i in oes:
col = str(i)+"_len"
df[col] = df[i].apply(lambda x: len(x.split(" ")))
df['open_ended_4_len'].describe()
So the longest response to the neuroticism prompt is 235 words and the shortest is 15, with a median response length of 49 words.
Most Common Words¶
To get the word count we'll want to do a small bit of pre-processing that takes out punctuation and stopwords as it will almost certainly be the case a few stopwords like; I, the, etc. will be the most common if we do not.
- The first cell creates a list of all responses to open_ended_4 and then joins them all together.
- Then the function text_process removes punctuation and stopwords
- Then we can use the FreqDist and word_tokenize methods from nltk to produce our list of the most commone words.
oe_4 = df['open_ended_4'].tolist() # neuroticism
oe_4_corpus = " ".join(oe_4)
import nltk
from nltk.corpus import stopwords
import string
#Pre-processing the data
def text_process(mess):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Returns a list of the cleaned text
"""
# Check characters to see if they are in punctuation
nopunc = [char for char in mess if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = ''.join(nopunc)
nopunc = nopunc.lower()
# Now just remove any stopwords
return " ".join([word for word in nopunc.split() if word.lower() not in stopwords.words('english')])
oe_4_corpus_clean = text_process(oe_4_corpus)
from nltk.probability import FreqDist
from nltk import word_tokenize
words = word_tokenize(oe_4_corpus_clean)
fdist = FreqDist(words)
fdist
So we can see here that after we remove the stop words the most used words are "would", which is part of the prompt and should be included as an additional stop word IMO, "feedback", "manager", and "ask". Not necessarily anything that readily jumps off the page thus far, but nevertheless interesting data to have.
Creating Features¶
The first thing you have to do in machine learning is create features you think will provide meaningful signal when it comes to predicting our criterion of interest. One way to do that is to leverage others' work. For the purposes of this article we will leverage a pre-trained sentiment analysis package.
Sentiment Analysis¶
Sentiment Analysis is a fairly simple concept. A trained sentiment based model has the capability of analyzing a set of written (or spoken text) and identifying whether the corpus is positive, negative, or neutral. A good overview of sentiment analysis in action is available here, but for the focus of this article I will assume you are generally familiar with sentiment and it's uses.
One potential use is for the sentiment proportions to be used as predictors for each of the trait scores.
We could go through the effort of collecting data from several sources (Yelp, IMBD, etc.) and building our own sentiment analysis algorithm, or we can leverage what is already available to make it much easier. For the purposes of this article we will use Vader which stands for Valence Aware Dictionary & sEntiment Reasoner.
Let's first install Vader, which we can do following the documentation from the github repo. Just for quick reference you can pip or conda install packages in the Jupyter Notebook by using the exclamation point at the beginning of the line. This actually acts as a terminal command in the notebook, you can read more here. For the purposes of this article I am going to comment mine out as I have already installed the package.
#!pip install vaderSentiment
After we install the package we can follow the documentation from the repo and load the Sentiment Analyzer by running the following command.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Now let's test it out on a response. Let's first pick a few written responses from our dataframe. We can do that by indexing into a column. [:2] will get us the first two responses from a column.
sentences = df['open_ended_1'][:2]; sentences
Now let's take the code from the repo and feed it the sentences.
analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
vs = analyzer.polarity_scores(sentence)
print("{:-<65} {}".format(sentence, str(vs)))
Output¶
You can see here we get 4 outputs from the sentiment analyzer.
- Negative
- Neutral
- Positive
- Compound
Negative, neutral, and positive outputs provide the proportions of text that fall into each category and compound is a value between -1 and 1 that provides the intensity. This is great, but we are going to have to write our own function to save all of these as this loop prints and then writes over. So....how can we write a function that we can use?
We are going to be feeding it a list of text, so let's assign that as the input to the function. We want to loop through that list and we want to save the output of the list to another a list (a list of lists). So that looks like this:
def get_sentiment(text_list):
json_list = []
for i in text_list:
sentiment = analyzer.polarity_scores(i)
json_list.append(sentiment)
return json_list
get_sentiment(sentences)
Now, what do we get as output if we test our sentences object on it? We get a list of dictionaries, which we clearly need to parse. Luckily pandas makes that easy with json_normalize
from pandas.io.json import json_normalize
Let's first turn all of the text rows into a list.
oe_1 = df['open_ended_1'].tolist() # agreeableness
oe_2 = df['open_ended_2'].tolist() # conscientiousness
oe_3 = df['open_ended_3'].tolist() # extraversion
oe_4 = df['open_ended_4'].tolist() # neuroticism
oe_5 = df['open_ended_5'].tolist() # openness to experience
Then let's run the function on each set of text, use json_normalize to turn the dictionary into a pandas dataframe, and rename each column to be specific to the text column it came from.
Open_Ended_1¶
This open ended question is designed to elicit agreeableness, refer to the beginning of the article for the exact question.
oe_1_sentiment = get_sentiment(oe_1)
oe_1_sent_df = json_normalize(oe_1_sentiment)
oe_1_sent_df.columns = ['oe_1_compound','oe_1_neg','oe_1_neu','oe_1_pos']
oe_1_sent_df.head()
Open_Ended_2¶
This open ended question is designed to elicit conscientiousness, refer to the beginning of the article for the exact question.
oe_2_sentiment = get_sentiment(oe_2)
oe_2_sent_df = json_normalize(oe_2_sentiment)
oe_2_sent_df.columns = ['oe_2_compound','oe_2_neg','oe_2_neu','oe_2_pos']
Open_Ended_3¶
This open ended question is designed to elicit extraversion, refer to the beginning of the article for the exact question.
oe_3_sentiment = get_sentiment(oe_3)
oe_3_sent_df = json_normalize(oe_3_sentiment)
oe_3_sent_df.columns = ['oe_3_compound','oe_3_neg','oe_3_neu','oe_3_pos']
Open_Ended_4¶
This open ended question is designed to elicit neuroticism, refer to the beginning of the article for the exact question.
oe_4_sentiment = get_sentiment(oe_4)
oe_4_sent_df = json_normalize(oe_4_sentiment)
oe_4_sent_df.columns = ['oe_4_compound','oe_4_neg','oe_4_neu','oe_4_pos']
Open_Ended_5¶
This open ended question is designed to elicit openness to experience, refer to the beginning of the article for the exact question.
oe_5_sentiment = get_sentiment(oe_5)
oe_5_sent_df = json_normalize(oe_5_sentiment)
oe_5_sent_df.columns = ['oe_5_compound','oe_5_neg','oe_5_neu','oe_5_pos']
Then let's concatenate all 6 of the dataframes back together.
new_df = pd.concat([df,oe_1_sent_df,oe_2_sent_df,oe_3_sent_df,oe_4_sent_df,oe_5_sent_df],axis=1)
new_df.to_csv('data/vader_features.csv',index=False)
Features/Variables to Include¶
As we mentioned earlier, for the purposes of this article we will only be focusing on Sentiment as features, so we will take the proportion scores of each set of text being negative, neutral, and positive as well as the compound score as the input features for our model.
Later we'll consider some others and see which one's appear the most important.
X = new_df[['oe_5_compound','oe_5_neg','oe_5_neu','oe_5_pos','oe_4_compound','oe_4_neg','oe_4_neu','oe_4_pos','oe_3_compound','oe_3_neg','oe_3_neu','oe_3_pos',
'oe_2_compound','oe_2_neg','oe_2_neu','oe_2_pos','oe_1_compound','oe_1_neg','oe_1_neu','oe_1_pos','label']]
ys = new_df[['E_Scale_score','A_Scale_score','O_Scale_score','C_Scale_score','N_Scale_score','label']]
new_df.label.unique()
Remember we have 3 different datasets here, so we'll need to seperate them into train, dev, and test then drop the label column from the dataframes.
X_train = X[X['label']=='train']
y_train = ys[ys['label']=='train']
X_dev = X[X['label']=='dev']
y_dev = ys[ys['label']=='dev']
X_test = X[X['label']=='test']
y_test = ys[ys['label']=='test']
y_train.drop(columns='label',inplace=True)
y_dev.drop(columns='label',inplace=True)
y_test.drop(columns='label',inplace=True)
X_train.drop(columns='label',inplace=True)
X_dev.drop(columns='label',inplace=True)
X_test.drop(columns='label',inplace=True)
Ridge Regression¶
Ridge regression is essentially an Ordinary Least Squares technique that leverages L2 Regularization. In simple terms this method help us to avoid overfitting our models to the data. L2 regularization specifically penalizes each of the features. Unlike L1 regularization it doesn't necessarily try to force the features to be extremely positive or negative (hence acting as a feature selection technique). The following article goes into much more depth regarding ridge regression. We will use the sklearn ridge regression implementation and for simplicity sake we'll use the penalization term of 1.0, which is the default.
Let's build a simple function that takes in the training and testing data as well as the label column and returns a correlation coefficient.
from sklearn.linear_model import Ridge
def run_ridge(X_train, X_test, y_train, y_test, y_label):
ridge = Ridge()
ridge.fit(X_train,y_train[y_label])
test_preds = ridge.predict(X_test)
test = pd.DataFrame(y_test[y_label])
test['Pred_score'] = test_preds
return test.corr().values[0][1]
Development Set¶
i.e. the Public Leaderboard
# Extraversion
E_pred = run_ridge(X_train, X_dev, y_train, y_dev, ['E_Scale_score'])
# Agreeableness
A_pred = run_ridge(X_train, X_dev, y_train, y_dev, ['A_Scale_score'])
# Conscientiousness
C_pred = run_ridge(X_train, X_dev, y_train, y_dev, ['C_Scale_score'])
# Openness
O_pred = run_ridge(X_train, X_dev, y_train, y_dev, ['O_Scale_score'])
# Neuroticism
N_pred = run_ridge(X_train, X_dev, y_train, y_dev, ['N_Scale_score'])
print("mean correlation: ", round(np.array([E_pred,A_pred,C_pred,O_pred,N_pred]).mean(),3))
So this would have gotten us right in about 20th place on the public leaderboard. Not bad for spending just a few minutes cleaning the data and setting up an out of the box ridge regression only leveraging some simple sentiment features.
Let's look at which traits were better and worse.
np.array([E_pred,A_pred,C_pred,O_pred,N_pred])
By looking at the individual scores we can see that ridge regression is best at predicting agreeableness and by far the worst at predicting conscientiousness.
What about some other machine learning algorithms?
Random Forest¶
Next, let's look at one of the more popular "Machine Learning" algorithms today the Random Forests algorithm. In fairly simple terms the Random Forests algorithm uses bagging where it takes random subsets of both data and features and creates a tree-based model to predict the label. All of these "trees" are then aggregated together to form an ensemble model using the wisdom of the crowds theory. For more specifics I refer you to this article.
We will implement sklearn version of the algorithm.
One quick note, with ridge regression I originally included the length features and found that they slightly lowered the predictions of each of the traits, so I excluded them. But for Random Forests I found the opposite, so we'll include them here.
X = new_df[['oe_5_compound','oe_5_neg','oe_5_neu','oe_5_pos','oe_4_compound','oe_4_neg','oe_4_neu','oe_4_pos','oe_3_compound','oe_3_neg','oe_3_neu','oe_3_pos',
'oe_2_compound','oe_2_neg','oe_2_neu','oe_2_pos','oe_1_compound','oe_1_neg','oe_1_neu','oe_1_pos','label','open_ended_1_len','open_ended_2_len','open_ended_3_len',
'open_ended_4_len','open_ended_5_len']]
ys = new_df[['E_Scale_score','A_Scale_score','O_Scale_score','C_Scale_score','N_Scale_score','label']]
X_train = X[X['label']=='train']
y_train = ys[ys['label']=='train']
X_dev = X[X['label']=='dev']
y_dev = ys[ys['label']=='dev']
X_test = X[X['label']=='test']
y_test = ys[ys['label']=='test']
y_train.drop(columns='label',inplace=True)
y_dev.drop(columns='label',inplace=True)
y_test.drop(columns='label',inplace=True)
X_train.drop(columns='label',inplace=True)
X_dev.drop(columns='label',inplace=True)
X_test.drop(columns='label',inplace=True)
from sklearn.ensemble import RandomForestRegressor
def run_rf(X_train, X_test, y_train, y_test, y_label):
rf = RandomForestRegressor(n_estimators = 100,n_jobs=-1)
rf.fit(X_train,y_train[y_label])
test_preds = rf.predict(X_test)
test = pd.DataFrame(y_test[y_label])
test['Pred_score'] = test_preds
return test.corr().values[0][1]
# Extraversion
E_pred = run_rf(X_train, X_dev, y_train, y_dev, ['E_Scale_score'])
# Agreeableness
A_pred = run_rf(X_train, X_dev, y_train, y_dev, ['A_Scale_score'])
# Conscientiousness
C_pred = run_rf(X_train, X_dev, y_train, y_dev, ['C_Scale_score'])
# Openness
O_pred = run_rf(X_train, X_dev, y_train, y_dev, ['O_Scale_score'])
# Neuroticism
N_pred = run_rf(X_train, X_dev, y_train, y_dev, ['N_Scale_score'])
print(np.array([E_pred,A_pred,C_pred,O_pred,N_pred]))
print("mean correlation: ", round(np.array([E_pred,A_pred,C_pred,O_pred,N_pred]).mean(),3))
Using these specific features random forests did not perform anywhere near as good as ridge regression performed overall, but you'll notice we did get a slight lift on oppenness to experience. What about Gradient Boosted Trees?
Gradient Boosted Trees¶
The final algorithm we will try is typically considered the most powerful. Unlike random forests which randomly develops trees via bagging, gradient boosting takes a purposeful approach and focuses on improving the weak learners or the trees that are performing relatively poorly. By doing this it boosts the overall success of the combination of trees at the end. For more specifics I refer you to this article.
We will actually not be using sklearn for the gradient boosting algorithm (although I have heard that the most recent version of sklearn does have a gradient boosting algorithm on par with XGboost and LightGBM. For this we will focus on XGBoost. Luckily for us the developers built the API to reflect the sklearn implementation, which makes it almost exactly the same.
from xgboost import XGBRegressor
def run_xgb(X_train, X_test, y_train, y_test, y_label):
xgb = XGBRegressor(n_jobs=-1)
xgb.fit(X_train,y_train[y_label])
test_preds = xgb.predict(X_test)
test = pd.DataFrame(y_test[y_label])
test['Pred_score'] = test_preds
return test.corr().values[0][1]
# Extraversion
E_pred = run_xgb(X_train, X_dev, y_train, y_dev, ['E_Scale_score'])
# Agreeableness
A_pred = run_xgb(X_train, X_dev, y_train, y_dev, ['A_Scale_score'])
# Conscientiousness
C_pred = run_xgb(X_train, X_dev, y_train, y_dev, ['C_Scale_score'])
# Openness
O_pred = run_xgb(X_train, X_dev, y_train, y_dev, ['O_Scale_score'])
# Neuroticism
N_pred = run_xgb(X_train, X_dev, y_train, y_dev, ['N_Scale_score'])
print(np.array([E_pred,A_pred,C_pred,O_pred,N_pred]))
print("mean correlation: ", round(np.array([E_pred,A_pred,C_pred,O_pred,N_pred]).mean(),3))
So, if we combine our best trait scores from each model it would be ridge regression for the following:
- extraversion
- agreeableness
- conscientiousness
- neuroticism
and random forests for:
- openness
np.array([0.25960968, 0.35637792, 0.09777237, 0.24055125, 0.21310606]).mean()
This gets us an average correlation of 0.2335 a slight improvement to what ridge regression was able to provide on its own.
Recap:¶
- We examined the data from the 2019 SIOP Machine Learning Competition
- We implemented an off the shelf Sentiment package to produce features we could feed into our machine learning algorithms
- We tried 3 different algorithms; Ridge Regression, Random Forests, and XGBoost
- Ridge Regression performed the best and provided us with an average correlation that would have put us in 20th place on the public leaderboard out of 39 teams
- Conscientiousness' high mean and small standard deviation made it the most difficult to predict with a top correlation of 0.098
In the next article we'll explore another off the shelf feature creation package and compare the two. We can also then combine all the features together and see if we can beat our current best of 20th place.