As I/O Psychologists one of our main functions is to predict job performance. Often we are in search of what job performance means and it means different things for different roles, but often we can get at a decent understanding of what behaviors and competencies differentiate good and poor performers. Our goal is to then use available data to predict whether or not the employee/candidate will be a good performer or a bad performer.
We can look at the game of baseball as a microcosm for this. A hit is good and an out is bad. For simplicity sake we will treat all hits as equal (single, double, triple, homerun) and an out from a ball in play is bad.
So, if we could get this data, how would we structure it to see if we could predict it, and...can we predict it?
introducing statcast....it's powered by AWS :) This new dataset allows us to actually examine parts of the game that are closer to behaviors. It has an amazing amount of pitch data, but it also has things like launch angle, exit velocity, whether or not there was a shift, where the ball was hit on the field, etc. All things that are important to understanding whether or not the ball in play resulted in a hit or an out.
first thing we need to do is pip install pybaseball, I've already done so, so I'm going to comment it out. Then we will import our needed packages.
#!pip install pybaseball
from pybaseball import statcast
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
After that let's download the 2019 seasons data. Pybaseball, makes downloading the statcast data about as easy as possible. You can look at this to get an idea of the variables/features available. These are the variables I thought might be useful:
- launch angle
- launch speed
- hc_y
- hc_x
- if_fielding_assignment (need to dummy code)
- of_fielding_assignment (need to dummy code)
- outs_when_up
- home_team (need to dummy code)
- stand (recode)
- hit_location
- p_throws (recode)
hc_y/hc_x is the location the ball lands. if/of_fielding_assignment is whether there was a shift in the infield or outfield, outs when up could impact how easily a force out is applied, home team allows us to account for park factors, stand is whether the batter is a lefty or a righty, hit location is the position of the first player to touch the ball and p_throws is the hand of the pitcher. that should be a good start.
This simple query allows us to download all data from a specific time period. I chose the length of the 2019 MLB regular season for the purposes of this.
hitter_stats = statcast('2019-03-31','2019-10-04')
Let's first take a look at the data we have.
len(hitter_stats)
len(hitter_stats['player_name'].unique())
So there are over 723k pitches thrown in the 2019 season by 829 different pitchers.
Let's look at a few pitchers. I'm a Cardinals fan, so let's look at a few of the Cardinals young arms.
- Jack Flaherty had a great first season in the big leagues posting a FIP of 3.46 good enough for 15th best in the league and had a WAR (wins above replacement) of 6.0.
- Jordan Hicks was having a great second season in the league with a FIP of 3.21 when he went down with a Tommy John injury.
jack_f = hitter_stats[hitter_stats['player_name']=='Jack Flaherty']
jordan_h = hitter_stats[hitter_stats['player_name']=='Jordan Hicks']
plt.scatter(jack_f[jack_f['pitch_type']=='FF']['pfx_x'],jack_f[jack_f['pitch_type']=='FF']['pfx_z'])
plt.scatter(jordan_h[jordan_h['pitch_type']=='SI']['pfx_x'],jordan_h[jordan_h['pitch_type']=='SI']['pfx_z'])
Above are two scatterplots of the horizontal (X) and vertical (Z) movement of two different pitchers' pitches. The first is Jack Flaherty's fastball. You can see some horizontal movement averaging around -0.5 and some vertical movement averaging around 1.2. Contrast that with Jordan Hicks' sinker. He actually has a lot of horizontal movement averaging around -1.3, but has less vertical movement averaging around 0.9, which is pretty interesting because one is a fastball and the other is a sinker. Hicks' movement at 100+MPH has been discussed a lot.
Here is a link to Jordan Hicks' sinker in action
He won the 2018 and 2019 PitchingNinja Award for Most Unfair At Bat. Here's a link to the 2019 AB
Predicting hits¶
Now let's go ahead and focus on the hitter statistics because after all our goal is to predict whether or not a ball put in play will be a hit.
Let's select those features along with the type and the event. The type is whether it was a hit or not and the event is the result of the ball strike.
hit = hitter_stats[['launch_angle','launch_speed','hc_x','hc_y','if_fielding_alignment','of_fielding_alignment','outs_when_up','home_team',
'stand','hit_location','p_throws','type','events']]
Now we only want the balls in play, so let's subset the dataset to only have the balls in play.
in_play = hit[hit['type']=='X']
Next we want to ensure our label or outcome variable is accurate. Right now this is what event contains.
in_play['events'].unique()
We can see that while all of these are balls in play, there are singles, doubles, homeruns, along with fielders choice, errors, and sacrifices. So let's recode these to be hits or outs.
in_play['hit'] = in_play['events'].apply(lambda x: 1 if x in('single','double','home_run','triple') else 0)
How many balls were put in play in 2019?
len(in_play)
What was the hardest ball put in play in 2019? The weakest?
print('hardest hit ball: ',in_play['launch_speed'].max())
print('softest hit ball: ',in_play['launch_speed'].min())
What would be our guess for with regards to which, if either of these were a hit? My initial hypothesis is that the soft one was not a hit, the hard one is likely to be a hit, but it could have been a line drive right at someone.
in_play[in_play['launch_speed']== 118.9]
in_play[in_play['launch_speed']== 7.6]
so there were actually two balls hit at 118.9 and both were hits. The ball hit at 7.6 MPH was not a hit. The two batters that hit the ball 118.9 MPH...Giancarlo Stanton and Vlady Guerrero Jr. Honestly no surprises there. The 7.6 MPH was hit by Willi Castro, a shortstop for the Detroit Tigers.
Cleaning Data for input into the model¶
- Recodes
- Dummy Codes
One important thing we need to do for our models to be able to consume the data is recodes and dummy coding.
Now let's recode a few of our features to ensure the models can understand them, then let's dummy code the remaining features to ensure the model can understand those.
This first one identifies whether or not the pitcher and hitter are throwing/hitting from the right side.
# Recodes
in_play['right_handed_batter'] = np.where(in_play['stand'].values =="R",1,0)
in_play['right_handed_pitcher'] = np.where(in_play['p_throws'].values =="R",1,0)
# drop all variables that will be recoded or dummy coded
in_play_new = in_play.drop(['if_fielding_alignment','of_fielding_alignment','home_team','stand','type','events','p_throws'],axis=1)
in_play_new['hit_location'].unique()
in_play_new['hit_location'].isna().sum()
we can see here that a certain % of hit locations have a np.nan. After some quick digging I identified these as homeruns. So we can re-code these as hit location 10, but in reality we can clearly see how this is cheating at this point because now we know that every ball with a hit location of 10 is a hit (perfect signal). But let's still use it for the time being.
in_play_new['hit_location'].replace(np.nan, 10,inplace=True)
# dummy code and concatenate
train = pd.concat([in_play_new, pd.get_dummies(in_play['if_fielding_alignment'])],axis=1)
train = pd.concat([in_play_new, pd.get_dummies(in_play['of_fielding_alignment'])],axis=1)
train = pd.concat([in_play_new, pd.get_dummies(in_play['home_team'])],axis=1)
# make sure there are no NAs
train.dropna(inplace=True)
x_train = train.drop(['hit'],axis=1)
y_train = train['hit']
y_train.mean()
This is interesting, because at first the .334 batting average seems a bit high, but in reality it's the batting average only for balls that were put in play. So we now know that by predicting an out every time we'd have a classification model that is 66.6% accurate.
Area Under the Curve¶
This is a good introduction to area under the curve, but basically when evaluating model performance, when the target is imbalanced, accuracy can often times be less important than the order of the probabilities assigned.
Logistic Regression as our baseline¶
Let's start by considering a more traditional model, the Logistic Regression. For more information on the mathematics behind LR this article may be useful.
We'll start off with a fairly simple k-fold cross-validation and we'll use 10 folds. We'll then average the area under the curve together to get our AUC across the 10 folds.
# cross-validate basic Logistic Regression model to get a baseline AUC
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
lr_cv = cross_val_score(LogisticRegression(),
x_train,y_train,scoring='roc_auc',cv=10)
print("Logistic Regression cross-validated mean AUC: ",lr_cv.mean())
So, we have a baseline area under the cruve of 0.883. That's a pretty good start, but remember we almost certainly have perfect signal from the balls that were hit to "fielder 10".
K-Nearest Neigbors¶
K-Nearest Neigbors is another algorithm that may be useful. It uses Euclidean Distance and a set of neigbors (nearest data points) to label a target. The basic premise is that if the features are similar to features we know were a hit then it's likely to be a hit, etc. For more specifics on the KNN algorithm check out this article. It makes sense that balls in play most like other balls in play will have a similar result, but the problem is identifying the ideal number of neighbors. Let's write a for loop to identify this.
neighbors = np.linspace(2,100,50).astype(int)
cv_means = []
for n in neighbors:
knn_cv = cross_val_score(KNeighborsClassifier(n_neighbors=n),
x_train,y_train,scoring='roc_auc',cv=10)
cv_means.append(knn_cv.mean())
np.asarray(cv_means).argmax()
We can see here that the argmax for the knn, i.e. the largest value in the array is at position # 32. We can index to position 32 to find out what knn value that is by doing the following.
neighbors[32]
so we can see that 66 neighbors was ideal in the cross-validation dataset, so let's go ahead and use that.
from sklearn.neighbors import KNeighborsClassifier
knn_cv = cross_val_score(KNeighborsClassifier(n_neighbors=66),
x_train,y_train,scoring='roc_auc',cv=10)
print("K-Nearest Neighbors cross-validated mean AUC: ",knn_cv.mean())
An AUC of 0.96113 is pretty impressive for such a simple algorithm, but now it's time to bring in the big boys. Random Forests and Gradient Boosted Trees.
Random Forests¶
Next, let's look at one of the more popular "Machine Learning" algorithms today the Random Forests algorithm. In fairly simple terms the Random Forests algorithm uses bagging where it takes random subsets of both data and features and creates a tree-based model to predict the label. All of these "trees" are then aggregated together to form an ensemble model using the wisdom of the crowds theory. For more specifics I refer you to this article.
We will implement sklearn version of the algorithm.
# cross-validate basic Random Forest model to get an AUC
from sklearn.ensemble import RandomForestClassifier
rf_cv = cross_val_score(RandomForestClassifier(),
x_train,y_train,scoring='roc_auc',cv=10)
print("Random Forest cross-validated mean AUC: ",rf_cv.mean())
XGBoosts implementation of Gradient Boosted Trees¶
The final algorithm we will try is typically considered the most powerful. Unlike random forests which randomly develops trees via bagging, gradient boosting takes a purposeful approach and focuses on improving the weak learners or the trees that are performing relatively poorly. By doing this it boosts the overall success of the combination of trees at the end. For more specifics I refer you to this article.
We will actually not be using sklearn for the gradient boosting algorithm (although I have heard that the most recent version of sklearn does have a gradient boosting algorithm on par with XGboost and LightGBM. For this we will focus on XGBoost. Luckily for us the developers built the API to reflect the sklearn implementation, which makes it almost exactly the same.
# cross-validate basic XGBoost model to get an AUC
from xgboost import XGBClassifier
xgb_cv = cross_val_score(XGBClassifier(),
x_train,y_train,scoring='roc_auc',cv=10)
print("XGBoost cross-validated mean AUC: ",xgb_cv.mean())
WOW!!¶
Nearest Neigbors outperformed both XGBoost and Random Forests. Let's go ahead and run XGBoost and Nearest Neigbors ina train/test split.
Let's use a 65/35 split.
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(x_train,y_train,test_size=.35, random_state=101)
print (len(X_train), len(X_test), len(X_train)+ len(X_test))
K-Nearest Neigbors¶
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier(n_neighbors=66)
knn.fit(X_train,y_train)
knn_proba = knn.predict_proba(X_test)
# Predictions###############
knn_proba_2 = knn_proba[:,1]
knn_pred = knn.predict(X_test)
print('K-Nearest Neigbors Accuracy:',accuracy_score(y_test, knn_pred))
print('K-Nearest Neigbors AUC:',roc_auc_score(y_test,knn_proba_2))
from sklearn.metrics import classification_report
print(classification_report(y_test,knn_pred))
XGBoost¶
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train,y_train)
xgb_proba = xgb.predict_proba(X_test)
# Predictions###############
xgb_proba_2 = xgb_proba[:,1]
xgb_pred = xgb.predict(X_test)
print('XGBoost Accuracy:',accuracy_score(y_test, xgb_pred))
print('XGBoost AUC:',roc_auc_score(y_test,xgb_proba_2))
from sklearn.metrics import classification_report
print(classification_report(y_test,xgb_pred))
So you can see that while K-Nearest Neighbors has a slightly higher AUC, XGBoost outperforms it when it comes to accuracy, but both are very good and drastically outperform logistic regression. One final thing we can do is a randomized grid search for xgboost. We tuned the hyper parameter for knn, by identifying the ideal # of nearest neigbors, but so far we're using XGBoost's default settings. Can we get a little more accuracy and AUC by picking ideal hypers?
Let's give it a try.
from scipy.stats import randint as sp_randint
from time import time
from sklearn.model_selection import RandomizedSearchCV
# Utility function to report best scores
def report(results, n_top=3):
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {0}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {0}".format(results['params'][candidate]))
print("")
# specify parameters and distributions to sample from
xgb = XGBClassifier(n_jobs=-1)
param_dist = {"learning_rate":np.linspace(.01,.15,15),
'min_child_weight': sp_randint(1,7),
"max_depth": sp_randint(3,10),
"reg_lambda":[0.0,0.1,1,3],
"reg_alpha":[0,0.1,1,3],
"n_estimators":[250,500]}
# run randomized search
n_iter_search = 50
random_search = RandomizedSearchCV(xgb, param_distributions=param_dist,
n_iter=n_iter_search,scoring='roc_auc',verbose=3)
start = time()
random_search.fit(x_train,y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)
The results of the grid search show us that the validation score of the randomized cross validation was 0.975
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
xgb = XGBClassifier(max_depth=7,learning_rate=0.0799999999,min_child_weight=5,n_estimators=500,reg_alpha=3,reg_lambda=0.0)
xgb.fit(X_train,y_train)
xgb_proba = xgb.predict_proba(X_test)
# Predictions###############
xgb_proba_2 = xgb_proba[:,1]
xgb_pred = xgb.predict(X_test)
print('XGBoost Accuracy:',accuracy_score(y_test, xgb_pred))
print('XGBoost AUC:',roc_auc_score(y_test,xgb_proba_2))
from sklearn.metrics import classification_report
print(classification_report(y_test,xgb_pred))
So, we achieved a similar AUC on our holdout sample using the hyperparameters obtained from the grid search and we got our accuracy up to almost 94%. Let's take a look at which features were deemed most important.
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(X_train.columns, xgb.feature_importances_):
feats[feature] = importance #add the name/value pair
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
from pylab import rcParams
rcParams['figure.figsize']=16,8
figure = importances.sort_values(by='Gini-importance').iloc[-15:,:].plot(kind='bar', rot=45)
So we can see that by far the most important feature is hit location, which makes a ton of sense because I recoded a hit location of 10 to be homeruns and balls first touched by outfielders are much more likely to be hits than balls first touched by infielders. Let's go ahead and remove that feature and the hc_y feature as it shows how far the ball was actually hit. Then let's create a feature which shows whether or not the batter/pitcher matchup was same or different as evidence typically shows hitters can see the ball for slightly longer if the pitcher is coming across their body, which tends to favor opposite hand matchups.
train = train.drop(['hit_location','hc_y'],axis=1)
x_train = train.drop(['hit'],axis=1)
y_train = train['hit']
x_train['opp_matchup'] = x_train.apply(lambda x: 0 if x['right_handed_batter'] ==x['right_handed_pitcher'] else 1,axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(x_train,y_train,test_size=.35, random_state=101)
print (len(X_train), len(X_test), len(X_train)+ len(X_test))
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train,y_train)
xgb_proba = xgb.predict_proba(X_test)
# Predictions###############
xgb_proba_2 = xgb_proba[:,1]
xgb_pred = xgb.predict(X_test)
print('XGBoost Accuracy:',accuracy_score(y_test, xgb_pred))
print('XGBoost AUC:',roc_auc_score(y_test,xgb_proba_2))
We can see that we lost a fair amount of both accuracy and AUC by removing hit location and the y-coordinate indicating how far the ball traveled.
Let's redo our grid search with the new smaller model.
start = time()
random_search.fit(x_train,y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
xgb = XGBClassifier(max_depth=9,learning_rate=0.1,min_child_weight=5,n_estimators=250,reg_alpha=3,reg_lambda=1)
xgb.fit(X_train,y_train)
xgb_proba = xgb.predict_proba(X_test)
# Predictions###############
xgb_proba_2 = xgb_proba[:,1]
xgb_pred = xgb.predict(X_test)
print('XGBoost Accuracy:',accuracy_score(y_test, xgb_pred))
print('XGBoost AUC:',roc_auc_score(y_test,xgb_proba_2))
The amount of lift we got from the grid search is actually pretty surprising to me. We had about a 3% lift in both accuracy and AUC. Let's look at the feature importance now.
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(X_train.columns, xgb.feature_importances_):
feats[feature] = importance #add the name/value pair
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
from pylab import rcParams
rcParams['figure.figsize']=16,8
figure = importances.sort_values(by='Gini-importance').iloc[-10:,:].plot(kind='bar', rot=45)
Our final feature importance graph indicates that launch angle and launch speed are the most important features. This is exactly what the MLB teams have found as well
You can also see a lot of parks make the top ten with decent sized gini importances. Off the top of my head I would guess the following:
- Hitter friendly parks: Boston, Houston, Cincinatti, Colorado
- Pitcher friendly parks: Detroit and Saint Louis
Let's quickly look at the batting averages for a few of those parks.
print('Colorado batting average of balls in play:',round(in_play[in_play['home_team']=='COL']['hit'].mean(),3))
print('Saint Louis batting average of balls in play:',round(in_play[in_play['home_team']=='STL']['hit'].mean(),3))
There's a pretty dramatic difference there.
So what have we done? We've used machine learning models to predict job performance in the game of baseball. Our original model which included the hit location and the y-coordinate for how far the ball was hit allowed us to achieve almost 94% accuracy and an area under the curve of almost 98% on determining whether a ball in play was a hit or an out across the entire 2019 MLB season.
When we removed both of those features we were still able to achieve an impressive 88% accuracy and 94% area under the curve.
Next steps?
- Build a model that includes who was on defense as certain teams are better defensively than others.
- Build a model to predict what type of hit it will be...perhaps a multi-label approach would make sense here.
Takeaways for the I/O world¶
At the surface level this may not seem of value to an I/O practitioner trying to predict job performance or job satisfaction, but the preprocessing, data manipulation, and model building are all identical steps to what you will want to do to ensure you build a good and accurate model, and as I mentioned above, baseball data isn't all that different from organizational data. At the end of the data we are using inputs to predict job performance.