Multiple Regression in Python and R¶
Linear Regression in Python Using Statsmodels¶
Let's Regression!!
Isn't that what everybody says after their first day in grad methods?
As you will see from the output statsmodels was built with a lot of similarities between SPSS & R's lm. One important thing to notice about statsmodels is by default it does not include a constant in the linear model, so you will need to add the constant to get the same results as you would get in SPSS or R.
Importing Packages¶
Have to import our relevant packages
import pandas as pd # dataframe library
import numpy as np # linear algebra library
from scipy import stats # stats library from scipy
import statsmodels.api as sm # importing statsmodels.api but giving it an alias, so you don't have to type statsmodels.api every time
Does data matter?¶
I often get people from I/O or other social sciences telling me that they don't see how many of the publicly available datasets are applicable to them. In my opinion all datasets are equally valuable when it comes to learning exploration and data manipulation techniques and predictive algorithms. The data below is about housing, but you could just as easily replace the features about crime rate and # of rooms with cognitive ability and extraversion and instead of predicting median house value you are predicting job performance The linear model works the exact same way.
Load the data¶
This is the Boston housing training data. It's an easy way to build a straightforward regression model that is comparable across packages and languages.
df = pd.read_csv('data/train.csv')
df.head() # shows top 5 by default by putting a number in the parantheses I can increase or decrease that amount
Here's a description of each variable/feature as provided by the link above.¶
Data description¶
The Boston data frame has 506 rows and 14 columns.
This data frame contains the following columns:
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes in $1000s.
Let's define our IVs and DV.
- In this instance we'll use everything but the ID, black, and medv, which will be the DV as our X.
- We'll use median value as our Y.
x_sm = df.drop(columns=['ID','black','medv'])
y_sm = df['medv']
- Again, remember we assigned statsmodels an alias of sm, so we are going to add a constant to sm
- Next, we are going to assign a variable named model the statsmodels ordinary least squares class and we are going to call the .fit() method to fit the data to the class. This will result in the results of this being held in the model variable.
x_sm = sm.add_constant(x_sm)
model = sm.OLS(y_sm,x_sm).fit()
Let's look at a summary of the model output¶
As social scientists this has everything we need. We can look at the R-sqaure to understand how much of the variance in the DV the IVs explain. We can look at the IV coefficients and the significance of these coefficients.
model.summary()
You can see we get F-statistic, R-square, adjusted R-square, coefficients, and p-values of each coefficient in a similar output to SPSS or R.¶
A Regression in R¶
Quick note about running R in a Python notebook.
- you will need to pip install rpy2
- Then use the %load_ext rpy2.ipython magic command (as shown below)
- Then in every cell you want to run R code you will begin the cell with %%R (as shown below)
import warnings #this just ignores warning messages I was getting a few warnings about data structure of the R import
warnings.filterwarnings('ignore')
# enables the %%R magic, this is necessary to run R code in a python kernel, not necessary if you are working in an R kernel
%load_ext rpy2.ipython
First thing we do is load the reader library and the data¶
While running R code within a python notebook you need to use %%R at the top of every cell. Again this is not necessary if you are using an R kernel.
%%R
library(readr)
train <- read_csv('data/train.csv')
Then we subset the data to include the same variables above¶
%%R
X <- subset(train, select = -c(black,ID))
Similar to python you assign the linear model function to model. In R you can tell the model specifically to use everything but the variable listed as the y. You then need to assign the data the lm function is using.
%%R
model <- lm(medv ~., data=X)
%%R
summary(model)
Similarly we have the F-statistic,R-square, adjusted R-square, coefficients, p-values, etc.¶
Machine Learning¶
So far we've been focused on replicating typical stats processes within python and R. However there is this entire field/buzzword/"all the cool kids are doing it" term called machine learning. There really isn't a difference, honestly. The main difference is the school of thought and the focus of the process. When it comes to statistics the focus is largely on interpretability and understanding. Examining the coefficient weights, importance of predictors, etc.
When it comes to machine learning the focus is solely on prediction. They don't really care as much about understanding what is predicting, they care about the MSE or the R-square, whatever the metric of importance is.
The scikit learn package is a machine learning-based package, which means the .summary() methods aren't readily available although you can still extract the same information. I will show how to do so next.
Linear Regression in Scikit Learn¶
df = pd.read_csv('data/train.csv')## load the dataset
Similar to statsmodels we need to add the constant.
In Pandas if you assign a dataframe's column with a specific # it acts as adding a scalar.
df['constant']=1.
df.head()
As you can see there is now an additional column called constant that is all 1s.
import the LinearRegression class from sklearn package
from sklearn.linear_model import LinearRegression
x = df.drop(columns=['ID','black','medv'])# dropping the 3 columns as we did earlier
y = df['medv'] # assigning the median value as the DV as we did earlier
assigning the linear regression class to the variable model and then applying the .fit() method off of the model variable to the x and y variables and assigning that to the variable lr_model
model = LinearRegression()
lr_model = model.fit(x,y)
Let's look at the R-square, which is defined as the .score() method. Notice, even the language suggests that there is a competition to predict. Score
print ('R-square: ',round(model.score(x,y),3))
Let's look at the Intercept/Constant:
print('Intercept/Constant: ',round((lr_model.intercept_),3))
Finally, let's create a dataframe that shows zips the coefficients to their feature names, so we can examine the coefficients.
pd.DataFrame(list(zip(x.columns,lr_model.coef_)),columns=['features','estimated_Coefficients'])
The TLDR Version¶
#statsmodels
import statsmodels.api as sm
df = pd.read_csv('data/train.csv')
x_sm = df.drop(columns=['ID','black','medv'])
y_sm = df['medv']
x_sm = sm.add_constant(x_sm)
model = sm.OLS(y_sm,x_sm).fit()
model.summary()
%%R
#R
library(readr)
train <- read_csv('data/train.csv')
X <- subset(train, select = -c(black,ID))
model <- lm(medv ~., data=X)
summary(model)
#scikitlearn
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data/train.csv')
x = df.drop(columns=['ID','black','medv'])
y = df['medv']
model = LinearRegression()
lr_model = model.fit(x,y)
print ('R-square: ',round(model.score(x,y),3))
print('Intercept/Constant: ',round((lr_model.intercept_),3))
pd.DataFrame(list(zip(x.columns,lr_model.coef_)),columns=['features','estimated_Coefficients'])
So as you can see there are several ways to do the same thing. All of which are fewer lines of code than SPSS!!!¶
Links to documentation: