Multiple Regression in Python and R¶

regression

Linear Regression in Python Using Statsmodels¶

Let's Regression!!

Isn't that what everybody says after their first day in grad methods?

python

As you will see from the output statsmodels was built with a lot of similarities between SPSS & R's lm. One important thing to notice about statsmodels is by default it does not include a constant in the linear model, so you will need to add the constant to get the same results as you would get in SPSS or R.

Importing Packages¶

Have to import our relevant packages

In [3]:

import pandas as pd # dataframe library
import numpy as np # linear algebra library
from scipy import stats # stats library from scipy
import statsmodels.api as sm # importing statsmodels.api but giving it an alias, so you don't have to type statsmodels.api every time

Does data matter?¶

I often get people from I/O or other social sciences telling me that they don't see how many of the publicly available datasets are applicable to them. In my opinion all datasets are equally valuable when it comes to learning exploration and data manipulation techniques and predictive algorithms. The data below is about housing, but you could just as easily replace the features about crime rate and # of rooms with cognitive ability and extraversion and instead of predicting median house value you are predicting job performance The linear model works the exact same way.

Load the data¶

This is the Boston housing training data. It's an easy way to build a straightforward regression model that is comparable across packages and languages.

In [4]:

df = pd.read_csv('data/train.csv')

In [5]:

df.head() # shows top 5 by default by putting a number in the parantheses I can increase or decrease that amount

Out[5]:

	ID	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0	1	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	2	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	4	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
3	5	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2
4	7	0.08829	12.5	7.87	0.524	6.012	66.6	5.5605	5	311	15.2	395.60	12.43	22.9

Here's a description of each variable/feature as provided by the link above.¶

Data description¶

The Boston data frame has 506 rows and 14 columns.

This data frame contains the following columns:

crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes in $1000s.

Let's define our IVs and DV.

In this instance we'll use everything but the ID, black, and medv, which will be the DV as our X.
We'll use median value as our Y.

In [6]:

x_sm = df.drop(columns=['ID','black','medv'])
y_sm = df['medv']

Again, remember we assigned statsmodels an alias of sm, so we are going to add a constant to sm
Next, we are going to assign a variable named model the statsmodels ordinary least squares class and we are going to call the .fit() method to fit the data to the class. This will result in the results of this being held in the model variable.

In [7]:

x_sm = sm.add_constant(x_sm)
model = sm.OLS(y_sm,x_sm).fit()

Let's look at a summary of the model output¶

As social scientists this has everything we need. We can look at the R-sqaure to understand how much of the variance in the DV the IVs explain. We can look at the IV coefficients and the significance of these coefficients.

In [8]:

model.summary()

Out[8]:

OLS Regression Results
Dep. Variable:	medv	R-squared:	0.724
Model:	OLS	Adj. R-squared:	0.714
Method:	Least Squares	F-statistic:	70.07
Date:	Fri, 18 Jan 2019	Prob (F-statistic):	5.58e-82
Time:	08:02:55	Log-Likelihood:	-995.49
No. Observations:	333	AIC:	2017.
Df Residuals:	320	BIC:	2066.
Df Model:	12
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	40.2724	6.085	6.619	0.000	28.301	52.244
crim	-0.0982	0.053	-1.864	0.063	-0.202	0.005
zn	0.0487	0.017	2.836	0.005	0.015	0.082
indus	0.0438	0.075	0.581	0.561	-0.104	0.192
chas	4.0569	1.164	3.486	0.001	1.767	6.346
nox	-17.0387	4.911	-3.470	0.001	-26.700	-7.377
rm	3.5993	0.525	6.855	0.000	2.566	4.632
age	-0.0017	0.017	-0.099	0.921	-0.035	0.032
dis	-1.5677	0.268	-5.847	0.000	-2.095	-1.040
rad	0.3267	0.083	3.934	0.000	0.163	0.490
tax	-0.0133	0.005	-2.895	0.004	-0.022	-0.004
ptratio	-0.8408	0.168	-5.013	0.000	-1.171	-0.511
lstat	-0.6203	0.064	-9.621	0.000	-0.747	-0.493

Omnibus:	121.432	Durbin-Watson:	1.083
Prob(Omnibus):	0.000	Jarque-Bera (JB):	453.682
Skew:	1.572	Prob(JB):	3.05e-99
Kurtosis:	7.776	Cond. No.	1.16e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

You can see we get F-statistic, R-square, adjusted R-square, coefficients, and p-values of each coefficient in a similar output to SPSS or R.¶

A Regression in R¶

Quick note about running R in a Python notebook.

you will need to pip install rpy2
Then use the %load_ext rpy2.ipython magic command (as shown below)
Then in every cell you want to run R code you will begin the cell with %%R (as shown below)

In [10]:

import warnings #this just ignores warning messages I was getting a few warnings about data structure of the R import
warnings.filterwarnings('ignore')

In [11]:

# enables the %%R magic, this is necessary to run R code in a python kernel, not necessary if you are working in an R kernel
%load_ext rpy2.ipython

First thing we do is load the reader library and the data¶

While running R code within a python notebook you need to use %%R at the top of every cell. Again this is not necessary if you are using an R kernel.

In [12]:

%%R 
library(readr)
train <- read_csv('data/train.csv')

Then we subset the data to include the same variables above¶

In [13]:

%%R
X <- subset(train, select = -c(black,ID))

Similar to python you assign the linear model function to model. In R you can tell the model specifically to use everything but the variable listed as the y. You then need to assign the data the lm function is using.

In [14]:

%%R 
model <- lm(medv ~., data=X)

In [15]:

%%R
summary(model)

Call:
lm(formula = medv ~ ., data = X)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9251  -2.8487  -0.5153   1.6668  24.5410 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  40.272413   6.084780   6.619 1.53e-10 ***
crim         -0.098221   0.052685  -1.864 0.063195 .  
zn            0.048667   0.017161   2.836 0.004861 ** 
indus         0.043766   0.075267   0.581 0.561333    
chas          4.056862   1.163744   3.486 0.000559 ***
nox         -17.038710   4.910673  -3.470 0.000592 ***
rm            3.599301   0.525072   6.855 3.68e-11 ***
age          -0.001693   0.017110  -0.099 0.921238    
dis          -1.567704   0.268120  -5.847 1.23e-08 ***
rad           0.326749   0.083067   3.934 0.000103 ***
tax          -0.013272   0.004584  -2.895 0.004047 ** 
ptratio      -0.840796   0.167708  -5.013 8.87e-07 ***
lstat        -0.620346   0.064480  -9.621  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.906 on 320 degrees of freedom
Multiple R-squared:  0.7243,	Adjusted R-squared:  0.714 
F-statistic: 70.07 on 12 and 320 DF,  p-value: < 2.2e-16

Similarly we have the F-statistic,R-square, adjusted R-square, coefficients, p-values, etc.¶

Machine Learning¶

So far we've been focused on replicating typical stats processes within python and R. However there is this entire field/buzzword/"all the cool kids are doing it" term called machine learning. There really isn't a difference, honestly. The main difference is the school of thought and the focus of the process. When it comes to statistics the focus is largely on interpretability and understanding. Examining the coefficient weights, importance of predictors, etc.

When it comes to machine learning the focus is solely on prediction. They don't really care as much about understanding what is predicting, they care about the MSE or the R-square, whatever the metric of importance is.

The scikit learn package is a machine learning-based package, which means the .summary() methods aren't readily available although you can still extract the same information. I will show how to do so next.

Linear Regression in Scikit Learn¶

sklearn

In [17]:

df = pd.read_csv('data/train.csv')## load the dataset

Similar to statsmodels we need to add the constant.

In Pandas if you assign a dataframe's column with a specific # it acts as adding a scalar.

In [18]:

df['constant']=1.
df.head()

Out[18]:

	ID	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv	constant
0	1	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0	1.0
1	2	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6	1.0
2	4	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4	1.0
3	5	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2	1.0
4	7	0.08829	12.5	7.87	0.524	6.012	66.6	5.5605	5	311	15.2	395.60	12.43	22.9	1.0

As you can see there is now an additional column called constant that is all 1s.

import the LinearRegression class from sklearn package

In [19]:

from sklearn.linear_model import LinearRegression

In [20]:

x = df.drop(columns=['ID','black','medv'])# dropping the 3 columns as we did earlier
y = df['medv'] # assigning the median value as the DV as we did earlier

assigning the linear regression class to the variable model and then applying the .fit() method off of the model variable to the x and y variables and assigning that to the variable lr_model

In [21]:

model = LinearRegression()
lr_model = model.fit(x,y)

Let's look at the R-square, which is defined as the .score() method. Notice, even the language suggests that there is a competition to predict. Score

In [22]:

print ('R-square: ',round(model.score(x,y),3))

R-square:  0.724

Let's look at the Intercept/Constant:

In [23]:

print('Intercept/Constant: ',round((lr_model.intercept_),3))

Intercept/Constant:  40.272

Finally, let's create a dataframe that shows zips the coefficients to their feature names, so we can examine the coefficients.

In [24]:

pd.DataFrame(list(zip(x.columns,lr_model.coef_)),columns=['features','estimated_Coefficients'])

Out[24]:

	features	estimated_Coefficients
0	crim	-0.098221
1	zn	0.048667
2	indus	0.043766
3	chas	4.056862
4	nox	-17.038710
5	rm	3.599301
6	age	-0.001693
7	dis	-1.567704
8	rad	0.326749
9	tax	-0.013272
10	ptratio	-0.840796
11	lstat	-0.620346
12	constant	0.000000

The TLDR Version¶

In [ ]:

#statsmodels
import statsmodels.api as sm
df = pd.read_csv('data/train.csv')
x_sm = df.drop(columns=['ID','black','medv'])
y_sm = df['medv']
x_sm = sm.add_constant(x_sm)
model = sm.OLS(y_sm,x_sm).fit()
model.summary()

In [ ]:

%%R
#R
library(readr)
train <- read_csv('data/train.csv')
X <- subset(train, select = -c(black,ID))
model <- lm(medv ~., data=X)
summary(model)

In [ ]:

#scikitlearn
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data/train.csv')
x = df.drop(columns=['ID','black','medv'])
y = df['medv']
model = LinearRegression()
lr_model = model.fit(x,y)
print ('R-square: ',round(model.score(x,y),3))
print('Intercept/Constant: ',round((lr_model.intercept_),3))
pd.DataFrame(list(zip(x.columns,lr_model.coef_)),columns=['features','estimated_Coefficients'])

So as you can see there are several ways to do the same thing. All of which are fewer lines of code than SPSS!!!¶

Links to documentation:

In [ ]:

AIO: Bridging the Gap Between Data Science and IO

Linear Regression in Python & R

N. Koenig

Multiple Regression in Python and R¶

Linear Regression in Python Using Statsmodels¶

Importing Packages¶

Does data matter?¶

Load the data¶

Here's a description of each variable/feature as provided by the link above.¶

Data description¶

Let's look at a summary of the model output¶

You can see we get F-statistic, R-square, adjusted R-square, coefficients, and p-values of each coefficient in a similar output to SPSS or R.¶

A Regression in R¶

First thing we do is load the reader library and the data¶

Then we subset the data to include the same variables above¶

Similarly we have the F-statistic,R-square, adjusted R-square, coefficients, p-values, etc.¶

Machine Learning¶

Linear Regression in Scikit Learn¶

The TLDR Version¶

So as you can see there are several ways to do the same thing. All of which are fewer lines of code than SPSS!!!¶

Comments