Multiple Regression in Python and R

regression

Linear Regression in Python Using Statsmodels

Let's Regression!!

Isn't that what everybody says after their first day in grad methods?

python

As you will see from the output statsmodels was built with a lot of similarities between SPSS & R's lm. One important thing to notice about statsmodels is by default it does not include a constant in the linear model, so you will need to add the constant to get the same results as you would get in SPSS or R.

Importing Packages

Have to import our relevant packages

In [3]:
import pandas as pd # dataframe library
import numpy as np # linear algebra library
from scipy import stats # stats library from scipy
import statsmodels.api as sm # importing statsmodels.api but giving it an alias, so you don't have to type statsmodels.api every time

Does data matter?

I often get people from I/O or other social sciences telling me that they don't see how many of the publicly available datasets are applicable to them. In my opinion all datasets are equally valuable when it comes to learning exploration and data manipulation techniques and predictive algorithms. The data below is about housing, but you could just as easily replace the features about crime rate and # of rooms with cognitive ability and extraversion and instead of predicting median house value you are predicting job performance The linear model works the exact same way.

Load the data

This is the Boston housing training data. It's an easy way to build a straightforward regression model that is comparable across packages and languages.

In [4]:
df = pd.read_csv('data/train.csv')
In [5]:
df.head() # shows top 5 by default by putting a number in the parantheses I can increase or decrease that amount
Out[5]:
ID crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0 1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
3 5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
4 7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9

Data description

The Boston data frame has 506 rows and 14 columns.

This data frame contains the following columns:

  • crim per capita crime rate by town.

  • zn proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus proportion of non-retail business acres per town.

  • chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox nitrogen oxides concentration (parts per 10 million).

  • rm average number of rooms per dwelling.

  • age proportion of owner-occupied units built prior to 1940.

  • dis weighted mean of distances to five Boston employment centres.

  • rad index of accessibility to radial highways.

  • tax full-value property-tax rate per $10,000.

  • ptratio pupil-teacher ratio by town.

  • black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

  • lstat lower status of the population (percent).

  • medv median value of owner-occupied homes in $1000s.

Let's define our IVs and DV.

  • In this instance we'll use everything but the ID, black, and medv, which will be the DV as our X.
  • We'll use median value as our Y.
In [6]:
x_sm = df.drop(columns=['ID','black','medv'])
y_sm = df['medv']
  • Again, remember we assigned statsmodels an alias of sm, so we are going to add a constant to sm
  • Next, we are going to assign a variable named model the statsmodels ordinary least squares class and we are going to call the .fit() method to fit the data to the class. This will result in the results of this being held in the model variable.
In [7]:
x_sm = sm.add_constant(x_sm)
model = sm.OLS(y_sm,x_sm).fit()

Let's look at a summary of the model output

As social scientists this has everything we need. We can look at the R-sqaure to understand how much of the variance in the DV the IVs explain. We can look at the IV coefficients and the significance of these coefficients.

In [8]:
model.summary()
Out[8]:
OLS Regression Results
Dep. Variable: medv R-squared: 0.724
Model: OLS Adj. R-squared: 0.714
Method: Least Squares F-statistic: 70.07
Date: Fri, 18 Jan 2019 Prob (F-statistic): 5.58e-82
Time: 08:02:55 Log-Likelihood: -995.49
No. Observations: 333 AIC: 2017.
Df Residuals: 320 BIC: 2066.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 40.2724 6.085 6.619 0.000 28.301 52.244
crim -0.0982 0.053 -1.864 0.063 -0.202 0.005
zn 0.0487 0.017 2.836 0.005 0.015 0.082
indus 0.0438 0.075 0.581 0.561 -0.104 0.192
chas 4.0569 1.164 3.486 0.001 1.767 6.346
nox -17.0387 4.911 -3.470 0.001 -26.700 -7.377
rm 3.5993 0.525 6.855 0.000 2.566 4.632
age -0.0017 0.017 -0.099 0.921 -0.035 0.032
dis -1.5677 0.268 -5.847 0.000 -2.095 -1.040
rad 0.3267 0.083 3.934 0.000 0.163 0.490
tax -0.0133 0.005 -2.895 0.004 -0.022 -0.004
ptratio -0.8408 0.168 -5.013 0.000 -1.171 -0.511
lstat -0.6203 0.064 -9.621 0.000 -0.747 -0.493
Omnibus: 121.432 Durbin-Watson: 1.083
Prob(Omnibus): 0.000 Jarque-Bera (JB): 453.682
Skew: 1.572 Prob(JB): 3.05e-99
Kurtosis: 7.776 Cond. No. 1.16e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

You can see we get F-statistic, R-square, adjusted R-square, coefficients, and p-values of each coefficient in a similar output to SPSS or R.

A Regression in R

Quick note about running R in a Python notebook.

  • you will need to pip install rpy2
  • Then use the %load_ext rpy2.ipython magic command (as shown below)
  • Then in every cell you want to run R code you will begin the cell with %%R (as shown below)

R

In [10]:
import warnings #this just ignores warning messages I was getting a few warnings about data structure of the R import
warnings.filterwarnings('ignore')
In [11]:
# enables the %%R magic, this is necessary to run R code in a python kernel, not necessary if you are working in an R kernel
%load_ext rpy2.ipython

First thing we do is load the reader library and the data

While running R code within a python notebook you need to use %%R at the top of every cell. Again this is not necessary if you are using an R kernel.

In [12]:
%%R 
library(readr)
train <- read_csv('data/train.csv')

Then we subset the data to include the same variables above

In [13]:
%%R
X <- subset(train, select = -c(black,ID))

Similar to python you assign the linear model function to model. In R you can tell the model specifically to use everything but the variable listed as the y. You then need to assign the data the lm function is using.

In [14]:
%%R 
model <- lm(medv ~., data=X)
In [15]:
%%R
summary(model)
Call:
lm(formula = medv ~ ., data = X)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9251  -2.8487  -0.5153   1.6668  24.5410 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  40.272413   6.084780   6.619 1.53e-10 ***
crim         -0.098221   0.052685  -1.864 0.063195 .  
zn            0.048667   0.017161   2.836 0.004861 ** 
indus         0.043766   0.075267   0.581 0.561333    
chas          4.056862   1.163744   3.486 0.000559 ***
nox         -17.038710   4.910673  -3.470 0.000592 ***
rm            3.599301   0.525072   6.855 3.68e-11 ***
age          -0.001693   0.017110  -0.099 0.921238    
dis          -1.567704   0.268120  -5.847 1.23e-08 ***
rad           0.326749   0.083067   3.934 0.000103 ***
tax          -0.013272   0.004584  -2.895 0.004047 ** 
ptratio      -0.840796   0.167708  -5.013 8.87e-07 ***
lstat        -0.620346   0.064480  -9.621  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.906 on 320 degrees of freedom
Multiple R-squared:  0.7243,	Adjusted R-squared:  0.714 
F-statistic: 70.07 on 12 and 320 DF,  p-value: < 2.2e-16

Similarly we have the F-statistic,R-square, adjusted R-square, coefficients, p-values, etc.

Machine Learning

So far we've been focused on replicating typical stats processes within python and R. However there is this entire field/buzzword/"all the cool kids are doing it" term called machine learning. There really isn't a difference, honestly. The main difference is the school of thought and the focus of the process. When it comes to statistics the focus is largely on interpretability and understanding. Examining the coefficient weights, importance of predictors, etc.

When it comes to machine learning the focus is solely on prediction. They don't really care as much about understanding what is predicting, they care about the MSE or the R-square, whatever the metric of importance is.

The scikit learn package is a machine learning-based package, which means the .summary() methods aren't readily available although you can still extract the same information. I will show how to do so next.

Linear Regression in Scikit Learn

sklearn

In [17]:
df = pd.read_csv('data/train.csv')## load the dataset

Similar to statsmodels we need to add the constant.

In Pandas if you assign a dataframe's column with a specific # it acts as adding a scalar.

In [18]:
df['constant']=1.
df.head()
Out[18]:
ID crim zn indus chas nox rm age dis rad tax ptratio black lstat medv constant
0 1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0 1.0
1 2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 1.0
2 4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 1.0
3 5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 1.0
4 7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9 1.0

As you can see there is now an additional column called constant that is all 1s.

import the LinearRegression class from sklearn package

In [19]:
from sklearn.linear_model import LinearRegression
In [20]:
x = df.drop(columns=['ID','black','medv'])# dropping the 3 columns as we did earlier
y = df['medv'] # assigning the median value as the DV as we did earlier

assigning the linear regression class to the variable model and then applying the .fit() method off of the model variable to the x and y variables and assigning that to the variable lr_model

In [21]:
model = LinearRegression()
lr_model = model.fit(x,y)

Let's look at the R-square, which is defined as the .score() method. Notice, even the language suggests that there is a competition to predict. Score

In [22]:
print ('R-square: ',round(model.score(x,y),3))
R-square:  0.724

Let's look at the Intercept/Constant:

In [23]:
print('Intercept/Constant: ',round((lr_model.intercept_),3))
Intercept/Constant:  40.272

Finally, let's create a dataframe that shows zips the coefficients to their feature names, so we can examine the coefficients.

In [24]:
pd.DataFrame(list(zip(x.columns,lr_model.coef_)),columns=['features','estimated_Coefficients'])
Out[24]:
features estimated_Coefficients
0 crim -0.098221
1 zn 0.048667
2 indus 0.043766
3 chas 4.056862
4 nox -17.038710
5 rm 3.599301
6 age -0.001693
7 dis -1.567704
8 rad 0.326749
9 tax -0.013272
10 ptratio -0.840796
11 lstat -0.620346
12 constant 0.000000

The TLDR Version

In [ ]:
#statsmodels
import statsmodels.api as sm
df = pd.read_csv('data/train.csv')
x_sm = df.drop(columns=['ID','black','medv'])
y_sm = df['medv']
x_sm = sm.add_constant(x_sm)
model = sm.OLS(y_sm,x_sm).fit()
model.summary()
In [ ]:
%%R
#R
library(readr)
train <- read_csv('data/train.csv')
X <- subset(train, select = -c(black,ID))
model <- lm(medv ~., data=X)
summary(model)
In [ ]:
#scikitlearn
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data/train.csv')
x = df.drop(columns=['ID','black','medv'])
y = df['medv']
model = LinearRegression()
lr_model = model.fit(x,y)
print ('R-square: ',round(model.score(x,y),3))
print('Intercept/Constant: ',round((lr_model.intercept_),3))
pd.DataFrame(list(zip(x.columns,lr_model.coef_)),columns=['features','estimated_Coefficients'])

So as you can see there are several ways to do the same thing. All of which are fewer lines of code than SPSS!!!

Links to documentation:

In [ ]: