Examining Data in Python via Pandas

Imgur

One of the most common uses of SPSS is to quickly look at the descriptives of multiple variables at once. Below is a simple example output from SPSS.

spss

Does this look familiar?

Aside from correlations this is one of the most common syntax commands I've traditionally used within SPSS. This article will walk you through how to produce extremely similar output in Python. I'll also show you how to use other common methods like crosstabs and correlation matrices. Finally, I'll introduce you to a great package that allows you to do even more investigation of the data with a few additional lines of code.

Let's explore some data!

first let's import pandas, numpy, and matplotlib inline

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Quick note about Pandas

Pandas is a package that was started by Wes McKinney who worked as a quant in a hedge fund that wanted a better way to work with data than excel macros had to offer.

The Dataset

This is a loan dataset, not exactly perfectly suited for the social sciences, especially I/O, but you could easily see how paying or not paying loans is a human behavior, similar to many of the variables we study as social scientists. Job performance, employee engagement, job satisfaction, etc. all have predictors and the fact that this data involves human behavior as an outcome (whether or not the loan was fully paid) means this is very similar to problems we'd be likely to encounter.

This is actually lendingclub.com data which you can read more about here. Here is what each of the variables/columns represent:

  • credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
  • purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
  • int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
  • installment: The monthly installments owed by the borrower if the loan is funded.
  • log.annual.inc: The natural log of the self-reported annual income of the borrower.
  • dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
  • fico: The FICO credit score of the borrower.
  • days.with.cr.line: The number of days the borrower has had a credit line.
  • revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
  • revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
  • inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
  • delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
  • pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
In [4]:
df = pd.read_csv('data/loan_data.csv') ## load the data
In [5]:
columns = list(df.columns) #give us a list of the columns for easy reference

Let's write a quick few lines of code to figure out which datatype each variable is, because we can't get descriptives on object type variables, which typically represent strings.

In [6]:
data_type = []
for i in df.columns:
    data_type.append(df[i].dtype)
pd.DataFrame(list(zip(df.columns, data_type)),columns=['variables','data_types'])
Out[6]:
variables data_types
0 credit.policy int64
1 purpose object
2 int.rate float64
3 installment float64
4 log.annual.inc float64
5 dti float64
6 fico int64
7 days.with.cr.line float64
8 revol.bal int64
9 revol.util float64
10 inq.last.6mths int64
11 delinq.2yrs int64
12 pub.rec int64
13 not.fully.paid int64

So we can quickly see all variables aside from purpose should produce us with usable descriptives.

Descriptive Statistics

How do we get descriptives in python? We can do this by using the .describe() method

Note: I will use the round function to make the data look nicer.

In [7]:
round(df.describe(),2)
Out[7]:
credit.policy int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
count 9578.0 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00
mean 0.8 0.12 319.09 10.93 12.61 710.85 4560.77 16913.96 46.80 1.58 0.16 0.06 0.16
std 0.4 0.03 207.07 0.61 6.88 37.97 2496.93 33756.19 29.01 2.20 0.55 0.26 0.37
min 0.0 0.06 15.67 7.55 0.00 612.00 178.96 0.00 0.00 0.00 0.00 0.00 0.00
25% 1.0 0.10 163.77 10.56 7.21 682.00 2820.00 3187.00 22.60 0.00 0.00 0.00 0.00
50% 1.0 0.12 268.95 10.93 12.66 707.00 4139.96 8596.00 46.30 1.00 0.00 0.00 0.00
75% 1.0 0.14 432.76 11.29 17.95 737.00 5730.00 18249.50 70.90 2.00 0.00 0.00 0.00
max 1.0 0.22 940.14 14.53 29.96 827.00 17639.96 1207359.00 119.00 33.00 13.00 5.00 1.00

SPSS won't allow you to add columns with a data type that is not an int or float for examining descriptives. In Python you can add it, but it won't show anything. Given what we saw above we should expect to see descriptive statistics for every column except purpose and it appears we do in fact have descriptives for every column except purpose. This provides us more than the base descriptives command from SPSS as we also see the quartiles.

If we wanted to look at specific columns we would just subset them like this:

In [8]:
round(df[['log.annual.inc','fico']].describe(),2)
Out[8]:
log.annual.inc fico
count 9578.00 9578.00
mean 10.93 710.85
std 0.61 37.97
min 7.55 612.00
25% 10.56 682.00
50% 10.93 707.00
75% 11.29 737.00
max 14.53 827.00

But I hear you saying, that's great and all, but I've used SPSS for so long I like the stacked output better than the wide output (variables as rows instead of columns)

Well....the great thing about object oriented programming is the table is an object that can be manipulated.

So....let's think about an easy way to do that. What if we just transposed the object?

In [37]:
round(df.describe().transpose(),2)
Out[37]:
count mean std min 25% 50% 75% max
credit.policy 9578.0 0.80 0.40 0.00 1.00 1.00 1.00 1.00
int.rate 9578.0 0.12 0.03 0.06 0.10 0.12 0.14 0.22
installment 9578.0 319.09 207.07 15.67 163.77 268.95 432.76 940.14
log.annual.inc 9578.0 10.93 0.61 7.55 10.56 10.93 11.29 14.53
dti 9578.0 12.61 6.88 0.00 7.21 12.66 17.95 29.96
fico 9578.0 710.85 37.97 612.00 682.00 707.00 737.00 827.00
days.with.cr.line 9578.0 4560.77 2496.93 178.96 2820.00 4139.96 5730.00 17639.96
revol.bal 9578.0 16913.96 33756.19 0.00 3187.00 8596.00 18249.50 1207359.00
revol.util 9578.0 46.80 29.01 0.00 22.60 46.30 70.90 119.00
inq.last.6mths 9578.0 1.58 2.20 0.00 0.00 1.00 2.00 33.00
delinq.2yrs 9578.0 0.16 0.55 0.00 0.00 0.00 0.00 13.00
pub.rec 9578.0 0.06 0.26 0.00 0.00 0.00 0.00 5.00
not.fully.paid 9578.0 0.16 0.37 0.00 0.00 0.00 0.00 1.00
fico_binned 9578.0 2.35 0.75 1.00 2.00 2.00 3.00 5.00

Now it looks identical to the SPSS output above!

Crosstabs

Another really common output I've used in SPSS is the crosstabs view. It looks a little like this;

spss_crosstabs

Just like descriptives, pandas has you covered for crosstabs as well. Now most of these variables are continuous by nature, so in order to make a crosstabs viewable let's quickly bin one that might be interesting to look at. What we want to do is use the following; np.linspace, np.digitize

  • np.linspacelooks at the min and the max of the fico score and creates a linearly spaced grouping of them, so all numbers within the min and max are equally grouped into 5 categories.
  • Then np.digitize returns the index for each bin.
  • Finally, I build a crosstab so we can examine the number of individuals that either paid back their loan (0) or did not pay back their loan (1).
In [9]:
bins = np.linspace(df['fico'].min(), df['fico'].max(), 5)
df['fico_binned'] = np.digitize(df['fico'],bins)
pd.crosstab(df['not.fully.paid'],df['fico_binned'],margins=True)
Out[9]:
fico_binned 1 2 3 4 5 All
not.fully.paid
0 659 4138 2625 622 1 8045
1 244 915 336 38 0 1533
All 903 5053 2961 660 1 9578

You can see a trend where the proportion of borrowers that did not pay back their loans tends to becom less and less as we move from the lowest bins to the highest bins. Which makes sense. People with lower fico scores typically have those for a reason, perhaps not fiscally responsible with credit?

Correlation Matrix

spss_corr

As an I/O this SPSS output is probably our bread and butter. Examining the correlations amongst your variables. Lucky for us that's also extremely simple to do in pandas. using the pandas.DataFrame.corr method.

In [25]:
round(df.corr(),3)
Out[25]:
credit.policy int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid fico_binned
credit.policy 1.000 -0.294 0.059 0.035 -0.091 0.348 0.099 -0.188 -0.104 -0.536 -0.076 -0.054 -0.158 0.345
int.rate -0.294 1.000 0.276 0.056 0.220 -0.715 -0.124 0.093 0.465 0.203 0.156 0.098 0.160 -0.658
installment 0.059 0.276 1.000 0.448 0.050 0.086 0.183 0.234 0.081 -0.010 -0.004 -0.033 0.050 0.077
log.annual.inc 0.035 0.056 0.448 1.000 -0.054 0.115 0.337 0.372 0.055 0.029 0.029 0.017 -0.033 0.108
dti -0.091 0.220 0.050 -0.054 1.000 -0.241 0.060 0.189 0.337 0.029 -0.022 0.006 0.037 -0.227
fico 0.348 -0.715 0.086 0.115 -0.241 1.000 0.264 -0.016 -0.541 -0.185 -0.216 -0.148 -0.150 0.921
days.with.cr.line 0.099 -0.124 0.183 0.337 0.060 0.264 1.000 0.229 -0.024 -0.042 0.081 0.072 -0.029 0.244
revol.bal -0.188 0.093 0.234 0.372 0.189 -0.016 0.229 1.000 0.204 0.022 -0.033 -0.031 0.054 -0.013
revol.util -0.104 0.465 0.081 0.055 0.337 -0.541 -0.024 0.204 1.000 -0.014 -0.043 0.067 0.082 -0.496
inq.last.6mths -0.536 0.203 -0.010 0.029 0.029 -0.185 -0.042 0.022 -0.014 1.000 0.021 0.073 0.149 -0.175
delinq.2yrs -0.076 0.156 -0.004 0.029 -0.022 -0.216 0.081 -0.033 -0.043 0.021 1.000 0.009 0.009 -0.200
pub.rec -0.054 0.098 -0.033 0.017 0.006 -0.148 0.072 -0.031 0.067 0.073 0.009 1.000 0.049 -0.136
not.fully.paid -0.158 0.160 0.050 -0.033 0.037 -0.150 -0.029 0.054 0.082 0.149 0.009 0.049 1.000 -0.143
fico_binned 0.345 -0.658 0.077 0.108 -0.227 0.921 0.244 -0.013 -0.496 -0.175 -0.200 -0.136 -0.143 1.000

We could also turn it into a nicer looking heatmap visual by importing the seaborn package.

There are many cmaps available, here are a few, but you can also create your own using matplotlibs linear segmented color map functionality.

In [34]:
import seaborn as sns
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(),cmap='viridis',annot=True);

Of course this would be great if this was all pandas had to offer.

Other important methods I've used over the past few years are:

There's also a great external package I want to talk about that allows you to quickly look at most of your data using the most common EDA (Exploratory Data Analysis) methods.

Pandas Profiling

Note in order to use pandas profiling (which we will explore next) you will need to install the package. You can do so, by using the following command in your terminal/anaconda prompt:

conda install -c conda-forge pandas-profiling

Let's see what additional features pandas profiling has to offer!

As you can see below the pandas profiling package offers just about everything you could ever want to quickly look through your data.

Here's a quick list:

  • Number of variables
  • Number of observations
  • Percent of missing data
  • Size of the data in memory
  • data types (numerical, categorical, boolean, etc.)
  • A flag on features/variables with a large # of zero values
  • Common statistics (like seen in the .describe() pandas method)
  • Histograms for each feature/variable
  • Common values
  • Extreme values
  • Pearson correlation
  • Spearman correlation

As far as understanding your data, this is about as good as it gets

You are going to have to scroll down a little ways given the amount of output pandas profiling produces from just 1 line of code, but please do, so we can see if anything jumps out at us.

In [36]:
import pandas_profiling
pandas_profiling.ProfileReport(df)
Out[36]:

Overview

Dataset info

Number of variables 14
Number of observations 9578
Total Missing (%) 0.0%
Total size in memory 1.0 MiB
Average record size in memory 112.0 B

Variables types

Numeric 11
Categorical 1
Boolean 2
Date 0
Text (Unique) 0
Rejected 0
Unsupported 0

Warnings

Variables

credit.policy
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 0.80497
1
7710
0
1868
Value Count Frequency (%)  
1 7710 80.5%
 
0 1868 19.5%
 

days.with.cr.line
Numeric

Distinct count 2687
Unique (%) 28.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 4560.8
Minimum 178.96
Maximum 17640
Zeros (%) 0.0%

Quantile statistics

Minimum 178.96
5-th percentile 1320
Q1 2820
Median 4140
Q3 5730
95-th percentile 9330
Maximum 17640
Range 17461
Interquartile range 2910

Descriptive statistics

Standard deviation 2496.9
Coef of variation 0.54748
Kurtosis 1.9379
Mean 4560.8
MAD 1897.7
Skewness 1.1557
Sum 43683000
Variance 6234700
Memory size 74.9 KiB
Value Count Frequency (%)  
3660.0 50 0.5%
 
3630.0 48 0.5%
 
3990.0 46 0.5%
 
4410.0 44 0.5%
 
3600.0 41 0.4%
 
2550.0 38 0.4%
 
4080.0 38 0.4%
 
3690.0 37 0.4%
 
1800.0 37 0.4%
 
4020.0 35 0.4%
 
Other values (2677) 9164 95.7%
 

Minimum 5 values

Value Count Frequency (%)  
178.95833330000002 1 0.0%
 
180.04166669999998 3 0.0%
 
181.0 1 0.0%
 
183.04166669999998 1 0.0%
 
209.04166669999998 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
16260.0 1 0.0%
 
16350.0 1 0.0%
 
16652.0 1 0.0%
 
17616.0 1 0.0%
 
17639.95833 1 0.0%
 

delinq.2yrs
Numeric

Distinct count 11
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.16371
Minimum 0
Maximum 13
Zeros (%) 88.3%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 1
Maximum 13
Range 13
Interquartile range 0

Descriptive statistics

Standard deviation 0.54621
Coef of variation 3.3365
Kurtosis 71.433
Mean 0.16371
MAD 0.28913
Skewness 6.0618
Sum 1568
Variance 0.29835
Memory size 74.9 KiB
Value Count Frequency (%)  
0 8458 88.3%
 
1 832 8.7%
 
2 192 2.0%
 
3 65 0.7%
 
4 19 0.2%
 
5 6 0.1%
 
6 2 0.0%
 
7 1 0.0%
 
13 1 0.0%
 
11 1 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0 8458 88.3%
 
1 832 8.7%
 
2 192 2.0%
 
3 65 0.7%
 
4 19 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
6 2 0.0%
 
7 1 0.0%
 
8 1 0.0%
 
11 1 0.0%
 
13 1 0.0%
 

dti
Numeric

Distinct count 2529
Unique (%) 26.4%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 12.607
Minimum 0
Maximum 29.96
Zeros (%) 0.9%

Quantile statistics

Minimum 0
5-th percentile 1.27
Q1 7.2125
Median 12.665
Q3 17.95
95-th percentile 23.65
Maximum 29.96
Range 29.96
Interquartile range 10.738

Descriptive statistics

Standard deviation 6.884
Coef of variation 0.54606
Kurtosis -0.90036
Mean 12.607
MAD 5.796
Skewness 0.023941
Sum 120750
Variance 47.389
Memory size 74.9 KiB
Value Count Frequency (%)  
0.0 89 0.9%
 
10.0 19 0.2%
 
0.6 16 0.2%
 
13.16 13 0.1%
 
19.2 13 0.1%
 
15.1 13 0.1%
 
12.0 13 0.1%
 
6.0 13 0.1%
 
13.28 12 0.1%
 
10.8 12 0.1%
 
Other values (2519) 9365 97.8%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 89 0.9%
 
0.01 1 0.0%
 
0.02 1 0.0%
 
0.03 1 0.0%
 
0.04 2 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
29.72 1 0.0%
 
29.74 1 0.0%
 
29.9 1 0.0%
 
29.95 1 0.0%
 
29.96 1 0.0%
 

fico
Numeric

Distinct count 44
Unique (%) 0.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 710.85
Minimum 612
Maximum 827
Zeros (%) 0.0%

Quantile statistics

Minimum 612
5-th percentile 657
Q1 682
Median 707
Q3 737
95-th percentile 782
Maximum 827
Range 215
Interquartile range 55

Descriptive statistics

Standard deviation 37.971
Coef of variation 0.053416
Kurtosis -0.42231
Mean 710.85
MAD 31.264
Skewness 0.47126
Sum 6808486
Variance 1441.8
Memory size 74.9 KiB
Value Count Frequency (%)  
687 548 5.7%
 
682 536 5.6%
 
692 498 5.2%
 
697 476 5.0%
 
702 472 4.9%
 
707 444 4.6%
 
667 438 4.6%
 
677 427 4.5%
 
717 424 4.4%
 
662 414 4.3%
 
Other values (34) 4901 51.2%
 

Minimum 5 values

Value Count Frequency (%)  
612 2 0.0%
 
617 1 0.0%
 
622 1 0.0%
 
627 2 0.0%
 
632 6 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
807 45 0.5%
 
812 33 0.3%
 
817 6 0.1%
 
822 5 0.1%
 
827 1 0.0%
 

inq.last.6mths
Numeric

Distinct count 28
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.5775
Minimum 0
Maximum 33
Zeros (%) 38.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 1
Q3 2
95-th percentile 5
Maximum 33
Range 33
Interquartile range 2

Descriptive statistics

Standard deviation 2.2002
Coef of variation 1.3948
Kurtosis 26.288
Mean 1.5775
MAD 1.4949
Skewness 3.5842
Sum 15109
Variance 4.8411
Memory size 74.9 KiB
Value Count Frequency (%)  
0 3637 38.0%
 
1 2462 25.7%
 
2 1384 14.4%
 
3 864 9.0%
 
4 475 5.0%
 
5 278 2.9%
 
6 165 1.7%
 
7 100 1.0%
 
8 72 0.8%
 
9 47 0.5%
 
Other values (18) 94 1.0%
 

Minimum 5 values

Value Count Frequency (%)  
0 3637 38.0%
 
1 2462 25.7%
 
2 1384 14.4%
 
3 864 9.0%
 
4 475 5.0%
 

Maximum 5 values

Value Count Frequency (%)  
27 1 0.0%
 
28 1 0.0%
 
31 1 0.0%
 
32 1 0.0%
 
33 1 0.0%
 

installment
Numeric

Distinct count 4788
Unique (%) 50.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 319.09
Minimum 15.67
Maximum 940.14
Zeros (%) 0.0%

Quantile statistics

Minimum 15.67
5-th percentile 65.559
Q1 163.77
Median 268.95
Q3 432.76
95-th percentile 756.27
Maximum 940.14
Range 924.47
Interquartile range 268.99

Descriptive statistics

Standard deviation 207.07
Coef of variation 0.64894
Kurtosis 0.13791
Mean 319.09
MAD 165.63
Skewness 0.91252
Sum 3056200
Variance 42879
Memory size 74.9 KiB
Value Count Frequency (%)  
317.72 41 0.4%
 
316.11 34 0.4%
 
319.47 29 0.3%
 
381.26 27 0.3%
 
662.68 27 0.3%
 
156.1 24 0.3%
 
320.95 24 0.3%
 
669.33 23 0.2%
 
334.67 23 0.2%
 
188.02 23 0.2%
 
Other values (4778) 9303 97.1%
 

Minimum 5 values

Value Count Frequency (%)  
15.67 1 0.0%
 
15.69 1 0.0%
 
15.75 1 0.0%
 
15.76 1 0.0%
 
15.91 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
916.95 2 0.0%
 
918.02 2 0.0%
 
922.42 1 0.0%
 
926.83 2 0.0%
 
940.14 1 0.0%
 

int.rate
Numeric

Distinct count 249
Unique (%) 2.6%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.12264
Minimum 0.06
Maximum 0.2164
Zeros (%) 0.0%

Quantile statistics

Minimum 0.06
5-th percentile 0.0774
Q1 0.1039
Median 0.1221
Q3 0.1407
95-th percentile 0.167
Maximum 0.2164
Range 0.1564
Interquartile range 0.0368

Descriptive statistics

Standard deviation 0.026847
Coef of variation 0.21891
Kurtosis -0.22432
Mean 0.12264
MAD 0.021441
Skewness 0.16442
Sum 1174.6
Variance 0.00072076
Memory size 74.9 KiB
Value Count Frequency (%)  
0.1253 354 3.7%
 
0.0894 299 3.1%
 
0.1183 243 2.5%
 
0.1218 215 2.2%
 
0.0963 210 2.2%
 
0.1114 206 2.2%
 
0.08 198 2.1%
 
0.1287 197 2.1%
 
0.1148 193 2.0%
 
0.0932 187 2.0%
 
Other values (239) 7276 76.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.06 8 0.1%
 
0.0639 4 0.0%
 
0.0676 9 0.1%
 
0.0705 23 0.2%
 
0.0712 9 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
0.2052 4 0.0%
 
0.2086 6 0.1%
 
0.209 2 0.0%
 
0.2121 7 0.1%
 
0.2164 2 0.0%
 

log.annual.inc
Numeric

Distinct count 1987
Unique (%) 20.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 10.932
Minimum 7.5475
Maximum 14.528
Zeros (%) 0.0%

Quantile statistics

Minimum 7.5475
5-th percentile 9.9179
Q1 10.558
Median 10.929
Q3 11.291
95-th percentile 11.918
Maximum 14.528
Range 6.9809
Interquartile range 0.73288

Descriptive statistics

Standard deviation 0.61481
Coef of variation 0.056239
Kurtosis 1.609
Mean 10.932
MAD 0.46903
Skewness 0.028668
Sum 104710
Variance 0.37799
Memory size 74.9 KiB
Value Count Frequency (%)  
11.00209984 308 3.2%
 
10.81977828 248 2.6%
 
10.59663473 224 2.3%
 
10.30895266 224 2.3%
 
10.71441777 221 2.3%
 
11.22524339 196 2.0%
 
11.15625052 165 1.7%
 
10.77895629 149 1.6%
 
10.91508846 147 1.5%
 
11.08214255 146 1.5%
 
Other values (1977) 7550 78.8%
 

Minimum 5 values

Value Count Frequency (%)  
7.547501682999999 1 0.0%
 
7.60090246 1 0.0%
 
8.101677747 1 0.0%
 
8.160518247 1 0.0%
 
8.188689124 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
13.71015004 2 0.0%
 
13.99783211 1 0.0%
 
14.12446477 1 0.0%
 
14.18015367 1 0.0%
 
14.52835448 1 0.0%
 

not.fully.paid
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 0.16005
0
8045
1
 
1533
Value Count Frequency (%)  
0 8045 84.0%
 
1 1533 16.0%
 

pub.rec
Numeric

Distinct count 6
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.062122
Minimum 0
Maximum 5
Zeros (%) 94.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 1
Maximum 5
Range 5
Interquartile range 0

Descriptive statistics

Standard deviation 0.26213
Coef of variation 4.2196
Kurtosis 38.781
Mean 0.062122
MAD 0.11699
Skewness 5.1264
Sum 595
Variance 0.06871
Memory size 74.9 KiB
Value Count Frequency (%)  
0 9019 94.2%
 
1 533 5.6%
 
2 19 0.2%
 
3 5 0.1%
 
5 1 0.0%
 
4 1 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0 9019 94.2%
 
1 533 5.6%
 
2 19 0.2%
 
3 5 0.1%
 
4 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
1 533 5.6%
 
2 19 0.2%
 
3 5 0.1%
 
4 1 0.0%
 
5 1 0.0%
 

purpose
Categorical

Distinct count 7
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
debt_consolidation
3957
all_other
2331
credit_card
1262
Other values (4)
2028
Value Count Frequency (%)  
debt_consolidation 3957 41.3%
 
all_other 2331 24.3%
 
credit_card 1262 13.2%
 
home_improvement 629 6.6%
 
small_business 619 6.5%
 
major_purchase 437 4.6%
 
educational 343 3.6%
 

revol.bal
Numeric

Distinct count 7869
Unique (%) 82.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 16914
Minimum 0
Maximum 1207359
Zeros (%) 3.4%

Quantile statistics

Minimum 0
5-th percentile 127.7
Q1 3187
Median 8596
Q3 18250
95-th percentile 57654
Maximum 1207359
Range 1207359
Interquartile range 15062

Descriptive statistics

Standard deviation 33756
Coef of variation 1.9958
Kurtosis 259.66
Mean 16914
MAD 15560
Skewness 11.161
Sum 162001946
Variance 1139500000
Memory size 74.9 KiB
Value Count Frequency (%)  
0 321 3.4%
 
255 10 0.1%
 
298 10 0.1%
 
682 9 0.1%
 
346 8 0.1%
 
182 6 0.1%
 
1085 6 0.1%
 
2229 6 0.1%
 
8035 5 0.1%
 
6 5 0.1%
 
Other values (7859) 9192 96.0%
 

Minimum 5 values

Value Count Frequency (%)  
0 321 3.4%
 
1 5 0.1%
 
2 2 0.0%
 
3 1 0.0%
 
4 2 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
407794 1 0.0%
 
508961 1 0.0%
 
602519 1 0.0%
 
952013 1 0.0%
 
1207359 1 0.0%
 

revol.util
Numeric

Distinct count 1035
Unique (%) 10.8%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 46.799
Minimum 0
Maximum 119
Zeros (%) 3.1%

Quantile statistics

Minimum 0
5-th percentile 1.1
Q1 22.6
Median 46.3
Q3 70.9
95-th percentile 94
Maximum 119
Range 119
Interquartile range 48.3

Descriptive statistics

Standard deviation 29.014
Coef of variation 0.61998
Kurtosis -1.1165
Mean 46.799
MAD 24.835
Skewness 0.059985
Sum 448240
Variance 841.84
Memory size 74.9 KiB
Value Count Frequency (%)  
0.0 297 3.1%
 
0.5 26 0.3%
 
0.3 22 0.2%
 
73.7 22 0.2%
 
47.8 22 0.2%
 
3.3 21 0.2%
 
0.1 21 0.2%
 
0.2 20 0.2%
 
0.7 20 0.2%
 
1.0 20 0.2%
 
Other values (1025) 9087 94.9%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 297 3.1%
 
0.04 1 0.0%
 
0.1 21 0.2%
 
0.2 20 0.2%
 
0.3 22 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
106.2 1 0.0%
 
106.4 1 0.0%
 
106.5 1 0.0%
 
108.8 1 0.0%
 
119.0 1 0.0%
 

Correlations

Sample

credit.policy purpose int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
0 1 debt_consolidation 0.1189 829.10 11.350407 19.48 737 5639.958333 28854 52.1 0 0 0 0
1 1 credit_card 0.1071 228.22 11.082143 14.29 707 2760.000000 33623 76.7 0 0 0 0
2 1 debt_consolidation 0.1357 366.86 10.373491 11.63 682 4710.000000 3511 25.6 1 0 0 0
3 1 debt_consolidation 0.1008 162.34 11.350407 8.10 712 2699.958333 33667 73.2 1 0 0 0
4 1 credit_card 0.1426 102.92 11.299732 14.97 667 4066.000000 4740 39.5 0 1 0 0

Some quick notes from the descriptives:

  1. We can clearly see that a number of these variables are incredibly skewed.
  2. The correlation matrix provides us with a few interesting (but understandable) findings:
    • Pretty strong negative correlation between interest rate and fico score
    • Pretty strong positive correlation between credit policy and fico score

While this is great for a brief overview, we should probably do a bit more digging. For that let's use a package built on top of matplotlib called seaborn

Let's look at the relationship between fico score and credit policy a bit closer using graphs.

In [21]:
import seaborn as sns

plt.figure(figsize=(10,6))
df[df['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',
                                              bins=30,label='Credit.Policy=1')
df[df['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
                                              bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO');

We can clearly see that there are a lot more people that have credit policies of 1 in this dataset, but even so, there is a skew towards lower fico scores for those with a credit policy of 0 and higher scores for those with a credit policy of 1.

We had one variable that was categorical, that we didn't get much data on. Let's use seaborn to create a quick categorical plot of that to see if anything stands out.

In [23]:
plt.figure(figsize=(11,7))
sns.countplot(x='purpose',hue='not.fully.paid',data=df,palette='Set2');

It doesn't look like anything really stands out to me, all of the bar charts look pretty proportional between non-paid loans and paid loans.

This was a quick overview of some of the features pandas, pandas profiling, matplotlib, and seaborn have to offer for exploring data. I plan to do a more full EDA in a future article, but I wanted to show you what pandas had to offer and how it provided many similar capablities to SPSS, plus a lot more.

If you want to learn exactly how to do everything I did in a bit more detail and a slower pace I'd highly recommend the python data science bootcamp by Jose Portilla I mentioned here as my plan for this site is to walk you through cool features I think add value to the social scientists tool belt, but not necessarily teach you how to use pandas, or the plotting libraries from scratch.