Examining Data in Python via Pandas¶
One of the most common uses of SPSS is to quickly look at the descriptives of multiple variables at once. Below is a simple example output from SPSS.
Does this look familiar?¶
Aside from correlations this is one of the most common syntax commands I've traditionally used within SPSS. This article will walk you through how to produce extremely similar output in Python. I'll also show you how to use other common methods like crosstabs and correlation matrices. Finally, I'll introduce you to a great package that allows you to do even more investigation of the data with a few additional lines of code.
Let's explore some data!
first let's import pandas, numpy, and matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Quick note about Pandas¶
Pandas is a package that was started by Wes McKinney who worked as a quant in a hedge fund that wanted a better way to work with data than excel macros had to offer.
The Dataset¶
This is a loan dataset, not exactly perfectly suited for the social sciences, especially I/O, but you could easily see how paying or not paying loans is a human behavior, similar to many of the variables we study as social scientists. Job performance, employee engagement, job satisfaction, etc. all have predictors and the fact that this data involves human behavior as an outcome (whether or not the loan was fully paid) means this is very similar to problems we'd be likely to encounter.
This is actually lendingclub.com data which you can read more about here. Here is what each of the variables/columns represent:
- credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
- purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
- int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
- installment: The monthly installments owed by the borrower if the loan is funded.
- log.annual.inc: The natural log of the self-reported annual income of the borrower.
- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
- fico: The FICO credit score of the borrower.
- days.with.cr.line: The number of days the borrower has had a credit line.
- revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
- revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
- inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
- pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
df = pd.read_csv('data/loan_data.csv') ## load the data
columns = list(df.columns) #give us a list of the columns for easy reference
Let's write a quick few lines of code to figure out which datatype each variable is, because we can't get descriptives on object type variables, which typically represent strings.
data_type = []
for i in df.columns:
data_type.append(df[i].dtype)
pd.DataFrame(list(zip(df.columns, data_type)),columns=['variables','data_types'])
variables | data_types | |
---|---|---|
0 | credit.policy | int64 |
1 | purpose | object |
2 | int.rate | float64 |
3 | installment | float64 |
4 | log.annual.inc | float64 |
5 | dti | float64 |
6 | fico | int64 |
7 | days.with.cr.line | float64 |
8 | revol.bal | int64 |
9 | revol.util | float64 |
10 | inq.last.6mths | int64 |
11 | delinq.2yrs | int64 |
12 | pub.rec | int64 |
13 | not.fully.paid | int64 |
So we can quickly see all variables aside from purpose should produce us with usable descriptives.
Descriptive Statistics¶
How do we get descriptives in python? We can do this by using the .describe() method
Note: I will use the round function to make the data look nicer.
round(df.describe(),2)
credit.policy | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 9578.0 | 9578.00 | 9578.00 | 9578.00 | 9578.00 | 9578.00 | 9578.00 | 9578.00 | 9578.00 | 9578.00 | 9578.00 | 9578.00 | 9578.00 |
mean | 0.8 | 0.12 | 319.09 | 10.93 | 12.61 | 710.85 | 4560.77 | 16913.96 | 46.80 | 1.58 | 0.16 | 0.06 | 0.16 |
std | 0.4 | 0.03 | 207.07 | 0.61 | 6.88 | 37.97 | 2496.93 | 33756.19 | 29.01 | 2.20 | 0.55 | 0.26 | 0.37 |
min | 0.0 | 0.06 | 15.67 | 7.55 | 0.00 | 612.00 | 178.96 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
25% | 1.0 | 0.10 | 163.77 | 10.56 | 7.21 | 682.00 | 2820.00 | 3187.00 | 22.60 | 0.00 | 0.00 | 0.00 | 0.00 |
50% | 1.0 | 0.12 | 268.95 | 10.93 | 12.66 | 707.00 | 4139.96 | 8596.00 | 46.30 | 1.00 | 0.00 | 0.00 | 0.00 |
75% | 1.0 | 0.14 | 432.76 | 11.29 | 17.95 | 737.00 | 5730.00 | 18249.50 | 70.90 | 2.00 | 0.00 | 0.00 | 0.00 |
max | 1.0 | 0.22 | 940.14 | 14.53 | 29.96 | 827.00 | 17639.96 | 1207359.00 | 119.00 | 33.00 | 13.00 | 5.00 | 1.00 |
SPSS won't allow you to add columns with a data type that is not an int or float for examining descriptives. In Python you can add it, but it won't show anything. Given what we saw above we should expect to see descriptive statistics for every column except purpose and it appears we do in fact have descriptives for every column except purpose. This provides us more than the base descriptives command from SPSS as we also see the quartiles.
If we wanted to look at specific columns we would just subset them like this:
round(df[['log.annual.inc','fico']].describe(),2)
log.annual.inc | fico | |
---|---|---|
count | 9578.00 | 9578.00 |
mean | 10.93 | 710.85 |
std | 0.61 | 37.97 |
min | 7.55 | 612.00 |
25% | 10.56 | 682.00 |
50% | 10.93 | 707.00 |
75% | 11.29 | 737.00 |
max | 14.53 | 827.00 |
But I hear you saying, that's great and all, but I've used SPSS for so long I like the stacked output better than the wide output (variables as rows instead of columns)¶
Well....the great thing about object oriented programming is the table is an object that can be manipulated.
So....let's think about an easy way to do that. What if we just transposed the object?
round(df.describe().transpose(),2)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
credit.policy | 9578.0 | 0.80 | 0.40 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |
int.rate | 9578.0 | 0.12 | 0.03 | 0.06 | 0.10 | 0.12 | 0.14 | 0.22 |
installment | 9578.0 | 319.09 | 207.07 | 15.67 | 163.77 | 268.95 | 432.76 | 940.14 |
log.annual.inc | 9578.0 | 10.93 | 0.61 | 7.55 | 10.56 | 10.93 | 11.29 | 14.53 |
dti | 9578.0 | 12.61 | 6.88 | 0.00 | 7.21 | 12.66 | 17.95 | 29.96 |
fico | 9578.0 | 710.85 | 37.97 | 612.00 | 682.00 | 707.00 | 737.00 | 827.00 |
days.with.cr.line | 9578.0 | 4560.77 | 2496.93 | 178.96 | 2820.00 | 4139.96 | 5730.00 | 17639.96 |
revol.bal | 9578.0 | 16913.96 | 33756.19 | 0.00 | 3187.00 | 8596.00 | 18249.50 | 1207359.00 |
revol.util | 9578.0 | 46.80 | 29.01 | 0.00 | 22.60 | 46.30 | 70.90 | 119.00 |
inq.last.6mths | 9578.0 | 1.58 | 2.20 | 0.00 | 0.00 | 1.00 | 2.00 | 33.00 |
delinq.2yrs | 9578.0 | 0.16 | 0.55 | 0.00 | 0.00 | 0.00 | 0.00 | 13.00 |
pub.rec | 9578.0 | 0.06 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 5.00 |
not.fully.paid | 9578.0 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
fico_binned | 9578.0 | 2.35 | 0.75 | 1.00 | 2.00 | 2.00 | 3.00 | 5.00 |
Now it looks identical to the SPSS output above!
Crosstabs¶
Another really common output I've used in SPSS is the crosstabs view. It looks a little like this;
Just like descriptives, pandas has you covered for crosstabs as well. Now most of these variables are continuous by nature, so in order to make a crosstabs viewable let's quickly bin one that might be interesting to look at. What we want to do is use the following; np.linspace, np.digitize
- np.linspacelooks at the min and the max of the fico score and creates a linearly spaced grouping of them, so all numbers within the min and max are equally grouped into 5 categories.
- Then np.digitize returns the index for each bin.
- Finally, I build a crosstab so we can examine the number of individuals that either paid back their loan (0) or did not pay back their loan (1).
bins = np.linspace(df['fico'].min(), df['fico'].max(), 5)
df['fico_binned'] = np.digitize(df['fico'],bins)
pd.crosstab(df['not.fully.paid'],df['fico_binned'],margins=True)
fico_binned | 1 | 2 | 3 | 4 | 5 | All |
---|---|---|---|---|---|---|
not.fully.paid | ||||||
0 | 659 | 4138 | 2625 | 622 | 1 | 8045 |
1 | 244 | 915 | 336 | 38 | 0 | 1533 |
All | 903 | 5053 | 2961 | 660 | 1 | 9578 |
You can see a trend where the proportion of borrowers that did not pay back their loans tends to becom less and less as we move from the lowest bins to the highest bins. Which makes sense. People with lower fico scores typically have those for a reason, perhaps not fiscally responsible with credit?
Correlation Matrix¶
As an I/O this SPSS output is probably our bread and butter. Examining the correlations amongst your variables. Lucky for us that's also extremely simple to do in pandas. using the pandas.DataFrame.corr method.
round(df.corr(),3)
credit.policy | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | fico_binned | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
credit.policy | 1.000 | -0.294 | 0.059 | 0.035 | -0.091 | 0.348 | 0.099 | -0.188 | -0.104 | -0.536 | -0.076 | -0.054 | -0.158 | 0.345 |
int.rate | -0.294 | 1.000 | 0.276 | 0.056 | 0.220 | -0.715 | -0.124 | 0.093 | 0.465 | 0.203 | 0.156 | 0.098 | 0.160 | -0.658 |
installment | 0.059 | 0.276 | 1.000 | 0.448 | 0.050 | 0.086 | 0.183 | 0.234 | 0.081 | -0.010 | -0.004 | -0.033 | 0.050 | 0.077 |
log.annual.inc | 0.035 | 0.056 | 0.448 | 1.000 | -0.054 | 0.115 | 0.337 | 0.372 | 0.055 | 0.029 | 0.029 | 0.017 | -0.033 | 0.108 |
dti | -0.091 | 0.220 | 0.050 | -0.054 | 1.000 | -0.241 | 0.060 | 0.189 | 0.337 | 0.029 | -0.022 | 0.006 | 0.037 | -0.227 |
fico | 0.348 | -0.715 | 0.086 | 0.115 | -0.241 | 1.000 | 0.264 | -0.016 | -0.541 | -0.185 | -0.216 | -0.148 | -0.150 | 0.921 |
days.with.cr.line | 0.099 | -0.124 | 0.183 | 0.337 | 0.060 | 0.264 | 1.000 | 0.229 | -0.024 | -0.042 | 0.081 | 0.072 | -0.029 | 0.244 |
revol.bal | -0.188 | 0.093 | 0.234 | 0.372 | 0.189 | -0.016 | 0.229 | 1.000 | 0.204 | 0.022 | -0.033 | -0.031 | 0.054 | -0.013 |
revol.util | -0.104 | 0.465 | 0.081 | 0.055 | 0.337 | -0.541 | -0.024 | 0.204 | 1.000 | -0.014 | -0.043 | 0.067 | 0.082 | -0.496 |
inq.last.6mths | -0.536 | 0.203 | -0.010 | 0.029 | 0.029 | -0.185 | -0.042 | 0.022 | -0.014 | 1.000 | 0.021 | 0.073 | 0.149 | -0.175 |
delinq.2yrs | -0.076 | 0.156 | -0.004 | 0.029 | -0.022 | -0.216 | 0.081 | -0.033 | -0.043 | 0.021 | 1.000 | 0.009 | 0.009 | -0.200 |
pub.rec | -0.054 | 0.098 | -0.033 | 0.017 | 0.006 | -0.148 | 0.072 | -0.031 | 0.067 | 0.073 | 0.009 | 1.000 | 0.049 | -0.136 |
not.fully.paid | -0.158 | 0.160 | 0.050 | -0.033 | 0.037 | -0.150 | -0.029 | 0.054 | 0.082 | 0.149 | 0.009 | 0.049 | 1.000 | -0.143 |
fico_binned | 0.345 | -0.658 | 0.077 | 0.108 | -0.227 | 0.921 | 0.244 | -0.013 | -0.496 | -0.175 | -0.200 | -0.136 | -0.143 | 1.000 |
We could also turn it into a nicer looking heatmap visual by importing the seaborn package.
There are many cmaps available, here are a few, but you can also create your own using matplotlibs linear segmented color map functionality.
import seaborn as sns
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(),cmap='viridis',annot=True);
Of course this would be great if this was all pandas had to offer.
Other important methods I've used over the past few years are:¶
- Groupby
- Pivot Tables
- Along with many more
There's also a great external package I want to talk about that allows you to quickly look at most of your data using the most common EDA (Exploratory Data Analysis) methods.
Pandas Profiling¶
Note in order to use pandas profiling (which we will explore next) you will need to install the package. You can do so, by using the following command in your terminal/anaconda prompt:
conda install -c conda-forge pandas-profiling
Let's see what additional features pandas profiling has to offer!
As you can see below the pandas profiling package offers just about everything you could ever want to quickly look through your data.
Here's a quick list:
- Number of variables
- Number of observations
- Percent of missing data
- Size of the data in memory
- data types (numerical, categorical, boolean, etc.)
- A flag on features/variables with a large # of zero values
- Common statistics (like seen in the .describe() pandas method)
- Histograms for each feature/variable
- Common values
- Extreme values
- Pearson correlation
- Spearman correlation
As far as understanding your data, this is about as good as it gets
You are going to have to scroll down a little ways given the amount of output pandas profiling produces from just 1 line of code, but please do, so we can see if anything jumps out at us.
import pandas_profiling
pandas_profiling.ProfileReport(df)
Overview
Dataset info
Number of variables | 14 |
---|---|
Number of observations | 9578 |
Total Missing (%) | 0.0% |
Total size in memory | 1.0 MiB |
Average record size in memory | 112.0 B |
Variables types
Numeric | 11 |
---|---|
Categorical | 1 |
Boolean | 2 |
Date | 0 |
Text (Unique) | 0 |
Rejected | 0 |
Unsupported | 0 |
Warnings
delinq.2yrs
has 8458 / 88.3% zeros Zerosinq.last.6mths
has 3637 / 38.0% zeros Zerospub.rec
has 9019 / 94.2% zeros Zerosrevol.bal
has 321 / 3.4% zeros Zerosrevol.util
has 297 / 3.1% zeros Zeros
Variables
credit.policy
Boolean
Distinct count | 2 |
---|---|
Unique (%) | 0.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Mean | 0.80497 |
---|
1 |
7710
|
---|---|
0 |
1868
|
Value | Count | Frequency (%) | |
1 | 7710 | 80.5% |
|
0 | 1868 | 19.5% |
|
days.with.cr.line
Numeric
Distinct count | 2687 |
---|---|
Unique (%) | 28.1% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 4560.8 |
---|---|
Minimum | 178.96 |
Maximum | 17640 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 178.96 |
---|---|
5-th percentile | 1320 |
Q1 | 2820 |
Median | 4140 |
Q3 | 5730 |
95-th percentile | 9330 |
Maximum | 17640 |
Range | 17461 |
Interquartile range | 2910 |
Descriptive statistics
Standard deviation | 2496.9 |
---|---|
Coef of variation | 0.54748 |
Kurtosis | 1.9379 |
Mean | 4560.8 |
MAD | 1897.7 |
Skewness | 1.1557 |
Sum | 43683000 |
Variance | 6234700 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
3660.0 | 50 | 0.5% |
|
3630.0 | 48 | 0.5% |
|
3990.0 | 46 | 0.5% |
|
4410.0 | 44 | 0.5% |
|
3600.0 | 41 | 0.4% |
|
2550.0 | 38 | 0.4% |
|
4080.0 | 38 | 0.4% |
|
3690.0 | 37 | 0.4% |
|
1800.0 | 37 | 0.4% |
|
4020.0 | 35 | 0.4% |
|
Other values (2677) | 9164 | 95.7% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
178.95833330000002 | 1 | 0.0% |
|
180.04166669999998 | 3 | 0.0% |
|
181.0 | 1 | 0.0% |
|
183.04166669999998 | 1 | 0.0% |
|
209.04166669999998 | 1 | 0.0% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
16260.0 | 1 | 0.0% |
|
16350.0 | 1 | 0.0% |
|
16652.0 | 1 | 0.0% |
|
17616.0 | 1 | 0.0% |
|
17639.95833 | 1 | 0.0% |
|
delinq.2yrs
Numeric
Distinct count | 11 |
---|---|
Unique (%) | 0.1% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 0.16371 |
---|---|
Minimum | 0 |
Maximum | 13 |
Zeros (%) | 88.3% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 1 |
Maximum | 13 |
Range | 13 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 0.54621 |
---|---|
Coef of variation | 3.3365 |
Kurtosis | 71.433 |
Mean | 0.16371 |
MAD | 0.28913 |
Skewness | 6.0618 |
Sum | 1568 |
Variance | 0.29835 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
0 | 8458 | 88.3% |
|
1 | 832 | 8.7% |
|
2 | 192 | 2.0% |
|
3 | 65 | 0.7% |
|
4 | 19 | 0.2% |
|
5 | 6 | 0.1% |
|
6 | 2 | 0.0% |
|
7 | 1 | 0.0% |
|
13 | 1 | 0.0% |
|
11 | 1 | 0.0% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
0 | 8458 | 88.3% |
|
1 | 832 | 8.7% |
|
2 | 192 | 2.0% |
|
3 | 65 | 0.7% |
|
4 | 19 | 0.2% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
6 | 2 | 0.0% |
|
7 | 1 | 0.0% |
|
8 | 1 | 0.0% |
|
11 | 1 | 0.0% |
|
13 | 1 | 0.0% |
|
dti
Numeric
Distinct count | 2529 |
---|---|
Unique (%) | 26.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 12.607 |
---|---|
Minimum | 0 |
Maximum | 29.96 |
Zeros (%) | 0.9% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 1.27 |
Q1 | 7.2125 |
Median | 12.665 |
Q3 | 17.95 |
95-th percentile | 23.65 |
Maximum | 29.96 |
Range | 29.96 |
Interquartile range | 10.738 |
Descriptive statistics
Standard deviation | 6.884 |
---|---|
Coef of variation | 0.54606 |
Kurtosis | -0.90036 |
Mean | 12.607 |
MAD | 5.796 |
Skewness | 0.023941 |
Sum | 120750 |
Variance | 47.389 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
0.0 | 89 | 0.9% |
|
10.0 | 19 | 0.2% |
|
0.6 | 16 | 0.2% |
|
13.16 | 13 | 0.1% |
|
19.2 | 13 | 0.1% |
|
15.1 | 13 | 0.1% |
|
12.0 | 13 | 0.1% |
|
6.0 | 13 | 0.1% |
|
13.28 | 12 | 0.1% |
|
10.8 | 12 | 0.1% |
|
Other values (2519) | 9365 | 97.8% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
0.0 | 89 | 0.9% |
|
0.01 | 1 | 0.0% |
|
0.02 | 1 | 0.0% |
|
0.03 | 1 | 0.0% |
|
0.04 | 2 | 0.0% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
29.72 | 1 | 0.0% |
|
29.74 | 1 | 0.0% |
|
29.9 | 1 | 0.0% |
|
29.95 | 1 | 0.0% |
|
29.96 | 1 | 0.0% |
|
fico
Numeric
Distinct count | 44 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 710.85 |
---|---|
Minimum | 612 |
Maximum | 827 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 612 |
---|---|
5-th percentile | 657 |
Q1 | 682 |
Median | 707 |
Q3 | 737 |
95-th percentile | 782 |
Maximum | 827 |
Range | 215 |
Interquartile range | 55 |
Descriptive statistics
Standard deviation | 37.971 |
---|---|
Coef of variation | 0.053416 |
Kurtosis | -0.42231 |
Mean | 710.85 |
MAD | 31.264 |
Skewness | 0.47126 |
Sum | 6808486 |
Variance | 1441.8 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
687 | 548 | 5.7% |
|
682 | 536 | 5.6% |
|
692 | 498 | 5.2% |
|
697 | 476 | 5.0% |
|
702 | 472 | 4.9% |
|
707 | 444 | 4.6% |
|
667 | 438 | 4.6% |
|
677 | 427 | 4.5% |
|
717 | 424 | 4.4% |
|
662 | 414 | 4.3% |
|
Other values (34) | 4901 | 51.2% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
612 | 2 | 0.0% |
|
617 | 1 | 0.0% |
|
622 | 1 | 0.0% |
|
627 | 2 | 0.0% |
|
632 | 6 | 0.1% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
807 | 45 | 0.5% |
|
812 | 33 | 0.3% |
|
817 | 6 | 0.1% |
|
822 | 5 | 0.1% |
|
827 | 1 | 0.0% |
|
inq.last.6mths
Numeric
Distinct count | 28 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1.5775 |
---|---|
Minimum | 0 |
Maximum | 33 |
Zeros (%) | 38.0% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 1 |
Q3 | 2 |
95-th percentile | 5 |
Maximum | 33 |
Range | 33 |
Interquartile range | 2 |
Descriptive statistics
Standard deviation | 2.2002 |
---|---|
Coef of variation | 1.3948 |
Kurtosis | 26.288 |
Mean | 1.5775 |
MAD | 1.4949 |
Skewness | 3.5842 |
Sum | 15109 |
Variance | 4.8411 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
0 | 3637 | 38.0% |
|
1 | 2462 | 25.7% |
|
2 | 1384 | 14.4% |
|
3 | 864 | 9.0% |
|
4 | 475 | 5.0% |
|
5 | 278 | 2.9% |
|
6 | 165 | 1.7% |
|
7 | 100 | 1.0% |
|
8 | 72 | 0.8% |
|
9 | 47 | 0.5% |
|
Other values (18) | 94 | 1.0% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
0 | 3637 | 38.0% |
|
1 | 2462 | 25.7% |
|
2 | 1384 | 14.4% |
|
3 | 864 | 9.0% |
|
4 | 475 | 5.0% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
27 | 1 | 0.0% |
|
28 | 1 | 0.0% |
|
31 | 1 | 0.0% |
|
32 | 1 | 0.0% |
|
33 | 1 | 0.0% |
|
installment
Numeric
Distinct count | 4788 |
---|---|
Unique (%) | 50.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 319.09 |
---|---|
Minimum | 15.67 |
Maximum | 940.14 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 15.67 |
---|---|
5-th percentile | 65.559 |
Q1 | 163.77 |
Median | 268.95 |
Q3 | 432.76 |
95-th percentile | 756.27 |
Maximum | 940.14 |
Range | 924.47 |
Interquartile range | 268.99 |
Descriptive statistics
Standard deviation | 207.07 |
---|---|
Coef of variation | 0.64894 |
Kurtosis | 0.13791 |
Mean | 319.09 |
MAD | 165.63 |
Skewness | 0.91252 |
Sum | 3056200 |
Variance | 42879 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
317.72 | 41 | 0.4% |
|
316.11 | 34 | 0.4% |
|
319.47 | 29 | 0.3% |
|
381.26 | 27 | 0.3% |
|
662.68 | 27 | 0.3% |
|
156.1 | 24 | 0.3% |
|
320.95 | 24 | 0.3% |
|
669.33 | 23 | 0.2% |
|
334.67 | 23 | 0.2% |
|
188.02 | 23 | 0.2% |
|
Other values (4778) | 9303 | 97.1% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
15.67 | 1 | 0.0% |
|
15.69 | 1 | 0.0% |
|
15.75 | 1 | 0.0% |
|
15.76 | 1 | 0.0% |
|
15.91 | 1 | 0.0% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
916.95 | 2 | 0.0% |
|
918.02 | 2 | 0.0% |
|
922.42 | 1 | 0.0% |
|
926.83 | 2 | 0.0% |
|
940.14 | 1 | 0.0% |
|
int.rate
Numeric
Distinct count | 249 |
---|---|
Unique (%) | 2.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 0.12264 |
---|---|
Minimum | 0.06 |
Maximum | 0.2164 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 0.06 |
---|---|
5-th percentile | 0.0774 |
Q1 | 0.1039 |
Median | 0.1221 |
Q3 | 0.1407 |
95-th percentile | 0.167 |
Maximum | 0.2164 |
Range | 0.1564 |
Interquartile range | 0.0368 |
Descriptive statistics
Standard deviation | 0.026847 |
---|---|
Coef of variation | 0.21891 |
Kurtosis | -0.22432 |
Mean | 0.12264 |
MAD | 0.021441 |
Skewness | 0.16442 |
Sum | 1174.6 |
Variance | 0.00072076 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
0.1253 | 354 | 3.7% |
|
0.0894 | 299 | 3.1% |
|
0.1183 | 243 | 2.5% |
|
0.1218 | 215 | 2.2% |
|
0.0963 | 210 | 2.2% |
|
0.1114 | 206 | 2.2% |
|
0.08 | 198 | 2.1% |
|
0.1287 | 197 | 2.1% |
|
0.1148 | 193 | 2.0% |
|
0.0932 | 187 | 2.0% |
|
Other values (239) | 7276 | 76.0% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
0.06 | 8 | 0.1% |
|
0.0639 | 4 | 0.0% |
|
0.0676 | 9 | 0.1% |
|
0.0705 | 23 | 0.2% |
|
0.0712 | 9 | 0.1% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
0.2052 | 4 | 0.0% |
|
0.2086 | 6 | 0.1% |
|
0.209 | 2 | 0.0% |
|
0.2121 | 7 | 0.1% |
|
0.2164 | 2 | 0.0% |
|
log.annual.inc
Numeric
Distinct count | 1987 |
---|---|
Unique (%) | 20.7% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 10.932 |
---|---|
Minimum | 7.5475 |
Maximum | 14.528 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 7.5475 |
---|---|
5-th percentile | 9.9179 |
Q1 | 10.558 |
Median | 10.929 |
Q3 | 11.291 |
95-th percentile | 11.918 |
Maximum | 14.528 |
Range | 6.9809 |
Interquartile range | 0.73288 |
Descriptive statistics
Standard deviation | 0.61481 |
---|---|
Coef of variation | 0.056239 |
Kurtosis | 1.609 |
Mean | 10.932 |
MAD | 0.46903 |
Skewness | 0.028668 |
Sum | 104710 |
Variance | 0.37799 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
11.00209984 | 308 | 3.2% |
|
10.81977828 | 248 | 2.6% |
|
10.59663473 | 224 | 2.3% |
|
10.30895266 | 224 | 2.3% |
|
10.71441777 | 221 | 2.3% |
|
11.22524339 | 196 | 2.0% |
|
11.15625052 | 165 | 1.7% |
|
10.77895629 | 149 | 1.6% |
|
10.91508846 | 147 | 1.5% |
|
11.08214255 | 146 | 1.5% |
|
Other values (1977) | 7550 | 78.8% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
7.547501682999999 | 1 | 0.0% |
|
7.60090246 | 1 | 0.0% |
|
8.101677747 | 1 | 0.0% |
|
8.160518247 | 1 | 0.0% |
|
8.188689124 | 1 | 0.0% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
13.71015004 | 2 | 0.0% |
|
13.99783211 | 1 | 0.0% |
|
14.12446477 | 1 | 0.0% |
|
14.18015367 | 1 | 0.0% |
|
14.52835448 | 1 | 0.0% |
|
not.fully.paid
Boolean
Distinct count | 2 |
---|---|
Unique (%) | 0.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Mean | 0.16005 |
---|
0 |
8045
|
---|---|
1 |
|
Value | Count | Frequency (%) | |
0 | 8045 | 84.0% |
|
1 | 1533 | 16.0% |
|
pub.rec
Numeric
Distinct count | 6 |
---|---|
Unique (%) | 0.1% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 0.062122 |
---|---|
Minimum | 0 |
Maximum | 5 |
Zeros (%) | 94.2% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 1 |
Maximum | 5 |
Range | 5 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 0.26213 |
---|---|
Coef of variation | 4.2196 |
Kurtosis | 38.781 |
Mean | 0.062122 |
MAD | 0.11699 |
Skewness | 5.1264 |
Sum | 595 |
Variance | 0.06871 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
0 | 9019 | 94.2% |
|
1 | 533 | 5.6% |
|
2 | 19 | 0.2% |
|
3 | 5 | 0.1% |
|
5 | 1 | 0.0% |
|
4 | 1 | 0.0% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
0 | 9019 | 94.2% |
|
1 | 533 | 5.6% |
|
2 | 19 | 0.2% |
|
3 | 5 | 0.1% |
|
4 | 1 | 0.0% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
1 | 533 | 5.6% |
|
2 | 19 | 0.2% |
|
3 | 5 | 0.1% |
|
4 | 1 | 0.0% |
|
5 | 1 | 0.0% |
|
purpose
Categorical
Distinct count | 7 |
---|---|
Unique (%) | 0.1% |
Missing (%) | 0.0% |
Missing (n) | 0 |
debt_consolidation |
3957
|
---|---|
all_other |
2331
|
credit_card |
1262
|
Other values (4) |
2028
|
Value | Count | Frequency (%) | |
debt_consolidation | 3957 | 41.3% |
|
all_other | 2331 | 24.3% |
|
credit_card | 1262 | 13.2% |
|
home_improvement | 629 | 6.6% |
|
small_business | 619 | 6.5% |
|
major_purchase | 437 | 4.6% |
|
educational | 343 | 3.6% |
|
revol.bal
Numeric
Distinct count | 7869 |
---|---|
Unique (%) | 82.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 16914 |
---|---|
Minimum | 0 |
Maximum | 1207359 |
Zeros (%) | 3.4% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 127.7 |
Q1 | 3187 |
Median | 8596 |
Q3 | 18250 |
95-th percentile | 57654 |
Maximum | 1207359 |
Range | 1207359 |
Interquartile range | 15062 |
Descriptive statistics
Standard deviation | 33756 |
---|---|
Coef of variation | 1.9958 |
Kurtosis | 259.66 |
Mean | 16914 |
MAD | 15560 |
Skewness | 11.161 |
Sum | 162001946 |
Variance | 1139500000 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
0 | 321 | 3.4% |
|
255 | 10 | 0.1% |
|
298 | 10 | 0.1% |
|
682 | 9 | 0.1% |
|
346 | 8 | 0.1% |
|
182 | 6 | 0.1% |
|
1085 | 6 | 0.1% |
|
2229 | 6 | 0.1% |
|
8035 | 5 | 0.1% |
|
6 | 5 | 0.1% |
|
Other values (7859) | 9192 | 96.0% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
0 | 321 | 3.4% |
|
1 | 5 | 0.1% |
|
2 | 2 | 0.0% |
|
3 | 1 | 0.0% |
|
4 | 2 | 0.0% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
407794 | 1 | 0.0% |
|
508961 | 1 | 0.0% |
|
602519 | 1 | 0.0% |
|
952013 | 1 | 0.0% |
|
1207359 | 1 | 0.0% |
|
revol.util
Numeric
Distinct count | 1035 |
---|---|
Unique (%) | 10.8% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 46.799 |
---|---|
Minimum | 0 |
Maximum | 119 |
Zeros (%) | 3.1% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 1.1 |
Q1 | 22.6 |
Median | 46.3 |
Q3 | 70.9 |
95-th percentile | 94 |
Maximum | 119 |
Range | 119 |
Interquartile range | 48.3 |
Descriptive statistics
Standard deviation | 29.014 |
---|---|
Coef of variation | 0.61998 |
Kurtosis | -1.1165 |
Mean | 46.799 |
MAD | 24.835 |
Skewness | 0.059985 |
Sum | 448240 |
Variance | 841.84 |
Memory size | 74.9 KiB |
Value | Count | Frequency (%) | |
0.0 | 297 | 3.1% |
|
0.5 | 26 | 0.3% |
|
0.3 | 22 | 0.2% |
|
73.7 | 22 | 0.2% |
|
47.8 | 22 | 0.2% |
|
3.3 | 21 | 0.2% |
|
0.1 | 21 | 0.2% |
|
0.2 | 20 | 0.2% |
|
0.7 | 20 | 0.2% |
|
1.0 | 20 | 0.2% |
|
Other values (1025) | 9087 | 94.9% |
|
Minimum 5 values
Value | Count | Frequency (%) | |
0.0 | 297 | 3.1% |
|
0.04 | 1 | 0.0% |
|
0.1 | 21 | 0.2% |
|
0.2 | 20 | 0.2% |
|
0.3 | 22 | 0.2% |
|
Maximum 5 values
Value | Count | Frequency (%) | |
106.2 | 1 | 0.0% |
|
106.4 | 1 | 0.0% |
|
106.5 | 1 | 0.0% |
|
108.8 | 1 | 0.0% |
|
119.0 | 1 | 0.0% |
|
Correlations
Sample
credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | debt_consolidation | 0.1189 | 829.10 | 11.350407 | 19.48 | 737 | 5639.958333 | 28854 | 52.1 | 0 | 0 | 0 | 0 |
1 | 1 | credit_card | 0.1071 | 228.22 | 11.082143 | 14.29 | 707 | 2760.000000 | 33623 | 76.7 | 0 | 0 | 0 | 0 |
2 | 1 | debt_consolidation | 0.1357 | 366.86 | 10.373491 | 11.63 | 682 | 4710.000000 | 3511 | 25.6 | 1 | 0 | 0 | 0 |
3 | 1 | debt_consolidation | 0.1008 | 162.34 | 11.350407 | 8.10 | 712 | 2699.958333 | 33667 | 73.2 | 1 | 0 | 0 | 0 |
4 | 1 | credit_card | 0.1426 | 102.92 | 11.299732 | 14.97 | 667 | 4066.000000 | 4740 | 39.5 | 0 | 1 | 0 | 0 |
Some quick notes from the descriptives:
- We can clearly see that a number of these variables are incredibly skewed.
- The correlation matrix provides us with a few interesting (but understandable) findings:
- Pretty strong negative correlation between interest rate and fico score
- Pretty strong positive correlation between credit policy and fico score
While this is great for a brief overview, we should probably do a bit more digging. For that let's use a package built on top of matplotlib called seaborn
Let's look at the relationship between fico score and credit policy a bit closer using graphs.
import seaborn as sns
plt.figure(figsize=(10,6))
df[df['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='Credit.Policy=1')
df[df['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO');
We can clearly see that there are a lot more people that have credit policies of 1 in this dataset, but even so, there is a skew towards lower fico scores for those with a credit policy of 0 and higher scores for those with a credit policy of 1.
We had one variable that was categorical, that we didn't get much data on. Let's use seaborn to create a quick categorical plot of that to see if anything stands out.
plt.figure(figsize=(11,7))
sns.countplot(x='purpose',hue='not.fully.paid',data=df,palette='Set2');
It doesn't look like anything really stands out to me, all of the bar charts look pretty proportional between non-paid loans and paid loans.
This was a quick overview of some of the features pandas, pandas profiling, matplotlib, and seaborn have to offer for exploring data. I plan to do a more full EDA in a future article, but I wanted to show you what pandas had to offer and how it provided many similar capablities to SPSS, plus a lot more.¶
If you want to learn exactly how to do everything I did in a bit more detail and a slower pace I'd highly recommend the python data science bootcamp by Jose Portilla I mentioned here as my plan for this site is to walk you through cool features I think add value to the social scientists tool belt, but not necessarily teach you how to use pandas, or the plotting libraries from scratch.