Examining Data in Python via Pandas¶

Imgur

One of the most common uses of SPSS is to quickly look at the descriptives of multiple variables at once. Below is a simple example output from SPSS.

spss

Does this look familiar?¶

Aside from correlations this is one of the most common syntax commands I've traditionally used within SPSS. This article will walk you through how to produce extremely similar output in Python. I'll also show you how to use other common methods like crosstabs and correlation matrices. Finally, I'll introduce you to a great package that allows you to do even more investigation of the data with a few additional lines of code.

Let's explore some data!

first let's import pandas, numpy, and matplotlib inline

In [3]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Quick note about Pandas¶

Pandas is a package that was started by Wes McKinney who worked as a quant in a hedge fund that wanted a better way to work with data than excel macros had to offer.

The Dataset¶

This is a loan dataset, not exactly perfectly suited for the social sciences, especially I/O, but you could easily see how paying or not paying loans is a human behavior, similar to many of the variables we study as social scientists. Job performance, employee engagement, job satisfaction, etc. all have predictors and the fact that this data involves human behavior as an outcome (whether or not the loan was fully paid) means this is very similar to problems we'd be likely to encounter.

This is actually lendingclub.com data which you can read more about here. Here is what each of the variables/columns represent:

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

In [4]:

df = pd.read_csv('data/loan_data.csv') ## load the data

In [5]:

columns = list(df.columns) #give us a list of the columns for easy reference

Let's write a quick few lines of code to figure out which datatype each variable is, because we can't get descriptives on object type variables, which typically represent strings.

In [6]:

data_type = []
for i in df.columns:
    data_type.append(df[i].dtype)
pd.DataFrame(list(zip(df.columns, data_type)),columns=['variables','data_types'])

Out[6]:

	variables	data_types
0	credit.policy	int64
1	purpose	object
2	int.rate	float64
3	installment	float64
4	log.annual.inc	float64
5	dti	float64
6	fico	int64
7	days.with.cr.line	float64
8	revol.bal	int64
9	revol.util	float64
10	inq.last.6mths	int64
11	delinq.2yrs	int64
12	pub.rec	int64
13	not.fully.paid	int64

So we can quickly see all variables aside from purpose should produce us with usable descriptives.

Descriptive Statistics¶

How do we get descriptives in python? We can do this by using the .describe() method

Note: I will use the round function to make the data look nicer.

In [7]:

round(df.describe(),2)

Out[7]:

	credit.policy	int.rate	installment	log.annual.inc	dti	fico	days.with.cr.line	revol.bal	revol.util	inq.last.6mths	delinq.2yrs	pub.rec	not.fully.paid
count	9578.0	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00
mean	0.8	0.12	319.09	10.93	12.61	710.85	4560.77	16913.96	46.80	1.58	0.16	0.06	0.16
std	0.4	0.03	207.07	0.61	6.88	37.97	2496.93	33756.19	29.01	2.20	0.55	0.26	0.37
min	0.0	0.06	15.67	7.55	0.00	612.00	178.96	0.00	0.00	0.00	0.00	0.00	0.00
25%	1.0	0.10	163.77	10.56	7.21	682.00	2820.00	3187.00	22.60	0.00	0.00	0.00	0.00
50%	1.0	0.12	268.95	10.93	12.66	707.00	4139.96	8596.00	46.30	1.00	0.00	0.00	0.00
75%	1.0	0.14	432.76	11.29	17.95	737.00	5730.00	18249.50	70.90	2.00	0.00	0.00	0.00
max	1.0	0.22	940.14	14.53	29.96	827.00	17639.96	1207359.00	119.00	33.00	13.00	5.00	1.00

SPSS won't allow you to add columns with a data type that is not an int or float for examining descriptives. In Python you can add it, but it won't show anything. Given what we saw above we should expect to see descriptive statistics for every column except purpose and it appears we do in fact have descriptives for every column except purpose. This provides us more than the base descriptives command from SPSS as we also see the quartiles.

If we wanted to look at specific columns we would just subset them like this:

In [8]:

round(df[['log.annual.inc','fico']].describe(),2)

Out[8]:

	log.annual.inc	fico
count	9578.00	9578.00
mean	10.93	710.85
std	0.61	37.97
min	7.55	612.00
25%	10.56	682.00
50%	10.93	707.00
75%	11.29	737.00
max	14.53	827.00

But I hear you saying, that's great and all, but I've used SPSS for so long I like the stacked output better than the wide output (variables as rows instead of columns)¶

Well....the great thing about object oriented programming is the table is an object that can be manipulated.

So....let's think about an easy way to do that. What if we just transposed the object?

In [37]:

round(df.describe().transpose(),2)

Out[37]:

	count	mean	std	min	25%	50%	75%	max
credit.policy	9578.0	0.80	0.40	0.00	1.00	1.00	1.00	1.00
int.rate	9578.0	0.12	0.03	0.06	0.10	0.12	0.14	0.22
installment	9578.0	319.09	207.07	15.67	163.77	268.95	432.76	940.14
log.annual.inc	9578.0	10.93	0.61	7.55	10.56	10.93	11.29	14.53
dti	9578.0	12.61	6.88	0.00	7.21	12.66	17.95	29.96
fico	9578.0	710.85	37.97	612.00	682.00	707.00	737.00	827.00
days.with.cr.line	9578.0	4560.77	2496.93	178.96	2820.00	4139.96	5730.00	17639.96
revol.bal	9578.0	16913.96	33756.19	0.00	3187.00	8596.00	18249.50	1207359.00
revol.util	9578.0	46.80	29.01	0.00	22.60	46.30	70.90	119.00
inq.last.6mths	9578.0	1.58	2.20	0.00	0.00	1.00	2.00	33.00
delinq.2yrs	9578.0	0.16	0.55	0.00	0.00	0.00	0.00	13.00
pub.rec	9578.0	0.06	0.26	0.00	0.00	0.00	0.00	5.00
not.fully.paid	9578.0	0.16	0.37	0.00	0.00	0.00	0.00	1.00
fico_binned	9578.0	2.35	0.75	1.00	2.00	2.00	3.00	5.00

Now it looks identical to the SPSS output above!

Crosstabs¶

Another really common output I've used in SPSS is the crosstabs view. It looks a little like this;

spss_crosstabs

Just like descriptives, pandas has you covered for crosstabs as well. Now most of these variables are continuous by nature, so in order to make a crosstabs viewable let's quickly bin one that might be interesting to look at. What we want to do is use the following; np.linspace, np.digitize

np.linspacelooks at the min and the max of the fico score and creates a linearly spaced grouping of them, so all numbers within the min and max are equally grouped into 5 categories.
Then np.digitize returns the index for each bin.
Finally, I build a crosstab so we can examine the number of individuals that either paid back their loan (0) or did not pay back their loan (1).

In [9]:

bins = np.linspace(df['fico'].min(), df['fico'].max(), 5)
df['fico_binned'] = np.digitize(df['fico'],bins)
pd.crosstab(df['not.fully.paid'],df['fico_binned'],margins=True)

Out[9]:

fico_binned	1	2	3	4	5	All
not.fully.paid
0	659	4138	2625	622	1	8045
1	244	915	336	38	0	1533
All	903	5053	2961	660	1	9578

You can see a trend where the proportion of borrowers that did not pay back their loans tends to becom less and less as we move from the lowest bins to the highest bins. Which makes sense. People with lower fico scores typically have those for a reason, perhaps not fiscally responsible with credit?

Correlation Matrix¶

spss_corr

As an I/O this SPSS output is probably our bread and butter. Examining the correlations amongst your variables. Lucky for us that's also extremely simple to do in pandas. using the pandas.DataFrame.corr method.

In [25]:

round(df.corr(),3)

Out[25]:

	credit.policy	int.rate	installment	log.annual.inc	dti	fico	days.with.cr.line	revol.bal	revol.util	inq.last.6mths	delinq.2yrs	pub.rec	not.fully.paid	fico_binned
credit.policy	1.000	-0.294	0.059	0.035	-0.091	0.348	0.099	-0.188	-0.104	-0.536	-0.076	-0.054	-0.158	0.345
int.rate	-0.294	1.000	0.276	0.056	0.220	-0.715	-0.124	0.093	0.465	0.203	0.156	0.098	0.160	-0.658
installment	0.059	0.276	1.000	0.448	0.050	0.086	0.183	0.234	0.081	-0.010	-0.004	-0.033	0.050	0.077
log.annual.inc	0.035	0.056	0.448	1.000	-0.054	0.115	0.337	0.372	0.055	0.029	0.029	0.017	-0.033	0.108
dti	-0.091	0.220	0.050	-0.054	1.000	-0.241	0.060	0.189	0.337	0.029	-0.022	0.006	0.037	-0.227
fico	0.348	-0.715	0.086	0.115	-0.241	1.000	0.264	-0.016	-0.541	-0.185	-0.216	-0.148	-0.150	0.921
days.with.cr.line	0.099	-0.124	0.183	0.337	0.060	0.264	1.000	0.229	-0.024	-0.042	0.081	0.072	-0.029	0.244
revol.bal	-0.188	0.093	0.234	0.372	0.189	-0.016	0.229	1.000	0.204	0.022	-0.033	-0.031	0.054	-0.013
revol.util	-0.104	0.465	0.081	0.055	0.337	-0.541	-0.024	0.204	1.000	-0.014	-0.043	0.067	0.082	-0.496
inq.last.6mths	-0.536	0.203	-0.010	0.029	0.029	-0.185	-0.042	0.022	-0.014	1.000	0.021	0.073	0.149	-0.175
delinq.2yrs	-0.076	0.156	-0.004	0.029	-0.022	-0.216	0.081	-0.033	-0.043	0.021	1.000	0.009	0.009	-0.200
pub.rec	-0.054	0.098	-0.033	0.017	0.006	-0.148	0.072	-0.031	0.067	0.073	0.009	1.000	0.049	-0.136
not.fully.paid	-0.158	0.160	0.050	-0.033	0.037	-0.150	-0.029	0.054	0.082	0.149	0.009	0.049	1.000	-0.143
fico_binned	0.345	-0.658	0.077	0.108	-0.227	0.921	0.244	-0.013	-0.496	-0.175	-0.200	-0.136	-0.143	1.000

We could also turn it into a nicer looking heatmap visual by importing the seaborn package.

There are many cmaps available, here are a few, but you can also create your own using matplotlibs linear segmented color map functionality.

In [34]:

import seaborn as sns
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(),cmap='viridis',annot=True);

Of course this would be great if this was all pandas had to offer.

Other important methods I've used over the past few years are:¶

There's also a great external package I want to talk about that allows you to quickly look at most of your data using the most common EDA (Exploratory Data Analysis) methods.

Pandas Profiling ¶

Note in order to use pandas profiling (which we will explore next) you will need to install the package. You can do so, by using the following command in your terminal/anaconda prompt:

conda install -c conda-forge pandas-profiling

Let's see what additional features pandas profiling has to offer!

As you can see below the pandas profiling package offers just about everything you could ever want to quickly look through your data.

Here's a quick list:

Number of variables
Number of observations
Percent of missing data
Size of the data in memory
data types (numerical, categorical, boolean, etc.)
A flag on features/variables with a large # of zero values
Common statistics (like seen in the .describe() pandas method)
Histograms for each feature/variable
Common values
Extreme values
Pearson correlation
Spearman correlation

As far as understanding your data, this is about as good as it gets

You are going to have to scroll down a little ways given the amount of output pandas profiling produces from just 1 line of code, but please do, so we can see if anything jumps out at us.

In [36]:

import pandas_profiling
pandas_profiling.ProfileReport(df)

Out[36]:

Overview

Dataset info

Number of variables	14
Number of observations	9578
Total Missing (%)	0.0%
Total size in memory	1.0 MiB
Average record size in memory	112.0 B

Variables types

Numeric	11
Categorical	1
Boolean	2
Date	0
Text (Unique)	0
Rejected	0
Unsupported	0

Warnings

delinq.2yrs has 8458 / 88.3% zeros Zeros
inq.last.6mths has 3637 / 38.0% zeros Zeros
pub.rec has 9019 / 94.2% zeros Zeros
revol.bal has 321 / 3.4% zeros Zeros
revol.util has 297 / 3.1% zeros Zeros

Variables

credit.policy
Boolean

Distinct count	2
Unique (%)	0.0%
Missing (%)	0.0%
Missing (n)	0

Mean	0.80497

1	7710
0	1868

Toggle details

Value	Count	Frequency (%)
1	7710	80.5%
0	1868	19.5%

days.with.cr.line
Numeric

Distinct count	2687
Unique (%)	28.1%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	4560.8
Minimum	178.96
Maximum	17640
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	178.96
5-th percentile	1320
Q1	2820
Median	4140
Q3	5730
95-th percentile	9330
Maximum	17640
Range	17461
Interquartile range	2910

Descriptive statistics

Standard deviation	2496.9
Coef of variation	0.54748
Kurtosis	1.9379
Mean	4560.8
MAD	1897.7
Skewness	1.1557
Sum	43683000
Variance	6234700
Memory size	74.9 KiB

Value	Count	Frequency (%)
3660.0	50	0.5%
3630.0	48	0.5%
3990.0	46	0.5%
4410.0	44	0.5%
3600.0	41	0.4%
2550.0	38	0.4%
4080.0	38	0.4%
3690.0	37	0.4%
1800.0	37	0.4%
4020.0	35	0.4%
Other values (2677)	9164	95.7%

Minimum 5 values

Value	Count	Frequency (%)
178.95833330000002	1	0.0%
180.04166669999998	3	0.0%
181.0	1	0.0%
183.04166669999998	1	0.0%
209.04166669999998	1	0.0%

Maximum 5 values

Value	Count	Frequency (%)
16260.0	1	0.0%
16350.0	1	0.0%
16652.0	1	0.0%
17616.0	1	0.0%
17639.95833	1	0.0%

delinq.2yrs
Numeric

Distinct count	11
Unique (%)	0.1%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	0.16371
Minimum	0
Maximum	13
Zeros (%)	88.3%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	1
Maximum	13
Range	13
Interquartile range	0

Descriptive statistics

Standard deviation	0.54621
Coef of variation	3.3365
Kurtosis	71.433
Mean	0.16371
MAD	0.28913
Skewness	6.0618
Sum	1568
Variance	0.29835
Memory size	74.9 KiB

Value	Count	Frequency (%)
0	8458	88.3%
1	832	8.7%
2	192	2.0%
3	65	0.7%
4	19	0.2%
5	6	0.1%
6	2	0.0%
7	1	0.0%
13	1	0.0%
11	1	0.0%

Minimum 5 values

Value	Count	Frequency (%)
0	8458	88.3%
1	832	8.7%
2	192	2.0%
3	65	0.7%
4	19	0.2%

Maximum 5 values

Value	Count	Frequency (%)
6	2	0.0%
7	1	0.0%
8	1	0.0%
11	1	0.0%
13	1	0.0%

dti
Numeric

Distinct count	2529
Unique (%)	26.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	12.607
Minimum	0
Maximum	29.96
Zeros (%)	0.9%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	1.27
Q1	7.2125
Median	12.665
Q3	17.95
95-th percentile	23.65
Maximum	29.96
Range	29.96
Interquartile range	10.738

Descriptive statistics

Standard deviation	6.884
Coef of variation	0.54606
Kurtosis	-0.90036
Mean	12.607
MAD	5.796
Skewness	0.023941
Sum	120750
Variance	47.389
Memory size	74.9 KiB

Value	Count	Frequency (%)
0.0	89	0.9%
10.0	19	0.2%
0.6	16	0.2%
13.16	13	0.1%
19.2	13	0.1%
15.1	13	0.1%
12.0	13	0.1%
6.0	13	0.1%
13.28	12	0.1%
10.8	12	0.1%
Other values (2519)	9365	97.8%

Minimum 5 values

Value	Count	Frequency (%)
0.0	89	0.9%
0.01	1	0.0%
0.02	1	0.0%
0.03	1	0.0%
0.04	2	0.0%

Maximum 5 values

Value	Count	Frequency (%)
29.72	1	0.0%
29.74	1	0.0%
29.9	1	0.0%
29.95	1	0.0%
29.96	1	0.0%

fico
Numeric

Distinct count	44
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	710.85
Minimum	612
Maximum	827
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	612
5-th percentile	657
Q1	682
Median	707
Q3	737
95-th percentile	782
Maximum	827
Range	215
Interquartile range	55

Descriptive statistics

Standard deviation	37.971
Coef of variation	0.053416
Kurtosis	-0.42231
Mean	710.85
MAD	31.264
Skewness	0.47126
Sum	6808486
Variance	1441.8
Memory size	74.9 KiB

Value	Count	Frequency (%)
687	548	5.7%
682	536	5.6%
692	498	5.2%
697	476	5.0%
702	472	4.9%
707	444	4.6%
667	438	4.6%
677	427	4.5%
717	424	4.4%
662	414	4.3%
Other values (34)	4901	51.2%

Minimum 5 values

Value	Count	Frequency (%)
612	2	0.0%
617	1	0.0%
622	1	0.0%
627	2	0.0%
632	6	0.1%

Maximum 5 values

Value	Count	Frequency (%)
807	45	0.5%
812	33	0.3%
817	6	0.1%
822	5	0.1%
827	1	0.0%

inq.last.6mths
Numeric

Distinct count	28
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1.5775
Minimum	0
Maximum	33
Zeros (%)	38.0%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	1
Q3	2
95-th percentile	5
Maximum	33
Range	33
Interquartile range	2

Descriptive statistics

Standard deviation	2.2002
Coef of variation	1.3948
Kurtosis	26.288
Mean	1.5775
MAD	1.4949
Skewness	3.5842
Sum	15109
Variance	4.8411
Memory size	74.9 KiB

Value	Count	Frequency (%)
0	3637	38.0%
1	2462	25.7%
2	1384	14.4%
3	864	9.0%
4	475	5.0%
5	278	2.9%
6	165	1.7%
7	100	1.0%
8	72	0.8%
9	47	0.5%
Other values (18)	94	1.0%

Minimum 5 values

Value	Count	Frequency (%)
0	3637	38.0%
1	2462	25.7%
2	1384	14.4%
3	864	9.0%
4	475	5.0%

Maximum 5 values

Value	Count	Frequency (%)
27	1	0.0%
28	1	0.0%
31	1	0.0%
32	1	0.0%
33	1	0.0%

installment
Numeric

Distinct count	4788
Unique (%)	50.0%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	319.09
Minimum	15.67
Maximum	940.14
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	15.67
5-th percentile	65.559
Q1	163.77
Median	268.95
Q3	432.76
95-th percentile	756.27
Maximum	940.14
Range	924.47
Interquartile range	268.99

Descriptive statistics

Standard deviation	207.07
Coef of variation	0.64894
Kurtosis	0.13791
Mean	319.09
MAD	165.63
Skewness	0.91252
Sum	3056200
Variance	42879
Memory size	74.9 KiB

Value	Count	Frequency (%)
317.72	41	0.4%
316.11	34	0.4%
319.47	29	0.3%
381.26	27	0.3%
662.68	27	0.3%
156.1	24	0.3%
320.95	24	0.3%
669.33	23	0.2%
334.67	23	0.2%
188.02	23	0.2%
Other values (4778)	9303	97.1%

Minimum 5 values

Value	Count	Frequency (%)
15.67	1	0.0%
15.69	1	0.0%
15.75	1	0.0%
15.76	1	0.0%
15.91	1	0.0%

Maximum 5 values

Value	Count	Frequency (%)
916.95	2	0.0%
918.02	2	0.0%
922.42	1	0.0%
926.83	2	0.0%
940.14	1	0.0%

int.rate
Numeric

Distinct count	249
Unique (%)	2.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	0.12264
Minimum	0.06
Maximum	0.2164
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	0.06
5-th percentile	0.0774
Q1	0.1039
Median	0.1221
Q3	0.1407
95-th percentile	0.167
Maximum	0.2164
Range	0.1564
Interquartile range	0.0368

Descriptive statistics

Standard deviation	0.026847
Coef of variation	0.21891
Kurtosis	-0.22432
Mean	0.12264
MAD	0.021441
Skewness	0.16442
Sum	1174.6
Variance	0.00072076
Memory size	74.9 KiB

Value	Count	Frequency (%)
0.1253	354	3.7%
0.0894	299	3.1%
0.1183	243	2.5%
0.1218	215	2.2%
0.0963	210	2.2%
0.1114	206	2.2%
0.08	198	2.1%
0.1287	197	2.1%
0.1148	193	2.0%
0.0932	187	2.0%
Other values (239)	7276	76.0%

Minimum 5 values

Value	Count	Frequency (%)
0.06	8	0.1%
0.0639	4	0.0%
0.0676	9	0.1%
0.0705	23	0.2%
0.0712	9	0.1%

Maximum 5 values

Value	Count	Frequency (%)
0.2052	4	0.0%
0.2086	6	0.1%
0.209	2	0.0%
0.2121	7	0.1%
0.2164	2	0.0%

log.annual.inc
Numeric

Distinct count	1987
Unique (%)	20.7%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	10.932
Minimum	7.5475
Maximum	14.528
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	7.5475
5-th percentile	9.9179
Q1	10.558
Median	10.929
Q3	11.291
95-th percentile	11.918
Maximum	14.528
Range	6.9809
Interquartile range	0.73288

Descriptive statistics

Standard deviation	0.61481
Coef of variation	0.056239
Kurtosis	1.609
Mean	10.932
MAD	0.46903
Skewness	0.028668
Sum	104710
Variance	0.37799
Memory size	74.9 KiB

Value	Count	Frequency (%)
11.00209984	308	3.2%
10.81977828	248	2.6%
10.59663473	224	2.3%
10.30895266	224	2.3%
10.71441777	221	2.3%
11.22524339	196	2.0%
11.15625052	165	1.7%
10.77895629	149	1.6%
10.91508846	147	1.5%
11.08214255	146	1.5%
Other values (1977)	7550	78.8%

Minimum 5 values

Value	Count	Frequency (%)
7.547501682999999	1	0.0%
7.60090246	1	0.0%
8.101677747	1	0.0%
8.160518247	1	0.0%
8.188689124	1	0.0%

Maximum 5 values

Value	Count	Frequency (%)
13.71015004	2	0.0%
13.99783211	1	0.0%
14.12446477	1	0.0%
14.18015367	1	0.0%
14.52835448	1	0.0%

not.fully.paid
Boolean

Distinct count	2
Unique (%)	0.0%
Missing (%)	0.0%
Missing (n)	0

Mean	0.16005

0	8045
1	1533

Toggle details

Value	Count	Frequency (%)
0	8045	84.0%
1	1533	16.0%

pub.rec
Numeric

Distinct count	6
Unique (%)	0.1%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	0.062122
Minimum	0
Maximum	5
Zeros (%)	94.2%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	1
Maximum	5
Range	5
Interquartile range	0

Descriptive statistics

Standard deviation	0.26213
Coef of variation	4.2196
Kurtosis	38.781
Mean	0.062122
MAD	0.11699
Skewness	5.1264
Sum	595
Variance	0.06871
Memory size	74.9 KiB

Value	Count	Frequency (%)
0	9019	94.2%
1	533	5.6%
2	19	0.2%
3	5	0.1%
5	1	0.0%
4	1	0.0%

Minimum 5 values

Value	Count	Frequency (%)
0	9019	94.2%
1	533	5.6%
2	19	0.2%
3	5	0.1%
4	1	0.0%

Maximum 5 values

Value	Count	Frequency (%)
1	533	5.6%
2	19	0.2%
3	5	0.1%
4	1	0.0%
5	1	0.0%

purpose
Categorical

Distinct count	7
Unique (%)	0.1%
Missing (%)	0.0%
Missing (n)	0

debt_consolidation	3957
all_other	2331
credit_card	1262
Other values (4)	2028

Toggle details

Value	Count	Frequency (%)
debt_consolidation	3957	41.3%
all_other	2331	24.3%
credit_card	1262	13.2%
home_improvement	629	6.6%
small_business	619	6.5%
major_purchase	437	4.6%
educational	343	3.6%

revol.bal
Numeric

Distinct count	7869
Unique (%)	82.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	16914
Minimum	0
Maximum	1207359
Zeros (%)	3.4%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	127.7
Q1	3187
Median	8596
Q3	18250
95-th percentile	57654
Maximum	1207359
Range	1207359
Interquartile range	15062

Descriptive statistics

Standard deviation	33756
Coef of variation	1.9958
Kurtosis	259.66
Mean	16914
MAD	15560
Skewness	11.161
Sum	162001946
Variance	1139500000
Memory size	74.9 KiB

Value	Count	Frequency (%)
0	321	3.4%
255	10	0.1%
298	10	0.1%
682	9	0.1%
346	8	0.1%
182	6	0.1%
1085	6	0.1%
2229	6	0.1%
8035	5	0.1%
6	5	0.1%
Other values (7859)	9192	96.0%

Minimum 5 values

Value	Count	Frequency (%)
0	321	3.4%
1	5	0.1%
2	2	0.0%
3	1	0.0%
4	2	0.0%

Maximum 5 values

Value	Count	Frequency (%)
407794	1	0.0%
508961	1	0.0%
602519	1	0.0%
952013	1	0.0%
1207359	1	0.0%

revol.util
Numeric

Distinct count	1035
Unique (%)	10.8%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	46.799
Minimum	0
Maximum	119
Zeros (%)	3.1%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	1.1
Q1	22.6
Median	46.3
Q3	70.9
95-th percentile	94
Maximum	119
Range	119
Interquartile range	48.3

Descriptive statistics

Standard deviation	29.014
Coef of variation	0.61998
Kurtosis	-1.1165
Mean	46.799
MAD	24.835
Skewness	0.059985
Sum	448240
Variance	841.84
Memory size	74.9 KiB

Value	Count	Frequency (%)
0.0	297	3.1%
0.5	26	0.3%
0.3	22	0.2%
73.7	22	0.2%
47.8	22	0.2%
3.3	21	0.2%
0.1	21	0.2%
0.2	20	0.2%
0.7	20	0.2%
1.0	20	0.2%
Other values (1025)	9087	94.9%

Minimum 5 values

Value	Count	Frequency (%)
0.0	297	3.1%
0.04	1	0.0%
0.1	21	0.2%
0.2	20	0.2%
0.3	22	0.2%

Maximum 5 values

Value	Count	Frequency (%)
106.2	1	0.0%
106.4	1	0.0%
106.5	1	0.0%
108.8	1	0.0%
119.0	1	0.0%

Correlations

Sample

	credit.policy	purpose	int.rate	installment	log.annual.inc	dti	fico	days.with.cr.line	revol.bal	revol.util	inq.last.6mths	delinq.2yrs
0	1	debt_consolidation	0.1189	829.10	11.350407	19.48	737	5639.958333	28854	52.1	0	0
1	1	credit_card	0.1071	228.22	11.082143	14.29	707	2760.000000	33623	76.7	0	0
2	1	debt_consolidation	0.1357	366.86	10.373491	11.63	682	4710.000000	3511	25.6	1	0
3	1	debt_consolidation	0.1008	162.34	11.350407	8.10	712	2699.958333	33667	73.2	1	0
4	1	credit_card	0.1426	102.92	11.299732	14.97	667	4066.000000	4740	39.5	0	1

Some quick notes from the descriptives:

We can clearly see that a number of these variables are incredibly skewed.
The correlation matrix provides us with a few interesting (but understandable) findings:
- Pretty strong negative correlation between interest rate and fico score
- Pretty strong positive correlation between credit policy and fico score

While this is great for a brief overview, we should probably do a bit more digging. For that let's use a package built on top of matplotlib called seaborn

Let's look at the relationship between fico score and credit policy a bit closer using graphs.

In [21]:

import seaborn as sns

plt.figure(figsize=(10,6))
df[df['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',
                                              bins=30,label='Credit.Policy=1')
df[df['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
                                              bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO');

We can clearly see that there are a lot more people that have credit policies of 1 in this dataset, but even so, there is a skew towards lower fico scores for those with a credit policy of 0 and higher scores for those with a credit policy of 1.

We had one variable that was categorical, that we didn't get much data on. Let's use seaborn to create a quick categorical plot of that to see if anything stands out.

In [23]:

plt.figure(figsize=(11,7))
sns.countplot(x='purpose',hue='not.fully.paid',data=df,palette='Set2');

It doesn't look like anything really stands out to me, all of the bar charts look pretty proportional between non-paid loans and paid loans.

This was a quick overview of some of the features pandas, pandas profiling, matplotlib, and seaborn have to offer for exploring data. I plan to do a more full EDA in a future article, but I wanted to show you what pandas had to offer and how it provided many similar capablities to SPSS, plus a lot more.¶

If you want to learn exactly how to do everything I did in a bit more detail and a slower pace I'd highly recommend the python data science bootcamp by Jose Portilla I mentioned here as my plan for this site is to walk you through cool features I think add value to the social scientists tool belt, but not necessarily teach you how to use pandas, or the plotting libraries from scratch.

AIO: Bridging the Gap Between Data Science and IO

Examining Data in Python via Pandas

N. Koenig

Examining Data in Python via Pandas¶

Does this look familiar?¶

Quick note about Pandas¶

The Dataset¶

Descriptive Statistics¶

But I hear you saying, that's great and all, but I've used SPSS for so long I like the stacked output better than the wide output (variables as rows instead of columns)¶

Crosstabs¶

Correlation Matrix¶

Other important methods I've used over the past few years are:¶

Pandas Profiling ¶

Overview

Variables

Correlations

Sample

Comments

AIO: Bridging the Gap Between Data Science and IO

Examining Data in Python via Pandas

N. Koenig

Examining Data in Python via Pandas¶

Does this look familiar?¶

Quick note about Pandas¶

The Dataset¶

Descriptive Statistics¶

But I hear you saying, that's great and all, but I've used SPSS for so long I like the stacked output better than the wide output (variables as rows instead of columns)¶

Crosstabs¶

Correlation Matrix¶

Other important methods I've used over the past few years are:¶

Pandas Profiling¶

Overview

Variables

Correlations

Sample

Comments

Pandas Profiling ¶