Why use NumPy? Why should I care about linear algebra as an I/O?

NumPy (Numerical Python) is a linear algebra library for python. But why use it? Python handles numbers natively. Let's do some basic math in python to show you what I am talking about.

In [134]:
print(7*3) 
print(7**3)
lst1 = [1,2,3,4,5]
lst2 = [1,2,4,8,16]
addition_list = []

# this would be equivalent to adding two columns in excel
for i in range(len(lst1)):
    addition_list.append(lst1[i]+lst2[i])
print(addition_list)
21
343
[2, 4, 7, 12, 21]

Enter NumPy

numpy

Quick note about NumPy

NumPy was created by Travis Oliphant who is actually the founder of Anaconda.

So again, why use NumPy? It seems basic python has this taken care of. Well here are a few reasons:

1. Memory: NumPy objects take up less space than python list objects.

While this is important, it's not a huge deal with most of the datasets we use.

2. Speed: NumPy leverages broadcasting which makes the computation much faster.

Let's take a look

First we import numpy and assign it an alias of np as this is the standard python etiquette

In [19]:
import numpy as np

Now let's create an enormous vector.

In [136]:
# 100,000 should do the trick
size_of_vec = 100000

Next let's write a simple function that creates an X and Y variable that each consist of the numbers 0 through 99,999. Let's then iterate through each of those lists index by index in a for loop and add them together in a variable called Z.

In [142]:
def pure_python_version():
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X))]

Next let's build a function using NumPy that does the exact same thing.

In [144]:
def numpy_version():
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y

Now let's use the %timeit magic function to time the speed of our code. First we will look at the pure python version, then we will look at the NumPy version.

In [147]:
%timeit t1 = pure_python_version()
22.9 ms ± 734 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [148]:
%timeit t2 = numpy_version()
679 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As you can see NumPy is roughly 33x faster at adding these two variables together.

Basic NumPy

So as you can see it is much faster. But why should we even care about this as social scientists? What does NumPy even do that's relevant to our work? Like many people going into social sciences, I didn't have a strong math background. I certainly never took a course in linear algebra. But...if you look at what we do in excel or SPSS what we are often doing is basic linear algebra.

For example, let's say we collected data on 20 respondents and we asked them a 5 item Extraversion battery. What would that data look like in excel or SPSS? It would be 20 rows and it would contain 5 columns, with each individual's response to the 5 items.

Let's create that array right now.

First we want to create a random array with numbers between 1-5 that has a size of 100 (20 people X 5 responses) Second we want to reshape it to be a 20,5 matrix.

The beauty of python is that you can chain methods together, so we can do all of that in one line of code. We'll assign the output of that to a variable named extraversion.

In [164]:
extraversion = np.random.randint(1,5,100).reshape(20,5)
In [165]:
extraversion
Out[165]:
array([[3, 1, 4, 2, 2],
       [4, 2, 1, 3, 3],
       [4, 2, 4, 3, 3],
       [1, 2, 1, 2, 4],
       [3, 1, 1, 3, 2],
       [1, 1, 4, 3, 4],
       [4, 1, 1, 4, 1],
       [2, 1, 1, 1, 1],
       [3, 2, 2, 2, 3],
       [1, 2, 4, 3, 2],
       [2, 1, 2, 4, 3],
       [3, 1, 2, 4, 2],
       [3, 2, 2, 4, 3],
       [3, 3, 3, 1, 3],
       [2, 4, 1, 3, 2],
       [3, 2, 4, 3, 2],
       [1, 1, 3, 1, 4],
       [3, 4, 3, 4, 3],
       [1, 1, 2, 1, 3],
       [1, 2, 3, 4, 2]])

Now we have our responses for each respondent. Is this starting to look familiar? It looks extremely similar to an excel sheet, without column headers. In fact if we wanted to we could import Pandas and turn this into a replica of an excel sheet.

In [186]:
import pandas as pd #alias
pd.DataFrame(extraversion,columns=['item_1','item_2','item_3','item_4','item_5']).head()
Out[186]:
item_1 item_2 item_3 item_4 item_5
0 3 1 4 2 2
1 4 2 1 3 3
2 4 2 4 3 3
3 1 2 1 2 4
4 3 1 1 3 2

But the focus of this post is on NumPy, so let's return to our array!

Now, Let's imagine we get asked to give our professor the mean extraversion score for each respondent. How would you do that? If it were a python list you would have to add up each item in the list and divide by the length of the list, something similar to this.

In [176]:
total = 0
for i in list(extraversion[0]):
    total += i
print (total/len(extraversion[0]))
2.4

So what that function did was the equivalent to generating the mean for one row of the NumPy array. You'd need to loop through that using extraversion[0] through extraversion[19] to get everyone's mean. Seems pretty time consuming and inefficient.

That's where NumPy comes in. Using NumPy I can simply add the columns together like a basic math problem.

Intuitively this is what is happening when you use a linear algebra package like NumPy, this specifically is an example of broadcasting

broadcasting

In [180]:
np.sum(extraversion,axis=1)/extraversion.shape[1]
Out[180]:
array([2.4, 2.6, 3.2, 2. , 2. , 2.6, 2.2, 1.2, 2.4, 2.4, 2.4, 2.4, 2.8,
       2.6, 2.4, 2.8, 2. , 3.4, 1.6, 2.4])

That one line did what about 5 lines of code would do in python. This also ignores the fact that NumPy comes with many built in methods. It's actually as simple as this:

In [181]:
np.mean(extraversion,axis=1)
Out[181]:
array([2.4, 2.6, 3.2, 2. , 2. , 2.6, 2.2, 1.2, 2.4, 2.4, 2.4, 2.4, 2.8,
       2.6, 2.4, 2.8, 2. , 3.4, 1.6, 2.4])

Hopefully now you have an understanding of how and why linear algebra and the NumPy packag are important to us and how it relates to your research/work as an I/O

I feel like the basic mathematical understanding is kind of assumed when learning how to calculate means or sum numbers for multiple respondents, but it's interesting to think about how it works when it comes to programming.

If you are interested in learning more about how to use NumPy, Jake Vanderplas has a great book titled: Python Data Science Handbook

He made all of his notebooks available for free on his Github account. The notebooks below are great resources for diggin in further.

This article wasn't meant to be all inclusive and show everything NumPy has to offer, moreso to be a quick overview of it's benefits and why I think it's relevant to the social sciences.

This is a very quick article that provides many of the highlights of NumPy functionality. Definitely worth taking a look to learn more.

In [ ]: