Speed Test

I thought it might be useful for those working day-to-day within Python to learn from some of the testing I do when I learn about new packages or methods. So I may have some shorter articles that come out more frequently as well. This article will focus on two things.

  1. A new dataframe concept that allows you to use multiple CPU cores
  2. A vectorization method for applying changes to your dataframe

Meet Modin

Modin is potentially the answer we have all been waiting for. A way to work with "big data" in memory. From their github repo linked earlier:

img

The fact that you can use multiple cores in parallel should greatly increase the speed, but let's give it a try on some of my data.

The data we will be using is the entire dataset provided from statcast for the 2019 MLB season. While the file size isn't huge, it's 360MB, which is relatively large for a standard csv. Let's load it in pandas and then load it in modin and see how it compares.

In [2]:
import pandas as pd
import numpy as np
In [2]:
%%time
pd_df = pd.read_csv('../statcast_data/2019_mlb_stats.csv')
CPU times: user 4.7 s, sys: 280 ms, total: 4.98 s
Wall time: 4.98 s
In [3]:
import modin.pandas as md_pd
In [4]:
%%time
mdn_df = md_pd.read_csv('../statcast_data/2019_mlb_stats.csv')
CPU times: user 144 ms, sys: 18.3 ms, total: 162 ms
Wall time: 2.85 s
In [5]:
pd_df.shape
Out[5]:
(723346, 90)

So we can see Modin has promise. The CPU time needed to load a 360MB file is roughly half and it's a relatively large dataframe of 723.3k rows by 90 columns.

Now let's look at a few different ways to process data.

3 different ways to process data

1. Write a function and use .apply()

this involves writing a typical function and then using .apply() to execute the function on the column. This simple function returns a 1 if the variable is an "R" or a 0 if it is not. This function can be used for creating an integer feature to be included in an ML model.

In [6]:
def stance(type):
    if type == "R":
        return 1
    else:
        return 0
In [7]:
%%time
pd_df['right_handed_batter'] = pd_df['stand'].apply(stance)
CPU times: user 164 ms, sys: 16 ms, total: 180 ms
Wall time: 176 ms

2. Write a lambda function within .apply()

This is similar to the original .apply() but allows us to use more concise functions that are not designed to be reproducible.This is a good resource for more on lambda functions.

In [8]:
%%time
pd_df['right_handed_batter'] = pd_df['stand'].apply(lambda x: 1 if x=="R" else 0)
CPU times: user 218 ms, sys: 0 ns, total: 218 ms
Wall time: 210 ms

We can see that the results of the first two are almost identical, with a wall time of about 180 milliseconds.

3. Use np.where()

np.where() leverages numpy and allows us to vectorize the computation. Let's see if we can get some speed up by taking advantage of numpy vectorization.

In [9]:
%%time
pd_df['right_handed_batter'] = np.where(pd_df['stand'].values =="R",1,0)
CPU times: user 14.4 ms, sys: 185 µs, total: 14.6 ms
Wall time: 13.3 ms

We can clearly see this is a lot faster. np.where() is roughly 13x faster on the exact same data.

Try the same thing on modin dataframes

We know modin loads data faster than pandas, but what about doing computation on the dataframes. Let's look at the numbers for each of those as well.

In [10]:
%%time
mdn_df['right_handed_batter'] = mdn_df['stand'].apply(stance)
CPU times: user 25.4 ms, sys: 3.82 ms, total: 29.2 ms
Wall time: 30.5 ms
In [11]:
%%time
mdn_df['right_handed_batter'] = mdn_df['stand'].apply(lambda x: 1 if x=="R" else 0)
CPU times: user 209 ms, sys: 7.87 ms, total: 217 ms
Wall time: 922 ms
In [12]:
%%time
mdn_df['right_handed_batter'] = np.where(mdn_df['stand'].values =="R",1,0)
CPU times: user 13.1 s, sys: 351 ms, total: 13.5 s
Wall time: 14.2 s

So we can see that for modin there are vast differences across the types of implementations. The traditional function runs extremely quickly, almost as fast as the np.where() function on pandas, but the lambda function is much slower and the most surprising of all was the np.where() function within modin. It's wall time is roughly 75x slower than the .apply() function within pandas.

I reached out to the development team and they said part of the reason for this is that when you call np.where() in modin it first converts it to a numpy array and then must convert it back to a pandas distributed series, which in effect ends up with them collecting the data, merging, running the np.where() then re-splitting the data for distribution, which causes all the high overhead.

How does this compare to R?

A question came up on whether or not the .apply() method was similar to the family of apply() functions in R. I'm not completely familiar with R code, but after some investigating it appears that they are very similar. So I decided it would be interesting to compare speed between R and Python. For whatever reason the R kernel was running exceptionally slow in Jupyter so I ran it in Rstudio. Below are the results. You can see that it took about 2 seconds longer to load the dataframe using pandas and about 4 seconds longer to load the dataframe as compared to Modin. the lapply() fucntion took 624 milliseconds compared to the roughly 200ms for .apply() and the 13.3ms for np.where(),so .apply() is roughly 3x faster and np.where() is roughly 47x faster.

img

Takeaways

  • Modin has promise for dealing with large dataframes across CPUs. I'm excited to see where the project goes.
  • There are often multiple ways to solve the same problem in python and as part of continuous learning and refactoring code it's often worth it to explores the pros and cons of these different methods.
In [ ]: