Matrix Multiplication from scratch in Python¶
recently in an effort to better understand deep learning architectures I've been taking Jeremy Howard's new course he so eloquently termed "Impractical Deep Learning"
I really agree with his education philosophy that it first helps to see something working in action and after you have seen it in action it can be extremely beneficial to see how exactly it works. This post will focus on doing matrix multiplication from scratch, first using base Python and then continually adding in C implementations that greatly help with speeding up the calculation process.
This website does an amazing job of helping visualize what happens when we do matrix multiplication.
%load_ext autoreload
%autoreload 2
%matplotlib inline
Let's start by building a type of unit test that allows us to check to see if we are getting the same calculations each time.
To test it out we'll set Test = 'test' and give it a try
import operator
def test(a,b,cmp,cname=None):
if cname is None: cname=cmp.__name__
assert cmp(a,b),f"{cname}:\n{a}\n{b}"
def test_eq(a,b): test(a,b,operator.eq,'==')
TEST = 'test'
test_eq(TEST,'test')
We receive no errors, so we can conclude that the tests are in fact equal to one another.
Downloading Data¶
The first thing we want to do is to download or create some data to play with. For the purposes of matrix multiplication the MNIST dataset is fine.
#export
from pathlib import Path
from IPython.core.debugger import set_trace
from fastai import datasets
import pickle, gzip, math, torch, matplotlib as mpl
import matplotlib.pyplot as plt
from torch import tensor
import warnings
warnings.filterwarnings('ignore')
MNIST_URL='http://deeplearning.net/data/mnist/mnist.pkl'
path = datasets.download_data(MNIST_URL, ext='.gz'); path
After we download it, let's unzip it and give it a look.
with gzip.open(path, 'rb') as f:
((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
x_train,y_train,x_valid,y_valid # numpy array
n,c = x_train.shape
x_train, x_train.shape, y_train, y_train.shape, y_train.min(), y_train.max()
We can see this is a NumPy array that is dtype=float32. Since I am trying to learn more and more PyTorch, I'll just map each of the datasets to a Torch.tensor, which is extremely similar, it just allows for calculations to be completed on the GPU by passing it to cuda.
Use the map to map the data to a tensor
x_train,y_train,x_valid,y_valid = map(tensor, (x_train,y_train,x_valid,y_valid))
n,c = x_train.shape
x_train, x_train.shape, y_train, y_train.shape, y_train.min(), y_train.max()
So as you can see we still have all of the same information, now it is just being held in a torch tensor instead of a numpy array. Let's use our unit testing function to confirm that we have the dataset size, shape, and min and max y-values we are expecting.
assert x_train.shape[0]==y_train.shape[0]==50000
test_eq(x_train.shape[1],28*28)
test_eq(y_train.min(),0)
test_eq(y_train.max(),9)
Next, let's take a view of the data. Images can be reconfigured to be viewable by restructuring the 784 pixel image to 28x28 and MNIST is a series of images of numbers 0 through 9, let's have a look.
mpl.rcParams['image.cmap'] = 'gray'
img = x_train[4]
img.view(28,28).type()
plt.imshow(img.view((28,28)));
So we can clearly see this is a 9 and we can confirm this by looking at the associated y value.
y_train[4]
Initial Python Model¶
let's build a small model where we will need to initialize random weights and derive a bias scalar.
weights = torch.randn(784,10)/math.sqrt(784); weights
bias = torch.zeros(10); bias
Matrix Multiplication in Python¶
Now let's do matrix multiplication in pure python. We can do this by leveraging for loops.
It will take the following steps;
- define a function that takes in two inputs
- break the shape of each input into the rows and columns
- assert that the columns of the first input equal the rows of the second input as we saw above that matrix multiplication is done by turning the second input
- create a new matrix using torch.zeros of size a rows by b columns
- loop through the a rows and b columns in the range of a's columns and create a variable stored in c that is the multiplication of a and b at the given position in the matrix.
Let's see what it looks like below
def matmul(a,b):
ar,ac = a.shape # n_rows * n_cols
br,bc = b.shape # n_rows * n_cols
assert ac==br # rows == columns
c = torch.zeros(ar, bc)
for i in range(ar):
for j in range(bc):
for k in range(ac): # or br
c[i,j] += a[i,k] * b[k,j]
return c
Let's test this on a small subset of the data, let's do 50 of the 50k and let's time it.
m1 = x_valid[:50]
m2 = weights
m1.shape,m2.shape
%time t1=matmul(m1, m2)
This takes a very long time¶
As you can see to calculate 50 of these using python for loops took us 5.66 seconds. Remember that was 1/1000 of the dataset. If we multiply 6 seconds by 1000 we get 6,000 seconds to complete the matrix multiplication in python, which is a little over 4 days. A pretty long time to wait. So, let's see if we can speed that up.
The way to make Python faster is to.....remove Python. If we pass the loops over to PyTorch we can take advantage of ATen, which is a a tensor library for PyTorch written in C++.
So, let's see how we can do that.
Let's leverage PyTorch (we could do the same with NumPy), but PyTorch acts very similarly and has easy access to GPU
Enter elementwise operations¶
One thing Python is good at is doing elementwise operations, here's an example below:
a = tensor([10., 6, -4])
b = tensor([2., 8, 7])
a,b
a*b # elementwise multiplication
So let's use elementwise multiplication to get rid of that third for loop using k and let's just multiply all columns by all rows and take the sum
def matmul(a,b):
ar,ac = a.shape
br,bc = b.shape
assert ac==br
c = torch.zeros(ar, bc)
for i in range(ar):
for j in range(bc):
c[i,j] = (a[i,:] * b[:,j]).sum() # multiply all of column j by all of row i and sum it
return c
%timeit -n 10 _=matmul(m1, m2)
Now we are at 9.28 ms instead of 6 seconds a speed up of ~650x
Now let's check to see if they are extremely close to one another (i.e. calculating the same thing).
def near(a,b): return torch.allclose(a, b, rtol=1e-3, atol=1e-5)
def test_near(a,b): test(a,b,near)
test_near(t1,matmul(m1, m2))
# we didn't get an error, so we can confirm they were the same (or extremely close)
Can we go faster?¶
Introducing Broadcasting¶
Numpy Documentation on broadcasting
The term broadcasting describes how numpy treats arrays with
different shapes during arithmetic operations. Subject to certain
constraints, the smaller array is “broadcast” across the larger
array so that they have compatible shapes. Broadcasting provides a
means of vectorizing array operations so that looping occurs in C
instead of Python. It does this without making needless copies of
data and usually leads to efficient algorithm implementations.
*documentation borrowed from Jeremy Howard
broadcasting in action
a
a + 1
How did that work? It was a rank 1 tensor and a scalar, you shouldn't be able to do linear algebra on those as they aren't the same size. Well it broadcasted the scalar of 1 into a rank 1 tensor of size 3.
Let's look at a tensor example!
m = tensor([[1.,2.,3.],[4.,5.,6.],[7.,8.,9.]])
m.shape
c = tensor([10.,20.,30.])
c.shape
m+c
What version of c is being used?
c.expand_as(m)
def matmul(a,b):
ar,ac = a.shape
br,bc = b.shape
assert ac==br
c = torch.zeros(ar, bc)
for i in range(ar):
c[i] = (a[i,:].unsqueeze(-1) * b).sum(dim=0) # unsqueeze just reshapes the tensors to be the same size for the .sum()
return c
%timeit -n 10 _=matmul(m1, m2)
Now we are ~3,200x as fast as pure python
Can we make it even faster?¶
This is when I stop understanding, but there is something called einsum which can make matrix multiplication even faster.
So, what does it look like? It's actually fairly straight forward, it essentially tells torch. to use einsum to take two matrices ik, and kj and produce ij from them and the two matrices will be a & b in this instance.
def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)
%timeit -n 10 _=matmul(m1, m2)
Now we are about 100,000x faster than pure python
AND...that's without using the GPU. This is all on the CPU.
Let's test the entire dataset on CPU vs. GPU¶
x_train.shape
CPU¶
%timeit -n 10 _=matmul(x_train, m2)
So it took 10.1 milliseconds to do the matrix multiplication on the 50k sample.
GPU¶
Let's first see what type of GPU I have. I'm using my personal computer, so I don't have anything like a Tesla K80, or anything like that.
torch.cuda.get_device_name(0)
x_train_cuda = x_train.cuda()
m2_cuda = m2.cuda()
%timeit -n 10 _=matmul(x_train_cuda, m2_cuda)
This took ~1.7 milliseconds, so the GPU in this instance was ~ 6x faster.
and from my experience this difference grows as the sample size gets larger.
So in this blog post we learned how to iterate through python code and leverage lower level libraries commonly used in NumPy and PyTorch to speed up computation.¶