Matrix Multiplication from scratch in Python

recently in an effort to better understand deep learning architectures I've been taking Jeremy Howard's new course he so eloquently termed "Impractical Deep Learning"

I really agree with his education philosophy that it first helps to see something working in action and after you have seen it in action it can be extremely beneficial to see how exactly it works. This post will focus on doing matrix multiplication from scratch, first using base Python and then continually adding in C implementations that greatly help with speeding up the calculation process.

This website does an amazing job of helping visualize what happens when we do matrix multiplication.

matmul_1

Imgur

Imgur

Imgur

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

Let's start by building a type of unit test that allows us to check to see if we are getting the same calculations each time.

To test it out we'll set Test = 'test' and give it a try

In [2]:
import operator

def test(a,b,cmp,cname=None):
    if cname is None: cname=cmp.__name__
    assert cmp(a,b),f"{cname}:\n{a}\n{b}"

def test_eq(a,b): test(a,b,operator.eq,'==')
In [3]:
TEST = 'test'
test_eq(TEST,'test')

We receive no errors, so we can conclude that the tests are in fact equal to one another.

Downloading Data

The first thing we want to do is to download or create some data to play with. For the purposes of matrix multiplication the MNIST dataset is fine.

In [4]:
#export
from pathlib import Path
from IPython.core.debugger import set_trace
from fastai import datasets
import pickle, gzip, math, torch, matplotlib as mpl
import matplotlib.pyplot as plt
from torch import tensor
import warnings
warnings.filterwarnings('ignore')

MNIST_URL='http://deeplearning.net/data/mnist/mnist.pkl'
In [5]:
path = datasets.download_data(MNIST_URL, ext='.gz'); path
Out[5]:
PosixPath('/home/nick/.fastai/data/mnist.pkl.gz')

After we download it, let's unzip it and give it a look.

In [6]:
with gzip.open(path, 'rb') as f:
    ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
In [7]:
x_train,y_train,x_valid,y_valid # numpy array
n,c = x_train.shape
x_train, x_train.shape, y_train, y_train.shape, y_train.min(), y_train.max()
Out[7]:
(array([[0., 0., 0., 0., ..., 0., 0., 0., 0.],
        [0., 0., 0., 0., ..., 0., 0., 0., 0.],
        [0., 0., 0., 0., ..., 0., 0., 0., 0.],
        [0., 0., 0., 0., ..., 0., 0., 0., 0.],
        ...,
        [0., 0., 0., 0., ..., 0., 0., 0., 0.],
        [0., 0., 0., 0., ..., 0., 0., 0., 0.],
        [0., 0., 0., 0., ..., 0., 0., 0., 0.],
        [0., 0., 0., 0., ..., 0., 0., 0., 0.]], dtype=float32),
 (50000, 784),
 array([5, 0, 4, 1, ..., 0, 8, 4, 8]),
 (50000,),
 0,
 9)

We can see this is a NumPy array that is dtype=float32. Since I am trying to learn more and more PyTorch, I'll just map each of the datasets to a Torch.tensor, which is extremely similar, it just allows for calculations to be completed on the GPU by passing it to cuda.

Use the map to map the data to a tensor

In [8]:
x_train,y_train,x_valid,y_valid = map(tensor, (x_train,y_train,x_valid,y_valid))
n,c = x_train.shape
x_train, x_train.shape, y_train, y_train.shape, y_train.min(), y_train.max()
Out[8]:
(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 torch.Size([50000, 784]),
 tensor([5, 0, 4,  ..., 8, 4, 8]),
 torch.Size([50000]),
 tensor(0),
 tensor(9))

So as you can see we still have all of the same information, now it is just being held in a torch tensor instead of a numpy array. Let's use our unit testing function to confirm that we have the dataset size, shape, and min and max y-values we are expecting.

In [9]:
assert x_train.shape[0]==y_train.shape[0]==50000
test_eq(x_train.shape[1],28*28)
test_eq(y_train.min(),0)
test_eq(y_train.max(),9)

Next, let's take a view of the data. Images can be reconfigured to be viewable by restructuring the 784 pixel image to 28x28 and MNIST is a series of images of numbers 0 through 9, let's have a look.

In [10]:
mpl.rcParams['image.cmap'] = 'gray'
img = x_train[4]
img.view(28,28).type()
plt.imshow(img.view((28,28)));

So we can clearly see this is a 9 and we can confirm this by looking at the associated y value.

In [11]:
y_train[4]
Out[11]:
tensor(9)

Initial Python Model

let's build a small model where we will need to initialize random weights and derive a bias scalar.

In [12]:
weights = torch.randn(784,10)/math.sqrt(784); weights
Out[12]:
tensor([[-0.0187,  0.0308, -0.0073,  ..., -0.0297,  0.0221, -0.0167],
        [-0.0416,  0.0027, -0.0576,  ...,  0.0168,  0.0054,  0.0534],
        [ 0.0024,  0.0142, -0.0206,  ...,  0.0536, -0.0267, -0.0699],
        ...,
        [ 0.0115, -0.0352, -0.0629,  ..., -0.0159,  0.0060, -0.0176],
        [-0.0328, -0.0531, -0.0413,  ..., -0.0258, -0.0377,  0.0149],
        [ 0.0118,  0.0311, -0.0458,  ...,  0.0149, -0.0096,  0.0465]])
In [13]:
bias = torch.zeros(10); bias
Out[13]:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Matrix Multiplication in Python

Now let's do matrix multiplication in pure python. We can do this by leveraging for loops.

It will take the following steps;

  1. define a function that takes in two inputs
  2. break the shape of each input into the rows and columns
  3. assert that the columns of the first input equal the rows of the second input as we saw above that matrix multiplication is done by turning the second input
  4. create a new matrix using torch.zeros of size a rows by b columns
  5. loop through the a rows and b columns in the range of a's columns and create a variable stored in c that is the multiplication of a and b at the given position in the matrix.

Let's see what it looks like below

In [14]:
def matmul(a,b):
    ar,ac = a.shape # n_rows * n_cols
    br,bc = b.shape # n_rows * n_cols
    assert ac==br # rows == columns
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            for k in range(ac): # or br
                c[i,j] += a[i,k] * b[k,j]
    return c

Let's test this on a small subset of the data, let's do 50 of the 50k and let's time it.

In [15]:
m1 = x_valid[:50]
m2 = weights
In [16]:
m1.shape,m2.shape
Out[16]:
(torch.Size([50, 784]), torch.Size([784, 10]))
In [17]:
%time t1=matmul(m1, m2)
CPU times: user 6.07 s, sys: 0 ns, total: 6.07 s
Wall time: 6.07 s

This takes a very long time

As you can see to calculate 50 of these using python for loops took us 5.66 seconds. Remember that was 1/1000 of the dataset. If we multiply 6 seconds by 1000 we get 6,000 seconds to complete the matrix multiplication in python, which is a little over 4 days. A pretty long time to wait. So, let's see if we can speed that up.

The way to make Python faster is to.....remove Python. If we pass the loops over to PyTorch we can take advantage of ATen, which is a a tensor library for PyTorch written in C++.

So, let's see how we can do that.

Let's leverage PyTorch (we could do the same with NumPy), but PyTorch acts very similarly and has easy access to GPU

Imgur

Enter elementwise operations

One thing Python is good at is doing elementwise operations, here's an example below:

In [18]:
a = tensor([10., 6, -4])
b = tensor([2., 8, 7])
a,b
Out[18]:
(tensor([10.,  6., -4.]), tensor([2., 8., 7.]))
In [19]:
a*b # elementwise multiplication
Out[19]:
tensor([ 20.,  48., -28.])

So let's use elementwise multiplication to get rid of that third for loop using k and let's just multiply all columns by all rows and take the sum

In [20]:
def matmul(a,b):
    ar,ac = a.shape
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            c[i,j] = (a[i,:] * b[:,j]).sum() # multiply all of column j by all of row i and sum it
    return c
In [21]:
%timeit -n 10 _=matmul(m1, m2)
9.28 ms ± 476 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now we are at 9.28 ms instead of 6 seconds a speed up of ~650x

Now let's check to see if they are extremely close to one another (i.e. calculating the same thing).

In [22]:
def near(a,b): return torch.allclose(a, b, rtol=1e-3, atol=1e-5)
def test_near(a,b): test(a,b,near)
In [23]:
test_near(t1,matmul(m1, m2)) 
# we didn't get an error, so we can confirm they were the same (or extremely close)

Can we go faster?

Introducing Broadcasting

Numpy Documentation on broadcasting

The term broadcasting describes how numpy treats arrays with 
different shapes during arithmetic operations. Subject to certain 
constraints, the smaller array is “broadcast” across the larger 
array so that they have compatible shapes. Broadcasting provides a 
means of vectorizing array operations so that looping occurs in C
instead of Python. It does this without making needless copies of 
data and usually leads to efficient algorithm implementations.

*documentation borrowed from Jeremy Howard

broadcasting in action

In [24]:
a
Out[24]:
tensor([10.,  6., -4.])
In [25]:
a + 1
Out[25]:
tensor([11.,  7., -3.])

How did that work? It was a rank 1 tensor and a scalar, you shouldn't be able to do linear algebra on those as they aren't the same size. Well it broadcasted the scalar of 1 into a rank 1 tensor of size 3.

Let's look at a tensor example!

In [26]:
m = tensor([[1.,2.,3.],[4.,5.,6.],[7.,8.,9.]])
m.shape
Out[26]:
torch.Size([3, 3])
In [27]:
c = tensor([10.,20.,30.])
c.shape
Out[27]:
torch.Size([3])
In [28]:
m+c
Out[28]:
tensor([[11., 22., 33.],
        [14., 25., 36.],
        [17., 28., 39.]])

What version of c is being used?

In [29]:
c.expand_as(m)
Out[29]:
tensor([[10., 20., 30.],
        [10., 20., 30.],
        [10., 20., 30.]])
In [30]:
def matmul(a,b):
    ar,ac = a.shape
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    for i in range(ar):
        c[i]   = (a[i,:].unsqueeze(-1) * b).sum(dim=0) # unsqueeze just reshapes the tensors to be the same size for the .sum()
    return c
In [31]:
%timeit -n 10 _=matmul(m1, m2)
1.9 ms ± 99.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now we are ~3,200x as fast as pure python

Can we make it even faster?

This is when I stop understanding, but there is something called einsum which can make matrix multiplication even faster.

So, what does it look like? It's actually fairly straight forward, it essentially tells torch. to use einsum to take two matrices ik, and kj and produce ij from them and the two matrices will be a & b in this instance.

In [38]:
def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)
In [39]:
%timeit -n 10 _=matmul(m1, m2)
66.5 µs ± 7.58 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now we are about 100,000x faster than pure python

AND...that's without using the GPU. This is all on the CPU.

Let's test the entire dataset on CPU vs. GPU

In [34]:
x_train.shape
Out[34]:
torch.Size([50000, 784])

CPU

In [35]:
%timeit -n 10 _=matmul(x_train, m2)
10.1 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So it took 10.1 milliseconds to do the matrix multiplication on the 50k sample.

GPU

Let's first see what type of GPU I have. I'm using my personal computer, so I don't have anything like a Tesla K80, or anything like that.

In [40]:
torch.cuda.get_device_name(0)
Out[40]:
'GeForce GTX 1070 with Max-Q Design'
In [36]:
x_train_cuda = x_train.cuda()
m2_cuda = m2.cuda()
In [37]:
%timeit -n 10 _=matmul(x_train_cuda, m2_cuda)
The slowest run took 333.79 times longer than the fastest. This could mean that an intermediate result is being cached.
1.69 ms ± 4.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This took ~1.7 milliseconds, so the GPU in this instance was ~ 6x faster.

and from my experience this difference grows as the sample size gets larger.

So in this blog post we learned how to iterate through python code and leverage lower level libraries commonly used in NumPy and PyTorch to speed up computation.

In [ ]: