Python is slow, so why we’re using it for Data Science?

Author: Mirosław “miro662” Błażej, Python developer and Masters’ IT student at AGH UST. Linkedin GitHub

Nowadays, Python is the most popular programming language for data-science-related tasks. And many of those tasks are computation-heavy ones. One of them is matrix multiplication – a basic linear algebra operation, used by almost all ML algorithms, for example, to compute activations of layers of a neural network based on activations of the previous ones.

Manual implementation in Python

Let’s write our own implementation of matrix multiplication in Python. Let’s implement two functions: random_matrix generating matrix of given size containing random elements and multiply_matrices:

from random import random


def random_matrix(rows, cols):
""" Generates random matrix represented as lists of lists """
return [[random() for _ in range(cols)] for _ in range(rows)]


def multiply_matrices(A, B):
""" Multiplies two matrices represented as lists of lists """
A_rows, A_cols = len(A), len(A[0])
B_rows, B_cols = len(B), len(B[0])
assert A_cols == B_rows
C = [[0.0 for _ in range(A_rows)] for _ in range(B_cols)]
for i in range(A_rows):
for j in range(A_cols):
for k in range(B_cols):
C[i][k] += A[i][j] * C[j][k]
return C

Now, let’s generate two random matrices and multiply them using our function. We will benchmark it using %timeit:

SIZE = 100
A = random_matrix(SIZE, SIZE)
B = random_matrix(SIZE, SIZE)
%timeit multiply_matrices(A, B)

On my (pretty decent) machine, this operation took a lot of time for such a small matrix. If this had been the only way to do such an operation, Python would have been rendered useless for data science. Thankfully, there is a much better – and much simpler – way to do it.

Using NumPy

One of the greatest strengths of Python is a wide collection of packages. They are collections of modules that we can use in projects. One of them is called NumPy (https://numpy.org/). It is an essential package that contains many facilities related to matrices and operations on them. Using NumPy,  Let’s rewrite our matrix-multiplication code in NumPy:

import numpy as np
A = np.random.random((SIZE, SIZE))
B = np.random.random((SIZE, SIZE))
%timeit A @ B

Not only this code is much faster (on my machine ~136 ms for manual implementation vs ~59,3 µs for using NumPy) – it is also much more readable.

But why code that uses NumPy is so fast? There are 2 main reasons:

  • Our computation-heavy multiply_matrices function is written in Python. This is not the case for NumPy. It uses highly optimized routines written in much faster languages like C or Fortran to perform its matrix operations.
  • Using Python lists is not the best idea. They are heterogenic (allows to store different data types) data structures that allow many operations like insertion of new elements/deletion of existing ones. Those features come with a performance tag and are completely unnecessary for our operations on matrices. NumPy provides a new data structure – ndarray – optimized for storing and operating on n-dimensional arrays such as vectors, matrices, or other tensors.

So, why we’re using Python?

Python is a simple language with great tooling, enormous community, and a lot of packages available. But it is too slow for running computational-heavy algorithms, crucial for any data science task. However, there is a simple bypass for this limit. We can use libraries like NumPy, which provides a nice Python API for running code written in faster languages.

We can have the performance of those languages, without their complexity. And we can still use an ecosystem of Python packages, tooling, and community. In the case of using Python for data science, we not only shouldn’t reinvent the wheel – we mustn’t do it.

Zostaw komentarz

Twój adres e-mail nie zostanie opublikowany. Wymagane pola są oznaczone *