Speed up Python with Numba

In the last article, I discussed which Python design decisions limit its performance. In this writing, I will demonstrate the Numba package that addresses those limitations to provide high-speed performance for specific use cases, particularly when speeding up functions performing computations on the Numpy’s ndarrays.

Numba is a just-in-time compiler for Python that works best on code that uses NumPy arrays, functions, and loops.”

Numba Docs


How exactly does a Just-In-Time (JIT) compiler work?  Previously, I mentioned that statically-typed languages like Java and C# support the Virtual Machine optimizations of their intermediate form (similar to Python bytecode) translation to machine code process during runtime to boost performance. Specific parts of code are optimized and then compiled to machine code run directly by a CPU. The first stage of compilation or recompilation (when JIT decides to introduce optimizations based on the runtime analysis) comes with an extra time overhead. After compilation, the compiled form is cached and executed with a native machine code speed.

Why does CPython not have a JIT compiler?

Python is a general-purpose language; you can use it for scripts or with an interactive Python prompt. However, the JIT compiler needs some warm-up time, and the first execution can be significantly slower because of the compilation stage, which is not desired. Additionally, as I clarified in the previous article, Python is very dynamic, making most of the optimizations impossible to be deduced by the JIT compiler.

How to use Numba?

Using Numba with existing code is very easy – all we need to do is decorate a selected function with one of Numba decorators.

When a call is made to a Numba-decorated function, it is compiled to machine code “just-in-time” for execution, and all or part of your code can subsequently run at native machine code speed!

Numba Docs

But it has limitations…

Numba provides its JIT compiler for selected Python code that is essential for the application’s performance, which of course, comes with a cost. Numba supports only some Python and Numpy features, like using just primitive types. When Numba tries to compile your code, it first tries to work out the types of all the variables in use; this is so it can generate a type-specific implementation of your code that can be compiled down to machine code. A function that can return an int or a string depending on the if statement? Forget it.

Hands-on Numba

@jit or @njit?

All we need to do is to decorate a function with one of the Numba decorators. There are two essential decorators – @jit and @njit.

What’s the difference? @njit is an abbreviation of calling @jit with the nopython option set to True – @jit(nopython=True),

that instructs Numba to operate in nopython mode only.

This mode tries to compile the decorated function so that it will run entirely without the involvement of the Python interpreter (this is what we want). If it is impossible, an exception will be raised to notify us that we need to change the code to be Numba compatible. 

When using @jit, Numba also tries to compile the function. Still, after a fail, it implicitly gives up, and as a result, Numba would run this code via the interpreter but with the added cost of the Numba internal overheads! To make sure your code is optimized, please use @njit.

from numba import jit
import numpy as np

x = np.arange(100).reshape(10, 10)

@jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
def go_fast(a): # Function is compiled to machine code when called the first time
    trace = 0.0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting


The first call to a Numba-decorated function

Each time you call a Numba-decorated function with new types of arguments, the compilation process takes place. 

from numba import jit

def f(x, y):
    # A somewhat trivial example
    return x + y
>>> f(1, 2)

>>> f(1j, 2)

Above, we called the same function but with different argument types. However, Numba distinguishes between those two calls resulting in different compilation outcomes for each combination of argument types.

The first execution of a Numba decorated function with specific types can be slow due to the compilation process. It is crucial to bear that in mind. Whenever we want to measure the performance of Numba, do remember to run the benchmark function with the specific argument types before setting the timer. 

Is Pandas allowed?

As I mentioned, Numba has its limitations – it works well with code using Numpy ndarrays and Python loops, but it can’t handle, for example, Pandas data frames. When Numba encounters Pandas, it can’t do much, so we must eliminate Pandas utilities. Otherwise, we will get an error when using @njit, or Numba won’t optimize the code at all in the case of the @jit decorator. You can find the complete list of supported features in Numba documentation.

from numba import jit
import pandas as pd

x = {'a': [1, 2, 3], 'b': [20, 30, 40]}

def use_pandas(a): # Function will not benefit from Numba jit
    df = pd.DataFrame.from_dict(a) # Numba doesn't know about pd.DataFrame
    df += 1                        # Numba doesn't understand what this is
    return df.cov()                # or this!


In such a situation, we can cast the data from Pandas to Numpy and use a Numba decorator on a new function that excludes only the lines that pose a performance bottleneck leaving the rest of the code outside. Numba should apply to array-oriented and math-heavy Python code. Before decorating a function, make sure that all used features are supported.

Numba-compatible classes with @jitclass

Self-defined types are also not supported as is. We can decorate a class with the @jitclass decorator, which causes all its methods to be compiled by Numba. In this case, we need to specify Numba types for all fields used in the class. Here you can find the complete list of Numba types. Be careful as Numba does not support all class features. Using @jitclass should be avoided when possible.

import numpy as np
from numba import int32, float32    # import the types
from numba.experimental import jitclass

spec = [
    ('value', int32),               # a simple scalar field
    ('array', float32[:]),          # an array field

class Bag(object):
    def __init__(self, value):
        self.value = value
        self.array = np.zeros(value, dtype=np.float32)

    def size(self):
        return self.array.size

    def increment(self, val):
        for i in range(self.size):
            self.array[i] += val
        return self.array

    def add(x, y):
        return x + y

n = 21
mybag = Bag(n) 

Can Numba bypass the GIL?

The GIL (Global Interpreter Lock) is a lock (mutex) set on the interpreter, preventing multiple threads from executing Python bytecodes simultaneously. However, running Numba compiled functions does not involve the Python interpreter, so the GIL instance can be released when entering a Numba compiled function. As a result, such functions can run concurrently with other Python codes. Note that this is possible in nopython mode only as the functions must be compiled. To enable releasing of the GIL when executing Numba decorated functions, use @jit(nogil=True).

Automatic parallelization with @njit(parallel=True)

Setting the parallel option to True in nopython mode will cause Numba to parallelize operations, such as Numpy arrays operations and Numpy reduction functions (all supported operations are listed here).

Another feature granted by this option is the possibility to parallelize loops by using the explicit parallel prange loops:

“One can use Numba’s prange instead of range to specify that a loop can be parallelized. The user is required to make sure that the loop does not have cross iteration dependencies except for supported reductions.”

Numba Docs
from numba import njit, prange

def prange_test(A):
    s = 0
    # Without "parallel=True" in the jit-decorator
    # the prange statement is equivalent to range
    for i in prange(A.shape[0]):
        s += A[i]
    return s

As mentioned above, the iterations cannot be dependent on each other.

Remember that parallel programming poses additional risks, such as a race condition when the elements specified by the slice or index are written simultaneously by multiple threads.

from numba import njit, prange
import numpy as np

def prange_wrong_result(x):
    n = x.shape[0]
    y = np.zeros(4)
    for i in prange(n):
        # accumulating into the same element of `y` from different
        # parallel iterations of the loop results in a race condition
        y[:] += x[i]

    return y

Caching compilation results

Numba compiles a function only once per given combination of its argument types. The subsequent calls are faster because the recompilation is usually not necessary. Nevertheless, all compiled code is aborted when the Python program is finished. To prevent that, you can specify the cache option as True – @jit(cache=True) – to store compiled functions in a file-based cache. 


Numba is a powerful tool that enables an execution speed boost for a performance-sensitive Python code without using other programming languages like C. We need to decorate a given function with one of the Numba decorators explained in this article. However, the code to be optimized needs to fulfill all Numba requirements listed in Numba documentation. When measuring Numba performance, please remember that the first execution of Numba decorated functions is slower due to the compilation process.

What’s next?

In the next article, I will introduce another way to improve Python – Cython compiler.


About the author

Paweł Golik

.Net Developer | MSc student in Data Science | General topics of interest: Machine Learning & Software Engineering 

Leave a Comment

Your email address will not be published. Required fields are marked *