Super fast Python: Numba

Published on: Dec 25, 2022

Super fast Python (Part-5): Numba

This is the fifth and last post in the series on Python performance and Optimization. The series points out the utilization of inbuilt libraries, low-level code conversions, and other Python implementations to speed-up Python. The other posts included in this series are

(Part-1): Why Python is slow?
(Part-2): Good practices to write fast Python code
(Part-3): Multi-processing in Python
(Part-4): Use Cython to get speed as fast as C
(Part-5): Use Numba to speed up Python Functions (this post)

In the last post about Cython to speed-up Python code, we discussed writing Python code in C-style, compiling that code separately into an object file, and using that generated file as an import directly into Python. But, not all people would feel comfortable writing C-style code or even some might not know C at all. So, to deal with such cases and get the performant efficient code to speed-up Python, one can use Numba instead. Numba translates Python code to machine code that executes almost as fast as C/C++ if optimized correctly.

What is Numba?

Numba is a JIT (just-in-time) compiler that takes Python byte code and compiles it into machine code directly using LLVM compiling mechanism. JIT is a type of interpreter that compiles frequently called code into machine code and caches that generated machine code to be used later for faster execution type. Here, Numba also takes Python code and generated machine code which the Python interpreter calls directly instead of interpreting and converting to machine code each time. Numba works best for numerical calculations, Array and Numpy operations, and loops. With Numba, we can write vectorized operations and parallelized loops to run on either CPU or GPU.

Numba decorators are one of the many ways to invoke the JIT compilation. Numba provides different decorators to compile code in different modes and types, the common decorators used in Numba are:

@jit - invoke JIT compilation for the provided function
@njit - @jit decorator with enabling strict no-python mode
@vectorize - convert normal functions into Numpy like ufuncs
@guvectorize - generalized ufuncs for higher dimensional arrays
@stencil - make a function behave as a kernel for a stencil like operation

Numba also provides different options to pass for some of these decorators to configure the JIT compilation behavior

nopython
parallel
cache
nogil
fastmath
boundscheck
error_model
cuda

Numba @jit

@jit decorator takes the Python function that needs to machine code compiled. When we make a call to the function we provided to @jit, upon the first time calling, Numba compiles the function, caches the machine code, and this machine code is directly used for the execution. As compilation takes time, the first-time call to the function gives some latency. But, for consecutive function calls in the same runtime, just the cached machine code is used instead of re-compiling every time.

Let's consider the following simple function solve_expression as an example. solve_expression takes some arguments, checks some conditions, and calculates the final polynomial expression.

1def solve_expression(x, a, b, c, d):
2    A, B, C, D = a, b, c, d
3    if a > 10.1:
4        A = 2 * a
5    if 2.6 <= b < 8.3:
6        B = b - 1/b
7    if c > 4.5:
8        C = 4
9    if d < 9.0:
10        D = d ** 2
11
12    return A*(x**3) + B*(x**2) + C*(x) + D
13

Now, use the @jit decorator to compile this function into machine code as

1from numba import jit
2
3@jit
4def solve_expression(x, a, b, c, d):
5    A, B, C, D = a, b, c, d
6    if a > 10.1:
7        A = 2 * a
8    if 2.6 <= b < 8.3:
9        B = b - 1/b
10    if c > 4.5:
11        C = 4
12    if d < 9.0:
13        D = d ** 2
14
15    return A*(x**3) + B*(x**2) + C*(x) + D
16

In the code snippet, we have imported the jit function decorator and decorated solve_expression with it.

1x, a, b, c, d = 2, 13, 1.2, 4, 7
2
3res = solve_expression(x, a, b, c, d)
4

Inspect the Intermediate Representation (IR) of the function using solve_expression.inspect_types()

As Numba @jit defers the JIT compilation until it encounters the first call to the function, the function call with arguments solve_expression(x, a, b, c, d) takes some time for executions. But, function calls later at this point will be fast.

Now compare the speeds of the normal Python function and Numba JIT decorated function. Using solve_expression.py_func(), we can invoke the normal python function of this JIT decorated function.

1%% timeit
2res = solve_expression.py_func(x, a, b, c, d)
3
4'''Output
5912 ns ± 3.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
6'''
7

The normal Python function takes approx. 900 nanoseconds.

1%%timeit
2res = solve_expression(x, a, b, c, d)
3
4'''Output
5277 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
6'''
7

And the Numba version takes approx. 280 nanoseconds. The Numba version is 3x times faster than the pure Python function.

Compilation options

For @jit decorator, we can pass multiple options to configure the compilation behavior

nopython: if True, enables no-python mode making code execution without Python interpreter interference
nogil: if True, releases the GIL (only when nopython=True) inside the compiled function, useful for concurrent execution such as threads
cache: if True, store the compiled code in local storage and use this code whenever the function is called instead of re-compiling for every runtime
parallel: if True, enables automatic parallelization
fastmath: if True, uses faster math operations but less safe floating-point operations

Apart from these, there are some more options available. You can check all options at @jit reference.

Lazy and Eager compilation

Note that, as Python function arguments can take any time of arguments, Numba compiles the function for that specific type of the argument passed. If a new type of argument is passed to the function while calling, Numba re-compiles the code for that specific type.

Lazy compilation

The above solve_expression() function takes any type of argument. If we provide a function like this Numba calculates the optimization steps to be done based on the argument types provided at the first function call. In this way, Numba infers the argument types and compiles the specific version of the same function for different types.

Ex: if we change the argument types like this

1# previous values, x, a, b, c, d = 2, 13, 1.2, 4, 7
2# new values
3x, a, b, c, d = 3.9, 12, 5, 9.1, 14
4
5res = solve_expression(x, a, b, c, d)
6

As the previous compiled function expects types for x=int, a=int, b=float, c=int, d=int, and in the latest function call we have changed some argument types. So, Numba re-compiles the function for new argument types. This mode of the compilation of code is called lazy compilation because Numba compiles for specific argument types only if it encounters them.

Eager compilation

For function overloading, we can specify function signatures with argument types and return types in a list with the least significant precision at the top. Numba types follow Numpy convention types with different precision levels.

1@jit(['int32(int32, int32)',
2      'i4(int32, int64)',
3      '(f4, f8)',
4      'f8(f4, f4)'])
5def func1(a, b):
6    return a + b
7

In the above function, we passed function signatures as a list of strings. The syntax is return type is specified first and argument types are specified after. It is allowed to have no return type specified. Numba will infer the return type automatically and use that specification. Calling the function with argument types not provided in the list raises an error.

@njit or @jit(nopython=True)

Numba @jit operates in two modes nopython and object. In nopython mode, no interference of the Python interpreter is required and execution is very fast compared to normal mode. object mode is the same as calling function without @jit.

Normally with the @jit decorator, Numba tries to compile in nopython mode. If any part of the code cannot be compiled due to the presence of code that is not supported by Numba like a heterogenous dictionary, some string methods, etc, then Numba fallbacks to object or normal Python mode for compilation. But still, it will improve the performance when loops are involved. If there is no code to optimize, @jit in object mode runs slower than the normal Python version as Numba compilation involves several function call overheads.

We can enable the strict nopython mode by passing the option to @jit as @jit(nopython=True). Numba also provides a separate decorator for this option. @njit decorator is an alias for @jit(nopython=True).

1@jit(nopython=True)
2def f(a, b):
3    return a + b
4
5# or
6
7@njit
8def f(a, b):
9    return a + b
10

With nopython mode, if any code is present that requires object mode compilation, Numba will raise an error. The primary goal is to write functions that can be implemented in strict no-python mode.

Numba and Numpy

Numba is best for Numpy arrays and supports some Numpy features in no-python mode.

1@njit(['int64[:](int64[:, :], int64[:, :])'])
2def f(a, b):
3    c = np.empty(a.shape[0], dtype='int64')
4
5    for i in range(a.shape[0]):
6        c[i] = a[i].sum() * b[i].sum()
7
8    return c
9

The function takes 2 2D Numpy int64 arrays as arguments with the return type as an array of float64. The function calculates the sum of the product of the sum of each row of a and b.

1x, y, n = 1000, 1000, 1_000_000
2l, h = 0, 100
3a = np.random.randint(l, h, n).reshape(x, y)
4b = np.random.randint(l, h, n).reshape(x, y)
5

If we compare both normal Numpy calculation of the above function and Numba compiled code,

1%%timeit
2# normal Python calculation
3res = a.sum(axis=1) * b.sum(axis=1)
4
5'''Output:
61.4 ms ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
7'''
8

1%%timeit
2# Numba compiled function
3res = f(a, b)
4
5'''Output:
61.19 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
7'''
8

Numba compiled version takes 1.19 ms which is slightly faster than normal calculation with Numpy 1.4ms.

@vectorize decorator

One of the main reasons for Numpy speed is that most of the Numpy functions are ufuncs (universal functions) that are vectorized and implemented inside compiled layer of Numpy and hence the speed. We might run into a situation where we cannot find any existing Numpy functions for use and write a workaround by combining Numpy functions into a single operation. This new operation is not optimized and we might lose the speed that Numpy provides due to custom loops. To solve this problem, we can make any function as ufunc that comes with vectorization and speed.

With @vectorize, Numba provides functionality to create custom vectorized universal functions. The universal function takes scalar values and returns a scalar value and these functions are applied over any Numpy arrays where array values are passed as single scalar values and an outside loop is automatically generated.

1@vectorize(['float64(int64, int64)'])
2def v_expr(a, b):
3    return (a**2) + (a*3) + (b/2) + 10
4

Here, we created a simple universal function that takes two scalar values and returns a calculated expression.

1n = 1_000_000
2l, h = 0, 100
3a = np.random.randint(l, h, n)
4b = np.random.randint(l, h, n)
5

The speed comparison of the Numpy expression and vectorized ufunc function gives

1%%timeit
2# general way of calculating Numpy expression
3res = (a**2) + (a*3) + (b/2) + 10
4
5'''Output:
68.66 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7'''
8

1%%timeit
2# calculate expression with ufunc
3res = v_expr(a, b)
4
5'''Output:
62.17 ms ± 96.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7'''
8

the optimized function v_expr takes 2.17 ms which is 4x faster than the normal calculation with Numpy.

In this blog, we discussed basic Numba decorators and compared Numba compiled functions with normal Python functions. Explore other features of Numba like

Remember that not every function can be passed to Numba as there are some limitations like

Numba only supports a subset of Python code like classes, multi-dimensional dictionaries, etc, which are not supported yet.
As object mode compilation takes more time than the normal Python mode in some cases, it is better to check the speed of both normal and compiled code execution speed.
Support for external libraries like Pandas is not supported.
As Numba re-implements some Numpy APIs, there may be different behavior expected.