Super fast Python: Why Python is Slow?

Published on: Nov 7, 2022

Super fast Python (Part-1): Why Python is Slow?

Python is an interpreted, high-level, and dynamically typed programming language. Developers prefer Python because of its easy-to-learn, flexibility, fast development, easy-to-code, readability, and many other development choices. From nowhere in the 2000s to the most used programming language right now, Python has come a long way with help of a strong community. But, there has been always a discussion on Python's choice for computation-intensive tasks like Machine Learning as Python is generally slow compared to widely used languages like C/C++ and Java. Even with the development of computation-efficient libraries and packages like Numpy, still Python is slow for general usage.

Python's core development team has been working to make Python as fast as C/C++. They set a goal to make each Python release significantly faster than the earlier release. The current Python 3.11 is up to 10-60 percent faster than Python 3.10. Let's hope we will reach a state where Python is at least at the same speed level as Java if not C++.

One common practice to tackle the speed issues in production is to upgrade the hardware resources or upscale the cloud infrastructure that increases the project budget. As Python's core development team is trying to improve Python, it's up to us to leverage the core libraries and code practices to make code faster at the developer's end and eventually use fewer resources and budget.

This is the first post in the series on Python performance and Optimization. The series points out the utilization of inbuilt libraries, low-level code conversions, and other Python implementations to speed-up Python. The other posts included in this series are


Why Python is slow?

Before looking into how we can optimize the Python code, we should look at first why Python is slow.

Some of the reasons for Python's slowness is due to its design of core details like how code executes, type-inference, and memory management.

Python implementation by interpretation

Python is a programming language that talks about syntax and rules to write programs. Executing the code written is done by the programming language implementation. This can be of two categories - Compilation and Interpretation. CPython is a Python implementation written in C that applies an interpretation approach to execute the python code written. CPython is the default runtime and reference implementation of Python and there are other runtimes like PyPy, Cython, Jython, etc., that take different execution approaches for different use cases. CPython is both an interpreter (widely represented) and a compiler as it complies the Python code to Python bytecode and then interprets it for the specific platform using Python Virtual Machine (PVM).

CPython uses Global Interpreter Lock (GIL) on each CPython interpreter process. This means, within a single process, only a single thread processes the Python bytecode. We will see later why this behavior is both good and bad.

As interpretation is usually slow compared to compilation it is understood that Python is slow, but maybe up to 2x-3x times slower than C/C++. But it is not true, Java, which also interprets the code, is still many times faster than Python and this asks questions on what are other factors for slowness in Python.

Changing the runtime of Python from interpretation to compilation like Cython and embedded C-code, or JIT with PyPy, we can improve the speed of Python many times. We will talk about Cython, and how to use it with Python, in a later article.

Comes the dynamically typing comes the problem

One of the beautiful features of Python is dynamic typing and many people like the way it is. But with dynamic typing, there is an additional burden on the interpreter to keep track of the type of variables that makes less scope for optimization. As Java is statically typed, the interpreter can optimize the bytecode generation and can interpret it faster than Python.

1a = 1 # a as int
2b = a * 2 # b as int
3
4a = 'python ' # a as string
5# c as a string with operations
6# of string and int
7c = a * b
8

In the above snippet, the interpreter has to keep track of the type of a from top to bottom. If not dynamic typing is supported in Python, we have to declare types for every variable and the operation c = a * b is not possible then. So, with flexible support from Python, there is some overhead with the interpreter too.

That is an object, this too

In Python, everything is an object. Even functions too. If one can remember how objects are referenced in C++ (yes, the pointers), apply the same concept in Python for everything including built-in primitive (not exactly) types which are objects too and are specially taken care of.

Assigning memory to the objects is done by creating actual memory to hold an object and a reference is given to the variables that point to the real object in the memory. Every time the variable changes its value, instead of changing the value in the memory, a new object is created and the variable is given the new location of the object that holds the new value. As CPython is implemented in C, objects created in Python are called PyObjects which are struct in C that refers to all Python objects.

1>>> a = 10
2>>> print(id(a))
39801536
4
5>>> a = 12
6>>> print(id(a))
79801600
8

We can see the address of a changes every time we change/assign the value because of how Python manages variables and their values. In C/C++, for variables, the addressing is done by creating an actual location for the variable instead of the value, and if the value changes, it just overwrites the previous value.

Python has to create new objects, keep their references, delete unused objects, and repeat the cycle. This continuous cycle of object creation and deletion whenever variable values changes make runtime slow.

The major overhead that needs to address is Python's way of handling collections like Lists. In C, we create the arrays with fixed memory and the address of that array is fixed, and continuous memory allocation happens for all the elements starting from the fixed memory point. But in Python, for each value, an object is created anywhere in the memory (not continuous), and the list holds only references to these objects. Working on these arbitrary memory locations make things complex and eventually takes more time.

1>>> l = [1, 2, 3, 4]
2>>> for i in l:
3        print(id(i))
4
5OUTPUT:
6-------
7
89801248
99801280
109801312
119801344
12

In the above snippet, we can observe that the addresses for each element in the list are not the same or continuous.

One of the reasons why Numpy is faster is because it creates fixed memory of the array elements. The below snippet prints out the memory address of each element in the array and they are the same, meaning Numpy create an array with continuous memory address and they are easy to operate.

1>>> a = np.array([1, 2, 3, 4])
2>>> for i in a:
3        print(i.__array_interface__['data'][0])
4
5OUTPUT:
6-------
7
837936064
937936064
1037936064
1137936064
12

GIL, good and bad

In C, there is no inbuilt support for garbage collection. One has to manually de-allocate/free the memory. Unlike C, Python does garbage collection by using a mechanism called reference count. For every PyObject, Python keeps track of a count of how many references are pointing to the current object. If no reference is pointing to the current object, then, Python frees the memory by deleting the unused object.

Earlier we talked about GIL that, in the current process, GIL locks the process to have only a single thread to run the Python interpreter. This does not mean we cannot use more threads, but it is hard and takes more time compared to multi-threading in other languages because CPython creates a separate environment for each thread. This limits multi-threading in Python to use only in situations where threads wait for external processes to complete like I/O operations, network requests, etc.

When multiple threads access the same resource, it is hard to keep track of the reference count of objects. GIL is required in CPython to ensure thread-safe as it allows only a single thread to process. As Python is a general-purpose language, in most cases there is no need to implement multi-threading, and there is no problem with GIL.

But when there is a need to optimize CPU-intensive tasks, one must use multi-processing to utilize all cores and reduce the overall speed. We will talk about different use cases for multi-threading and multi-processing in the later part of this series.


Now, we have learned that there are a lot of things that contribute to the slowness of Python compared to other languages. We cannot change the design of Python but we can change how we write the code by using good libraries, code snippets, and changing implementations.

In the upcoming posts, we discuss how we can use both internal and external libraries like multiprocessing, Cython, and Numba to make Python blazingly faster.