Published on: Nov 9, 2022
Super fast Python (Part-2): Good practices
In the earlier post on why Python is slow?, we discussed slowness is in Python due to its internal design of some essential components like GIL, dynamic typing, and interpretation.
In this blog, we discuss some good practices to speed up Python incredibly faster.
This is the second post in the series on Python performance and Optimization. The series points out the utilization of inbuilt libraries, low-level code conversions, and other Python implementations to speed-up Python. The other posts included in this series are
- (Part-1): Why Python is slow?
- (Part-2): Good practices to write fast Python code (this post)
- (Part-3): Multi-processing in Python
- (Part-4): Use Cython to get speed as fast as C
- (Part-5): Use Numba to speed up Python Functions and Numeric calculations
The following section describes the various good practices one can use to make Python super speed (up to 30% or more) without any external support like PyPy, Cython, Numpy, etc.
Python good practices for super fast code
Use built-in data structures and libraries
As Python data types are implemented directly in C, using the built-in types like list, map, and trees, compared to custom types we define, really helps the program to run faster.
Also, use built-in libraries for common algorithms like counting the duplicates, summing all list elements, finding the maximum element, etc, because these are all already written in C and compiled which makes these functions run faster than custom functions we write.
1from random import randint 2 3rand_nums = [randint(1, 100) for _ in range(100000)] 4
Create 100000 random numbers between 1 and 100.
1%%timeit 2cc = 0 3for i in range(len(rand_nums)): 4 cc += rand_nums[i] 5
output
16.02 ms ± 94 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2
Manually summing up the all numbers over running a loop takes approximately 6ms.
1%%timeit 2cc = sum(rand_nums) 3
output
1332 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2
Using built-in sum() takes only 0.3ms approximately.
Local Variables vs Global Variables
When we call a function (a routine), the system pauses the code execution at the call site in the current routine (say main()) where the call has been made and places the called function at the top of the call stack. Imagine if we have defined numerous global variables and made multiple function calls. The system has to make sure that all these global variables should be available for any routine placed in the call stack at all times. The system has to provide a lookup mechanism for both local and global variables for each routine. And with global variables, this lookup mechanism may take some time than local variables.
Import the sub-modules and functions directly
When importing any module to use its sub-modules, classes, or functions, import them directly instead of importing just the module. When accessing objects using ., it triggers dictionary lookup using ___getattribute__. If we call the object multiple times using ._, that may increase the program time.
1# instead of this 2import abc 3def_obj = abc.Def() 4 5# do this 6from abc import Def 7def_obj = Def() 8
Limit the usage of '.'
Speaking of the lookup time with the module's object references, the same can be applied to the referencing of properties and functions of an object(both custom and in-built).
1%%timeit 2ll = [] 3for i in range(len(rand_nums)): 4 ll.append(rand_nums[i]) 5
output
16.67 ms ± 84.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2
Adding the list elements by appending with list.append() takes more time than following code because the function is assigned to a variable (functions are first-class citizens in Python) and used inside the loop. This simple practice avoids referencing the functions with '.' too often and finally limits the need for dictionary lookup.
1%%timeit 2ll = [] 3ll_append = ll.append 4for i in range(len(rand_nums)): 5 ll_append(rand_nums[i]) 6
output
15.45 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2
Avoid writing functions unnecessary
It's good to have code separability by using functions for each independent task. But, as functions in Python are relatively more expensive than C/C++ due to boxing and unboxing dynamic variables and other factors, limit writing functions for unnecessary cases like one-liners.
Don't wrap lambdas around functions
Overuse or misuse of lambdas is not a good practice. It's common to wrap functions inside lambdas which do the same thing without wrapping.
Consider the following two functions for sorting a list based on absolute values.
1def fun_sort_with_lambda(l): 2 return sorted(l, key=lambda x: abs(x)) 3 4def fun_sort_without_lambda(l): 5 return sorted(l, key=abs) 6
If we look at the CPython bytecode for the above functions with lambda expression passed as a key and with abs function object as a key,
1>>> from dis import dis 2>>> dis(fun_sort_with_lambda) 3 2 0 LOAD_GLOBAL 0 (sorted) 4 2 LOAD_FAST 0 (l) 5 4 LOAD_CONST 1 (<code object <lambda> at 0x7fc51b3a19d0, file "<ipython-input-62-c4147c242c71>", line 2>) 6 6 LOAD_CONST 2 ('fun_sort_with_lambda.<locals>.<lambda>') 7 8 MAKE_FUNCTION 0 8 10 LOAD_CONST 3 (('key',)) 9 12 CALL_FUNCTION_KW 2 10 14 RETURN_VALUE 11 12Disassembly of <code object <lambda> at 0x7fc51b3a19d0, file "<ipython-input-62-c4147c242c71>", line 2>: 13 2 0 LOAD_GLOBAL 0 (abs) 14 2 LOAD_FAST 0 (x) 15 4 CALL_FUNCTION 1 16 6 RETURN_VALUE 17 18>>> dis(fun_sort_without_lambda) 19 2 0 LOAD_GLOBAL 0 (sorted) 20 2 LOAD_FAST 0 (l) 21 4 LOAD_GLOBAL 1 (abs) 22 6 LOAD_CONST 1 (('key',)) 23 8 CALL_FUNCTION_KW 2 24 10 RETURN_VALUE 25
for the function fun_sort_with_lambda, there is an additional function has been generated for lambda. We can avoid this function generation without using lambda as we can see in function fun_sort_without_lambda.
List comprehension is fast
When operating over lists like data structures, list comprehension is faster than traditional methods like looping, functional programming, etc.
1%%timeit 2rand_nums = [] 3for _ in range(1000): 4 rand_nums.append(randint(1, 100)) 5
1603 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2
The list comprehension version of the above code snippet runs faster.
output
1%%timeit 2rand_nums = [randint(1, 100) for _ in range(1000)] 3
output
1565 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2
The optimization practices are not limited to the above approaches. We can check how much time it is taking for each line by using libraries like CProfile and making changes to run faster. In the next blog, we discuss how to improve Python computing efficiency using multiprocessing.