Getty Images/iStockphoto
Speed up Python and NumPy by avoiding the conversion tax
Data and memory transfers in Python come with a hidden performance tax. Here's how to use NumPy for optimal performance by avoiding jumps across a hidden line of conversions.
There is a phenomenon in the Python programming language that affects the efficiency of data representation and memory. I call it the "invisible line."
This invisible line might seem innocuous at first, but it can significantly impact application performance, especially if the data stored in an application's memory must move between code managed by different software platforms. This issue is particularly pronounced when data is exchanged between Python and a native C++ runtime, a situation that is ubiquitous in the AI and machine learning space.
Consider a simple algorithm known as the prefix sum or cumulative sum. The algorithm for prefix sum is quite simple.
For the sake of simplicity, we create a list of 1 million ones. Then we iterate through it, updating each element to be the sum of itself and the preceding element. We now have the prefix sum of 1 million ones, which is a list from 1 to 1 million and one.
We can express it in Python as follows:
list = [1] * 1_000_000
for i in range(1, len(list)):
list[i] += list[i-1]
To measure time spent on each element, we augment the script with a few time measuring statements:
from time import time_ns
list = [1] * 1_000_000
tik = time_ns()
for i in range(1, len(list)):
list[i] += list[i-1]
tok = time_ns()
print(f"Time spent per element: {(tok - tik) / len(list)} ns")
On my trusty laptop (11th Gen Intel Core i7-1165G7 at 2.80GHz × 8, running Ubuntu 22.04.3 LTS and Python 3.11.5), the result clocks in at approximately 52.88 nanoseconds (ns) per element.
Is this good? Perhaps, but it's hard to know if you are going fast if you are the only one on the road. So, we'll attempt the same computation with NumPy and establish a comparison.
What is NumPy? NumPy is a fundamental package for scientific computing, widely used by Python developers. Underneath the Python API, it performs computations in C or Fortran to help speed up Python.
NumPy comes with a built-in prefix sum implementation, which is as follows:
import numpy as np
list = [1] * 1_000_000
tik = time_ns()
list = np.cumsum(list)
tok = time_ns()
print(f"Time spent per element: {(tok - tik) / len(list)} ns")
This NumPy version performs admirably, clocking in at around 28.77 ns per element -- almost two times faster than the pure Python rendition. Comparison established -- we have a clear winner.
However, before we clap ourselves on the back and move on, can we go even faster? Let's change our script a bit and replace the Python list with a NumPy array:
import numpy as np
list = np.full(1_000_000, 1)
tik = time_ns()
list = np.cumsum(list)
tok = time_ns()
print(f"Time spent per element: {(tok - tik) / len(list)} ns")
Lo and behold, we now have a blazing 2.43 ns per element. That's more than 10 times faster than the previous NumPy version, and over 20 times compared to the Python implementation.
How does this happen?
Let's demystify the concept of the "invisible line." Consider how we generate data in Python, for example:
list = [1] * 1_000_000
Python stores the data in its appropriate data representation and memory space. However, packages such as NumPy are implemented in systems programming languages such as C, Rust or Fortran. These languages interface with Python, convert the Python data representation to their own and transfer the data from Python's memory space to native memory space. This process of crossing a line is unseen, thanks to Python's excellent ergonomics, but that conceals the expenses associated with such transfers and makes them deceptively easy to overlook.
Here is another extreme example to drive the point home:
list = np.full(1_000_000, 1)
tik = time_ns()
for i in range(1, len(list)):
list[i] += list[i-1]
tok = time_ns()
print(f"Time spent per element: {(tok - tik) / len(list)} ns")
This NumPy implementation clocks in almost three times slower than the pure Python version with 148.81 ns per element, as it crosses the invisible line on every iteration.
In essence, NumPy's efficiency dance gracefully navigates the invisible line. So, as you waltz through the world of NumPy, keep the invisible line in your mind for optimal performance.
Meanwhile, I am exploring the new Mojo programming language that can do even better. The simple for loop implementation of prefix sum results in 0.337532 ns per element, and a more elaborate SIMD-based implementation clocks at 0.188 ns per element.
Maxim Zaks is principal software engineer at Yoyo Labs, with expertise in data engineering, creating mobile apps as well as front-end and back-end web apps.