Although Python is one of the world’s most popular programming languages, it isn't without flaws. The biggest one? Probably its lack of speed. Python code is compiled to bytecode by CPython and then executed by an interpreter. Easy to learn? Yes. Speedy? No.
Until you edit your source code, the bytecode is cached in a .pyc file; the first run is always a fraction longer as CPython turns it into bytecode. The second and subsequent runs are better for judging speed.
Here are five ways to improve your Python code. I've tested them using Python 3.7 and 3.9; I’ve included some code examples for you to try out. I've used the perf_counter_ns function from the time package to do all the timing in nanoseconds.
Be Pythonic
It's very easy, coming from other programming languages, to write code that cuts against the grain. Using for loops or arrays, for example, and your code will run slowly. The Python way of doing things (a.k.a. Pythonic) is using features like map, list comprehensions and generators. And don’t forget the useful functions like sum and range.
For instance, if you want to compare two ways of totalling all ints in the range 1-100 that are divisible by 3:
from time import perf_counter_ns
def main():
# non pythonic
start=perf_counter_ns()
total=0
for i in range(1,100):
if (i %3)== 0:
total += i
end=perf_counter_ns()
print (f"Non-Pythonic Total of divisible by 3= {total}")
print(f"Time took {end-start}")
# pythonic
start=perf_counter_ns()
total =sum(range(1, 100, 3))
end=perf_counter_ns()
print (f"Pythonic Total of divisible by 3= {total}")
print(f"Time took {end-start}")
if __name__ == "__main__":
main()
This gives you:
Non-Pythonic Total of divisible by 3= 1683
Time took 13300
Pythonic Total of divisible by 3= 1683
Time took 2900
These are second-run times. First runs were 14,500 and 3,000, so between 3.5 percent and 9 percent longer (note that the Pythonic way is almost five times faster).
Use memoization
This sounds more complicated than it really is. If the cost of calling a function is high, adding “memorization” means the results are cached; later, calls you make with the same parameter values are pulled from the cache instead of recomputing. This can give massive speed improvements.
The functools package has a lru_cache, which you can use to decorate the function you wish to memoize. In the example below, fib is a simple non-memoized fibonacci function, and fib(35) does a lot of additions; mfib is the memoized version.
from time import perf_counter_ns
from functools import lru_cache
def main():
def fib(n):
return n if n < 2 else fib(n-1) + fib(n-2)
@lru_cache(maxsize=None)
def mfib(n):
return n if n < 2 else mfib(n-1) + mfib(n-2)
start=perf_counter_ns()
print(f"Non-memoized fib()={fib(35)}")
end=perf_counter_ns()
print(f"Time took {end-start}")
start=perf_counter_ns()
print(f"Memoized fib()={mfib(35)}")
end=perf_counter_ns()
print(f"Time took {end-start}")
if __name__ == "__main__":
main()
The results speak for themselves:
Non-memoized fib()=9227465
Time took 2905175700
Memoized fib()=9227465
Time took 148700
That's almost 20,000 times faster!
I was curious about the number of additions. The total number of additions #(x) = #(x-1)+#(x-2)+1 where #(n) is the number of additions for n. Starting with fib(2), you get a sequence of total additions 1, 2, 4, 7, 12…, and for fib(35) there are a whopping 14,930,351 additions.
Developer Oren Tosh reckons he can improve on this by sneakily using a Dictionary subclass with the __missing__ dunder method.
Code it in C
This is not always easy to do. You have to know C and how it interfaces with Python. Also, there may be only a few cases where coding in C will do the trick. It helps that CPython is written in C.
There's a library of C types and their Python mappings in the ctypes library. This library also lets you make calls into the operating system libraries, but you should be comfortable working at a fairly low level and know C including arrays, structs and pointers before you go there.
Compile Python
Machine code that you create when you compile code is always going to run faster than interpreted bytecode. You'll find several Python compilers available including Numpa, Nuitka, pypi and Cython. I haven't tried all of these but suggest you optimize your Python code before you try compiling them. Numpa's compiler is JIT (Just-In-Time) and also provides GPU powered acceleration.
Use ‘From’ when possible
It's all too easy to use import package all the time. But it makes more sense to use from when you can to just import the needed function(s). Why import twenty functions when you only need one? For short programs like this, you probably won't notice a difference, but as your program size increases you might start to notice it:
from time import perf_counter_ns
def main():
start=perf_counter_ns()
n = 10
fact = 1
for i in range(1,n+1):
fact = fact * i
end=perf_counter_ns()
print (f"The factorial of {n} is {fact}")
print(f"Time took {end-start}")
if __name__ == "__main__":
main()
First run of this was 4200 nanoseconds, but subsequent runs were around 3900. Also, don't forget you can put imports inside functions so they only get called when needed.
Conclusion
If there's one method to single out, use memoization to get the best speed from Python—but you would probably become a better Python programmer if you learned the pythonic approach.
It's interesting to see plans to make CPython up to five times faster over the next four releases. Whether that is doable is anyone’s guess but if it does it will keep Python in the number one spot for popularity for a very long time.
Another proposed method of increasing Python’s speed is looking to make multithreaded code faster by removing the GIL (Global Interpreter Lock). This is a mutex that only allows one thread to hold control of the Python interpreter. If you write multithreaded code that doesn’t use the Gil, it can’t access Python variables and objects.