Python Multithreading - You Could Be Wasting Time
Python is an excellent scripting language that seems especially suitable for Linux applications. You may have heard that python has multi-threading support, which is true, and could dramatically improve your application's performance as demonstrated by running the following two examples below, which are trying to simulate connecting to 4 different network computers.
Single Threaded Implementation
#!/usr/bin/python
import time
import threading
def connect():
print "connecting..."
time.sleep(3)
print "connected."
# Single threaded
for i in range(4):
connect()
Multithreaded Implementation
#!/usr/bin/python
import time
import threading
def connect():
print "connecting..."
time.sleep(3)
print "connected."
# multi threaded
threads = []
for i in range(4):
t = threading.Thread(target = connect)
threads.append(t)
t.start()
You will see that it takes 12 seconds to run in the single-threaded implementation, but just 3 in the multi-threaded one. This also works well with File I/O, not just networking.
Global Interpreter Lock (GIL)
Unfortunately, python suffers from the global interpreter lock
meaning that, for all intents and purposes, it's only running on a single thread, and is much more like the
event loop in Javascript. To demonstrate this, try running the code below:
#!/usr/bin/python
import threading
def thread_test():
print "starting thread"
counter = 0
while 1 == 1:
counter = 2
return
threads = []
for i in range(4):
t = threading.Thread(target = thread_test)
threads.append(t)
t.start()
The program should run infinitely, allowing you to monitor your CPU to see how hard it is working. You may have expected 4 cores to be maxed out, but actually, they aren't due to the GIL.
Multiprocessing
If you have a CPU intenstive application, you can work around the GIL issue in python by using multiprocessing instead of multithreading. This is essentially running multiple processes instead of trying to run multiple threads in a single process. Luckily for us, there are python libraries for multiprocessing to help automate this in your code so that it is farily simple to implement.
Below is an example that uses multiprocessing [source], by taking a filepath as an argument, and executing each line in that file as a command in a single process. It makes sure to only be running as many processes as your CPU count which is the total number of threads your CPU can handle in parallel.
#!/usr/bin/env python
"""
Script that takes a path to a file as an argument, and executes each line in that file with a thread a pool
equal in size to the cpu_count of the computer, thus fully utilizing the CPU.
"""
#!/usr/bin/env python
import multiprocessing
import sys
import subprocess
def process_line(line_command):
subprocess.check_call(line_command, shell=True)
num_cpu = multiprocessing.cpu_count()
job_pool = multiprocessing.Pool(num_cpu)
# Fetch the commands file path which
# contains each command we wish
# to run on a separate line
cmds_fp = sys.argv[1]
lines = [line.strip() for line in open(cmds_fp, 'r')]
job_pool.map(process_line, lines)
Related Video
Below is a video from Jack of Some demonstrating that Python is technically a multithreaded language, and how the GIL may get in your way, but if you are using libraries like NumPy, this may not be so much of an issue.
Conclusion
Python multithreading is useful for removing I/O bottlenecks, but if your application is CPU intensive and it cannot use multiprocessing, then you may need to use another language such as Java or Rust instead.
References
- The two code examples for demonstrating multithreading are based on Darren's blog - Basic Python Multithreading
First published: 16th August 2018