Programster's Blog

Tutorials focusing on Linux, programming, and open source

Python Multithreading - You Could Be Wasting Time

Python is an excellent scripting language that seems especially suitable for Linux applications. You may have heard that python has multithreading support, which is true, and could dramatically improve your application's performance as demonstrated by running the following two examples below, which are trying to simulate connecting to 4 different network computers.

Single Threaded Implementation

#!/usr/bin/python
import time
import threading

def connect():
    print "connecting..."
    time.sleep(3)
    print "connected."

# Single threaded
for i in range(4):
    connect()

Multithreaded Implementation

#!/usr/bin/python
import time
import threading

def connect():
    print "connecting..."
    time.sleep(3)
    print "connected."

# multi threaded
threads = []
for i in range(4):
    t = threading.Thread(target = connect)
    threads.append(t)
    t.start()

You will see that it takes 12 seconds to run in the single-threaded implementation, but just 3 in the multi-threaded one. This also works well with File I/O, not just networking.

Global Interpreter Lock (GIL)

Unfortunately, python suffers from the global interpreter lock meaning that, for all intents and purposes, it's only running on a single thread, and is much more like the event loop in Javascript. To demonstrate this, try running the code below:

#!/usr/bin/python

import threading

def thread_test():
    print "starting thread"
    counter = 0

    while 1 == 1:
        counter = 2
    return

threads = []
for i in range(4):
    t = threading.Thread(target = thread_test)
    threads.append(t)
    t.start()

The program should run infinitely, allowing you to monitor your CPU to see how hard it is working. You may have expected 4 cores to be maxed out, but actually, they aren't due to the GIL.

Multiprocessing

If you have a CPU intenstive application, you can work around the GIL issue in python by using multiprocessing instead of multithreading. This is essentially running multiple processes instead of trying to run multiple threads in a single process. Luckily for us, there are python libraries for multiprocessing to help automate this in your code so that it is farily simple to implement.

Below is an example that uses multiprocessing [source], by taking a filepath as an argument, and executing each line in that file as a command in a single process. It makes sure to only be running as many processes as your CPU count which is the total number of threads your CPU can handle in parallel.

#!/usr/bin/env python
"""
Script that takes a path to a file as an argument, and executes each line in that file with a thread a pool
equal in size to the cpu_count of the computer, thus fully utilizing the CPU.
"""
#!/usr/bin/env python

import multiprocessing
import sys
import subprocess

def process_line(line_command):
    subprocess.check_call(line_command, shell=True)

num_cpu = multiprocessing.cpu_count()
job_pool = multiprocessing.Pool(num_cpu)

# Fetch the commands file path which 
# contains each command we wish 
# to run on a separate line
cmds_fp = sys.argv[1]
lines = [line.strip() for line in open(cmds_fp, 'r')]

job_pool.map(process_line, lines)

Conclusion

Python multithreading is useful for removing I/O bottlenecks, but if your application is CPU intensive and it cannot use multiprocessing, then you may need to use another language such as Java instead.

References