If you buy a laptop nowadays, it is more often than not a quad-core (or better) computer, sometimes with Hyper Threading Technology and Turbo Boost. However, this article is not about these fancy terminologies. The important thing for software engineers is to know how to take advantage of growing computing power.

You bought a Ferrari & drove it like a Fiat.

-- Zlatan Ibrahimović on his days in Barcelona

The Question

If you've taken operating system related courses, you may know what processes and threads are. There are tons of tutorials and articles defining and comparing these concepts.

A process, in the simplest terms, is an executing program. One or more threads run in the context of the process. A thread is the basic unit to which the operating system allocates processor time. A thread can execute any part of the process code, including parts currently being executed by another thread.

-- https://docs.microsoft.com/en-us/windows/win32/procthread/processes-and-threads

However, I personally find it difficult to relate to when it comes to practice: I know my laptop has several cores, I know programs may run on multiple processes and/or multiple threads, but how do they work together and which ones should I choose?

Process vs. Thread

Before we address the question, let's look at processes and threads from a programming perspective. In a naive view, your program is a process, and it would run on one core. It contains one thread that follows the instructions you gave. This thread is a subset of the process since the process contains more information (the context).

However, as the CPU becomes increasingly powerful, the single-threaded process would not make the most out of the processor. If we can allocate the work into several threads, the CPU would be powerful enough to switch between them, and execute each of them in turn for a very short period of time. Then it looks like they are running all at once. This is known as concurrency.

As you may have guessed, since our CPUs are now multicore, why not let the program run on all cores? This is when multi-processing comes into play. One process only runs on one core. This is because processes are much heavier than threads, so the CPU cannot switch between them seamlessly on one core as it would for threads. Therefore, we allocate them to different cores. Each core has its own computing power, so the CPU can execute as many processes simultaneously as the number of cores it has. This is called parallelism.

Now that we know how processes and threads work in real life, the remaining question is: how do I choose which one to use?

A Pythonic Answer

To answer the question, I would provide two real life examples that I came across and solved with multiprocessing and multithreading respectively. Hopefully, these examples would shed light on a generalizable answer.

Batch API Calls With Multithreading

One scenario is that I needed to send a bunch of API calls to a third-party API. More specifically, I wanted to fetch 50 users' data for my website. Here were my requirements:

I did not care about the order of the requests, which is perfect for parallel programming (either using concurrency or parallelism).
I wanted it to finish as soon as possible, otherwise, I would've left my user waiting.

I chose to use multithreading instead of multiprocessing, since:

I do not have a 50-core computer.
Each task is very light.
I don't really need any computation on my side. (Querying data is done by the API host, I'm just requesting the result.)

And here's how I did it (nothing explains better than code):

from concurrent.futures import ThreadPoolExecutor, as_completed


def fetch_user_info(user_id):
    # call third party API here
    pass


with ThreadPoolExecutor(max_workers=20) as executor:
    # submit each job and map future to user id
    future_to_id = {
        executor.submit(fetch_user_info, user_id): user_id
        for user_id in user_ids
    }
    # collect each finished job
    for future in as_completed(future_to_id):
        res = future.result()  # read the future object for result
        user_id = future_to_id[future]  # match result back to user
        # TODO: process the result

Excuse my semi-pseudocode, here's some explanation.

The ThreadPoolExecutor provides a context manager, which lets you not worry about forking a thread when it should start and closing after it finishes, which is a very neat feature.
The ThreadPoolExecutor, as its name hints, provides a thread "pool" and you can specify how many threads can run together at maximum via the max_workers arg. You can dispatch more than max_workers jobs into the pool but it can only take max_workers at a time. Note that it depends on how much of your resources you want to use, and also, more importantly in my case, it depends on how frequently the third-party API allows you to send requests (I tried 50 and got banned).
The executor returns a Future object, which communicates through the concurrent.futures.as_completed function once its task is completed. It allows us to interact with the asynchronous code in a synchronous style.

Text Processing With Multiprocessing

In this case, I was trying to perform some Natural Language Processing (NLP) locally via an NLP library:

I was given a list of articles and the NLP library would analyze and extract some specific terminologies from the papers.
The NLP job was very expensive in terms of computing power. It would take lots of CPU resource.

Here's what I did:

from concurrent.futures import ProcessPoolExecutor, as_completed


def run_nlp(filename):
    # execute NLP
    pass


with ProcessPoolExecutor(max_workers=MAX_WORKERS) as executor:
    future_to_file = {
        executor.submit(run_nlp, filename): filename for filename in filenames
    }

    for future in as_completed(future_to_file):
        res = future.result()  # read the future object for result
        filename = future_to_file[future]  # match result back to filename
        # TODO: process the result

Note on code:

I did not provide the res parsing code but it could be dealt with as the same way as the prior example.
It's almost identical to the previous example, apart from using a different "Pool Executor". With this example, I'm merely hoping to provide a different scenario rather than implementation.
I set the max_workers via an environment variable which I'll explain below.

Exploiting the CPU

The reason I configure the max_workers via an environment variable is that in this case, the level of parallelism does not depend on my use case, but on the computing power instead. I had three different environments:

local: on my laptop
development: on a dev server
production: on a production server

My laptop is 4 core, with Turbo Boost. I initially set MAX_WORKERS to 4 locally, which makes sense in that I can allocate the heavy jobs 4 at a time, to 4 cores separately. My dev server also has 4 cores and the production server has 16 cores. So I set the MAX_WORKERS 4, 4 and 16 respectively.

I found out that the program runs much faster locally than on the dev server. But of course, both setup get trucked by the production server. With a little bit research, I found out that the dev server is using a cheaper CPU, with lower clock rate.

Then I tried something else which leads to an interesting discovery. I tried to increase the MAX_WORKERS locally and I actually got improvement on results. And the improvement stopped at MAX_WORKERS=8 and it's taking about half of the time compared to MAX_WORKERS=4. Then the improvement disappeared even if added more workers.

However, this pattern does not exist on the dev server. This is a little bit weird, isn't it? Then it daunt on me that it must have something to do with Turbo Boost. It allows each core to go beyond its normal rate (about twice the rate, for a short period of time) and appears as if it were twice the cores. However, the server, which is usually using cheaper CPUs, does not have this capability.

So my takeaway from this example is that MAX_WORKERS can be set as the number of cores in the computer to make the most out of its power. Depending on the weight of your tasks, and capabilities of the cores, you may even set the MAX_WORKERS to multiples of the number of cores. The best configuration would require some benchmark work.

Conclusion

With the two examples, we can come to this simple rule of thumb:

If you want to launch light-weight jobs in a batch, multithreading is your friend.
If your tasks are resource-intensive (requires lots of computation), consider multiprocessing.

There are other restrictions, such as whether your requests are context independent. Recalling that we mentioned earlier in this article that a process contains context for the threads. In other words, the threads are sharing the same context. This can be a plus as well as trouble depending on each case, we may delve in this matter in another post.

At last, let's take some time to appreciate the beauty of Python, providing elegant solutions and patterns to these problems.

Process or Thread, that is the question