> I may have missed something but I couldn’t figure out how to get the multi-threaded performance out of Python
Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.
> Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it.
Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory and can be accessed by all processes concurrently without incurring the wrath of the GIL.
Obviously this does not fix the issue of Python just being super slow in general. It just lets you max out all your CPU cores instead of having just one core at 100% all the time.
Multiprocessing is not a real solution, it’s a break-glass procedure when you just need to throw some cores at something without any hope for reliability. Unless something has changed since I used python, it is essentially a wrapper on Fork.
This means you need to deal with stuck/dead processes. I’ve used multiprocessing extensively and once you hit a certain amount of usage, even in a pool, you just get hangs and unresponsive processes.
I’ve also written a huge amount of Cython wrapped c++ code which releases the GIL. This never hangs and I can multithread there all I want without issue.
Why would they get stuck/dead and why wouldn't that happen with threads which might be even worse as they're more tightly bound? At least with zombies or inactive processes you can detect and kill them externally - if needs be.
Haven't played with multiprocess at scale, so am genuinely interested.
If subprocesses die (segfault maybe) it isn't uncommon for them to not be cleaned up and/or cause the parent process to hang while it waits for the zombie to respond. That's one I experienced last week on Python 3.9. A thread that experienced that would likely kill the parent process or maybe even exit with a stacktrace. Way easier to debug, and doesn't require me to search through running tasks and manually kill them after each debug cycle.
My impression is that the multiprocessing module is a heroic effort, but unfortunately making the whole system work transparently across multiple OSs and architectures is a nearly insurmountable problem.
It provides a nice interface but is using multiprocessing or multi threading under the hood depending on which executioner you use:
> The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.
Your trouble seems to involve not understanding how to set up signal handlers, which ProcessPoolExecutor handles for you and exposes via a BrokenProcessPool exception.
> Derived from BrokenExecutor (formerly RuntimeError), this exception class is raised when one of the workers of a ProcessPoolExecutor has terminated in a non-clean fashion (for example, if it was killed from the outside).
Always setting a timeout on every IPC or network operation helps immensely. IIRC multiprocessing module allows that everywhere, but defaults to waiting forever in a couple of places.
Zombies don't respond, they merely have to be wait()'d for. Which should take microseconds at most.
I've seen orphaned processes sometimes idle, sometimes busy doing god knows what.
But Zombies OTOH are rarely a problem, and should be able to be dealt with easily.
Perhaps the desire of Python to be Windows compatible mitigates against some design more suitable for Unix.
If processes were a universal substitute for threads we wouldn't have threads. That reasoning only gets stronger when you apply python's heavy limitations, but it gets the most strength when you experience the awkwardness of multiprocessing firsthand.
There isn't much difference on Linux between threads and processes that share memory. Multiprocessing is fine, it's just slightly more isolated threads.
multiprocessing is very good solution for scatter-and-gather (or map/reduce) type workloads:
for example ssh to 1000 machines, run some commands, grab output, analyze output, done some action based on output, etc
if you are managing a fleet of machines and have some tasks to do on each machine, then multiprocessing is the life saver.
There is a "fork" mode and a "spawn" mode. Fork (the default) tends to result in broken process pools as you say, spawn seems to work a lot better but the performance is worse.
I’m not a huge fan of Cython and the like. It seems to be more natural to open a tcp connection to a c/c++ program and let that do the heavy lifting. Anything else seems like not a proper UNIX style solution.
I want to warn people against multiprocessing in python though.
If you're thinking about parallelizing your Python process, chances are your Python code is CPU-bound. That's when you should stop and think, is Python really the right tool for this job?
From experience, translating a Python program into C++ or Rust often gives a speed-up of around 100x, without introducing threads. Go probably has a similar level of speed-up. So while you can throw a lot of time fighting Python to get it to consume 16x the compute resources for a 10x speed-up, you could often instead spend a similar amount of time rewriting the program for a 100x speed-up with the same compute resources. And then you could parallelize your Go/Rust/C++ program for another 10x, if necessary.
Of course, this is highly dependent on what you're actually doing. Maybe your Python code isn't the bottleneck, maybe your code spends 99% of its time in datastructure operations implemented in C and you need to parallelize it. Or maybe your use-case is one where you could use pypy and get the required speed-up. I just recognize from my own experience the temptation of parallelizing some Python code because it's slow, only to find that the parallelized version isn't that much faster (my computer is just hotter and louder), and then giving in and rewriting the code in C++.
The first thing you should do is profile the code (py-spy is my preferred option) and see if there are any obvious hotspots. Then I'd actually look at the code, and understand what the structure is. For example, are you making lots of unnecessary copies of data? Are you recomputing something expensive you can store (functools.cache is one line and can make things much faster at the cost of memory)?
Once you've done that, then you should be familiar enough the code to know which bits are worth using multiprocessing on (i.e. the large embarrassingly parallel bits), which if they are a significant part of your code should scale near linearly.
The other thing to check is which libraries are you using (and what are your dependencies using). numpy now includes openblas (though mkl may be faster for your usecase), but sometimes you can achieve large speedups though choosing a different library, or ensuring speedups are being built.
>Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory
Only if it can be immutable. So it can't be shared and changed by multiple processes as needed (with synchronization).
And even if you can have it mostly immutable, if you need to refresh it (e.g. after some time read a newer large file from disk to load into your data structure), you can't without restarting the whole server and processes.
So, it could work for this case, but it's hardly a general solution for the problem.
For this use case it would be better to put the data in a shared SQLite database than relying on multiprocessing CoW.
Even accessing objects from the shared memory would cause the reference counter to increment and the data would be copied, causing a memory usage explosion.
Nowadays multiprocessing is rarely the answer. Between all the gotchas (memory usage can be horrific, have to be careful what you modify, etc.) it's almost never the right answer.
Nowadays numba is usually a better solution for when you want to run some computationally expensive python code that itself calls numpy, etc.
For the parent commenter's use case though that wouldn't be a great solution either. In general, Python does not have an optimal way of operating on a shared data structure across OS threads and certainly not in a way that doesn't require forking the interpreter.
You have to be much more careful about what you modify when using multithreading, so I'm not sure what you mean by that.
A lot of people here mention that sharing data is much easier with multithreading, but doing this without races is not easy.
You can't just use the values from difference threads like you would in normal code, you need to synchronize access with locks, which can be difficult to do correctly and can harm performance in a lot of cases.
I think a lot of the people who complain about the GIL are going to become acutely aware of why it was useful when they attempt to use GIL-less multithreading, and realize that removing it wasn't as great as it sounded at first!
In my experience, most problems are inherently synchronous with lots of mutable state and complex data dependencies, or inherently parallel with lots of tasks that can run independently. Problems that can be easily parallelized already work fine with multiprocessing! Problems that can't be easily parallelized are not something you can just slap some threading on to get more performance, and will require a lot of work to keep state synced!
This is just my opinion though and I'm sure there are plenty of domains that I don't have experience with that will benefit from no-GIL python!
> Problems that can be easily parallelized already work fine with multiprocessing!
Yeah, except afaik you pay more in context switches, sharing is more cumbersome. Also language runtime of a single process is likely working with less information, you end up using more memory on multiple language runtime instances
Frankly I'd just use Java or Go at that point and not even bother
Multithreading is hard but once you have been doing it a while, it becomes easy and most importantly, it’s stable.
When you have to deal with processes, there’s a lot of external factors out of your control because processes are much more visible and carry a lot of extra baggage.
Hard multithreading problems are fun. Hard multi-process problems are just tedious.
As I understand it on Linux processes and threads are implemented in almost the same way, just that threads share memory. I've heard it said several times that the idea that processes are "heavier" is a bit of a myth. I guess they need to allocate heap space and threads don't. I'm not an expert, just mentioning because it sounded like you might be believing something which is at odds with what people say about processes and threads on Linux.
I'm not a Linux kernel dev but I think this is true! Not sure what's up with the downvotes.
You can create a process/thread chimera with certain system calls, and get something that is in-between a thread and process if you want, which is neat but maybe not that useful.
Creating processes on Linux is actually much faster than people seem to realize. I can spawn at least a few thousand a second from a quick test of spawning bash instances.
Not sure why this is directed at my comment-- I didn't touch on synchronization.
Yes, locks like mutexes, semaphores, etc. and approaches like atomics, lockfree datastructures come into play when writing multithreaded code. There's no getting around that.
> In my experience, most problems are inherently synchronous with lots of mutable state and complex data dependencies, or inherently parallel with lots of tasks that can run independently. Problems that can be easily parallelized already work fine with multiprocessing!
This is a hot take though-- most problems that are truly embarrassingly parallel don't work as well as you'd think w/ multiprocessing. There's a ton of overhead there and when you do need synchronization steps (eg; in reductions) it can get pretty messy.
I don't mean to pile on to what ghshephard already posted, but I'm afraid you've been breaking the site guidelines repeatedly lately - not just here but these:
... as well as others. Can you please not do this? We're trying for something different here. If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."
Over quite some time I've become convinced multiprocessing module is better than an optional GIL removal.
It may leave many useful bits on the table (compared to pure multithreaded coding, like C++/pthreads) but I've still been able to get it to scale my application performance (CPU-bound, large-memory) to the number of cores of even large boxes (96+ vCPUs). IIRC the future/concurrent library was key to being productive.
20 years ago I would said different, as at the time, IronPython demonstrated a real alternative to CPython that was faster, and fully multitrhreaded (including the container classes).
Sure, with multiprocessing you can get 96 python processes running at 100% CPU while sharing a large dataset.
Only problem is that 99% of that CPU usage is for serializing/deserializing IPC messages and total throughput would have been higher using a single process.
There are use-cases for multiprocessing. As long as data sharing between processes is insignificant, it can be quite performant. Just like using a bash-wrapper script that orchestrates a bunch of python (or other) processes.
Whatever happened to ironpython? I used to do a lot of C# development and remember dabbling with ironpython back in the day. It seemed like it was important to Microsoft, .Net added the whole concept of dynamic data types mostly to support ironpython and ironruby. But I never really used python much until recently, so of course when I finally needed to do python I looked for ironpython and it doesn’t appear to be a thing anymore.
It looks like Microsoft abandoned these dynamic language implementations in 2010. Maintaining parallel implementations of two complex, mature scripting languages is a huge feat. It would take some very expensive talent. That said, IronPython was loved by those who used it, which means it captured them in the DotNet ecosystem. Perhaps that win was not enough for Microsoft to continue the project. Ideally, Python foundation should "own" (and fund) Jython and IronPython development, but that takes (a lot of) money. (Sorry, I'm much less familiar with Ruby and IronRuby.)
It is still a thing, but it's open source now instead of maintained by Microsoft. There was a release that finally supports Python 3 in December last year.
I don't know how useful it is really, if you really want performance then you probably shouldn't choose Python to begin with, or you use the libraries which may not be compatible with IronPython. These days it barely takes me longer to build a simple script in C# than in Python either.
It's so so.
Pythons core value is it's huge stack of lib's. And most important fall down with IP due to them using c and so on.
When we needed python c# interop it was better to use python.net and integrate that way. Annoying to setup but when it works you can get both to work seamlessly
I dont really partake in programming "wars", but the idea of launching a set of separate processes instead of separate threads to do a bunch of IOs has always seem to be weird to me. Yes, I have built software using Python. Yes, I have done things as you suggest. Now I use asyncio, since the syntax has matured and I finally understand coroutines, runners, tasks etc. Lets see where the GIL less Python takes us.
I'm confused. If you're doing a "bunch of IOs" then that's the situation where people use threads in Python, not processes. The argument for processes in Python is CPU-bound workloads.
Yup. I work at the Space Telescope Science Institute, where we maintain pipelines for astronomical data that move petabytes, among other things. All of the heavy lifting is done in Python.
Loading 100GB into RAM and then calling fork() is just painting a giant OOM Killer target on your back. It'll work until something breaks the CoWs or the parent gets restarted while some forks still linger or other fun things like that.
Threads make it transparent to the OS that this memory really must be shared between compute tasks.
While that does sometimes happen, I find the risk to be overstated. Most simple "allocate a large, complex data structure (e.g. dict of vectors of dataclasses) before creating a multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor and then refer to parts of it in the executor's jobs" work that deals in GBs of data does not suffer from copy-on-write-induced OOM issues in my experience. If the data in the shared memory isn't mutated in python, the refcount mutations are rarely enough to dirty more than a fraction of a percent of pages (though there are pathological allocation/reference schemes where that's not true).
If you do have memory issues, calling 'gc.freeze()' right before creating your multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor is sufficient to mitigate refcount-related page dirtying in the vast majority of cases. In the small remaining minority of cases, 'gc.disable()' as suggested by the freeze docs[1] may help. If that still doesn't do it, or if your page-dirtying is due to actual mutations of data (not just refcounts), it may be time to reach for actual shared memory instead[2][3].
This exists, but one of two things happen, which still significantly slows things down. Either 1) you generate multiple python instances or 2) you push the code to a different language. Both are cumbersome and have significant effects. The latter is more common in computational libraries like numpy or pytorch, but in this respect it is more akin to python being a wrapper for C/C++/Cuda. Your performance is directly related to the percentage of time your code spends within those computation blocks otherwise you get hammered by IO operations.
You have to manually set up shared memory with its own API that has its own limitations, right? I thought some seamless integration was a new feature, but AFAICT, transfers between multiprocesses still leads to things being pickled and copied. Am I wrong?
Only partially. When you send things to a multiprocessing.Pool/concurrent.futures.ProcessPoolExecutor, they're pickled and copied. "Sending" happens when passing arguments to e.g. "multiprocessing.Pool.apply_async()", "multiprocessing.Queue.put()" or "concurrent.futures.ProcessPoolExecutor.submit()".
However, there are two other ways to share data into your multiprocessing processes:
1. Copy-on-write via fork(2). In this mode, globally-visible data structures in Python that were created before your Pool/ProcessPoolExecutor are made accessible to code in child processes for (nearly) free, with no pickling, and no copying unless they are mutated in the child process. Two caveats here, which I've discussed in other comments on this thread: mutation may occur via garbage collection even if you don't explicitly change fork-shared data in Python[1]; and fork(2) is not used by default in multiprocessing on MacOS or Windows[2].
2. Using explicit shared memory data structures provided by Multiprocessing[3][4]. These do not incur the overhead (in CPU or copied memory) that pickle-based IPC does, but they are not without complexity or cost.
Unfortunately, truly "seamless integration" is not really possible with multiprocessing, so users will have to use one or more of the above strategies according to their application needs.
If you have a non trivial application, multiprocessing just takes a lot of memory. Every child process that you create duplicates the parent memory. There are some interesting hacks like gc.freeze that exploits the copy on write feature of forks to reduce memory, but ultimately you can just create a few hundred of processes compared to thousands of threads because of memory consumption.
>If you have a non trivial application, multiprocessing just takes a lot of memory. Every child process that you create duplicates the parent memory.
Not really, unless you want to alter it. The OS uses copy on write behind the scenes for forked processes, so will use the same memory locations already loaded until/if you modify that. So parent memory isn't really duplicated.
As for any new memory allocated by each child process, that's its own.
Unfortunately the generational GC modifies bits all over the heap, so you have to use some tricks to really leverage copy on write (as the commenter alludes to).
The situation is a bit more complicated than this. While it's usually not the case that child processes always duplicate parent memory, that does happen on certain platforms (MacOS and Windows) on some Pythons. Additionally, the situation regarding unexpected page dirtying of copy-on-write memory is nuanced as well, which some of the sibling comments allude to.
I'll copy the tl;dr from another comment I've made nearby:
There are three main ways to share data into your multiprocessing processes:
1. By sending that data to them with IPC/pickling/copying, e.g. via "multiprocessing.Pool.apply_async()", "multiprocessing.Queue.put()" or "concurrent.futures.ProcessPoolExecutor.submit()".
2. Copy-on-write via fork(2). In this mode, globally-visible data structures in Python that were created before your Pool/ProcessPoolExecutor are made accessible to code in child processes for (nearly) free, with no pickling, and no copying unless they are mutated in the child process. Two caveats here, which I've discussed in other comments on this thread: mutation may occur via garbage collection even if you don't explicitly change fork-shared data in Python[1]; and fork(2) is not used by default in multiprocessing on MacOS or Windows[2].
3. Using explicit shared memory data structures provided by Multiprocessing[3][4].
Multiprocessing is great. But then every process keeps its own copy of hundreds of gigabytes of stuff. May be okay, depending on how many processes you spawn.
If the bulk of the data is immutable (or at least never mutated), it can be safely shared though, via shared memory.
> every process keeps its own copy of hundreds of gigabytes of stuff. May be okay, depending on how many processes you spawn
That depends on how you're using multiprocessing. If you're using the "spawn" multiprocessing-start method (which was set to the default on MacOS a few years ago[1], unfortunately), then every process re-starts python from the beginning of your program and does indeed have its own copy of anything not explicitly shared.
However, the "fork" and "forkserver" start methods make everything available in python before your multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor was created accessible for "free" (really: via fork(2)'s copy-on-write semantics) in the child processes without any added memory overhead. "fork" is the default startup mode on everything other than MacOS/Windows[2].
I find that those differing defaults are responsible for a lot of FUD around memory management regarding multiprocessing (some of which can be found in these comments!); folks who are watching memory while using multiprocessing on MacOS or Windows observe massively different memory consumption behavior than folks on Linux/BSD (which includes folks validating in Docker on MacOS/Windows). There's an additional source of FUD among folks who used Python on MacOS before the default was changed from "fork" to "spawn" and who assume the prior behavior still exists when it does not.
This sometimes results in the humorously counterintuitive situation of someone testing some Python code in Docker on MacOS/Windows observing far better performance inside Docker (and its accompanying virtual machine) than they observe when running that same code natively directly on the host operating system.
If you're on MacOS (not Windows) and wish to use the "fork" or "forkserver" behaviors of multiprocessing for memory sharing, do "export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES" in your shell before starting Python (modifying os.environ or calling os.setenv() in Python will not work), and then call "multiprocessing.set_start_method("fork", force=True)" in your entry point. Per the linked GitHub issue below, this can occasionally cause issues, but in my experience it does so rarely if ever.
Is what you're describing only true of the "Framework" Python build on MacOS? It sounds like that's the case from a quick read of the issue you linked. I would say that people should basically never use the "Framework" Python on MacOS. (There's some insanity IIRC where matplotlib wants you to use the Framework build? But that's matplotlib)
You can check the default process-start method of your Python's multiprocessing by running this command: "python -c 'import multiprocessing; print(multiprocessing.get_start_method())'"
Python is also going to get a JIT eventually, so they’re fixing that too! One of the concerns with no gil was that it would make certain optimisations harder for the JIT, but it’s very cool to see both being worked on.
> Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.
I assume mod_wsgi under apache was not the answer here due to memory constraints. That being said, why not serve from disk and use redis for a cache. This should work well unless the queries had high cardinality.
Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.
> Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it.
Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory and can be accessed by all processes concurrently without incurring the wrath of the GIL.
Obviously this does not fix the issue of Python just being super slow in general. It just lets you max out all your CPU cores instead of having just one core at 100% all the time.