*But the NT kernel is much more sophisticated and powerful than Linux* That does...

bitwize · on June 8, 2016

Actually it does, as it mentioned that the extra parameters are for things like async callbacks and partial results.

The I/O model that Windows supports is a strict superset of the Unix I/O model. Windows supports true async I/O, allowing process to start I/O operations and wait on an object like an I/O completion port for them to complete. Multiple threads can share a completion port, allowing for useful allocation of thread pools instead of thread-per-request.

In Unix all I/O is synchronous; asynchronicity must be faked by setting O_NONBLOCK and buzzing in a select loop, interleaving bits of I/O with other processing. It adds complexity to code to simulate what Windows gives you for real, for free. And sometimes it breaks down; if I/O is hung on a device the kernel considers "fast" like a disk, that process is hosed until the operation completes or errors out.

piscisaureus · on June 8, 2016

I wrote the windows bits for libuv (node.js' async i/o library), so I have extensive experience with asynchronous I/O on Windows, and my experience doesn't back up parent's statement.

Yes, it's true that many APIs would theoretically allow kernel-level asynchronous I/O, but in practice the story is not so rosy.

* Asynchronous disk I/O is in practice often not actually asynchronous. Some of these cases are documented (https://support.microsoft.com/en-us/kb/156932), but asychronous I/O also actually blocks in cases that are not listed in that article (unless the disk cache is disabled). This is the reason that node.js always uses threads for file i/o.

* For sockets, the downside of the 'completion' model that windows is that the user must pre-allocate a buffer for every socket that it wants to receive data on. Open 10k sockets and allocate a 64k receive buffer for all of them - that adds up quickly. The unix epoll/kqueue/select model is much more memory-efficient.

* Many APIs may support asynchronous operation, but there are blatant omissions too. Try opening a file without blocking, or reading keyboard input.

* Windows has many different notification mechanisms, but none of them are both scalable and work for all types of events. You can use completion ports for files and sockets (the only scalable mechanism), but you need to use events for other stuff (waiting for a process to exit), and a completely different API to retrieve GUI events. That said, unix uses signals in some cases which are also near impossible to get right.

* Windows is overly modal. You can't use asynchronous operations on files that are open in synchronous mode or vice versa. That mode is fixed when the file/pipe/socket is created and can't be changed after the fact. So good luck if a parent process passes you a synchronous pipe for stdout - you must special case for all possible combinations.

* Not to mention that there aren't simple 'read' and 'write' operations that work on different types of I/O streams. Be ready to ReadFileEx(), Recv(), ReadConsoleInput() and whatnot.

IMO the Windows designers got the general idea to support asynchronous I/O right, but they completely messed up all the details.

trentnelson · on June 9, 2016

You're completely missing how the NT I/O subsystem works, and how to use it optimally.

> * Asynchronous disk I/O is in practice often not actually asynchronous. Some of these cases are documented (https://support.microsoft.com/en-us/kb/156932), but asychronous I/O also actually blocks in cases that are not listed in that article (unless the disk cache is disabled). This is the reason that node.js always uses threads for file i/o.

The key to NT asynchronous I/O is understanding that the cache manager, memory manager and file system drivers all work in harmony to allow a ReadFile() request to either immediately return the data if it is available in the cache, and if not, indicate to the caller that an overlapped operation has been started.

Things like extending a file, opening a file, that's not typically hot-path stuff. If you're doing a network oriented socket server, you would submit such a blocking operation to a separate thread pool (I set up separate thread pools for wait events, separate to the normal I/O completion thread pools), and then that I/O thread moves on to the next completion packet in its queue.

> * For sockets, the downside of the 'completion' model that windows is that the user must pre-allocate a buffer for every socket that it wants to receive data on. Open 10k sockets and allocate a 64k receive buffer for all of them - that adds up quickly. The unix epoll/kqueue/select model is much more memory-efficient.

Well that's just flat out wrong. You can set your socket buffer size as large or as small as you want. For PyParallel I don't even use an outgoing send buffer.

Also, the new registered I/O model in 8+ is a much better way to handle socket buffers without the constant memcpy'ing between kernel and user space.

> IMO the Windows designers got the general idea to support asynchronous I/O right, but they completely messed up all the details.

I disagree. Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.

haberman · on June 9, 2016

> The key to NT asynchronous I/O is understanding that the cache manager, memory manager and file system drivers all work in harmony to allow a ReadFile() request to either immediately return the data if it is available in the cache, and if not, indicate to the caller that an overlapped operation has been started.

The Microsoft article cited above (https://support.microsoft.com/en-us/kb/156932) directly contradicts you:

> Be careful when coding for asynchronous I/O because the system reserves the right to make an operation synchronous if it needs to. Therefore, it is best if you write the program to correctly handle an I/O operation that may be completed either synchronously or asynchronously.

Microsoft is directly saying that it reserves the right to violate the guarantee you are counting on at any time, and it documents several known cases of this. You can try to guess when this will happen and put those I/O operations on a different thread pool, but you're just playing whack-a-mole. And you're violating Microsoft's own recommendations.

trentnelson · on June 9, 2016

That's not a particularly good article with regards to high performance techniques.

You wouldn't be using compression or encryption for a file that you wanted to be able to submit asynchronous file I/O writes to in a highly concurrent network server. Those have to be synchronous operations. You'd do everything you can to use TransmitFile() on the hot path.

If you need to sequentially write data, wanted to employ encryption or compression, and reduce the likelihood of your hot-path code blocking, you'd memory map file-sector-aligned chunks at a time, typically in a windowed fashion, such that when you consume the next one you submit threadpool work to prepare the one after that (which would extend the file if necessary, create the file mapping, map it as a view, and then do an interlocked push to the lookaside list that the hot-path thread will use).

I use that technique, and also submit prefaults in a separate threadpool for the page ahead of the next page as I consume records I'm writing to. Before you can write to a page, it needs to be faulted in, and that's a synchronous operation, so you'd architect it to happen ahead of time, before you need it, such that your hot-path code doesn't get blocked when it writes to said page.

That works incredibly well, especially when you combine it with transparent NTFS compression, because the file system driver and the memory manager are just so well integrated.

If you wanted to do scatter/gather random I/O asynchronously, you'd pre-size the file ahead of time, then simply dispatch asynchronous writes for everything, possibly leveraging SetFileIoOverlappedRange such that the kernel locks all the necessary sections into memory ahead of time.

And finally, what's great about I/O completion ports in general is they are self-aware of their concurrency. The rule is always "never block". But sometimes, blocking is inevitable. Windows can detect when a thread that was servicing an I/O completion port has blocked and will automatically mark another thread as runnable so the overall concurrency of the server isn't impacted (or rather, other network clients aren't impacted by a thread's temporary blocking). The only service that's affected is to the client that triggered whatever blocking I/O call there was -- it would be indistinguishable (from a latency perspective) to other clients, because they're happily being picked up by the remaining threads in the thread pool.

I describe that in detail here: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...

> > Be careful when coding for asynchronous I/O because the system reserves the right to make an operation synchronous if it needs to. Therefore, it is best if you write the program to correctly handle an I/O operation that may be completed either synchronously or asynchronously.

That's not the best wording they've used given the article is also talking about blocking. If you've followed my guidelines above, a synchronous return is actually advantageous for file I/O because it means your request was served directly from the cache, and no overlapped I/O operation had to be posted.

And you know all of the operations that will block (and they all make sense when you understand what the kernel is doing behind the scenes), so you just don't do them on the hot path. It's pretty straight forward.

4ad · on June 9, 2016

> Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.

I wrote Windows drivers and file systems for about 10 years, and Unix drivers and file systems also for about 10 years.

I'd rather practice substance agriculture for the rest of my life than deal with Windows drivers again.

trentnelson · on June 9, 2016

Yeah it's not a simple affair at all. It's a lot easier these days though, and the static verifier stuff is very good.

thwarted · on June 9, 2016

I disagree. Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.

Can programming against the userspace interface the I/O subsystem really be compared to programming against the kernel driver interface to I/O subsystem? In Linux, kernel drivers have access to structures, services, and layers that userspace doesn't. And can these be compared between a monolithic and a micro-kernel approach, other than what has been debated ad nauseam for micro/monolithic kernels in general (not just used for I/O)?

trentnelson · on June 9, 2016

I didn't make my point particularly well there to be honest. Writing an NT driver is incredibly more complicated than an equivalent Linux one, because your device needs to be able to handle different types of memory buffers, support all the Irp layering quirks, etc.

I just meant that writing an NT kernel driver will really give you an appreciation of what's going on behind the scenes in order to facilitate awesome userspace things like overlapped I/O, threadpool completion routines, etc.

SwellJoe · on June 8, 2016

I have never worked with systems-level Windows programming, so I don't know the answer to this...but, how is what you're describing better than epoll or aio in Linux or kqueues on the BSDs?

I'm guessing you're coming from the opposite position of ignorance I am (i.e. you've worked on Windows, but not Linux or other modern UNIX), though, since "setting O_NONBLOCK and buzzing in a select loop, interleaving bits of I/O with other processing" doesn't describe anything developed in many, many years. 15 years ago select was already considered ancient.

MarkSweep · on June 8, 2016

I think IO Completion Ports [1] in Windows are pretty similar to kqueue [2] in FreeBSD and Event Ports [3] in Illumos & Solaris. All of them are unified methods methods for getting change notifications on IO events and file system changes. Event Ports and kqueue also handle unix signals and timers.

Windows will also take care of managing a thread pool to handle the event completion callbacks by means of BindIoCompletionCallback [4]. I don't think kqueue or Event Ports has a similar facility.

[1]: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3... [2]: https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2 [3]: https://illumos.org/man/3C/port_create [4]: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

trentnelson · on June 9, 2016

BindIoCompletionCallback is very old, the new threadpool APIs should be used, e.g.: https://github.com/pyparallel/pyparallel/blob/branches/3.3-p...

Regarding the differences between IOCP and epoll/kqueue, it all comes down to completion-oriented versus readiness-oriented.

https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...

MarkSweep · on June 9, 2016

The documentation for the new thread pool APIs in Windows say that it is more performant than the old ones. Do you have any experience transitioning from the old thread pool to the new one, and if so did you notice performance differences?

dblohm7 · on June 9, 2016

Last decade I wrote a server based on the old API and BindIoCompletionCallback. I can't comment on the performance differences, but the newer threadpool API is much better to work with.

IMO one of the biggest problems with the old API is that there is only one pool. This can cause crazy deadlocks due to dependencies between work items executing in the pool. The new API allows you to create isolated pools.

The old API did not give you the ability to "join" on its work items before shutting down. You could roll your own solution if you knew what you were doing, but more naive developers would get burned.

You also get much finer grained control over resources.

trentnelson · on June 9, 2016

I'm really lucky to have decided to dig into this stuff (for PyParallel) around ~2011, by which stage Windows Vista (where all the new APIs had been introduced) had been out for like, 5-6 years, so no, I never had to deal with the old APIs, thankfully.

I really sympathize with Len Holgate though... he's had to nurse Server Framework through all of that transition, which must have been painful: http://www.serverframework.com/asynchronousevents/2016/06/an...

trentnelson · on June 8, 2016

To quote myself:

> The “Why Windows?” (or “Why not Linux?”) question is one I get asked the most, but it’s also the one I find hardest to answer succinctly without eventually delving into really low-level kernel implementation details. >

> You could port PyParallel to Linux or OS X -- there are two parts to the work I’ve done: a) the changes to the CPython interpreter to facilitate simultaneous multithreading (platform agnostic), and b) the pairing of those changes with Windows kernel primitives that provide completion-oriented thread-agnostic high performance I/O. That part is obviously very tied to Windows currently. >

> So if you were to port it to POSIX, you’d need to implement all the scaffolding Windows gives you at the kernel level in user space. (OS X Grand Central Dispatch was definitely a step in the right direction.) So you’d have to manage your threadpools yourself, and each thread would have to have its own epoll/kqueue event loop. The problem with adding a file descriptor to a per-thread event loop’s epoll/kqueue set is that it’s just not optimal if you want to continually ensure you’re saturating your hardware (either CPU cores or I/O). You need to be able to disassociate the work from the worker. The work is the invocation of the data_received() callback, the worker is whatever thread is available at the time the data is received. As soon as you’ve bound a file descriptor to a per-thread set, you prevent thread migration >

> Then there’s the whole blocking file I/O issue on UNIX. As soon as you issue a blocking file I/O call on one of those threads, you have one thread less doing useful work, which means you’re increasing the time before any other file descriptors associated with that thread’s multiplex set can be served, which adversely affects latency. And if you’re using the threads == ncpu pattern, you’re going to have idle CPU cycles because, say, only 6 out of your 8 threads are in a runnable state. So, what’s the answer? Create 16 threads? 32? The problem with that is you’re going to end up over-scheduling threads to available cores, which results in context switching, which is less optimal than having one (and only one) runnable thread per core. I spend some time discussing that in detail here: https://speakerdeck.com/trent/parallelism-and-concurrency-wi.... (The best example of how that manifests as an issue in real life is `make –jN world` -- where N is some magic number derived from experimentation, usually around ncpu X 2. Too low, you’ll have idle CPUs at some point, too high and the CPU is spending time doing work that isn’t directly useful. There’s no way to say `make –j[just-do-whatever-you-need-to-do-to-either-saturate-my-I/O-channels-or-CPU-cores-or-both]`.) >

> Alternatively, you’d have to rely on AIO on POSIX for all of your file I/O. I mean, that’s basically how Oracle does it on UNIX – shared memory, lots of forked processes, and “AIO” direct-write threads (bypassing the filesystem cache – the complexities of which have thwarted previous attempts on Linux to implement non-blocking file I/O). But we’re talking about a highly concurrent network server here… so you’d have to implement userspace glue to synchronize the dispatching of asynchronous file I/O and the per-thread non-blocking socket epoll/kqueue event loops… just… ugh. Sure, it’s all possible, but imagine the complexity and portability issues, and how much testing infrastructure you’d need to have. It makes sense for Oracle, but it’s not feasible for a single open source project. The biggest issue in my mind is that the whole thing just feels like forcing a square peg through a round hole… the UNIX readiness file descriptor I/O model just isn’t well suited to this sort of problem if you want to optimally exploit your underlying hardware. >

> Now, with Windows, it’s a completely different situation. The whole kernel is architected around the notion of I/O completion and waitable events, not “file descriptor readiness”. This seems subtle but it pervades every single aspect of the system. The cache manager is tightly linked to the memory management and I/O manager – once you factor in asynchronous I/O this becomes incredibly important because of the way you need to handle memory locking for the duration of the I/O request and the conditions for synchronously serving data from the cache manager versus reading it from disk. The waitable events aspect is important too – there’s not really an analog on UNIX. Then there’s the notion of APCs instead of signals which again, are fundamentally different paradigms. The digger you deep the more you appreciate the complexity of what Windows is doing under the hood. >

> What was fantastic about Vista+ is that they tied all of these excellent primitives together via the new threadpool APIs, such that you don’t need to worry about creating your own threads at any point. You just submit things to the threadpool – waitable events, I/O or timers – and provide a C callback that you want to be called when the thing has completed, and Windows takes care of everything else. I don’t need to continually check epoll/kqueue sets for file descriptor readiness, I don’t need to have signal handlers to intercept AIO or timers, I don’t need to offload I/O to specific I/O threads… it’s all taken care of, and done in such a way that will efficiently use your underlying hardware (cores and I/O bandwidth), thanks to the thread-agnosticism of Windows I/O model (which separates the work from the worker). >

> Is there something simple that could be added to Linux to get a quick win? Or would it require architecting the entire kernel? Is there an element of convergent evolution, where the right solution to this problem is the NT/VMS architecture, or is there some other way of solving it? I’m too far down the Windows path now to answer that without bias. The next 10 years are going to be interesting, though.

https://groups.google.com/forum/#!topic/framework-benchmarks...

wahern · on June 9, 2016

"The problem with adding a file descriptor to a per-thread event loop’s epoll/kqueue set is that it’s just not optimal if you want to continually ensure you’re saturating your hardware (either CPU cores or I/O)"

Both epoll and kqueue permit multiple threads to poll the same event set. Normally you do this in tandem with edge-triggered readiness (EPOLLET on Linux, EV_CLEAR on BSD) so that only one thread will dequeue an event.

How do think IOCP is implemented in Windows? There's a thread pool in the kernel which _literally_ polls a shared event queue. It's just hidden so you can pretend it's magical. But conceptually it works almost identically to how you would do it in Unix.

The benefit of IOCP is that it's a native API. It's warts and shortcomings notwithstanding, developers never even need to think about how it's actually implemented. Whereas with epoll and kqueue you either have to roll your own framework, or select from various third-party options. Seeing how the sausage is made can turn some people off. But just because you don't see the gory details doesn't mean it's implemented using magical fairy dust.

There's much to recommend Windows, and many things the NT kernel conceptually gets right. But IOCP vs polling? The only real difference architecturally is how much of the stack sits in user-space vs kernel-space, and how much of the stack is delivered by Microsoft (all of if in the case of IOCP) vs other sources (in Linux, glibc does AIO, while all the event loop and callback code is provided by various libraries or written yourself).

Putting more of the stack in kernel-space doesn't magically make it easier to perform optimizations. That's marketing speak and kernel fetishism. You have to first show why those optimizations can't be achieved elsewhere, like in the I/O or process scheduler. Various Linux components traditionally are more performant (e.g. process scheduling) than in Windows, so many of the optimizations wrt IOCP is arguably clawing back performance lost elsewhere in the system.

zxcvcxz · on June 8, 2016

http://pyparallel.org/wrk-rps-comparison2.svg

According to your website pretty much every other technology runs better on Linux than it does on Windows, and of course pyparallel runs better than everything you tested.

How can I run these tests my self? I specifically want to test it against golang.

trentnelson · on June 9, 2016

See here: https://www.reddit.com/r/programming/comments/3jhv80/pyparal...

Source: https://github.com/tpn/pyparallel-tefb

Go-specific comment thread: https://www.reddit.com/r/programming/comments/3jhv80/pyparal...

zxcvcxz · on June 9, 2016

I don't see where you take advantage of go routines and I don't see a real world use case unless it's operating on concurrent connections.

https://www.reddit.com/r/programming/comments/3jhv80/pyparal...

heh..

trentnelson · on June 9, 2016

Patches welcome?

abaines · on June 9, 2016

I believe Linux since 2.5 has 'proper' asynchronous system calls, io_getevents(2) and co.

Further information: https://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt

inopinatus · on June 9, 2016

Based on what you describe here, I'd say your comparative understanding is about two decades out of date. Async I/O has been a capability in various Unix and Unix-alike kernels for that long.

trentnelson · on June 9, 2016

As in signal based AIO? Have you ever tried to use it in a high performance network server, where you want to have reads also satisfied from the cache if possible?

Because that is like pulling teeth on UNIX. See: https://groups.google.com/forum/#!topic/framework-benchmarks...