tl;dr
To give a definite answer, more tests seem necessary. But circumstantial evidence suggests Linux is the OS used practically exclusively in the ultra low latency community, which also routinely processes Mpps workloads. That does not mean it is impossible with Windows, but Windows will probably lag behind quite a bit, even though it may be possible to achieve Mpps numbers. But that needs testing to be ascertained, and e.g. to figure out at what (CPU) cost those numbers can be achieved.
N.B. This is not an answer I intend to accept. It is intended to give anyone interested in an answer to the question some hints about where we stand and where to investigate further.
Len Holgate, who according to google seems to be the only one who has tested RIO to get more performance out of Windows networking (and published the results), just clarified in a comment on his blog that he was using a single IP/Port combo for sending the UDP packets.
In other words, his results should be somewhat comparable to the single core figures in tests on Linux (although he is using 8 threads - which, without having checked his code yet, seems harmful for performance when handling just a single UDP packet stream and not doing any heavy processing of the packets, and he mentions only few threads are actually used, which would make sense). That is despite him saying:
I wasn't trying that hard to get maximum performance just to compare relative performance between old and new APIs and so I wasn't that thorough in my testing.
But what is giving up the (relative) comfort zone of standard IOCP for the more rough RIO world other than "trying hard"? At least as far as a single UDP packet stream is concerned.
I guess what he means - as he did try various design approaches in several tests of RIO - is that he did not e.g. fine-tune NIC settings to squeeze out the last bit of performance. Which, e.g. in the case of Receive Buffer Size could potentially have a huge positive impact on UDP receive performance and packet loss figures.
The problem however when trying to directly compare his results with those of other Linux/Unix/BSD tests is this: Most tests, when trying to push the "packets per second" boundary, use the smallest possible packet/frame size, i.e. an Ethernet frame of 64 bytes. Len tested 1024 byte packets (-> a 1070 byte frame), which (especially for No-Nagle UDP) can get you much higher "bits per second" figures, but may not push the pps boundary as far is at could with smaller packets. So it would not be fair to compare these figures as is.
Summing up the results of my quest into Windows UDP receive performance so far:
- No one really is using Windows when trying to develep ultra low latency and/or high throughput applications, these days they are using Linux
- Practically all performance tests and reports with actual results (i.e. not mere product advertisement) these days are on Linux or BSD (thanks Len for being a pioneer and giving us at least one point of reference!)
- Is UDP (standard sockets) on Windows faster/slower than on Linux? I don't can't tell yet, would have to do my own testing
- Is high-performance UDP (RIO vs netmap) on Windows faster/slower than on Linux? Linux easily handles full 10Gb line speed with a single core at 900MHz, Windows, in the best case published is able to go up to 43% or 492kpps for a large UDP packet size of 1024, i.e. bps figures for smaller sizes will probably be significantly worse, although pps figures will probably rise (unless interrupt handling or some other kernel space overhead is the limiting factor).
As to why they use linux, that must be because developing solutions that involve kernel changes like netmap or RIO - necessary when pushing performance to the limits - is near impossible with a closed system like Windows, unless your paychecks happen to come out of Redmond, or you have some special contract with Microsoft. Which is why RIO is a MS product.
Finally, just to give a few extreme examples of what I discovered was and is going on in Linux land:
Already 15 years ago, some were receiving 680kpps using a 800 mHz Pentium III CPU, 133 mHz front-side bus on a 1GbE NIC. Edit: They were using Click, a kernel-mode router that bypasses much of the standard network stack, i.e. they "cheated".
In 2013, Argon Design managed to get
tick to trade latencies as low as 35ns [nano seconds]
Btw they also claim that
The vast majority of existing computing code for trading today is written for Linux on x86 processor architectures.
and Argon use the Arista 7124FX switch, that (in addition to an FPGA) has an OS
built on top of a standard Linux kernel.