I am investigating if I can implement an HPC app on Windows that receives small UDP multicast datagrams (mostly 100-400 Bytes) at a high rate, using a dozen or up to maybe 200 multicast groups (i.e. using MSI-X and RSS I can scale to multiple cores), does some processing per packet, and then sends it out. Sending via TCP I managed to go up as far as I needed to (6.4Gb/sec) without hitting a wall, but receiving datagrams at high pps rates turned out to be an issue.

In a recent test on a high-spec NUMA machine with a 2-port 10Gb ethernet NIC on Windows 2012 R2, I was only able to receive hundreds of thousands of UDP datagrams per second (early drop, i.e. without actually processing the data, to remove the processing overhead of my app from the equation to see how fast it gets) using 2x12 cores, and the kernel part of the 12 multicast groups tested seemed to get distributed across 8 or 10 cores of one NUMA node (Max RSS queues was set to 16) - albeit with a .net app, so native apps should be able to go faster.

But even Len Holgate only managed to receive UDP packets at 500kpps in his high-performance Windows RIO tests, using a UDP payload of 1024 bytes.

In QLogic's whitepaper (OS under test not mentioned) the limits for "multi-threaded super-small packet routing" (so that includes both receiving and subsequent sending?) are set at 5.7Mpps. In articles on Linux networking, the limits are set at 1Mpps to 2Mpps per core (reportedly scaling up more or less linearly), or even 15Mpps with special solutions that bypass the kernel.

E.g. netmap

can generate traffic at line rate (14.88Mpps) on a 10GigE link with just a single core running at 900Mhz. This equals to about 60-65 clock cycles per packet, and scales well with cores and clock frequency (with 4 cores, line rate is achieved at less than 450 MHz). Similar rates are reached on the receive side.

So how far can I take (the latest versions of) Windows / Windows Server, in particular receiving UDP multicast as described in the leading paragraph?

Edit There's a cloudflare blog post - and an interesting comment section - on how to do it on Linux: How to receive a million packets per second, and there's the corresponding hacker news comments page.

  • @Ramhound In theory, it is probably possible in Windows. But how is it possible in practice? By now I've come across quite a few reports from people achieving these levels in linux on standard hardware, but not a single one getting anywhere close in Windows. And how do you think I could reduce the scope of the question? It's just this: "What are the highest UDP multicast receive rates in Windows?". The bulk of the text in my question is just examples that ought to show it's possible with linux - and that I've done my homework. Commented Jun 2, 2015 at 23:16
  • @Ramhound 'If its possible on Linux its possible on Windows'. I respectively disagree.. one system that instantly comes to mind is iptables.. yeah good luck mimicking that system on windows. ^_^ Commented Jun 5, 2015 at 0:44
  • I wasn't actually trying that hard, so you could always take all of the code that I have available for the RIO testing I did and continue pushing. Commented Jun 5, 2015 at 12:36

3 Answers 3


According to Microsoft, tests in their lab showed that "on a particular server in early testing" of RIO, they were able to handle

  • 2Mpps without loss in Windows Server 2008R2, i.e. without RIO
  • 4Mpps on (pre-release) Windows Server 8 using RIO

Screenshot from that video (44:33):

enter image description here

So the answer to my question Is it possible to process millions of datagrams per second with Windows? would be: Yes, and apparently it was even before RIO, in Windows Server 2008R2.

But in addition to official figures, especially on unreleased software, having to be taken with a pinch of salt, with only the sparse information given in this presentation, many questions about the test, and hence how to properly interpret the results, remain. The most relevant ones being:

  1. Are the figures for Sending? Receiving? Or maybe for Routing (i.e. Receive + Send)?
  2. What packet size? -> Probably the lowest possible, as is generally done when trying to get pps figures to brag about
  3. How many connections (if TCP) / packet streams (if UDP)? -> Probably as many as necessary to distribute the workload so all cores present can be used
  4. What test setup? Machine and NIC specs and wiring

The first one is crucial, because Sends and Receives require different steps and can show substantial differences in performance. For the other figures, we can probably assume that the lowest packet size, with at least one connection/packet stream per core was being used on a high-spec machine to get the maximum possible Mpps figures.

Edit I just stumbled upon an Intel document on High Performance Packet Processing on Linux, and according to that, the (Linux)

platform can sustain a transaction rate of about 2M transactions per second

using the standard Linux networking stack (on a physical host with 2x8 cores). A transaction in this request/reply test includes both

  1. reception of a UDP packet
  2. subsequent forwarding of that packet

(using netperf's netserver). The test was running 100 transactions in parallel. There are many more details in the paper, for those interested. I wish we had something like this for Windows to compare... Anyway, here's the most relevant chart for that request/reply test:

enter image description here

  • How how are your successes in comparing Windows & Linux networks stacks - is Windows as bad as it sounds? 4Mpps overall is not that much if they were using all cores since Linux is capable of pushing 1..2 Mpps per core with excellent scaling. I'am afraid Windows Server could lag behind as much as 10x times in overall scaling(mpps) over multicore cfgs with hundreds of cpu cores(think Epyc, etc). And no one to date tried to put Windows Server to honest test... Commented Jun 15, 2020 at 20:55
  • 1
    @DmitrySychov My tests of UDP packet reception on Windows vs Linux at the time showed that we were able to receive about 10 times as many packets without drops in Linux. I can't say if this has changed, 5 years on. We are using Linux for this kind of thing. Commented Jun 18, 2020 at 23:10
  • Thank you for your comment - really appreciated! For the lack of proper testing, almost non existent RIO adaption, inability of MS to provide proper benchmarks we were forced to switch to Linux too. BTW Linux brings exciting new stuff(including registered buffers support) with 5+ kernel called aio uring - check it out if not already. :) Regards, Dmitry Commented Jul 14, 2020 at 21:14


To give a definite answer, more tests seem necessary. But circumstantial evidence suggests Linux is the OS used practically exclusively in the ultra low latency community, which also routinely processes Mpps workloads. That does not mean it is impossible with Windows, but Windows will probably lag behind quite a bit, even though it may be possible to achieve Mpps numbers. But that needs testing to be ascertained, and e.g. to figure out at what (CPU) cost those numbers can be achieved.

N.B. This is not an answer I intend to accept. It is intended to give anyone interested in an answer to the question some hints about where we stand and where to investigate further.

Len Holgate, who according to google seems to be the only one who has tested RIO to get more performance out of Windows networking (and published the results), just clarified in a comment on his blog that he was using a single IP/Port combo for sending the UDP packets.

In other words, his results should be somewhat comparable to the single core figures in tests on Linux (although he is using 8 threads - which, without having checked his code yet, seems harmful for performance when handling just a single UDP packet stream and not doing any heavy processing of the packets, and he mentions only few threads are actually used, which would make sense). That is despite him saying:

I wasn't trying that hard to get maximum performance just to compare relative performance between old and new APIs and so I wasn't that thorough in my testing.

But what is giving up the (relative) comfort zone of standard IOCP for the more rough RIO world other than "trying hard"? At least as far as a single UDP packet stream is concerned.

I guess what he means - as he did try various design approaches in several tests of RIO - is that he did not e.g. fine-tune NIC settings to squeeze out the last bit of performance. Which, e.g. in the case of Receive Buffer Size could potentially have a huge positive impact on UDP receive performance and packet loss figures.

The problem however when trying to directly compare his results with those of other Linux/Unix/BSD tests is this: Most tests, when trying to push the "packets per second" boundary, use the smallest possible packet/frame size, i.e. an Ethernet frame of 64 bytes. Len tested 1024 byte packets (-> a 1070 byte frame), which (especially for No-Nagle UDP) can get you much higher "bits per second" figures, but may not push the pps boundary as far is at could with smaller packets. So it would not be fair to compare these figures as is.

Summing up the results of my quest into Windows UDP receive performance so far:

  • No one really is using Windows when trying to develep ultra low latency and/or high throughput applications, these days they are using Linux
  • Practically all performance tests and reports with actual results (i.e. not mere product advertisement) these days are on Linux or BSD (thanks Len for being a pioneer and giving us at least one point of reference!)
  • Is UDP (standard sockets) on Windows faster/slower than on Linux? I don't can't tell yet, would have to do my own testing
  • Is high-performance UDP (RIO vs netmap) on Windows faster/slower than on Linux? Linux easily handles full 10Gb line speed with a single core at 900MHz, Windows, in the best case published is able to go up to 43% or 492kpps for a large UDP packet size of 1024, i.e. bps figures for smaller sizes will probably be significantly worse, although pps figures will probably rise (unless interrupt handling or some other kernel space overhead is the limiting factor).

As to why they use linux, that must be because developing solutions that involve kernel changes like netmap or RIO - necessary when pushing performance to the limits - is near impossible with a closed system like Windows, unless your paychecks happen to come out of Redmond, or you have some special contract with Microsoft. Which is why RIO is a MS product.

Finally, just to give a few extreme examples of what I discovered was and is going on in Linux land:

Already 15 years ago, some were receiving 680kpps using a 800 mHz Pentium III CPU, 133 mHz front-side bus on a 1GbE NIC. Edit: They were using Click, a kernel-mode router that bypasses much of the standard network stack, i.e. they "cheated".

In 2013, Argon Design managed to get

tick to trade latencies as low as 35ns [nano seconds]

Btw they also claim that

The vast majority of existing computing code for trading today is written for Linux on x86 processor architectures.

and Argon use the Arista 7124FX switch, that (in addition to an FPGA) has an OS

built on top of a standard Linux kernel.


You will surely need "measuring" different configurations and scenarios. This can be done AFAIK with two gear provided by 2 companies. IXIA and Spirent. They offer hardware based traffic generators able to pump traffic at line speed. They offer ramp test where you can detect the speed at which your particular system might collapse. The devices are expensive but you can rent them.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .