10

My question will be mostly about Linux and contemporary X86 hardware.

Clarifying the terms

async event based programming: spawning fixed amount of threads and using some user space scheduling mechanism. Usually callbacks are handled in a continuation style, so the callback will be executed in the thread of the consumer, not producer.

thread, blocking based programming: using classic OS threads and mutexes and semaphores. Waiting is handled by blocking the calling thread. Although not usually used, in my perspective this requires cooperative threads to be effective.

The status quo from my POV

It seems like more and more people turn away from the classic way of using syncrhonization/cooperation primitives for threading and from blocking (as in making current thread sleep) in general. Their problems seems to be mainly based on the following (italics are my thoughts on the matter):

  • Blocking synchronization is prone to deadlocks (async code can livelock too)

  • Too many threads cause oversubscription and OS scheduler makes latency sky high (fixable?)

  • Whatever libraries that use threading do not provide any way of cooperation, thus causing the problem above

  • High cost of a thread spin-up (perceived?)

  • Not usable if sleep/wakeup is noticeable compared to active time (true)

#5 is not fixable without patching kernel and #3 could be fixed by better teaching.

The can of worms

Async systems usually rely on events that trigger something else, thus cascading. Even if the above problems were fixed by async event based programming, it creates harder issues that are solvable by blocking thread based programming:

  • Starvation (very hard to achieve fairness without OS support, which in turn utilizes hardware clocks to enforce CPU time slices)

  • Can break when only one worker is allocated (this is mostly true of lock free queues, nothing can cause progress condition when the worker is not giving CPU time to another task, although this is solved with coroutines)

  • Exclusivity and ABA problems (with OS threads no thread can be scheduled on multiple CPUs at the same time, while with event based programming some event might be fired more than needed)

  • Complexity either goes into the library (which, if written with poor interface, pretty much doubles the mental complexity) or into the user code

Why reinvent the wheel?

For most complaints I have seen with thread based blocking programming, I couldn't find any reason why it couldn't be solved with classic approach. The only problem is when a thread needs to run for so small amount of time that typical ~10 microsecond delay will start becoming noticable. It is usually I/O bound applications. On all other cases I found that non-blocking code is usually less efficient on the normal path (non contending), because it does more operations to ensure atomicity. It seems like almost all problems could be solved by using blocking in a better way (don't hold lock for more than very few operations) and exposing parallelism instead of just spawning threads (algorithms could use task queues, for example). Real-time is not possible without preemption, so async event driven code is not real-time by definition. Even the coroutine based frameworks pretty much duplicate the OS scheduler and synchronization primitives.

Examples when threads are superior

On one occasion I had to utilize only 2 CPU cores for a stream of tasks, where some tasks take about 100 times longer than the others. Having two of those would pretty much stop anything from making progress. I did the following:

  1. Allocate two threads for lightweight tasks and make the producers put lightweight tasks into a queue

  2. Spin up a thread for every heavy task

  3. Put everything under SCHED_RR with a timeslice set to upperbound of a lightweight task, so that even in case of heavier tasks, lighter ones could make progress

For async event based programming, this would require support from the callee, and it would have predefined amount of preemption points, while OS based one is a pretty much coroutine with as many preemption points as possible, albeit with less control over where exactly it will be preempted.

Question

Since async event driven programming offers so little benefits, why is there shift towards it? What am I missing?

14
  • 8
    I think part of the problem with this question is that it conflates asynchrony with threading. Asynchrony is just a natural feature for any system that relies on some external service (e.g. database, other microservice, etc.), so a program with just one thread can't do anything with that operation until the external service completes. In the meantime, that thread can either block/sleep or continue doing something else useful asynchronously. Also remember that many systems these days are split into many microservices and often rely heavily on SaaS/PaaS solutions. Commented Nov 15, 2021 at 21:16
  • 1
    This article by an author of Scala's biggest async framework has a good explanation. Commented Nov 16, 2021 at 6:02
  • 1
    surely you are missing the obvious. The async approach is for non-cpu tasks. you dont spin up a new thread because there is no work for the thread to do except wait.
    – Ewan
    Commented Nov 16, 2021 at 15:01
  • 5
    Downvoting because this isn't a question, it's an argument. OP already has their answer (async is bad) and simply wants evidence to support it.
    – user949300
    Commented Nov 16, 2021 at 17:46
  • 5
    You are not making the difference between the two perceived paradigms very clear. Nor do I see how this would be bound to Linux or any particular processor family. It may help to provide typical (code) examples and to be a lot more concise, leaving out the essay about your personal assessments. There may or may not be any merit to them, either way they stray from the question. Now you will have people arguing with you over your statements rather than listing reasons. Commented Nov 16, 2021 at 19:07

8 Answers 8

18

The "async" approach better facilitates human reasoning.

When most people drive, they don't need to concern themselves with how every element of the car is interacting. Forget the tires - when you turn the steering wheel, the car turns. Sure, there are exceptional cases where you need to remember that the tires exist, and you need to keep the big picture in your head to avoid big errors. But for the most part, that's background knowledge that won't help you get to the store and back.

Technology grows to cover the pain points

In computing as a whole, things tend to move from "easy for the machine" to "easy for the person". User-friendliness (noting that developers are the users of programming technologies) is obviously valuable, so as technology and discipline develop, patterns tend toward the human.

In programming languages, for example, we've generally moved from "low" to "high"; machine to assembly to C to C++ to C#, for example (not that each is specifically an iteration of its predecessor, but as a general direction). Early languages tended to be more dynamic, because it's easy to interpret a dynamic language, whereas many modern languages tend to come with more compile-time checks and pre-processing. Obviously dynamic and low-level languages are still valuable - especially in specific situations - but the benefits of things like generic types, garbage collection, and syntactic sugar are undeniable for most software development.

Separate the model from the mechanism

Also look at cloud computing - have you heard the buzzword "serverless"? Obviously there's still a server, but the details of its hardware can be safely ignored in most situations. With architectural patterns like microservices, there's more and more push toward completely isolating different areas of an application domain; each piece is maintained as a black box and coordination is only required at the edges. This is supported and enforced by cloud resources like Azure Functions or AWS Lambda that may not run anywhere near one-another in time and space. Coordinating all of these pieces turns out to be a distraction from software design and development rather than a fundamental building block.

This extends naturally to concurrent programming. It's easier to think "pick up here when data is available" than it is to think about scheduling whose turn it is and ensuring that timing works out. Software design principles suggest focusing on a single responsibility, and part of that is keeping the details coordination out of a given chain of events.

Features like async and await or promise then callbacks build upon low-level thread coordination the way loops build on goto. They behave in a structured, consistent way that is more limited and with somewhat more overhead, but are ultimately more predictable and quick to use in typical scenarios. When I write getValue().then(useValue), I can assume that code will execute in that order as time progresses, albeit with a time disconnect. Maybe the actual thread could run other code in the meantime - why should I care!

6
  • 1
    I personally found async (via callbacks) reasonably "logical", because when I first learned systems programming, a long time ago, that what how things were done. I suspect that my experience is rare: for most other programmers async code is hard for reasoning. That's why languages have added keywords like await, which effectively turn async code into synchronous, for easier reasoning.
    – user949300
    Commented Nov 16, 2021 at 17:50
  • 2
    @user949300 I think terms are being used differently all around here. Continuation-like callbacks and keywords async and await are both what I think the querent describes as "async event based programming", but with different levels of technology and sugar. This is in contrast to "using classic OS threads and mutexes and semaphores" where you think in terms of active threads and the coordination and scheduling between them. Commented Nov 16, 2021 at 17:55
  • Thank you for the answer. Now I can more easily see the difference between async and event based programming. It is just that I saw those used together more often than not that I confused the two for being one. Commented Nov 16, 2021 at 18:18
  • The more I thought about this, the more I seem to lose the difference between async and blocking. The use cases I have in mind are creating a promise then doing multiple thens on the future from it. In the case of blocking threads, it is just a context switch. Given enough time spent active, the context switch will not be noticeable. While on the other hand, if it is checking if the result is ready periodically, then the last paragraph wouldn't apply. Am I missing something? I'm not saying async is useless, just that it has more limited uses than what I hear about. Commented Nov 16, 2021 at 18:41
  • 2
    @Incomputable The way I understand it is that things like async and await or promise then callbacks builds upon low-level thread coordination the way loops build on low-level gotos. They translate in a structured, consistent way that is more limited and with more overhead, but are ultimately more predictable and quick to use. When I write getValue().then(useValue), I can assume that code will execute in that order as time progresses, albeit with a time disconnect. Notably the thread could run other code in the meantime - why should I care! Commented Nov 16, 2021 at 19:21
10

Fundamentally the answer comes down to the fact that threads are not free. There is overhead associated with each thread. I found this article which is written by former Intel Engineer Arch D. Robison who explains this using the TBB scheduler, which is what I think you mentioned you have familiarity with.

The impact of having too many threads comes in two ways. First, partitioning a fixed amount of work among too many threads gives each thread too little work that the overhead of starting and terminating threads swamps the useful work. Second, having too many threads running incurs overhead from the way they share finite hardware resources.

Most significantly for this discussion, the cost of switching can be very significant. In our discussion, you mentioned a thread switching time of 10 microseconds. If we assume that's correct for the moment, consider that on a 1GHz machine (e.g. an Athlon circa 2000), that's 10,000 clock cycles. Robison explains part of why this is so costly:

Modern processors rely heavily on cache memory, which can be about 10 to 100 times faster than main memory. Accesses that hit in cache are not only much faster; they also consume no bandwidth from the memory bus. Caches are fast, but finite. When the cache is full, a processor must evict data from the cache to make room for new data. Typically, the choice for eviction is the least recently used data, which is typically data from an earlier time slice. Thus software threads tend to evict each other’s data,

Moreover:

On Linux, starting and terminating an Intel® TBB task is about 18 times faster than starting and terminating a thread. On Windows, the ratio is more than 100.

Effectively, these higher-level abstractions are able to more efficiently manage the work in order to avoid all the overhead of thread switching. I think one thing about this that you need to consider is that this probably doesn't matter much if you are running one heavyweight process on a machine. But these improvements are designed to support things running at scale. Once you hit a certain number of threads on a given machine, it will spend more time on switching thread contexts than it is executing the program logic. By eliminating this thread-switching overhead, I can get more productive work done on the same hardware.

8
  • Thank you for taking the time to answer my question! I will delete my question because I gained a lot of insight into what's going on and how to write a much more insightful question. Sorry that this answer will probably go to waste (my new question will refute each and every point in this answer, so if you want to read about it, stay tuned, probably next week). Well, it seems like I cannot delete it. But either way, I would like to make great amendments to it. I guess I should ask mods for that. Commented Nov 16, 2021 at 16:54
  • 5
    @Incomputable I don't see any reason to delete this. If you have this question, someone else will surely have it too. You can ask another question regardless.
    – JimmyJames
    Commented Nov 16, 2021 at 18:12
  • 1
    in asynchronous model worker threads "manually" switch contexts much more often than in thread-per-request/thread-per-connection, so in asynchronous model there are way more cache misses and invalidations, thus the facts you pointed are against the asynchronous model. Furthermore, there's nothing that stops us from maintaining a thread-pool in thread-per-request/connection model as well to avoid frequent thread creation/destruction: just the maximum limit should be much bigger.
    – morgwai
    Commented Nov 17, 2021 at 19:47
  • @morgwai Can you provide some docs/evidence of increased cache misses?
    – JimmyJames
    Commented Nov 17, 2021 at 19:55
  • 2
    I would be honestly astonished if a single thread running coroutines had more cache misses than multiple threads context-switching. The coroutines would have to be as heavy-weight as a thread execution context for this to seem plausible. I guess there are corner cases like each coroutine fills L1 and they never share any memory ...
    – Useless
    Commented Nov 17, 2021 at 20:08
8

A lot of event driven programming is because the applications are event driven. You're running code to handle a button click, or a received packet, or a timer expiring. That sort of code is actually more awkward to write in a synchronous style. While you are correct that long-running CPU-bound synchronous code should get its own thread instead of manually specifying suspend points, most code is relatively short-lived IO-bound event handlers.

In my opinion, another reason for the growth of async programming is the growth of JavaScript programming. JavaScript is single-threaded, so you need async to do any sort of concurrency, and people looked for better and better patterns to do so, borrowing and lending good ideas between other languages.

Another reason is threads are much heavier than fibers. You might ask why we don't just create lighter weight threads. As this great article explains, one big reason is that threads are where your call stack is, and the garbage collector needs to be aware of the call stack in order to determine what memory is reachable. This tracking adds per-thread overhead. Likewise at the OS level for things like swapping.

5

Why is there a shift towards asynchronous and event driven programming?

My explanation would be that the tooling have become better. While many operating systems have had support for asynchronous IO for a long time, it has often been difficult to use effectively. Many modern languages feature some form of easy to use coroutine/async/await that lowers the barrier to entry.

This kind of programming is well suited when doing lots of IO, and this has increased in importance as as processors have become faster, while IO latency has not improved as much. Just blocking the thread while doing IO is not a great solution since UI programs would freeze while the UI thread blocks, and server programs would have a higher overhead due to all the blocked threads.

This does not mean that regular thread based programming is not relevant. I would consider the techniques complementary. When doing CPU intensive stuff you obviously need threads to do the processing.

5
  • 2
    "When doing CPU intensive stuff that's embarrassingly parallel" There are still workloads that don't thread well
    – Caleth
    Commented Nov 16, 2021 at 16:31
  • Thank you for taking the time to answer my question! Unfortunately I will probably delete it because I gained a lot more understanding on the subject that provide much stronger arguments against event based programming. Sorry. Commented Nov 16, 2021 at 16:55
  • @Incomputable Why don't you submit an answer with your findings? Commented Nov 16, 2021 at 17:37
  • 1
    @VincentSavard, if the question will avoid close/reopen battles, then I will do that. I didn't think of this, thanks for your suggestion. Commented Nov 16, 2021 at 17:39
  • @Caleth Or if responsiveness should be assured. Commented Nov 17, 2021 at 14:44
4

IME you're arguing from a false premise.

Your "classic" threads with blocking I/O model was popularized by Java, which was the first widely-used language to have standardized built-in thread support, and which did not initially support non-blocking I/O at all. Threaded blocking I/O was the only form of (de)multiplexing it supported.

I was already writing C++ code on UNIX using the traditional select/poll non-blocking reactor style while Java was gaining popularity, and I saw the explosion of Java documentation & tutorials - together with similar-enough syntax - encourage people to try the same style in C++. From my perspective this was an aberration, caused entirely by the popularity of the newer language and the particular order in which features were added.

The reactor-style approach is more work, because you have to manually track lots of state that would otherwise exist as local variables in an almost-alway-blocked thread. However, it has always been vastly more efficient for I/O bound systems. When Java eventually got NIO, it became possible to do efficient (de)multiplexing in the same style there as well.

The newer async event-based style essentially provides better language support for this, removing the need for most custom callback classes and manually managing what was effectively coroutine state.


NB. I'm not really addressing async/await style in this answer, just the underlying assumption about multi-threaded blocking I/O versus single threaded non-blocking I/O. The fact that the latter is simplified by using coroutines doesn't mean it needs an async thread pool - except, perhaps, for spinning off long-running operations so it can get back to I/O.

1

Verifying lack of deadlocks is easier with async driven code than with sync blocking code. In a synchronous system with multiple threads and multiple mutexes, how does one verify correctness? There is manual code inspection; however, this cannot be automated. Subsequent code changes may cause race conditions and deadlocks. Async structures are not dependent on other threads being correct and thus are easier to test and verify with an automated test.

2
  • Async code opens another problem then: the async operation could get preempted in the middle of an update, then another coroutine/fiber/whatever wants to read from that object. Async/blocking threading do not solve sharing problem, but async is superior in cases when context switch becomes noticable as incase of IO operations. Blocking threading, on the other hand, is usually a lot more efficient in terms of battery usage and stuff, because it sleeps. Commented Nov 16, 2021 at 18:06
  • 1
    @Incomputable While anync is not free from issues, consider this article called the 'Top 20 C++ multithreading mistakes and how to avoid them' Note that it's just the Top 20! There are so many ways you can get it wrong. More crucially (IMO) there's really no way to test multi-threaded code exhaustively. Even if it works in testing on one system (hard enough) there are effectively unlimited scenarios. Even if you could test them all, there still could be bugs that will appear on other hardware.
    – JimmyJames
    Commented Nov 16, 2021 at 18:26
1

The biggest problems with asynchronous approach is that it's difficult to read (callback hell) and error prone (deadlocks, race conditions, etc) unless advanced language support is available. One example is async/await in C# or Node that makes the request processing code look "linear" and similar to thread-per-request/thread-per-connection.
Asynchronous approach may also be a bit slower due to synchronization overhead. However in case of IO-bound applications this is almost always insignificant.

OTOH, the main problem with thread-per-request/thread-per-connection is (used to be) that for services with many users, a lot of threads need to be created. Depending on specific hardware+OS platform it may not be feasible (more on that later).

update: I used to think also that asynchronous approach results in more cache misses, but thanks to the discussion in this thread exactly, I've realized that it's the same for both asynchronous and thread-per-request/thread-per-connection models. In both models context switching happens almost exclusively at points where async/blocking calls are issued. When responses to these calls return, there will be 100% cache miss in both models. Finally in both models there will be almost 100% success cache hit during execution of synchronous non-blocking code: as most requests are waiting for their async/blocking calls to return, they don't compete for cache with the very few remaining active.

NOW SOME HISTORY BACKGROUND:

Before the mobile revolution, this was rarely a problem: very few internet services ever experienced more than 100 concurrent requests, so for almost everyone except Google, Yahoo, AOL etc, thread-per-request/thread-per-connection model was sufficient. Things changed drastically in late 2000s and within just few years roughly 50% population of the developed world started to access internet services dozens-hundreds of times a day. Suddenly, even moderately popular services started to have hundreds of concurrent requests during peak hours.
This also happened to be roughly the same time when shared server hosting services became cheap and popular and every hosting provider was copying Google's approach to data-center design: replace a few very expensive main-frames with hundreds of cheap, easily replaceable, almost-consumer-grade machines (early days Google clusters were built from decommissioned office PCs). Google claimed that this approach was more cost effective and fault tolerant: this was true, but as Google's code-base was a very tightly guarded secret at that time, they didn't mention that it comes with another cost...

LIMIT OF THREAD NUMBER ON 32BIT HARDWARE:

Most of the hardware at that time, except that main-frames that ppl were migrating away from, was still 32bit (64bit CPUs like AMD's Opterons were available, but still waaay more expensive). This limits virtual memory address space to 4GB (2^32) and usually a bit over 3GB is available to user space.
Now on most platforms, virtual address space (not the actual physical memory) containing thread's stack needs to be allocated right away when the thread is created and cannot be later expanded or switched. As some programs may be using deep, non-tail recursions, the default initial stack virtual address space is big on most platforms: on Linux it's 2MB (RHEL even had some variants with 10MB). Assuming even 0 heap-memory usage, this allows to create 3GB/2MB=1500 threads on 1 machine (and only 300 if stack address space is 10MB). Given standard average heap usage, virtual memory bounds were usually reached with about 500-900 concurrent requests.
For more info see this excellent aricle by Ulrich Drepper.

TOWARDS ASYNC MODEL:

The solution (and the cost that Google didn't mention before) was exactly a switch to the asynchronous server model: using small fixed pool of threads that switch requests whenever the current one is performing some blocking operation. This allowed to handle even 10_000 concurrent requests on the same machines that were previously melting at not even 1000 due to the problems described above. Everyone was super excited and a trend followed: we got async/await in F# in 2007, async support in Java Servlets in 2009, async/await in C# in 2011 and so on...
"And for the time it was good"

MIGRATION TO 64BIT HARDWARE:

Sometime in early 2010s AFAIR, 64bit CPUs became cheap enough to replace all 32bit ones, at least on servers. 64bit address space is huUUUuge: even with 10MB stack address space allocation, it's enough for more than about 1T (trillion) threads, so basically it's not a concern for the time being.

CONCLUSIONS:

So why is asynchronous style still so popular? There are some cases when it definitely is still useful:

  • probably for processing events on some IoT or embedded devices that still may be using 32bit CPUs.
  • it may be more convenient GUI event handling or, to that matter, any other situation with multiple independent events that result in modification of some common state. Since it is necessary to properly synchronize access to the common state anyway, most of the disadvantages of asynchronous model will also affect thread-per-request/connection model anyway.

However, for standard IO-bound server programming (if your service fetches data from a DB or communicates with a non-local dependent micro-service) with request-response processing (like HTTP or some RPC), that is mostly stateless (there is little data exchange and synchronization between separate requests), I don't think that asynchronous model is much useful anymore. IMO it should be avoided in languages that don't have good support for async programming (like Java or C++) unless indeed beneficial.
Of course there's nothing wrong with using it with languages that for example have async/await mechanism, like C# or Node. In fact, .net-core's kestrel server is asynchronous by design and its thread-pool's size is equal to the number available CPU cores. Node used to be single-threaded for long time.

Nevertheless, for reasons that I don't exactly understand, async model is still quite popular in Java in cases where it doesn't bring any value any more (like servlets fetching data from a DB). I suppose that it may be a matter of perpetuated thinking that maintaining a pool of several thousands threads is still very expensive, which is not true anymore.
For example Jetty on Linux can easily scale up to 32k threads without any tuning on a typical modern Linux laptop (see this article). For most apps and typical server hardware, 32k thread limit significantly exceeds number of concurrent requests that a machine can squeeze due to physical memory limit needed to fit all requests' data on the server's heap.

12
  • If you have enough threads they only fit in virtual memory with a 64-bit address space, they're practically guaranteed never to be in cache when they wake up. This might be absolutely fine for many use cases, but for many others it's completely unacceptable. I'm pretty sure that having different non-functional requirements than you does not make me a chimpanzee.
    – Useless
    Commented Nov 17, 2021 at 20:03
  • @Useless first, please don't take it personally ;-) the experiment was actually projected towards corporate rules and authors were pointing that humans often fall a victim of such bias as well. I myself have fallen into it few times at least. Now for the memory: current machines don't have enough physical RAM to handle enough requests to fill virtual address space. The biggest successful squeeze per machine in real life service I've seen was about 10_000. Generally the amount of cache should be proportional to the amount of RAM, that's why putting 256GB RAM into a main-board with small...
    – morgwai
    Commented Nov 17, 2021 at 20:11
  • I'm not upset about the simian simile, just pointing out that my performance constraints are real, even though they're different to yours.
    – Useless
    Commented Nov 17, 2021 at 20:14
  • 1
    @JimmyJames yes :) I've just updated my answer regarding the cache hits. Thanks again for the fruitful discussion!
    – morgwai
    Commented Nov 17, 2021 at 21:51
  • 1
    Yes, this has been very informative. My belief/argument is that the push has been around scalability. The idea being that an (IO bound) webserver can handle more concurrent requests using async requests. Your points have made me question some of my assumptions around this which is great.
    – JimmyJames
    Commented Nov 17, 2021 at 21:55
0

Just answering the "asynchronous" part of your question

Here are some technical reasons why "asynchronous" programming is advancing

  • In Android you must handle web-service-Requests / downloads in a seperate background-threads to avoid a frozen userinterface (or to avoid that android kills your app because it is not responding)
  • Tools/Libraries makes it very easy and memory efficient to implement asynchronous
    • Javascript with Promise API makes async programming easy. Running js code on a NodeJS Server makes executing the promises very efficient.
    • In Java8 and later there is the CompletableFuture API that can be used similar to js promises,
  • In scallable, parallel, distributed computing of big data there is the programming model Map Reduce that requires "asynchronous" programming

Not the answer you're looking for? Browse other questions tagged or ask your own question.