6

If I plug a line or USB microphone into my PC, I can record audio at 44.1kHz. That's one sample of audio about every 23 microseconds. The audio wave must be sampled and played back precisely or it will sound distorted. How do non-realtime OSes manage the very time-sensitive process of audio recording and playback with such high sample rates? Is the process any different whether the audio is played/recorded on motherboard audio versus USB speakers/microphone?

2
  • The only thing that needs to happen on a timely basis is getting the data into storage quickly enough. For a non-RTOS like Windows, this is trivial given that the CPU cycles in the Ghz, and the memory databus is measured in Gb/s.
    – Paul
    Commented Jan 28, 2015 at 4:44
  • en.wikipedia.org/wiki/Digital-to-analog_converter .. The OS doesn't have to do much (RTOS or not) when the DAC does the heave lifting :)
    – txtechhelp
    Commented Jan 28, 2015 at 5:10

1 Answer 1

10

If I plug a line or USB microphone into my PC, I can record audio at 44.1kHz. That's one sample of audio about every 23 microseconds. The audio wave must be sampled and played back precisely or it will sound distorted. How do non-realtime OSes manage the very time-sensitive process of audio recording and playback with such high sample rates? Is the process any different whether the audio is played/recorded on motherboard audio versus USB speakers/microphone?

It comes down to buffering.

44.1 kHz does not mean that there is an analog signal that has to be timed perfectly when being sent from the sound card (USB, PCIe, on-board, doesn't matter) to the CPU. In fact, the signal is digital. To have a digital Pulse Code Modulation bitstream being sent at 44100 Hertz just means that, for each second of audio, there will be 44,100 "data points" in the signal. The signal is quantized, meaning that it's basically a sequence of numbers. The size of those numbers (8-bit, 16-bit, etc) is determined by the sample format.

Now, consider that, once the audio is quantized, it doesn't matter how fast or slow it's captured: since it's just a bunch of data points ("samples") of the audio, it could be captured by the system equally well if you transferred 1 sample per microsecond, 1 sample per second, 1 sample per year, etc. -- as long as your computer knows what the intended sample rate is, it could collect all the data points at any speed it pleases, and then store them on disk (or in RAM, or anywhere else) to be reproduced with 100% fidelity later. That's the beauty of digital signal processing. The only minimum requirement is that, if you are capturing audio data in "real time" from a microphone, you must be able to write the stream of samples to the CPU and have them processed at least as quickly as the samples are coming in. But you don't have to process each individual sample as a separate transfer.

While audio work does require fairly precise timing so that the human ear cannot perceive audible "lag" in the audio, it is never the case that a digital audio source, like a sound card or USB microphone, would ever transmit each individual sample as and when they are received, directly to the CPU, one at a time. The problem is that there is significant overhead in each transfer from the audio source to the CPU.

The overhead is much lower on PCI Express than USB (in fact, USB 2.0 can only send and receive packets about once every 1 millisecond -- far too coarse-grained for sending 44,100 samples individually), but in either case, you aren't going to be able to send 44100 individual samples, each one by itself, in one second.

To resolve this, we use buffering. While buffering does introduce latency, the goal is to have a small enough buffer that the user can tolerate it for whatever their use case is, while high enough that the preemptive multi-tasking kernel scheduler of your non-RTOS will "cut off" any other CPU-hogging tasks and give your audio stack a chance to process the samples that have stacked up.

The basic idea of buffering is that you have a sequence of memory locations where the bits representing a certain number of samples (usually several thousand out of those 44,100) are queued up. Once the buffer is full, or nearly full, the sound source sends an interrupt to the kernel, which tells the CPU that the buffer is ready to be read from. Then, when the kernel has time, it schedules a task to perform a Direct Memory Access (DMA) transfer of those samples over to the system memory. From there, the program doing the sound capture can process the samples. The process is similar, but somewhat reversed, for audio playback.

So if you have a buffer of 50 milliseconds (1/20th of a second), which is not all that uncommon, you would have 44100 / 20 = 2205 samples in each buffer. So instead of receiving 44100 interrupts per second (which would surely overload the CPU in just the overhead of receiving and processing these interrupts!), the CPU only receives 20 interrupts per second. Much better.

Here is a "hands-on" walkthrough of an example:

  1. The user of a computer requests that a program begin recording audio from the attached USB microphone.
  2. After several layers of software send the right commands to the sound card, (eventually) the digital signal processor (DSP) within the microphone's audio chipset turns on. Basically, this DSP takes an analog audio signal being captured by the analog microphone, and takes very frequent, small, "samples" of the audio level at many little points per second -- say, 44100 times per second, or 44.1 kHz. Imagine placing a piece of paper over the top of a very precisely machined circular disc (like a frisbee), then start placing dots on your paper, along the perimeter of the circle. 4 dots at each quadrant of the circle doesn't look much like a circle on paper, but if you increase the number of dots to a few hundred, it starts to take shape. If you take many thousands of dots, it looks a LOT like a circle. This is what quantization does, except that it takes a whole lot of dots, so that the end result that gets "drawn" onto the paper looks very, very close to the original "shape" of the analog wave.
  3. Each sample that the DSP generates gets captured into a small sample buffer that resides within the audio chipset of the sound card (I use the term "sound card" to refer to any digital audio interface connected to a digital computer -- so, USB, PCI, PCIe, all are relatively the same in this aspect.)
  4. Once there are "enough" samples, the sound card lets the computer know that it can take the samples out of the buffer. If the computer doesn't take the samples in time, the same buffer will eventually get overwritten by the next set of samples, known as a "dropout" (the audio will have an audible "pop" in it). This can happen if the CPU is very busy.
  5. The computer's hardware drivers copy the samples out of the buffer into the CPU, and then transfer them from the CPU into system memory. From there, applications can do whatever they want with them, including writing them to disk, encoding them in MP3, sending them over the network, just letting them stack up in RAM, etc.

Since we're assuming that the operating system is not real-time, there is always the potential that the CPU will have so many tasks competing for CPU time that it simply won't have time to read or write the audio data in time, before the sound card keeps moving to the next set of samples. By having a predictable scheduler implementation in the operating system kernel, such as one that divides tasks into time slices, it is possible to minimize (though, never 100% eliminate) the likelihood of dropouts. For example, if HypotheticalOS has a kernel scheduler that is preemptive and assigns 1 millisecond time slices to any task that requests time on the CPU, and fairly distributes these time slices among competing tasks, then the task that's handling your audio I/O will be likely to get:

  • 1 millisecond per second of CPU time if there are 1,000 tasks competing for time (1 interrupt per second)
  • 2 milliseconds per second of CPU time if there are 500 tasks competing for time (2 interrupts per second)
  • ...
  • 500 milliseconds per second of CPU time if there are 2 tasks competing for time (500 interrupts per second)

(The above assumes a uniprocessor and a fully saturated CPU; things get complicated with SMP.)

So you can see that a 100 millisecond buffer means that the kernel scheduler must give your audio I/O stack (which includes both kernel and userspace components, usually) at least 10 time slices to do work, for each second of audio. And if the audio pipeline requires more than one time slice to do its work, then it needs that many more time slices per second.

And hopefully those time slices are evenly distributed: if you got 50 time slices within the first 50 ms of a given second, and then had zero time slices for the remaining 950 ms, with a 100 ms buffer, the user is going to experience nearly a full second of dropout! Yuck.

To diverge a bit, I should note that there are other ways to do audio I/O, but this is the most traditional way that is done on typical desktops. On GNU/Linux, there is a sound server called PulseAudio that tells your sound card to stop sending interrupts, and just grabs data from the sound buffer whenever it needs to, based on the size of the buffer it's using (which can change over time to adapt to higher CPU demands or lower latency requested by applications!) This technique is called time-based scheduling, and it depends on a very good scheduler implementation in the Linux kernel.

2
  • Great answer. Would be nice to add a reference to (or briefly explain) dedicated A/D devices which allow lower buffer times, combined with technology such as ASIO. This is what really enables live recording.
    – slhck
    Commented Jan 28, 2015 at 15:49
  • I'm no expert on ASIO; my background is more along the lines of ALSA, DirectSound, and Win32 MME... I think an overview of the different sound APIs is out of scope for the question. Basically, some are awesome and some are pretty bad. WASAPI is a great compromise that brings regular, casual Intel HDA codecs on motherboards much closer to low latency of ASIO for audio production. Yada yada yada. I could go on and on. Not really on-topic for the Q. Commented Jan 28, 2015 at 16:52

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .