4

For the purpose of this question, I downloaded the video at:

It downloaded as "3Uyndrm.mp4", 4,762 kB

In Windows 11, I used File Explorer > Details to view its properties:

enter image description here

In these data, the Data rate is the video bitrate and

  • Total bitrate = Audio Bit rate + Video Data rate

This is confirmed by calculating the size of the file from these numbers:

File size = 1,765,000 bits/sec * 22 second / 8 / 1024 = 4,740 kB

But there's something I don't understand about these these numbers.

We can calculate the number of pixels per second in the video:

(854 * 480) pixels/frame * 30 frames/sec = 12,297,600 pixels/sec

From that, we can get the number of video bits per pixel:

1,635,000 bits per sec / 12,297,600 pixels/sec / 8 = 0.017 byte/pixel

Does this make sense? It means that the data for the video are a tiny fraction of a byte per pixel in each frame. I would have thought that each pixel would require at least three bytes for its color values. Reducing that from 3 to 0.017 would be more than 99% compression, which is larger than any compression ratio I've ever heard of.

Is there something wrong with my calculation?

3
  • Video compression is almost magical, right? What's in the video? Lots of movement? Scene changes? What's the codec used? What's the pixel format?
    – Daniel B
    Commented Mar 15 at 20:03
  • @DanielB - I provided a link to the video so you could see exactly what video the calculation is for.
    – NewSites
    Commented Mar 15 at 20:13
  • The chroma is sub-sampled in a YUV 4:2:0 pattern, so only one pair of U/V chroma samples per 4 luma (Y) samples. mediainfo calculates it as 0.133 Bits/(Pixel*Frame). (mediaarea.net/en/MediaInfo - open source free software). YUV420 is standard for video outside of high-bitrate formats used while recording and editing, or cinema. (i.e. any .mp4 you find on the internet except from video nerds demoing different formats; e.g. ffmpeg with x264 can encode and decode h.264 with High 4:2:2 or High 4:4:4 profile en.wikipedia.org/wiki/Advanced_Video_Coding#Profiles) Commented Mar 16 at 8:05

1 Answer 1

5

Depending on the video they can have a main frame that is essentially a compressed image in a lossy format similar to a JPEG. This is an i-frame that contains actual image data.

JPEG already achieves somewhere in the order of 10:1 compression with minimal loss of quality, newer video codecs are probably at least as good.

If the next few frames contain a lot of very similar data but a small amount of movement you can simply have a bit of data that says "move these areas of the image by x pixels" and then only compress actual new image data that wasn't already on screen. This is a p-frame, or a predicted frame based on a previous one.

If you have a series of successive p-frames after an i-frame you could essentially wipe out an entire series of frames. You would be reducing 8 full frames of data down to 1 single compressed frame along with a small amount of data giving transforms and mathematical calculations.

At 25 frames per second if you have an i-frame every eighth frame you could reduce the amount of data by a very significant quantity. Potentially it could reduce the data to somewhere around 1/8th plus some new parts of the image.

Then there are b-frames, bi-directional predicted frames. These can look back at previous i-frames, but they can also look forward to the next i-frame. If you know that there is new data in the next full image then you can use that data to encode even less data in the current b-frame and rely on an i-frame even further ahead.

These predicted frames can massively reduce the amount of actual encoded data, but at the cost of increased processing power required to encode and decode the video. You need a lot of processing power to look back and forwards across the series of images and figure out all the similarities and differences and apply transforms, blends, blurs, motions and so on.

It also costs the decoder because you need to buffer at least two i-frames in order to re-create all the frames in-between.

By offloading a lot of data into equations detailing transforms and motions you can reduce the bitrate far, far below what you would expect, and achieve compression ratios far higher, especially for video without much in the way of motion or changes between frames.

The video you linked shows some artefacts of highly compressed i-frames and has very little in the way of motion. It could easily get very high compression ratios.

You can get more information on the basics of the process at Wikipedia: Video compression picture types

13
  • Interesting, Could you say what you mean by "artefacts of highly compressed i-frames"?
    – NewSites
    Commented Mar 15 at 21:12
  • 1
    I expanded it as much as I can on my display and watched it about a dozen times and don't see what you're talking about, so I guess it takes a trained eye to see it. But I'll take your word for it. I take your answer to mean that there is nothing wrong with my calculation and it does make sense. A big surprise to learn about such a high level of compression. Thank you.
    – NewSites
    Commented Mar 15 at 22:33
  • 1
    h.264 I-frames (and JPEG) are lossy not lossless. (Unless you encoded with x264 -qp 0 lossless encoding!) Lossless compression (like PNG or lossless h.264) has a much worse compression ratio than this file compressed with x264 CRF=23 which is the default. (Looks like preset=faster, for subme=4 and ref=2 among other things, saving CPU time but getting a worse tradeoff between bitrate and quality. And profile=baseline instead of the default high, so no CABAC (only the less efficient CAVLC entropy coder for the final bitstream) and no B frames, so another maybe 10% worse qual / bitrate.) Commented Mar 16 at 8:14
  • 1
    Also, B frames can use previous P or even B frames as references (x264 --b-pyramid is on by default), not just the previous I frame. Also, your example I-frame interval is insanely frequent. This actual video was encoded with keyint=250 keyint_min=25, so a minimum interval of 25 frames between I frames. (Actually x264 can use an I frame more often, but it won't be a key frame; it won't stop the encoder from referencing earlier frames.) Most P frames use another very recent P frame as a reference, although the encoder can keep multiple old frames around as references. (@NewSites) Commented Mar 16 at 8:19
  • 1
    I think all modern video codecs have P (and B-frames if used) reference the last (or a recent) P-frame, rather than increasingly large changes from the last I-frame. Your answer doesn't say either way for P frame, but implies B frames just reference between two I frames. Having the decoder keep multiple reference pictures to choose from is also not unique to h.264, nor is a much larger keyframe interval. Some online video gets encoded with very short keyframe intervals like maybe 1 second or sometimes even half a second for easier seek, but that's still at least 12 frames. And 90 is common. Commented Mar 16 at 19:15

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .