How Video Encoding Works

Video encoding is definitely one of the most intensive computer processing tasks: compressing and encoding a 1-hour raw video to AVC (the most common video codec) can take up to 2 hours, depending on your hardware components (e.g. processor, GPU, RAM) or your cloud provider’s. Different codecs (AVC, HEVC, AV1, VP9, etc.), or output configurations (resolution, bitrate, quality, color depth, etc.) entail different computing workloads, and the pattern is usually “the higher the quality and compression, the higher the encoding complexity”. But why is encoding so hard?

Generally, the video encoding process has three phases: Prediction, Transform and Encoding. The aim is to produce a compressed representation of an input video that can be later played by a decoder. Let’s get a quick overview of each phase.

Prediction

During the Prediction phase, the algorithm tries to find common blocks and patterns among neighboring frames. The goal is to link the data contained in the original frames (called “I frames”), to subsequent frames (called “P frames” and “B frames”). When a reference (or “prediction”) is found, the redundant information is dropped and the algorithm has successfully compressed a piece of the video. The process continues until all the frames have been scanned and compressed, where possible, by also referencing blocks in the same frame and across multiple frames (the picture below represents the prediction process).
Every group of frames that share predictions is called “GOP” (simply, “Group Of Pictures”!). The size of a GOP depends on the sequence (a scene change will end the GOP) and the distribution type (live streaming requires 0.5 – 2 seconds GOPs, while for video on demand they can be longer). To learn more about GOPs and frame types, visit this OTTverse’s guide on I, P, and B-frames.

Transform

The second phase is Transform: all the residual blocks (i.e., parts of the video that aren’t predicted) are “translated” into a non-graphical language with a pattern and a set of coefficients. The result is a collection of colors and instructions that will be used later, during the decoding process, to systematically rebuild the frames. This saves a lot of information.
To understand this process, imagine the job of a painter: the action of picking colors from the palette and applying them to the canvas is what happens in the decoding process; the Transform phase does exactly the reverse – taking the colors off the canvas and tidily placing them on the palette.

Encoding

Finally, all the values and parameters from the previous steps (predictions, the transform coefficients and the information about the structure of the compressed data) is encoded in a computer-readable format. The result is a compressed bitstream that can be stored and transmitted: it’s what we know as “video”.

(If you’re adventurous enough, here are two VCodex’s overviews on AVC and HEVC encoding).

The encoding process described above is probably complicated for you to crack, but – guess what – it is hard for computers as well! Image and video files are large compared to other categories of data, and data size is a bottleneck in processing performance, affecting execution times and network transmission time, when there’s a data transfer involved (like in cloud encoding services).
To grasp how much data is needed to store a (compressed) video, consider the three following ways of representing a scene from the famous Big Buck Bunny cartoon (not sure if kids watch this type of content today, but video engineers do): plain text, image and video. Each of these representations convey the same story snippet in a different fashion – stimulating the reader’s imagination (text), displaying a screenshot of the scene (image), and displaying the full sequence with audio. While the text has a size of only 397 bytes, the image is 618,149 bytes and the video 6,992,559 bytes. No doubt the video is much richer in detail and more enjoyable to consume, but one could argue that this doesn’t justify taking up 17,000 times storage relative to the text representation!

An enormous, fluffy, and utterly adorable grey-furred rabbit is heartlessly harassed by the ruthless, loud, bullying gang of a flying squirrel, who is determined to squash his happiness. While he's peacefully strolling through a meadow, the gang throws acorns and nuts at him while he's standing helpless: after a dodged shot, he succumbs to a barrage of pesky bullets that whack him and stun him.

Text
397 Bytes

JPEG Image
618,149 Bytes

10-second video
6,992,549 Bytes

So, encoding videos entails dealing with such high volumes of data in the attempt to shrink them into smaller files using complex algorithms: that’s no child’s play!
Video engineers are well-aware of the computing capacity and time required to process videos into different formats. They allow for performance and quality constraints when advising their CTOs about the choice of an encoding solution for a particular use case.

Encoding without ByteNite

Commonly, video encoding jobs are run on single machines that process the workloads sequentially (possibly, supporting multithreading and multiprocessing). Video is streamed from the source and entered in the encoding process; thus each frame of the input video is scanned and goes through the three phases described above, forming GOPs that are transformed and encoded. The maximum throughput is constrained by the machine’s capacity and the encoding speed – calculated by dividing the total processing time by the input video duration – varies according to the type of video and the encoding parameters. To understand this better, the following examples show how different output configurations affect encoding speed (using the open-source encoder FFmpeg):

Encoding job #1 (Meridian)
•  Input duration: 12’
•  Input size: 851 MB
• Output codec: AVC
•  Output bitrate: 750 kbps
•  Output resolution: 480p
Total encoding time: 4’48”
Encoding speed = 12’/4’48” = 2.5x

Encoding job #2 (Unbelievable Beauty)
•  Input duration: 61’
•  Input size: 8.23 GB
•  Output codec: HEVC
•  Output bitrate: 1.5 Mbps
•  Output resolution: 1080p
Total encoding time: 72’
Encoding speed = 61’/72’ = 0.8x

You can see a drop in the second encoding speed due to a different input video and set of encoding parameters (in particular, codec HEVC is more than twice as slow as AVC). There are several reasons why one codec, resolution, or bitrate is preferred over another, and encoding speed takes a back seat when there are requirements for video quality, size and playability. Nevertheless, overall video efficiency always comes at the expense of encoding time. Without ByteNite, the only way to break the complexity curse is to develop more efficient codecs (e.g., LCEVC), optimize encoding software, or, lastly, boost the performance by allocating more powerful resources (CPUs, GPUs, RAM, etc.).
When you encode on premise, the optimization above comes at your own expense. When you encode in the cloud, your provider has already integrated the latest encoding software and absorbed all sunk costs, and bills you according to the total minutes of video output, or uses another pricing model (check out this really well-made guide on cloud encoding pricing). However, by using cloud encoding SaaS like Bitmovin, AWS MediaConvert or Zencoder, you won’t be able to change the processing speed because the workload is managed by the provider (that’s the two-edged sword of software as a service). Therefore, when technology constraints are not an issue, there are provider-specific product limitations that won’t let you get the most of your processing performance.

Encoding with ByteNite

ByteNite tackles the encoding complexity problem from another perspective: instead of focusing on vertical speed, it breaks down longer videos into multiple chunks that are encoded in parallel by our network of devices. Such inter-machine parallelization, a.k.a. grid computing, allows to handle longer videos and higher workloads without involving powerful and expensive machines. It’s a bit like diluting a bitter medicine in a glass of water!
Surprisingly, the video segmentation process is anything but cumbersome or slow: our software automatically detects potential GOPs (remember above in the article?), extracts them from the input video – as it’s streamed from the source – and sends them to different devices for encoding. Since different GOPs don’t share frames or predictions, the Assembler (that’s what we call ByteNite’s module for merging data chunks) simply glues processed video chunks next to each other, in just a few seconds after all the chunks have been processed.
This process turns out to be up to 10x faster than standard encoding. The test results below prove it for the use case presented above.

Encoding job #1 (Meridian) on ByteNite
•  Input duration: 12’
•  Input size: 851 MB
•  Output codec: AVC
•  Output bitrate: 750 kbps
•  Output resolution: 480p
Total encoding time: 2’15”
Encoding speed = 12’/2’15” = 5.3x

Encoding job #2 (Unbelievable Beauty) on ByteNite
•  Input duration: 61’
•  Input size: 8.23 GB
•  Output codec: HEVC
•  Output bitrate: 1.5 Mbps
•  Output resolution: 1080p
Total encoding time: 7’32”
Encoding speed = 61’/8’13” = 8.1x

TL;DR

Video encoding is a complex processing activity, requiring many computations and powerful hardware. Video broadcasting companies and associations are attacking this problem by optimizing encoding software and developing more efficient codecs, but, eventually, the core processing activity is still the same. ByteNite has built a distributed computing software that uses the inherent video encoding segmentation (the Groups Of Pictures) to simplify encoding complexity, by having each device process a small amount of video simultaneously. This way, videos can be encoded even 10x faster.

Leave a Reply

Your email address will not be published.

Scroll up

ByteNite Inc.
708 Long Bridge Street
San Francisco,
California 94158, USA
EIN: 88-1647206