CSA302

Lecture 9 - Compression Techniques: H.261 and MPEG

References:
Steinmetz, R., and Nahrstedt, K. (1995). Multimedia: Computing, Communications & Applications. Prentice Hall. Chapter 6.
Aravind, R., et. al., (1993). Image and Video Coding Standards.

H.261 and H.263

H.261 and H.263 are video transmission standards. H.261 was developed for low-bit rate ISDN services, specifically to provide video video-conferencing and ISDN video telephony. Although developed with ISDN in mind, H.261 encoded data can be transported over other networks. H.263 was also designed for low bit rate transmission, and includes some improvements and changes to H.261.
The main feature of video-conferencing is that in order to maintain the illusion of motion a minimum of 16 frames per second need to be digitized from an analog source, compressed, transmitted, received, decompressed, and finally displayed. Usually, only a head-and-shoulders shot of the speakers involved in the video-conferencing session is transmitted. Consequently, the background tends to remain stable and the amount of movement is generally small. The data rate required to support this is 200Kbits/s. However, in video-conferencing systems used to allow students to attend a virtual classroom, for example, there is less of a restriction on movement, and the camera will track the lecturer's movement around the classroom. In this case, the background does not remain stable, and there is more freedom of movement. The downside is that a data rate of 6Mbits/s is required.
In 1984, the CCITT Study Group XV (Transmission Systems and Equipment) established a "Specialists' Group on Coding for Visual Telephony". The H.261 standard stabilized in 1989 on a codec which could deliver compressed video data on from 1 to 30 64Kb/s ISDN channels. H.261 is geared towards cross-platform real-time video communications over low-bit-rate channels, and so, has some interesting design properties.
H.261 is a complete specification only for the video decoder. It simply describes the H.261 compressed multimedia data stream. Third-party encoders (which can be written for any platform) must produce the data stream as defined by H.261. H.261 takes advantage of the fact in one-to-one video-conferencing sessions, head shots of the participants will be transmitted. Consequently, the differences between successive video frames will be minimal. Only the differences from a previous frame need to be communicated to the receiver. As the communication will be real-time, including heavy overheads to capturing the analog video signal in the first place, the operation of coder and decoder need to be simple. H.261 is a compromise between coding performance, real-time requirements, implementation simplicity, and system robustness. The system needs to be robust, as users will avoid using a brittle system, and the codec operation and implementation need to introduce as little a delay in the end-to-end communication to support the real-time communications. The transport speed is a large factor - imposing minimum specifications which would require a high bit-rate would limit the practical implementation of the codec to few institutions able to afford high-speed communications. As video transmission is expected to occur over narrowband networks, the coding structures and parameters should be geared towards low-bit-rates.

H.261, on the other hand, is a standard proposed in 1996 to replace H.261, and is one of the best techniques available. The differences between H.261 and H.263 will be highlighted.

Coding Structures and Components

H.261 allows only 2 picture formats - CIF (common intermediate format) and QCIF (Quarter-CIF), both component video. CIF frames are made up of one luminance signal (Y) and 2 chrominance signals (C_B and C_R). (Brief explanation.). The chrominance signals represent colour differences. The Y frame size for CIF is 288 lines each of 352 pixels. C_B and C_R are sub-sampled to 176 pixels per line and 144 lines per frame. QCIF is exactly half the CIF component resolutions. All H.261 codecs must support QCIF, although support for CIF is optional. The frame aspect ratio for both formats is 4:3 (the same as standard TV). The frame rate is 30 non-interlaced frames per second.
H.263 supports 5 resolutions: CIF, QCIF, SQCIF, 4CIF, and 16CIF. SQCIF has about half the resolution of QCIF. 4CIF and 16CIF have 4 and 16 times the resolution of CIF.
The compressed H.261 bit-stream is composed of 4 layers, the Picture Layer, the Group of Blocks Layer, the Macro block Layer, and the Block Layer. The Picture Layer contains its own header information and a number of the lower layer data.
The main features of H.261 are that compression is partially achieved by inter-frame prediction, block transformation and quantization.
Data units the size of 8x8 pixels are used for the representation of the Y, C_B and C_R components. A macro block is the result of combining 4 blocks of the Y-matrix each with one block of the C_B and C_R components. A group of blocks (GOB) is defined to contain 33 macro blocks, and finally a CIF contains 12 GOBs (QCIF contains 3 GOBs).

Coding Algorithms

H.261 uses both interframe and intraframe coding. During interframe coding, redundant information is not transmitted, because it will be available to the decoder in previous or subsequent frames. The macro blocks from the previous frame are compared to the current one. A simple H.261 implementation will simply compare macro blocks corresponding to the same positions in the previous and current images to compute a prediction error. An advanced implementation will perform motion compensation to "track" the movement of a block of pixels from one image to the next using a motion vector. In both cases, the result will be a DPCM-coded (Differential Pulse Code Modulation-coded) macro block.
The motion vector (which will always be zero-length in the simple H.261 implementation) and DPCM-coded macro block (if and only if its value exceeds a certain threshold) are subsequently processed. The motion vector is entropy-coded. Macro blocks which require further coding are transformed using DCT and the coefficients are quantized and variable-length coded.
For intraframe coding, each block of 8x8 pixels is transformed into 64 coefficients using DCT (similar to JPEG). The AC- and DC-coefficients are quantized and entropy-coded.

H.263 has four negotiable modes of infraframe coding, which may be used separately or together (except for the advanced predication method, which requires the unrestricted motion vector):

Syntax-based arithmetic coding mode
uses the more efficient arithmetic coder, instead of variable-length coding
PB-frame mode
uses the frame "averaging" technique used in MPEG
Unrestricted motion vector mode
if movement occurs at edges of picture, the normal motion vector cannot be used efficiently. However, the unrestricted motion vector puts the picture on a "canvas" so that reference can be made to locations on the canvas which are not within the dimensions of the picture
Advanced prediction mode

H.261/H.263 Data Stream and Characteristics

A data stream has a hierarchical structure composed of several layers. It includes information for error correction. Each image has a 5-bit (8-bit in H.263) image temporal reference number. Using a command sent from the application to the decoder, it is possible to freeze the last image displayed as a still image. The encoder can also send commands to switch between still and moving images.

MPEG

MPEG is actually a series of standards for video and audio encoding. The major differences in the standards are the targeted application areas. MPEG-1 is suitable for low-bit-rate transmission, up to a maximum of about 1.5 Mb/s. MPEG-2 is suitable for higher bit-rate environments, such as HDTV. MPEG-4 is suitable for interactive TV, as audio-visual scenes can be described as audio-visual objects which have relations in space and time. Additionally, MPEG-4 allows the integration of media objects of different types.
We will concentrate primarily on MPEG-1 which makes the storage and transmission of audio-visual data more efficient.

MPEG-1 is optimized to work with video resolutions of 352x240 at 30 fps (NTSC) and 352x288 at 25 fps (PAL) with a post-compression maximum data transfer requirement of 1.5 Mb/s, although it is not limited to work with these frame resolutions and bit-rates. At the optimal frame resolutions and bit-rates, the MPEG-1 decoded video is comparable in quality to VHS video.
The MPEG standard mandates real-time decoding, although it does not stipulate whether the encoder should be symmetric or asymmetric. With a real-time encoder, MPEG-1 can be used for video-conferencing. If the encode is asymmetric, then source video will have been compressed in advance, stored, and can be transmitted and decompressed many times. This suits a video library.
The MPEG standard has three parts:

Part 1: Synchronization and multiplexing of video and audio
Part 2: Video
Part 3: Audio

Video

To achieve compression MPEG exploits the spatial and temporal redundancy present in the video signal. The primary requirement of MPEG is that it achieves the best quality decoded video at the given bit-rate. In addition, different multimedia applications may impose other requirements. For example, there may be the requirement to access any portion of the bit-stream in a short time, for rapid access to any decoded video frame, especially if the storage medium supports random access. It might also be desirable to encoded bit streams directly while supporting decodability.

Compression Algorithm Overview

Exploiting spatial redundancy

The compression technique employed by MPEG uses a combination of JPEG and H.261 compression standards. Video is a sequence of still images, and as such, especially to support direct access to individual frames it would be desirable to use JPEG compression. However, the resulting data stream would violate the maximum bandwidth requirement of 1.5Mb/s. Bear in mind that if the decoding of a single frame is dependent on the decompression of all the preceding frames, then direct access to individual frames is not possible. However, in order to support rapid access to frames whose decoding is not dependent on other frames, MPEG encodes intra-dependecies within frames known as I-frames. Decoding of other frames (known as P- and B-frames) will be dependent on the prior decoding of an I-frame. As with JPEG and H.261, I-frames are first divided into non-overlapping 8x8 pixel blocks. Each block is DCT transformed, which results in a DCT block of 64 coefficients, where most of the energy in the original image is concentrated into the top left-hand corner of the DCT-block. A quantizer is then applied which reduces the frequency of values of the coefficients, setting most of them to zero. An entropy-coder then reduces the length of the data stream for that block.

Exploiting temporal redundancy

Temporal redundancy results from a high-degree of correlation between adjacent frames. In both H.261 and MPEG, is expected that differences between successive frames will be small, and that motion is detected by a block of pixels in an image will be shifted in the subsequent frame - implying a temporal relationship between the blocks. A block of pixels, called a target block, in a frame is compared with a set of blocks of the same dimensions in the previous frame (the reference frame). The block in the reference frame which best matches the target block is used as the prediction for the latter. This best matching block is associated with a motion vector that describes the displacement between it and the target block. As in H.261, the motion vector is encoded and transmitted along with the prediction error. The prediction error is transmitted using the DCT-based intraframe encoding technique described above. Although the motion vector can be zero-length in simple implementations of H.261, in MPEG full advantage is taken of motion compensation. Predictive frames are known as P-frames in MPEG.

Bidirectional temporal prediction

Bidirectional temporal prediction, also called motion compensated interpolation, is a key feature of MPEG video. Some of the video frames are encoded using two reference frames, one in the past and one in the future. A block in these frames can predicted by another block from the past reference frame (forward prediction), from the future reference frame (backward prediction), or from an average of two blocks, one from each reference frame (interpolation). In every case, the block from the reference frame is associated with a motion vector, so that two motion vectors are associated with interpolation. B-frames are never themselves used as reference frames. Advantages of bidirectional predication are that typically higher compression ratios are achievable than with forward prediction alone. The same picture quality can be achieved using fewer encoded bits. However, as encoding may be based on frames that have not yet been seen, a delay is introduced, because frames must be encoded out of sequence. There is additional encoding complexity because block matching must be performed on two reference frames.

The MPEG bit-stream structure

MPEG defines the structure of the MPEG data stream and how a decoder should decode it. It does not specify how the encoder should produce the data streaming the first place. This is to enable the encoders to be flexible with a view to the application areas in which they will be used. To support this multi-variate approach, the data structure imposed on the data stream is constructed in several layers, each performing a different logical function. At the top of the hierarchy is the video sequence layer, which contains basic parameters, such as the size of the video frames, the frame rate, the bit rate, and other global parameters.
Inside the video sequence layer is the Group of Pictures (GOP) layer, which provides support for random access, fast search and editing. A video sequence is divided into a series of GOPs, where each GOP contains an I-frame followed by an arrangement of P-frames and B-frames. Random access and fast search are enabled by the I-frames (and D-frames, which are DC-coefficients only). MPEG allows GOPs to be of arbitrary structure and length. The GOP Layer is the basic unit for editing an MPEG video bit stream.
The compressed bits produced by encoding a frame in a GOP constitute the picture layer. The picture layer first contains information on the type of frame that is present (I, P, or B) and the position of the frame in the display order. The bits corresponding to the motion vector are packaged in the macro block layer, and those corresponding to the DCT unit are packaged in the block layer. In between the macro block layer and the picture layer is the slice layer, which is used mainly for resynchronization during the decoding of a frame in the event of bit errors. Prediction registers used in the differential encoding of the motion vector are reset at the beginning of a slice.

MPEG Audio encoding

MPEG audio compression uses sub-band coding to achieve high compression ratios whilst maintaining audio quality.
Sub-band coding is a flexible method of encoding audio samples from a variety of different sources. The field of psychoacoustics has determined that if a high-frequency audio signal precedes a low-frequency signal, then the human ear cannot detect the low-frequency signal. Sub-band coding techniques identify audio sequences that would be undetected by the human ear anyway, and discards them, thus reducing storage space requirements.
First, a time-frequency mapping (e.g., FFT) decomposes the input signal into subbands. A psychoacoustic model looks at these subbands as well as the original signal, and determines masking thresholds using psychoacoustic information. Using these masking thresholds, each of the subband samples is quantized and encoded so as to keep the quantization noise below the masking threshold. The final step is to assemble all these quantized samples into frames, so that the decoder can figure it out without getting lost. Decoding is easier, since there is no need for a psychoacoustic model. The frames are unpacked, subband samples are decoded, and a frequency-time mapping turns them back into a single output audio signal.
MPEG-1 Audio is really a group of three different SBC schemes, called layers. Each layer is a self-contained SBC coder with its own time-frequency mapping, psychoacoustic model, and quantizer. Layer 1 is the simplest, but gives the poorest compression. Layer 3 is the most complicated and difficult to compute, but gives the best compression. The idea is that an application of MPEG-1 Audio can use whichever layer gives the best tradeoff between computational burden and compression performance. Audio can be encoded in any one layer. A standard MPEG decoder for any layer is also able to decode lower (simpler) layers of encoded audio. MPEG-1 Audio is intended to take a PCM audio signal sampled at a rate of 32, 44.1 or 48 kHz, and encode it at a bit rate of 32 to 192 Kbits per audio channel (depending on layer). See Introduction to Sub-band Coding in MPEG-1 for more information.

Synchronizing Audio and Video in MPEG-1

The audio and video data streams are interleaved.

Note on YC_BC_R

RGB representations of pixels is expensive in terms of storage space and transmission bandwidth. YC_B and C_R is a transformation of the RGB signals into a cheaper representation format by taking advantage of the fact that the human eye is more sensitive to luminance (related to brightness) information than it is to chrominance (colour) information. Essentially, Y is the light intensity, C_B is the blue colour difference signal, and C_R is the red colour difference signal (in both cases, the difference signals from white). The information for C_B and C_R can either be complete (in which case it is as expensive as RGB), or else it can be reduced to half or one quarter of the original chrominance information. In order to know what chrominance information can be thrown away, it is necessary to look at groups of pixels. Usually, 4 groups of 8x8 blocks of pixels are processed in order to obtain the closest mapping from the RGB to the YC_BC_R colour spaces. If no chrominance information is thrown away, there will be a 4:4:4 video format (4 Y blocks, 4 C_B blocks and 4 C_R blocks). Otherwise, the formats will be 4:2:2 or 4:2:0 depending on whether the chrominance information is reduced by one-half or three-quarters. H.261 and MPEG, as well as most other video applications that use this colour space, use 4:2:0 video. This means that whereas the original RGB information would have required a 12 block representation space, 4:2:0 video requires only 6 - achieving an immediate 50% reduction in space and bandwidth requirements.

Back to the index for this course.
In case of any difficulties or for further information e-mail [email protected]

Date last amended: 2nd September, 2002