CSA302
Lecture 9 - Compression Techniques: H.261 and MPEG
References:
Steinmetz, R., and Nahrstedt, K. (1995). Multimedia:
Computing, Communications & Applications. Prentice Hall. Chapter 6.
Aravind, R., et. al., (1993). Image
and Video Coding Standards.
H.261 and H.263
H.261 and H.263 are video transmission standards. H.261 was developed for low-bit rate ISDN services,
specifically to provide video video-conferencing and ISDN video telephony.
Although developed with ISDN in mind, H.261 encoded data can be transported
over other networks. H.263 was also designed for low bit rate
transmission, and includes some improvements and changes to H.261.
The main feature of video-conferencing is that in order to maintain the
illusion of motion a minimum of 16 frames per second need to be digitized
from an analog source, compressed, transmitted, received, decompressed, and
finally displayed. Usually, only a head-and-shoulders shot of the
speakers involved in the video-conferencing session is transmitted.
Consequently, the background tends to remain stable and the amount of movement is
generally small. The data rate required to support this is
200Kbits/s. However, in video-conferencing systems used to allow
students to attend a virtual classroom, for example, there is less of a
restriction on movement, and the camera will track the lecturer's movement
around the classroom. In this case, the background does not remain stable,
and there is more freedom of movement. The downside is that a data
rate of 6Mbits/s is required.
In 1984, the CCITT Study Group XV (Transmission Systems and Equipment)
established a "Specialists' Group on Coding for Visual Telephony". The
H.261 standard stabilized in 1989 on a codec which could deliver
compressed video data on from 1 to 30 64Kb/s ISDN channels. H.261 is
geared towards cross-platform real-time video communications over
low-bit-rate channels, and so, has some interesting design properties.
H.261 is a complete specification only for the video decoder. It simply describes the
H.261 compressed multimedia data stream. Third-party encoders (which can
be written for any platform) must produce the data stream as defined by
H.261. H.261 takes advantage of the fact in one-to-one video-conferencing
sessions, head shots of the participants will be transmitted.
Consequently, the differences between successive video frames will be
minimal. Only the differences from a previous frame need to be communicated to the
receiver. As the communication will be real-time, including heavy
overheads to capturing the analog video signal in the first place, the
operation of coder and decoder need to be simple. H.261 is a compromise
between coding performance, real-time requirements, implementation
simplicity, and system robustness. The system needs to be robust, as users
will avoid using a brittle system, and the codec operation and
implementation need to introduce as little a delay in the end-to-end
communication to support the real-time communications. The transport speed
is a large factor - imposing minimum specifications which would require a
high bit-rate would limit the practical implementation of the codec to few
institutions able to afford high-speed communications. As video
transmission is expected to occur over narrowband networks, the coding
structures and parameters should be geared towards low-bit-rates.
H.261, on the other hand, is a standard proposed in 1996 to replace H.261,
and is one of the best techniques available. The differences between H.261
and H.263 will be highlighted.
Coding Structures and Components
H.261 allows only 2 picture formats - CIF (common intermediate format) and
QCIF (Quarter-CIF), both component video. CIF frames are made up of one
luminance signal (Y) and 2 chrominance signals (CB and
CR). (Brief explanation.).
The chrominance signals represent colour
differences. The Y frame size for CIF is 288 lines each of 352 pixels. CB and
CR are sub-sampled to 176 pixels per line and 144
lines per frame. QCIF is exactly half the CIF component resolutions. All
H.261 codecs must support QCIF, although support for CIF is optional. The frame
aspect ratio for both formats is 4:3 (the same as standard TV). The
frame rate is 30 non-interlaced frames per second.
H.263 supports 5 resolutions: CIF, QCIF, SQCIF, 4CIF, and 16CIF. SQCIF has
about half the resolution of QCIF. 4CIF and 16CIF have 4 and 16 times the
resolution of CIF.
The compressed H.261 bit-stream is composed of 4 layers, the Picture Layer,
the Group of Blocks Layer, the Macro block Layer, and the Block Layer. The
Picture Layer contains its own header information and a number of the lower
layer data.
The main features of H.261 are that compression is partially achieved by
inter-frame prediction, block transformation and quantization.
Data units the size of 8x8 pixels are used for the representation of the
Y, CB and CR
components. A macro block is the result of combining 4 blocks of the
Y-matrix each with one block of the CB and
CR components. A group of blocks (GOB) is defined to
contain 33 macro blocks, and finally a CIF contains 12 GOBs (QCIF contains 3 GOBs).
Coding Algorithms
H.261 uses both interframe and intraframe coding. During interframe coding,
redundant information is not transmitted, because it will be available to the
decoder in previous or subsequent frames. The macro blocks from the
previous frame are compared to the current one. A simple H.261
implementation will simply compare macro blocks corresponding to the same positions
in the previous and current images to compute a prediction error. An advanced implementation will perform
motion compensation to "track" the movement of a block of pixels from one
image to the next using a motion vector. In both cases, the result will be
a DPCM-coded (Differential Pulse Code Modulation-coded) macro block.
The motion vector (which will always be zero-length in the simple H.261
implementation) and DPCM-coded macro block (if and only if its value exceeds
a certain threshold) are subsequently processed. The motion
vector is entropy-coded. Macro blocks which require further coding are transformed using DCT and the
coefficients are quantized and variable-length coded.
For intraframe coding, each block of 8x8 pixels is transformed into 64
coefficients using DCT (similar to JPEG). The AC- and DC-coefficients are
quantized and entropy-coded.
H.263 has four negotiable modes of infraframe coding, which may be used
separately or together (except for the advanced predication method,
which requires the unrestricted motion vector):
- Syntax-based arithmetic coding mode
uses the more efficient arithmetic coder, instead of variable-length coding
- PB-frame mode
uses the frame "averaging" technique used in MPEG
- Unrestricted motion vector mode
if movement occurs at edges of picture, the normal motion vector cannot be used
efficiently. However, the unrestricted motion vector puts the picture on
a "canvas" so that reference can be made to locations on the canvas which
are not within the dimensions of the picture
- Advanced prediction mode
H.261/H.263 Data Stream and Characteristics
A data stream has a hierarchical structure composed of several layers. It
includes information for error correction. Each image has a 5-bit (8-bit in
H.263) image
temporal reference number. Using a command sent from the application to
the decoder, it is possible to freeze the last image displayed as a still
image. The encoder can also send commands to switch between still and
moving images.
MPEG
MPEG is actually a series of standards for video and audio encoding.
The major differences in the standards are the targeted application
areas. MPEG-1 is suitable for low-bit-rate transmission, up to a maximum
of about 1.5 Mb/s. MPEG-2 is suitable for higher bit-rate environments,
such as HDTV. MPEG-4 is suitable for interactive TV, as audio-visual
scenes can be described as audio-visual objects which have relations in
space and time. Additionally, MPEG-4 allows the integration of media
objects of different types.
We will concentrate primarily on MPEG-1 which makes the storage and
transmission of audio-visual data more efficient.
MPEG-1 is optimized to work with video resolutions of 352x240 at 30 fps
(NTSC) and 352x288 at 25 fps (PAL) with a post-compression maximum data
transfer requirement of 1.5 Mb/s, although it is not limited to work with
these frame resolutions and bit-rates. At the optimal frame resolutions
and bit-rates, the MPEG-1 decoded video is comparable in quality to VHS
video.
The MPEG standard mandates real-time decoding, although it does not
stipulate whether the encoder should be symmetric or asymmetric. With a
real-time encoder, MPEG-1 can be used for video-conferencing. If the encode
is asymmetric, then source video will have been compressed in advance,
stored, and can be transmitted and decompressed many times. This suits a
video library.
The MPEG standard has three parts:
- Part 1: Synchronization and multiplexing of video and audio
- Part 2: Video
- Part 3: Audio
Video
To achieve compression MPEG exploits the spatial and temporal redundancy
present in the video signal. The primary requirement of MPEG is that it
achieves the best quality decoded video at the given bit-rate. In addition,
different multimedia applications may impose other requirements. For
example, there may be the requirement to access any portion of the
bit-stream in a short time, for rapid access to any decoded video frame,
especially if the storage medium supports random access. It might also be
desirable to encoded bit streams directly while supporting decodability.
Compression Algorithm Overview
Exploiting spatial redundancy
The compression technique employed by MPEG uses a combination of JPEG and
H.261 compression standards. Video is a sequence of still images, and as
such, especially to support direct access to individual frames it would be
desirable to use JPEG compression. However, the resulting data stream
would violate the maximum bandwidth requirement of 1.5Mb/s. Bear in mind
that if the decoding of a single frame is dependent on the decompression of
all the preceding frames, then direct access to individual frames is not
possible. However, in order to support rapid access to frames whose
decoding is not dependent on other frames, MPEG encodes intra-dependecies
within frames known as I-frames. Decoding of other frames (known as P- and
B-frames) will be dependent on the prior decoding of an I-frame. As with
JPEG and H.261, I-frames are first divided into non-overlapping 8x8 pixel
blocks. Each block is DCT transformed, which results in a DCT block of 64
coefficients, where most of the energy in the original image is
concentrated into the top left-hand corner of the DCT-block. A quantizer
is then applied which reduces the frequency of values of the coefficients,
setting most of them to zero. An entropy-coder then reduces the length of
the data stream for that block.
Exploiting temporal redundancy
Temporal redundancy results from a high-degree of correlation between
adjacent frames. In both H.261 and MPEG, is expected that differences
between successive frames will be small, and that motion is detected by a
block of pixels in an image will be shifted in the subsequent frame -
implying a temporal relationship between the blocks. A block of pixels,
called a target block, in a frame is compared with a set of blocks
of the same dimensions in the previous frame (the reference frame).
The block in the reference frame which best matches the target block is used
as the prediction for the latter. This best matching block is associated
with a motion vector that describes the displacement between it and the
target block. As in H.261, the motion vector is encoded and transmitted
along with the prediction error. The prediction error is transmitted using
the DCT-based intraframe encoding technique described above. Although the
motion vector can be zero-length in simple implementations of H.261, in MPEG
full advantage is taken of motion compensation. Predictive frames are known
as P-frames in MPEG.
Bidirectional temporal prediction
Bidirectional temporal prediction, also called motion compensated
interpolation, is a key feature of MPEG video. Some of the video frames are
encoded using two reference frames, one in the past and one in the future.
A block in these frames can predicted by another block from the past
reference frame (forward prediction), from the future reference frame
(backward prediction), or from an average of two blocks, one from each
reference frame (interpolation). In every case, the block from the
reference frame is associated with a motion vector, so that two motion
vectors are associated with interpolation. B-frames are never themselves
used as reference frames. Advantages of bidirectional predication are that
typically higher compression ratios are achievable than with forward
prediction alone. The same picture quality can be achieved using fewer
encoded bits. However, as encoding may be based on frames
that have not yet been seen, a delay is introduced, because frames must be
encoded out of sequence. There is additional encoding complexity because
block matching must be performed on two reference frames.
The MPEG bit-stream structure
MPEG defines the structure of the MPEG data stream and how a decoder
should decode it. It does not specify how the encoder should produce the
data streaming the first place. This is to enable the encoders to be
flexible with a view to the application areas in which they will be used.
To support this multi-variate approach, the data structure imposed on the
data stream is constructed in several layers, each performing a different
logical function. At the top of the hierarchy is the video sequence layer,
which contains basic parameters, such as the size of the video frames, the
frame rate, the bit rate, and other global parameters.
Inside the video sequence layer is the Group of Pictures (GOP) layer, which
provides support for random access, fast search and editing. A video
sequence is divided into a series of GOPs, where each GOP contains an
I-frame followed by an arrangement of P-frames and B-frames. Random access
and fast search are enabled by the I-frames (and D-frames, which are
DC-coefficients only). MPEG allows GOPs to be of
arbitrary structure and length. The GOP Layer is the basic unit for editing
an MPEG video bit stream.
The compressed bits produced by encoding a frame in a GOP constitute
the picture layer. The picture layer first contains information on the type
of frame that is present (I, P, or B) and the position of the frame in the
display order. The bits corresponding to the motion vector are packaged in
the macro block layer, and those corresponding to the DCT unit are packaged
in the block layer. In between the macro block layer and the picture layer
is the slice layer, which is used mainly for resynchronization during the
decoding of a frame in the event of bit errors. Prediction registers used
in the differential encoding of the motion vector are reset at the
beginning of a slice.
MPEG Audio encoding
MPEG audio compression uses sub-band coding to achieve high compression
ratios whilst maintaining audio quality.
Sub-band coding is a flexible method of encoding audio samples from a
variety of different sources. The field of psychoacoustics has determined
that if a high-frequency audio signal precedes a low-frequency signal,
then the human ear cannot detect the low-frequency signal. Sub-band coding
techniques identify audio sequences that would be undetected by the human
ear anyway, and discards them, thus reducing storage space requirements.
First, a time-frequency mapping (e.g., FFT) decomposes the input signal into
subbands. A psychoacoustic
model looks at these subbands as well as the original signal, and
determines masking thresholds using psychoacoustic information. Using
these masking thresholds, each of the subband samples is quantized and
encoded so as to keep the quantization noise below the masking
threshold. The final step is to assemble all these quantized samples
into frames, so that the decoder can figure it out without getting lost.
Decoding is easier, since there is no need for a psychoacoustic model.
The frames are unpacked, subband samples are decoded, and a
frequency-time mapping turns them back into a single output audio
signal.
MPEG-1 Audio is really a group of three different SBC schemes, called
layers. Each layer is a self-contained SBC coder with its own
time-frequency mapping, psychoacoustic model, and quantizer.
Layer 1 is the simplest, but gives the poorest
compression. Layer 3 is the most complicated and difficult to compute,
but gives the best compression. The idea is that an application of
MPEG-1 Audio can use whichever layer gives the best tradeoff between
computational burden and compression performance. Audio can be encoded
in any one layer. A standard MPEG decoder for any layer is also able to
decode lower (simpler) layers of encoded audio. MPEG-1 Audio is intended
to take a PCM audio signal sampled at a rate of 32, 44.1 or 48 kHz, and
encode it at a bit rate of 32 to 192 Kbits per audio channel (depending
on layer).
See Introduction
to Sub-band Coding in MPEG-1 for more information.
Synchronizing Audio and Video in MPEG-1
The audio and video data streams are interleaved.
Note on YCBCR
RGB representations of pixels is expensive in terms of storage space and
transmission bandwidth. YCB and
CR is a transformation of the RGB signals into a
cheaper representation format by taking advantage of the fact that the
human eye is more sensitive to luminance (related to brightness)
information than it is to chrominance (colour) information. Essentially, Y
is the light intensity, CB is the blue colour
difference signal, and CR is the red colour difference
signal (in both cases, the difference signals from white). The information for
CB and CR can either be
complete (in which case it is as expensive as RGB), or else it can be
reduced to half or one quarter of the original chrominance information.
In order to know what chrominance information can be thrown away, it is
necessary to look at groups of pixels. Usually, 4 groups of 8x8 blocks of
pixels are processed in order to obtain the closest mapping from the RGB
to the YCBCR colour
spaces. If no chrominance information is thrown away, there will be a 4:4:4
video format (4 Y blocks, 4 CB blocks and 4
CR blocks). Otherwise, the formats will be 4:2:2
or 4:2:0 depending on whether the chrominance information is reduced by one-half
or three-quarters. H.261 and MPEG, as well as most other video applications
that use this colour space, use 4:2:0 video. This means that whereas the
original RGB information would have required a 12 block representation
space, 4:2:0 video requires only 6 - achieving an immediate 50% reduction
in space and bandwidth requirements.
Back to the index for this course.
In case of any difficulties or for further information e-mail
[email protected]
Date last amended: 2nd September, 2002