Data compression has come of age in the last 20 years. Both the quantity and the quality of the body of literature in this field provide ample proof of this. There are many known methods for data compression. They are based on different ideas, are suitable for different types of data, and produce different results, but they are all based on the same principle, namely they compress data by removing redundancies from the original data in the source file. This report discusses the different types of data compression, the advantages of data compression and the procedures of data compression.
2.0 DATA COMPRESSION
Data compression is important in this age because of the amount of data that is transferred within a certain network. It makes the transfer of data relatively easy . This section explains and compares lossy and lossless compression techniques.
2.1 LOSSLESS DATA COMPRESSION
Lossless data compression makes use of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. This can be contrasted to lossy data compression, which does not allow the exact original data to be reconstructed from the compressed data. Lossless data compression is used in many applications .
Lossless compression is used when it is vital that the original and the decompressed data be identical, or when no assumption can be made on whether certain deviation is uncritical.
Most lossless compression programs implements two kinds of algorithms: one which generates a statistical model for the input data, and another which maps the input data to bit strings using this model in such a way that “probable” (e.g. frequently encountered) data will produce shorter output than “improbable” data. Often, only the former algorithm is named, while the second is implied (through common use, standardization etc.) or unspecified .
2.2 LOSSY DATA COMPRESSION
A lossy data compression technique is one where compressing data and its decompression retrieves data that may will be different from the original, but is “close enough” to be useful in some way.
There are two basic lossy compression schemes:
First is lossy transform codecs, where samples of picture or sound are taken, chopped into small segments, transformed into a new basis space, and quantized. The resulting quantized values are then entropy coded .
Second is lossy predictive codecs, where previous and/or subsequent decoded data is used to predict the current sound sample or image frame.
In some systems the two methods are used, with transform codecs being used to compress the error signals generated by the predictive stage.
The advantage of lossy methods over lossless methods is that in some cases a lossy method can produce a much smaller compressed file than any known lossless method, while still meeting the requirements of the application .
Lossless compression schemes are reversible in-order for the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.
In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, which for example always removes the last byte of a file, will always compress a file up to the point where it is empty .
2.3 LOSSLESS vs. LOSSY DATA COMPRESSION
Lossless and lossy data compressions are two methods which are use to compressed data. Each technique has its individual used. A compression between the two techniques can be summarised as follow [4-5]:
Lossless technique keeps the source as it is during compression while a change of the original source is expected in lossy technique but very close to the origin.
Lossless technique is reversible process which means that the original data can be reconstructed. However, the lossy technique is irreversible due to the lost of some data during extraction.
Lossless technique produces larger compressed file compared with lossy technique.
Lossy technique is mostly used for images and sound.
3.0 DATA COMPRESSION TECHNIQUES
Data compression is known as storing data in a way which requires fewer spaces than the typical. Generally, it is saving of space by the reduction in data size . This section explains Huffman coding and Lempel-Ziv-Welch (LZW) compression techniques.
3.1 HUFFMAN CODING
Huffman coding is an entropy encoding method used for lossless data compression. The term means the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper “A Method for the Construction of Minimum-Redundancy Codes” .
Huffman coding implements a special method for choosing the representation for each symbol, resulting in a prefix code (sometimes called “prefix-free codes”, that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol) that expresses the most common source symbols using shorter strings of bits than are used for less common source symbols .
The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols, n. A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node.
The process practically starts with the leaf nodes containing the probabilities of the symbol they represent, and then a new node whose children are the 2 nodes with smallest probability is created, such that the new node’s probability is equal to the sum of the children’s probability. With the 2 nodes combined into one node (thus not considering them anymore), and with the new node being now considered, the procedure is repeated until only one node remains, the Huffman tree .
The simplest construction algorithm is one where a priority queues where the node with lowest probability is given highest priority :
1. Create a leaf node for each symbol and add it to the priority queue.
2. While there is more than one node in the queue:
Remove the two nodes of highest priority (lowest probability) from the queue.
Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes’ probabilities.
Add the new node to the queue.
3. The remaining node is the root node and the tree is complete .
3.2 LEMPEL-ZIV-WELCH (LVW) COMPRESSION
Lempel-Ziv-Welch (LZW) is a data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as a development of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is designed to be fast to implement but is not usually optimal because it performs only limited analysis of the data.
LZW can also be called a substitutional or dictionary-based encoding algorithm. The algorithm normally builds a data dictionary (also called a translation table or string table) of data occurring in an uncompressed data stream. Patterns of data (substrings) are identified in the data stream and are matched to entries in the dictionary. If the substring is not present in the dictionary, a code phrase is created based on the data content of the substring, and it is stored in the dictionary. The phrase is then written to the compressed output stream .
When a reoccurrence of a substring is found in the data, the phrase of the substring already stored in the dictionary is written to the output. Because the phrase value has a physical size that is smaller than the substring it represents, data compression is achieved.
Decoding LZW data is the reverse of encoding. The decompressor reads the code from the stream and adds the code to the data dictionary if it is not already there. The code is then translated into the string it represents and is written to the uncompressed output stream .
LZW goes beyond most dictionary-based compressors because it is not necessary to keep the dictionary to decode the LZW data stream. This can save quite a bit of space when storing the LZW-encoded data .
TIFF, among other file formats, applies the same method for graphic files. In TIFF, the pixel data is packed into bytes before being presented to LZW, so an LZW source byte might be a pixel value, part of a pixel value, or several pixel values, depending on the image’s bit depth and number of colour channels.
GIF requires each LZW input symbol to be a pixel value. Because GIF allows 1- to 8-bit deep images, there are between 2 and 256 LZW input symbols in GIF, and the LZW dictionary is initialized accordingly. It is not important how the pixels might have been packed into storage; LZW will deal with them as a sequence of symbols .
The TIFF approach does not work very well for odd-size pixels, because packing the pixels into bytes creates byte sequences that do not match the original pixel sequences, and any patterns in the pixels are obscured. If pixel boundaries and byte boundaries agree (e.g., two 4-bit pixels per byte, or one 16-bit pixel every two bytes), then TIFF’s method works well .
The GIF approach works better for odd-size bit depths, but it is difficult to extend it to more than eight bits per pixel because the LZW dictionary must become very large to achieve useful compression on large input alphabets.
If variable-width codes were implemented, the encoder and decoder must be careful to change the width at the same points in the encoded data, or they will disagree about where the boundaries between individual codes fall in the stream .
In conclusion, because of the fact that one can’t hope to compress everything, all compression algorithms must assume that there is some bias on the input messages so that some inputs are more likely than others, i.e. that there will always be some unbalanced probability distribution over the possible messages. Most compression algorithms base this “bias” on the structure of the messages – i.e., an assumption that repeated characters are more likely than random characters, or that large white patches occur in “typical” images. Compression is therefore all about probability.