Idea of compression?
Plenty of disk space .. doen't effect most of us.
Wrong impression... compression is used behind the scences in computer systems quite often.
Even your voice is compressed when you speak on the phone: telephone compnay can chieve a vstly superior utiliation fo their resources if they compress voice data before transporting it.
Lossless Compression: The ultimate free lunch
two very different types of compression: lossless and lossy. lossless compression algorithm cant produce dramatic savings on every file. But a good algorithm will produce substantial savings on certain common types of files.
Humans do this all the time without even thinking about it Eg. weekly calenders - each hours for 5 days - 40 hrs, . but you may says somethings like "monday and tuesday are full and I am booked from 1pm to 3pm on Thursday and Friday, but otherwise available"
basic idea is to find parts of the data that are identical to each other and use some kind of trick to describe those parts more efficiently.
AAAAAAAAAAAAAAAAAAAAABCBCBCBCBCBCBCBCBCBCAAAAAADEFDEFDEF
I may write like 21A, 10BC, 6A, 3DEF. In this case 56 characters is compressed to only 16. less than one-third. this is call Run Length Encoding. It encodes a "run" of repetitions with the "length" of that run.
Run Length Encoding
RLE is only useful for compression of specific types of data. It is used mostly in combination with other algorithms like Huffman coding. Problem with RLE is repeatitions in the data has to be adjacent. on other data in the repeated parts.. eg. ABXABYAB
Fax machines can take advantage of RLE as faxes are black and white documents which get converted to larger number of dots with each dots being either black or white.
Same-as-Earlier Trick
DJKSEMSIT-SM-DJKSEMSIT-ALXDF-O-DJKSEMSIT-ALXDF-DJKSEMSIT-EW-ALXDF
"-" are just put in so that we can read.. in actual sequece there is no dash.
56 characters..
think of go back to and copy.. b for back, c for copy.. b10c6 means go back 10 and copy 6
DJKSEMSIT-SM-b11c9-AlXDF-O-b16c15-b14c9-EW-b16c5
40 characters
There is one more interesting twist. FG-FG-FG-FG-FG-FG-FG-FG. 8-repretitions. FG-FG-FG-FG-b8c8 or FG-b2c14.
Shorter-Symbol Trick
Computers stored everything as a number and interpreted as a letter according to some fixed table. "a" is represented by 27, b is 28, c is 29. Then abc is stored as 272829. A is 01 and Z is 26, "a" is 27 and "z" is 52", a stroke is 80 and U stroke is 99. "Meet your fiance there." will be 1331314600514147440032352740298200463431443166
so this message is 46 digits long.
This trick is something Humans do all the time without even thinking about it. USA - United State of America - we use 3 letter codes for a the full 24 letter phrase it stands for. How about "The sky is blue in color"? The key difference is that one of these phrases is used much more often than the other. we can save a lot more by abbreviating a frequently used phrase instead of one that is rarely used.
e and t are commonly used, let cut them down to single digit.
M eet y o u r f i a n c e th ere.
1388900514147440032352740298200934844866
There is a fatal flaw, as computer doesn't store spaces between individual letters. Let see first 5 digit 13889, it can be interpreted as 13-8-8-9 or 13-88-9 or 13-8-89.
There is no way to tell one digit code or two digit code. We have to make some sacrifice: some of our codes will be actually get longer. The ambiguoug two-digit codes that start with 8 or 9 will become three-digit codes that do not start with 8 or 9. Anything start with a 7 is a three digit code, any starting with an 8 or 9 is a single digit code. anything starting with 0,1,2,3,4,5 or 6 is the same two-digit code.
M eet y o u r f i a n c e th ere. 13889005141474400323527402978200934844866
46 digits to only 41.
ZIP
Step 1 same as earlier trick, back and copy
Step 2 most common symbols and construct table
Step 3 transform again by translating into numeric codes defined in step 2
the table is stored together with the file, becomes a ZIP file. In a real ZIP file the original file is broken up into chunks and each chunk can have a different numeric code table.
Lossy Compression
For image or audio, as long as it look the same and sound the same, it doesn't really matter whether the file storing is exactly the same as the file stores that song on a compact disc. Sometimes lossy compression is used in much more extreme way such as compression is being used aggressively to ake the fiale size of the video or images very small resulting in low-quality video and images.
Leave-it-out Trick
Simple and useful trick. first understand how black and white picture are stored, grey scale and pixels and resolution.For 320 x 240 picture, number of pixels rather large, 320*240 = 76,800. Extremely simple technique: ignore, or leave out every second row or second column of pixels. It results in picture of 160 x 120. The size will be only one-quarter of the original. Reduce again and the size will be only 6% of original size.
But when the picture is decompress back to original size, computer has to guess the colors of missing pixels. Most of the visual feature may be retained but there is some definite loss of quality and details, especially in complex areas.
But computer do indeed leave out information to achieve lossy compression, but they are much more careful. JPEG strategy for is complicated. But basic flavor is fairly straightforward. JPEG divides the whole image into small squares of 8x8 pixels. Each is compressed separately. If the square happens to be all one color, the entire square is represented by a single number. If the square is mostly the same color, with a few very slight differences, computer can decide to represent the square by a single number, with only a small error. JPEG works well with picture. MP3 and AAC generally use the same high-level approach as JPEG.
Origin of Compression Algorithms
The same-as-earlier trick is the main compression methods used in ZIP file and it is known as LZ77 algorithm. It was invented by two Israeli computer scinetists, Abraham Lempel and Jacob Ziv, and published in 1977.
Claude Shannon the bell labs scientist who founded the fiedl of information theory withhis 1958 paper. Not only error correcting codes, but also figure importantly in the emergence of compression algorithms.
error-correcting codes and compression algorithms are two sides of the same coin. Error-correcting codes can be viewed as a principled way of adding redundancy to a message or file.
Compression algorithms do the oppostie: they remove redundancy from a messge of file.
Shannon, in his seminal 1948 paper, included a description of one of the earlier compression techniques, Shannon-Fano encoding. It is a particular way of implemneting the shorter-symbol trick. The Robert Fano's student solved the problem for achieving optimal compression for each individual symbol. The student was David Huffman, and his technique-now known as Huffman coding is another example of shorter symbol trick.
No comments:
Post a Comment