C++ program to compress text and images, inspired by lack of storage on my phone
Technologies Used: C++, Data structures, Encoding, compression, decompression
I kept on running out of storage on my phone and became curious about how compression works
I knew that compression was a method of encoding from larger to smaller, but was not sure how this was done. When trying create a compression mapping myself,
I realized that my approach would be ambiguous to decompress. Would the bits 1010101101 be decoded into
10 10101 101 or 1010 10 110 1 ? These would result in different
decompressed results.
I then found Huffman encoding. Huffman encoding removes ambiguity with prefix codes where the code that one character is mapped to will not be the prefix
of the code another character is mapped to. If "a" is 11 , then there will not be any character whose code start with
11 .
I built the prefix code tree by determining the frequency of each character (for text) and pixel value (for images), and creating a Huffman tree from it. This was done using a min-priority queue that created a specific tree where the leaves were the characters and the branches represented the encodings. Traveling left was a '0', traveling right was a '1'.
Image of prefix-code tree for "test".
Following this tree, the letters of "test" will be encoded as follows:
Because the tree is generated from a min-heap (by letter frequency), it takes into account more commonly used letters and those appear closer to the root, which is why 't' has a shorter code. This allows for frequently used letters to take up less space (1-bit in this case).
The letters are encoded, but their meaning still needed to be communicated with the program performing the decoding. This could have been Development
by encoding the character-to-coding table, but it could also be done by directly encoding the tree. This is especially easy because each node of a Huffman
tree will have a valence of 2, allowing the tree to be easily decoded recursively. This also prevents the need for a delimeter between header and main body
within the compressed file.
From the "test" encoding example, the encoded header appears as follows:
01t01e1s
It can be interpreted as follows: The path to the leaf node -> leaf node specified by a 1 -> character
(The characters will be encoded into binary)
After I created the compressed file, I also wanted to be able to decompress it. To do so, I had to recreate the header from the compressed file, create the same Huffman tree from that header, and then use that tree to decode the compressed body.
letter | code ' | 11110011100 q | 11110011101 j | 1111001111 x | 111100110 z | 11110010 w | 0011110 k | 0011111 v | 1111000 f | 001110 y | 100010 b | 100011 h | 111101 p | 00100 m | 00101 g | 00110 u | 10000 d | 11000 c | 11001 l | 11111 o | 0110 t | 0111 n | 1001 r | 1010 a | 1011 s | 1101 i | 1110 e | 000 \n| 010
Encodings of the letters within the dictionary
0001e001p1m01g01f01w1k01\n01o1t00001u01y1b1n01r1a0001d1c1s01i0001v01z01x001'1q1j1h1l
Compressed Header (characters are in ascii for demonstration)
--------------------- Compression Stats --------------------- Uncompressed Bits: 6355472 Compressed Bits: 3388691 Compression %: 46.6807
Compressed Header (characters are in ascii for demonstration)
Something that was important for me to understand was that any encodings larger than 8 bits were less efficient than ASCII. But even though there are a few of those in this example, the most frequently used characters are less than 8 bits, allowing for a decrease in space consumed.
To apply the same algorithm to the image, I would be compressing the pixel intensity values which - just like characters - are also 0-255. This meant I just needed to add code to read in the image pixels, and send those pixels into my compression program.
Image I am compressing (379x248)
1 126 1 22 1 106 1 36 2 106 1 85 3 75 1 187 1 106 1 215 2 35 1 106 1 59 3 36 1 36 1 187 1 22 1 75 1 89 1 59 1 35 2 89 1 35 1 35 2 89 1 187 1 35 2 11 2 25 1 11 3 35 1 135 1 161 1 110 1 133 1 125 1 184 1 110 1 161 1 184 2 44 1 184 1 24 1 57 1 161 1 59 1 138 2 98 1 44 2 59 1 119 1 101 2 89 1 75 1 158 1 108 4 72 1 1 3 35 1 35 2 11 1 72 3 11 11 2 1 2 2 35 1 25 1 72 1 11 1 60 1 60 1 11 2 158 1 194 5 108 1 158 0 17 1 187 1 22 1 59 1 50 2 85 6 85 1 63 1 85 2 239 1 106 1 138 1 187 2 106 1 63 1 63 1 138 1 89 2 187 1 22 4 187 4 59 7 35 5 126 1 2 1 3 1 126 1 72 1 25 1 1 3 72 1 59 2 179 1 125 1 110 1 104 1 224 1 110 4 214 1 184 4 107 1 180 1 107 1 24 1 224 2 147 1 87 1 139 1 110 3 98 1 85 4 59 1 101 2 62 2 75 2 158 1 194 3 1 1 25 1 35 1 60 1 20 1 1 5 2 1 25 1 2 1 35 1 25 1 20 1 25 4 158 1 60 1 158 1 194 1 0 1 126 1 50 1 85 2 106 2 138 2 36 1 85 2 106 2 36 1 63 1 75 1 22 3 36 4 138 1 44 1 59 1 22 1 101 1 187 1 89 2 101 1 187 1 35 2 50 1 75 2 45 1 50 1 75 4 25 1 50 1 59 2 44 1 180 2 58 1 224 2 180 1 110 1 184 1
Decompressed pixel intensity (1st row)
--------------------- Compression Stats --------------------- Uncompressed Bits: 751936 Compressed Bits: 304477 Compression %: 59.5076
Compression Stats
Something that was important for me to understand was that any encodings larger than 8 bits were less efficient than ASCII. But even though there are a few of those in this example, the most frequently used characters are less than 8 bits, allowing for a decrease in space consumed.
victortaksheyev@gmail.com