Huffman Coding Algorithm
Developed by David A.Huffman, This is a greedy algorithm used for lossless data compression. This algorithm uses a variable-length encoding scheme to reduce the number of bits utilized.
What is encoding and why is it necessary?
Encoding is the process of converting data from one form to another. It involves transmitting or storing a sequence of characters efficiently.
We are very well aware that everything in a computer is represented in binary form ie. stream of 0s and 1s. The audios, videos, images, characters all are encoded in the form of binary numbers in machine code. The most common example is using UTF-8 encoding ie. using ASCII codes in 8-bit binary format. For example alphabet, ‘A’ is represented by 01000001 ie. 8-bit binary form of ASCII code of ‘A’, 65.
Types of encoding:
Our aim is to reduce the number of bits required in this encoding. This is fulfilled by Huffman’s algorithm.
Compression using fixed-length codes:
Let's assume we have to transfer a message with just five different letters A, B, C, D, E in it.Then we can assign a 3-bit code to each letter rather than using 8 bit ASCII codes.
This compression works fine only when the frequency of these letters is small which is not the case in real-life scenarios.
Compression using variable-length codes:
The output from Huffman’s algorithm can be viewed as a variable-length code. The idea behind this is to assign the smallest Huffman code to the character with the highest frequency and the largest to the one with the least frequency. For most cases, This encoding helps in lossless compression by reducing the number of bits utilized.
But how are these codes assigned?
- We Build a Huffman Tree from input characters.
- Traverse the Tree and assign codes to characters.
Creating a Huffman tree
Steps to construct the binary tree
- Build a min-heap of all leaf nodes. The leaf nodes are unique input characters.
- Initially choose the two nodes with the least frequency
- Create a new parent node with a frequency equal to the sum of the two node's frequencies. Make the first extracted node as its left child and the other extracted node as its right child.
- Add this node to the min-heap.
- Repeat the above steps until the heap contains only one node. The remaining node is the root node and the tree is complete.
Let's take an example and construct the Huffman tree. In the following figure, the bubble in green represents the leaf nodes and their values are the frequency of occurrence of the unique letters
Traversing the tree extract codes:
- Start from the root node and,
You can follow any of the two conventions (but the same must be followed for all leaf nodes):
- Add 0 to the array while traversing the left child and add 1 to the array while traversing the right child
- Add 1 to the array while traversing the left child and add 0 to the array while traversing the right child.
Main Properties of Huffman coding:
- All unique characters are at the leaf node of the tree
- Unique Prefix Property: Code assigned to a particular character is not a prefix of any other character. Like in the above example B is assigned code 00 and as you can notice from the table no other code starts with 00. This helps in avoiding ambiguity while decoding the message..
- Huffman codes optimal when all the probabilities in frequency distribution are integral powers of 1/2
Time complexity analysis:
Since this algorithm uses min Heap data structure for implementing priority queue, the complexity is:
O(N log N)
N is the number of unique characters
- Transmitting fax and text.
- Multimedia codecs such as JPEG and MP3
- Compressing HTTP headers.
- Used by conventional compression formats like PKZIP, GZIP, etc.
The disadvantages of the Huffman code.
- If the ensemble changes that means the frequency and probability change. Then this coding option is not optimal.
- Another instance when Huffman coding is not optimal is when probabilities are not in the negative power of two.
- By using Huffman coding, we can achieve a lower compression ratio, this does not make it suitable for encoding digital images.
- It’s a relatively slow process because it includes two passes: firstly, building the statistical model, basically the coding part, and second, The Encoding part.
- It is difficult to detect whether encoded data is corrupt as codes have variable lengths.