λ
home posts study uses projects

Huffman Coding

Jul 18, 2024

compression

Huffman Trees are a way to losslessly compress text by encoding the probabilities of each character in a tree. This allows you to use less bits for more frequently used characters, thus saving space. It is the most efficient way to compress text if you look at one character at a time.

You build the tree by going over the text input and storing the frequency of each character in a priority queue. You then start pairing up the 2 least frequently used characters (the 2 top-most elements in the queue) in a new node, whose value is the sum of their frequencies. You repeat this process recursively until you’ve merged all nodes into one tree (you’ll be left with one node in the queue).

After you’ve built the tree you then encode the original input text into a new stream of bits. You build a dictionary of each letter and its corresponding binary representation. This is done by iterating over the tree (for each leaf node) and storing a 0 or a 1 depending on whether you went left or right. You compress the data by going over each character and using the dictionary, writing a 0 or a 1. You can then send the compressed data and the translation dictionary to the recipient.

To decompress the data you to the same operations in reverse. By going over the input stream and using the dictionary you can easily lookup the corresponding character.

The huffman code dictionary isn’t fixed length (thus takes up less space as you don’t have to use padding) and therefore posses the prefix property.

Example

Huffman_coding_visualisation.svg

References

LλBS

All systems up and running

Commit c0daef5, deployed on Oct 01, 2024.