Huffman Coding

Huffman coding is a lossless data compression algorithm. In this algorithm, variable length codes are assigned to input different characters. The code length is related to the frequency of character use. The most frequently used characters have the shortest codes, while longer codes are used for the least frequently used characters.

There are mainly two parts. The first is to create the Huffman tree, and the other is to traverse the tree to find the codes.

For example, consider some strings 'YYYZXXYYX', the frequency of character Y is greater than X, and the frequency of character Z is the smallest. Therefore, the code length of Y is less than X, and the code length of X will be less than Z.

The complexity of assigning codes based on the frequency of each character is O(n log n)

Input　-String of different characters, for example, 'ACCEBFFFFAAXXBLKE'.
Output　-The codes for different characters:

Data: K, Frequency:　1, Code: 0000
Data: L, Frequency:　1, Code: 0001
Data: E, Frequency:　2, Code: 001
Data: F, Frequency:　4, Code: 01
Data: B, Frequency:　2, Code:　100
Data: C, Frequency:　2, Code:　101
Data: X, Frequency:　2, Code:　110
Data: A, Frequency:　3, Code:　111

Algorithm

huffmanCoding(string)

Input-String of different characters.

Output-The code for each character.

Begin
　　　define a node with character, frequency, left and right child of the node for Huffman tree.
　　　create a list 'freq' to store the frequency of each character, initially all are 0
　　　for each character c in the string do
　　　　　　increase the frequency for character ch in freq list.
　　　done
　　　for all type of character ch do
　　　　　　if the frequency of ch is non zero then add ch and its frequency as a node of priority queue Q.
　　　done
　　　while Q is not empty do
　　　　　　remove item from Q and assign it to left child of node
　　　　　　remove item from Q and assign to the right child of node
　　　　　　traverse the node to find the assigned code
　　　done
End

traverseNode(n: node, code)

Input-Huffman tree node n, and the code assigned in the last call　

Output-code assigned to each character

if left child of node n ≠ φ then
　　　traverseNode(leftChild(n), code+’0’)　//traverse through the left child
　　　traverseNode(rightChild(n), code+’1’)　//traverse through the right child
else
　　　display the character and data of current node.

Example

#include<iostream>
#include<queue>
#include<string>
using namespace std;
struct node{
　　　int freq;
　　　char data;
　　　const node　*child0,　*child1;
　　　node(char d, int f =　-1{　//assign values in the node
　　　　　　data = d;
　　　　　　freq = f;
　　　　　　child0 = NULL;
　　　　　　child1　= NULL;
　　　}
　　　node(const node　*c0, const node　*c1{
　　　　　　data = 0;
　　　　　　freq = c0->freq　+　c1->freq;
　　　　　　child0=c0;
　　　　　　child1=c1;
　　　}
　　　bool operator<(const node &a) const {　//< 运算符用于在队列中查找优先级
　　　　　　return freq > a.freq;
　　　}
　　　void traverse(string code = "")const{
　　　　　　if(child0!=NULL){
　　　　　　　　　child0->traverse(code+'0');　//以代码作为左子节点添加0
　　　　　　　　　child1->traverse(code+'1);　//添加　1　以代码作为右子节点
　　　　　　}else{
　　　　　　　　　cout << "Data: " << data << ", Frequency: " << freq << ", Code: " << code << endl;
　　　　　　}
　　　}
};
void huffmanCoding(string str){
　　　priority_queue<node> qu;
　　　int frequency[256];
　　　for(int i = 0; i <256; i++)
　　　　　　frequency[i] = 0;　//清除所有频率
　　　for(int i = 0; i < str.size(); i++{
　　　　　　frequency[int(str[i])]++;　//增加频率
　　　}
　　　for(int i = 0; i <256; i++{
　　　　　　if(frequency[i]){
　　　　　　　　　qu.push(node(i, frequency[i]));
　　　　　　}
　　　}
　　　while(qu.size() >1{
　　　　　　node　*c0 = new node(qu.top());　//获取左子节点并从队列中移除
　　　　　　qu.pop();
　　　　　　node　*c1　= new node(qu.top());　//获取右子节点并从队列中移除
　　　　　　qu.pop();
　　　　　　qu.push(node(c0, c1));　//将两个子节点的频率相加，然后再次加入队列
　　　}
　　　cout << "The Huffman Code: " << endl;
　　　qu.top().traverse();　//遍历树以获取代码
}
main(){
　　　string str = "ACCEBFFFFAAXXBLKE";　//arbitrary string to get frequency
　　　huffmanCoding(str);
}

Output Result

The Huffman Code:
Data: K, Frequency:　1, Code: 0000
Data: L, Frequency:　1, Code: 0001
Data: E, Frequency:　2, Code: 001
Data: F, Frequency:　4, Code: 01
Data: B, Frequency:　2, Code:　100
Data: C, Frequency:　2, Code:　101
Data: X, Frequency:　2, Code:　110
Data: A, Frequency:　3, Code:　111

Basic Tutorial