什么是霍夫曼编码 (Huffman Coding)
是一种用于无损数据压缩的权编码算法。由美国计算机科学家David Albert Huffman在1952年发明。
霍夫曼编码使用变长编码表泽源符号(如一个字母)进行编码,其中变长编码表是通过一种评估来源符号出现几率的方法得到的,出现几率高的字母使用较短的编码,反之出现几率低的则使用较长的编码,这便使编码之后的字符串的平均长度、期望值降低,从而达到无损压缩数据的目的。
Huffman Coding的作用是什么?
用于数据压缩与解压。
我们知道英文和数字各占1个字节,中文占1个字符,也是就是2个字节;
- utf-8编码中,中文字符占了3个字节,英文占1个字节;
- utf-16编码中,中文字符占了3个字节,英文占2个字节;
- utf-32编码中,所有字符均占4个字节;
我们再重温下字节:
字节是一种数据量的单位,一个字节等于8位(8 bit) bit,一个二进制数据0或1,是一bit。
所有的数据所占空间都可以用字节数据来衡量;例如Java中:
- 一个字符(char)占2个字节,
- 一个short占2个字节,
- 一个int占4个字节,
- 一个float占4个字节,
- 一个long或double占8个字节。
代码的实现
Code Tree, Left Traversal has a value of 0, Right Traversal has a value of 1.
Coal: reduce the code tree,
- step1: take the 2 chars with the lowest frequency
- step2: make a 2 leaf node tree from them, the root node value is a sum of 2 leaves node's frequency
- step3: take the next lowest frequency char, and add it to the tree
Let us understand the algorithm with an example.
package _Algorithm.HuffmanCode import java.util.* class HuffmanCoding { //recursive function to paint the huffman-code through the tree traversal private fun printCode(root: HuffmanNode?, s: String) { if (root?.left == null && root?.right == null && Character.isLetter(root?.c!!)) { println("${root?.c}:$s") return } //if we go left than add "0" to the node //if we go right than add "1" to the node printCode(root.left, s + "0") printCode(root.right, s + "1") } fun test() { val n = 6 val charArray = charArrayOf('a', 'b', 'c', 'd', 'e', 'f') val charfreq = intArrayOf(5, 9, 12, 13, 16, 45) val priorityQueue = PriorityQueue<HuffmanNode>(n, MyComparator()) for (i in 0 until n) { val node = HuffmanNode() node.c = charArray[i] node.data = charfreq[i] node.left = null node.right = null priorityQueue.add(node) } //create root node var root: HuffmanNode? = null while (priorityQueue.size > 1) { //first min extract val x = priorityQueue.poll() //second min extract val y = priorityQueue.poll() // to the sum of the frequency of the two nodes // assigning values to the f node. val f = HuffmanNode() f.data = x.data + y.data f.c = '-' f.left = x f.right = y //make the f node as the root root = f priorityQueue.add(f) } printCode(root, "") } } class MyComparator : Comparator<HuffmanNode> { override fun compare(o1: HuffmanNode?, o2: HuffmanNode?): Int { return o1?.data!! - o2?.data!! } }
打印结果
f:0
c:100
d:101
a:1100
b:1101
e:111
压缩结果
从数据:
val charArray = charArrayOf('a', 'b', 'c', 'd', 'e', 'f')
val charfreq = intArrayOf(5, 9, 12, 13, 16, 45)
我们得出结果:
Finding number of bits without using Huffman:
Total number of characters = sum of frequencies = 100;
1byte = 8bits, so total number of bit = 100*8 = 800;
Using Huffman Encoding result is :
f:0 //code length is 1
c:100 //code length is 3
d:101
a:1100
b:1101
e:111
so total number of bits =
freq(f) * code_length(f) + freq(c) * code_length(c) + freq(d) * code_length(d) + freq(a) * code length(a) +
freq(b) * code_length(b) + freq(e) * code_length(e) =
45*1 + 12*3 + 13*3 + 5*4 + 9*4 + 16*3 = 224
Bits saved: 800-224 = 576.