堆排序(Heap Sort)之所以让人魂牵梦萦,是因为其实现过程比较复杂,不但包括堆的构造,还包括堆的析构。无论是堆的构造(初始化)还是析构,都离不开堆的调整。而调整堆的方法又分为两种,一种是向上游(Swim),另一种就是往下沉(Sink)。向上游(Swim)方法通常用在堆的构造过程中,而往下沉(Sink)通常用在堆的析构过程中。在正式介绍堆排序之前,有必要先讲讲堆的基本概念。
P.S.在前段时间找工作的过程中,Intel OTC的一个兄弟就问过我堆排序。可惜我当时对堆排序没有很深入的理解,因为没有很好地把握堆的调整过程。直到我后来看了大师Robert Sedgewick和Kevin Wayne合著的《算法》第4版第2章(2.4 优先级队列),才恍然大悟!大师们用的是向上游(Bottom-up reheapify (swim))和往下沉(Top-down reheapify (sink))来描述堆的调整过程,可谓形象生动至极!相比之下,在某些国产数据结构教科书中,使用FilterUp()和FilterDown()函数来描述和实现堆的向上调整与向下调整就显得晦涩得多。为了表达对大师们的顶礼膜拜之情,我这周果断地购买了《算法》第4版的英文版。所以, "既要好读书,更要读好书!" |
关于堆的定义,维基百科是这么说滴,
A heap is a specialized tree-based data structure that satisfies the heap
property: If A is a parent node of B, then the key (the value) of node A is
ordered with respect to the key of node B with the same ordering applying
across the heap.
A heap can be classified further as either a "max heap" or a "min heap".
In a max heap, the keys of parent nodes are always greater than
or equal to those of the children and the highest key is in the root node.
In a min heap, the keys of parent nodes are less than
or equal to those of the children and the lowest key is in the root node.
什么是堆(Heap)?
堆本质上是一棵二叉树,而且是完全二叉树。 (注:从严格意义上讲,堆可以是N(>=2)叉树,为简单起见,这里只讨论堆为二叉树的情况。)
堆的分类
堆分为大顶堆(Max Heap)和小顶堆(Min Heap)。
什么是大顶堆(Max Heap)?
大顶堆中的每个结点的关键字都不小于孩子结点(如果有的话)的关键字。 例如, (图片来源戳这里)
什么是小顶堆(Min Heap)?
小顶堆中的每个结点的关键字都不大于孩子结点(如果有的话)的关键字。 例如,(感谢贤妻将上面的Max Heap图PS成Min Heap图,背景透明,高大上!)
堆到底有什么用?
堆(Heap)能很好地支持优先级队列(Priority Queue)的基本操作。 而优先级队列的应用是非常广泛的,例如:
- 操作系统的调度器实现
- 网络传输中的路由选择
- ... ...
- 针对急诊大厅里的病号们而设计的就诊调度程序(试想想,如果没有实现优先级队列,那该何等可怕?!)
那么,什么是优先级队列(Priority Queue)?
优先级队列毫无疑问也是队列,表现形式为FIFO(Fisrt In, First Out)线性结构。但与普通队列不同的是,优先级队列的删除操作比较特殊,总是以队列元素的优先级高低为准,比如总是删除优先级最高的元素或者总是删除优先级最低的元素。而优先级队列的插入操作跟普通队列的插入操作一样,也就是不限制元素的优先级,任何时刻任何优先级的元素都可以入队。
在熟悉了堆的基本概念后,下面以大顶堆(Max Heap)为例讲述堆排序(Heap Sort)。
堆排序(Heap Sort)的基本思想
堆排序(Heap Sort)是一种树形选择排序。
- 堆的构造:根据初始数组(a[0..n-1])构造一个完全二叉树,保证所有的父结点的关键字都>=它的孩子结点的关键字。
- 堆的析构:交换最后一个叶子结点(a[n-1])和树根结点(a[0])(关键字最大),输出结点(a[n-1]),然后把余下的整棵树(a[0..n-2])重新调整为大顶堆。
将析构过程循环往复,当输出完最后一个结点后,这个数组(a[0..n-1])的关键字就满足从小到大的排列顺序了。
典型的堆排序看起来是这样子滴,(图片来源戳这里)
文章Heap Data Structures对大顶堆的排序过程有非常好的论述,摘要如下:
1. 大顶堆的构造算法(Max Heap Construction Algorithm) // Swim
Step 1: Create a new node at the end of heap.
Step 2: Assign new value to the node.
Step 3: Compare the value of this child node with its parent.
Step 4: If value of parent is less than child, then swap them.
Step 5: Repeat step 3 & 4 until Heap property holds.
2. 大顶堆的删除(析构)算法(Max Heap Deletion Algorithm) // Sink
Step 1: Remove root node. Step 2: Move the last element of last level to root. Step 3: Compare the value of this child node with its parent. Step 4: If value of parent is less than child, then swap them. Step 5: Repeat step 3 & 4 until Heap property holds.
上面的动画我个人非常喜欢,因为对于理解堆排序实在是太贴心了(不得不承认,老外写的文章就是可读性好,最大的特点就是将复杂的东西尽可能地简单化和形象化,很值得我们学习和借鉴)。好了,该上具体的代码实现啦! (为方便理解,下面给出略微冗长(i.e.不是非常精简)但是可读性良好的C代码实现)
1. 数组与堆的映射表示
/* * Assume the size of an array is 9, e.g. * * Array: a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] // n = 9 * * Heap : a[0] --- Level 0 --- * / * / * / * a[1] a[2] --- Level 1 --- * / / * / / * a[3] a[4] a[5] a[6] --- Level 2 --- * / * a[7] a[8] --- Level 3 --- * * For node a[i], i = 0, 1, 2, ..., N-1 * * Index of its Left Child = (i + 1) * 2 - 1 = i * 2 + 1 * Index of its Right Child = (i + 1) * 2 = i * 2 + 2 * Index of its Parent = (i + 1) / 2 - 1 = (i - 1) / 2 * * C code: * * int getIndexOfLeftChild(i) { return i * 2 + 1; } * int getIndexOfRightChild(i) { return i * 2 + 2; } * int getIndexParent(i) { return (i - 1) / 2; } * int hasLeftChild(n, i) { return ((i * 2 + 1) < n); } * int hasRightChild(n, i) { return ((i * 2 + 2) < n); } * int hasParent(i) { return (i > 0); } * int isLeafNode(n, i) { return (i >= n / 2); } */
2. 第1个版本的heapsort()函数 [只使用swim(),未使用sink()]
1 static void exchange(int a[], int i, int j) 2 { 3 int t = a[i]; 4 a[i] = a[j]; 5 a[j] = t; 6 } 7 8 static int getIndexOfParent(int i) 9 { 10 return (i - 1) / 2; 11 } 12 13 static void swim(int a[], size_t n, int k) 14 { 15 #define ISNOTROOT(k) (k != 0) 16 while (ISNOTROOT(k)) { 17 int parent = getIndexOfParent(k); 18 if (a[k] <= a[parent]) 19 break; 20 21 /* swim only if a[k] > a[parent] */ 22 exchange(a, k, parent); 23 k = parent; 24 } 25 } 26 27 static void constructMaxHeap(int a[], size_t n) 28 { 29 for (int i = 0; i < n; i++) 30 swim(a, i, i); 31 } 32 33 void heapsort(int a[], size_t n) 34 { 35 constructMaxHeap(a, n); // firstly construct a max heap 36 // 37 while (n > 0) { // note a[0] is always the max element 38 exchange(a, 0, n - 1); // swap a[0], a[n-1] then 39 // a[n-1] is the max one 40 n--; // decrease the size of a[] 41 constructMaxHeap(a, n); // re-construct the max heap by walking 42 // a[0 .. n-2] 43 } // 44 }
说明: 对于L13-L25的swim()函数,a[0 .. k-1]总是满足大顶堆的结构,那么将a[k]插入大顶堆的时候,采用swim(向上游)的方法。如果a[k]比它的parent的关键字要大,就向上游一把(即交换a[k]和它的parent),如此循环往复,直到a[0 .. k]满足大顶堆结构为止。
如果输入的数组为: int a[] = {0, 1, 2, 3}, 则构造堆的过程可用图推演如下:
3. 第2个版本的heapsort()函数 [既使用swim()也使用sink()]
1 static void exchange(int a[], int i, int j) 2 { 3 int t = a[i]; 4 a[i] = a[j]; 5 a[j] = t; 6 } 7 8 static int getIndexOfParent(int i) 9 { 10 return (i - 1) / 2; 11 } 12 13 static int getIndexOfLeftChild(int i) 14 { 15 return i * 2 + 1; 16 } 17 18 static int getIndexOfRightChild(int i) 19 { 20 return i * 2 + 2; 21 } 22 23 static void swim(int a[], size_t n, int k) 24 { 25 #define ISNOTROOT(k) (k != 0) 26 while (ISNOTROOT(k)) { 27 int parent = getIndexOfParent(k); 28 if (a[k] <= a[parent]) 29 break; 30 31 /* swim only if a[k] > a[parent] */ 32 exchange(a, k, parent); 33 k = parent; 34 } 35 } 36 37 static void sink(int a[], size_t n, int k) 38 { 39 #define ISLEAFNODE(n, k) (k >= n / 2) 40 while (!ISLEAFNODE(n, k)) { 41 int left = getIndexOfLeftChild(k); 42 int right = getIndexOfRightChild(k); 43 44 int child = -1; /* set to be an invalid index on purpose */ 45 if (left < n && right < n) 46 child = (a[left] > a[right]) ? left : right; 47 else if (left < n && right >= n) 48 child = left; 49 else if (left >= n && right < n) 50 child = right; 51 else 52 child = k; 53 54 if (a[k] >= a[child]) 55 break; 56 57 /* sink only if max of children of a[k] say a[child] > a[k] */ 58 exchange(a, k, child); 59 k = child; 60 } 61 } 62 63 static void constructMaxHeap(int a[], size_t n) 64 { 65 for (int i = 0; i < n; i++) 66 swim(a, i, i); 67 } 68 69 void heapsort(int a[], size_t n) 70 { 71 constructMaxHeap(a, n); // firstly construct a max heap 72 // 73 while (n > 0) { // note a[0] is always the max element 74 exchange(a, 0, n - 1); // swap a[0], a[n-1] then 75 // a[n-1] is the max one 76 n--; // decrease the size of a[] 77 sink(a, n, 0); // re-build the max heap by using sink() 78 } // while walking a[0 .. n-2] 79 }
说明: L37-L61中的sink()函数,关键是找出比结点a[k]关键字大的孩子结点。如果a[k]有两个孩子,那么就取出两个孩子结点中关键字最大的那一个孩子结点与a[k]进行比较。 设找出的孩子结点为a[child], 如果a[k] < a[child], 就交换a[k]和a[child]。新的k就是child, 如此循环往复,直到a[k]不需要再往下沉为止。(如果a[k]为叶子结点,显然不需要再往下沉。)
如果使用函数constructMaxHeap()完成了堆的构造后的数组为: int a[] = {3, 2, 1, 0}, 则接下来堆排序的过程可用图推演如下:
4. 完整的实现
o heapsort.c
1 #include <stdio.h> 2 #include <stdlib.h> 3 4 /* 5 * Assume the size of an array is 9, e.g. 6 * 7 * Array: a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] // n = 9 8 * 9 * Heap : a[0] --- Level 0 --- 10 * / 11 * / 12 * / 13 * a[1] a[2] --- Level 1 --- 14 * / / 15 * / / 16 * a[3] a[4] a[5] a[6] --- Level 2 --- 17 * / 18 * a[7] a[8] --- Level 3 --- 19 * 20 * For node a[i], i = 0, 1, 2, ..., N-1 21 * 22 * Index of its Left Child = (i + 1) * 2 - 1 = i * 2 + 1 23 * Index of its Right Child = (i + 1) * 2 = i * 2 + 2 24 * Index of its Parent = (i + 1) / 2 - 1 = (i - 1) / 2 25 * 26 * C code: 27 * 28 * int getIndexOfLeftChild(i) { return i * 2 + 1; } 29 * int getIndexOfRightChild(i) { return i * 2 + 2; } 30 * int getIndexParent(i) { return (i - 1) / 2; } 31 * int hasLeftChild(n, i) { return ((i * 2 + 1) < n); } 32 * int hasRightChild(n, i) { return ((i * 2 + 2) < n); } 33 * int hasParent(i) { return (i > 0); } 34 * int isLeafNode(n, i) { return (i >= n / 2); } 35 */ 36 37 static void show(int a[], size_t n) 38 { 39 for (int i = 0; i < n; i++) 40 printf("%-2c ", a[i]); 41 printf(" "); 42 } 43 44 static void exchange(int a[], int i, int j) 45 { 46 int t = a[i]; 47 a[i] = a[j]; 48 a[j] = t; 49 } 50 51 static int getIndexOfParent(int i) 52 { 53 return (i - 1) / 2; 54 } 55 56 static int getIndexOfLeftChild(int i) 57 { 58 return i * 2 + 1; 59 } 60 61 static int getIndexOfRightChild(int i) 62 { 63 return i * 2 + 2; 64 } 65 66 static void swim(int a[], size_t n, int k) 67 { 68 #define ISNOTROOT(k) (k != 0) 69 while (ISNOTROOT(k)) { 70 int parent = getIndexOfParent(k); 71 if (a[k] <= a[parent]) 72 break; 73 74 /* swim only if a[k] > a[parent] */ 75 exchange(a, k, parent); 76 k = parent; 77 } 78 } 79 80 static void sink(int a[], size_t n, int k) 81 { 82 #define ISLEAFNODE(n, k) (k >= n / 2) 83 while (!ISLEAFNODE(n, k)) { 84 int left = getIndexOfLeftChild(k); 85 int right = getIndexOfRightChild(k); 86 87 int child = -1; /* set to invalid index on purpose */ 88 if (left < n && right < n) 89 child = (a[left] > a[right]) ? left : right; 90 else if (left < n && right >= n) 91 child = left; 92 else if (left >= n && right < n) 93 child = right; 94 else 95 child = k; 96 97 if (a[k] >= a[child]) 98 break; 99 100 /* sink only if max of children of a[k] say a[child] > a[k] */ 101 exchange(a, k, child); 102 k = child; 103 } 104 } 105 106 static void constructMaxHeap(int a[], size_t n) 107 { 108 for (int i = 0; i < n; i++) 109 swim(a, i, i); 110 } 111 112 #ifdef _USE_SWIM_ONLY_ 113 void heapsort(int a[], size_t n) 114 { 115 constructMaxHeap(a, n); 116 117 while (n > 0) { 118 exchange(a, 0, n - 1); 119 120 n--; 121 constructMaxHeap(a, n); 122 } 123 } 124 #else 125 void heapsort(int a[], size_t n) 126 { 127 constructMaxHeap(a, n); 128 129 while (n > 0) { 130 exchange(a, 0, n - 1); 131 132 n--; 133 sink(a, n, 0); 134 } 135 } 136 #endif 137 138 int main(int argc, char *argv[]) 139 { 140 if (argc < 2) { 141 fprintf(stderr, "Usage: %s <C1> [C2] ... ", argv[0]); 142 return -1; 143 } 144 145 argc--; 146 argv++; 147 148 int n = argc; 149 int *a = (int *)malloc(sizeof(int) * n); 150 #define VALIDATE(p) do { if (p == NULL) return -1; } while (0) 151 VALIDATE(a); 152 153 for (int i = 0; i < n; i++) 154 *(a+i) = argv[i][0]; 155 156 printf(" "); 157 for (int i = 0; i < n; i++) 158 printf("%-2d ", i); 159 printf(" "); 160 161 printf("Before sorting: "); show(a, n); 162 heapsort(a, n); 163 printf("After sorting: "); show(a, n); 164 165 free(a); a = NULL; 166 return 0; 167 }
o 编译并测试
$ gcc -g -Wall -m32 -std=c99 -o heapsort heapsort.c $ ./heapsort 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Before sorting: 0 1 2 3 4 5 6 7 8 9 After sorting: 0 1 2 3 4 5 6 7 8 9 $ ./heapsort 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 Before sorting: 9 8 7 6 5 4 3 2 1 0 After sorting: 0 1 2 3 4 5 6 7 8 9 $ ./heapsort S O R T E X A M P L E 0 1 2 3 4 5 6 7 8 9 10 Before sorting: S O R T E X A M P L E After sorting: A E E L M O P R S T X $ gcc -g -Wall -m32 -std=c99 -D_USE_SWIM_ONLY_ -o heapsort heapsort.c heapsort.c:80:13: warning: ‘sink’ defined but not used [-Wunused-function] static void sink(int a[], size_t n, int k) ^ $ ./heapsort S O R T E X A M P L E 0 1 2 3 4 5 6 7 8 9 10 Before sorting: S O R T E X A M P L E After sorting: A E E L M O P R S T X
最后,有必要提一下堆排序的时间复杂度和空间复杂度。 注意: 堆排序是一种不稳定的排序算法。
Worst-case performance O(N * logN)
Best-case performance O(N)
Average performance O(N * logN)
Worst-case space complexity O(1) auxiliary
参考资料
- HeapSort (array Based) implementation in Java
- Heap Data Structures
- Priority Queues
- Heapsort from wikipedia
小结:
到此为止,我们已经完全弄明白了神秘的堆排序(Heap Sort)。 归结起来,只要充分掌握了调整堆的两种方法(向上游(Swim)和往下沉(Sink)),堆排序其实很简单。行文至此,我忽然觉得,做人和治学当如堆排序,因为"既要沉得下去,又要浮得上来"。下一节将介绍归并排序(Merge Sort)。