Python与数据结构[4] -> 散列表[2] -> 开放定址法与再散列的 Python 实现

开放定址散列法和再散列

开放定址法
再散列
代码实现

1 开放定址散列法

前面利用分离链接法解决了散列表插入冲突的问题，而除了分离链接法外，还可以使用开放定址法来解决散列表的冲突问题。

开放定址法在遇见冲突情形时，将会尝试选择另外的单元，直到找到空的单元为止，一般来说，单元h₀(X), h₁(X), h₂(x)为相继尝试的单元，则h_i(X)=(Hash(X)+F(i)) mod TableSize，其中F(i)即为冲突解决的探测方法，

开放定址法中的探测方法的三种基本方式为，

线性探测法：探测步进为线性增长，最基本的方式为F(i)=i
平方探测法：探测步进为平方增长，最基本的方式为F(i)= i²
双散列：探测方法的步进由一个新的散列函数决定，最基本的方式为F(i)= i*Hash₂(i)，通常选择Hash₂(i)=R-(X mod R)，其中R为小于TableSize的素数。

2 再散列

对于使用平方探测的开放定址散列法，当元素填得太满的时候，操作运行的时间将消耗过长，而且插入操作有可能失败，此时则可以进行一次再散列来解决这一问题。

再散列会创建一个新的散列表，新的散列表大小为大于原散列表大小2倍的第一个素数，随后将原散列表的值重新散列至新的散列表中。

这一操作的开销十分大，但并不经常发生，且在发生前必然已经进行了多次插入，因此这一操作的实际情况并没有那么糟糕。

散列的时机通常有几种，

装填因子达到一半的时候进行再散列
插入失败时进行再散列
达到某一装填因子时进行散列

3 代码实现

完整代码

  1 from functools import partial as pro
  2 from math import ceil, sqrt
  3 from hash_table import HashTable, kmt_hashing
  4 
  5 
  6 class RehashError(Exception):
  7     pass
  8 
  9 
 10 class OpenAddressingHashing(HashTable):
 11     def __init__(self, size, hs, pb, fn=None, rf=0.5):
 12         self._array = [None for i in range(size)]
 13         self._get_hashing = hs
 14         self._hashing = hs(size) if not fn else fn
 15         self._probing = pb
 16         self._rehashing_factor = rf
 17 
 18     def _sniffing(self, item, num, hash_code=None):
 19         # Avoid redundant hashing calculation, if hashing calculation is heavy, this would count much.
 20         if not hash_code:
 21             hash_code = self._hashing(item)
 22         return (hash_code + self._probing(num, item, self.size)) % self.size
 23 
 24     def _get_rehashing_size(self):
 25         size = self.size * 2 + 1
 26         while not is_prime(size):
 27             size += 1
 28         return size
 29 
 30     def rehashing(self, size=None, fn=None):
 31         if not size:
 32             size = self._get_rehashing_size()
 33         if size <= (self.size * self.load_factor):
 34             raise RehashError('Rehash size is too small!')
 35         array = self._array
 36         self._array = [None for i in range(size)]
 37         self._hashing = self._get_hashing(size) if not fn else fn
 38         self.insert(filter(lambda x: x is not None, array))
 39 
 40     def find(self, item):
 41         hash_code = ori_hash_code = self._hashing(item)
 42         collision_count = 1
 43         value = self._array[hash_code]
 44 
 45         # Build up partial function to shorten time consuming when heavy sniffing encountered.
 46         collision_handler = pro(self._sniffing, hash_code=ori_hash_code)
 47 
 48         while value is not None and value != item:
 49             hash_code = collision_handler(item, collision_count)
 50             value = self._array[hash_code]
 51             collision_count += 1
 52         return value, hash_code
 53 
 54     def _insert(self, item):
 55         if item is None:
 56             return
 57         value, hash_code = self.find(item)
 58         if value is None:
 59             self._array[hash_code] = item
 60         if self.load_factor > self._rehashing_factor:
 61             self.rehashing()
 62 
 63 
 64 def is_prime(num):  # O(sqrt(n)) algorithm
 65     if num < 2:
 66         raise Exception('Invalid number.')
 67     if num == 2:
 68         return True
 69     for i in range(2, ceil(sqrt(num))+1):
 70         if num % i == 0:
 71             return False
 72     return True
 73 
 74 
 75 def linear_probing(x, *args):
 76     return x
 77 
 78 
 79 def square_probing(x, *args):
 80     return x**2
 81 
 82 
 83 def double_hashing(x, item, size, *args):
 84     r = size - 1
 85     while not is_prime(r):
 86         r -= 1
 87     return x * (r - (item % r))
 88 
 89 
 90 def test(h):
 91     print('
Show hash table:')
 92     h.show()
 93 
 94     print('
Insert values:')
 95     h.insert(range(9))
 96     h.show()
 97 
 98     print('
Insert value (existed):')
 99     h.insert(1)
100     h.show()
101 
102     print('
Insert value (collided):')
103     h.insert(24, 47)
104     h.show()
105 
106     print('
Find value:')
107     print(h.find(7))
108     print('
Find value (not existed):')
109     print(h.find(77))
110 
111     print('
Load factor is:', h.load_factor)
112 
113 
114 if __name__ == '__main__':
115     test(OpenAddressingHashing(11, kmt_hashing, linear_probing))
116     print(30*'-')
117     test(OpenAddressingHashing(11, kmt_hashing, square_probing))
118     print(30*'-')
119     test(OpenAddressingHashing(11, kmt_hashing, double_hashing))

View Code

分段解释

首先导入几个需要的模块，以及散列表类和散列函数（具体实现参考文末相关阅读），并定义一个再散列异常

1 from functools import partial as pro
2 from math import ceil, sqrt
3 from hash_table import HashTable, kmt_hashing
4 
5 
6 class RehashError(Exception):
7     pass

定义一个开放定址散列表类，接收参数包括，散列表初始大小size，散列函数的生成函数hs，探测函数pb，指定散列函数fn，再散列因子rf。当指定了散列函数时，使用指定的散列函数，否则使用传入的生成函数，根据散列表大小获得一个默认的散列函数。

1 class OpenAddressingHashing(HashTable):
2     def __init__(self, size, hs, pb, fn=None, rf=0.5):
3         self._array = [None for i in range(size)]
4         self._get_hashing = hs
5         self._hashing = hs(size) if not fn else fn
6         self._probing = pb
7         self._rehashing_factor = rf

定义_sniffing方法，嗅探方法用于计算下一个嗅探位置。

Note: 此处为了避免冗余计算，开放一个参数供散列值传入，当进行同一个插入的不同嗅探时，其原始散列值是不变的，因此这里可以配合后面的偏函数，在多次嗅探中固定这一参数，从而避免多次散列函数的计算。这对于复杂的散列函数来说可以减少嗅探计算时间。

1     def _sniffing(self, item, num, hash_code=None):
2         # Avoid redundant hashing calculation, if hashing calculation is heavy, this would count much.
3         if not hash_code:
4             hash_code = self._hashing(item)
5         return (hash_code + self._probing(num, item, self.size)) % self.size

定义_get_rehashing_size方法，用于计算需要再散列时新散列表的大小，通常为大于当前表大小2倍的第一个素数。

1     def _get_rehashing_size(self):
2         size = self.size * 2 + 1
3         while not is_prime(size):
4             size += 1
5         return size

定义rehashing方法，用于进行再散列操作，

若没有指定再散列大小，则使用默认方式计算，
当传入的再散列大小小于已有元素数量时，引发再散列异常，
保存原始散列表信息，并新建一个散列表，更新散列函数，
利用新的散列函数，遍历原始散列表并插入新的散列表中。

1     def rehashing(self, size=None, fn=None):
2         if not size:
3             size = self._get_rehashing_size()
4         if size <= (self.size * self.load_factor):
5             raise RehashError('Rehash size is too small!')
6         array = self._array
7         self._array = [None for i in range(size)]
8         self._hashing = self._get_hashing(size) if not fn else fn
9         self.insert(filter(lambda x: x is not None, array))

定义find方法，用于查找散列表内的指定元素，

Note: 这里使用偏函数处理嗅探函数，减少散列计算，利用嗅探函数循环嗅探新的位置，直到找到目标元素或None，此时返回元素值或None和对应的散列值。

 1     def find(self, item):
 2         hash_code = ori_hash_code = self._hashing(item)
 3         collision_count = 1
 4         value = self._array[hash_code]
 5 
 6         # Build up partial function to shorten time consuming when heavy sniffing encountered.
 7         collision_handler = pro(self._sniffing, hash_code=ori_hash_code)
 8 
 9         while value is not None and value != item:
10             hash_code = collision_handler(item, collision_count)
11             value = self._array[hash_code]
12             collision_count += 1
13         return value, hash_code

定义_insert方法，唯一的区别在于，当装载因子大于再散列因子时，需要进行一次再散列操作。

1     def _insert(self, item):
2         if item is None:
3             return
4         value, hash_code = self.find(item)
5         if value is None:
6             self._array[hash_code] = item
7         if self.load_factor > self._rehashing_factor:
8             self.rehashing()

定义一个素数判断函数，用于计算一个值是否为素数，时间复杂度为O(sqrt(n))。

1 def is_prime(num):  # O(sqrt(n)) algorithm
2     if num < 2:
3         raise Exception('Invalid number.')
4     if num == 2:
5         return True
6     for i in range(2, ceil(sqrt(num))+1):
7         if num % i == 0:
8             return False
9     return True

接着定义三个探测函数，分别为线性探测、平方探测和双散列。

 1 def linear_probing(x, *args):
 2     return x
 3 
 4 
 5 def square_probing(x, *args):
 6     return x**2
 7 
 8 
 9 def double_hashing(x, item, size, *args):
10     r = size - 1
11     while not is_prime(r):
12         r -= 1
13     return x * (r - (item % r))

最后定义一个测试函数，并对三种探测函数分别进行测试。

 1 def test(h):
 2     print('
Show hash table:')
 3     h.show()
 4 
 5     print('
Insert values:')
 6     h.insert(range(9))
 7     h.show()
 8 
 9     print('
Insert value (existed):')
10     h.insert(1)
11     h.show()
12 
13     print('
Insert value (collided):')
14     h.insert(24, 47)
15     h.show()
16 
17     print('
Find value:')
18     print(h.find(7))
19     print('
Find value (not existed):')
20     print(h.find(77))
21 
22     print('
Load factor is:', h.load_factor)
23 
24 
25 if __name__ == '__main__':
26     test(OpenAddressingHashing(11, kmt_hashing, linear_probing))
27     print(30*'-')
28     test(OpenAddressingHashing(11, kmt_hashing, square_probing))
29     print(30*'-')
30     test(OpenAddressingHashing(11, kmt_hashing, double_hashing))

三种探测函数测试项及对应结果如下，

初始建立散列表

1     print('
Show hash table:')
2     h.show()

三者结果均相同，

Show hash table:
[0] None
[1] None
[2] None
[3] None
[4] None
[5] None
[6] None
[7] None
[8] None
[9] None
[10] None

接着尝试插入超过装填因子一半的元素数量，此时散列表会自动进行再散列

Insert values:
[0] 0
[1] 1
[2] 2
[3] 3
[4] 4
[5] 5
[6] 6
[7] 7
[8] 8
[9] None
[10] None
[11] None
[12] None
[13] None
[14] None
[15] None
[16] None
[17] None
[18] None
[19] None
[20] None
[21] None
[22] None

接着插入已存在的元素

1     print('
Insert value (existed):')
2     h.insert(1)
3     h.show()

结果不变，

再插入两个会造成冲突的元素，三者结果分别如下，可以看到，不同的探测函数将元素插入到了散列表的不同位置。

Linear_probing | Square_probing | Double_hashing 
[0] 0          | [0] 0          | [0] 0
[1] 1          | [1] 1          | [1] 1
[2] 2          | [2] 2          | [2] 2
[3] 3          | [3] 3          | [3] 3
[4] 4          | [4] 4          | [4] 4
[5] 5          | [5] 5          | [5] 5
[6] 6          | [6] 6          | [6] 6
[7] 7          | [7] 7          | [7] 7
[8] 8          | [8] 8          | [8] 8
[9] 24         | [9] None       | [9] None
[10] 47        | [10] 24        | [10] None
[11] None      | [11] None      | [11] 47
[12] None      | [12] None      | [12] None
[13] None      | [13] None      | [13] None
[14] None      | [14] None      | [14] None
[15] None      | [15] None      | [15] 24
[16] None      | [16] None      | [16] None
[17] None      | [17] 47        | [17] None
[18] None      | [18] None      | [18] None
[19] None      | [19] None      | [19] None
[20] None      | [20] None      | [20] None
[21] None      | [21] None      | [21] None
[22] None      | [22] None      | [22] None

最后，测试查找函数以及获取装填因子，由于探测函数不同，因此查找不存在结果时，最后处在的位置也不同。

--------------------------------------
Linear_probing  
--------------------------------------
Find value:
(7, 7)

Find value (not existed):
(None, 11)

Load factor is: 0.4782608695652174

--------------------------------------
Square_probing
--------------------------------------
Find value:
(7, 7)

Find value (not existed):
(None, 9)

Load factor is: 0.4782608695652174

--------------------------------------
Double_hashing 
--------------------------------------
Find value:
(7, 7)

Find value (not existed):
(None, 21)

Load factor is: 0.4782608695652174

相关阅读

1. 散列表

2. 分离链接法

相关阅读:
面经补充
 一些杂项
 leetcode整理
 缓存问题及相关解决策略
 4.10 面经补充
 合并区间（二维数组与列表的转换）
1.4任务
 jvm虚拟机笔记<八> 线程安全与锁优化
 jvm虚拟机笔记<七> 内存模型与线程
 jvm虚拟机笔记<六> 运行期优化
原文地址：https://www.cnblogs.com/stacklike/p/8298563.html