今天看代码的时候注意了一下 hash 以及看了下读书时候的笔记, 主要就是说: hash桶大小以及hash 算法
1、目前hash桶的大小都是素数(和2倍相近的一个素数)
设有一个哈希函数
H( c ) = c % N;
当N取一个合数时,最简单的例子是取2^n,比如说取2^3=8,这时候
H( 11100(二进制) ) = H( 28 ) = 4
H( 10100(二进制) ) = H( 20 )= 4
这时候c的二进制第4位(从右向左数)就”失效”了,也就是说,无论第c的4位取什么值,都会导致H( c )的值一样.这时候c的第四位就根本不参与H( c )的运算,这样H( c )就无法完整地反映c的特性,增大了导致冲突的几率.
7ul, 23ul, 53ul, 97ul, 193ul, 389ul, 769ul, 1543ul, 3079ul, 6151ul, 12289ul, 24593ul, 49157ul, 98317ul, 196613ul, 393241ul, 786433ul, 1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul, 50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul, 1610612741ul, 3221225473ul, 4294967291ul
2、hash 算法:
1、DJB Hash 也叫 Time 33 hash哈希算法
Times33的算法很简单,就是不断的乘33;hash(i) = hash(i-1) * 33 + str[i] Time33在效率和随机性两方面上俱佳;
unsigned int DJBHash(const char* str, unsigned int length){ unsigned int hash = 5381; unsigned int i = 0; for (i = 0; i < length; ++str, ++i) { hash = ((hash << 5) + hash) + (*str); }//((hash << 5) + hash) 就是位操作实现的hash * 32+hash,即 hash * 33。 return hash; }
Kernighan 和 Ritchie 在《The C Programming Language》提出BKDR Hash,采用/ 31 131 1313 13131 131313 etc.. / 作为种子计算hash;这个就是数学家的事了
MurmurHash算法:高运算性能,低碰撞率
/* * The following hash function is based on MurmurHash64A(), placed into the * public domain by Austin Appleby. See http://murmurhash.googlepages.com/ for * details. */ JEMALLOC_INLINE uint64_t hash(const void *key, size_t len, uint64_t seed) { const uint64_t m = UINT64_C(0xc6a4a7935bd1e995); const int r = 47; uint64_t h = seed ^ (len * m); const uint64_t *data = (const uint64_t *)key; const uint64_t *end = data + (len/8); const unsigned char *data2; assert(((uintptr_t)key & 0x7) == 0); while(data != end) { uint64_t k = *data++; k *= m; k ^= k >> r; k *= m; h ^= k; h *= m; } data2 = (const unsigned char *)data; switch(len & 7) { case 7: h ^= ((uint64_t)(data2[6])) << 48; case 6: h ^= ((uint64_t)(data2[5])) << 40; case 5: h ^= ((uint64_t)(data2[4])) << 32; case 4: h ^= ((uint64_t)(data2[3])) << 24; case 3: h ^= ((uint64_t)(data2[2])) << 16; case 2: h ^= ((uint64_t)(data2[1])) << 8; case 1: h ^= ((uint64_t)(data2[0])); h *= m; } h ^= h >> r; h *= m; h ^= h >> r; return (h); }
常见的散列算法有:CRC-32、MD5、SHA-1,SM3,以及广泛使用 SHA-2(SHA-224、SHA-356、SHA-384、SHA-512)