• Bloom Filter js version All In One


    Bloom Filter js version All In One

    布隆过滤器 js 版

    布隆过滤器

    意义:从海量数据中快速的过滤数据,判断是否能命中该数据;

    优点:使用二进制,查找性能高,速度超快!✅

    缺点:判断不存在的准确率 100%, 而判断存在的准确率不可靠,有可能错误 ❌

    A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
    False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set".
    Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant);
    the more items added, the larger the probability of false positives.

    布隆过滤器是一种节省空间的概率数据结构,由伯顿·霍华德·布鲁姆(Burton Howard Bloom)在1970年提出,用于测试某个元素是否为集合成员
    可能会出现假阳性匹配,但否定否定匹配-换句话说,查询返回“可能在集合中”或“绝对不在集合中”。
    元素可以添加到集合中,但不能删除(尽管可以通过计数 Bloom 过滤器变体解决);
    添加的项目越多,误报的可能性就越大。

    https://en.wikipedia.org/wiki/Bloom_filter

    使用场景

    这些使用场景有个共同的需求:如何在有海量数据的数据中查找一条数据是否存在其中?

    • 文字处理软件中,需要检查一个英语单词是否拼写正确;
    • 在 FBI,一个嫌疑人的名字是否已经在嫌疑名单上;
    • 在网络爬虫里,一个网址 url 是否被访问过;
    • gmail 等邮箱垃圾邮件过滤功能;

    demos

    https://www.npmjs.com/package/bloomfilter

    https://github.com/jasondavies/bloomfilter.js/blob/master/bloomfilter.js

    // Bloom Filter 
    (function(exports) {
      exports.BloomFilter = BloomFilter;
      exports.fnv_1a = fnv_1a;
    
      var typedArrays = typeof ArrayBuffer !== "undefined";
    
      // Creates a new bloom filter.  If *m* is an array-like object, with a length
      // property, then the bloom filter is loaded with data from the array, where
      // each element is a 32-bit integer.  Otherwise, *m* should specify the
      // number of bits.  Note that *m* is rounded up to the nearest multiple of
      // 32.  *k* specifies the number of hashing functions.
      function BloomFilter(m, k) {
        var a;
        if (typeof m !== "number") a = m, m = a.length * 32;
    
        var n = Math.ceil(m / 32),
            i = -1;
        this.m = m = n * 32;
        this.k = k;
    
        if (typedArrays) {
          var kbytes = 1 << Math.ceil(Math.log(Math.ceil(Math.log(m) / Math.LN2 / 8)) / Math.LN2),
              array = kbytes === 1 ? Uint8Array : kbytes === 2 ? Uint16Array : Uint32Array,
              kbuffer = new ArrayBuffer(kbytes * k),
              buckets = this.buckets = new Int32Array(n);
          if (a) while (++i < n) buckets[i] = a[i];
          this._locations = new array(kbuffer);
        } else {
          var buckets = this.buckets = [];
          if (a) while (++i < n) buckets[i] = a[i];
          else while (++i < n) buckets[i] = 0;
          this._locations = [];
        }
      }
    
      // See http://willwhim.wpengine.com/2011/09/03/producing-n-hash-functions-by-hashing-only-once/
      BloomFilter.prototype.locations = function(v) {
        var k = this.k,
            m = this.m,
            r = this._locations,
            a = fnv_1a(v),
            b = fnv_1a(v, 1576284489), // The seed value is chosen randomly
            x = a % m;
        for (var i = 0; i < k; ++i) {
          r[i] = x < 0 ? (x + m) : x;
          x = (x + b) % m;
        }
        return r;
      };
    
      BloomFilter.prototype.add = function(v) {
        var l = this.locations(v + ""),
            k = this.k,
            buckets = this.buckets;
        for (var i = 0; i < k; ++i) buckets[Math.floor(l[i] / 32)] |= 1 << (l[i] % 32);
      };
    
      BloomFilter.prototype.test = function(v) {
        var l = this.locations(v + ""),
            k = this.k,
            buckets = this.buckets;
        for (var i = 0; i < k; ++i) {
          var b = l[i];
          if ((buckets[Math.floor(b / 32)] & (1 << (b % 32))) === 0) {
            return false;
          }
        }
        return true;
      };
    
      // Estimated cardinality.
      BloomFilter.prototype.size = function() {
        var buckets = this.buckets,
            bits = 0;
        for (var i = 0, n = buckets.length; i < n; ++i) bits += popcnt(buckets[i]);
        return -this.m * Math.log(1 - bits / this.m) / this.k;
      };
    
      // http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
      function popcnt(v) {
        v -= (v >> 1) & 0x55555555;
        v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
        return ((v + (v >> 4) & 0xf0f0f0f) * 0x1010101) >> 24;
      }
    
      // Fowler/Noll/Vo hashing.
      // Nonstandard variation: this function optionally takes a seed value that is incorporated
      // into the offset basis. According to http://www.isthe.com/chongo/tech/comp/fnv/index.html
      // "almost any offset_basis will serve so long as it is non-zero".
      function fnv_1a(v, seed) {
        var a = 2166136261 ^ (seed || 0);
        for (var i = 0, n = v.length; i < n; ++i) {
          var c = v.charCodeAt(i),
              d = c & 0xff00;
          if (d) a = fnv_multiply(a ^ d >> 8);
          a = fnv_multiply(a ^ c & 0xff);
        }
        return fnv_mix(a);
      }
    
      // a * 16777619 mod 2**32
      function fnv_multiply(a) {
        return a + (a << 1) + (a << 4) + (a << 7) + (a << 8) + (a << 24);
      }
    
      // See https://web.archive.org/web/20131019013225/http://home.comcast.net/~bretm/hash/6.html
      function fnv_mix(a) {
        a += a << 13;
        a ^= a >>> 7;
        a += a << 3;
        a ^= a >>> 17;
        a += a << 5;
        return a & 0xffffffff;
      }
    })(typeof exports !== "undefined" ? exports : this);
    

    https://www.npmjs.com/package/bloom-filters

    https://github.com/Callidon/bloom-filters

    https://github.com/Callidon/bloom-filters/blob/master/src/bloom/bloom-filter.ts

    // Bloom Filter 
    
    /* file : bloom-filter.ts
    MIT License
    
    Copyright (c) 2017 Thomas Minier & Arnaud Grall
    
    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:
    
    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.
    
    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE.
    */
    
    import ClassicFilter from '../interfaces/classic-filter'
    import BaseFilter from '../base-filter'
    import BitSet from './bit-set'
    import {AutoExportable, Field, Parameter} from '../exportable'
    import {optimalFilterSize, optimalHashes} from '../formulas'
    import {HashableInput} from '../utils'
    
    /**
     * A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970,
     * that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.
     *
     * Reference: Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426.
     * @see {@link http://crystal.uta.edu/~mcguigan/cse6350/papers/Bloom.pdf} for more details about classic Bloom Filters.
     * @author Thomas Minier
     * @author Arnaud Grall
     */
    @AutoExportable<BloomFilter>('BloomFilter', ['_seed'])
    export default class BloomFilter
      extends BaseFilter
      implements ClassicFilter<HashableInput>
    {
      @Field()
      public _size: number
    
      @Field()
      public _nbHashes: number
    
      @Field<BitSet>(
        f => f.export(),
        data => {
          // create the bitset from new and old array-based exported structure
          if (Array.isArray(data)) {
            const bs = new BitSet(data.length)
            data.forEach((val: number, index: number) => {
              if (val !== 0) {
                bs.add(index)
              }
            })
            return bs
          } else {
            return BitSet.import(data as {size: number; content: string})
          }
        }
      )
      public _filter: BitSet
    
      /**
       * Constructor
       * @param size - The number of cells
       * @param nbHashes - The number of hash functions used
       */
      constructor(
        @Parameter('_size') size: number,
        @Parameter('_nbHashes') nbHashes: number
      ) {
        super()
        if (nbHashes < 1) {
          throw new Error(
            `A BloomFilter cannot uses less than one hash function, while you tried to use ${nbHashes}.`
          )
        }
        this._size = size
        this._nbHashes = nbHashes
        this._filter = new BitSet(size)
      }
    
      /**
       * Create an optimal bloom filter providing the maximum of elements stored and the error rate desired
       * @param  nbItems      - The maximum number of item to store
       * @param  errorRate  - The error rate desired for a maximum of items inserted
       * @return A new {@link BloomFilter}
       */
      public static create(nbItems: number, errorRate: number): BloomFilter {
        const size = optimalFilterSize(nbItems, errorRate)
        const hashes = optimalHashes(size, nbItems)
        return new this(size, hashes)
      }
    
      /**
       * Build a new Bloom Filter from an existing iterable with a fixed error rate
       * @param items - The iterable used to populate the filter
       * @param errorRate - The error rate, i.e. 'false positive' rate, targeted by the filter
       * @param seed - The random number seed (optional)
       * @return A new Bloom Filter filled with the iterable's elements
       * @example
       * ```js
       * // create a filter with a false positive rate of 0.1
       * const filter = BloomFilter.from(['alice', 'bob', 'carl'], 0.1);
       * ```
       */
      public static from(
        items: Iterable<HashableInput>,
        errorRate: number,
        seed?: number
      ): BloomFilter {
        const array = Array.from(items)
        const filter = BloomFilter.create(array.length, errorRate)
        if (typeof seed === 'number') {
          filter.seed = seed
        }
        array.forEach(element => filter.add(element))
        return filter
      }
    
      /**
       * Get the optimal size of the filter
       * @return The size of the filter
       */
      get size(): number {
        return this._size
      }
    
      /**
       * Get the number of bits currently set in the filter
       * @return The filter length
       */
      public get length(): number {
        return this._filter.bitCount()
      }
    
      /**
       * Add an element to the filter
       * @param element - The element to add
       * @example
       * ```js
       * const filter = new BloomFilter(15, 0.1);
       * filter.add('foo');
       * ```
       */
      public add(element: HashableInput): void {
        const indexes = this._hashing.getIndexes(
          element,
          this._size,
          this._nbHashes,
          this.seed
        )
        for (let i = 0; i < indexes.length; i++) {
          this._filter.add(indexes[i])
        }
      }
    
      /**
       * Test an element for membership
       * @param element - The element to look for in the filter
       * @return False if the element is definitively not in the filter, True is the element might be in the filter
       * @example
       * ```js
       * const filter = new BloomFilter(15, 0.1);
       * filter.add('foo');
       * console.log(filter.has('foo')); // output: true
       * console.log(filter.has('bar')); // output: false
       * ```
       */
      public has(element: HashableInput): boolean {
        const indexes = this._hashing.getIndexes(
          element,
          this._size,
          this._nbHashes,
          this.seed
        )
        for (let i = 0; i < indexes.length; i++) {
          if (!this._filter.has(indexes[i])) {
            return false
          }
        }
        return true
      }
    
      /**
       * Get the current false positive rate (or error rate) of the filter
       * @return The current false positive rate of the filter
       * @example
       * ```js
       * const filter = new BloomFilter(15, 0.1);
       * console.log(filter.rate()); // output: something around 0.1
       * ```
       */
      public rate(): number {
        return Math.pow(1 - Math.exp(-this.length / this._size), this._nbHashes)
      }
    
      /**
       * Check if another Bloom Filter is equal to this one
       * @param  other - The filter to compare to this one
       * @return True if they are equal, false otherwise
       */
      public equals(other: BloomFilter): boolean {
        if (this._size !== other._size || this._nbHashes !== other._nbHashes) {
          return false
        }
        return this._filter.equals(other._filter)
      }
    }
    

    refs

    https://www.cnblogs.com/xgqfrms/p/13490357.html



    ©xgqfrms 2012-2020

    www.cnblogs.com/xgqfrms 发布文章使用:只允许注册用户才可以访问!

    原创文章,版权所有©️xgqfrms, 禁止转载 ️,侵权必究⚠️!


  • 相关阅读:
    Qt中widget重新setParent需要注意的问题
    在有状态机下,写自动测试需要注意的问题
    C#获取当前路径的7种方法
    VS快捷键大全
    [WPF]设置背景色
    [WPF]建立自适应窗口大小布局的WinForm窗口
    [WPF]Slider控件常用方法
    [C#.NET]
    VB中的API详解
    VB6.0和VB.Net的函数等对照表
  • 原文地址:https://www.cnblogs.com/xgqfrms/p/16355146.html
Copyright © 2020-2023  润新知