• Hyperloglog算法


    什么是Hyperloglog?

    • 一个在大数据量下统计基数的算法, 占用内存小, 误差小, 但是会损失一定精度(Kylin中需要高精度可以用bitmap)。

    作为数据人, 我们为何要了解它?

    • 它与我们的部分实际业务是有关联的, 理解原理能更好的做好工作。
      • 应用了Hyperloglog算法的框架:
        • Redis
        • Apache Kylin

    理解方式

    • 有两种理解方式

      • 在理想状态下, 将一对数据hash至[0, 1], 每两点间距离d相等, 则这堆数据的基数即为 1/d。

        • 但实际情况通常都不能如愿, 只能用分桶取kmax的方式不断逼近该基数值(积分?)。
        • 分桶将数据分为m组, 每组取第k个位置的值, 所有组中得到最大的kmax, (k - 1)/kmax 即为得到估计的基数。
      • 以抛硬币的方式理解

        • 以抛硬币出现一次反面为一次过程, 记录为1, 若抛硬币为正面则记录为0。

        • 当实验次数k很大时, 硬币不出现反面的概率基本为0。

        • 转换到基数的思想是: 可以用第一个1出现前0的个数n来统计基数。

        • 当基数大致为2n+1时, 硬币的概率统计可以为:

          [frac{1}{2}*1+frac{1}{4}*2+frac{1}{8}*3 ...... ]

    算法伪代码

    img

    • 流程概括:
      • hash成32位的值, 并获取最左位置为1所对应的数
      • 初始化m个登记表, m∈[24, 216]
      • 计算出每组最大的首零位
      • 计算基数预估值并根据预估值大小做调整

    Hyperloglog的开源Java实现

    /*
     * Copyright (C) 2012 Clearspring Technologies, Inc.
     *
     * Licensed under the Apache License, Version 2.0 (the "License");
     * you may not use this file except in compliance with the License.
     * You may obtain a copy of the License at
     *
     * http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    
    package com.clearspring.analytics.stream.cardinality;
    
    import java.io.ByteArrayInputStream;
    import java.io.ByteArrayOutputStream;
    import java.io.DataInput;
    import java.io.DataInputStream;
    import java.io.DataOutput;
    import java.io.DataOutputStream;
    import java.io.Externalizable;
    import java.io.IOException;
    import java.io.ObjectInput;
    import java.io.ObjectInputStream;
    import java.io.ObjectOutput;
    import java.io.Serializable;
    
    import com.clearspring.analytics.hash.MurmurHash;
    import com.clearspring.analytics.util.Bits;
    import com.clearspring.analytics.util.IBuilder;
    
    /**
     * Java implementation of HyperLogLog (HLL) algorithm from this paper:
     * <p/>
     * http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
     * <p/>
     * HLL is an improved version of LogLog that is capable of estimating
     * the cardinality of a set with accuracy = 1.04/sqrt(m) where
     * m = 2^b.  So we can control accuracy vs space usage by increasing
     * or decreasing b.
     * 准确度: a = 1.04/sqrt(m), m = 2^b, 可以通过增加或减少b参数来控制精度和占用空间
     * <p/>
     * The main benefit of using HLL over LL is that it only requires 64%
     * of the space that LL does to get the same accuracy.
     * Hyperloglog算法最大的优势是它只需要常规loglog算法的64%空间就能维持与其相等的精度
     * <p/>
     * This implementation implements a single counter.  If a large (millions)
     * number of counters are required you may want to refer to:
     * 此实现仅实现了单个计数器, 如果需要千百万数量的计数器, 请参考以下链接:
     * <p/>
     * http://dsiutils.di.unimi.it/
     * <p/>
     * It has a more complex implementation of HLL that supports multiple counters
     * in a single object, drastically reducing the java overhead from creating
     * a large number of objects.
     * 它有更复杂的支持单对象中有多个计数器的Hyperloglog实现, 大幅度减少了java创建大量对象的开销
     * <p/>
     * This implementation leveraged a javascript implementation that Yammer has
     * been working on:
     * 该实现对Yammer所做的js实现有一定影响
     * <p/>
     * https://github.com/yammer/probablyjs
     * <p>
     * Note that this implementation does not include the long range correction function
     * defined in the original paper.  Empirical evidence shows that the correction
     * function causes more harm than good.
     * 需要注意的是, 此实现没有包含原先paper中的长跨度修正函数。实验表明修正函数的负面影响大于正面影响。
     * </p>
     * <p/>
     * <p>
     * Users have different motivations to use different types of hashing functions.
     * 使用者有不同的动机来使用不同的哈希函数, 
     * Rather than try to keep up with all available hash functions and to remove
     * the concern of causing future binary incompatibilities this class allows clients
     * to offer the value in hashed int or long form.  
     * 是设法保留所有哈希函数并移除所有会导致将来的
     * 二进制不兼容性比如该类允许客户端提供hashed int 或者 hashed long形式的参数。
     * This way clients are free to change their hash function on their own time line. 
     * 此方式下客户端可以随意在它们的时间线上改变它们的哈希函数。
     * We recommend using Google's Guava Murmur3_128 implementation as it provides good 
     * performance and speed when high precision is required.  
     * 我们推荐使用Google的Guava Murmur3_128实现, 因为它在高精度要求下提供了优秀的性能和速
     * 度。
     * In our tests the 32bit MurmurHash function included in this project is faster and 
     * produces better results than the 32 bit murmur3 implementation google provides.
     * 在我们的测试中此项目中的32bit MurmurHash 函数 相比Google提供的 32 bit murmur3实现 更
     * 快且产生了更好的结果。
     * </p>
     */
    public class HyperLogLog implements ICardinality, Serializable {
    
        // 注册集
        private final RegisterSet registerSet;
        private final int log2m;
        private final double alphaMM;
    
    
        /**
         * Create a new HyperLogLog instance using the specified standard deviation.
         * 通过使用特定的标准差创建一个新的HyperLogLog实例。
         *
         * rsd是该计数器的相对标准差, 该值越小, 创建计数器就需要更多的空间(精度与空间的取舍)。
         * @param rsd - the relative standard deviation for the counter.
         *            smaller values create counters that require more space.
         */
        public HyperLogLog(double rsd) {
            this(log2m(rsd));
        }
    
        private static int log2m(double rsd) {
            return (int) (Math.log((1.106 / rsd) * (1.106 / rsd)) / Math.log(2));
        }
    
        private static double rsd(int log2m) {
            return 1.106 / Math.sqrt(Math.exp(log2m * Math.log(2)));
        }
    
        private static double logBase(double exponent, double base) {
            return Math.log(exponent) / Math.log(base);
        }
    
        private static int accuracyToLog2m(double accuracy) {
            return Math.toIntExact(2 * Math.round(logBase(1.04 / (1 - accuracy), 2)));
        }
    
        private static void validateLog2m(int log2m) {
            if (log2m < 0 || log2m > 30) {
                throw new IllegalArgumentException("log2m argument is "
                                                   + log2m + " and is outside the range [0, 30]");
            }
        }
    
        /**
         * Create a new HyperLogLog instance.  The log2m parameter defines the accuracy 
         * of the counter.  
         * 创建一个新的Hyperloglog实例, log2m参数定义了计数器的准确度(log2m越大越准确)
         * The larger the log2m the better the accuracy.<p/>
         * accuracy = 1 - 1.04/sqrt(2^log2m)
         *
         * @param log2m - the number of bits to use as the basis for the HLL instance
         * log2m: 被用作HyperLogLog实例基类的比特数
         */
        public HyperLogLog(int log2m) {
            this(log2m, new RegisterSet(1 << log2m));
        }
    
        /**
         * Creates a new HyperLogLog instance using the given registers. 
         * 用所给的注册集创建一个新的HyperLogLog实例(已过时)。
         * Used for unmarshalling a serialized
         * instance and for merging multiple counters together.
         * 用于解组一个序列化过的实例以及合并多个计数器
         *
         * @param registerSet - the initial values for the register set
         * 注册集的初始值
         */
        @Deprecated
        public HyperLogLog(int log2m, RegisterSet registerSet) {
            validateLog2m(log2m);
            this.registerSet = registerSet;
            this.log2m = log2m;
            int m = 1 << this.log2m;
    
            alphaMM = getAlphaMM(log2m, m);
        }
    
        @Override
        public boolean offerHashed(long hashedValue) {
            // j becomes the binary address determined by the first b log2m of x
            // j成为了由第一个b(即log2m)所决定的地址, >>> 无符号右移, 若hashedValue为正则高位补0, 若为负责, 则右移后补0. 等价于:
            /*
            	if(hashedValue == 0){
            		j = 0
            	} else if(hashValue > 0){
            		j = hashedValue >> (Long.SIZE - log2m) = hashedValue/2^(Long.SIZE - log2m)
            	} else {
            		j = -hashedValue >> (Long.SIZE - log2m) = -hashedValue/2^(Long.SIZE - log2m)
            	}
            */
            // j will be between 0 and 2^log2m j会在0~2^log2m之间
            // 比较j位置的桶内的数值与传入的值r, 比较当前值和新值, 如果新值大就更新
            final int j = (int) (hashedValue >>> (Long.SIZE - log2m));
            final int r = Long.numberOfLeadingZeros((hashedValue << this.log2m) | (1 << (this.log2m - 1)) + 1) + 1;
            return registerSet.updateIfGreater(j, r);
        }
    
        @Override
        public boolean offerHashed(int hashedValue) {
            // j becomes the binary address determined by the first b log2m of x
            // j will be between 0 and 2^log2m
            final int j = hashedValue >>> (Integer.SIZE - log2m);
            final int r = Integer.numberOfLeadingZeros((hashedValue << this.log2m) | (1 << (this.log2m - 1)) + 1) + 1;
            return registerSet.updateIfGreater(j, r);
        }
    
        @Override
        public boolean offer(Object o) {
            final int x = MurmurHash.hash(o);
            return offerHashed(x);
        }
    
    
        @Override
        public long cardinality() {
            double registerSum = 0;
            int count = registerSet.count;
            double zeros = 0.0;
            for (int j = 0; j < registerSet.count; j++) {
                int val = registerSet.get(j);
                registerSum += 1.0 / (1 << val);
                if (val == 0) {
                    zeros++;
                }
            }
    
            double estimate = alphaMM * (1 / registerSum);
    
            if (estimate <= (5.0 / 2.0) * count) {
                // Small Range Estimate 小范围的预估
                return Math.round(linearCounting(count, zeros));
            } else {
                return Math.round(estimate);
            }
        }
    
        @Override
        public int sizeof() {
            return registerSet.size * 4;
        }
    
        @Override
        public byte[] getBytes() throws IOException {
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            DataOutput dos = new DataOutputStream(baos);
            writeBytes(dos);
            baos.close();
    
            return baos.toByteArray();
        }
    
        private void writeBytes(DataOutput serializedByteStream) throws IOException {
            serializedByteStream.writeInt(log2m);
            serializedByteStream.writeInt(registerSet.size * 4);
            for (int x : registerSet.readOnlyBits()) {
                serializedByteStream.writeInt(x);
            }
        }
    
        /**
         * Add all the elements of the other set to this set.
         * 将所有其他结合的元素放入此集合
         * <p/>
         * This operation does not imply a loss of precision.
         * 此操作不会产生精度的损失
         *
         * @param other A compatible Hyperloglog instance (same log2m)
         * 另一个可兼容的HyperLogLog实例(相同的 log2m)
         * @throws CardinalityMergeException if other is not compatible
         */
        public void addAll(HyperLogLog other) throws CardinalityMergeException {
            if (this.sizeof() != other.sizeof()) {
                throw new HyperLogLogMergeException("Cannot merge estimators of different sizes");
            }
    
            registerSet.merge(other.registerSet);
        }
    
        @Override
        public ICardinality merge(ICardinality... estimators) throws CardinalityMergeException {
            HyperLogLog merged = new HyperLogLog(log2m, new RegisterSet(this.registerSet.count));
            merged.addAll(this);
    
            if (estimators == null) {
                return merged;
            }
    
            for (ICardinality estimator : estimators) {
                if (!(estimator instanceof HyperLogLog)) {
                    throw new HyperLogLogMergeException("Cannot merge estimators of different class");
                }
                HyperLogLog hll = (HyperLogLog) estimator;
                merged.addAll(hll);
            }
    
            return merged;
        }
    
        private Object writeReplace() {
            return new SerializationHolder(this);
        }
    
        /**
         * This class exists to support Externalizable semantics for
         * HyperLogLog objects without having to expose a public
         * constructor, public write/read methods, or pretend final
         * fields aren't final.
         * 该类的存在时为了支持Hyperloglog对象的外部化语义并不暴露公有构造器, 公有读写方式, 或
         * 者预防最终fields不为final
         *
         * In short, Externalizable allows you to skip some of the more
         * verbose meta-data default Serializable gets you, but still
         * includes the class name. In that sense, there is some cost
         * to this holder object because it has a longer class name. I
         * imagine people who care about optimizing for that have their
         * own work-around for long class names in general, or just use
         * a custom serialization framework. Therefore we make no attempt
         * to optimize that here (eg. by raising this from an inner class
         * and giving it an unhelpful name).
         * 简短的说Externalizable允许你跳过一些冗长的元数据默认序列化, 但仍包含类名。如此, 维持该长名对象就有一定的开销。此处没有做优化的想法。
         */
        private static class SerializationHolder implements Externalizable {
    
            HyperLogLog hyperLogLogHolder;
    
            public SerializationHolder(HyperLogLog hyperLogLogHolder) {
                this.hyperLogLogHolder = hyperLogLogHolder;
            }
    
            /**
             * required for Externalizable 
             * Externalizable 不需要序列化的时候可以用
             */
            public SerializationHolder() {
    
            }
    
            @Override
            public void writeExternal(ObjectOutput out) throws IOException {
                hyperLogLogHolder.writeBytes(out);
            }
    
            @Override
            public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
                hyperLogLogHolder = Builder.build(in);
            }
    
            private Object readResolve() {
                return hyperLogLogHolder;
            }
        }
    
        public static class Builder implements IBuilder<ICardinality>, Serializable {
            private static final long serialVersionUID = -2567898469253021883L;
    
            private final double rsd;
            private transient int log2m;
    
            /**
             * Uses the given RSD percentage to determine how many bytes the constructed HyperLogLog will use.
             * 使用所给的RSD比例来决定所构造的HyperLogLog会占用多少字节(已过时)
             * @deprecated Use {@link #withRsd(double)} instead. This builder's constructors did not match the (already
             * themselves ambiguous) constructors of the HyperLogLog class, but there is no way to make them match without
             * risking behavior changes downstream.
             */
            @Deprecated
            public Builder(double rsd) {
                this.log2m = log2m(rsd);
                validateLog2m(log2m);
                this.rsd = rsd;
            }
    
            /** This constructor is private to prevent behavior change for ambiguous usages. (Legacy support). 
            * 此构造器为了以防语意不清的使用, 所以是私有的。
            */
            private Builder(int log2m) {
                this.log2m = log2m;
                validateLog2m(log2m);
                this.rsd = rsd(log2m);
            }
    
            private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
                in.defaultReadObject();
                this.log2m = log2m(rsd);
            }
    
            @Override
            public HyperLogLog build() {
                return new HyperLogLog(log2m);
            }
    
            @Override
            public int sizeof() {
                int k = 1 << log2m;
                return RegisterSet.getBits(k) * 4;
            }
    
            public static Builder withLog2m(int log2m) {
                return new Builder(log2m);
            }
    
            public static Builder withRsd(double rsd) {
                return new Builder(rsd);
            }
    
            public static Builder withAccuracy(double accuracy) { return new Builder(accuracyToLog2m(accuracy)); }
    
            public static HyperLogLog build(byte[] bytes) throws IOException {
                ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
                return build(new DataInputStream(bais));
            }
    
            public static HyperLogLog build(DataInput serializedByteStream) throws IOException {
                int log2m = serializedByteStream.readInt();
                int byteArraySize = serializedByteStream.readInt();
                return new HyperLogLog(log2m,
                        new RegisterSet(1 << log2m, Bits.getBits(serializedByteStream, byteArraySize)));
            }
        }
    
        @SuppressWarnings("serial")
        protected static class HyperLogLogMergeException extends CardinalityMergeException {
    
            public HyperLogLogMergeException(String message) {
                super(message);
            }
        }
    
        protected static double getAlphaMM(final int p, final int m) {
            // See the paper.
            switch (p) {
                case 4:
                    return 0.673 * m * m;
                case 5:
                    return 0.697 * m * m;
                case 6:
                    return 0.709 * m * m;
                default:
                    return (0.7213 / (1 + 1.079 / m)) * m * m;
            }
        }
    
        protected static double linearCounting(int m, double V) {
            return m * Math.log(m / V);
        }
    }
    
    
  • 相关阅读:
    phpStrom添加插件:php文档生成(phpDocumentor)
    apache2.2 虚拟主机配置
    PHP环境(apache,PHP,Mysql)详细配置方法
    PHP设计模式之工厂模式(权限分配)
    PHP的接口类(interface)和抽象类(abstract)的区别
    PHP设计模式之单例模式(数据库访问)
    windows svn 客户端连不上linux svn server
    Xshell6和Xftp下载地址,rzsz的使用
    linux文件删除,剩余空间没变化
    thinkphp的_STORAGE_WRITE_ERROR_问题
  • 原文地址:https://www.cnblogs.com/ronnieyuan/p/13776894.html
Copyright © 2020-2023  润新知