• hive随机采样


    hive> select * from account limit 10;
    OK
    account.accountname     account.accid   account.platid  account.dateid  account.createtime
    1004210 1004210 6       20180116        2018-01-16 10:39:50.0
    20946754        20946754        0       20170913        2017-09-13 10:02:37.0
    20946766        20946766        0       20170901        2017-09-01 16:51:30.0
    20946793        20946793        0       20171117        2017-11-17 16:51:07.0
    20946796        20946796        0       20180110        2018-01-10 13:30:46.0
    20946962        20946962        0       20171219        2017-12-19 15:43:14.0
    20957641        20957641        0       20171117        2017-11-17 17:44:58.0
    20957642        20957642        0       20171220        2017-12-20 15:32:21.0
    20963649        20963649        6       20171220        2017-12-20 10:13:57.0
    20963674        20963674        33      20171219        2017-12-19 22:59:39.0

    只通过limit河南保证数据随机的返回,通常情况下,此种方式会按文件的顺序返回,没到达随机采样的目的。


    scala> hivcon.sql("select * from gamedw.account order by rand() limit 10").show
    +-----------+--------+------+--------+--------------------+                    
    |accountname|   accid|platid|  dateid|          createtime|
    +-----------+--------+------+--------+--------------------+
    |   22780583|22780583|    18|20180117|2018-01-17 07:27:...|
    |   22777237|22777237|     6|20180116|2018-01-16 20:52:...|
    |   22868751|22868751|     1|20180208|2018-02-08 10:20:...|
    |   22993835|22993835|    83|20180327|2018-03-27 19:48:...|
    |   22779551|22779551|     9|20180117|2018-01-17 00:33:...|
    |   22796970|22796970|     1|20180119|2018-01-19 17:53:...|
    |   22713995|22713995|    33|20171227|2017-12-27 21:05:...|
    |   22811076|22811076|     6|20180123|2018-01-23 13:16:...|
    |   22715294|22715294|    15|20171228|2017-12-28 01:59:...|
    |   22843107|22843107|    33|20180129|2018-01-29 20:13:...|
    +-----------+--------+------+--------+--------------------+

    scala> hivcon.sql("select * from gamedw.account order by rand() limit 10").show
    +-----------+--------+------+--------+--------------------+                    
    |accountname|   accid|platid|  dateid|          createtime|
    +-----------+--------+------+--------+--------------------+
    |   22809368|22809368|    35|20180122|2018-01-22 20:55:...|
    |   22810321|22810321|    18|20180123|2018-01-23 03:42:...|
    |   22850089|22850089|    15|20180201|2018-02-01 07:16:...|
    |   23040545|23040545|    27|20180415|2018-04-15 10:45:...|
    |   21095611|21095611|     3|20180118|2018-01-18 19:02:...|
    |   22787794|22787794|     9|20180117|2018-01-17 22:10:...|
    |   22701928|22701928|     6|20171226|2017-12-26 14:42:...|
    |   23123308|23123308|     6|20180526|2018-05-26 02:02:...|
    |   23037031|23037031|    27|20180413|2018-04-13 23:52:...|
    |   22798775|22798775|    27|20180120|2018-01-20 00:32:...|
    +-----------+--------+------+--------+--------------------+

    使用rand()确实可以实现真正意义的随机,但性能不高。为了实现总排序,Hive必须将所有数据强制传输到单个reducer。该reducer将对整个数据集进行排序。这很不好。

    scala> hivcon.sql("select * from gamedw.account sort by rand() limit 10").show
    +-----------+--------+------+--------+--------------------+
    |accountname|   accid|platid|  dateid|          createtime|
    +-----------+--------+------+--------+--------------------+
    |   22708141|22708141|    15|20171226|2017-12-26 22:13:...|
    |   22780855|22780855|    15|20180117|2018-01-17 08:25:...|
    |   22781579|22781579|    35|20180117|2018-01-17 09:55:...|
    |   22775456|22775456|     1|20180116|2018-01-16 18:35:...|
    |   22705922|22705922|    18|20171226|2017-12-26 20:18:...|
    |   22773860|22773860|   138|20180116|2018-01-16 16:09:...|
    |   22771885|22771885|     4|20180116|2018-01-16 14:05:...|
    |   22633182|22633182|     0|20171202|2017-12-02 04:07:...|
    |   22228137|22228137|     0|20170828|2017-08-28 17:55:...|
    |   22593212|22593212|     0|20171121|2017-11-21 18:38:...|
    +-----------+--------+------+--------+--------------------+

    scala> hivcon.sql("select * from gamedw.account sort by rand() limit 10").show
    +-----------+--------+------+--------+--------------------+
    |accountname|   accid|platid|  dateid|          createtime|
    +-----------+--------+------+--------+--------------------+
    |   22703713|22703713|    15|20171226|2017-12-26 18:02:...|
    |   22766790|22766790|     4|20180116|2018-01-16 10:11:...|
    |   22728340|22728340|    27|20171231|2017-12-31 19:59:...|
    |   22770064|22770064|     1|20180116|2018-01-16 12:13:...|
    |   22705604|22705604|    15|20171226|2017-12-26 19:58:...|
    |   22771835|22771835|    21|20180116|2018-01-16 14:01:...|
    |   22720562|22720562|    33|20171229|2017-12-29 11:56:...|
    |   22779785|22779785|    83|20180117|2018-01-17 01:36:...|
    |   22705319|22705319|    33|20171226|2017-12-26 19:46:...|
    |   22717315|22717315|    27|20171228|2017-12-28 17:04:...|
    +-----------+--------+------+--------+--------------------+ 

    Hive有一个非标准SQL“sort by”子句,它只在单个reducer中排序,并且不保证数据跨多个reducers中排序.

    Hive的将数据拆分为多个reducer的方法是未定义的。它可能是真正随机的,它可能基于文件顺序,它可能基于数据中的某些值。Hive如何在reducers中实现limit子句也是未定义的。也许它按顺序从reducer中获取数据 - 即,reducer 0中的所有数据,然后全部来reducer1,等等。也许它通过它们循环并将所有内容混合在一起。

    假设reduce 的key是基于数据列,而limit子句是reducers的顺序。然后样品会非常倾斜。

    解决方案是另一个非标准的Hive功能:“distribute by”。对于reduce key不是由查询结构确定的查询(没有“group by”,没有join),可以准确指定reduce key的内容。

    scala> hivcon.sql("select * from gamedw.account distribute by rand() sort by rand() limit 10").show
    +-----------+--------+------+--------+--------------------+                    
    |accountname|   accid|platid|  dateid|          createtime|
    +-----------+--------+------+--------+--------------------+
    |   22808048|22808048|    27|20180122|2018-01-22 14:15:...|
    |   22984607|22984607|    60|20180324|2018-03-24 00:56:...|
    |   22775548|22775548|    15|20180116|2018-01-16 18:40:...|
    |   22386073|22386073|     4|20170914|2017-09-14 22:33:...|
    |   22709109|22709109|    18|20171226|2017-12-26 23:06:...|
    |   22713147|22713147|     6|20171227|2017-12-27 18:55:...|
    |   22847395|22847395|    15|20180131|2018-01-31 09:21:...|
    |   22863523|22863523|    50|20180206|2018-02-06 10:06:...|
    |   22772443|22772443|     3|20180116|2018-01-16 14:41:...|
    |   23063884|23063884|    27|20180424|2018-04-24 19:25:...|
    +-----------+--------+------+--------+--------------------+

    scala> hivcon.sql("select * from gamedw.account distribute by rand() sort by rand() limit 10").show
    +-----------+--------+------+--------+--------------------+                    
    |accountname|   accid|platid|  dateid|          createtime|
    +-----------+--------+------+--------+--------------------+
    |   22702272|22702272|     6|20171226|2017-12-26 15:32:...|
    |   22588521|22588521|     0|20171121|2017-11-21 17:00:...|
    |   23045191|23045191|    27|20180416|2018-04-16 22:31:...|
    |   22997042|22997042|    83|20180328|2018-03-28 21:17:...|
    |   22801335|22801335|    18|20180120|2018-01-20 17:59:...|
    |   22789961|22789961|     6|20180118|2018-01-18 09:51:...|
    |   22884362|22884362|     3|20180214|2018-02-14 00:01:...|
    |   22786025|22786025|     1|20180117|2018-01-17 18:45:...|
    |   22769132|22769132|    19|20180116|2018-01-16 11:32:...|
    |   23045125|23045125|    27|20180416|2018-04-16 23:12:...|
    +-----------+--------+------+--------+--------------------+

    作为最后一次优化,可以在map-side做一些过滤。如果表的总大小是已知的,轻松设置一个随机阈值条件来进行数据过滤,如下:

    scala> hivcon.sql("select * from gamedw.account where rand()<0.01 distribute by rand() sort by rand() limit 10").show
    +-----------+--------+------+--------+--------------------+                    
    |accountname|   accid|platid|  dateid|          createtime|
    +-----------+--------+------+--------+--------------------+
    |   23120261|23120261|    15|20180524|2018-05-24 12:05:...|
    |   22726833|22726833|     4|20171231|2017-12-31 09:53:...|
    |   22731460|22731460|     4|20180101|2018-01-01 22:15:...|
    |   22379995|22379995|     4|20170913|2017-09-13 10:01:...|
    |   22767301|22767301|    21|20180116|2018-01-16 10:25:...|
    |   22701994|22701994|    21|20171226|2017-12-26 14:52:...|
    |   22792831|22792831|    60|20180118|2018-01-18 20:09:...|
    |   23145016|23145016|    15|20180608|2018-06-08 05:04:...|
    |   22702216|22702216|    21|20171226|2017-12-26 15:28:...|
    |   22858386|22858386|    33|20180204|2018-02-04 10:19:...|
    +-----------+--------+------+--------+--------------------+

     抽样可以从被抽取的数据中估计和推断出整体的特性,是科学实验、质量检验、社会调查普遍采用的一种经济有效的工作和研究方法。当数据量特别大时,对全体数据进行处理存在困难时,抽样就显得尤其重要了

    Hive支持桶表抽样和块抽样:

    桶表抽样的语法如下:

       table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

    TABLESAMPLE子句允许用户编写用于数据抽样而不是整个表的查询,该子句出现FROM子句中,可用于任何表中。桶编号从1开始,colname表明抽取样本的列,可以是非分区列中的任意一列,或者使用rand()表明在整个行中抽取样本而不是单个列。在colname上分桶的行随机进入1到y个桶中,返回属于桶x的行。

    下例返回3000个桶中第2个桶的数据。

    hive> select * from gamedw.account TABLESAMPLE(BUCKET 2 OUT OF 3000 on rand()) s;
    OK
    s.accountname   s.accid s.platid        s.dateid        s.createtime
    22701274        22701274        33      20171226        2017-12-26 13:17:31.0
    22702663        22702663        33      20171226        2017-12-26 16:30:12.0
    22714480        22714480        27      20171227        2017-12-27 22:20:26.0
    22729211        22729211        4       20180101        2018-01-01 00:37:46.0
    22772179        22772179        18      20180116        2018-01-16 14:23:18.0
    22774547        22774547        4       20180116        2018-01-16 17:12:40.0
    22779739        22779739        1       20180117        2018-01-17 01:20:23.0
    22787395        22787395        4       20180117        2018-01-17 21:27:29.0
    22789721        22789721        15      20180118        2018-01-18 08:27:52.0
    22793281        22793281        35      20180118        2018-01-18 21:14:48.0
    22795002        22795002        15      20180119        2018-01-19 10:01:58.0
    22797054        22797054        3       20180119        2018-01-19 18:12:43.0
    22804855        22804855        138     20180121        2018-01-21 15:38:33.0
    22813246        22813246        1       20180124        2018-01-24 00:09:43.0
    22877117        22877117        3       20180211        2018-02-11 13:33:44.0
    23045597        23045597        21      20180417        2018-04-17 04:02:17.0
    23060405        23060405        15      20180423        2018-04-23 07:07:09.0
    23075002        23075002        191     20180430        2018-04-30 17:24:44.0
    23076553        23076553        158     20180501        2018-05-01 15:45:16.0
    23085391        23085391        18      20180506        2018-05-06 10:53:59.0
    Time taken: 0.938 seconds, Fetched: 20 row(s)

    通常情况下,TABLESAMPLE将会扫描整个表然后抽取样本,显然这种做法效率不是很高。

    替代方法是,由于在使用CLUSTERED BY时指定了分桶的列,如果抽样时TABLESAMPLE子句中指定的列匹配CLUSTERED BY子句中的列,TABLESAMPLE只扫描表中要求的分区。

    与rand()相比,rand()每次执行的结果都不同,指定分桶列后,每次执行结果相同

    TABLESAMPLE(BUCKET 3 OUT OF 32 on rand()):  account表在创建时使用了CLUSTEREDBY accid  INTO 32 BUCKETS,那么下面的语句将返回第3个和第19个簇中的行,因为每个桶由(32/16)=2个簇组成。为什么是3和19呢,因为要返回的是第3个桶,而每个桶由原来的2个簇组成,第3个桶就由原来的第3个和19个簇组成,根据简单的哈希算法(3%16=19%16)。

    hive> select * from gamedw.account TABLESAMPLE(BUCKET 5 OUT OF 3000 on accid) s;
    OK
    s.accountname   s.accid s.platid        s.dateid        s.createtime
    21078004        21078004        3       20180116        2018-01-16 20:21:32.0
    22380004        22380004        4       20170913        2017-09-13 10:02:13.0
    22386004        22386004        27      20170914        2017-09-14 22:15:23.0
    22392004        22392004        4       20170917        2017-09-17 00:36:06.0
    22671004        22671004        0       20171215        2017-12-15 11:58:57.0
    22701004        22701004        33      20171226        2017-12-26 12:49:32.0
    22704004        22704004        4       20171226        2017-12-26 18:22:16.0
    22707004        22707004        15      20171226        2017-12-26 21:20:56.0
    22710004        22710004        15      20171227        2017-12-27 01:45:43.0
    22722004        22722004        27      20171229        2017-12-29 19:49:04.0
    22767004        22767004        1       20180116        2018-01-16 10:17:55.0
    22770004        22770004        19      20180116        2018-01-16 12:10:56.0
    22773004        22773004        138     20180116        2018-01-16 15:11:39.0
    22779004        22779004        3       20180116        2018-01-16 23:19:48.0
    22782004        22782004        18      20180117        2018-01-17 10:38:42.0
    22803004        22803004        1       20180120        2018-01-20 23:59:56.0
    22812004        22812004        6       20180123        2018-01-23 18:36:55.0
    22821004        22821004        21      20180124        2018-01-24 20:38:20.0
    22830004        22830004        4       20180126        2018-01-26 03:44:23.0
    22887004        22887004        15      20180214        2018-02-14 23:16:07.0
    22911004        22911004        83      20180223        2018-02-23 18:11:57.0
    22971004        22971004        1       20180318        2018-03-18 17:01:03.0
    22977004        22977004        19      20180320        2018-03-20 22:22:38.0
    23043004        23043004        27      20180416        2018-04-16 09:23:30.0
    Time taken: 0.926 seconds, Fetched: 24 row(s)

    使用块抽样

    hive> select * from gamedw.account TABLESAMPLE(0.01 percent) s;
    OK
    s.accountname   s.accid s.platid        s.dateid        s.createtime
    1004210 1004210 6       20180116        2018-01-16 10:39:50.0
    20946754        20946754        0       20170913        2017-09-13 10:02:37.0
    20946766        20946766        0       20170901        2017-09-01 16:51:30.0
    20946793        20946793        0       20171117        2017-11-17 16:51:07.0
    20946796        20946796        0       20180110        2018-01-10 13:30:46.0
    20946962        20946962        0       20171219        2017-12-19 15:43:14.0
    20957641        20957641        0       20171117        2017-11-17 17:44:58.0
    20957642        20957642        0       20171220        2017-12-20 15:32:21.0
    Time taken: 0.873 seconds, Fetched: 8 row(s)

    hive> select * from gamedw.account TABLESAMPLE(1k) s;
    OK
    s.accountname   s.accid s.platid        s.dateid        s.createtime
    1004210 1004210 6       20180116        2018-01-16 10:39:50.0
    20946754        20946754        0       20170913        2017-09-13 10:02:37.0
    20946766        20946766        0       20170901        2017-09-01 16:51:30.0
    20946793        20946793        0       20171117        2017-11-17 16:51:07.0
    20946796        20946796        0       20180110        2018-01-10 13:30:46.0
    20946962        20946962        0       20171219        2017-12-19 15:43:14.0
    20957641        20957641        0       20171117        2017-11-17 17:44:58.0
    20957642        20957642        0       20171220        2017-12-20 15:32:21.0
    20963649        20963649        6       20171220        2017-12-20 10:13:57.0
    20963674        20963674        33      20171219        2017-12-19 22:59:39.0
    20964032        20964032        15      20171221        2017-12-21 17:04:03.0
    20964042        20964042        33      20171221        2017-12-21 20:06:33.0
    20964052        20964052        3       20171221        2017-12-21 14:17:28.0
    20964128        20964128        0       20171123        2017-11-23 15:52:38.0
    20964153        20964153        3       20170906        2017-09-06 16:39:35.0
    20964307        20964307        18      20180101        2018-01-01 17:10:39.0
    20964461        20964461        18      20180116        2018-01-16 13:14:22.0
    20964769        20964769        0       20170825        2017-08-25 17:03:35.0
    20965027        20965027        18      20180116        2018-01-16 11:16:07.0
    20965038        20965038        33      20171226        2017-12-26 20:02:26.0
    Time taken: 0.588 seconds, Fetched: 20 row(s)

    hive> select * from gamedw.account TABLESAMPLE(30 rows) s;
    OK
    s.accountname   s.accid s.platid        s.dateid        s.createtime
    1004210 1004210 6       20180116        2018-01-16 10:39:50.0
    20946754        20946754        0       20170913        2017-09-13 10:02:37.0
    20946766        20946766        0       20170901        2017-09-01 16:51:30.0
    20946793        20946793        0       20171117        2017-11-17 16:51:07.0
    20946796        20946796        0       20180110        2018-01-10 13:30:46.0
    20946962        20946962        0       20171219        2017-12-19 15:43:14.0
    20957641        20957641        0       20171117        2017-11-17 17:44:58.0
    20957642        20957642        0       20171220        2017-12-20 15:32:21.0
    20963649        20963649        6       20171220        2017-12-20 10:13:57.0
    20963674        20963674        33      20171219        2017-12-19 22:59:39.0
    20964032        20964032        15      20171221        2017-12-21 17:04:03.0
    20964042        20964042        33      20171221        2017-12-21 20:06:33.0
    20964052        20964052        3       20171221        2017-12-21 14:17:28.0
    20964128        20964128        0       20171123        2017-11-23 15:52:38.0
    20964153        20964153        3       20170906        2017-09-06 16:39:35.0
    20964307        20964307        18      20180101        2018-01-01 17:10:39.0
    20964461        20964461        18      20180116        2018-01-16 13:14:22.0
    20964769        20964769        0       20170825        2017-08-25 17:03:35.0
    20965027        20965027        18      20180116        2018-01-16 11:16:07.0
    20965038        20965038        33      20171226        2017-12-26 20:02:26.0
    20965124        20965124        18      20171226        2017-12-26 11:10:57.0
    20965187        20965187        33      20171226        2017-12-26 22:35:32.0
    20965282        20965282        33      20171226        2017-12-26 11:50:23.0
    20965301        20965301        33      20180121        2018-01-21 17:34:58.0
    20965326        20965326        33      20171228        2017-12-28 10:22:44.0
    20965646        20965646        18      20180119        2018-01-19 21:13:33.0
    20965650        20965650        18      20180116        2018-01-16 11:08:33.0
    20965815        20965815        18      20180116        2018-01-16 13:59:43.0
    20965938        20965938        18      20180116        2018-01-16 10:01:19.0
    20966100        20966100        18      20180122        2018-01-22 14:25:31.0
    Time taken: 0.898 seconds, Fetched: 30 row(s)

  • 相关阅读:
    嵌入式系统WinCE下应用程序GUI界面开发【转】
    pidstat 命令详解
    shell脚本检查是否存在tun0虚拟网卡,若不不存在服务器更改port,并重启服务器,客户端修改port,并重新启动客户端
    阿里云k8s部署zookeeper集群
    Kubernetes StatefulSet
    dubbo-admin管理平台搭建
    k8s pv,pvc无法删除问题
    Linux配置AndroidSDK&Jenkins远程部署
    CentOS的vsftp修改默认配置路径方法
    Java语法糖3:泛型
  • 原文地址:https://www.cnblogs.com/playforever/p/9560043.html
Copyright © 2020-2023  润新知