• HBase和Phoneix使用示例


    #HBase操作 ##基本操作 ###创建表

    Examples:
     
      hbase> create 't1', {NAME => 'f1', VERSIONS => 5}
      hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
      hbase> # The above in shorthand would be the following:
      hbase> create 't1', 'f1', 'f2', 'f3'
      hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}
      hbase> create 't1', 'f1', {SPLITS => ['10', '20', '30', '40']}
      hbase> create 't1', 'f1', {SPLITS_FILE => 'splits.txt'}
      hbase> # Optionally pre-split the table into NUMREGIONS, using
      hbase> # SPLITALGO ("HexStringSplit", "UniformSplit" or classname)
      hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
    
    create 'dbname:newsinfo_anticheat_user_tag_data', 'user', 'device'
    

    ###获得表的描述

    hbase(main):006:0> describe "dbname:newsinfo_anticheat_user_tag_data"  
    

    ###插入几条记录

    put 'dbname:newsinfo_anticheat_user_tag_data', '5a483b8769e9560001f9d1b9_20181224', 'user:phone', '190eb638-185d-3e58-a009-fcd69f67b8ac'
    put 'dbname:newsinfo_anticheat_user_tag_data', '596f51ba4e94d9000170e1ff_20181224', 'user:phone', '17086385281'
    

    ###查看所有数据scan

    hbase(main):014:0> scan 'dbname:newsinfo_anticheat_user_tag_data'
    ROW                                              COLUMN+CELL        
    596f51ba4e94d9000170e1ff_20181223               column=user:phone, timestamp=1545646705395, value=17086385281                       
    
    596f51ba4e94d9000170e1ff_20181224               column=user:phone, timestamp=1545646716425, value=17086385281                      
    

    ###获得数据 get ####获得一行的所有数据

    get 'dbname:newsinfo_anticheat_user_tag_data', '5a483b8769e9560001f9d1b9_20181224'
    

    ####获得某行,某列族的所有数据

    get 'dbname:newsinfo_anticheat_user_tag_data', '5a483b8769e9560001f9d1b9_20181224','user'
    

    ####获得某行,某列族,某列的所有数据

    get 'dbname:newsinfo_anticheat_user_tag_data', '5a483b8769e9560001f9d1b9_20181224','user:dt'
    

    ##预分区 默认情况下,在创建HBase表的时候会自动创建一个region分区,当导入数据的时候,所有的HBase客户端都向这一个region写数据,直到这个region足够大了才进行切分。一种可以加快批量写入速度的方法是通过预先创建一些空的regions,这样当数据写入HBase时,会按照region分区情况,在集群内做数据的负载均衡。 命令方式:

    create ‘t1’, ‘f1’, {NUMREGIONS => 15, SPLITALGO => ‘HexStringSplit’} 
    也可以使用api的方式: 
    bin/hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f info 
    参数: 
    test_table是表名 
    HexStringSplit 是split 方式 
    -c 是分10个region 
    -f 是family
    

    这样就可以将表预先分为15个区,减少数据达到storefile 大小的时候自动分区的时间消耗,并且还有以一个优势,就是合理设计rowkey 能让各个region 的并发请求平均分配(趋于均匀) 使IO 效率达到最高,但是预分区需要将filesize 设置一个较大的值,设置哪个参数呢, hbase.hregion.max.filesize 这个值默认是10G 也就是说单个region 默认大小是10G, 这个参数的默认值在0.90 到0.92到0.94.3各版本的变化:256M--1G--10G 但是如果MapReduce Input类型为TableInputFormat 使用hbase作为输入的时候,就要注意了,每个region一个map,如果数据小于10G 那只会启用一个map 造成很大的资源浪费,这时候可以考虑适当调小该参数的值,或者采用预分配region的方式,并将检测如果达到这个值,再手动分配region。

    #HBase已有表与Phoenix映射 ##使用phoenix 视图方式映射 ###初始创建 查看HBASE 已有表dbname:newsinfo_anticheat_tag_data

    hbase(main):003:0> scan ' dbname:newsinfo_anticheat_tag_data'
    ROW                                              COLUMN+CELL                                                         
     596f51ba4e94d9000170e1ff_20181223               column=user:dt, timestamp=1545647095916, value=20181223                                                                                       
     596f51ba4e94d9000170e1ff_20181223               column=user:phone, timestamp=1545646705395, value=17086385281                                                                                 
     596f51ba4e94d9000170e1ff_20181224               column=user:dt,
    

    phoenix 4.10 版本后,对列映射做了优化,采用一套新的机制,不在基于列名方式映射到 hbase。如果只做查询,强烈建议使用phoenix 视图方式映射,删除视图不影响 hbase 源数据,语法如下:

    0: jdbc:phoenix:dsrv2.heracles.sohuno.com,dme>
    use "dbname";
    create view "newsinfo_anticheat_user_tag_data"("ROW" varchar primary key, "user"."dt" varchar , "user"."phone" varchar) ;
    

    把HBASE中的ROW当作主键 表名和列族以及列名需要用双引号括起来,因为HBase是区分大小写的,如果不用双引号括起来的话Phoenix在创建表的时候会自动将小写转换为大写字母

    ###Hbase新增列后重新映射 Hbase shell

    新添加列user.did_count
    put 'dbname:test2', '5a483b8769e9560001f9d1b9_20181224', 'user:did_count', '100'
    

    Phoneix sql

    #删除视图
    drop view "newsinfo_anticheat_user_tag_data";
    
    #重新创建视图,加上新增列
    use "dbname";
    create view "newsinfo_anticheat_user_tag_data"("ROW" varchar primary key, "user"."dt" varchar , "user"."phone" varchar, "user"."did_count" varchar) ;
    

    重新查询有了新数据

    ##使用phoenix 表方式映射 ###创建映射表 必须要表映射,需要禁用列映射规则(会降低查询性能),如下:

    use "dbname";
    create table "newsinfo_anticheat_user_tag_data"("ROW" varchar primary key, "user"."dt" varchar , "user"."phone" varchar)  column_encoded_bytes=0;
    

    注意:删除映射表时,hbase对应数据也会被删除,慎用删除表操作!!! #Phoneix二级索引 创建二级索引

    create index my_index2 on MY_TABLE(V1) include(v2);
    

    #HBase增加列

    put 'dbname:test2', '5a483b8769e9560001f9d1b9_20181224', 'user:did_count', '100'
    

    #Phoneix导出csv文件

    [@dudbname103113.heracles.sohuno.com ~]$ /opt/work/phoenix-4.13.1-HBase-1.3/bin/sqlline.py --help
    usage: sqlline.py [-h] [-v VERBOSE] [-c COLOR] [-fc FASTCONNECT]
                      [zookeepers] [sqlfile]
    
    Launches the Apache Phoenix Client.
    
    positional arguments:
      zookeepers            The ZooKeeper quorum string
      sqlfile               A file of SQL commands to execute
    
    optional arguments:
      -h, --help            show this help message and exit
      -v VERBOSE, --verbose VERBOSE
                            Verbosity on sqlline.
      -c COLOR, --color COLOR
                            Color setting for sqlline.
      -fc FASTCONNECT, --fastconnect FASTCONNECT
                            Fetch all schemas on initial connection
    

    编辑一个sqlfile,写入待查询sql语句,如下所示:

    [@dudbname103113.heracles.sohuno.com ~]$ cat /home/dbname/data/dev_xdf/phoneix/sqlfile.txt
    use "dbname";
    
    select * from "newsinfo_anticheat_blacklist_data" limit 5;
    

    执行命令导出数据到csv文件,如下:

    /opt/work/phoenix-4.13.1-HBase-1.3/bin/sqlline.py dsrv2.heracles.sohuno.com,dmeta2.heracles.sohuno.com,drm2.heracles.sohuno.com,dmeta1.heracles.sohuno.com /home/dbname/data/dev_xdf/phoneix/sqlfile.txt >> /home/dbname/data/dev_xdf/phoneix/sqlout.csv
    

    查看导出csv文件内容:

    [@dudbname103113.heracles.sohuno.com ~]$ cat /home/dbname/data/dev_xdf/phoneix/sqlout.csv
    997/997 (100%) Done
    +---------------------------+-----+---------------------------------------+-----------+
    |          rowkey           | dt  |                  did                  |  did_dt   |
    +---------------------------+-----+---------------------------------------+-----------+
    | 51bd61df  |     | 76b46d96-a986-36b6-864c-4558d250e6ad  | 20190114  |
    | 58a272ec427ad50001849b2b  |     | dd4768f4-5928-37bc-93d9-d700595434a1  | 20190129  |
    | 58c246850c0e580001f4104a  |     | 9a79728e-3d0c-3f39-bad4-40a093ec27ea  | 20190122  |
    | 596628720c0e5800018c8d2c  |     | f92979768431f3bc6dcf352ac67e5e5d      | 20190129  |
    | 5969a1a5c6e6dd0001a27f35  |     | bb60dc61-69ec-3eaa-9ade-772122c8ac88  | 20190129  |
    +---------------------------+-----+---------------------------------------+-----------+
    

    #Phoneix查询

    0: jdbc:phoenix:dsrv2.heracles.sohuno.com,dme> select * from "device_tag_data" where "dt"='20190210' and TO_NUMBER("user_count")>10 LIMIT 2;
    +------------------------------------------------+-----------+-------------+----------+------------+--------------+-----------+--------------+-----+
    |                      ROW                       |    dt     | user_count  | fenshen  | gyroscope  | android_ver  | hotcloud  | ad_exposure  | ad_ |
    +------------------------------------------------+-----------+-------------+----------+------------+--------------+-----------+--------------+-----+
    | 03215fc6-5520-3e49-b1e5-f6d72024fa62_20190210  | 20190210  | 11          | 1        |            |              | 1         |              |     |
    | 0a340e16-f0dd-3be0-8632-8ac1895d2e6a_20190210  | 20190210  | 12          | 1        |            | 8.1.0        | 1         |              |     |
    +------------------------------------------------+-----------+-------------+----------+------------+--------------+-----------+--------------+-----+
    

    #写hbase ##Spark Bulkload方式

    def main(args: Array[String]): Unit = {
    
      //日期定义
      val today = args(0)
      val hbaseTabName=args(1)
      val hdfsTmpPath = args(2)
    
      //创建SparkSession
      val sparkconf = new SparkConf().setAppName("UserMetrics")
    .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      val sc = new SparkContext(sparkconf)
      val spark = SparkSessionSingleton.getInstance(sc.getConf)
    
      //配置hbase参数
      val conf = new Configuration()
      conf.set("hbase.zookeeper.quorum", "zklist")
      conf.set("hbase.zookeeper.property.clientPort", "2181")
      conf.set("zookeeper.znode.parent", "/hbase ")
      conf.set(TableOutputFormat.OUTPUT_TABLE, hbaseTabName)
    
      //获取rdd数据
      val device _rdd = get_device_rdd(spark,today)
    
      //spark bulkload导用户指标数据入hbase
      spark_bulkload_to_hbase(device_rdd,conf,hdfsTmpPath)
    
      sc.stop()
    }
    
    def get_device_rdd(spark: SparkSession,today: String) :RDD[(ImmutableBytesWritable, Put)] = {
      val sql = ""
      val data = spark.sql(sql)
    
      val ad_exposure_click_rdd = data.rdd.map(record => {
        val did = record.getString(0)
        val rowkey = did+"_"+today
        val put = new Put(Bytes.toBytes(rowkey))
    
        try{
          put.addColumn(Bytes.toBytes("device"), Bytes.toBytes("dt"), Bytes.toBytes(today))
          val user_count = record.get(1)
          put.addColumn(Bytes.toBytes("device"), Bytes.toBytes("user_count"), Bytes.toBytes(user_count.toString))
        }catch {
          case e: Exception => println(s"${e.printStackTrace()}")
        }
        (new ImmutableBytesWritable, put)
      })
      ad_exposure_click_rdd
    }
    
    /**
      * spark bulkload导dataframe数据入hbase
      */
    def spark_direct_bulkload_to_hbase(rdd: RDD[(ImmutableBytesWritable, Put)], conf: Configuration,path: String): Unit ={
      val job = Job.getInstance(conf)
      job.setOutputKeyClass(classOf[ImmutableBytesWritable])
      job.setOutputValueClass(classOf[Result])
      job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
    
      //save to hbase hfile
      job.getConfiguration.set("mapred.output.dir", path)
      rdd.saveAsNewAPIHadoopDataset(job.getConfiguration)
    }
    
    /**
      * spark 先生成hfiles在调用 bulkload导数据入hbase
      */
    def spark_bulkload_to_hbase(rdd: RDD[(ImmutableBytesWritable, Put)], conf: Configuration,path: String,hbaseTabName: String): Unit ={
      val myTable = new HTable(conf, hbaseTabName)
      // Save Hfiles on HDFS
      rdd.saveAsNewAPIHadoopFile(path, classOf[ImmutableBytesWritable], classOf[Result], classOf[TableOutputFormat[ImmutableBytesWritable]], conf)
    
      //Bulk load Hfiles to Hbase
      val bulkLoader = new LoadIncrementalHFiles(conf)
      bulkLoader.doBulkLoad(new Path(path), myTable)
    }
    

    ##Hbase API方式

    data.rdd.foreachPartition(
      partitionRecords => {
        val conn = getHBaseConn(hbaseTabName) // 获取Hbase连接
        val tName = TableName.valueOf(hbaseTabName)
        val table = conn.getTable(tName)
        partitionRecords.foreach(record => {
    
          val userid = record.getString(0)
          val rowkey = userid + "_" + today
          val put = new Put(Bytes.toBytes(rowkey))
    
          try {
            put.addColumn(Bytes.toBytes("user"), Bytes.toBytes("dt"), Bytes.toBytes(today))
    
            val ad_exposure = record.get(1)
            if (ad_exposure != None && ad_exposure != null) {
              put.addColumn(Bytes.toBytes("user"), Bytes.toBytes("ad_exposure"), Bytes.toBytes(ad_exposure.toString))
            }
            val ad_click_rate = record.get(2)
            if (ad_click_rate != None && ad_click_rate != null) {
              put.addColumn(Bytes.toBytes("user"), Bytes.toBytes("ad_click_rate"), Bytes.toBytes(ad_click_rate.toString))
            }
            Try(table.put(put)).getOrElse(table.close())//将数据写入HBase,若出错关闭table
          } catch {
            case e: Exception => println(s"================================ ${e.printStackTrace()}")
          }
        })
        table.close()//分区数据写入HBase后关闭连接
        conn.close()
      })
    

    #Hbase数据导出 ##Hbase shell 通过查询条件过滤导出

    hbase(main):006:0> scan 'dbname:newsinfo_anticheat_blacklist_data",{COLUMNS => 'user:dt',LIMIT=>1}
    ROW                                        COLUMN+CELL               59c925dbcb8e580001ddd21c                  column=user:dt, timestamp=1547805110793, value=20190114                              
    

    导出到文件:

    echo " scan 'dbname:newsinfo_anticheat_blacklist_data',{COLUMNS => 'user:dt',LIMIT=>1}" | hbase shell > ./hbase.csv
    

    ##通过hive导出 有时候我们需要把已存在Hbase中的用户画像数据导到hive里面查询,也就是通过hive就能查到hbase里的数据。但是我又不想使用sqoop或者DataX等工具倒来倒去。这时候可以在hive中创建关联表的方式来查询hbase中的数据。 HBase中建表,然后Hive中建一个外部表,这样当Hive中写入数据后,HBase中也会同时更新。 #用hive映射表访问hbase数据 在hbase中创建表后,我们只能在hbase shell中使用scan查询数据,这对于熟悉SQL的使用者不怎么习惯,不过我们可以在hive中创建与hbase表的映射来访问hbase表中的数据,例子如下:

    1.这里hbase中的表dbname: device_tag_data已经存在

    hbase(main):067:0> scan "dbname:device_tag_data",LIMIT=>1
    ROW                                COLUMN+CELL          
    0000039f-2d6e-3140-bb97-0d7294cfa4fe_2019 column=device:dt, timestamp=1547637050603, value=20190114                              
    0000039f-2d6e-3140-bb97-0d7294cfa4fe_2019 column=device:gyroscope, timestamp=1547631713244, value=103                                   
    0000039f-2d6e-3140-bb97-0d7294cfa4fe_2019 column=device:user_count, timestamp=1547631631653, value=1                                     0000039f-2d6e-3140-bb97-0d7294cfa4fe_2019 column=device:android_ver, timestamp=1547638332436, value=6.0
    

    2.创建hive映射表关联hbase

    CREATE EXTERNAL TABLE dbname.device_tag_data(
    key string, 
    dt string,
    user_count string,
    fenshen string,
    gyroscope string,
    android_ver string,
    hotcloud string,
    ad_exposure string,
    ad_click_rate string
    ) 
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
    WITH SERDEPROPERTIES
    ("hbase.columns.mapping" = 
    ":key,device:dt,device:user_count,device:fenshen,device:gyroscope,device:android_ver,device:hotcloud,device:ad_exposure,device:ad_click_rate")
    TBLPROPERTIES("hbase.table.name" = "dbname:device_tag_data");
    

    主要是配置hbase.table.name和hbase.columns.mapping,一个是hbase表名,一个是hbase字段和hive字段的一一映射,然后就可以从hive中读写hbase数据。 注意hbase.columns.mapping后面的字段直接不能出现空格和换行。

    3.通过hive查询数据

    0: jdbc:hive2://10.31.103.113:10003/dbname> select * from dbname.device_tag_data where key='0000039f-2d6e-3140-bb97-0d7294cfa4fe_20190114';
    

    注意: 这里我们访问的dbname.device_tag_data表是虚表,数据是存储在hbase中的。Hive 与HBase集成,直接从Hive里面连HBase的数据库进行查询,虽然没有做专门的Benchmark, 但总感觉直接对HBase进行查询操作不怎么靠谱,如果我们要频繁做很多类型的数据分析,那HBase的压力一定会倍增。为此我们可以再建立一个新的hive空表, 把查询出来的数据全部导入到新表当中,以后的所有数据分析操作在新表中完成。

    4.创建hive表

    CREATE TABLE dbname.device_tag_data2(
    key string, 
    dt string,
    user_count string,
    fenshen string,
    gyroscope string,
    android_ver string,
    hotcloud string,
    ad_exposure string,
    ad_click_rate string
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '	'
    STORED AS TEXTFILE 
    LOCATION 'hdfs://ns/user/dbname/hive/online/device_tag_data2';
    

    5.将hbase中的表数据加载到本地表

    INSERT OVERWRITE TABLE dbname.device_tag_data2 select * from dbname.device_tag_data;
    

    至此大功告成! 以后所有复杂的数据查询和数据分析都可以在新hive表中完成。

  • 相关阅读:
    [JavaScript]继续学习DOM事件模型
    [面试]作答整理一些面试题
    [JavaScript]使用jQuery实现无缝轮播
    [JavaScript]使用CSS + jQuery 实现自动轮播图
    [JavaScript]从DOM到jQuery(2)
    [JavaScript]从DOM到jQuery(1)
    [JavaScript]JavaScript中的函数(2)
    [JavaScript]JavaScript中的函数(1)
    [JavaScript]JavaScript中的Array
    React练习 7 :点击div,显示innerHTML
  • 原文地址:https://www.cnblogs.com/xiaodf/p/10577115.html
Copyright © 2020-2023  润新知