• 基于Hadoop2.7.3集群数据仓库Hive1.2.2的部署及使用


    基于Hadoop2.7.3集群数据仓库Hive1.2.2的部署及使用

    HBase是一种分布式、面向列的NoSQL数据库,基于HDFS存储,以表的形式存储数据,表由行和列组成,列划分到列族中。HBase不提供类SQL查询语言,要想像SQL这样查询数据,可以使用Phonix,让SQL查询转换成hbase的扫描和对应的操作,也可以使用现在说讲Hive仓库工具,让HBase作为Hive存储。

    Hive是运行在Hadoop之上的数据仓库,将结构化的数据文件映射为一张数据库表,提供简单类SQL查询语言,称为HQL,并将SQL语句转换成MapReduce任务运算。有利于利用SQL语言查询、分析数据,适于处理不频繁变动的数据。Hive底层可以是HBase或者HDFS存储的文件。
    两者都是基于Hadoop上不同的技术,相互结合使用,可处理企业中不同类型的业务,利用Hive处理非结构化离线分析统计,利用HBase处理在线查询。


    1.安装hive通过二进制包安装
    下载地址:http://mirrors.shuosc.org/apache/hive/stable/apache-hive-1.2.2-bin.tar.gz
    tar -zxf apache-hive-1.2.2-bin.tar.gz

    配置环境变量

    # vi /etc/profile
    HIVE_HOME=/data/yunva/apache-hive-1.2.2-bin
    PATH=$PATH:$HIVE_HOME/bin
    export HIVE_NAME PATH
    # source /etc/profile

    2.安装mysql,存储hive相关的信息(此处因为资源使用问题,mysql安装在了另外的服务器中)

    # yum install -y mariadb mariadb-server
    # systemctl start mariadb

    在MySQL创建Hive元数据存放库和连接用户

    mysql>create database hive;
    mysql>grant all on *.* to'hive'@'%' identified by 'hive';
    mysql>flush privileges;

    3.配置hive

    cd /data/yunva/apache-hive-1.2.2-bin/conf
    cp hive-default.xml.template hive-default.xml

    配置hive连接mysql的信息
    # vim hive-site.xml

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://10.10.11.214:3306/hive?createDatabaseIfNotExist=true</value>
            <description>JDBC connect string for a JDBC metastore</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
            <description>Driver class name for a JDBC metastore</description>  
        </property>          
    
        <property> 
            <name>javax.jdo.option.ConnectionUserName</name>
            <value>hive</value>
            <description>username to use against metastore database</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionPassword</name>
            <value>hive</value>
            <description>password to use against metastore database</description>
        </property>
    </configuration>

    4.安装java连接mysql的驱动
    下载地址:https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.45.tar.gz
    将解压的mysql-connector-java-5.1.45-bin.jar放到/data/yunva/apache-hive-1.2.2-bin/lib目录

    5.启动Hive服务

    # hive --service metastore &
    
    [root@test3 apache-hive-1.2.2-bin]# ps -ef|grep hive
    root      4302  3176 99 14:09 pts/0    00:00:06 /usr/java/jdk1.8.0_65/bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/data/yunva/hadoop-2.7.3/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/data/yunva/hadoop-2.7.3 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /data/yunva/apache-hive-1.2.2-bin/lib/hive-service-1.2.2.jar org.apache.hadoop.hive.metastore.HiveMetaStore
    root      4415  3176  0 14:09 pts/0    00:00:00 grep hive
    [root@test3 apache-hive-1.2.2-bin]# jps
    15445 HRegionServer
    4428 Jps
    4302 RunJa # hive会启动叫做RunJa的程序

    客户端配置,需要集成Hadoop环境
    scp -P 48490 -r apache-hive-1.2.2-bin 10.10.114.112:/data/yunva

    配置环境变量:
    vim /etc/profile

    # hive client
    HIVE_HOME=/data/yunva/apache-hive-1.2.2-bin
    PATH=$PATH:$HIVE_HOME/bin
    export HIVE_NAME PATH

    # vi hive-site.xml(或者直接使用原有配置不变,此时hive就有两个服务端了)

    <configuration>
    <!--通过thrift方式连接hive-->
       <property>
           <name>hive.metastore.uris</name>
            <value>thrift://hive_server_ip:9083</value>
       </property>
    </configuration>

    简单测试:
    执行hive命令会进入命令界面:

    [root@test3 apache-hive-1.2.2-bin]# hive
    
    Logging initialized using configuration in jar:file:/data/yunva/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
    hive> show databases;
    OK
    default
    Time taken: 1.158 seconds, Fetched: 1 row(s)
    
    hive> create database yunvatest;
    hive> use yunvatest;
    OK
    Time taken: 0.021 seconds
    hive> show databases;
    OK
    default
    yunvatest
    Time taken: 0.225 seconds, Fetched: 2 row(s)
    hive> create table table_test(id string,name string);
    OK
    Time taken: 0.417 seconds
    hive> show tables;
    OK
    table_test
    Time taken: 0.033 seconds, Fetched: 1 row(s)

    6.Hive常用SQL命令
    6.1先创建一个测试库

    hive> create database test;
    hive> use test;

    创建tb1表,并指定字段分隔符为tab键(否则会插入NULL)

    hive> create table tb1(id int,name string) row format delimited fields terminated by '	';

    如果想再创建一个表,而且表结构和tb1一样,可以这样:
    hive> create table table2 like tb1;

    查看下表结构:
    hive> describe table2;
    OK
    id int
    name string
    Time taken: 0.126 seconds, Fetched: 2 row(s)

    6.2从本地文件中导入数据到Hive表
    先创建数据文件,键值要以tab键空格:

    # cat seasons.txt
    1    spring
    2    summer
    3    autumn
    4    winter

    再导入数据:
    hive> load data local inpath'/root/seasons.txt' overwrite into table tb1;
    查询是否导入成功

    hive> select * from tb1;
    OK
    1    spring
    2    summer
    3    autumn
    4    winter

    6.3从HDFS中导入数据到Hive表:

    列出hdfs文件系统根目录下的目录
    hadoop fs -ls /

    创建test根目录
    hadoop fs -mkdir /test
    put 命令向/test目录写入文件为siji.txt
    hadoop fs -put /root/seasons.txt /test/siji.txt

    查看siji.txt文件内容

    # hadoop fs -cat /test/siji.txt
    17/12/06 14:54:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    1    spring
    2    summer
    3    autumn
    4    winte

    hive> load data inpath '/test/siji.txt' overwrite into table table2;
    Loading data to table test.table2
    Table test.table2 stats: [numFiles=1, numRows=0, totalSize=36, rawDataSize=0]
    OK
    Time taken: 0.336 seconds

    查询是否导入成功

    hive> select * from table2;
    OK
    1    spring
    2    summer
    3    autumn
    4    winter
    Time taken: 0.074 seconds, Fetched: 4 row(s)

    6.4上面是基本表的简单操作,为了提高处理性能,Hive引入了分区机制,那我们就了解分区表概念:

    1>.分区表是在创建表时指定的分区空间
    2>.一个表可以有一个或多个分区,意思把数据划分成块
    3>.分区以字段的形式在表结构中,不存放实际数据内容
    分区表优点:将表中数据根据条件分配到不同的分区中,缩小查询范围,提高检索速度和处理性能

    6.5单分区表:
    创建单分区表tb2(HDFS表目录下只有一级目录):
    hive> create table tb2(id int,name string) partitioned by (dt string) row format delimited fields terminated by ' ';

    注:dt可以理解为分区名称。

    从文件中把数据导入到Hive分区表,并定义分区信息(需要已经存在的表)

    hive> load data local inpath '/root/seasons.txt' into table tb2 partition (dt='2017-12-06');
    hive> load data local inpath '/root/seasons.txt' into table tb2 partition (dt='2017-12-07');

    查看表数据

    hive> select * from tb2;
    OK
    1    spring    2017-12-06
    2    summer    2017-12-06
    3    autumn    2017-12-06
    4    winter    2017-12-06
    1    spring    2017-12-07
    2    summer    2017-12-07
    3    autumn    2017-12-07
    4    winter    2017-12-07
    Time taken: 0.086 seconds, Fetched: 8 row(s)

    查看HDFS仓库中表目录变化

    [root@test4_haili_dev ~]# hadoop fs -ls -R /user/hive/warehouse/test.db/tb2
    17/12/06 15:09:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    drwxrwxrwx   - root supergroup          0 2017-12-06 15:07 /user/hive/warehouse/test.db/tb2/dt=2017-12-06
    -rwxrwxrwx   3 root supergroup         36 2017-12-06 15:07 /user/hive/warehouse/test.db/tb2/dt=2017-12-06/seasons.txt
    drwxrwxrwx   - root supergroup          0 2017-12-06 15:07 /user/hive/warehouse/test.db/tb2/dt=2017-12-07
    -rwxrwxrwx   3 root supergroup         36 2017-12-06 15:07 /user/hive/warehouse/test.db/tb2/dt=2017-12-07/seasons.txt

    可以看到tb2表导入的数据根据日期将数据划分到不同目录下


    6.6多分区表:
    创建多分区表tb3(HDFS表目录下有一级目录,一级目录下再有子级目录)

    hive> create table table3(id int,name string) partitioned by (dt string,location string) row format delimited fields terminated by ' ';

    从文件中把数据导入到Hive分区表,并定义分区信息

    hive> load data local inpath '/root/seasons.txt' into table table3 partition (dt='2017-12-06',location='guangzhou');
    hive> load data local inpath '/root/seasons.txt' into table table3 partition (dt='2017-12-07',location='shenzhen');

    查看表数据

    hive> select * from table3;
    OK
    1    spring    2017-12-06    guangzhou
    2    summer    2017-12-06    guangzhou
    3    autumn    2017-12-06    guangzhou
    4    winter    2017-12-06    guangzhou
    1    spring    2017-12-07    shenzhen
    2    summer    2017-12-07    shenzhen
    3    autumn    2017-12-07    shenzhen
    4    winter    2017-12-07    shenzhen

    查看HDFS仓库中表目录变化

    [root@test3 yunva]# hadoop fs -ls -R /user/hive/warehouse/test.db/table3
    17/12/06 15:22:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    drwxrwxrwx   - root supergroup          0 2017-12-06 15:19 /user/hive/warehouse/test.db/table3/dt=2017-12-06
    drwxrwxrwx   - root supergroup          0 2017-12-06 15:19 /user/hive/warehouse/test.db/table3/dt=2017-12-06/location=guangzhou
    -rwxrwxrwx   3 root supergroup         36 2017-12-06 15:19 /user/hive/warehouse/test.db/table3/dt=2017-12-06/location=guangzhou/seasons.txt
    drwxrwxrwx   - root supergroup          0 2017-12-06 15:20 /user/hive/warehouse/test.db/table3/dt=2017-12-07
    drwxrwxrwx   - root supergroup          0 2017-12-06 15:20 /user/hive/warehouse/test.db/table3/dt=2017-12-07/location=shenzhen
    -rwxrwxrwx   3 root supergroup         36 2017-12-06 15:20 /user/hive/warehouse/test.db/table3/dt=2017-12-07/location=shenzhen/seasons.txt

    可以看到表中一级dt分区目录下又分成了location分区。

    查看表分区信息
    hive> show partitions table3;
    OK
    dt=2017-12-06/location=guangzhou
    dt=2017-12-07/location=shenzhen
    Time taken: 0.073 seconds, Fetched: 2 row(s)

    根据分区查询数据

    hive> select name from table3 where dt='2017-12-06';
    OK
    spring
    summer
    autumn
    winter
    Time taken: 0.312 seconds, Fetched: 4 row(s)

    重命名分区
    hive> alter table table3 partition (dt='2017-12-06',location='guangzhou') rename to partition(dt='20171206',location='shanghai');

    删除分区
    hive> alter table table3 drop partition(dt='2017-12-06',location='guangzhou');
    OK
    Time taken: 0.113 seconds
    可以看到已经查不出来了
    hive> select name from table3 where dt='2017-12-06';
    OK
    Time taken: 0.078 seconds

    模糊搜索表
    hive> show tables 'tb*';
    OK
    tb1
    tb2

    给表新添加一列

    hive> alter table tb1 add columns (comment string);
    OK
    Time taken: 0.106 seconds
    hive> describe tb1;
    OK
    id                      int                                         
    name                    string                                      
    comment                 string                                      
    Time taken: 0.079 seconds, Fetched: 3 row(s)

    重命名表
    hive> alter table tb1 rename to new_tb1;
    OK
    Time taken: 0.095 seconds
    hive> show tables;
    OK
    new_tb1
    table2
    table3
    tb2

    删除表
    hive> drop table new_tb1;
    OK
    Time taken: 0.094 seconds
    hive> show tables;
    OK
    table2
    table3
    tb2

  • 相关阅读:
    Django Rest Swagger生成api文档
    django 完整日志配置
    django解决跨域请求的问题
    Django REST framework 自定义字段
    Django model 定义属性
    mysql server has gone away的原因
    也谈时间管理和GTD
    MySQL之thread cache
    MySQL之aborted connections和aborted clients
    TokuDB的特点验证
  • 原文地址:https://www.cnblogs.com/reblue520/p/7993026.html
Copyright © 2020-2023  润新知