• Hadoop入门(三):Sqoop和Hive的使用


    1安装Hive

    1.1下载解压

    wget http://mirrors.cnnic.cn/apache/hive/stable/hive-0.10.0.tar.gz

    tar xzvfhive-0.10.0.tar.gz

    1.2配置环境变量

    exportHIVE_HOME=/usr/local/src/hive-0.10.0

    export PATH=$HIVE_HOME/bin:$PATH

    1.3建立Hive仓库目录

    hadoop fs -mkdir/tmp

    hadoop fs -mkdir/user/hive/warehouse

    hadoop fs -chmodg+w /tmp

    hadoop fs -chmodg+w /user/hive/warehouse


    1.4启动命令行

    通过hive命令进入命令行,操作与MySQL的命令行类似:



    2安装Sqoop

    2.1下载解压

    下载适合Hadoop 0.20版本的Sqoop:

    wget http://mirrors.cnnic.cn/apache/sqoop/1.4.3/sqoop-1.4.3.bin__hadoop-0.20.tar.gz

    tar -xvf sqoop-1.4.3.bin__hadoop-0.20.tar.gz

    2.2配置环境变量

    export SQOOP_HOME=/usr/local/src/sqoop-1.4.3.bin__hadoop-0.20

    export PATH=$SQOOP_HOME/bin:$PATH

    export HADOOP_COMMON_HOME=/home/admin/hadoop-0.20.2

    export HADOOP_MAPRED_HOME=/home/admin/hadoop-0.20.2


    3用Sqoop导入数据到HIVE

    3.1导入HDFS

    我们从MySQL数据库中导入一张表的数据来测试一下Sqoop是否配置成功。首先上传mysql-connector-java-5.1.23.jar到sqoop的lib文件夹下,然后在sqoop/bin下执行下列命令:

    sqoop import--connect jdbc:mysql://ip/database --table tb1 --username user -P

    ===============================================================================

    Warning: /usr/lib/hbase does not exist!HBase imports will fail.

    Please set $HBASE_HOME to the root of yourHBase installation.

    Enter password:

    13/06/07 16:51:46 INFOmanager.MySQLManager: Preparing to use a MySQL streaming resultset.

    13/06/07 16:51:46 INFO tool.CodeGenTool: Beginning codegeneration

    13/06/07 16:51:48 INFO manager.SqlManager:Executing SQL statement: SELECT t.* FROM `tb1` AS t LIMIT 1

    13/06/07 16:51:48 INFO manager.SqlManager:Executing SQL statement: SELECT t.* FROM `tb1` AS t LIMIT 1

    13/06/07 16:51:48 INFOorm.CompilationManager: HADOOP_MAPRED_HOME is /home/admin/hadoop-0.20.2

    13/06/07 16:51:48 INFOorm.CompilationManager: Found hadoop core jar at:/home/admin/hadoop-0.20.2/hadoop-0.20.2-core.jar

    Note:/tmp/sqoop-root/compile/44c4b6c5ac57de04b487eb90633ac33e/tb1.java uses oroverrides a deprecated API.

    Note: Recompile with -Xlint:deprecation fordetails.

    13/06/07 16:51:54 INFO orm.CompilationManager:Writing jar file:/tmp/sqoop-root/compile/44c4b6c5ac57de04b487eb90633ac33e/tb1.jar

    13/06/07 16:51:54 WARNmanager.MySQLManager: It looks like you are importing from mysql.

    13/06/07 16:51:54 WARNmanager.MySQLManager: This transfer can be faster! Use the --direct

    13/06/07 16:51:54 WARNmanager.MySQLManager: option to exercise a MySQL-specific fast path.

    13/06/07 16:51:54 INFOmanager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)

    13/06/07 16:51:54 INFO mapreduce.ImportJobBase:Beginning import of tb1

    13/06/07 16:51:57 INFOdb.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM`tb1`

    13/06/07 16:51:59 INFO mapred.JobClient:Running job: job_201306071651_0001

    13/06/07 16:52:00 INFOmapred.JobClient:  map 0% reduce 0%

    13/06/07 16:52:38 INFOmapred.JobClient:  map 50% reduce 0%

    13/06/07 16:52:44 INFOmapred.JobClient:  map 100% reduce 0%

    13/06/07 16:52:46 INFO mapred.JobClient:Job complete: job_201306071651_0001

    13/06/07 16:52:46 INFO mapred.JobClient:Counters: 5

    13/06/07 16:52:46 INFOmapred.JobClient:   Job Counters

    13/06/07 16:52:46 INFOmapred.JobClient:     Launched map tasks=2

    13/06/07 16:52:46 INFOmapred.JobClient:   FileSystemCounters

    13/06/07 16:52:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=212

    13/06/07 16:52:46 INFOmapred.JobClient:   Map-Reduce Framework

    13/06/07 16:52:46 INFOmapred.JobClient:     Map input records=2

    13/06/07 16:52:46 INFOmapred.JobClient:     Spilled Records=0

    13/06/07 16:52:46 INFO mapred.JobClient:     Map output records=2

    13/06/07 16:52:46 INFOmapreduce.ImportJobBase: Transferred 212 bytes in 51.383 seconds (4.1259bytes/sec)

    13/06/07 16:52:46 INFOmapreduce.ImportJobBase: Retrieved 2 records.

    ===============================================================================

    数据文件默认被导入到当前用户文件夹下表名对应的文件夹了:

     

    Sqoop默认会同时启动四个Map任务来加速数据导入,可以通过-m 1命令来强制只启动一个map任务,这样就只会在HDFS中生成一个数据文件了。因为tb1表目前就两条数据,所以一共产生两个文件,查看下生成的文件内容:


    3.2创建Hive表

    首先在hive命令行中创建tb1表。注意hive支持的数据类型有限,并且一定要设置表的分隔符为逗号,否则Hive默认分隔符为Ctrl+A。

    CREATE TABLE tb1(

      id int,

     ......

    ) row format delimited fields terminated by ‘,’;

    也可以通过下面的命令让Sqoop根据MySQL表结构自动创建出Hive表:

    sqoop create-hive-table --connect jdbc:mysql://ip/database --table tb1 --hive-table tb1 --username user -P

    3.3导入Hive

    现在导入HDFS中的文件到Hive,注意Hive从HDFS导入数据后,会将HDFS中的文件/user/root/tb1移动到/user/hive/tb1:

             LOADDATA INPATH '/user/root/tb1/part-m-*' OVERWRITE INTO TABLE tb1

    3.4一条强大的命令

    上面的从MySQL导出数据到HDFS、创建Hive表格、导入数据到Hive三步,可以直接用一条Sqoop命令完成:

    sqoop import--connect jdbc:mysql://ip/database --table tb1 --username user -P  --hive-import

    4用HiveQL做分析

    待续...... 


    参考资料

    Hive安装

    https://cwiki.apache.org/confluence/display/Hive/GettingStarted

    http://sqoop.apache.org/docs/1.99.1/Installation.html

  • 相关阅读:
    javaSE基础(三)
    javaSE基础(二)
    javaSE基础(一)
    文件目录爬虫
    前自增 与 后自增
    查找 与 排序 总结
    python 使用 grpc
    python3.7 安装 uwsgi
    go
    go
  • 原文地址:https://www.cnblogs.com/xiaomaohai/p/6157738.html
Copyright © 2020-2023  润新知