• 使用sqoop把mysql数据导入hive


    使用sqoop把mysql数据导入hive

    export HADOOP_COMMON_HOME=/hadoop
    export HADOOP_MAPRED_HOME=/hadoop
     
    cp /hive/lib/mysql-connector-java-5.1.25-bin.jar     /sqoop/lib/

    share表第一列为自增主键 share_id,share_id<1000的数据共有999条:

    mysql> SELECT COUNT(*) FROM share WHERE share_id<1000;
    +----------+
    | COUNT(*) |
    +----------+
    |      999 |
    +----------+
    1 ROW IN SET (0.00 sec)

    使用sqoop导入到hive

    [hduser@www lib]$ /sqoop/bin/sqoop import --connect jdbc:mysql://localhost/shipincon --table share --username root --password xxx --hive-import --where "share_id<1000"  -- --default-character-set=utf8
     
     
    Warning: /usr/lib/hbase does not exist! HBase imports will fail.
    Please set $HBASE_HOME to the root of your HBase installation.
    		--password 密码   比较危险,可以改成 -P   (Read password from console)
    13/07/12 18:42:58 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
    		默认是用hive的分隔符 ctrl+a 可以使用 --fields-terminated-by 来更改
    		插一句:在console vim中如何输入ctrl+a:  在输入模式下先按ctrl+v,再按ctrl+a即可,会显示成^A
    13/07/12 18:42:58 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
    13/07/12 18:42:58 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
    13/07/12 18:42:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
    13/07/12 18:42:58 INFO tool.CodeGenTool: Beginning code generation
    13/07/12 18:42:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1
    13/07/12 18:42:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1
    13/07/12 18:42:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop
    Note: /tmp/sqoop-hduser/compile/720a3072de33c8a826cdf96d9eced686/share.java uses or overrides a deprecated API.
    Note: Recompile with -Xlint:deprecation for details.
    13/07/12 18:43:00 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/720a3072de33c8a826cdf96d9eced686/share.jar
    13/07/12 18:43:00 WARN manager.MySQLManager: It looks like you are importing from mysql.
    	如果没有blob clob longvarbinary字段,可以使用 --direct 速度也许会快一些。    text字段不清楚。
    13/07/12 18:43:00 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
    13/07/12 18:43:00 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
    13/07/12 18:43:00 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
    13/07/12 18:43:00 INFO mapreduce.ImportJobBase: Beginning import of share
    13/07/12 18:43:02 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`share_id`), MAX(`share_id`) FROM `share` WHERE ( share_id<1000 )
    13/07/12 18:43:02 INFO mapred.JobClient: Running job: job_201307121751_0001
    13/07/12 18:43:03 INFO mapred.JobClient:  map 0% reduce 0%
    13/07/12 18:43:12 INFO mapred.JobClient:  map 50% reduce 0%
    13/07/12 18:43:17 INFO mapred.JobClient:  map 100% reduce 0%
    13/07/12 18:43:18 INFO mapred.JobClient: Job complete: job_201307121751_0001
    13/07/12 18:43:18 INFO mapred.JobClient: Counters: 18
    13/07/12 18:43:18 INFO mapred.JobClient:   Job Counters 
    13/07/12 18:43:18 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=20723
    13/07/12 18:43:18 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
    13/07/12 18:43:18 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
    		################################ map tasks  =4 就会生成四个文件
    13/07/12 18:43:18 INFO mapred.JobClient:     Launched map tasks=4
    13/07/12 18:43:18 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
    13/07/12 18:43:18 INFO mapred.JobClient:   File Output Format Counters 
    13/07/12 18:43:18 INFO mapred.JobClient:     Bytes Written=202094
    13/07/12 18:43:18 INFO mapred.JobClient:   FileSystemCounters
    13/07/12 18:43:18 INFO mapred.JobClient:     HDFS_BYTES_READ=455
    13/07/12 18:43:18 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=260292
    13/07/12 18:43:18 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=202094
    13/07/12 18:43:18 INFO mapred.JobClient:   File Input Format Counters 
    13/07/12 18:43:18 INFO mapred.JobClient:     Bytes Read=0
    13/07/12 18:43:18 INFO mapred.JobClient:   Map-Reduce Framework
    13/07/12 18:43:18 INFO mapred.JobClient:     Map input records=999
    13/07/12 18:43:18 INFO mapred.JobClient:     Physical memory (bytes) snapshot=179216384
    13/07/12 18:43:18 INFO mapred.JobClient:     Spilled Records=0
    13/07/12 18:43:18 INFO mapred.JobClient:     CPU time spent (ms)=3360
    13/07/12 18:43:18 INFO mapred.JobClient:     Total committed heap usage (bytes)=65011712
    13/07/12 18:43:18 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1519448064
    13/07/12 18:43:18 INFO mapred.JobClient:     Map output records=999
    13/07/12 18:43:18 INFO mapred.JobClient:     SPLIT_RAW_BYTES=455
    13/07/12 18:43:18 INFO mapreduce.ImportJobBase: Transferred 197.3574 KB in 18.0516 seconds (10.9329 KB/sec)
    		共999行记录
    13/07/12 18:43:18 INFO mapreduce.ImportJobBase: Retrieved 999 records.
    13/07/12 18:43:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1
    13/07/12 18:43:19 WARN hive.TableDefWriter: Column add_time had to be cast to a less precise type in Hive
    		add_time的精度变小了。。为嘛?  mysql中 `add_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    mysql> select add_time from share limit 1;
    +---------------------+
    | add_time            |
    +---------------------+
    | 2012-04-08 17:17:13 |
    +---------------------+
    1 row in set (0.00 sec)
    到了hive中会变成:
    2012-04-08 17:17:13.0
    	貌似精度是变大了。
    不过据说某些数据库的timestamp字段支持到了毫秒的精度。	
    13/07/12 18:43:19 INFO hive.HiveImport: Removing temporary files from import process: hdfs://localhost:54310/user/hduser/share/_logs
    13/07/12 18:43:19 INFO hive.HiveImport: Loading uploaded data into Hive
    13/07/12 18:43:21 INFO hive.HiveImport: WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
    13/07/12 18:43:22 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-0.10.0.jar!/hive-log4j.properties
    13/07/12 18:43:22 INFO hive.HiveImport: Hive history file=/tmp/hduser/hive_job_log_hduser_201307121843_1382245494.txt
    13/07/12 18:43:25 INFO hive.HiveImport: OK
    13/07/12 18:43:25 INFO hive.HiveImport: Time taken: 2.389 seconds
    13/07/12 18:43:25 INFO hive.HiveImport: Loading data to table default.share
    13/07/12 18:43:26 INFO hive.HiveImport: Table default.share stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 202094, raw_data_size: 0]
    13/07/12 18:43:26 INFO hive.HiveImport: OK
    13/07/12 18:43:26 INFO hive.HiveImport: Time taken: 0.726 seconds
    13/07/12 18:43:26 INFO hive.HiveImport: Hive import complete.
    13/07/12 18:43:26 INFO hive.HiveImport: Export directory is empty, removing it.
    [hduser@www lib]$ 
     
     
     
    --default-character-set=utf8之前还有 --
     
    Arguments to mysqldump and other subprograms may be supplied
    after a '--' on the command line.

    去hive中看看数据:

    [hduser@www lib]$ /hive/bin/hive
    WARNING: org.apache.hadoop.metrics.jvm.EventCounter IS deprecated. Please USE org.apache.hadoop.log.metrics.EventCounter IN ALL the log4j.properties files.
    Logging initialized USING configuration IN jar:file:/usr/LOCAL/hive/lib/hive-common-0.10.0.jar!/hive-log4j.properties
    Hive history file=/tmp/hduser/hive_job_log_hduser_201307121854_1566000179.txt
    hive> USE shipincon;
    FAILED: Error IN metadata: ERROR: The DATABASE shipincon does NOT exist.
    FAILED: Execution Error, RETURN code 1 FROM org.apache.hadoop.hive.ql.EXEC.DDLTask
    hive> SHOW DATABASES;
    OK
    DEFAULT
    TIME taken: 0.244 seconds
    hive> USE DEFAULT;
    OK
    TIME taken: 0.018 seconds
    hive> SHOW TABLES;
    OK
    a
    cleanmaster
    cm
    login_info
    login_interm
    share
    TIME taken: 0.155 seconds
    hive> SELECT COUNT(*) FROM sahre;
    FAILED: SemanticException [Error 10001]: Line 1:21 TABLE NOT found 'sahre'
    hive> SELECT COUNT(*) FROM share 
        > ;
    Total MapReduce jobs = 1
    Launching Job 1 OUT OF 1
    NUMBER OF reduce tasks determined at compile TIME: 1
    IN ORDER TO CHANGE the average LOAD FOR a reducer (IN bytes):
      SET hive.EXEC.reducers.bytes.per.reducer=<number>
    IN ORDER TO LIMIT the maximum NUMBER OF reducers:
      SET hive.EXEC.reducers.MAX=<number>
    IN ORDER TO SET a constant NUMBER OF reducers:
      SET mapred.reduce.tasks=<number>
    Starting Job = job_201307121751_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201307121751_0002
    KILL Command = /usr/LOCAL/hadoop/libexec/../bin/hadoop job  -KILL job_201307121751_0002
    Hadoop job information FOR Stage-1: NUMBER OF mappers: 1; NUMBER OF reducers: 1
    2013-07-12 18:54:37,120 Stage-1 map = 0%,  reduce = 0%
    2013-07-12 18:54:40,143 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec
    2013-07-12 18:54:41,154 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec
    2013-07-12 18:54:42,162 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec
    2013-07-12 18:54:43,168 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec
    2013-07-12 18:54:44,175 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec
    2013-07-12 18:54:45,181 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec
    2013-07-12 18:54:46,192 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec
    2013-07-12 18:54:47,199 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec
    2013-07-12 18:54:48,208 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 0.82 sec
    2013-07-12 18:54:49,219 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.74 sec
    2013-07-12 18:54:50,238 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.74 sec
    MapReduce Total cumulative CPU TIME: 1 seconds 740 msec
    Ended Job = job_201307121751_0002
    MapReduce Jobs Launched: 
    Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.74 sec   HDFS READ: 202547 HDFS WRITE: 4 SUCCESS
    Total MapReduce CPU TIME Spent: 1 seconds 740 msec
    OK
    999
    TIME taken: 18.573 seconds
    hive> SHOW CREATE TABLE share;
    OK
    CREATE  TABLE share(
      share_id BIGINT, 
      add_time string, 
      tag string)
    COMMENT 'Imported by sqoop on 2013/07/12 19:48:33'
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY 'u0001' 
      LINES TERMINATED BY '
    ' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      'hdfs://localhost:54310/user/hive/warehouse2/share'
    TBLPROPERTIES (
      'numPartitions'='0', 
      'numFiles'='4', 
      'transient_lastDdlTime'='1373629961', 
      'totalSize'='227161', 
      'numRows'='0', 
      'rawDataSize'='0')
    TIME taken: 0.292 seconds
     
    hive>

    再去dfs里面看看:

    [hduser@www lib]$ /hadoop/bin/hadoop dfs -ls /user/hive/warehouse2/share
    Found 5 items
    -rw-r--r--   1 hduser supergroup          0 2013-07-12 18:43 /user/hive/warehouse2/share/_SUCCESS
    -rw-r--r--   1 hduser supergroup      53226 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00000
    -rw-r--r--   1 hduser supergroup      50185 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00001
    -rw-r--r--   1 hduser supergroup      50634 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00002
    -rw-r--r--   1 hduser supergroup      48049 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00003

    乍一看,像是4个partition。 可是细看文件名,并不是partition的那种结构:
    比如:/user/hive/warehouse2/cm/country_name=VN_en

    把文件复制到本地:

    /hadoop/bin/hadoop dfs -get /user/hive/warehouse2/share  ~/
    [hduser@www share]$ cd ~/share
    [hduser@www share]$ wc -l *
       250 part-m-00000
       250 part-m-00001
       249 part-m-00002
       250 part-m-00003
         0 _SUCCESS
       999 total

    –hive-overwrite 可以覆盖hive中的现有数据。
    –hive-table
    Sets the table name to use when importing to Hive.
    可以把数据import到某个分区中:
    –hive-partition-key Name of a hive field to partition are sharded on
    –hive-partition-value String-value that serves as partition key for this imported into hive in this job.

    可以写个cron,周期性地把share表的新增数据导入到hive中了: –append

    参考:

    http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html

    http://www.mysqlperformanceblog.com/2013/07/11/mysql-and-hadoop/

  • 相关阅读:
    【转载】用opencv中的CvEM做GMM的训练
    关于OpenCV做图像处理内存释放的一些问题
    关于C++ vector的拷贝
    VS2015创建的C++程序在Debug模式下不能调试
    OpenCV中IplImage和Mat间的相互转换
    视频资源下载方法[download video resources]
    关于win10安装VisualSVN遇到的一个问题及解决办法
    win7启动出现蓝屏STOP: 0X0000007B
    CentOS下挂在NTFS分区
    硬盘类型和Linux的分区
  • 原文地址:https://www.cnblogs.com/sunwubin/p/3554243.html
Copyright © 2020-2023  润新知