sqoop是常用的 关系数据库离线同步到数仓的 工具
sqoop导入有两种方式:
1)直接导入到hdfs,然后再load到表中
2)直接导入到hive中
一、直接导入到hdfs,然后再load到表中
1:先将mysql一张表的数据用sqoop导入到hdfs中
将 test 表中的前10条数据导 导出来 只要id name 和 teset 这3个字段
数据存在 hdfs 目录 /user/hdfs 下
bin/sqoop import --connect jdbc:mysql://127.0.0.1:3306/dbtest --username root --password root --query 'select id, name,text from test where $CONDITIONS LIMIT 10' --target-dir /user/hadoop --delete-target-dir --num-mappers 1 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec
--direct --fields-terminated-by ' '
如果导出的数据库是mysql 则可以添加一个 属性 --direct 该属性在导出mysql数据库表中的数据会快一点 执行的是mysq自带的导出功能
2:启动hive 在hive中创建一张表
drop table if exists default.hive_test ; create table default.hive_test( id int, name string, text string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' ;
3:将hdfs中的数据导入到hive中 (导入后,此时hdfs 中原数据没有了)
load data inpath '/user/hadoop' into table default.hive_test ;
4:查询 hive_test 表
select * from hive_test;
二、直接导入到hive中
1、创建一个文件 vi havetest.sql 编辑文件 vi havetest.sql
use test; drop table if exists test.hive_test ; create table test.hive_test( id int, name string, text string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' ;
2、在 启动hive的时候 执行 sql脚本
hive -f /root/hivetest.sql
[root@cdh01 ~]# hive -f /root/hivetest.sql WARNING: Use "yarn jar" to launch YARN applications. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false OK Time taken: 4.816 seconds OK Time taken: 0.19 seconds OK Time taken: 0.725 seconds [root@cdh01 ~]#
3、执行sqoop直接导入hive的功能 IP不能是localhost
sqoop import --connect jdbc:mysql://IP:3306/test --username root --password root --table test --fields-terminated-by ' ' --delete-target-dir --num-mappers 1 --hive-import --hive-database test --hive-table hive_test
参数说明 #(必须参数)sqoop 导入,-D 指定参数,当前参数是集群使用队列名称 sqoop import -D mapred.job.queue.name=q #(必须参数)链接mysql jdbs参数 xxxx路径/yy库?mysql的参数(多个参数用&隔开) #tinyInt1isBit=false这个参数 主要解决 从Sqoop导入MySQL导入TINYINT(1)类型数据到hive(tinyint),数据为null --connect jdbc:mysql:xxxx/yy?tinyInt1isBit=false #(必须参数)用户名、密码、具体的表 --username xx --password xx --table xx --delete-target-dir #(非必须)系统默认是textfile格式,当然也可以是parquet --as-textfile #(非必须)指定想要的列 --columns 'id,title' #(必须参数)导入到hive的参数 --hive-import #(必须参数)指定分隔符,和hive 的目标表分隔符一致 --fields-terminated-by ' ' #(必须参数)hive的库表名 --hive-database xx --hive-table xx #(必须参数)是不是string 为null的都要变成真正为null --null-string '\N' --null-non-string '\N' #(非必须)写入到hive的分区信息,hive无分区无需这步 --hive-partition-key dt --hive-partition-value $day #写入hive的方式 --hive-overwrite --num-mappers 1 #(必须参数)导入到hive时删除 , , and 01 -hive-drop-import-delims #sqoop完会生成java文件,可以指定文件的路径,方便删除和管理 --outdir xxx
#导入时创建表
--hbase-create-table
hive导入相关参数
--hive-database 库名
--hive-table 表名
--hive-home 重写$HIVE_HOME
--hive-import 插入数据到hive当中,使用hive的默认分隔符
--hive-overwrite 重写插入
--create-hive-table 建表,如果表已经存在,该操作会报错!
--hive-table [table] 设置到hive当中的表名
--hive-drop-import-delims 导入到hive时删除
,
, and 1
--hive-delims-replacement 导入到hive时用自定义的字符替换掉
,
, and 1
--hive-partition-key hive分区的key
--hive-partition-value hive分区的值
--map-column-hive 类型匹配,sql类型对应到hive类型
--query 'select * from test where id >10 and $CONDITIONS' sql语句 $CONDITIONS 必须
执行
[root@cdh01 ~]# sqoop import > --connect jdbc:mysql://192.168.230.101:3306/test > --username root > --password root > --table test > --fields-terminated-by ' ' > --delete-target-dir > --num-mappers 5 > --hive-import > --hive-database test > --hive-table hive_test Warning: /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 20/08/14 14:47:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.2 20/08/14 14:47:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 20/08/14 14:47:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 20/08/14 14:47:43 INFO tool.CodeGenTool: Beginning code generation Fri Aug 14 14:47:44 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 20/08/14 14:47:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test` AS t LIMIT 1 20/08/14 14:47:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test` AS t LIMIT 1 20/08/14 14:47:45 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce 20/08/14 14:48:28 INFO mapreduce.Job: Job job_1597287071548_0008 running in uber mode : false 20/08/14 14:48:28 INFO mapreduce.Job: map 0% reduce 0% 20/08/14 14:48:49 INFO mapreduce.Job: map 20% reduce 0% 20/08/14 14:49:08 INFO mapreduce.Job: map 40% reduce 0% 20/08/14 14:49:25 INFO mapreduce.Job: map 60% reduce 0% 20/08/14 14:49:42 INFO mapreduce.Job: map 80% reduce 0% 20/08/14 14:50:00 INFO mapreduce.Job: map 100% reduce 0% 20/08/14 14:50:00 INFO mapreduce.Job: Job job_1597287071548_0008 completed successfully 20/08/14 14:50:00 INFO mapreduce.Job: Counters: 33 20/08/14 14:50:05 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:05 INFO session.SessionState: Created local directory: /tmp/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:05 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2/_tmp_space.db 20/08/14 14:50:05 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:05 INFO session.SessionState: Updating thread name to d2cee9ab-0c69-4748-9259-45a1ed4e38b2 main 20/08/14 14:50:05 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:06 INFO ql.Driver: Compiling command(queryId=root_20200814145005_35ad04c9-8567-4295-b490-96d4b495a085): CREATE TABLE IF NOT EXISTS `test`.`hive_test` ( `id` INT, `name` STRING, `text` STRING) COMMENT 'Imported by sqoop on 2020/08/14 14:50:01' ROW FORMAT DELIMITED FIELDS TERMINATED BY '