Sqoop 使用shell命令的各种参数的配置及使用方法（含主键重复问题参数配置）

一、sqoop简介

Sqoop将用户编写的Sqoop命令翻译为MR程序，MR程序读取关系型数据库中的数据，写入到HDFS或读取HDFS上的数据，写入到关系型数据库！

在MR程序中如果要读取关系型数据库中的数据，必须指定输入格式为DBInputformat！

在MR程序中如果要向关系型数据库写入数据，必须指定输出格式为DBOutputformat！

Sqoop命令运行的MR程序，只有Map阶段，没有Reduce阶段！只需要做数据传输，不需要对数据进行合并和排序！

二、sqoop导入数据（将关系型数据库的数据导入到 HDFS）

数据如下

2.1直接导入HDFS

2.1.1 全表导入(部分导入)

bin/sqoop import 
##连接的关系型数据库的url,用户名，密码
--connect jdbc:mysql://hadoop102:3306/test 
--username root 
--password 123 
##连接的表
--table t_emp 
##导出数据在hdfs上存放路径
--target-dir /sqoopTest 
##如果路径已存在则先删除
--delete-target-dir 
##导入到Hdfs上后，每个字段使用什么参数进行分割
--fields-terminated-by "	" 
##要启动几个MapTask，默认4个
--num-mappers 2 
##数据集根据哪个字段进行切分，切分后每个MapTask负责一部分
--split-by id 
##要实现部分导入，加入下面的参数，表示导入哪些列
##columns中如果涉及到多列，用逗号分隔，分隔时不要添加空格
--columns id,name,age

2.1.2 使用sqoop关键字筛选查询导入数据

bin/sqoop import 
--connect jdbc:mysql://hadoop102:3306/test 
--username root 
--password 123 
--table t_emp 
##指定过滤的where语句,where语句最好使用引号包裹
--where 'id>6' 
--target-dir /sqoopTest 
--delete-target-dir 
--fields-terminated-by "	" 
--num-mappers 1 
--split-by id

2.1.3 使用查询语句导入

bin/sqoop import 
--connect jdbc:mysql://hadoop102:3306/test 
--username root 
--password 123 
##查询语句最好使用单引号
##如果query后使用的是双引号，则$CONDITIONS前必须加转移符，防止shell识别为自己的变量
--query 'select * from t_emp where id>3 and $CONDITIONS' 
--target-dir /sqoopTest 
--delete-target-dir 
--fields-terminated-by "	" 
--num-mappers 1 
--split-by id

注意：

1、如果使用了--query，就不能指定--table，和--columns和--where

　　--query 和 --table一定不能同时存在！

　　--where和--query同时存在时，--where失效

　　--columns和--query同时存在时，还有效！

2、--query 必须跟--target-dir

2.2 导入到Hive

bin/sqoop import 
--connect jdbc:mysql://hadoop102:3306/test 
--username root 
--password 123 
--query 'select * from t_emp where id>3 and $CONDITIONS' 
--target-dir /sqoopTest 
##如果不限定分隔符，那么hive存储的数据将不带分隔符，之后再想操作很麻烦，所以建议加上
--fields-terminated-by "	" 
--delete-target-dir 
##导入到hive
--hive-import 
##是否覆盖写，不加这个参数就是追加写
--hive-overwrite 
##指定要导入的hive的表名
--hive-table t_emp 
--num-mappers 1 
--split-by id

原理还是分俩步：先把数据从关系数据库里导到hdfs中，然后再从hdfs中导到hive中，此时hdfs中的文件会被删除

注意：如果hive中没表会自动创建表，但是类型是自动生成的，所以还是建议手动创建

也可以分俩步走：

先导入hdfs

#!/bin/bash
import_data(){
$sqoop import 
--connect jdbc:mysql://hadoop102:3306/gmall 
--username root 
--password 123 
--target-dir /origin_data/gmall/db/$1/$do_date 
--delete-target-dir 
--query "$2 and  $CONDITIONS" 
--num-mappers 1 
--fields-terminated-by '	' 
# 使用压缩，和指定压缩格式为lzop
--compress 
--compression-codec lzop 
#将String类型和非String类型的空值替换为N,方便Hive读取
--null-string '\N' 
--null-non-string '\N'
}

然后利用 load data 命令导入hive

注意：这里使用到了空值处理 ——Hive中的Null在底层是以“N”来存储，而MySQL中的Null在底层就是Null，为了保证数据两端的一致性。在导出数据时采用--input-null-string和--input-null-non-string两个参数。导入数据时采用--null-string和--null-non-string。

2.3导入到Hbase

bin/sqoop import 
--connect jdbc:mysql://hadoop102:3306/test 
--username root 
--password 123 
--query 'select * from t_emp where id>3 and $CONDITIONS' 
--target-dir /sqoopTest 
--delete-target-dir 
##表不存在是否创建
--hbase-create-table 
##hbase中的表名
--hbase-table "t_emp" 
##将导入数据的哪一列作为rowkey
--hbase-row-key "id" 
##导入的列族
--column-family "info" 
--num-mappers 2 
--split-by id

1、当选用自动创建表时，如果版本不兼容会报错：

20/03/24 13:51:24 INFO mapreduce.HBaseImportJob: Creating missing HBase table t_emp
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)V

此时只能自己手动创建或者可以重新编译sqoop源码

2、如果要多列族导入，只能多次运行命令，一次导入一个列族

三、导出

将Hdfs上的数据导出到关系型数据库中

3.1sql中表为空表时

bin/sqoop export 
--connect 'jdbc:mysql://hadoop102:3306/test?useUnicode=true&characterEncoding=utf-8' 
--username root 
--password 123 
##导出的表名，需要自己提前创建好
--table t_emp2 
--num-mappers 1 
##hdfs上导出的数据的路径
--export-dir /user/hive/warehouse/t_emp 
##hdfs上数据的分隔符
--input-fields-terminated-by "	"

3.2 表不为空表时

如果插入的数据的主键和表中已有数据的主键冲突，那么会报错

Duplicate entry '5' for key 'PRIMARY'

如果在SQL下，可以使用

INSERT INTO t_emp2 VALUE(5,'jack',30,3,1111) 
ON DUPLICATE KEY UPDATE NAME=VALUES(NAME),deptid=VALUES(deptid),
empno=VALUES(empno);

意为

指定当插入时，主键重复时时，对于重复的记录，只做更新，不做插入！

而用sqoop时，则可以启用以下俩种模式

3.2.1updateonly模式

bin/sqoop export 
--connect 'jdbc:mysql://hadoop103:3306/mydb?useUnicode=true&characterEncoding=utf-8' 
--username root 
--password 123456 
--table t_emp2 
--num-mappers 1 
--export-dir /hive/t_emp 
--input-fields-terminated-by "	" 
--update-key id

利用 --update-key 字段，表示主键重复时会进行更新，但是主键不重复的时候，数据不会插入进来

3.2.2allowinsert模式

bin/sqoop export 
--connect 'jdbc:mysql://hadoop103:3306/mydb?useUnicode=true&characterEncoding=utf-8' 
--username root 
--password 123456 
--table t_emp2 
--num-mappers 1 
--export-dir /hive/t_emp 
--input-fields-terminated-by "	" 
--update-key id 
--update-mode  allowinsert

表示主键重复时会进行更新，主键不重复的时候，数据也会插入进来

3.3如何查看导出命令的具体实现

3.3.1配置/etc/my.cnf

[mysqld]
#开启binlog日志功能
log-bin=mysql-bin

3.3.2重启mysql服务

3.3.3进入/var/lib/mysql，调用方法

sudo mysqlbinlog mysql-bin.000001

参考博客

https://my.oschina.net/u/1765168/blog/1593343

相关阅读:
docker固定IP地址重启不变
 关于 CentOS 7 里面普通用户 Ulimit max user processes 值的问题
 CentOS tengine mysql 5.7 php 5.6
strace 分析跟踪进程错误
 WPF中的字体改善
 C#数据类型-string
WPF中的文字修饰
 Winform与WPF对话框(MessageBox, Dialog)之比较
 WPF中将四个数字字符串值(比如："10,10,300,300")转为Rect
WPF中嵌入Flash(ActiveX)
原文地址：https://www.cnblogs.com/yangxusun9/p/12558683.html