一,sqoop安装
1.解压源码包 2.配置环境变量 3.在bin目录下的 /bin/configsqoop 注释掉check报错信息 4.配置conf目录下 /conf/sqoop-env.sh 配置hadoop和hive家目录 5.导入依赖的jar包至lib目录下 mysql-connector-java-5.1.46-bin.jar /share/hadoop/common/hadoop-common-2.7.2.jar 6.连接测试
二、命令介绍
version import list-databases list-tables create-hive-table 在hive中仅创建表结构 --connect 连接url ,在hive-site.xml中配置连接参数 --username --password --table --columns --query where and $CONDITIONS (没有就会报错) --delete-target-dir 目标文件若存在在删除 --target-dir --fields-terminated-by 默认逗号 --lines-terminated-by 默认换行符 --hive-import 以Hive的方式进行导入 --hive-database 导入到指定的Hive库中 --external-table-dir <hdfs path>:以外部表的形式创建 --hive-table 导入到指定的Hive表中 --hive-overwrite 是否覆盖导入 --split-by 指定分任务的操作列名 -m {d} 指定maptask数目(默认4)
三、数据导入
1.数据导入到HDFS
1.指定命令 -- (声明配置项 - 指定参数) 2.命令语句 -> 翻译成程序 -> 打成jar包 3.将jar包(核心程序 - 依赖jar包)提交至集群 4.简单的数据导出:只有Map阶段 -> 数据源:关系型数据库的数据(结构化的数据) 5.单表数据导出:将数据全部查出,拼接分隔符输出,Reduce阶段原样输出 6.默认使用4个MapTask执行 -> 产生了四个结果文件 -> 通过-m参数实现 7.默认使用第一列作为分割任务的列 -> 1(MIN(id))-100(MAX(id)) -> 确定固定的区间 -> --split-by指定分割列 8.在相应的目录下生成结果 -> 使用SQL(聚合函数运算或UDAF) -> -m 1
2.数据导入到HIVE
- 先将数据文件导入到HDFS上,产生一个临时文件
- 成功后将数据上传到HIVE,成功则删除hdfs上的临时文件
1.导入至HDFS ->临时文件的路径:当前执行sqoop命令的家目录 -> 目录名称:不指定自动生成/也可以手动指定 2.导入为HDFS时指定的部分参数会被Hive导入时进行读取(分隔符) 3.如果目标表不存在则创建(数据表结构信息(数据源的表结构) - 命令指定的相关参数) 4.如果目标表存在:默认追加导入,数据结构保持一致,目标Hive表的相关参数(分隔符) 5.导入数据 -> 将中间目录的文件夹下的数据文件移动至内部表目录下
3.打印信息相关理解
1 sqoop import --connect jdbc:mysql://sz01:3306/test --username root --password root --table make --delete-target-dir --target-dir /home/haha 2 18/09/18 11:36:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 1.sqoop版本 3 18/09/18 11:36:00 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 4 18/09/18 11:36:00 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 2.获取mysql的结果集 5 18/09/18 11:36:00 INFO tool.CodeGenTool: Beginning code generation 3.sql代码生成 6 18/09/18 11:36:01 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `make` AS t LIMIT 1 7 18/09/18 11:36:01 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `make` AS t LIMIT 1 8 18/09/18 11:36:01 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/bigdata/hadoop-2.7.2 4.hadoop-->MR 9 Note: /tmp/sqoop-bigdata/compile/5b8add033feeb0f72f4b3eac44ff00e9/make.java uses or overrides a deprecated API. 10 Note: Recompile with -Xlint:deprecation for details. 11 18/09/18 11:36:04 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-bigdata/compile/5b8add033feeb0f72f4b3eac44ff00e9/make.jar 5生成jar文件 12 18/09/18 11:36:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13 18/09/18 11:36:05 INFO tool.ImportTool: Destination directory /home/haha is not present, hence not deleting. 6.确认目标路径 14 18/09/18 11:36:05 WARN manager.MySQLManager: It looks like you are importing from mysql. 15 18/09/18 11:36:05 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 16 18/09/18 11:36:05 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 17 18/09/18 11:36:05 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 18 18/09/18 11:36:05 INFO mapreduce.ImportJobBase: Beginning import of make 7.开始导入数据 19 18/09/18 11:36:05 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 8.开始替换MR 20 18/09/18 11:36:05 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 21 18/09/18 11:36:06 INFO client.RMProxy: Connecting to ResourceManager at sz01/192.168.18.130:8032 9.连接RM 22 18/09/18 11:36:11 INFO db.DBInputFormat: Using read commited transaction isolation 23 18/09/18 11:36:11 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `make` 10.划分任务 24 18/09/18 11:36:11 INFO db.IntegerSplitter: Split size: 67; Num splits: 4 from: 1 to: 272 25 18/09/18 11:36:11 INFO mapreduce.JobSubmitter: number of splits:4 26 18/09/18 11:36:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1537076834131_0011 11.提交JOB 27 18/09/18 11:36:12 INFO impl.YarnClientImpl: Submitted application application_1537076834131_0011 28 18/09/18 11:36:12 INFO mapreduce.Job: The url to track the job: http://sz01:8088/proxy/application_1537076834131_0011/ 29 18/09/18 11:36:12 INFO mapreduce.Job: Running job: job_1537076834131_0011 30 18/09/18 11:36:27 INFO mapreduce.Job: Job job_1537076834131_0011 running in uber mode : false 12.执行job 31 18/09/18 11:36:27 INFO mapreduce.Job: map 0% reduce 0% 32 18/09/18 11:36:40 INFO mapreduce.Job: map 25% reduce 0% 33 18/09/18 11:36:41 INFO mapreduce.Job: map 50% reduce 0% 34 18/09/18 11:36:48 INFO mapreduce.Job: map 75% reduce 0% 35 18/09/18 11:36:50 INFO mapreduce.Job: map 100% reduce 0% 36 18/09/18 11:36:51 INFO mapreduce.Job: Job job_1537076834131_0011 completed successfully 37 18/09/18 11:36:52 INFO mapreduce.Job: Counters: 31 13.打印counters 38 File System Counters 39 FILE: Number of bytes read=0 40 FILE: Number of bytes written=548044 41 FILE: Number of read operations=0 42 FILE: Number of large read operations=0 43 FILE: Number of write operations=0 44 HDFS: Number of bytes read=405 45 HDFS: Number of bytes written=2279 46 HDFS: Number of read operations=16 47 HDFS: Number of large read operations=0 48 HDFS: Number of write operations=8 49 Job Counters 50 Killed map tasks=1 51 Launched map tasks=4 52 Other local map tasks=4 53 Total time spent by all maps in occupied slots (ms)=34012 54 Total time spent by all reduces in occupied slots (ms)=0 55 Total time spent by all map tasks (ms)=34012 56 Total vcore-milliseconds taken by all map tasks=34012 57 Total megabyte-milliseconds taken by all map tasks=34828288 58 Map-Reduce Framework 59 Map input records=184 60 Map output records=184 61 Input split bytes=405 62 Spilled Records=0 63 Failed Shuffles=0 64 Merged Map outputs=0 65 GC time elapsed (ms)=470 66 CPU time spent (ms)=5830 67 Physical memory (bytes) snapshot=417902592 68 Virtual memory (bytes) snapshot=8252805120 69 Total committed heap usage (bytes)=121896960 70 File Input Format Counters 71 Bytes Read=0 72 File Output Format Counters 73 Bytes Written=2279 74 18/09/18 11:36:52 INFO mapreduce.ImportJobBase: Transferred 2.2256 KB in 46.2286 seconds (49.2985 bytes/sec) 75 18/09/18 11:36:52 INFO mapreduce.ImportJobBase: Retrieved 184 records. 76 77 78 18/09/18 13:24:59 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table make 79 18/09/18 13:24:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `make` AS t LIMIT 1 80 18/09/18 13:24:59 INFO hive.HiveImport: Loading uploaded data into Hive 81 18/09/18 13:25:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with renew id 1 executed 82 83 Logging initialized using configuration in jar:file:/home/bigdata/apache-hive-1.2.2-bin/lib/hive-jdbc-1.2.2-standalone.jar!/hive-log4j.properties 84 OK 85 Time taken: 5.158 seconds 86 Loading data to table user1.make 87 Table user1.make stats: [numFiles=4, totalSize=2279] 88 OK 89 Time taken: 1.244 seconds
四、日志记录
1.自己配置
conf目录下放入log4j.properties -> 日志级别,日志文件产生的目录以及名称
lib目录下放入log4j相关jar包
2.sqoop拉取数据时,实际是MR和HIVE数据导入导出
可查看hadoop的logs/userlogs 和hive的日志