1.下载sqoop-1.4.6-bin.tar.gz并解压
2.修改conf/sqoop-env.sh,设置如下变量:
export HADOOP_COMMON_HOME=/usr/local/hadoop-2.6.3
export HADOOP_MAPRED_HOME=/usr/local/hadoop-2.6.3
export HBASE_HOME=/usr/local/hbase-1.1.3
export HIVE_HOME=/usr/local/hive-2.0.0
#export ZOOCFGDIR=
或者在用户的环境变量中做以上设置
二、sqoop使用
sqoop通过bin下的各种工具完成任务
1.连接数据库
参数:
Argument Description
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-name> Specify connection manager class to use
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
--password-file Set path for a file containing the authentication password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file <filename> Optional properties file that provides connection parameters
--relaxed-isolation Set connection transaction isolation to read uncommitted for the mappers.
$ sqoop import --connect jdbc:mysql://database.example.com/employees
--connect参数中主机名不能用localhost代替,否则各个结点都查询自己机器上的数据库。
安全是验证方式是把数据库的密码写入在/home/${user}下,并赋400权限。如下:
$ sqoop import --connect jdbc:mysql://database.example.com/employees --username venkatesh --password-file ${user.home}/.password
2.导出数据到HDFS
以下是参数:
--append | Append data to an existing dataset in HDFS |
--as-avrodatafile | Imports data to Avro Data Files |
--as-sequencefile | Imports data to SequenceFiles |
--as-textfile | Imports data as plain text (default) |
--as-parquetfile | Imports data to Parquet Files |
--boundary-query <statement> | Boundary query to use for creating splits |
--columns <col,col,col…> | Columns to import from table --columns "name,employee_id,jobtitle" |
--delete-target-dir | Delete the import target directory if it exists |
--direct | Use direct connector if exists for the database |
--fetch-size <n> | Number of entries to read from database at once. |
--inline-lob-limit <n> | Set the maximum size for an inline LOB |
-m,--num-mappers <n> | Use n map tasks to import in parallel |
-e,--query <statement> | Import the results of statement . select min(<split-by>), max(<split-by>) from <table name> |
--split-by <column-name> | Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.以某个字段平衡负载 |
--autoreset-to-one-mapper | Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with--split-by <col> option. |
--table <table-name> | Table to read |
--target-dir <dir> | HDFS destination dir |
--warehouse-dir <dir> | HDFS parent for table destination |
--where <where clause> | WHERE clause to use during import |
-z,--compress | Enable compression |
--compression-codec <c> | Use Hadoop codec (default gzip) |
--null-string <null-string> | The string to be written for a null value for string columns |
--null-non-string <null-string> | The string to be written for a null value for non-string columns |
示例:
bin/sqoop list-databases --connect jdbc:mysql://yangxw:3306/mysql --username root --password root
bin/sqoop import --connect jdbc:mysql://yangxw:3306/classicmodels --username root --password root --table customers --target-dir /mysql_hadoop
$ sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' --split-by a.id --target-dir /user/foo/joinresults
其它参数:
控制字段类型:
$ sqoop import ... --map-column-java id=String,value=Integer
增量导入:
大对象(BLOB CLOB):
对于16M以下的LOB,存储在常规的地方,大于16M的对象,存储在_lobs 目录下,并且格式与常规数据不一样,每个存储对象可以存储2^63大小的字节。
3.导入HIVE
导入HIVE的步骤:
dbms->hdfs->hive(load inpath)
参数:
--hive-home <dir> Override $HIVE_HOME
--hive-import Import tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwrite Overwrite existing data in the Hive table.
--create-hive-table If set, then the job will fail if the target hive
table exits. By default this property is false.--hive-table <table-name> Sets the table name to use when importing to Hive.
--hive-drop-import-delims Drops , , and 1 from string fields when importing to Hive.
--hive-delims-replacement Replace , , and 1 from string fields with user defined string when importing to Hive.
--hive-partition-key Name of a hive field to partition are sharded on
--hive-partition-value <v> String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map> Override default mapping from SQL type to Hive type for configured columns.
示例:
bin/sqoop import --connect jdbc:mysql://yangxw:3306/classicmodels --username root --password root --table products --hive-import --create-hive-table
如果原表是压缩表,导入HIVE时可能无法分割任务(无法并行),但lzop编码可以分割以并行执行
4.导入hbase
相关参数:
--column-family <family> | Sets the target column family for the import |
--hbase-create-table | If specified, create missing HBase tables |
--hbase-row-key <col> | Specifies which input column to use as the row key |
In case, if input table contains composite(复合) | |
key, then <col> must be in the form of a | |
comma-separated(逗号分隔) list of composite key | |
attributes | |
--hbase-table <table-name> | Specifies an HBase table to use as the target instead of HDFS |
--hbase-bulkload | Enables bulk loading 指导入 |
sqoop使用hdfs的put功能把数据导入hdfs中。默认会使用split key做为rowkey,如果没有定义split key,则尝试用primary key.如果原表是组合键,--hbase-row-key要设置组合键。如果hbase中没有表或者列簇,则会报错,可以添加--hbase-create-table解决。如果不使用--hbase-create-table,则要设置--column-family,所有的输出列都放在一个--column-family 里。
sqoop先导入hdfs中再以utf8导入hbase,忽略除row-key外的空值。为了减轻负载,可以使用批量导入bulk
示例:
- bin/sqoop import --connect jdbc:mysql://yangxw:3306/classicmodels --username root --password root --table orders --target-dir /mysql_hadoop/orders4 --hbase-table orders --column-family orders --hbase-create-table
- 报以下错误:无法创建hbase的表:
16/03/24 18:30:23 INFO mapreduce.HBaseImportJob: Creating missing HBase table orders
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)V
at org.apache.sqoop.mapreduce.HBaseImportJob.jobSetup(HBaseImportJob.java:222)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:264)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:118)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
可能是hadoop和hbase兼容性差的原因引起的:http://www.aboutyun.com/thread-12236-1-1.html
那么先将hbase的表创建好:
hbase(main):002:0> create 'orders','CF1'
0 row(s) in 1.6730 seconds
=> Hbase::Table - orders
再执行:
bin/sqoop import --connect jdbc:mysql://yangxw:3306/classicmodels --username root --password root --table orders --target-dir /mysql_hadoop/orders5 --hbase-table orders --column-family CF1
然后执行成功了!
5.从数据库导入到HADOOP中$CONDITIONS
$CONDITIONS 前面要写个