SQOOP
Sqoop是一款开源的工具,主要用于在Hadoop生态系统与传统的数据库(如mysql)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
关于Sqoop版本,Sqoop1.99.7是属于sqoop2,Sqoop1的最高版本为1.4.7。Sqoop2较之Sqoop1来说,引入CS架构,部署变的复杂了,但是耦合性降低了。
我这里使用Hadoop版本是2.8.1,相应地需要使用Sqoop2。
数据源
网上开源数据颇多,我们这里以2008年美航班起降时间数据为原料,字段详见
简单分析
规定delay = departure delay
常规mysql查询(数据的下载和建表见后)
average delay最大的10个起飞城市
mysql> select Origin,avg(DepDelay) from airline2008 group by Origin order by avg(DepDelay) desc limit 10; +--------+---------------+ | Origin | avg(DepDelay) | +--------+---------------+ | ACK | 27.3720 | | PUB | 27.0000 | | CEC | 23.0941 | | PIR | 22.8000 | | AKN | 21.5603 | | OTH | 21.0893 | | SPI | 20.3290 | | LMT | 19.3796 | | ACY | 18.0690 | | EWR | 17.6917 | +--------+---------------+ 10 rows in set (18.91 sec) //check: select DepDelay,ArrDelay from airline2008 where Origin = "ACK" order by DepDelay desc limit 10; select avg(DepDelay) from airline2008 where Origin = "ACK";
在XX地delay超2h航班最多的一天
mysql> select Month,DayOfMonth,count(*) as delatOver2hCount from airline2008 where Origin = "DCA" and DepDelay > 120 group by Month,DayOfMonth order by count(*) desc limit 10; +-------+------------+------------------+ | Month | DayOfMonth | delatOver2hCount | +-------+------------+------------------+ | 7 | 13 | 37 | | 6 | 4 | 33 | | 6 | 16 | 27 | | 7 | 23 | 26 | | 7 | 27 | 26 | | 5 | 31 | 26 | | 12 | 20 | 23 | | 5 | 2 | 15 | | 12 | 21 | 15 | | 7 | 6 | 15 | +-------+------------+------------------+ 10 rows in set (4.82 sec)
- 如果数据规模大,则需要导入HDFS,利用Hive等大数据组件对其进行分析。这里用Sqoop将MySQL数据转移至HDFS。
数据预处理
- 下载csv
使用bzip2 -d zipfile.bz2
解压,并更名为airline2008
- mysql建库建表
CREATE DATABASE IF NOT EXISTS local_db default charset utf8 COLLATE utf8_general_ci;
create table airline2008 (
Year int,
Month int,
DayofMonth int,
DayOfWeek int,
DepTime int,
CRSDepTime int,
ArrTime int,
CRSArrTime int,
UniqueCarrier varchar(5),
FlightNum int,
TailNum varchar(8),
ActualElapsedTime int,
CRSElapsedTime int,
AirTime int,
ArrDelay int,
DepDelay int,
Origin varchar(3),
Dest varchar(3),
Distance int,
TaxiIn int,
TaxiOut int,
Cancelled int,
CancellationCode varchar(1),
Diverted varchar(1),
CarrierDelay int,
WeatherDelay int,
NASDelay int,
SecurityDelay int,
LateAircraftDelay int
);
- 导入mysql表
mysqlimport --ignore-lines 1 --fields-terminated-by=, --local -uroot -p local_db airline2008.csv
并且添加自增主键(可选:加FIRST可以将id这列置于首位,更便于查看)ALTER TABLE 'airline2008' ADD 'id' INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
详见
sqoop导入hdfs
安装sqoop2
上官网找到sqoop1.99.7,解压并创建环境变量$SQOOP_HOME
修改sqoop配置
修改
$SQOOP_HOME/conf/sqoop.properties
如下:#修改指向我的hadoop安装目录
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/usr/local/lib/hadoop-2.8.1/etc/hadoop
修改 `$SQOOP_HOME/server/catalina.properties` 如下:
common.loader=
/usr/local/lib/hadoop-2.8.1/share/hadoop/common/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/common/lib/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/hdfs/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/hdfs/lib/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/mapreduce/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/mapreduce/lib/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/tools/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/tools/lib/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/yarn/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/yarn/lib/*.jar,
/usr/local/lib/hadoop-2.8.1/share/hadoop/httpfs/tomcat/lib/*.jar
下载mysql-connector的jar包,把jar包丢到到$SQOOP_HOME/server/lib
- Shell基本操作
键入命令 `sqoop2-shell` 或者 `sqoop server start` 进入sqoop2-shell
sqoop2-tool verify 验证是否配置成功
sqoop.sh server start 开启server
sqoop.sh server stop 关闭server
sqoop.sh client 开启客户端
- Shell关键操作
主要是connector/link/job的创建和维护
- 创建link
sqoop:000> set server --host localhost --port 12000 --webapp sqoop
Server is set successfully
sqoop:000> show version --all
client version:
Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
server version:
Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
API versions:
[v1]
sqoop:000> create link -connector generic-jdbc-connector
Creating link for connector with name generic-jdbc-connector
Please fill following values to create new link object
Name: mysql-link
Database connection
Driver class: com.mysql.jdbc.Driver
Connection String: jdbc:mysql://localhost/local_db
Username: root
Password: ******
Fetch Size:
Connection Properties:
There are currently 0 values in the map:
entry# protocol=tcp
There are currently 1 values in the map:
protocol = tcp
entry#
SQL Dialect
Identifier enclose:
Mon Jan 22 16:00:38 CST 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
New link was successfully created with validation status WARNING and name mysql-link
sqoop:000> create link -connector hdfs-connector
Creating link for connector with name hdfs-connector
Please fill following values to create new link object
Name: hdfs-link
HDFS cluster
URI: hdfs://localhost:9000
Conf directory:
Additional configs::
There are currently 0 values in the map:
entry#
New link was successfully created with validation status OK and name hdfs-link
- 创建job
sqoop:000> create job -f "mysql-link" -t "hdfs-link"
Creating job for links with from name mysql-link and to name hdfs-link
Please fill following values to create new job object
Name: mysql-hdfs
Database source
Schema name: local_db
Table name: airline2008
SQL statement:
Column names:
There are currently 0 values in the list:
element#
Partition column: id ##!!!!!明天改下
Partition column nullable:
Boundary query:
Incremental read
Check column:
Last value:
Target configuration
Override null value:
Null value:
File format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
2 : PARQUET_FILE
Choose: 0
Compression codec:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
8 : CUSTOM
Choose: 0
Custom codec:
Output directory: /hadoop/sqoop
Append mode:
Throttling resources
Extractors:
Loaders:
Classpath configuration
Extra mapper jars:
There are currently 0 values in the list:
element#
New job was successfully created with validation status OK and name mysql-hdfs
- 开始job
sqoop:000> start job -n mysql-hdfs
即可以完成MySQL数据导入HDFS。
- 由于airline2008表过大,导出时间略长,上次结果忘记保存了,这里使用本地之前保存的Demo表(Department)做演示,结果如下:
sqoop:000> start job -n mysql-hdfs
Submission details
Job Name: mysql-hdfs
Server URL: http://localhost:12000/sqoop/
Created by: hadoop
Creation date: 2018-01-23 18:09:07 CST
Lastly updated by: hadoop
External ID: job_local1619449777_0003
http://localhost:8080/
Source Connector schema: Schema{name= hive . departments ,columns=[
Text{name=dept_no,nullable=true,type=TEXT,charSize=null},
Text{name=dept_name,nullable=true,type=TEXT,charSize=null}]}
2018-01-23 18:09:07 CST: BOOTING - Progress is not available
$ hdfs dfs -ls /hadoop/sqoop
Found 10 items
-rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/0d38a9bd-3740-4c45-b73f-d34b70679596.txt
-rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/1b1db902-1595-4756-b552-c146a49600ea.txt
-rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/1c847041-46fa-420d-a8fa-feb74e82a7f6.txt
-rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/1fdb7a44-23fd-4cf0-9b44-f7293e72fdc8.txt
-rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/5862a9bb-4b07-46d4-9dfa-f588b3e20d19.txt
-rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/58e7cde4-e129-4fee-a08b-48a00e026e5d.txt
-rw-r--r-- 3 hadoop supergroup 189 2018-01-23 18:09 /hadoop/sqoop/8cd3761a-6070-474f-be08-e6c2609bc6ff.txt
-rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/9eae0db2-6171-4856-9182-e4137dfd6911.txt
-rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/cec1349f-2e18-4227-98dd-4f50ef6abd2d.txt
-rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/dd7e8b46-2406-436d-8e47-3f39f643a4e3.txt
$ hdfs dfs -cat /hadoop/sqoop/8cd3761a-6070-474f-be08-e6c2609bc6ff.txt
'd001','Marketing'
'd002','Finance'
'd003','Human Resources'
'd004','Production'
'd005','Development'
'd006','Quality Management'
'd007','Sales'
'd008','Research'
'd009','Customer Service'
- 注意事项
- create link时,hdfs端口要和hadoop的core-site里设置的一致
- core-site.xml中,可能需要添加如下配置
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
- 记得开启verbose,方便在出错时及时查错
set option --name verbose --value true