Sqoop2操作小记

SQOOP

Sqoop是一款开源的工具,主要用于在Hadoop生态系统与传统的数据库(如mysql)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
关于Sqoop版本,Sqoop1.99.7是属于sqoop2,Sqoop1的最高版本为1.4.7。Sqoop2较之Sqoop1来说,引入CS架构,部署变的复杂了,但是耦合性降低了。
我这里使用Hadoop版本是2.8.1,相应地需要使用Sqoop2。

数据源

网上开源数据颇多,我们这里以2008年美航班起降时间数据为原料,字段详见

http://stat-computing.org/dataexpo/2009/the-data.html

简单分析

规定delay = departure delay

  • 常规mysql查询(数据的下载和建表见后)

    1. average delay最大的10个起飞城市

      mysql> select Origin,avg(DepDelay) from airline2008 group by Origin order by avg(DepDelay) desc limit 10;
      +--------+---------------+
      | Origin | avg(DepDelay) |
      +--------+---------------+
      | ACK    |       27.3720 |
      | PUB    |       27.0000 |
      | CEC    |       23.0941 |
      | PIR    |       22.8000 |
      | AKN    |       21.5603 |
      | OTH    |       21.0893 |
      | SPI    |       20.3290 |
      | LMT    |       19.3796 |
      | ACY    |       18.0690 |
      | EWR    |       17.6917 |
      +--------+---------------+
      10 rows in set (18.91 sec)
      
      //check:
      select DepDelay,ArrDelay from airline2008 where Origin = "ACK" order by DepDelay desc limit 10;
      select avg(DepDelay) from airline2008 where Origin = "ACK";
      
  1. 在XX地delay超2h航班最多的一天

    mysql> select Month,DayOfMonth,count(*) as delatOver2hCount from airline2008 where Origin = "DCA" and DepDelay > 120 group by Month,DayOfMonth order by count(*) desc limit 10;
    +-------+------------+------------------+
    | Month | DayOfMonth | delatOver2hCount |
    +-------+------------+------------------+
    |     7 |         13 |               37 |
    |     6 |          4 |               33 |
    |     6 |         16 |               27 |
    |     7 |         23 |               26 |
    |     7 |         27 |               26 |
    |     5 |         31 |               26 |
    |    12 |         20 |               23 |
    |     5 |          2 |               15 |
    |    12 |         21 |               15 |
    |     7 |          6 |               15 |
    +-------+------------+------------------+
    10 rows in set (4.82 sec)
    
  • 如果数据规模大,则需要导入HDFS,利用Hive等大数据组件对其进行分析。这里用Sqoop将MySQL数据转移至HDFS。

数据预处理

  1. 下载csv

http://stat-computing.org/dataexpo/2009/2008.csv.bz2

使用bzip2 -d zipfile.bz2解压,并更名为airline2008

  1. mysql建库建表

CREATE DATABASE IF NOT EXISTS local_db default charset utf8 COLLATE utf8_general_ci;

create table airline2008 (
  Year int,
  Month int,
  DayofMonth int,
  DayOfWeek int,
  DepTime  int,
  CRSDepTime int,
  ArrTime int,
  CRSArrTime int,
  UniqueCarrier varchar(5),
  FlightNum int,
  TailNum varchar(8),
  ActualElapsedTime int,
  CRSElapsedTime int,
  AirTime int,
  ArrDelay int,
  DepDelay int,
  Origin varchar(3),
  Dest varchar(3),
  Distance int,
  TaxiIn int,
  TaxiOut int,
  Cancelled int,
  CancellationCode varchar(1),
  Diverted varchar(1),
  CarrierDelay int,
  WeatherDelay int,
  NASDelay int,
  SecurityDelay int,
  LateAircraftDelay int
);
  1. 导入mysql表
    mysqlimport --ignore-lines 1 --fields-terminated-by=, --local -uroot -p local_db airline2008.csv
    并且添加自增主键(可选:加FIRST可以将id这列置于首位,更便于查看)
    ALTER TABLE 'airline2008' ADD 'id' INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
    详见

https://dev.mysql.com/doc/refman/5.6/en/mysqlimport.html

  1. sqoop导入hdfs

    • 安装sqoop2
      上官网找到sqoop1.99.7,解压并创建环境变量$SQOOP_HOME

    • 修改sqoop配置

      修改 $SQOOP_HOME/conf/sqoop.properties 如下:

      #修改指向我的hadoop安装目录
      org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/usr/local/lib/hadoop-2.8.1/etc/hadoop

修改 `$SQOOP_HOME/server/catalina.properties` 如下:

    common.loader=
    /usr/local/lib/hadoop-2.8.1/share/hadoop/common/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/common/lib/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/hdfs/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/hdfs/lib/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/mapreduce/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/mapreduce/lib/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/tools/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/tools/lib/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/yarn/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/yarn/lib/*.jar,
    /usr/local/lib/hadoop-2.8.1/share/hadoop/httpfs/tomcat/lib/*.jar

下载mysql-connector的jar包,把jar包丢到到$SQOOP_HOME/server/lib

- Shell基本操作
键入命令 `sqoop2-shell` 或者 `sqoop server start` 进入sqoop2-shell

        sqoop2-tool verify 验证是否配置成功
        sqoop.sh server start 开启server
        sqoop.sh server stop 关闭server
        sqoop.sh client 开启客户端

- Shell关键操作
主要是connector/link/job的创建和维护

    - 创建link

            sqoop:000> set server --host localhost --port 12000 --webapp sqoop
            Server is set successfully
            sqoop:000> show version --all
            client version:
              Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb 
              Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
            server version:
              Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb 
              Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
            API versions:
              [v1]
            sqoop:000> create link -connector generic-jdbc-connector
            Creating link for connector with name generic-jdbc-connector
            Please fill following values to create new link object
            Name: mysql-link

            Database connection

            Driver class: com.mysql.jdbc.Driver
            Connection String: jdbc:mysql://localhost/local_db
            Username: root
            Password: ******
            Fetch Size: 
            Connection Properties: 
            There are currently 0 values in the map:
            entry# protocol=tcp
            There are currently 1 values in the map:
            protocol = tcp
            entry# 

            SQL Dialect

            Identifier enclose:  
            Mon Jan 22 16:00:38 CST 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
            New link was successfully created with validation status WARNING and name mysql-link

            sqoop:000> create link -connector hdfs-connector
            Creating link for connector with name hdfs-connector
            Please fill following values to create new link object
            Name: hdfs-link

            HDFS cluster

            URI: hdfs://localhost:9000
            Conf directory: 
            Additional configs:: 
            There are currently 0 values in the map:
            entry# 
            New link was successfully created with validation status OK and name hdfs-link
    - 创建job

            sqoop:000> create job -f "mysql-link" -t "hdfs-link"
            Creating job for links with from name mysql-link and to name hdfs-link
            Please fill following values to create new job object
            Name: mysql-hdfs

            Database source

            Schema name: local_db
            Table name: airline2008
            SQL statement: 
            Column names: 
            There are currently 0 values in the list:
            element# 
            Partition column: id ##!!!!!明天改下
            Partition column nullable: 
            Boundary query: 

            Incremental read

            Check column: 
            Last value: 

            Target configuration

            Override null value: 
            Null value: 
            File format: 
              0 : TEXT_FILE
              1 : SEQUENCE_FILE
              2 : PARQUET_FILE
            Choose: 0
            Compression codec: 
              0 : NONE
              1 : DEFAULT
              2 : DEFLATE
              3 : GZIP
              4 : BZIP2
              5 : LZO
              6 : LZ4
              7 : SNAPPY
              8 : CUSTOM
            Choose: 0
            Custom codec: 
            Output directory: /hadoop/sqoop
            Append mode: 

            Throttling resources

            Extractors: 
            Loaders: 

            Classpath configuration

            Extra mapper jars: 
            There are currently 0 values in the list:
            element# 
            New job was successfully created with validation status OK  and name mysql-hdfs

    - 开始job

            sqoop:000> start job -n mysql-hdfs

        即可以完成MySQL数据导入HDFS。
    - 由于airline2008表过大,导出时间略长,上次结果忘记保存了,这里使用本地之前保存的Demo表(Department)做演示,结果如下:

            sqoop:000> start job -n mysql-hdfs
            Submission details
            Job Name: mysql-hdfs
            Server URL: http://localhost:12000/sqoop/
            Created by: hadoop
            Creation date: 2018-01-23 18:09:07 CST
            Lastly updated by: hadoop
            External ID: job_local1619449777_0003
                    http://localhost:8080/
            Source Connector schema: Schema{name= hive . departments ,columns=[
                    Text{name=dept_no,nullable=true,type=TEXT,charSize=null},
                    Text{name=dept_name,nullable=true,type=TEXT,charSize=null}]}
            2018-01-23 18:09:07 CST: BOOTING  - Progress is not available


            $ hdfs dfs -ls /hadoop/sqoop
            Found 10 items
            -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/0d38a9bd-3740-4c45-b73f-d34b70679596.txt
            -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/1b1db902-1595-4756-b552-c146a49600ea.txt
            -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/1c847041-46fa-420d-a8fa-feb74e82a7f6.txt
            -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/1fdb7a44-23fd-4cf0-9b44-f7293e72fdc8.txt
            -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/5862a9bb-4b07-46d4-9dfa-f588b3e20d19.txt
            -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/58e7cde4-e129-4fee-a08b-48a00e026e5d.txt
            -rw-r--r--   3 hadoop supergroup        189 2018-01-23 18:09 /hadoop/sqoop/8cd3761a-6070-474f-be08-e6c2609bc6ff.txt
            -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/9eae0db2-6171-4856-9182-e4137dfd6911.txt
            -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/cec1349f-2e18-4227-98dd-4f50ef6abd2d.txt
            -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/dd7e8b46-2406-436d-8e47-3f39f643a4e3.txt

            $ hdfs dfs -cat /hadoop/sqoop/8cd3761a-6070-474f-be08-e6c2609bc6ff.txt
            'd001','Marketing'
            'd002','Finance'
            'd003','Human Resources'
            'd004','Production'
            'd005','Development'
            'd006','Quality Management'
            'd007','Sales'
            'd008','Research'
            'd009','Customer Service'


- 注意事项
    - create link时,hdfs端口要和hadoop的core-site里设置的一致
    - core-site.xml中,可能需要添加如下配置

            <property>
              <name>hadoop.proxyuser.root.hosts</name>
              <value>*</value>
            </property>
            <property>
              <name>hadoop.proxyuser.root.groups</name>
              <value>*</value>
            </property>
  - 记得开启verbose,方便在出错时及时查错

        set option --name verbose --value true