• Hadoop 1: NCDC 数据准备


    本文介绍Hadoop- The Definitive Guide一书中的NCDC数据准备,为后面的学习构建大数据环境;

    环境

    3节点 Hadoop 2.7.3 集群; java version "1.8.0_111"

    1 下载数据

    NCDC下载20,21世纪天历史气数据;官网按年份命名文件夹,每个文件内包含N个gz打包的(*.op.gz)全年各地区天气数据文件和一个全年天气数据打包tar文件,比如1971年;

    034700-99999-1971.op.gz
    035623-99999-1971.op.gz
    035833-99999-1971.op.gz
    035963-99999-1971.op.gz
    036880-99999-1971.op.gz
    040180-16201-1971.op.gz
    061800-99999-1971.op.gz
    080870-99999-1971.op.gz
    gsod_1971.tar
    

    *1971.op.gz就是该年的某地区某天数据打包,而*1971.tar就是对全年*.op.gz文件的打包;只需要下载tar文件,再解压即可得到全年天气数据;在这里下载从1902年到2017年tar文件;

    #!/bin/bash
    for i in {1902..2017}
    do
        cd /home/lanstonwu/hapood/ncdc
        wget --execute robots=off -r -np -nH --cut-dirs=4 -R index.html* ftp://ftp.ncdc.noaa.gov/pub/data/gsod/$i/*.tar
    done
    

    2 上传数据

    为了便于使用,文件下载完成后,推荐使用hadoop将全年的天气数据合并为一个文件;由于下载的数据保存在本地,为了使用hadoop并行处理这些数据,需要将数据上传到HDFS;

    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.OutputStream;
    import java.net.URI;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IOUtils;
    
    /**
     * 将本地文件上传到hadoop集群 hdfs 提示权限不足时设置环境变量export HADOOP_USER_NAME=hadoop再运行
     * 
     * @author lanstonwu
     *
     */
    public class UpLoadFile {
        public static void main(String[] args) throws IOException {
            // hdfs目录
            String target = "hdfs://192.168.56.12:9000/gsod";
            // 本地文件目录
            File file = new File("/home/lanstonwu/hapood/ncdc");
            if (file.exists()) {
                File[] files = file.listFiles();
                if (files.length == 0) {
                    System.out.println("文件夹是空的!");
                    return;
                } else {
                    for (File file2 : files) {// 遍历本地文件目录
                        if (file2.isDirectory()) {
                            System.out.println("文件夹:" + file2.getAbsolutePath() + "," + file2.getName());
                        } else {
                            System.out.println("文件:" + file2.getAbsolutePath() + ",name:" + file2.getName());
                            // 读取本地文件
                            FileInputStream fis = new FileInputStream(new File(file2.getAbsolutePath()));
                            Configuration config = new Configuration();
                            // Returns the FileSystem for this URI's scheme and authority
                            FileSystem fs = FileSystem.get(URI.create(target + "/" + file2.getName()), config);
                            // Create an FSDataOutputStream at the indicated Path
                            OutputStream os = fs.create(new Path(target + "/" + file2.getName()));
                            // 复制数据
                            IOUtils.copyBytes(fis, os, 4096, true);
                            System.out.println("拷贝完成...");
                        }
                    }
                }
            } else {
                System.out.println("文件不存在!");
            }
        }
    }
    

    3 合并数据

    由于hadoop处理大数据文件比处理小数据文件更有优势,这里将tar文件内的全年gz打包数据合并为一个文件;因为仅仅合并数据,用map即可,无需reduce,用hadoop的streaming并行完成这个工作;首先准备处理文件清单;

    $ vi ncdc_file_list.txt
    
    hdfs://gp-sdw1:9000/gsod/gsod_1981.tar
    hdfs://gp-sdw1:9000/gsod/gsod_1977.tar
    hdfs://gp-sdw1:9000/gsod/gsod_1978.tar
    hdfs://gp-sdw1:9000/gsod/gsod_1979.tar
    hdfs://gp-sdw1:9000/gsod/gsod_1980.tar
    hdfs://gp-sdw1:9000/gsod/gsod_1981.tar
    hdfs://gp-sdw1:9000/gsod/gsod_1982.tar
    hdfs://gp-sdw1:9000/gsod/gsod_1983.tar
    .....
    

    文件清单中记录所有要处理的文件,每一行即代表一个文件,hadoop streaming逐行读取传递给map函数处理;接着编写map脚本,每一个步骤有序号和说明;

    #!/bin/bash
    
    HADOOP_HOME=/opt/hadoop/2.7.3
    cd /tmp
    #1 NLineInputFormat give a signle line:offset is key,hdfile is HDFS 
    read offset hdfile
    
    #2 restrive file from hdfs
    echo "reporter:status:Restrivering $hdfile" >&2
    $HADOOP_HOME/bin/hadoop fs -get $hdfile .
    
    #3 get short name from tar file
    target=`basename $hdfile .tar`
    
    #4 create directory by name of target
    mkdir $target
    
    #5 un-tar the local file to target directory
    tar xvf `basename $hdfile` -C $target
    
    #6 un-zip the local file and merge them to one file
    echo "reporter:status:Un-gzipping $target" >&2
    for file in $target/*
    do 
        gunzip -c $file>>$target.all
        echo "repoter:status:Processed $file" >&2
    done
    
    #7 Put gzipped version into HDFS
    echo "reporter:status:Gzipping $target and putting in HDFS" >&2
    gzip -c $target.all | $HADOOP_HOME/bin/hadoop fs -put - /ncdc_year_gz/$target.gz
    
    #8 remove the local file
    rm -Rf $target
    rm -f $target.all
    rm -f $target.tar
    

    hadoop从HDFS中读取文件到本地(第2步),获取文件名(第3步),根据获取到的文件名创建目录(第4步),解压该年的全年数据到目录里(第5步),循环解压和读取全年数据合并到一个文件里(第6步),将合并的文件压缩并上传到HDFS ncdc_year_gz目录(第7步),删除本地文件目录和文件(第8步).reporter 的目的是返回状态信息,便于监控mapper运行.注意:必须设置HADOOP_HOME变量,如果不设置该变量,所有调用hadoop的地方必须全路径,因为在运行时操作系统上配置的HADOOP_HOME变量是不可见,会导致运行报如下错误;

    No such file or directory
    PipeMapRed.waitOutputThreads(): subprocess failed with code 127
    

    1.4 运行mapper 将准备好的NCDC文件清单上传到HDFS(hadoop集群节点需要);

    $ hadoop fs -put ncdc_file_list.txt /
    

    运行map;

    hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar 
      -D mapred.reduce.tasks=0 
      -D mapred.map.tasks.speculative.execution=false 
      -D mapred.task.timeout=12000000 
      -input /ncdc_file_list.txt 
      -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat 
      -output output 
      -mapper load_ncdc_map.sh 
      -file /home/hadoop/script/load_ncdc_map.sh
    

    禁用reduce,设置超时,设置input为准备好的ncdc清单文件,设置mapper和file为map脚本.

    17/10/01 13:05:36 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
    packageJobJar: [/home/hadoop/script/load_ncdc_map.sh, /tmp/hadoop-unjar708897410907700502/] [] /tmp/streamjob2755689666173396550.jar tmpDir=null
    17/10/01 13:05:37 INFO client.RMProxy: Connecting to ResourceManager at gp-sdw1/192.168.56.12:8032
    17/10/01 13:05:37 INFO client.RMProxy: Connecting to ResourceManager at gp-sdw1/192.168.56.12:8032
    17/10/01 13:05:38 INFO mapred.FileInputFormat: Total input paths to process : 1
    17/10/01 13:05:38 INFO mapreduce.JobSubmitter: number of splits:114
    17/10/01 13:05:38 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
    17/10/01 13:05:38 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
    17/10/01 13:05:38 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
    17/10/01 13:05:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506832924184_0001
    17/10/01 13:05:39 INFO impl.YarnClientImpl: Submitted application application_1506832924184_0001
    17/10/01 13:05:39 INFO mapreduce.Job: The url to track the job: http://gp-sdw1:8088/proxy/application_1506832924184_0001/
    17/10/01 13:05:39 INFO mapreduce.Job: Running job: job_1506832924184_0001
    17/10/01 13:05:46 INFO mapreduce.Job: Job job_1506832924184_0001 running in uber mode : false
    17/10/01 13:05:46 INFO mapreduce.Job:  map 0% reduce 0%
    17/10/01 13:06:00 INFO mapreduce.Job:  map 1% reduce 0%
    17/10/01 13:06:04 INFO mapreduce.Job:  map 2% reduce 0%
    17/10/01 13:06:09 INFO mapreduce.Job:  map 3% reduce 0%
    17/10/01 13:06:12 INFO mapreduce.Job:  map 4% reduce 0%
    17/10/01 13:06:17 INFO mapreduce.Job:  map 5% reduce 0%
    17/10/01 13:06:23 INFO mapreduce.Job:  map 6% reduce 0%
    17/10/01 13:06:25 INFO mapreduce.Job:  map 7% reduce 0%
    17/10/01 13:06:28 INFO mapreduce.Job:  map 11% reduce 0%
    17/10/01 13:06:32 INFO mapreduce.Job:  map 12% reduce 0%
    17/10/01 13:06:34 INFO mapreduce.Job:  map 13% reduce 0%
    17/10/01 13:06:37 INFO mapreduce.Job:  map 14% reduce 0%
    17/10/01 13:06:38 INFO mapreduce.Job:  map 17% reduce 0%
    17/10/01 13:06:39 INFO mapreduce.Job:  map 19% reduce 0%
    17/10/01 13:06:56 INFO mapreduce.Job:  map 20% reduce 0%
    17/10/01 13:07:02 INFO mapreduce.Job:  map 21% reduce 0%
    17/10/01 13:07:12 INFO mapreduce.Job:  map 22% reduce 0%
    17/10/01 13:07:14 INFO mapreduce.Job:  map 23% reduce 0%
    17/10/01 13:07:16 INFO mapreduce.Job:  map 24% reduce 0%
    17/10/01 13:07:17 INFO mapreduce.Job:  map 25% reduce 0%
    17/10/01 13:07:52 INFO mapreduce.Job:  map 27% reduce 0%
    

    ncdc_mapper
    Status即为map脚本reporter返回信息;map完成,检查hadoop 合并后的文件;

    $ hadoop fs -ls /ncdc_year_gz
    
    -rw-r--r--   3 hadoop supergroup   14809707 2017-10-01 13:11 /ncdc_year_gz/gsod_1966.gz
    -rw-r--r--   3 hadoop supergroup   14771822 2017-10-01 13:13 /ncdc_year_gz/gsod_1967.gz
    -rw-r--r--   3 hadoop supergroup   13592592 2017-10-01 13:12 /ncdc_year_gz/gsod_1968.gz
    -rw-r--r--   3 hadoop supergroup   20475061 2017-10-01 13:14 /ncdc_year_gz/gsod_1969.gz
    -rw-r--r--   3 hadoop supergroup   20012492 2017-10-01 13:14 /ncdc_year_gz/gsod_1970.gz
    -rw-r--r--   3 hadoop supergroup   11205341 2017-10-01 13:12 /ncdc_year_gz/gsod_1971.gz
    -rw-r--r--   3 hadoop supergroup    4556815 2017-10-01 13:11 /ncdc_year_gz/gsod_1972.gz
    -rw-r--r--   3 hadoop supergroup   21961972 2017-10-01 13:18 /ncdc_year_gz/gsod_1974.gz
    -rw-r--r--   3 hadoop supergroup   23030229 2017-10-01 13:18 /ncdc_year_gz/gsod_1976.gz
    -rw-r--r--   3 hadoop supergroup   23293175 2017-10-01 13:18 /ncdc_year_gz/gsod_1978.gz
    -rw-r--r--   3 hadoop supergroup   24564712 2017-10-01 13:18 /ncdc_year_gz/gsod_1980.gz
    -rw-r--r--   3 hadoop supergroup   29662599 2017-10-01 13:19 /ncdc_year_gz/gsod_1988.gz
    -rw-r--r--   3 hadoop supergroup   29092407 2017-10-01 13:19 /ncdc_year_gz/gsod_1993.gz
    -rw-r--r--   3 hadoop supergroup   25363736 2017-10-01 13:19 /ncdc_year_gz/gsod_1994.gz
    -rw-r--r--   3 hadoop supergroup   22179093 2017-10-01 13:19 /ncdc_year_gz/gsod_1995.gz
    
  • 相关阅读:
    Hadoop学习笔记之六:HDFS功能逻辑(2)
    Hadoop学习笔记之五:HDFS功能逻辑(1)
    Hadoop学习笔记之四:HDFS客户端
    Hadoop学习笔记之三:DataNode
    Hadoop学习笔记之二:NameNode
    MySQL不同存储引擎下optimize的用法
    Zabbix数据库表分区
    Zabbix备份数据文件
    Web性能优化之-深入理解TCP Socket
    DDOS攻击攻击种类和原理
  • 原文地址:https://www.cnblogs.com/lanston/p/hadoop_ncdc_data_prepare.html
Copyright © 2020-2023  润新知