• MapReduce实验-数据清洗-阶段一


    Result文件数据说明:

    Ip106.39.41.166,(城市)

    Date10/Nov/2016:00:01:02 +0800,(日期)

    Day10,(天数)

    Traffic: 54 ,(流量)

    Type: video,(类型:视频video或文章article

    Id: 8701(视频或者文章的id

    测试要求:

    1、 数据清洗:按照进行数据清洗,并将清洗后的数据导入hive数据库中

    两阶段数据清洗:

    1)第一阶段:把需要的信息从原始日志中提取出来

    ip:    199.30.25.88

    time:  10/Nov/2016:00:01:03 +0800

    traffic:  62

    文章: article/11325

    视频: video/3235

    2)第二阶段:根据提取出来的信息做精细化操作

    ip--->城市 cityIP

    date--> time:2016-11-10 00:01:03

    day: 10

    traffic:62

    type:article/video

    id:11325

    3hive数据库表结构:

    create table data(  ip string,  time string , day string, traffic bigint,

    type string, id   string )

    2、数据处理:

    ·统计最受欢迎的视频/文章的Top10访问次数 (video/article

    ·按照地市统计最受欢迎的Top10课程 (ip

    ·按照流量统计最受欢迎的Top10课程 (traffic

    3、数据可视化:将统计结果倒入MySql数据库中,通过图形化展示的方式展现出来。

    阶段一:

    /**
     * MapReduce实验-数据清洗-阶段一
     * 高泽伟19.11.20
     * 
     */
    package classtest3;
    
    import java.io.IOException;
    import java.text.SimpleDateFormat;
    import java.util.Date;
    import java.util.Iterator;
    import java.util.Locale;
    import java.util.StringTokenizer;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    
    public class DataClean {
        
        static String INPUT_PATH="hdfs://192.168.57.128:9000/testhdfs1026/run/input/DataClean.txt";
        static String OUTPUT_PATH="hdfs://192.168.57.128:9000/testhdfs1026/run/output/DataClean";
        
        
        /*                
         * 数据格式:
         *                 Ip                    Date        Day|Traffic|Type|Id
         *             106.39.41.166,10/Nov/2016:00:01:02 +0800,10,54,video,8701
         */
        
        public static final SimpleDateFormat FORMAT = new SimpleDateFormat("d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH); //原时间格式
        public static final SimpleDateFormat dateformat1 = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");//现时间格式
        
        //提取数据的函数  #########################################################################################
        //将一行数据清洗整合到一个字符串数组里 parse:解析
        //String line --> String[]
        public static String[] parse(String line){
    
            String ip = parseIP(line);
            String date = parseTime(line);
            String day = parseDay(line);
            String traffic = parseTraffic(line);
            String type = parseType(line);
            String id = parseId(line);
            return new String[]{ip,date,day,traffic,type,id};
        }
        
        //Ip
        private static String parseIP(String line) {     
            String ip =line.split(",")[0].trim();
            return ip;
        }
        
        //Date
        private static String parseTime(String line) {
            //time=日期String
            String time =line.split(",")[1].trim();
            //截取最后的" +0800"
            final int f = time.indexOf(" ");
            String time1 = time.substring(0, f);
            Date date = parseDateFormat(time1);
            return dateformat1.format(date);
        }
        //把String类型转换成Date类型
        private static Date parseDateFormat(String string){
            Date parse = null;
            try{
                    parse = FORMAT.parse(string);//parse()方法,把String型的字符串转换成特定格式的date类型
            }catch (Exception e){
                e.printStackTrace();
            }
         return parse;
        }
        
        //Day
        private static String parseDay(String line) {     
            String day =line.split(",")[2].trim();
            return day;
        }
        
        //Traffic
        private static String parseTraffic(String line) {     
            String traffic = line.split(",")[3].trim();
            return traffic;
        }
        
        //Type
        private static String parseType(String line) {     
            String type = line.split(",")[4].trim();
            return type;
        }
        
        //Id
        private static String parseId(String line) {     
            String id =line.split(",")[5].trim();
            return id;
        }
    
        /*
         * Mapper 
         * 把需要的信息从原始日志中提取出来,根据提取出来的信息做精细化操作
         */
        public static class Map extends
                Mapper<LongWritable,Text,Text,NullWritable>{
            
            public static Text word = new Text();
            public void map(LongWritable key,Text value,Context context)
                    throws IOException, InterruptedException{
                String line = value.toString();
                String arr[] = parse(line);
                word.set(arr[0]+"	"+arr[1]+"	"+arr[2]+"	"+arr[3]+"	"+arr[4]+"	"+arr[5]+"	");
                context.write(word,NullWritable.get());
            }
        }
        
        public static class Reduce extends
                Reducer<Text,NullWritable,Text,NullWritable>{
            public void reduce(Text key, Iterable<NullWritable> values,Context context) 
                    throws IOException, InterruptedException {
                context.write(key, NullWritable.get());
            } 
        }
        
        
        
        public static void main(String[] args) throws Exception{
            Path inputpath=new Path(INPUT_PATH);
            Path outputpath=new Path(OUTPUT_PATH);
            Configuration conf=new Configuration();
            System.out.println("Start");
            Job job=Job.getInstance(conf);        
            job.setJarByClass(DataClean.class);
            job.setMapperClass(Map.class);
            job.setReducerClass(Reduce.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(NullWritable.class);
            FileInputFormat.addInputPaths(job, INPUT_PATH);
            FileOutputFormat.setOutputPath(job,outputpath);
            
            boolean flag = job.waitForCompletion(true);
            System.out.println(flag);
            System.exit(flag? 0 : 1);
        }
    
    }

    知识点1:SimpleDateFormat的用法

    SimpleDateFormat用于格式化时间
    实例::::::::::::::::::::::::::::::::::::::::::::::::::::
    import java.util.Date;
    import java.text.SimpleDateFormat;

    public class SimpleDateFormat1 {
      public static void main(String[] args){
        Date date = new Date();
        String dat = date.toString();
        System.out.println(dat);
        String strDateFormat = "yyyy-MM-dd HH:mm:ss";
        SimpleDateFormat sdf = new SimpleDateFormat(strDateFormat);
        System.out.println(sdf.format(date));
      }
    }
    ::::::::::::::::::::::::::::::::::::::::::::::::::::::
    输出结果:
      Tue Nov 19 18:55:29 CST 2019
      2019-11-19 18:55:29

    知识点2:lastIndexOf()方法和indexOf()方法比较

    lastIndexOf()方法,返回子字符串最后出现的位置。没有找到,则返回 -1。
      如:"ABCDABCD".lastIndexOf("BC"); 返回5
      "ABCDABCD".lastIndexOf("DE"); 返回-1
    indexOf()方法返回子字符串第一次出现字符位置。没有找到,则返回 -1。
      如:"ABCDABCD".indexOf("BC"); 返回1
      "ABCDABCD".indexOf("B"); 返回1
      "ABCDABCD".indexOf("DE"); 返回-1

    导入Hive语句:

    hive数据库的操作:
    hive> create table if not exists data(
        > dip string,
        > dtime string,
        > dday string,
        > dtraffic bigint,
        > dtype string,
        > did string)
        > row format delimited fields terminated by ',' lines terminated by '
    ';
    [root@localhost 桌面]# hadoop fs -get hdfs://localhost:9000/testhdfs1026/run/input/DataClean.txt /usr/local
    hive> load data local  inpath '/usr/local/DataClean.txt' into table data;
    hive> select * from data limit 3;
  • 相关阅读:
    keepalived 打印日志
    mysql 主从切换
    mysql 开启只读 普通用户无法写入
    主从复制同步mysql数据库后会导致从上用户无法登陆
    MySQL性能优化的最佳20+条经验(1)
    MySQL性能测试工具之mysqlslap
    关于ad所用端口
    mysql 从设置只读
    ARP表信息引起的telnet 时断时通
    Caused by: java.sql.SQLException: ResultSet is from UPDATE. No Data.
  • 原文地址:https://www.cnblogs.com/jmdd/p/11854191.html
Copyright © 2020-2023  润新知