• 练习:Flink面试题


    问题描述:

    APP用户点击日志,列名分别为时间,用户ID,产品代号,点击的功能代号,邮箱,省市,耗时,参数详情。需使用flink批处理进行数据清洗及开窗统计,样例数据如下:

    data

    说明:

    1. 数据的列分隔符为逗号,详情参数为json
    2. 数据行中存在脏数据

    环境:

    1. 机器可联网,笔试机器的桌面上有idea开发环境,flink相关依赖需自己引入(需使用flink 1.11.1 以上版本)
    2. 样例数据存放在

    链接: https://pan.baidu.com/s/18n4PjyXHsrXwWz6rdBzKIw?pwd=ct6c 

    提取码: ct6c

    1. 注意:笔试过程中避免重启机器,否则答题过程可能被还原
    2. 考试时间为2个小时

    pom

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>org.example</groupId>
        <artifactId>FlinkTrue</artifactId>
        <version>1.0-SNAPSHOT</version>
        <properties>
            <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
            <maven.compiler.source>1.8</maven.compiler.source>
            <maven.compiler.target>1.8</maven.compiler.target>
            <flink.version>1.13.0</flink.version>
            <hadoop.version>3.1.3</hadoop.version>
            <scala.version>2.12</scala.version>
            <scala.binary.version>2.11</scala.binary.version>
        </properties>
    
        <dependencies>
    
            <!--flink-java-core-stream-clients -->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-core</artifactId>
                <version>${flink.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-java</artifactId>
                <version>${flink.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-streaming-java_${scala.version}</artifactId>
                <version>${flink.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-clients_${scala.version}</artifactId>
                <version>${flink.version}</version>
            </dependency>
    
            <dependency>
                <groupId>org.projectlombok</groupId>
                <artifactId>lombok</artifactId>
                <version>1.16.22</version>
            </dependency>
    
    
            <!--jedis-->
            <dependency>
                <groupId>redis.clients</groupId>
                <artifactId>jedis</artifactId>
                <version>3.3.0</version>
            </dependency>
    
            <!--fastjson-->
            <dependency>
                <groupId>com.alibaba</groupId>
                <artifactId>fastjson</artifactId>
                <version>1.2.60</version>
            </dependency>
    
    
            <!--flink SQL table api-->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-table-api-java-bridge_${scala.version}</artifactId>
                <version>${flink.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-table-planner-blink_${scala.version}</artifactId>
                <version>${flink.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-table-common</artifactId>
                <version>${flink.version}</version>
            </dependency>
    
            <!--cep-->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-cep-scala_${scala.version}</artifactId>
                <version>${flink.version}</version>
            </dependency>
    
            <!--csv-->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-csv</artifactId>
                <version>${flink.version}</version>
            </dependency>
    
            <!--sink kafka-->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-connector-kafka_${scala.version}</artifactId>
                <version>${flink.version}</version>
            </dependency>
    
            <!--sink hadoop hdfs-->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-connector-filesystem_${scala.binary.version}</artifactId>
                <version>1.4.2</version>
            </dependency>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-hadoop-compatibility_${scala.version}</artifactId>
                <version>${flink.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-client</artifactId>
                <version>${hadoop.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>${hadoop.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-hdfs</artifactId>
                <version>${hadoop.version}</version>
            </dependency>
    
            <!--sink mysql-->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-jdbc_${scala.version}</artifactId>
                <version>1.9.2</version>
            </dependency>
            <dependency>
                <groupId>mysql</groupId>
                <artifactId>mysql-connector-java</artifactId>
                <version>5.1.38</version>
            </dependency>
    
            <!--sink数据到hbse-->
    
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-hbase_${scala.version}</artifactId>
                <version>1.8.1</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.hbase</groupId>
                <artifactId>hbase-client</artifactId>
                <version>2.4.3</version>
            </dependency>
    
            <!--jdbc sink clickhouse-->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-connector-jdbc_${scala.version}</artifactId>
                <version>1.13.0</version>
            </dependency>
            <dependency>
                <groupId>ru.yandex.clickhouse</groupId>
                <artifactId>clickhouse-jdbc</artifactId>
                <version>0.2.4</version>
            </dependency>
            <!--Guava工程包含了若干被Google的Java项目广泛依赖的核心库,方便开发-->
            <dependency>
                <groupId>com.google.guava</groupId>
                <artifactId>guava</artifactId>
                <version>30.1.1-jre</version>
            </dependency>
    
            <!--jdbc sink Clickhouse exclusion-->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-connector-jdbc_${scala.version}</artifactId>
                <version>${flink.version}</version>
            </dependency>
            <dependency>
                <groupId>ru.yandex.clickhouse</groupId>
                <artifactId>clickhouse-jdbc</artifactId>
                <version>0.2.4</version>
                <exclusions>
                    <exclusion>
                        <groupId>com.fasterxml.jackson.core</groupId>
                        <artifactId>jackson-databind</artifactId>
                    </exclusion>
                    <exclusion>
                        <groupId>com.fasterxml.jackson.core</groupId>
                        <artifactId>jackson-core</artifactId>
                    </exclusion>
                </exclusions>
            </dependency>
    
            <!-- Flink连接redis的连接包-->
            <dependency>
                <groupId>org.apache.bahir</groupId>
                <artifactId>flink-connector-redis_2.11</artifactId>
                <version>1.0</version>
            </dependency>
    
            <!--jedis-->
            <dependency>
                <groupId>redis.clients</groupId>
                <artifactId>jedis</artifactId>
                <version>3.3.0</version>
            </dependency>
    
            <!--sink es-->
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-connector-elasticsearch7_${scala.version}</artifactId>
                <version>1.10.1</version>
            </dependency>
    
        </dependencies>
    
    </project>

    bean

    import lombok.AllArgsConstructor;
    import lombok.Data;
    import lombok.NoArgsConstructor;
    
    
    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public class Log {
        //时间
        private Long time   ;
        // 用户ID
        private String    uid ;
        // 产品代号
        private String    sid ;
        // 点击的功能代号
        private String    exeid ;
        // 邮箱
        private String    email ;
        // 省市
        private String    province ;
        // 耗时
        private String    spend ;
        // 参数详情
        private String    detail ;
    }
    import lombok.AllArgsConstructor;
    import lombok.NoArgsConstructor;
    
    @lombok.Data
    @NoArgsConstructor
    @AllArgsConstructor
    public class Data {
        private Long time;
    //    private Integer isman;
        private String level;
    //    private Integer ts;
    //    private Integer ver;
        private Integer count;
    }

    test

    import com.alibaba.fastjson.JSON;
    import com.alibaba.fastjson.JSONObject;
    import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
    import org.apache.flink.api.common.eventtime.WatermarkStrategy;
    import org.apache.flink.api.common.functions.AggregateFunction;
    import org.apache.flink.api.common.functions.FilterFunction;
    import org.apache.flink.api.common.functions.MapFunction;
    import org.apache.flink.api.common.functions.ReduceFunction;
    import org.apache.flink.api.java.ExecutionEnvironment;
    import org.apache.flink.api.java.operators.DataSource;
    import org.apache.flink.streaming.api.TimeCharacteristic;
    import org.apache.flink.streaming.api.datastream.DataStreamSource;
    import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
    import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
    import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
    import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
    import org.apache.flink.streaming.api.windowing.time.Time;
    import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
    import org.apache.flink.util.Collector;
    
    import java.text.SimpleDateFormat;
    import java.time.Duration;
    
    public class FlinkTrue {
        public static void main(String[] args) throws Exception {
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            env.setParallelism(1);
    
    //        DataSource<String> data = env.readTextFile(FlinkTrue.class.getClassLoader().getResource("simple.etl.csv").getPath());
            DataStreamSource<String> data = env.readTextFile("C:\\Users\\liuyuan\\Desktop\\实训3\\FlinkTrue\\src\\main\\resources\\simple.etl.csv");
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
            env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
            SingleOutputStreamOperator<Log> maped = data.map(new MapFunction<String, Log>() {
                @Override
                public Log map(String s) throws Exception {
                    String[] split = s.split(",");
                    Log log = new Log();
                    if (split.length >= 11) {
                        log = new Log(sdf.parse(split[0]).getTime(), split[1], split[2], split[3], split[4], split[5], split[6], split[7]+","+split[8]+","+split[9]+","+split[10]);
                    }
                    return log;
                }
            });
            SingleOutputStreamOperator<Log> map = maped.filter(new FilterFunction<Log>() {
                @Override
                public boolean filter(Log log) throws Exception {
                    return !String.valueOf(log.getTime()).equals("null");
                }
            });
    
            SingleOutputStreamOperator<Data> mapData = map.map(new MapFunction<Log, Data>() {
                @Override
                public Data map(Log log) throws Exception {
                    String detail = log.getDetail();
                    String s = detail.replaceAll("[\"]", "");
                    JSONObject jsonObject = JSON.parseObject(s);
                    return new Data(log.getTime(),String.valueOf(jsonObject.get("level")),0);
                }
            });
    
       /*     SingleOutputStreamOperator<Data> stream = mapData.assignTimestampsAndWatermarks(
                    new BoundedOutOfOrdernessTimestampExtractor<Data>(Time.seconds(3)) {
                @Override
                public long extractTimestamp(Data data1) {
                                     return data1.getTime();
                                 }
          });*/
            SingleOutputStreamOperator<Data> stream = mapData.assignTimestampsAndWatermarks(WatermarkStrategy.<Data>forBoundedOutOfOrderness(Duration.ofSeconds(5))
                    .withTimestampAssigner(new SerializableTimestampAssigner<Data>() {
                        @Override
                        public long extractTimestamp(Data data, long l) {
                            return data.getTime();
                        }
                    }));
    
    
            SingleOutputStreamOperator<String> aggregate = stream.keyBy(d -> d.getLevel())
                    .window(TumblingEventTimeWindows.of(Time.seconds(30)))
                    .aggregate(new AggregateFunction<Data, Integer, Integer>() {
                        @Override
                        public Integer createAccumulator() {
                            return 0;
                        }
    
                        @Override
                        public Integer add(Data data, Integer integer) {
                            data.setCount(integer + 1);
                            return data.getCount();
                        }
    
                        @Override
                        public Integer getResult(Integer integer) {
                            return integer;
                        }
    
                        @Override
                        public Integer merge(Integer integer, Integer acc1) {
                            return null;
                        }
                    }, new ProcessWindowFunction<Integer, String, String, TimeWindow>() {
    
                        @Override
                        public void process(String s, Context context, Iterable<Integer> iterable, Collector<String> collector) throws Exception {
                            long end = context.window().getEnd();
                            long start = context.window().getStart();
                            Integer next = iterable.iterator().next();
                            collector.collect("结束时间:"+start+",结束时间:"+end+ ",level:"+s+",次数:"+next);
                        }
                    });
    
            aggregate.print();
    
    
            //1.对数据进行必要的丢弃或修复
            //2.以样例数据中第1列时间作为事件时间
            //3.编写3s的窗口,统计详细参数里各level的出现次数
            //4.设定水位线为5s,过滤部分延迟数据
    
            //5.不允许使用过时方法·
            //6.请使用java语言实现
            //控制台输出:窗口时间+level+次数,样例如下 timewindow: key: count:
    
            env.execute();
    
        }
    }

    评价标准:

    考察范围

    A

    B

    C

    数据丢弃及Json处理

    结果正确,代码规范

    结果正确,代码规范但存在瑕疵

    不符合

    Flink窗口使用

    结果正确,代码规范

    结果正确,代码规范但存在瑕疵

    不符合

    Flink水位线使用

    结果正确,代码规范

    结果正确,代码规范但存在瑕疵

    不符合

  • 相关阅读:
    Oracle 各种查询语句
    Win7下Eclipse中文字体太小
    ASP.NET MVC(Razor)上运用UEditor和xhEditor编辑器检测到有潜在危险的 Request.Form的真正解决办法
    Oracle 分页
    限制IIS访问流量提升IIS性能
    pl\sql工具导出表结构、序列和触发器方法
    JS 中面向对象的5种写法
    去除Windows 2003的登录CTRL+ALT+DEL
    List绑定时无法进行增删查改的解决办法
    .net工具
  • 原文地址:https://www.cnblogs.com/chang09/p/16382638.html
Copyright © 2020-2023  润新知