• Flink常用算子代码实现(Scala和Java)


    Flink常用算子代码实现 (Scala版本和Java版本)

    map之scala实现

    map:

    def main(args: Array[String]): Unit = {
      val env = ExecutionEnvironment.getExecutionEnvironment
      mapFunction(env)
    }
    
    def mapFunction(env: ExecutionEnvironment):Unit = {
      val data = env.fromCollection(List(1,2,3,4,5))
      data.map((x:Int)=>x+1).print()
    }
    
    输出:
    2
    3
    4
    5
    6
    

    scala语法简化:

    data.map((x:Int)=>x+1).print()
    println("----")
    data.map((x)=>x+1).print()
    println("----")
    data.map(x=>x+1).print()
    println("----")
    data.map(_+1).print()
    

    map之Java实现

    public static void main(String[] args) throws Exception{
            ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
            mapFunction(env);
        }
    
        public static void mapFunction(ExecutionEnvironment env) throws Exception{
            List<Integer> list = new ArrayList<Integer>() ;
            for (int i = 1; i <= 5; i++) {
                list.add(i);
            }
            env.fromCollection(list).map(new MapFunction<Integer, Object>() {
                @Override
                public Object map(Integer input) {
                    return input + 1;
                }
            }).print();
        }
        
    输出:
    2
    3
    4
    5
    6
    

    filter之scala实现

    filter算子,返回满足条件的结果。

    def main(args: Array[String]): Unit = {
      val env = ExecutionEnvironment.getExecutionEnvironment
      filterFunction(env)
    }
    
    def filterFunction(env: ExecutionEnvironment):Unit = {
      env.fromCollection(List(1,2,3,4,5))
        .map(_+1)
        .filter(_>3)
        .print()
    }
    
    输出:
    4
    5
    6
    

    filter 之Java实现

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        filterFunction(env);
    }
    
    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        List<Integer> list = new ArrayList<Integer>();
        for (int i = 1; i <= 5; i++) {
            list.add(i);
        }
        env.fromCollection(list).map(new MapFunction<Integer, Integer>() {
            @Override
            public Integer map(Integer input) throws Exception {
                return input + 1;
            }
        }).filter(new FilterFunction<Integer>() {
            @Override
            public boolean filter(Integer input) throws Exception{
                return input > 3;
            }
        }).print();
    }
    
    输出:
    4
    5
    6
    

    mapPartition 之scala实现

    mapPartition的作用:原本是一个map调用一次,现在改成一个分区调用一次。

    import scala.util.Random
    //新建一个数据库工具类,用来连接数据库
    object DBUtils {
      def getConnection() = {
        //获取数据库连接
        new Random().nextInt(10)
      }
    
      def returnConnection(connection: String) = {
        //把数据存到数据库
      }
    }
    
    如果使用map函数,每次都会去请求数据库连接,请求太频繁会把数据库搞崩溃,但是mapPartition就不会,它是一个分区的数据请求一次,可以设置并行度,较少数据库的请求压力。
    
    def main(args: Array[String]): Unit = {
        val env = ExecutionEnvironment.getExecutionEnvironment
        //filterFunction(env)
        mapPartitionFunction(env)
      }
    
      def mapPartitionFunction(env:ExecutionEnvironment):Unit = {
        val students = new ListBuffer[String]
        for(i <-1 to 100) {
          students.append("student " + i)
        }
    
        val data = env.fromCollection(students).setParallelism(4)
    
        data.mapPartition(x=>{
          val connection = DBUtils.getConnection()
          println(connection + "......")
          x
        }).print();
    
    //    data.map(x=>{
    //      //每一个元素要存储到数据库中,肯定要先获取到一个connection
    //      val connection = DBUtils.getConnection() + "...."
    //
    //      //把数据保存到DB
    //      DBUtils.returnConnection(connection)
    //    }).print();
    
      }
    
    现在的情况是: 使用map会请求100次,使用mapPartition 会请求4次,大大降低数据库的压力。
    

    mapPartition之java实现

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
        mapPartition(env);
    }
    
    public static void mapPartition(ExecutionEnvironment env) throws Exception {
        List<String> list = new ArrayList<String>();
    
        for (int i = 1; i <=100 ; i++) {
            list.add("Student " + i);
        }
        DataSource<String> data = env.fromCollection(list);
    
        data.map(new MapFunction<String, String>(){
            @Override
            public String map(String input) throws Exception {
                String connection = DBUtils.getConnection() + "";
                System.out.println("conncetion: [ " + connection + " ]");
                DBUtils.returnConnection(connection);
                return input;
            }
        }).print();
    }
    

    现在换成mapPartition:

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
        mapPartition(env);
    }
    
    public static void mapPartition(ExecutionEnvironment env) throws Exception {
        List<String> list = new ArrayList<String>();
    
        for (int i = 1; i <=100 ; i++) {
            list.add("Student " + i);
        }
        DataSource<String> data = env.fromCollection(list).setParallelism(4);
    
        data.mapPartition(new MapPartitionFunction<String, String>() {
            @Override
            public void mapPartition(Iterable<String> inputs, Collector<String> collector) {
                String connection =  DBUtils.getConnection() + "";
                System.out.println("connect: [ " + connection + " ]");
                DBUtils.returnConnection(connection);
            }
        }).print();
    }
    
    输出:
    connect: [ 9 ]
    connect: [ 2 ]
    connect: [ 2 ]
    connect: [ 3 ]
    
    只会创建4个链接。
    

    first(n)之scala实现

    def main(args: Array[String]): Unit = {
      val env = ExecutionEnvironment.getExecutionEnvironment
      firstFunction(env)
    }
    
    def firstFunction(env: ExecutionEnvironment) : Unit = {
      val info = ListBuffer[(Int,String)]()
      info.append((1,"Hadoop"))
      info.append((1,"Spark"))
      info.append((1,"Flink"))
      info.append((2,"Java"))
      info.append((2,"Spring"))
      info.append((3,"Linux"))
      info.append((4,"VUE"))
    
      val data = env.fromCollection(info)
    
      data.first(3).print()
    输出:
    (1,Hadoop)
    (1,Spark)
    (1,Flink)
    
    data.groupBy(0).first(2).print()
    输出:
    (3,Linux)
    (1,Hadoop)
    (1,Spark)
    (2,Java)
    (2,Spring)
    (4,VUE)
    
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print();
    输出:
    (3,Linux)
    (1,Spark)
    (1,Hadoop)
    (2,Spring)
    (2,Java)
    (4,VUE)
    
    data.groupBy(0).sortGroup(1,Order.ASCENDING).first(2).print();
    (3,Linux)
    (1,Flink)
    (1,Hadoop)
    (2,Java)
    (2,Spring)
    (4,VUE)
    
    }
    
    

    first 之 java实现

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
        firstFunction(env);
    }
    
    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer, String>> info = new ArrayList<Tuple2<Integer, String>>();
        info.add(new Tuple2(1,"Hadoop"));
        info.add(new Tuple2(1,"Spark"));
        info.add(new Tuple2(1,"Flink"));
        info.add(new Tuple2(2,"Java"));
        info.add(new Tuple2(2,"Spring"));
        info.add(new Tuple2(3,"Linux"));
        info.add(new Tuple2(4,"VUE"));
    
        DataSource<Tuple2<Integer,String>> data = env.fromCollection(info);
    
        data.first(3).print();
        System.out.println("~~~~~~~");
        data.groupBy(0).first(2).print();
        System.out.println("~~~~~~~");
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
        System.out.println("~~~~~~~");
        data.groupBy(0).sortGroup(1, Order.ASCENDING).first(2).print();
    }
    

    flatMap之scala实现

    FlatMap:take one element and produce zero, one or more elements.

    def main(args: Array[String]): Unit = {
      val env = ExecutionEnvironment.getExecutionEnvironment
    
      flatMapFunction(env)
    }
    
    def flatMapFunction(env: ExecutionEnvironment) : Unit = {
      val info = ListBuffer[String]()
      info.append("hadoop,spark")
      info.append("flink,spark")
      info.append("hadoop,flink,spark")
    
      env.fromCollection(info).flatMap(_.split(",")).print()
      
    输出:
    hadoop
    spark
    flink
    spark
    hadoop
    flink
    spark
      env.fromCollection(info).flatMap(_.split(",")).map((_,1)).groupBy(0).sum(1).print()
    输出:
    (hadoop,2)
    (flink,2)
    (spark,3)
    
    }
    
    

    flatMap之Java 实现

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
        flatMapFunction(env);
    }
    
    public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
        List<String> list = new ArrayList<String>();
        list.add("spark,hadoop,flink");
        list.add("sqoop,flink,spark");
        list.add("strom,flink");
    
        DataSource<String> data = env.fromCollection(list);
    
        data.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String input, Collector<String> collector) throws Exception{
                String splits[] = input.split(",");
                for (String split : splits){
                    collector.collect(split);
                }
            }
        }).map(new MapFunction<String, Tuple2<String,Integer>>() {
            public Tuple2<String, Integer> map(String s) throws Exception {
                return new Tuple2<String, Integer>(s,1);
            }
        }).groupBy(0).sum(1).print();
    }
    输出:
    (hadoop,1)
    (flink,3)
    (sqoop,1)
    (spark,2)
    (strom,1)
    注意点:多写几遍
    

    distinct 之scala实现

    def main(args: Array[String]): Unit = {
      val env = ExecutionEnvironment.getExecutionEnvironment
    
      distinctFunction(env)
    }
    
    def distinctFunction(env: ExecutionEnvironment) :Unit = {
      val info = ListBuffer[String]()
      info.append("hadoop,spark")
      info.append("flink,spark")
      info.append("hadoop,flink,spark")
    
      env.fromCollection(info).flatMap(_.split(",")).distinct().print()
    }
    
    输出:
    hadoop
    flink
    spark
    

    distinct之java实现

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
        distinctFunction(env);
    }
    
    public static void distinctFunction(ExecutionEnvironment env) throws Exception {
        List<String> list = new ArrayList<String>();
        list.add("spark,hadoop,flink");
        list.add("sqoop,flink,spark");
        list.add("strom,flink");
    
        DataSource<String> data = env.fromCollection(list);
        data.flatMap(new FlatMapFunction<String, String>(){
            @Override
            public void flatMap(String input, Collector<String> collector) throws Exception {
                String[] splits = input.split(",");
                for (String split : splits){
                    collector.collect(split);
                }
            }
        }).distinct().print();
    }
    输出:
    hadoop
    flink
    sqoop
    spark
    strom
    

    join之scala实现

    val result = input1.join(input2).where(0).equalTo(1)
    解释:0表示第一个输入的字段,1表示第二个输入的字段。
    input1的第0个字段和input2的第1个字段做join。

    def main(args: Array[String]): Unit = {
      val env = ExecutionEnvironment.getExecutionEnvironment
    
      joinFunction(env)
    }
    
    def joinFunction(env: ExecutionEnvironment):Unit = {
      val info1 = ListBuffer[(Int,String)]()  //编号  名字
      info1.append((1,"张三"))
      info1.append((2,"李四"))
      info1.append((3,"王五"))
      info1.append((4,"小强"))
    
      val info2 = ListBuffer[(Int,String)]()  //编号  城市
      info2.append((1,"北京"))
      info2.append((2,"上海"))
      info2.append((3,"成都"))
      info2.append((5,"武汉"))
    
      val data1 = env.fromCollection(info1)
      val data2 = env.fromCollection(info2)
    
      data1.join(data2).where(0).equalTo(0).apply((first,second)=>{
        (first._1,first._2,second._2)
      }).print();
    }
    输出:
    (3,王五,成都)
    (1,张三,北京)
    (2,李四,上海)
    

    join之java实现

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
        joinFunction(env);
    }
    
    public static void joinFunction(ExecutionEnvironment env) throws Exception{
        List <Tuple2<Integer, String>> info1 = new ArrayList<Tuple2<Integer, String>>();
        info1.add(new Tuple2(1,"张三"));  //编号 名字
        info1.add(new Tuple2(2,"李四"));
        info1.add(new Tuple2(3,"王五"));
        info1.add(new Tuple2(4,"小强"));
    
        List <Tuple2<Integer, String>> info2 = new ArrayList<Tuple2<Integer, String>>();
        info2.add(new Tuple2(1,"北京"));  // 编号,城市
        info2.add(new Tuple2(2,"上海"));
        info2.add(new Tuple2(3,"成都"));
        info2.add(new Tuple2(5,"杭州"));
    
        DataSource<Tuple2<Integer,String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer,String>> data2 = env.fromCollection(info2);
    
        data1.join(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>(){
            @Override
            public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception{
                return new Tuple3<Integer, String, String>(first.f0, first.f1, second.f1);
            }
        }).print();
    }
    
    输出:
    (3,王五,成都)
    (1,张三,北京)
    (2,李四,上海)
    

    outjoin之scala实现

      def main(args: Array[String]): Unit = {
        val env = ExecutionEnvironment.getExecutionEnvironment
    
        outjoinFunction(env)
      }
    
      def outjoinFunction(env: ExecutionEnvironment):Unit = {
        val info1 = ListBuffer[(Int,String)]()  //编号  名字
        info1.append((1,"张三"))
        info1.append((2,"李四"))
        info1.append((3,"王五"))
        info1.append((4,"小强"))
    
        val info2 = ListBuffer[(Int,String)]()  //编号  城市
        info2.append((1,"北京"))
        info2.append((2,"上海"))
        info2.append((3,"成都"))
        info2.append((5,"武汉"))
    
        val data1 = env.fromCollection(info1)
        val data2 = env.fromCollection(info2)
    
        data1.leftOuterJoin(data2).where(0).equalTo(0).apply((first,second)=> {
          if (second == null) {
            (first._1, first._2, "null")
          } else {
            (first._1, first._2, second._2)
          }
        }).print();
    }
    输出:
    (3,王五,成都)
    (1,张三,北京)
    (2,李四,上海)
    (4,小强,null)
    

    outerJoin之java实现

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
        outerjoinFunction(env);
    }
    
    public static void outerjoinFunction(ExecutionEnvironment env) throws Exception{
        List <Tuple2<Integer, String>> info1 = new ArrayList<Tuple2<Integer, String>>();
        info1.add(new Tuple2(1,"张三"));  //编号 名字
        info1.add(new Tuple2(2,"李四"));
        info1.add(new Tuple2(3,"王五"));
        info1.add(new Tuple2(4,"小强"));
    
        List <Tuple2<Integer, String>> info2 = new ArrayList<Tuple2<Integer, String>>();
        info2.add(new Tuple2(1,"北京"));  // 编号,城市
        info2.add(new Tuple2(2,"上海"));
        info2.add(new Tuple2(3,"成都"));
        info2.add(new Tuple2(5,"杭州"));
    
        DataSource<Tuple2<Integer,String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer,String>> data2 = env.fromCollection(info2);
    
        data1.leftOuterJoin(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>(){
            @Override
            public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception{
                if (second == null) {
                    return new Tuple3<Integer, String, String>(first.f0,first.f1,"null");
                } else {
                    return new Tuple3<Integer, String, String>(first.f0, first.f1, second.f1);
                }
            }
        }).print();
    }
    输出:
    (3,王五,成都)
    (1,张三,北京)
    (2,李四,上海)
    (4,小强,null)
    

    cross之scala实现

    def main(args: Array[String]): Unit = {
      val env = ExecutionEnvironment.getExecutionEnvironment
    
      crossFunction(env)
    }
    
    def crossFunction(env: ExecutionEnvironment):Unit = {
      val info1 = ListBuffer[String]()
      info1.append("长城")
      info1.append("长安")
    
      val info2 = ListBuffer[Int]()
      info2.append(1)
      info2.append(2)
      info2.append(3)
    
      val data1 = env.fromCollection(info1)
      val data2 = env.fromCollection(info2)
    
      data1.cross(data2).print()
    }
    输出:
    (长城,1)
    (长城,2)
    (长城,3)
    (长安,1)
    (长安,2)
    (长安,3)
    

    cross之java实现

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
        crossFunction(env);
    }
    
    public static void crossFunction(ExecutionEnvironment env) throws Exception{
        List<String> info1 = new ArrayList<String>();
        info1.add("张三");  
        info1.add("李四");
    
        List<Integer> info2 = new ArrayList<Integer>();
        info2.add(1);
        info2.add(2);
        info2.add(3);
    
        DataSource<String> data1 = env.fromCollection(info1);
        DataSource<Integer> data2 = env.fromCollection(info2);
    
        data1.cross(data2).print();
    }
    输出:
    (张三,1)
    (张三,2)
    (张三,3)
    (李四,1)
    (李四,2)
    (李四,3)
    

    sink scala 代码

    def main(args: Array[String]): Unit = {
      val env = ExecutionEnvironment.getExecutionEnvironment
      val data = 1.to(10)
      val text = env.fromCollection(data)
      val path = "/Users/zhiyingliu/tmp/flink/ouput"
      text.writeAsText(path,WriteMode.OVERWRITE).setParallelism(3)
      env.execute("sinkTest")
    }
    

    sink Java 代码

    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
        List<Integer> info = new ArrayList<Integer>();
    
        for (int i = 0; i < 10; i++) {
            info.add(i);
        }
    
        DataSource<Integer> data = env.fromCollection(info);
    
        String filePath  = "/Users/zhiyingliu/tmp/flink/ouput-java/";
    data.writeAsText(filePath,FileSystem.WriteMode.OVERWRITE);
        env.execute("java-sink");
    }
    
  • 相关阅读:
    Jenkins入门教程(3)
    Jenkins入门教程(2)
    Jenkins入门教程(1)
    Tomcat与Weblogic
    fidder教程
    postman教程(3)- 用Postman生成Request代码
    postman教程(2)— 在test suite中运行test case
    postman教程(1)
    unix环境高级编程——文件和目录
    unix环境高级编程——文件IO
  • 原文地址:https://www.cnblogs.com/bigband/p/13588683.html
Copyright © 2020-2023  润新知