• 第五章 Flink 流处理Api


    1       Environment

    getExecutionEnvironment

    创建一个执行环境,表示当前执行程序的上下文。 如果程序是独立调用的,则此方法返回本地执行环境;如果从命令行客户端调用程序以提交到集群,则此方法返回此集群的执行环境,也就是说,getExecutionEnvironment会根据查询运行的方式决定返回什么样的运行环境,是最常用的一种创建执行环境的方式。

    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    如果没有设置并行度,会以flink-conf.yaml中的配置为准,默认是1

    createLocalEnvironment

    返回本地执行环境,需要在调用时指定默认的并行度。

    val env = StreamExecutionEnvironment.createLocalEnvironment(1)

    createRemoteEnvironment

    返回集群执行环境,将Jar提交到远程服务器。需要在调用时指定JobManager的IP和端口号,并指定要在集群中运行的Jar包。

    val env = ExecutionEnvironment.createRemoteEnvironment("jobmanager-hostname", 6123,"C://jar//flink//wordcount.jar")
    

      

    2       Source

    创建kafka工具类

    object MyKafkaUtil {
    
      val prop = new Properties()
    
      prop.setProperty("bootstrap.servers","hadoop1:9092")
      prop.setProperty("group.id","gmall")
    
      def getConsumer(topic:String ):FlinkKafkaConsumer011[String]= {
          val myKafkaConsumer:FlinkKafkaConsumer011[String] = new FlinkKafkaConsumer011[String](topic, new SimpleStringSchema(), prop)
         myKafkaConsumer
      }
    }
    

    增加业务主类 StartupApp

    object StartupApp {
    def main(args: Array[String]): Unit = {
              val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    
             val kafkaConsumer  =MyKafkaUtil.getConsumer("GMALL_STARTUP")
    
             val dstream: DataStream[String] = environment.addSource(kafkaConsumer)
    
              dstream.print()
    
              environment.execute()
    }
    }
    

     

    Flink+kafka是如何实现exactly-once语义的

     

    Flink通过checkpoint来保存数据是否处理完成的状态

       

    由JobManager协调各个TaskManager进行checkpoint存储,checkpoint保存在 StateBackend中,默认StateBackend是内存级的,也可以改为文件级的进行持久化保存。

    执行过程实际上是一个两段式提交,每个算子执行完成,会进行“预提交”,直到执行完sink操作,会发起“确认提交”,如果执行失败,预提交会放弃掉。

    如果宕机需要通过StateBackend进行恢复,只能恢复所有确认提交的操作。

    3 Transform

    转换算子

    3.1 map

    val streamMap = stream.map { x => x * 2 }
    

    3.2 flatMap    

    val streamFlatMap = stream.flatMap{
        x => x.split(" ")
    }

    3.3 Filter

    val streamFilter = stream.filter{
        x => x == 1
    }
    

     

    3.4 KeyBy

    DataStream → KeyedStream:输入必须是Tuple类型,逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同key的元素,在内部以hash的形式实现的。

    3.5 Reduce

    KeyedStream → DataStream:一个分组数据流的聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。

    //求各个渠道的累计个数
    val startUplogDstream: DataStream[StartUpLog] = dstream.map{ JSON.parseObject(_,classOf[StartUpLog])}
    val keyedStream: KeyedStream[(String, Int), Tuple] = startUplogDstream.map(startuplog=>(startuplog.ch,1)).keyBy(0)
    //reduce //sum
    keyedStream.reduce{  (ch1,ch2)=>
      (ch1._1,ch1._2+ch2._2)
    } .print().setParallelism(1)

    flink是如何保存累计值的,

    flink是一种有状态的流计算框架,其中说的状态包括两个层面:

    1)  operator state 主要是保存数据在流程中的处理状态,用于确保语义的exactly-once。

    2)  keyed state  主要是保存数据在计算过程中的累计值。

    这两种状态都是通过checkpoint机制保存在StateBackend中,StateBackend可以选择保存在内存中(默认使用)或者保存在磁盘文件中。

    3.6 Split 和 Select

    Split

     

    DataStream → SplitStream:根据某些特征把一个DataStream拆分成两个或者多个DataStream。

    Select

    SplitStream→DataStream从一个SplitStream中获取一个或者多个DataStream。

     

    需求:把appstore和其他的渠道的数据单独拆分出来,做成两个流

     

         // 将appstore与其他渠道拆分拆分出来  成为两个独立的流
    val splitStream: SplitStream[StartUpLog] = startUplogDstream.split { startUplog =>
      var flags:List[String] =  null
      if ("appstore" == startUplog.ch) {
        flags = List(startUplog.ch)
      } else {
        flags = List("other" )
      }
      flags
    }
    val appStoreStream: DataStream[StartUpLog] = splitStream.select("appstore")
    appStoreStream.print("apple:").setParallelism(1)
    val otherStream: DataStream[StartUpLog] = splitStream.select("other")
    otherStream.print("other:").setParallelism(1)
    

     

    3.7 Connect和 CoMap

    图 Connect算子

    DataStream,DataStream → ConnectedStreams:连接两个保持他们类型的数据流,两个数据流被Connect之后,只是被放在了一个同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。

     CoMap,CoFlatMap

    ConnectedStreams → DataStream:作用于ConnectedStreams上,功能与map和flatMap一样,对ConnectedStreams中的每一个Stream分别进行map和flatMap处理。

    //合并以后打印
    val connStream: ConnectedStreams[StartUpLog, StartUpLog] = appStoreStream.connect(otherStream)
    val allStream: DataStream[String] = connStream.map(
      (log1: StartUpLog) => log1.ch,
      (log2: StartUpLog) => log2.ch
    )
    allStream.print("connect::")
    

      

    3.8 Union

    DataStream → DataStream:对两个或者两个以上的DataStream进行union操作,产生一个包含所有DataStream元素的新DataStream。注意:如果你将一个DataStream跟它自己做union操作,在新的DataStream中,你将看到每一个元素都出现两次。

    //合并以后打印
    val unionStream: DataStream[StartUpLog] = appStoreStream.union(otherStream)
    unionStream.print("union:::")

    Connect与 Union 区别:

    1 、 Union之前两个流的类型必须是一样,Connect可以不一样,在之后的coMap中再去调整成为一样的。

    2    Connect只能操作两个流,Union可以操作多个

    4  Sink

       Flink没有类似于spark中foreach方法,让用户进行迭代的操作。虽有对外的输出操作都要利用Sink完成。最后通过类似如下方式完成整个任务最终输出操作。

       myDstream.addSink(new MySink(xxxx)) 
    

      官方提供了一部分的框架的sink。除此以外,需要用户自定义实现sink

    4.1 Kafka

    pom.xml

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.11 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-kafka-0.11_2.11</artifactId>
        <version>1.7.0</version>
    </dependency>
    

    mykafkaUtil中增加方法

    def getProducer(topic:String): FlinkKafkaProducer011[String] ={
      new FlinkKafkaProducer011[String](brokerList,topic,
    new SimpleStringSchema()) }

    主函数中添加sink

    val myKafkaProducer: FlinkKafkaProducer011[String] = MyKafkaUtil.getProducer("channel_sum")
    
    sumDstream.map( chCount=>chCount._1+":"+chCount._2 ).addSink(myKafkaProducer)

    4.2 Redis

    <!-- https://mvnrepository.com/artifact/org.apache.bahir/flink-connector-redis -->
    <dependency>
        <groupId>org.apache.bahir</groupId>
        <artifactId>flink-connector-redis_2.11</artifactId>
        <version>1.0</version>
    </dependency>
    
    object MyRedisUtil {
    
     
    
      val conf = new FlinkJedisPoolConfig.Builder().setHost("hadoop1").setPort(6379).build()
    
      def getRedisSink(): RedisSink[(String,String)] ={
        new RedisSink[(String,String)](conf,new MyRedisMapper)
      }
    
      class MyRedisMapper extends RedisMapper[(String,String)]{
        override def getCommandDescription: RedisCommandDescription = {
          new RedisCommandDescription(RedisCommand.HSET, "channel_count")
         // new RedisCommandDescription(RedisCommand.SET  )
        }
    
        override def getValueFromData(t: (String, String)): String = t._2
    
        override def getKeyFromData(t: (String, String)): String = t._1
      }
    
    }
    

    在主函数中调用

    sumDstream.map( chCount=>(chCount._1,chCount._2+"" )).addSink(MyRedisUtil.getRedisSink())

    4.3 Elasticsearch 

    pom.xml

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-elasticsearch6_2.11</artifactId>
        <version>1.7.0</version>
    </dependency>
    
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.3</version>
    </dependency>
    

      

    添加MyEsUtil

    import java.util
    
    import com.alibaba.fastjson.{JSON, JSONObject}
    import org.apache.flink.api.common.functions.RuntimeContext
    import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
    import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
    import org.apache.http.HttpHost
    import org.elasticsearch.action.index.IndexRequest
    import org.elasticsearch.client.Requests
    
    object MyEsUtil {
    
      
      val httpHosts = new util.ArrayList[HttpHost]
      httpHosts.add(new HttpHost("hadoop1",9200,"http"))
       httpHosts.add(new HttpHost("hadoop2",9200,"http"))
       httpHosts.add(new HttpHost("hadoop3",9200,"http"))
    
    
      def  getElasticSearchSink(indexName:String):  ElasticsearchSink[String]  ={
        val esFunc = new ElasticsearchSinkFunction[String] {
          override def process(element: String, ctx: RuntimeContext, indexer: RequestIndexer): Unit = {
            println("试图保存:"+element)
            val jsonObj: JSONObject = JSON.parseObject(element)
            val indexRequest: IndexRequest = Requests.indexRequest().index(indexName).`type`("_doc").source(jsonObj)
            indexer.add(indexRequest)
            println("保存1条")
          }
        }
    
        val sinkBuilder = new ElasticsearchSink.Builder[String](httpHosts, esFunc)
    
        //刷新前缓冲的最大动作量
        sinkBuilder.setBulkFlushMaxActions(10)
     
    
         sinkBuilder.build()
      }
    
    }
    

      

    在main方法中调用

    // 明细发送到es 中
     val esSink: ElasticsearchSink[String] = MyEsUtil.getElasticSearchSink("gmall0503_startup")
    
    
      dstream.addSink(esSink)
    

      

    4.4 JDBC 自定义sink

    <!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.44</version>
    </dependency>
    
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>druid</artifactId>
        <version>1.1.10</version>
    </dependency>

    添加MyJdbcSink

    class MyJdbcSink(sql:String ) extends  RichSinkFunction[Array[Any]] {
    
      val driver="com.mysql.jdbc.Driver"
    
      val url="jdbc:mysql://hadoop2:3306/gmall2019?useSSL=false"
    
      val username="root"
    
      val password="123123"
    
      val maxActive="20"
    
      var connection:Connection=null;
    
      //创建连接
      override def open(parameters: Configuration): Unit = {
        val properties = new Properties()
        properties.put("driverClassName",driver)
        properties.put("url",url)
        properties.put("username",username)
        properties.put("password",password)
        properties.put("maxActive",maxActive)
    
    
        val dataSource: DataSource = DruidDataSourceFactory.createDataSource(properties)
        connection = dataSource.getConnection()
      }
    
    //反复调用
      override def invoke(values: Array[Any]): Unit = {
        val ps: PreparedStatement = connection.prepareStatement(sql )
        println(values.mkString(","))
        for (i <- 0 until values.length) {
          ps.setObject(i + 1, values(i))
        }
        ps.executeUpdate()
    
    
      }
    
      override def close(): Unit = {
    
        if(connection!=null){
          connection.close()
        }
    
      }
    
    }

    在main方法中增加

       把明细保存到mysql中

        val startUplogDstream: DataStream[StartUpLog] = dstream.map{ JSON.parseObject(_,classOf[StartUpLog])}
    
    val jdbcSink = new MyJdbcSink("insert into z_startup values(?,?,?,?,?)")
    startUplogDstream.map(startuplog=>Array(startuplog.mid,startuplog.uid,startuplog.ch,startuplog.area,  startuplog.ts)).addSink(jdbcSink)
    

      

      

      

      

  • 相关阅读:
    商业分析07_15辅助产品设计
    商业分析07_14行为序列的分析方法
    商业分析07_13路径挖掘的分析方法
    商业分析07_12精准运营推送
    商业分析07_11归因查找
    商业分析07_10高质量拉新
    商业分析07_08功能/内容上线的评估
    商业分析07_07用户留存分析
    Ubuntu14.04安装CUDA8.0与Cudnn5.1
    Windows编译libcaffe时报cudnn.hpp(114): too few arguments in function call错误
  • 原文地址:https://www.cnblogs.com/tesla-turing/p/13273333.html
Copyright © 2020-2023  润新知