• Flink| Table API| SQL


     Table API与SQL

        Table API是流处理和批处理通用的关系型API,Table API可以基于流输入或者批输入来运行而不需要进行任何修改。

    Table API是SQL语言的超集并专门为Apache Flink设计的,Table API是Scala 和Java语言集成式的API。与常规SQL语言中将查询指定为字符串不同,

    Table API查询是以Java或Scala中的语言嵌入样式来定义的,具有IDE支持如:自动完成和语法检测。

    引入pom依赖

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table_2.11</artifactId>
        <version>1.7.0</version>
    </dependency>

    构造表环境

    简单的fliter、select操作

    import org.apache.flink.api.scala._ //scala的隐式转换

    def main(args: Array[String]): Unit = {
    //类似sparkcontext val env: StreamExecutionEnvironment
    = StreamExecutionEnvironment.getExecutionEnvironment val myKafkaConsumer: FlinkKafkaConsumer011[String] = MyKafkaUtil.getConsumer("GMALL_STARTUP") val dstream: DataStream[String] = env.addSource(myKafkaConsumer)
      val startupLogDstream: DataStream[StartupLog] = dstream.map{ jsonString =>JSON.parseObject(jsonString,classOf[StartupLog]) }
     //类似sparksession
      val tableEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(env)
     //将流转换成表
      val startupLogTable: Table = tableEnv.fromDataStream(startupLogDstream)

    //通过Table API操作,sql中的操作都变成了函数方法来操作了; 如果filter中过滤的字段不在select中则filter要写在select前边; filter("ch = 'appstore'").select("mid,uid,ts")
    val
    table: Table = startupLogTable.select("mid,ch").filter("ch ='appstore'") //筛选出mid,ch字段,过滤掉 ch = 'appstore'的信息;
    //表不能直接打印,要把它转换成流 //import org.apache.flink.table.api.scala._ 需要加隐式转换
    val midchDataStream: DataStream
    [(String, String)] = table.toAppendStream[(String,String)] midchDataStream.print() env.execute() }

    动态表

    如果流中的数据类型是case class可以直接根据case class的结构生成table   或者根据字段顺序单独命名

    tableEnv.fromDataStream(startupLogDstream)  
    tableEnv.fromDataStream(startupLogDstream,’mid,’uid  .......)  

    最后的动态表可以转换为流进行输出

    table.toAppendStream[(String,String)]

    字段

     用一个单引放到字段前面 来标识字段名, 如 ‘name , ‘mid ,’amount 等

    每10秒统计渠道为appstore的个数

    方法一、TableAPI的实现方式:

    //每10秒中渠道为appstore的个数
    def main(args: Array[String]): Unit = {
      //sparkcontext
      val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    
      //时间特性改为eventTime
      env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    
      val myKafkaConsumer: FlinkKafkaConsumer011[String] = MyKafkaUtil.getConsumer("GMALL_STARTUP")
      val dstream: DataStream[String] = env.addSource(myKafkaConsumer)
    
      val startupLogDstream: DataStream[StartupLog] = dstream.map{ jsonString =>JSON.parseObject(jsonString,classOf[StartupLog]) }
      //告知watermark 和 eventTime如何提取
      val startupLogWithEventTimeDStream: DataStream[StartupLog] = startupLogDstream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[StartupLog](Time.seconds(0L)) {
        override def extractTimestamp(element: StartupLog): Long = {
          element.ts
        }
      }).setParallelism(1)
    
      //SparkSession
      val tableEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(env)
    
       //把数据流转化成Table
      val startupTable: Table = tableEnv.fromDataStream(startupLogWithEventTimeDStream , 'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ts.rowtime)
       //ts.rowtime是声明ts为表里的时间字段;  每个字段都要跟 类StartupLog中的字段对齐;
    
      //通过table api 进行操作
      // 每10秒 统计一次各个渠道的个数 table api 解决
      //1 groupby  2 要用 window   3 用eventtime来确定开窗时间
    //①这里在表中.window()开窗; 但前提是还要有一个明确的哪个字段是时间字段; //②.window(Tumble over 10000.millis on 'ts as' tt) 滚动窗口每10s滚动一次,ts起别名为tt,tt必须出现在groupBy中
    val resultTable:
    Table = startupTable.window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch,'tt ).select( 'ch, 'ch.count) //把Table转化成数据流 //val appstoreDStream: DataStream[(String, String, Long)] = appstoreTable.toAppendStream[(String,String,Long)] //使用了group by之后就不能直接用toAppendStream方法了,应该使用toRetractStream val resultDstream: DataStream[(Boolean, (String, Long))] = resultSQLTable.toRetractStream[(String,Long)] resultDstream.filter(_._1).print() //它打印的结果是(true,(tencent,539)) (false,(tencent,539)) //表这个数据已过期了,每来一条count一个 (true,(tencent,540)) //.filter(_._1)是留下为true的数据 env.execute() }

    每10秒统计渠道为appstore的个数

    方法二、SQL的实现方式:

     见官方文档:   https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sql.html

    def main(args: Array[String]): Unit = {
      //sparkcontext
      val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    
      //时间特性改为eventTime
      env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    
      val myKafkaConsumer: FlinkKafkaConsumer011[String] = MyKafkaUtil.getConsumer("GMALL_STARTUP")
      val dstream: DataStream[String] = env.addSource(myKafkaConsumer)
    
      val startupLogDstream: DataStream[StartupLog] = dstream.map{ jsonString =>JSON.parseObject(jsonString,classOf[StartupLog]) }
      //告知watermark 和 eventTime如何提取
      val startupLogWithEventTimeDStream: DataStream[StartupLog] = startupLogDstream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[StartupLog](Time.seconds(0L)) {
        override def extractTimestamp(element: StartupLog): Long = {
          element.ts
        }
      }).setParallelism(1)
    
      //SparkSession
      val tableEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(env)
    
      //把数据流转化成Table
      val startupTable: Table = tableEnv.fromDataStream(startupLogWithEventTimeDStream , 'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ts.rowtime)
    
      //方法一、通过table api 进行操作
      // 每10秒 统计一次各个渠道的个数 table api 解决
      //1 groupby  2 要用 window   3 用eventtime来确定开窗时间
      //val resultTable: Table = startupTable.window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch,'tt ).select( 'ch, 'ch.count)
    
    //方法二、通过sql 进行操作 val resultSQLTable : Table = tableEnv.sqlQuery( "select ch ,count(ch) from "+startupTable+" group by ch,Tumble(ts,interval '10' SECOND )") //把Table转化成数据流 //val appstoreDStream: DataStream[(String, String, Long)] = appstoreTable.toAppendStream[(String,String,Long)] //使用了group by之后就不能直接用toAppendStream方法了,应该使用toRetractStream
    val resultDstream: DataStream
    [(Boolean, (String, Long))] = resultSQLTable.toRetractStream[(String,Long)] resultDstream.filter(_._1).print()
    //它打印的结果是(true,(tencent,539)) (false,(tencent,539)) //表这个数据已过期了,每来一条count一个 (true,(tencent,540)) //.filter(_._1)是留下为true的数据 env.
    execute() }

    关于group by

     如果使用 groupby table转换为流的时候只能用toRetractDstream

      val rDstream: DataStream[(Boolean, (String, Long))] = table.toRetractStream[(String,Long)]

      toRetractDstream 得到的第一个boolean型字段标识 true就是最新的数据,false表示过期老数据

      val rDstream: DataStream[(Boolean, (String, Long))] = table.toRetractStream[(String,Long)]
      rDstream.filter(_._1).print()

    如果使用的api包括时间窗口,那么时间的字段必须,包含在group by中。

      val table: Table = startupLogTable.filter("ch ='appstore'").window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch ,'tt).select("ch,ch.count ")

    关于时间窗口

    用到时间窗口,必须提前声明时间字段,如果是processTime直接在创建动态表时进行追加就可以

    val startupLogTable: Table = tableEnv.fromDataStream(startupLogWithEtDstream,'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ts.rowtime)

      如果是EventTime要在创建动态表时声明

    val startupLogTable: Table = tableEnv.fromDataStream(startupLogWithEtDstream,'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ps.processtime)

      滚动窗口可以使用Tumble over 10000.millis on

      val table: Table = startupLogTable.filter("ch ='appstore'").window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch ,'tt).select("ch,ch.count ")

     

  • 相关阅读:
    5.6Java 创建File
    5.6Java File ApI
    Java字符集乱码
    5.17Java文件的拷贝
    5.10Java实操IO流,面向接口编程,面向父类编程
    5.6Java File对象使用递归打印子孙级目录以及文件的名称
    5.10JavaIo流四大抽象类
    5.17Java文件字节流
    5.7通过Maven配置seleeniumjava依赖
    5.6Java多态的介绍
  • 原文地址:https://www.cnblogs.com/shengyang17/p/12247026.html
Copyright © 2020-2023  润新知