• PyFlink和sklearn结合——pyflink可以调用sklearn,使用udf 执行sql,这样说来,sql可以支持AI模型了,和flowsql没有区别!


    PyFlink + Sklearn实现流式数据的机器学习预测

    一、概述

    灵感来自阿里批流一体机器学习框架Alink,可以参考Alink的一篇文章。

    总体思路是用PyFlink处理流式数据,在PyFlink中定义自定义函数UDF,UDF里使用Scikit-learn来进行机器学习预测,来一个数据预测一个数据。

    具体的步骤可以是:

    1. 用sklearn训练好一个模型,并保存起来;
    2. 使用kafka进行流式数据的发送;
    3. flink接收kafka发送的数据;
    4. 定义自定义函数UDF,UDF里读取保存的模型,并对接收到的数据进行预测;
    5. 将预测结果打印出来。

    二、版本说明

    Python:3.8.8

    PyFlink:1.13.0

    Scikit-learn:0.24.1

    三、代码思路

    1、先训练好一个模型将它保存起来

    clf = DecisionTreeClassifier()
    clf.fit(X, y)
    
    with open('model.pickle', 'wb') as f:
        pickle.dump(clf, f)

    2、创建Flink流式数据环境

    env = StreamExecutionEnvironment.get_execution_environment()
    env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=env_settings)

    3、创建kafka数据源

    kafka_source_ddl = """ 
        create table kafka_source ( 
         X FLOAT
        ) with ( 
          'connector' = 'kafka', 
          'topic' = 'myTopic',
          'properties.bootstrap.servers' = 'localhost:9092',
          'properties.group.id' = 'myGroup',
          'scan.startup.mode' = 'earliest-offset',
          'format' = 'csv'
        )
    """
    t_env.execute_sql(kafka_source_ddl)

    4、读取保存的模型,并定义自定义函数UDF用于数据预测

    with open('model.pickle', 'rb') as f:
        clf = pickle.load(f)
    
    @udf(input_types=DataTypes.FLOAT(), result_type=DataTypes.FLOAT())
    def predict(X):
        X = pd.Series([X]).values.reshape(-1, 1)
        y_pred = clf.predict(X)
        return y_pred
    
    t_env.create_temporary_function('predict', predict)

    5、读取kafka数据进行预测,将结果打印出来

    result = t_env.from_path('kafka_source').select('X, predict(X) as y_pred')
    data = t_env.to_append_stream(result, Types.ROW([Types.FLOAT(), Types.FLOAT()]))
    data.print()
    
    env.execute('stream predict job')

    四、完整代码

    模型保存代码 model.py

    import pickle
    import pandas as pd
    from sklearn.tree import DecisionTreeClassifier
    
    X = pd.Series([0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3]).values.reshape(-1, 1)
    y = pd.Series([0, 0, 0, 1, 1, 1, 2, 2, 2])
    
    clf = DecisionTreeClassifier()
    clf.fit(X, y)
    
    with open('model.pickle', 'wb') as f:
        pickle.dump(clf, f)

    流式数据预测代码 stream_predict.py

    import pickle
    import pandas as pd
    from pyflink.datastream import StreamExecutionEnvironment
    from pyflink.table import EnvironmentSettings, StreamTableEnvironment, DataTypes
    from pyflink.table.udf import udf
    from pyflink.common.typeinfo import Types
    
    env = StreamExecutionEnvironment.get_execution_environment()
    env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=env_settings)
    
    kafka_source_ddl = """ 
        create table kafka_source ( 
         X FLOAT
        ) with ( 
          'connector' = 'kafka', 
          'topic' = 'myTopic',
          'properties.bootstrap.servers' = 'localhost:9092',
          'properties.group.id' = 'myGroup',
          'scan.startup.mode' = 'earliest-offset',
          'format' = 'csv'
        )
    """
    t_env.execute_sql(kafka_source_ddl)
    
    with open('model.pickle', 'rb') as f:
        clf = pickle.load(f)
    
    @udf(input_types=DataTypes.FLOAT(), result_type=DataTypes.FLOAT())
    def predict(X):
        X = pd.Series([X]).values.reshape(-1, 1)
        y_pred = clf.predict(X)
        return y_pred
    
    t_env.create_temporary_function('predict', predict)
    
    result = t_env.from_path('kafka_source').select('X, predict(X) as y_pred')
    data = t_env.to_append_stream(result, Types.ROW([Types.FLOAT(), Types.FLOAT()]))
    data.print()
    
    env.execute('stream predict job')
  • 相关阅读:
    vim lua对齐indent无效
    C中的私有成员
    Lua 设置table为只读属性
    c语言结构体可以直接赋值
    Lua5.3 注册表 _G _ENV
    火狐浏览器调试ajax异步页面时报错NS_ERROR_UNEXPECTER
    ajax向后台请求数据,后台接收到数据并进行了处理,但前台就是调用error方法
    maven安装之后,或者升级之后遇到的问题:could not find or load main class org.codehaus.plexus.class.....
    jenkins执行shell命令,有时会提示“Command not found”
    shell 脚本替换文件中某个字符串
  • 原文地址:https://www.cnblogs.com/bonelee/p/15871458.html
Copyright © 2020-2023  润新知