PyFlink + Sklearn实现流式数据的机器学习预测
from:https://zhuanlan.zhihu.com/p/372421714
一、概述
灵感来自阿里批流一体机器学习框架Alink,可以参考Alink的一篇文章。
总体思路是用PyFlink处理流式数据,在PyFlink中定义自定义函数UDF,UDF里使用Scikit-learn来进行机器学习预测,来一个数据预测一个数据。
具体的步骤可以是:
- 用sklearn训练好一个模型,并保存起来;
- 使用kafka进行流式数据的发送;
- flink接收kafka发送的数据;
- 定义自定义函数UDF,UDF里读取保存的模型,并对接收到的数据进行预测;
- 将预测结果打印出来。
二、版本说明
Python:3.8.8
PyFlink:1.13.0
Scikit-learn:0.24.1
三、代码思路
1、先训练好一个模型将它保存起来
clf = DecisionTreeClassifier()
clf.fit(X, y)
with open('model.pickle', 'wb') as f:
pickle.dump(clf, f)
2、创建Flink流式数据环境
env = StreamExecutionEnvironment.get_execution_environment()
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
t_env = StreamTableEnvironment.create(env, environment_settings=env_settings)
3、创建kafka数据源
kafka_source_ddl = """
create table kafka_source (
X FLOAT
) with (
'connector' = 'kafka',
'topic' = 'myTopic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'myGroup',
'scan.startup.mode' = 'earliest-offset',
'format' = 'csv'
)
"""
t_env.execute_sql(kafka_source_ddl)
4、读取保存的模型,并定义自定义函数UDF用于数据预测
with open('model.pickle', 'rb') as f:
clf = pickle.load(f)
@udf(input_types=DataTypes.FLOAT(), result_type=DataTypes.FLOAT())
def predict(X):
X = pd.Series([X]).values.reshape(-1, 1)
y_pred = clf.predict(X)
return y_pred
t_env.create_temporary_function('predict', predict)
5、读取kafka数据进行预测,将结果打印出来
result = t_env.from_path('kafka_source').select('X, predict(X) as y_pred')
data = t_env.to_append_stream(result, Types.ROW([Types.FLOAT(), Types.FLOAT()]))
data.print()
env.execute('stream predict job')
四、完整代码
模型保存代码 model.py
import pickle
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
X = pd.Series([0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3]).values.reshape(-1, 1)
y = pd.Series([0, 0, 0, 1, 1, 1, 2, 2, 2])
clf = DecisionTreeClassifier()
clf.fit(X, y)
with open('model.pickle', 'wb') as f:
pickle.dump(clf, f)
流式数据预测代码 stream_predict.py
import pickle
import pandas as pd
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment, DataTypes
from pyflink.table.udf import udf
from pyflink.common.typeinfo import Types
env = StreamExecutionEnvironment.get_execution_environment()
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
t_env = StreamTableEnvironment.create(env, environment_settings=env_settings)
kafka_source_ddl = """
create table kafka_source (
X FLOAT
) with (
'connector' = 'kafka',
'topic' = 'myTopic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'myGroup',
'scan.startup.mode' = 'earliest-offset',
'format' = 'csv'
)
"""
t_env.execute_sql(kafka_source_ddl)
with open('model.pickle', 'rb') as f:
clf = pickle.load(f)
@udf(input_types=DataTypes.FLOAT(), result_type=DataTypes.FLOAT())
def predict(X):
X = pd.Series([X]).values.reshape(-1, 1)
y_pred = clf.predict(X)
return y_pred
t_env.create_temporary_function('predict', predict)
result = t_env.from_path('kafka_source').select('X, predict(X) as y_pred')
data = t_env.to_append_stream(result, Types.ROW([Types.FLOAT(), Types.FLOAT()]))
data.print()
env.execute('stream predict job')