不同的工具
在机器学习的常用工具中,一般的数据挖掘和数据统计分析的工具,是R语言和Python,大量的数据时候,使用的是Flink和Spark。
了解和熟悉工具的使用,对于一些数据进行探索和实现。
本文主要是基于Python的数据挖掘和机器学习的流程,来对比Spark和Flink的机器学习包,进而通过使用其中的一种情况而熟悉其他,达到触类旁通的效果
Python
一般流程: 获取数据 -> 数据预处理 -> 训练建模 -> 模型评估 -> 预测,分类
scikit-learn : NumPy SciPy matplotlib
管道机制实现了对全部步骤的流式化封装和管理(streaming workflows with pipelines)
许多算法模型串联起来,比如将特征提取、归一化、分类组织在一起形成一个典型的机器学习问题工作流 编程技巧的创新,而非算法的创新
Transformer 转换器 Estimator 估计器 Pipeline 管道
具体
01.Transformer 转换器 (StandardScaler,MinMaxScaler)
02.Estimator 估计器(LinearRegression、LogisticRegression、LASSO、Ridge),
所有的机器学习算法模型,都被称为估计器
03.Pipeline 管道 将Transformer、Estimator 组合起来成为一个大模型
pipeline
使用PipeLine对数据进行预处理组成新的模型
直接调用fit和predict方法来对pipeline中的所有算法模型进行训练和预测
可以结合grid search对参数进行选择
示例
eg: from sklearn.pipeline import Pipeline
过程:
数据归一化(Data Normalization) from sklearn import preprocessing
特征选择(Feature Selection) from sklearn.ensemble import ExtraTreesClassifier
算法的使用 from sklearn.linear_model import LogisticRegression
优化算法参数 from sklearn.grid_search import GridSearchCV
one-hot编码
数据集拆分
模型:
# 拟合模型
model.fit(X_train, y_train)
# 模型预测
model.predict(X_test)
# 获得这个模型的参数
model.get_params()
模型保存和载入
from sklearn.externals import joblib
# 保存模型
joblib.dump(model, 'model.pickle')
#载入模型
model = joblib.load('model.pickle')
Spark
1.基本概念
org.apache.spark.ml
PipelineStage
A stage in a pipeline, either an [[Estimator]] or a [[Transformer]].
Transformer
transform one dataset into another.
Estimator
estimators that fit models to data.
Model
A fitted model, i.e., a [[Transformer]] produced by an [[Estimator]].
Pipeline
A Pipeline consists of a sequence of stages, each of which is either an [[Estimator]] or a [[Transformer]]
PipelineModel
object PipelineModel extends MLReadable[PipelineModel]
Parameter
被用来设置 Transformer 或者 Estimator 的参数
VectorAssembler
CrossValidatorModel
Params for [[CrossValidator]] and [[CrossValidatorModel]].
Spark提供在org.apache.spark.ml.tuning包下提供了模型选择器,可以替换参数然后比较模型输出
2.Spark 的 Dataset
randomSplit
Randomly splits this Dataset with the provided weights.
randomSplitAsList
Returns a Java list that contains randomly split Dataset with the provided weights.
输入: weights: Array[Double]
weights: List[Double]
返回: Array[Dataset]or List
示例:
正样本和负样本截取(样本数据过多的情况)
double[] weights = {pos_rate,1.0-pos_rate};
Dataset<Row>[] arr = posSet.randomSplit(weights);
posSet = arr[0];
正样本和负样本均衡
//合并正负样本数据
Dataset<Row> dataUse = dataPos_sample.union(dataNeg_sample);
// 定义 Pipeline 中的各个 PipelineStage ,如指标提取和转换模型训练等。
有了这些处理特定问题的 Transformer 和 Estimator,
我们就可以按照具体的处理逻辑来有序的组织 PipelineStages 并创建一个 Pipeline
每个stage要么是一个Transformer,要么是一个Estimator。
这些stage是按照顺序执行的,输入的dataframe当被传入每个stage的时候会被转换
Pipeline pipeline = new Pipeline().setStages(Array(stage1,stage2,stage3,…))
然后就可以把 训练数据集 作为入参并调用 Pipeline 实例的 fit 方法来开始以流的方式来处理源训练数据
//构建完成一个 stage piple
Pipeline pipeline = new Pipeline().setStages(pipeArr);
PipelineModel model = pipeline.fit(train_data);
加载模型: PipelineModel model2 = PipelineModel.load(path);
方式 获得 CrossValidator 的最佳模型参数 -- 通过交叉验证进行模型选择
CrossValidator rf_cv = new CrossValidator().setEstimator(pipeline)
CrossValidatorModel rf_model = rf_cv.fit(train_data);
加载模型: CrossValidatorModel rf_model2 = CrossValidatorModel.load(path);
eg: // Chain indexers and tree in a Pipeline.
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[]{labelIndexer, featureIndexer, dt, labelConverter});
Flink
1.Flink ML
PipelineStage
Base class for a stage in a pipeline,and does not have any actual functionality
Its subclasses must be either Estimator or Transformer
Transformer
* A transformer is a {@link PipelineStage} that transforms an input {@link Table} to a result {@link Table}.
Estimator
Estimators are {@link PipelineStage}s responsible for training and generating machine learning models.
Model
A model is an ordinary {@link Transformer} except how it is created.
Pipeline
A pipeline is a linear workflow which chains {@link Estimator}s and {@link Transformer}s to execute an algorithm.
can also be used as a {@link PipelineStage} in another pipeline
Params WithParams ParamInfoFactory ParamInfo
2.Alink
com.alibaba.alink.pipeline
Pipeline
A pipeline is a linear workflow which chains {@link EstimatorBase}s and {@link TransformerBase}s to
* execute an algorithm.
public class Pipeline extends EstimatorBase<Pipeline, PipelineModel>
PipelineModel
public class PipelineModel extends ModelBase<PipelineModel> implements LocalPredictable {
PipelineStageBase
The base class for a stage in a pipeline, either an [[EstimatorBase]] or a [[TransformerBase]].
EstimatorBase
public abstract class EstimatorBase<E extends EstimatorBase<E, M>, M extends ModelBase<M>> extends PipelineStageBase<E> implements Estimator<E, M>
TransformerBase
public abstract class TransformerBase<T extends TransformerBase<T>> extends PipelineStageBase<T> implements Transformer<T>
VectorAssembler
VectorAssembler is a transformer that combines a given list of columns
参考
源码