• Spark.ML之PipeLine学习笔记


    地址:

     
    Spark PipeLine
    是基于DataFrames的高层的API,可以方便用户构建和调试机器学习流水线
    可以使得多个机器学习算法顺序执行,达到高效的数据处理的目的
     
    DataFrame是来自Spark SQL的ML DataSet 可以存储一系列的数据类型,text,特征向量,Label和预测结果
     
    Transformer:将DataFrame转化为另外一个DataFrame的算法,通过实现transform()方法
    Estimator:将DataFrame转化为一个Transformer的算法,通过实现fit()方法
     
    PipeLine:将多个Transformer和Estimator串成一个特定的ML Wolkflow
    Parameter:Tansformer和Estimator共用同一个声明参数的API
     
    Transformer和Estimator是PipeLine的Stage
    Pipeline是一系列的Stage按照声明的顺序排列成的工作流
     
    Transformer.transform()和Estimator.fit()都是无状态的
    每一个Transformer和Estimator的实例都有唯一的ID在声明参数的时候非常有用
     
    下面是一个线性的PipeLine的流程
     
    上面创建的是线性的PipeLine,每一步都依赖上一步的结果
    如果数据流可以组成有向不循环图(Directed Acyclic Graph DAG)
    那么可以创建Non-Linear Pipeline
     
    RuntimeCheching:因为PipeLine可以操作多种类型的DataFrame
    所以不能使用编译时检测
    那么PipeLine或者PipeLine Model使用运行时检测
    这种检测使用了DataFrame Schema这个Schema是DataFrame列的数据类型的描述
     
    Unique PipeLine Stage:PipeLine Stage应当都是唯一的实例,都拥有唯一的ID
     
    Param是一个命名参数,带有自包含文档
    ParamMap是一个参数与值的对(parameter,value)
     
    将参数传递给算法主要有下面两种方式:
    1. 为实例设置参数,若Ir是LogisticRegression的实例,调用Ir.SetMaxIter(10)意味着Ir.fit()做多调用10次
    2. 传递一个ParamMap给fit()或者transform()那么位于map中的所有的parameter都会通过setter方法override以前的参数
     
    很多时候将PipeLine保存到disk方便以后的使用是值得的
    Spark 1.6时候,model Import/Export函数被添加到PipeLine API
    大部分transformer和一些ML Model支持I/O
     
    下面是基本组件的一些操作的例子:
     
     
    1. #导入向量和模型
    2. from pyspark.ml.linalg importVectors
    3. from pyspark.ml.classification importLogisticRegression
    4. #准备训练数据
    5. # Prepare training data from a list of (label, features) tuples.
    6. training = spark.createDataFrame([
    7. (1.0,Vectors.dense([0.0,1.1,0.1])),
    8. (0.0,Vectors.dense([2.0,1.0,-1.0])),
    9. (0.0,Vectors.dense([2.0,1.3,1.0])),
    10. (1.0,Vectors.dense([0.0,1.2,-0.5]))],["label","features"])
    11. #创建回归实例,这个实例是Estimator
    12. # Create a LogisticRegression instance. This instance is an Estimator.
    13. lr =LogisticRegression(maxIter=10, regParam=0.01)
    14. #打印出参数和文档
    15. # Print out the parameters, documentation, and any default values.
    16. print"LogisticRegression parameters: "+ lr.explainParams()+" "
    17. #使用Ir中的参数训练出Model1
    18. # Learn a LogisticRegression model. This uses the parameters stored in lr.
    19. model1 = lr.fit(training)
    20. # Since model1 is a Model (i.e., a transformer produced by an Estimator),
    21. # we can view the parameters it used during fit().
    22. # This prints the parameter (name: value) pairs, where names are unique IDs for this
    23. # LogisticRegression instance.
    24. #查看model1在fit()中使用的参数
    25. print"Model 1 was fit using parameters: "
    26. print model1.extractParamMap()
    27. #修改其中的一个参数
    28. # We may alternatively specify parameters using a Python dictionary as a paramMap
    29. paramMap ={lr.maxIter:20}
    30. #覆盖掉
    31. paramMap[lr.maxIter]=30# Specify 1 Param, overwriting the original maxIter.
    32. #更新参数对
    33. paramMap.update({lr.regParam:0.1, lr.threshold:0.55})# Specify multiple Params.
    34. # You can combine paramMaps, which are python dictionaries.
    35. #新的参数,合并为两组参数对
    36. paramMap2 ={lr.probabilityCol:"myProbability"}# Change output column name
    37. paramMapCombined = paramMap.copy()
    38. paramMapCombined.update(paramMap2)
    39. #重新得到model2并拿出来参数看看
    40. # Now learn a new model using the paramMapCombined parameters.
    41. # paramMapCombined overrides all parameters set earlier via lr.set* methods.
    42. model2 = lr.fit(training, paramMapCombined)
    43. print"Model 2 was fit using parameters: "
    44. print model2.extractParamMap()
    45. #准备测试的数据
    46. # Prepare test data
    47. test = spark.createDataFrame([
    48. (1.0,Vectors.dense([-1.0,1.5,1.3])),
    49. (0.0,Vectors.dense([3.0,2.0,-0.1])),
    50. (1.0,Vectors.dense([0.0,2.2,-1.5]))],["label","features"])
    51. # Make predictions on test data using the Transformer.transform() method.
    52. # LogisticRegression.transform will only use the 'features' column.
    53. # Note that model2.transform() outputs a "myProbability" column instead of the usual
    54. # 'probability' column since we renamed the lr.probabilityCol parameter previously.
    55. prediction = model2.transform(test)
    56. #得到预测的DataFrame打印出预测中的选中列
    57. selected = prediction.select("features","label","myProbability","prediction")
    58. for row in selected.collect():
    59. print row
     
     
    下面是一个PipeLine的实例:
     
    1. from pyspark.ml importPipeline
    2. from pyspark.ml.classification importLogisticRegression
    3. from pyspark.ml.feature importHashingTF,Tokenizer
    4. #准备测试数据
    5. # Prepare training documents from a list of (id, text, label) tuples.
    6. training = spark.createDataFrame([
    7. (0L,"a b c d e spark",1.0),
    8. (1L,"b d",0.0),
    9. (2L,"spark f g h",1.0),
    10. (3L,"hadoop mapreduce",0.0)],["id","text","label"])
    11. #构建机器学习流水线
    12. # Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
    13. tokenizer =Tokenizer(inputCol="text", outputCol="words")
    14. hashingTF =HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
    15. lr =LogisticRegression(maxIter=10, regParam=0.01)
    16. pipeline =Pipeline(stages=[tokenizer, hashingTF, lr])
    17. #训练出model
    18. # Fit the pipeline to training documents.
    19. model = pipeline.fit(training)
    20. #测试数据
    21. # Prepare test documents, which are unlabeled (id, text) tuples.
    22. test = spark.createDataFrame([
    23. (4L,"spark i j k"),
    24. (5L,"l m n"),
    25. (6L,"mapreduce spark"),
    26. (7L,"apache hadoop")],["id","text"])
    27. #预测,打印出想要的结果
    28. # Make predictions on test documents and print columns of interest.
    29. prediction = model.transform(test)
    30. selected = prediction.select("id","text","prediction")
    31. for row in selected.collect():
    32. print(row)






    万事走心 精益求美


  • 相关阅读:
    absolute 导致点击事件无效
    windows 下的命令操作
    localStorage 设置本地缓存
    css渐变
    vue-cli webpack全局引入jquery
    #024分段函数。
    #023单词接龙1(字符串)(女友)
    #022 Python 实验课
    #021 Java复习第一天
    #020PAT 没整明白的题L1-009 N个数求和 (20 分)
  • 原文地址:https://www.cnblogs.com/kongchung/p/5776727.html
Copyright © 2020-2023  润新知