• spark pyspark 常用算法实现


    利用Spark-mllab进行聚类,分类,回归分析的代码实现(python)

    http://www.cnblogs.com/adienhsuan/p/5654481.html

    稀疏向量:

    关于SparkMLlib的基础数据结构Spark-MLlib-Basics

    http://blog.csdn.net/canglingye/article/details/41316193

    关于正则化项:http://www.itnose.net/detail/6266100.html

    精度和召回率:http://f.dataguru.cn/thread-707310-1-1.html

     机器学习:http://www.cnblogs.com/Leo_wl/p/5544239.html

    【重要】正则化 范数详细:http://www.fuqingchuan.com/2015/03/500.html

    http://www.cnblogs.com/tovin/p/3816289.html

    计算AUC  http://www.toutiao.com/i6259948874706715138/

     1 from pyspark.mllib.regression import LabeledPoint
     2 from pyspark.mllib.feature import HashingTF
     3 from pyspark.mllib.calssification import LogisticRegressionWithSGD
     4 
     5 spam = sc.textFile("spam.txt")
     6 normal = sc.textFile("normal.txt")
     7 
     8 #创建一个HashingTF实例来把邮件文本映射为包含10000个特征的向量
     9 tf = HashingTF(numFeatures = 10000)
    10 #各邮件都被切分为单词,每个单词背映射为一个特征
    11 spamFeatures = spam.map(lambda email: tf.transform(email.split(" ")))
    12 normalFeatures = normal.map(lambda email: tf.transform(email.split(" ")))
    13 
    14 #创建LabeledPoint数据集分别存放阳性(垃圾邮件)和阴性(正常邮件)的例子
    15 positiveExamples = spamFeatures.map(lambda features: LabeledPoint(1,features))
    16 negativeExamples = normalFeatures.map(lambda features: LabeledPoint(0,features))
    17 trainingData = positiveExamples.union(negativeExamples)
    18 trainingData.cache#因为逻辑回归是迭代算法,所以缓存数据RDD
    19 
    20 #使用SGD算法运行逻辑回归
    21 model = LogisticRegressionWithSGD.train(trainingData)
    22 
    23 #以阳性(垃圾邮件)和阴性(正常邮件)的例子分别进行测试
    24 posTest = tf.transform("O M G GET cheap stuff by sending money to...".split(" "))
    25 negTest = tf.transform("Hi Dad, I stared studying Spark the other ...".split(" "))
    26 print "Prediction for positive test examples: %g" %model.predict(posTest)
    27 print "Prediction for negative test examples: %g" %model.predict(negTest)

     算法通解:

  • 相关阅读:
    python学习笔记(excel中处理日期格式)
    python学习笔记(生成xml)
    python学习笔记(接口自动化框架 V1.0)
    python学习笔记(excel+unittest)
    刷题[RoarCTF 2019]Easy Java
    刷题[GKCTF2020]
    php bypass disable function
    刷题[MRCTF2020]Ezpop
    刷题[安恒DASCTF2020四月春季赛]Ez unserialize
    刷题[HFCTF2020]EasyLogin
  • 原文地址:https://www.cnblogs.com/zhangbojiangfeng/p/5994284.html
Copyright © 2020-2023  润新知