COMP9313 Week8 Classification and PySpark MLlib

COMP9313 Week8 Classification and PySpark MLlib

https://drive.google.com/drive/folders/13_vsxSIEU9TDg1TCjYEwOidh0x3dU6es

https://www.cse.unsw.edu.au/~cs9313/20T2/slides/L7.pdf

Machine Learning :

　　1. Construct a model, predicting new data

　　2.

Evaluation Matrix:

　　Positive/Negative: Label ∈{a,b,c,d} 选择a为positive，则其他都是negative

　　False Positive: not a but classified as a

　　False Negative: a but classified as b or c or d

　　True Positive : a and classified as a

　　

　　Precision = tp / tp+fp

　　Recall = tp / tp+fn

　　F1 = 2 * precision*recall / ( precision + recall)

　　Micro: True label 是 positive

　　Macro: mean of F1 of each class label

Classification:

　　1. Preprocessing and Feature Engineering

　　　　1) bag of words

　　　　2) 去高频词

　　　　

　　2. Train classifier

　　3. Evaluate the classifier

　　　　1） split a 'development set' from the training set

　　　　2) k-fold cross-validation，然后取 avg(accuracy)

　　　　　　

Text Classification:

　　1. Input •Document or sentence

　　2. •Output •Class label C ∈ {c1, c2, … }

　　3. Classification methods:

　　　　 •Naïve bayes

　　　　•Logistic regression

　　　　•Support-vector machines •…

　　4. Naïve Bayes

　　　　1) bag of words -> features变成d维向量，label为c

　　　　2) 最大后验概率

　　　　3）假设条件独立。假设位置无关

　　　　4）

　　　　

　　

　　

　　

PySpark MLlib:

　　
相关阅读:
boost库的使用介绍
 《架构实战软件架构设计的过程》
常用开发命令
 《项目管理最佳实践案例剖析》
From Live Writer
希望实现的程序
 正在进行调试的Web服务器进程已由Internet信息服务（IIS）终止。可以通过在IIS中配置应用程序池Ping设置来避免这一问题。有关更多详细信息，请参见“帮助”
请确保此代码文件中定义的类与“inherits”属性匹配
 更改IE默认源代码编辑器
 MS的.net源码地址
原文地址：https://www.cnblogs.com/ChevisZhang/p/13344371.html