Mechine Learning Programe

1.平衡数据(imblearn)

RandomOverSampler 过采样，从小众样本中复制样本或者使用SMOTE方法生成样本

多次欠采样，然后合并多个估计器或者采用boost思想，分类正确的不再放入原来的大众样本中

2.GridSearchCV parameter = {'C' : np.linspace(10,1,num=10)}

3.ROC曲线

绘制ROC曲线需 decision_function()

y_pred_score = model.decision_function(X_test)

fpr,tpr,thresholds = roc_curve(y_test,y_pred_score)

decision_function 表示通过度量样本距离分隔超平面距离来表示置信度

1.Counter() 统计词频

2.feature_extraction.text.CountVectorizer 特征提取函数，把一段文本编程0,1矩阵

3.Naive Bayes:GussianNB MultionomialNB BernonliNB

1.One-hot编码/Label Encoder编码

2.用随机森林观察强特征 RandomForestClassifier()

3.粗调优，以Accuracy Score、FBeta_score作为评分标准，对n_estimators,min_sample_leaf,max_depth,random_state选择最优参数

4.细调优，GridSearchCV对parameters={'max_depth','n_estimators'}继续调优

1.data[data.duuplicated(keep = False)].sort_values(by = ["user_id"])

2.样本容量检验：1)基准线 2)最小提升比例

3.假设检验：python的statmodel模块

相关阅读:
使用Publish Over SSH插件实现远程自动部署
Certificates does not conform to algorithm constraints
在 Linux 命令行脚本中执行 sudo 时自动输入密码
pig学习
Attention-based Model
kesci---2019大数据挑战赛预选赛---情感分析
计算广告（1）---广告技术概览
Hadoop 使用小命令（2）
shell学习（2）----常用语法
docker入门

原文地址：https://www.cnblogs.com/jiaxinwei/p/13938419.html