对文本抽取词袋模型特征

对文本抽取词袋模型特征

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(

    analyzer='word',           # tokenise by character ngrams

    max_features=4000, # keep the most common 4000 ngrams，表示抽取最常见的4000个单词

#在x_train上提取词袋模型特征

vec.fit(x_train)

classifier = MultinomialNB()

# vec.transform(x_train)转化训练集样本，转变之后矩阵维度是[n_samples, 4000]

classifier.fit(vec.transform(x_train), y_train)

#加入抽取2-gram和3-gram的统计特征

vec = CountVectorizer(

    analyzer='word',   # tokenise by character ngrams

    ngram_range=(1,4), # use ngrams of size 1 and 2

max_features=20000,) # keep the most common 1000 ngrams

更可靠的验证效果的方式是交叉验证，但是交叉验证最好保证每一份里面的样本类别也是相对均衡的，我们这里使用StratifiedKFold

from sklearn.cross_validation import StratifiedKFold

#x是训练数据，y是标签，train_index : test_index = 4:1

stratifiedk_fold = StratifiedKFold(y, n_folds=n_folds, shuffle=shuffle)

    for train_index, test_index in stratifiedk_fold:

        X_train, X_test = x[train_index], x[test_index]

        y_train = y[train_index]
相关阅读:
TP隐藏入口
 CentOs5.2中PHP的升级
 centos 关闭不使用的服务
 也不知怎么了LVS.SH找不到，网上搜了一篇环境搭配CENTOS下面的高可用参考
 三台CentOS 5 Linux LVS 的DR 模式http负载均衡安装步骤
 分享Centos作为WEB服务器的防火墙规则
 Openssl生成根证书、服务器证书并签核证书
 生成apache证书(https应用)
openssl生成https证书 (转)
ls -l 列表信息详解
原文地址：https://www.cnblogs.com/yongfuxue/p/10118993.html