初识人工智能(二):机器学习(一):sklearn特征抽取

1. sklearn特征抽取

1.1 安装sklearn

pip install Scikit-learn -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

没有报错，导入命令查看是否可用：

import sklearn

注：安装scikit-learn需要Numpy,pandas等库。

1.2 特征抽取

例子：

# 特征抽取

# 导入包
from sklearn.feature_extraction.text import CountVectorizer

# 实例化CountVectorizer
vector = CountVectorizer()

# 调用fit_transform输入并转换数据
res = vector.fit_transform(["life is short,i like python","life is too long,i dislike python"])

# 打印结果
print(vector.get_feature_names())

print(res.toarray())

运行结果：

通过例子我们可以得出结论，特征抽取对文本等数据进行特征值化。

1.3 字典特征抽取

作用：对字典数据进行特征值化。

类：sklearn.feature_extraction.DictVectorizer

DictVectorizer语法：

DictVectorizer(sparse=True,…)

DictVectorizer.fit_transform(X)
- 　　X:字典或者包含字典的迭代器
- 　　返回值：返回sparse矩阵
DictVectorizer.inverse_transform(X)
- 　　X:array数组或者sparse矩阵
- 　　返回值:转换之前数据格式
DictVectorizer.get_feature_names()
- 　　返回类别名称
DictVectorizer.transform(X)
- 　　按照原先的标准转换

from sklearn.feature_extraction import DictVectorizer

def dictvec():
    """
    字典数据抽取
    :return: None
    """
    # 实例化
    dict = DictVectorizer()

    # 调用fit_transform
    data = dict.fit_transform([{'city': '北京','temperature': 100}, {'city': '上海','temperature':60}, {'city': '深圳','temperature': 30}])

    print(dict.get_feature_names())

    print(dict.inverse_transform(data))

    print(data)

    return None

if __name__ == "__main__":
    dictvec()

运行结果：

修改属性，让数据更直观。

from sklearn.feature_extraction import DictVectorizer

def dictvec():
    """
    字典数据抽取
    :return: None
    """
    # 实例化
    dict = DictVectorizer(sparse=False)

    # 调用fit_transform
    data = dict.fit_transform([{'city': '北京','temperature': 100}, {'city': '上海','temperature':60}, {'city': '深圳','temperature': 30}])

    print(dict.get_feature_names())

    print(dict.inverse_transform(data))

    print(data)

    return None

if __name__ == "__main__":
    dictvec()

运行结果：

1.4 文本特征抽取

作用：对文本数据进行特征值化。

类：sklearn.feature_extraction.text.CountVectorizer

CountVectorizer语法：

CountVectorizer(max_df=1.0,min_df=1,…)

返回词频矩阵
CountVectorizer.fit_transform(X,y)
- 　　X:文本或者包含文本字符串的可迭代对象
- 　　返回值：返回sparse矩阵
CountVectorizer.inverse_transform(X)
- 　　X:array数组或者sparse矩阵
- 　　返回值:转换之前数据格式
CountVectorizer.get_feature_names()
- 　　返回值:单词列表

from sklearn.feature_extraction.text import CountVectorizer

def countvec():
    """
    对文本进行特征值化
    :return: None
    """
    cv = CountVectorizer()

    data = cv.fit_transform(["人生 苦短，我 喜欢 python", "人生漫长，不用 python"])

    print(cv.get_feature_names())

    print(data.toarray())

    return None

if __name__ == "__main__":
    countvec()

运行结果：

我们在处理文本的时候，不可能自己一个一个去分词吧，所以我们就要使用一个工具jieba。

pip install jieba -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

from sklearn.feature_extraction.text import CountVectorizer
import jieba

def cutword():

    con1 = jieba.cut("今天很残酷，明天更残酷，后天很美好，但绝对大部分是死在明天晚上，所以每个人不要放弃今天。")

    con2 = jieba.cut("我们看到的从很远星系来的光是在几百万年之前发出的，这样当我们看到宇宙时，我们是在看它的过去。")

    con3 = jieba.cut("如果只用一种方式了解某样事物，你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。")

    # 转换成列表
    content1 = list(con1)
    content2 = list(con2)
    content3 = list(con3)

    # 吧列表转换成字符串
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)

    return c1, c2, c3

def hanzivec():
    """
    中文特征值化
    :return: None
    """
    c1, c2, c3 = cutword()

    print(c1, c2, c3)

    cv = CountVectorizer()

    data = cv.fit_transform([c1, c2, c3])

    print(cv.get_feature_names())

    print(data.toarray())

    return None

if __name__ == "__main__":
    hanzivec()

运行结果：

Building prefix dict from the default dictionary ...
Dumping model to file cache C:UsersACERAppDataLocalTempjieba.cache
Loading model cost 0.839 seconds.
Prefix dict has been built successfully.
今天 很 残酷 ， 明天 更 残酷 ， 后天 很 美好 ， 但 绝对 大部分 是 死 在 明天 晚上 ， 所以 每个 人 不要 放弃 今天 。 我们 看到 的 从 很 远 星系 来 的 光是在 几百万年 之前 发出 的 ， 这样 当 我们 看到 宇宙 时 ， 我们 是 在 看 它 的 过去 。 如果 只用 一种 方式 了解 某样 事物 ， 你 就 不会 真正 了解 它 。 了解 事物 真正 含义 的 秘密 取决于 如何 将 其 与 我们 所 了解 的 事物 相 联系 。
['一种', '不会', '不要', '之前', '了解', '事物', '今天', '光是在', '几百万年', '发出', '取决于', '只用', '后天', '含义', '大部分', '如何', '如果', '宇宙', '我们', '所以', '放弃', '方式', '明天', '星系', '晚上', '某样', '残酷', '每个', '看到', '真正', '秘密', '绝对', '美好', '联系', '过去', '这样']
[[0 0 1 0 0 0 2 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 2 0 1 0 2 1 0 0 0 1 1 0 0 0]
 [0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 1]
 [1 1 0 0 4 3 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 2 1 0 0 1 0 0]]

1.5 TF-IDF

TF-IDF的主要思想是：如果某个词或短语在一篇文章中出现的概率高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

TF-IDF作用：用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。

类：sklearn.feature_extraction.text.TfidfVectorizer

TfidfVectorizer语法：

TfidfVectorizer(stop_words=None,…)

返回词的权重矩阵
TfidfVectorizer.fit_transform(X,y)
- 　　X:文本或者包含文本字符串的可迭代对象
- 　　返回值：返回sparse矩阵
TfidfVectorizer.inverse_transform(X)
- 　　X:array数组或者sparse矩阵
- 　　返回值:转换之前数据格式
TfidfVectorizer.get_feature_names()
- 　　返回值:单词列表

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import jieba

def cutword():

    con1 = jieba.cut("今天很残酷，明天更残酷，后天很美好，但绝对大部分是死在明天晚上，所以每个人不要放弃今天。")

    con2 = jieba.cut("我们看到的从很远星系来的光是在几百万年之前发出的，这样当我们看到宇宙时，我们是在看它的过去。")

    con3 = jieba.cut("如果只用一种方式了解某样事物，你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。")

    # 转换成列表
    content1 = list(con1)
    content2 = list(con2)
    content3 = list(con3)

    # 把列表转换成字符串
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)

    return c1, c2, c3

def tfidfvec():
    """
    中文特征值化
    :return: None
    """
    c1, c2, c3 = cutword()

    print(c1, c2, c3)

    tf = TfidfVectorizer()

    data = tf.fit_transform([c1, c2, c3])

    print(tf.get_feature_names())

    print(data.toarray())

    return None

if __name__ == "__main__":
    tfidfvec()

运行结果：

Building prefix dict from the default dictionary ...
Loading model from cache C:UsersACERAppDataLocalTempjieba.cache
Loading model cost 0.633 seconds.
Prefix dict has been built successfully.
今天 很 残酷 ， 明天 更 残酷 ， 后天 很 美好 ， 但 绝对 大部分 是 死 在 明天 晚上 ， 所以 每个 人 不要 放弃 今天 。 我们 看到 的 从 很 远 星系 来 的 光是在 几百万年 之前 发出 的 ， 这样 当 我们 看到 宇宙 时 ， 我们 是 在 看 它 的 过去 。 如果 只用 一种 方式 了解 某样 事物 ， 你 就 不会 真正 了解 它 。 了解 事物 真正 含义 的 秘密 取决于 如何 将 其 与 我们 所 了解 的 事物 相 联系 。
['一种', '不会', '不要', '之前', '了解', '事物', '今天', '光是在', '几百万年', '发出', '取决于', '只用', '后天', '含义', '大部分', '如何', '如果', '宇宙', '我们', '所以', '放弃', '方式', '明天', '星系', '晚上', '某样', '残酷', '每个', '看到', '真正', '秘密', '绝对', '美好', '联系', '过去', '这样']
[[0.         0.         0.21821789 0.         0.         0.
  0.43643578 0.         0.         0.         0.         0.
  0.21821789 0.         0.21821789 0.         0.         0.
  0.         0.21821789 0.21821789 0.         0.43643578 0.
  0.21821789 0.         0.43643578 0.21821789 0.         0.
  0.         0.21821789 0.21821789 0.         0.         0.        ]
 [0.         0.         0.         0.2410822  0.         0.
  0.         0.2410822  0.2410822  0.2410822  0.         0.
  0.         0.         0.         0.         0.         0.2410822
  0.55004769 0.         0.         0.         0.         0.2410822
  0.         0.         0.         0.         0.48216441 0.
  0.         0.         0.         0.         0.2410822  0.2410822 ]
 [0.15698297 0.15698297 0.         0.         0.62793188 0.47094891
  0.         0.         0.         0.         0.15698297 0.15698297
  0.         0.15698297 0.         0.15698297 0.15698297 0.
  0.1193896  0.         0.         0.15698297 0.         0.
  0.         0.15698297 0.         0.         0.         0.31396594
  0.15698297 0.         0.         0.15698297 0.         0.        ]]

通过对比，我们就可以看出哪些词在文本中的权重高了。

相关阅读:
springmvc 之 url映射restful 及 ant
springmvc 之处理方法的返回值类型
 springmvc 之数据处理
 springmvc 之使用注解开发springmvc
springmvc 之配置及流程
 springmvc 之 springmvc简介，开发步骤
 mybatis 之 mybatis整合spring
mybatis 之 mybatis缓存
 mybatis 之 mybatis的映射
 SuperMap iClient3D for WebGL教程水面特效制作
原文地址：https://www.cnblogs.com/liuhui0308/p/12680798.html