• sklearn中模型抽取


    特征抽取sklearn.feature_extraction 模块提供了从原始数据如文本,图像等众抽取能够被机器学习算法直接处理的特征向量。

    1.特征抽取方法之 Loading Features from Dicts

    复制代码
    measurements=[
        {'city':'Dubai','temperature':33.},
        {'city':'London','temperature':12.},
        {'city':'San Fransisco','temperature':18.},
    ]
    
    from sklearn.feature_extraction import DictVectorizer
    vec=DictVectorizer()
    print(vec.fit_transform(measurements).toarray())
    print(vec.get_feature_names())
    
    #[[  1.   0.   0.  33.]
     #[  0.   1.   0.  12.]
     #[  0.   0.   1.  18.]]
    
    #['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
    复制代码

    2.特征抽取方法之 Features hashing

    3.特征抽取方法之 Text Feature Extraction

    词袋模型 the bag of words represenatation

    复制代码
    #词袋模型
    from sklearn.feature_extraction.text import CountVectorizer
    #查看默认的参数
    vectorizer=CountVectorizer(min_df=1)
    print(vectorizer)
    
    """
    CountVectorizer(analyzer='word', binary=False, decode_error='strict',
            dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
            lowercase=True, max_df=1.0, max_features=None, min_df=1,
            ngram_range=(1, 1), preprocessor=None, stop_words=None,
            strip_accents=None, token_pattern='(?u)\b\w\w+\b',
            tokenizer=None, vocabulary=None)
    
    """
    
    corpus=["this is the first document.",
            "this is the second second document.",
            "and the third one.",
            "Is this the first document?"]
    x=vectorizer.fit_transform(corpus)
    print(x)
    
    """
    (0, 1)    1
      (0, 2)    1
      (0, 6)    1
      (0, 3)    1
      (0, 8)    1
      (1, 5)    2
      (1, 1)    1
      (1, 6)    1
      (1, 3)    1
      (1, 8)    1
      (2, 4)    1
      (2, 7)    1
      (2, 0)    1
      (2, 6)    1
      (3, 1)    1
      (3, 2)    1
      (3, 6)    1
      (3, 3)    1
      (3, 8)    1
    """
    复制代码

     默认是可以识别的字符串至少为2个字符

    analyze=vectorizer.build_analyzer()
    print(analyze("this is a document to anzlyze.")==
        (["this","is","document","to","anzlyze"])) #True

    在fit阶段被analyser发现的每一个词语都会被分配一个独特的整形索引,该索引对应于特征向量矩阵中的一列

    复制代码
    print(vectorizer.get_feature_names()==(
        ["and","document","first","is","one","second","the","third","this"]
    ))
    #True
    print(x.toarray())
    """
    [[0 1 1 1 0 0 1 0 1]
     [0 1 0 1 0 2 1 0 1]
     [1 0 0 0 1 0 1 1 0]
     [0 1 1 1 0 0 1 0 1]]
    """
    复制代码

    获取属性

    print(vectorizer.vocabulary_.get('document'))
    #1

    对于一些没有出现过的字或者字符,则会显示为0

    复制代码
    vectorizer.transform(["somthing completely new."]).toarray()
    """
    [[0 1 1 1 0 0 1 0 1]
     [0 1 0 1 0 2 1 0 1]
     [1 0 0 0 1 0 1 1 0]
     [0 1 1 1 0 0 1 0 1]]
    """
    复制代码

    在上边的语料库中,第一个和最后一个单词是一模一样的,只是顺序不一样,他们会被编码成相同的特征向量,所以词袋表示法会丢失了单词顺序的前后相关性信息,为了保持某些局部的顺序性,可以抽取2个词和一个词    

    复制代码
    bigram_vectorizer=CountVectorizer(ngram_range=(1,2),token_pattern=r"w+",min_df=1)
    analyze=bigram_vectorizer.build_analyzer()
    print(analyze("Bi-grams are cool!")==(['Bi','grams','are','cool','Bi grams',
                                     'grams are','are cool']))
    
    #True
    x_2=bigram_vectorizer.fit_transform(corpus).toarray()
    print(x_2)
    
    """
    [[0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0]
     [0 0 1 0 0 1 1 0 0 2 1 1 1 0 1 0 0 0 1 1 0]
     [1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0]
     [0 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1]]
    """
    复制代码
  • 相关阅读:
    uva11059
    uva725
    程序中double类型的数输出为什么要用lf
    c++形参和实参同名时,如何单步执行观察形参的变化。
    台式机的字母键和数字键都不能正常使用了呢?
    找错误——下面的程序意图在于统计字符串中字符数1的个数,可惜有瑕疵
    初学者常见错误1——赋值时的类型转换
    scanf
    c++的调试与运行
    黑猫派对
  • 原文地址:https://www.cnblogs.com/cmybky/p/11772638.html
Copyright © 2020-2023  润新知