• 使用Bag of words和随机森林进行文本情感分类


    1. 读取数据集

    使用pandas读取训练数据集。

    import re
    import pandas as pd
    import numpy as np
    from bs4 import BeautifulSoup
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.ensemble import RandomForestClassifier
    
    # downlaod frequent stopwords
    import nltk
    nltk.download('stopwords')  
    from nltk.corpus import stopwords
    
    # Read train data
    path_train_data = "movie_review/labeledTrainData.tsv"
    train = pd.read_csv(path_train_data, header=0, delimiter="\t", quoting=3)
    
    

    导入的包:

    • re:Python内置的正则库;
    • pandas:数据读取和处理库;
    • numpy:高性能的向量库;
    • BeautifulSoup:HTML解析库,用来从HTML文档和XML文档中提取内容;
    • sklearn:常用的机器学习库;
      • CounterVectorizer:
      • RandomForestClassifier:随机森林分类器;
    • nltk:自然语言处理库;
      • corpus:nltk自带的语料库;
      • stopwords:停顿词(如is, are, I, you, and....');

    2. 预处理文本

    定义两个函数来将清理文本,比如剔除掉HTML的tag、标点符号以及一些无意义的词(and, is, are...)。

    
    def review_to_words(raw_review):
        # 1. Remove HTML tags
        review_text = BeautifulSoup(raw_review).get_text()
        # 2. Remove non-letter
        letters_only = re.sub("[^a-zA-Z]", " ", review_text)
        # 3. Convert to lower case, split into indivisul words
        words = letters_only.lower().split()
        # 4. Convert stopwords to a set, because searching a set is 
        #    much faster than searching a list
        stops = set(stopwords.words('english'))
        # 5. Remove stopwords
        meaningful_words = [w for w in words if w not in stops]
        # 6. Using space as separator join words to a string 
        return (" ".join(meaningful_words))
    
    def get_clean_reviews(reviews):
        num_reviews = len(reviews)
        clean_reviews = []
        for i in range(num_reviews):
            if ((i+1) % 1000) == 0:  
                print("Review %d of %d" % (i+1, num_reviews))
            clean_reviews.append(review_to_words(reviews[i]))
        return clean_reviews
    
    clean_train_reviews = get_clean_reviews(train["review"])
    

    3. 创建bag of words:

    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of strings.
    train_data_features = vectorizer.fit_transform(clean_train_reviews)
    
    # Numpy arrays are easy to work with, so convert the result to an array
    train_data_features = train_data_features.toarray()
    vocab = vectorizer.get_feature_names()
    

    vectorizer.fit_transform的作用是将一个字符串数组转化为一个 bag of words列表。

    4. 训练模型

    训练一个包含100课树的随机森林:

    print("Training the random forest...")
    # Initialize a Random Forest classifier with 100 trees
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(train_data_features, train['sentiment'])
    

    5. 准备测试集:

    使用之前定义的函数,来获取测试数据集的特征向量。

    print("Testing the model...")
    path_test_data = "movie_review/testData.tsv"
    test = pd.read_csv(path_test_data, header=0, delimiter="\t", quoting=3)
    
    # processing test data
    clean_test_reviews = get_clean_reviews(test["review"])
    test_data_features = vectorizer.transform(clean_test_reviews)
    test_data_features = test_data_features.toarray()
    
    

    注意,在获取 test_data_features时,我们并没有使用 fit_transform而是直接使用了 transform,因为在前面 vectorize已经训练好了。

    6. 预测

    将预测结果写入csv文件,方便查看:

    path_result = "movie_review/Bag_of_Words_model.csv"
    result = forest.predict(test_data_features)
    output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
    output.to_csv(path_result , index=False, quoting=3 )
    
  • 相关阅读:
    Java基于数据源的数据库访问
    新手接触java
    完成了第一个java
    Mysql服务器相互作用的通讯协议包括TCP/IP,Socket,共享内存,命名管道
    SQL 根据IF判断,SET字段值
    MyBatis SQL 生成方法 增删改查
    JAVA 文件转字节数组转字符串
    Word内容修改,以及转PDF
    SpringBoot编辑代码时不重启服务
    java 图片转换工具
  • 原文地址:https://www.cnblogs.com/jmhwsrr/p/15923630.html
Copyright © 2020-2023  润新知