• CNN实现推特文本分类


    做的第一个nlp任务,用的textcnn来实现推特文本分类。

    任务描述

    image-20211125220349319

    判断一个tweet是否真的再说现实里的灾难,而不是说只是用一些类似灾难的形容词。

    数据预处理

    import torch
    import torchtext
    import torch.nn as nn
    import torch.utils.data as Data
    import torch.nn.functional as F
    torch.manual_seed(916)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
    import math
    import random
    import time
    import seaborn as sns
    import matplotlib.pyplot as plt
    import warnings
    import numpy as np
    import pandas as pd
    import re
    
    import spacy
    import texthero as th
    from torchtext import vocab
    from d2l import torch as d2l
    
    train_path = '/home/wu/kaggle/pipeline/textcnn/train.csv'
    test_path = '/home/wu/kaggle/pipeline/textcnn/test.csv'
    embed_path = '/home/wu/kaggle/wordvec/crawl-300d-2M.vec'
    submission_path = '/home/wu/kaggle/pipeline/textcnn/submission.csv'
    model_path = '/home/wu/kaggle/pipeline/textcnn/default_model.pkl'
    
    w2v = torchtext.vocab.Vectors(embed_path, max_vectors=10000) #这是我的词向量
    

    关于这个词向量,它是如果该词不再词向量列表里就返回全0的tensor。

    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)
    
    all_data = pd.concat((train, test))
    all_data['text'] = th.preprocessing.clean(all_data['text'])
    all_data['text'] = th.preprocessing.remove_urls(all_data['text'])
    all_data['text'] = th.preprocessing.tokenize(all_data.text)
    

    这里用到了texthero这个工具,这玩意是真的好用,数据清洗就一个命令,然后直接分词。

    length = all_data.text.apply(lambda x: len(x))
    length.plot.hist(bins=25) # 显示每个文本词的长度
    

    image-20211125220845325

    可以看到每个推文经过clean后,大部分的推文的词就在20个以下了,所以pad做就取20了。

    def pad(s):
        if len(s) < 20:
            return s + ['<pad>'] * (20-len(s))
        else:
            return s[:20]
        
    all_data.text = all_data.text.apply(lambda x : pad(x))
    

    不得不说的是,pd的apply功能是真的好用!

    train_X = all_data.text.iloc[:train.shape[0]]
    test_X = all_data.text.iloc[train.shape[0]:]
    train_y = torch.tensor(np.array(all_data.target_relabeled.iloc[:train.shape[0]]), dtype=torch.int64)
    
    #进行切分
    
    train_X = torch.concat([x for x in train_X.values], dim=0).reshape(-1, 20, 300)
    train_X.shape
    

    因为每个词向量实际上是tensor形式存储的,但是存在dataFrame里外面又套了一层list,所以没办法了。。。只能reshape回去。

    这里我懒得思考了 ,直接用整个训练数据来train也不搞验证集了。就这样吧,爱咋咋地了。

    train_dataset = Data.TensorDataset(train_X, train_y)
    train_iter = Data.DataLoader(train_dataset, batch_size=64, shuffle=True)
    # 做成迭代器
    

    训练

    模型的定义

    class textcnn(nn.Module):
        def __init__(self, embed_dim, nfilters, filter_sizes, classes=2, dropout=0.1):
            super(textcnn, self).__init__()
            self.convs = nn.ModuleList([
                nn.Conv1d(in_channels=embed_dim, out_channels=nfilters, kernel_size=fs) 
                for fs in filter_sizes
            ])
            self.fn = nn.Linear(len(filter_sizes)*nfilters, classes)
            self.dropout = nn.Dropout(dropout)
            self.mish = nn.Mish()
        def forward(self, X):
            # X shape (batchsize, nsteps, embed_dim)
            X = X.permute(0, 2, 1)
            # X (bs , embed_dim, ns)
            conved = [self.mish(conv(X)) for conv in self.convs]
            # (bs, nfiters, ns-filtersize[n]+1)
            pooled = [F.max_pool1d(conv, conv.shape[-1]).squeeze(2) for conv in conved]
            
            x = self.dropout(torch.cat(pooled, dim=1))
            
            return self.fn(x)
            
    

    这里怎么说呢,就是把维度看做了特征通道,所以input_channel就是输入的embed_dim。然后把时间步看做特征,那么我们的滑动窗口就很好理解了,就是在取滑动窗口的时间步进行分析。不得不说啊,用CNN来做确实快,并行度一高,train的就是快。

    model = textcnn(300, 100, [3,4,5], 2, 0.5)
    

    词向量的维度就是直接用的导入的词向量,没有直接用一层embedding,所以就是300,这个输出通道选择100,[3, 4, 5]分别表示,有3个卷积核。

    训练过程

    epochs = 200
    model.cuda()
    loss = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    for epoch in range(epochs):
        model.train()
        metric = d2l.Accumulator(3)
        for X, y in train_iter:
            X = X.cuda()
            y = y.cuda()
            y_hat = model(X)
            l = loss(y_hat, y) # y_hat shape (batch_size, 2)
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            acc_sum = (y_hat.argmax(dim=-1) == y).sum()
            num = len(y)
            lsum = l.item() * num
            metric.add(num, lsum, acc_sum)
        #if epoch == 0 or (epoch +1)%50 ==0:
        print("epoch: %d acc: %.3f, loss: %.3f"%(epoch, metric[2]/metric[0],metric[1]/metric[0]))
    

    image-20211125222402747

    可以看到这里准确率还是可以的

    预测并提交

    model = model.to('cpu')
    def predict(X):
        X = X.unsqueeze(0)
        t = model(X)
        x,y = list(t[0])
        if x > y:
            return 0
        else:
            return 1
    test_X['target'] = test_X['text'].apply(lambda x: predict(x))
    sub = pd.read_csv('/home/wu/kaggle/pipeline/textcnn/sample_submission.csv')
    sub['target'] = test_X.target
    sub.to_csv('./sub.csv', index=False)
    

    摆烂环节

    image-20211125222817437

    简简单单900名,准确率0.728。emmm还行吧,没有用bert来做,train出0.7我觉得其实OK。开摆!

  • 相关阅读:
    canvas-color的几种设置
    canvas-2lineCap.html
    canvas-2lineJoin.html
    canvas-0trasform.html
    总体、个体和简单随机样本
    大数定律
    切比雪夫不等式
    B1032. 挖掘机技术哪家强
    Array(数组)对象-->join() 方法
    Array(数组)对象-->shift() 方法
  • 原文地址:https://www.cnblogs.com/kalicener/p/15605380.html
Copyright © 2020-2023  润新知