• 李宏毅机器学习HW4:句子情感分类


    简介

    本次作业所用到的数据为Twitter上的推文,训练数据会被打上正面或负面的标签,最终我们要对无标签的句子分类。

    带标签的训练数据,中间的+++$+++只是分隔符,共200000条数据。

     不带标签的训练数据,共1178614条数据。

    测试数据,共200000条数据。

    数据处理

    读数据

    import torch
    import pandas as pd
    import torch.nn as nn
    
    
    def load_training_data(path='./data/training_label.txt'):
        """
        读取训练数据
        :param path:数据路径
        :return: 若为带标签的训练数据,返回x,y
                 若为不带标签的训练数据,返回y
        """
        if 'training_label' in path:
            with open(path, 'r', encoding='UTF-8') as f:
                lines = f.readlines()
                lines = [line.strip().split(' ') for line in lines]
                train_x = [line[2:] for line in lines]
                train_y = [line[0] for line in lines]
            return train_x, train_y
        else:
            with open(path, 'r', encoding='UTF-8') as f:
                lines = f.readlines()
                train_x = [line.strip().split(' ') for line in lines]
            return train_x
    
    
    def load_testing_data(path='./data/testing_data.txt'):
        """
        读取测试数据
        :param path:数据路径
        :return:返回x
        """
        with open(path, 'r', encoding='UTF-8') as f:
            lines = f.readlines()
            test_data = [''.join(line.strip('
    ').split(',', 1)[1:]).strip() for line in lines[1:]]   # 在第一个,处将句子分成两半,并取后一半
            test_data = [sen.split(' ') for sen in test_data]
            return test_data

    词向量

    导入第三方库gensim计算得到。

    Word2vec参数含义:
    
    size: 词向量的维度。
    alpha: 模型初始的学习率。
    window: 表示在一个句子中,当前词于预测词在一个句子中的最大距离。
    min_count: 用于过滤操作,词频少于 min_count 次数的单词会被丢弃掉,默认值为 5。
    max_vocab_size: 设置词向量构建期间的 RAM 限制。如果所有的独立单词数超过这个限定词,那么就删除掉其中词频最低的那个。根据统计,每一千万个单词大概需要 1GB 的RAM。如果我们把该值设置为 None ,则没有限制。
    sample: 高频词汇的随机降采样的配置阈值,默认为 1e-3,范围是 (0, 1e-5)。
    seed: 用于随机数发生器。与词向量的初始化有关。
    workers: 控制训练的并行数量。
    min_alpha: 随着训练进行,alpha 线性下降到 min_alpha。
    sg: 用于设置训练算法。当 sg=0,使用 CBOW 算法来进行训练;当 sg=1,使用 skip-gram 算法来进行训练。
    hs: 如果设置为 1 ,那么系统会采用 hierarchica softmax 技巧。如果设置为 0(默认情况),则系统会采用 negative samping 技巧。
    negative: 如果这个值大于 0,那么 negative samping 会被使用。该值表示 “noise words” 的数量,一般这个值是 5 - 20,默认是 5。如果这个值设置为 0,那么 negative samping 没有使用。
    cbow_mean: 如果这个值设置为 0,那么就采用上下文词向量的总和。如果这个值设置为 1 (默认情况下),那么我们就采用均值。但这个值只有在使用 CBOW 的时候才起作用。
    hashfxn: hash函数用来初始化权重,默认情况下使用 Python 自带的 hash 函数。
    iter: 算法迭代次数,默认为 5。
    trim_rule: 用于设置词汇表的整理规则,用来指定哪些词需要被剔除,哪些词需要保留。默认情况下,如果 word count < min_count,那么该词被剔除。这个参数也可以被设置为 None,这种情况下 min_count 会被使用。
    sorted_vocab: 如果这个值设置为 1(默认情况下),则在分配 word index 的时候会先对单词基于频率降序排序。
    batch_words: 每次批处理给线程传递的单词的数量,默认是 10000。
    
    from gensim.models import Word2Vec
    from utils import load_training_data
    from utils import load_testing_data
    
    
    def train_word2vec(x):
        model = Word2Vec(x, size=250, window=5, min_count=5, workers=12, iter=10, sg=1)
        return model
    
    
    print('loading training data ...')
    train_x, train_y = load_training_data()
    train_x_no_label = load_training_data('./data/training_nolabel.txt')
    
    print('load testing data ...')
    test_x = load_testing_data()
    
    word2evc_model = train_word2vec(train_x + test_x)
    
    print('saving model ...')
    word2evc_model.save('w2v.model')
    

      

    PreProcess

    定义一个数据预处理的类,这个类主要实现调整单词长度、生成单词与词向量映射关系等。

    import torch
    from gensim.models import Word2Vec
    
    
    class PreProcess():
        def __init__(self, sentences, sen_len, w2v_path):
            self.w2v_path = w2v_path      # 模型存储地址
            self.sentences = sentences    # 句子
            self.sen_len = sen_len        # 句子长度
            self.idx2word = []
            self.word2idx = {}
            self.embedding_matrix = []    # 词向量矩阵
    
        def get_w2v_model(self):
            # 读取之前训练好的 word2vec
            self.embedding = Word2Vec.load(self.w2v_path)
            self.embedding_dim = self.embedding.vector_size
    
        def add_embedding(self, word):
            # 这里的 word 只会是 "<PAD>" 或 "<UNK>"
            # 把一个随机生成的表征向量 vector 作为 "<PAD>" 或 "<UNK>" 的嵌入
            vector = torch.empty(1, self.embedding_dim)
            torch.nn.init.uniform_(vector)
            self.idx2word.append(word)
            self.word2idx[word] = len(self.word2idx)
            self.embedding_matrix = torch.cat([self.embedding_matrix, vector], 0)
    
        def make_embedding(self, load=True):
            # 生成词向量矩阵
            print("Get embedding ...")
            if load:
                print("loading word to vec model ...")
                self.get_w2v_model()            # 获取训练好的 Word2vec word embedding
            else:
                raise NotImplementedError
    
            for i, word in enumerate(self.embedding.wv.vocab):           # 遍历词向量
                print('
    当前构建词向量矩阵进度:{:.2f}%'.format(i/len(self.embedding.wv.vocab)*100), end='')
                self.idx2word.append(word)                               # idx2word是一个列表,列表的下标索引对应了单词
                self.word2idx[word] = len(self.word2idx)
                # self.word2idx[word] = self.idx2word.index(word)        # 也可以这样写,但这样速度会慢一些
                self.embedding_matrix.append(self.embedding[word])       # 在embedding_matrix中加入词向量,word所对应的索引就是词向量在embedding_matrix所在的行
    
            print('')
            self.embedding_matrix = torch.tensor(self.embedding_matrix)  # 转成tensor
            # 将 <PAD> 和 <UNK> 加入 embedding
            self.add_embedding("<PAD>")     # 训练时需要将每个句子调整成相同的长度,短的句子需要补<PAD>
            self.add_embedding("<UNK>")     # word2vec时有些词频低的被删掉了,所以有些词可能没有词向量,对于这种词,统一用一个随机的<UNK>词向量表示
            print("total words: {}".format(len(self.embedding_matrix)))
            return self.embedding_matrix
    
        def pad_sequence(self, sentence):
            # 将句子调整成相同长度,即sen_len
            if len(sentence) > self.sen_len:
                sentence = sentence[:self.sen_len]         # 截断
            else:
                pad_len = self.sen_len - len(sentence)     # 补<PAD>
                for _ in range(pad_len):
                    sentence.append(self.word2idx['<PAD>'])
            assert len(sentence) == self.sen_len
            return sentence
    
        def sentence_word2idx(self):
            # 将句子单词用词向量索引表示
            sentence_list = []
            for i, sen in enumerate(self.sentences):
                sentence_idx = []
                for word in sen:
                    if word in self.word2idx.keys():
                        sentence_idx.append(self.word2idx[word])
                    else:
                        sentence_idx.append(self.word2idx["<UNK>"])     # 表中没有的词用<UNK>表示
                sentence_idx = self.pad_sequence(sentence_idx)          # 调整长度
                sentence_list.append(sentence_idx)
            return torch.LongTensor(sentence_list)              # torch.size(句子数, sen_len)
    
        def labels_to_tensor(self, y):
            # 把 labels 转成 tensor
            y = [int(label) for label in y]
            return torch.LongTensor(y)
    

      

    DataSet

    from torch.utils.data import Dataset
    
    
    class TwitterDataset(Dataset):
        def __init__(self, X, y):
            self.data = X
            self.label = y
    
        def __getitem__(self, idx):
            if self.label is None:
                return self.data[idx]
            return self.data[idx], self.label[idx]
    
        def __len__(self):
            return len(self.data)
    

      

    模型定义

    定义一个简单的只有一层的LSTM,其中词向量由gensim训练得到的数据导入,关于LSTM参数不了解的可以看:传送门

    import torch
    from gensim.models import Word2Vec
    import torch.nn as nn
    
    
    class LSTM_Net(nn.Module):
        def __init__(self, embedding, embedding_dim, hidden_dim, num_layers, dropout=0.5, fix_embedding=True):
            super(LSTM_Net, self).__init__()
    
            self.embedding = torch.nn.Embedding(embedding.size(0), embedding.size(1))
            self.embedding.weight = torch.nn.Parameter(embedding)
    
            # 是否将embedding固定住,不固定的话embedding会在训练过程中随之改变
            self.embedding.weight.requires_grad = False if fix_embedding else True
    
            self.embedding_dim = embedding.size(1)   # 词向量的维度,也就是之后的input_size
            self.hidden_dim = hidden_dim             # 隐藏层维度
            self.num_layers = num_layers             # LSTM层数
            self.dropout = dropout
            self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
    
            self.classifier = nn.Sequential(
                nn.Dropout(dropout),
                nn.Linear(hidden_dim, 1),
                nn.Sigmoid()
            )
    
        def forward(self, inputs):
            inputs = self.embedding(inputs)    # 先将索引映射为词向量
            x, _ = self.lstm(inputs, None)
            # x的dimension (batch, seq_len, hidden_size)
            # 取句子最后一个单词输出的hidden state丢到分类器中
            x = x[:, -1, :]
            x = self.classifier(x)
            return x
    

      

    模型训练

    from sklearn.model_selection import train_test_split
    from utils import load_training_data, load_testing_data
    from _class import PreProcess, LSTM_Net, TwitterDataset
    from torch.utils.data import DataLoader
    import torch
    import torch.nn as nn
    
    
    def evaluation(outputs, labels):
        # outputs => 预测值,概率(float)
        # labels => 真实值,标签(0或1)
        outputs[outputs >= 0.5] = 1    # 大于等于0.5为正面
        outputs[outputs < 0.5] = 0     # 小于0.5为负面
        accuracy = torch.sum(torch.eq(outputs, labels)).item()
        return accuracy
    
    
    def training(batch_size, n_epoch, lr, train, valid, model, device):
        # 输出模型总的参数数量、可训练的参数数量
        total = sum(p.numel() for p in model.parameters())   # 返回数组中元素的个数
        trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
        print('
    start training, parameter total:{}, trainable:{}
    '.format(total, trainable))
    
        loss = nn.BCELoss()    # 定义损失函数为二元交叉熵损失 binary cross entropy loss
        t_batch = len(train)   # training数据的batch数量
        v_batch = len(valid)   # validation数据的batch数量
        optimizer = torch.optim.Adam(model.parameters(), lr=lr)  # optimizer用Adam,设置适当的学习率lr
        total_loss, total_acc, best_acc = 0, 0, 0
        for epoch in range(n_epoch):
            total_loss, total_acc = 0, 0
    
            # training
            model.train()
            for i, (inputs, labels) in enumerate(train):
                inputs = inputs.to(device, dtype=torch.long)  # 因为 device 为 "cuda",将 inputs 转成 torch.cuda.LongTensor
                labels = labels.to(device, dtype=torch.float)  # 因为 device 为 "cuda",将 labels 转成 torch.cuda.FloatTensor,loss()需要float
    
                optimizer.zero_grad()  # 由于 loss.backward() 的 gradient 会累加,所以每一个 batch 后需要归零
                outputs = model(inputs)  # 模型输入Input,输出output
                outputs = outputs.squeeze()  # 去掉最外面的 dimension,好让 outputs 可以丢进 loss()
                batch_loss = loss(outputs, labels)  # 计算模型此时的 training loss
                batch_loss.backward()  # 计算 loss 的 gradient
                optimizer.step()  # 更新模型参数
    
                accuracy = evaluation(outputs, labels)  # 计算模型此时的 training accuracy
                total_acc += (accuracy / batch_size)
                total_loss += batch_loss.item()
            print('Epoch | {}/{}'.format(epoch + 1, n_epoch))
            print('Train | Loss:{:.5f} Acc: {:.3f}'.format(total_loss / t_batch, total_acc / t_batch * 100))
    
            # validation
            model.eval()
            with torch.no_grad():
                total_loss, total_acc = 0, 0
    
                for i, (inputs, labels) in enumerate(valid):
                    inputs = inputs.to(device, dtype=torch.long)
                    labels = labels.to(device, dtype=torch.float)
    
                    outputs = model(inputs)
                    outputs = outputs.squeeze()
                    batch_loss = loss(outputs, labels)
                    accuracy = evaluation(outputs, labels)
                    total_acc += (accuracy / batch_size)
                    total_loss += batch_loss.item()
    
                print("Valid | Loss:{:.5f} Acc: {:.3f} ".format(total_loss / v_batch, total_acc / v_batch * 100))
                if total_acc > best_acc:
                    # 如果 validation 的结果优于之前所有的結果,就把当下的模型保存下来,用于之后的testing
                    best_acc = total_acc
                    torch.save(model, "ckpt.model")
            print('-----------------------------------------------')
    
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    sen_len = 20
    fix_embedding = True
    batch_size = 128
    epoch = 10
    lr = 0.001
    w2v_path = 'w2v.model'
    
    print('loading data...')
    train_x, train_y = load_training_data()
    train_x_no_label = load_training_data('./data/training_nolabel.txt')
    
    pre_process = PreProcess(train_x, sen_len, w2v_path)
    embedding = pre_process.make_embedding(load=True)
    train_x = pre_process.sentence_word2idx()
    train_y = pre_process.labels_to_tensor(train_y)
    
    model = LSTM_Net(embedding, embedding_dim=250, hidden_dim=150, num_layers=1, dropout=0.5, fix_embedding=fix_embedding)
    model = model.to(device)
    # stratify=y是指按照y标签来分层,也就是数据分层后标签的比例大致等同于原先标签比例
    X_train, X_val, y_train, y_val = train_test_split(train_x, train_y, test_size=0.1, random_state=1, stratify=train_y)
    
    train_dataset = TwitterDataset(X_train, y_train)
    val_dataset = TwitterDataset(X_val, y_val)
    
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=0)
    
    # 开始训练
    training(batch_size, epoch, lr, train_loader, val_loader, model, device)
    

     

    预测

    import torch
    from utils import load_testing_data
    from _class import PreProcess, TwitterDataset, LSTM_Net
    from torch.utils.data import DataLoader
    import pandas as pd
    
    
    def testing(batch_size, test_loader, model, device):
        model.eval()
        ret_output = []
        with torch.no_grad():
            for i, inputs in enumerate(test_loader):
                inputs = inputs.to(device, dtype=torch.long)
                outputs = model(inputs)
                outputs = outputs.squeeze()
                outputs[outputs >= 0.5] = 1
                outputs[outputs < 0.5] = 0
                ret_output += outputs.int().tolist()    # outputs是Tensor且是float,故转化一下
        return ret_output
    
    
    sen_len = 20
    batch_size = 128
    w2v_path = 'w2v.model'
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    test = load_testing_data()
    pre_process = PreProcess(test, sen_len, w2v_path)
    embedding = pre_process.make_embedding(load=True)
    test = pre_process.sentence_word2idx()
    test_dataset = TwitterDataset(test, None)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=0)
    
    model = torch.load('ckpt.model')
    outputs = testing(batch_size, test_loader, model, device)
    
    tmp = pd.DataFrame({'id': [str(i) for i in range(len(test))], 'label': outputs})
    print("save csv ...")
    tmp.to_csv('predict.csv', index=False)
    print("Finish Predicting")
    

    优化

    增加语料库

    之前只使用了有标签的训练数据来构建词向量,现在将无标签的训练数据也一并加上,只需要在训练word2vec处加上tanin_x_no_label的数据,即:

    # word2evc_model = train_word2vec(train_x + test_x)
    word2evc_model = train_word2vec(train_x + train_x_no_label + test_x)
    
    print('saving model ...')
    # word2evc_model.save('w2v.model')
    word2evc_model.save('new_w2v.model')
    

    注释部分为原先的代码,之后也要记得把代码中使用w2v.model的部分改为new_w2v.model。

     可以看到,增大语料库还是有作用的。

    self_training

    接下来用到李宏毅老师视频里所讲到的技术,对于无标签数据,可以根据训练得到的模型对它进行分类,当然,需要设定一个阈值,达到这个阈值的数据就可以标上标签成为训练集的一部分,没有达到阈值的则继续后新的模型进行预测。

    先对类PreProcess()略作修改,之前的代码在创建对象的时候就固定了数据,但是这次要加入无标签的数据,它也要词嵌入,所以改成在调用类方法的时候输入数据。

    class PreProcess():
        def __init__(self, sen_len, w2v_path):
            self.w2v_path = w2v_path      # 模型存储地址
            self.sen_len = sen_len        # 句子长度
            self.idx2word = []
            self.word2idx = {}
            self.embedding_matrix = []    # 词向量矩阵
    
        def get_w2v_model(self):
            # 读取之前训练好的 word2vec
            self.embedding = Word2Vec.load(self.w2v_path)
            self.embedding_dim = self.embedding.vector_size
    
        def add_embedding(self, word):
            # 这里的 word 只会是 "<PAD>" 或 "<UNK>"
            # 把一个随机生成的表征向量 vector 作为 "<PAD>" 或 "<UNK>" 的嵌入
            vector = torch.empty(1, self.embedding_dim)
            torch.nn.init.uniform_(vector)
            self.idx2word.append(word)
            self.word2idx[word] = len(self.word2idx)
            self.embedding_matrix = torch.cat([self.embedding_matrix, vector], 0)
    
        def make_embedding(self, load=True):
            # 生成词向量矩阵
            print("Get embedding ...")
            if load:
                print("loading word to vec model ...")
                self.get_w2v_model()            # 获取训练好的 Word2vec word embedding
            else:
                raise NotImplementedError
    
            for i, word in enumerate(self.embedding.wv.vocab):           # 遍历词向量
                print('
    当前构建词向量矩阵进度:{:.2f}%'.format(i/len(self.embedding.wv.vocab)*100), end='')
                self.idx2word.append(word)                               # idx2word是一个列表,列表的下标索引对应了单词
                self.word2idx[word] = len(self.word2idx)
                # self.word2idx[word] = self.idx2word.index(word)        # 也可以这样写,但这样速度会慢一些
                self.embedding_matrix.append(self.embedding[word])       # 在embedding_matrix中加入词向量,word所对应的索引就是词向量在embedding_matrix所在的行
    
            print('')
            self.embedding_matrix = torch.tensor(self.embedding_matrix)  # 转成tensor
            # 将 <PAD> 和 <UNK> 加入 embedding
            self.add_embedding("<PAD>")     # 训练时需要将每个句子调整成相同的长度,短的句子需要补<PAD>
            self.add_embedding("<UNK>")     # word2vec时有些词频低的被删掉了,所以有些词可能没有词向量,对于这种词,统一用一个随机的<UNK>词向量表示
            print("total words: {}".format(len(self.embedding_matrix)))
            return self.embedding_matrix
    
        def pad_sequence(self, sentence):
            # 将句子调整成相同长度,即sen_len
            if len(sentence) > self.sen_len:
                sentence = sentence[:self.sen_len]         # 截断
            else:
                pad_len = self.sen_len - len(sentence)     # 补<PAD>
                for _ in range(pad_len):
                    sentence.append(self.word2idx['<PAD>'])
            assert len(sentence) == self.sen_len
            return sentence
    
        def sentence_word2idx(self, sentences):
            # 将句子单词用词向量索引表示
            sentence_list = []
            for i, sen in enumerate(sentences):
                sentence_idx = []
                for word in sen:
                    if word in self.word2idx.keys():
                        sentence_idx.append(self.word2idx[word])
                    else:
                        sentence_idx.append(self.word2idx["<UNK>"])     # 表中没有的词用<UNK>表示
                sentence_idx = self.pad_sequence(sentence_idx)          # 调整长度
                sentence_list.append(sentence_idx)
            return torch.LongTensor(sentence_list)              # torch.size(句子数, sen_len)
    
        def labels_to_tensor(self, y):
            # 把 labels 转成 tensor
            y = [int(label) for label in y]
            return torch.LongTensor(y)
    

    添加方法add_train(),threshold是设定的阈值,当outputs中的概率值大于threshold时,我们就可以认为它的标签就是1;相反,如果小于1-threshold,则认为它的标签为0。其余的数据还比较迷,需要继续用模型去区分。

    def add_train(outputs, threshold=0.9):
        idx = (outputs >= threshold) | (outputs < 1 - threshold)
        outputs[outputs > threshold] = 1
        outputs[outputs < 1-threshold] = 0
        return outputs, idx
    

    训练的代码需要修改的比较多,每次训练后都要对无标签的数据进行预测,并将那些合格的数据加入训练集,不合格的数据继续用新模型预测。因为每一次迭代训练数据都可能会增加,所以每次训练时训练数据都要重新封装。

    所以,训练部分的完整代码为:

    from sklearn.model_selection import train_test_split
    from utils import load_training_data, load_testing_data
    from _class import PreProcess, LSTM_Net, TwitterDataset
    from torch.utils.data import DataLoader
    import torch
    import torch.nn as nn
    
    
    def evaluation(outputs, labels):
        # outputs => 预测值,概率(float)
        # labels => 真实值,标签(0或1)
        outputs[outputs >= 0.5] = 1    # 大于等于0.5为正面
        outputs[outputs < 0.5] = 0     # 小于0.5为负面
        accuracy = torch.sum(torch.eq(outputs, labels)).item()
        return accuracy
    
    
    def add_train(outputs, threshold=0.9):
        idx = (outputs >= threshold) | (outputs < 1 - threshold)
        outputs[outputs > threshold] = 1
        outputs[outputs < 1-threshold] = 0
        return outputs, idx
    
    
    def training(batch_size, n_epoch, lr, X_train, y_train, valid, train_x_no_label, model, device):
        # 输出模型总的参数数量、可训练的参数数量
        total = sum(p.numel() for p in model.parameters())   # 返回数组中元素的个数
        trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
        print('
    start training, parameter total:{}, trainable:{}
    '.format(total, trainable))
    
        loss = nn.BCELoss()    # 定义损失函数为二元交叉熵损失 binary cross entropy loss
        v_batch = len(valid)      # validation数据的batch数量
        optimizer = torch.optim.Adam(model.parameters(), lr=lr)  # optimizer用Adam,设置适当的学习率lr
        total_loss, total_acc, best_acc = 0, 0, 0
        for epoch in range(n_epoch):
            total_loss, total_acc = 0, 0
    
            # training
            train_dataset = TwitterDataset(X_train, y_train)
            train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
    
            t_batch = len(train_loader)
            print('epoch: % d | batch_num: % d' % (epoch+1, t_batch))
    
            model.train()
            for i, (inputs, labels) in enumerate(train_loader):
                inputs = inputs.to(device, dtype=torch.long)    # 因为 device 为 "cuda",将 inputs 转成 torch.cuda.LongTensor
                labels = labels.to(device, dtype=torch.float)   # 因为 device 为 "cuda",将 labels 转成 torch.cuda.FloatTensor,loss()需要float
    
                optimizer.zero_grad()                # 由于 loss.backward() 的 gradient 会累加,所以每一个 batch 后需要归零
                outputs = model(inputs)              # torch.size([batch_size, 1])
                outputs = outputs.squeeze()          # torch.size([batch_size]), 与labels保持一致
                batch_loss = loss(outputs, labels)   # 计算模型此时的 training loss
                batch_loss.backward()                # 计算 loss 的 gradient
                optimizer.step()                     # 更新模型参数
    
                accuracy = evaluation(outputs, labels)  # 计算模型此时的 training accuracy
                total_acc += (accuracy / batch_size)
                total_loss += batch_loss.item()
            print('Epoch | {}/{}'.format(epoch + 1, n_epoch))
            print('Train | Loss:{:.5f} Acc: {:.3f}'.format(total_loss / t_batch, total_acc / t_batch * 100))
    
            if epoch > 3:                            # 先训练几次,再加入no_label数据
                model.eval()
                train_x_no_label_dataset = TwitterDataset(train_x_no_label, None)
                train_x_no_label_loader = DataLoader(train_x_no_label_dataset, batch_size=batch_size, shuffle=False,  num_workers=0)
    
                tmp = torch.Tensor()
    
                with torch.no_grad():
                    for i, (inputs) in enumerate(train_x_no_label_loader):
                        inputs = inputs.to(device, dtype=torch.long)
                        outputs = model(inputs)
                        outputs = outputs.squeeze()       # torch.size([batch_size])
                        labels, idx = add_train(outputs)      # 筛选
    
                        X_train = torch.cat((X_train.to(device), inputs[idx].to(device)), dim=0)   # 合格的数据加入训练集
                        y_train = torch.cat((y_train.to(device), labels[idx].to(device)), dim=0)
    
                        tmp = torch.cat((tmp.to(device), inputs[~idx].to(device)), dim=0)          # 不合格的数据继续用新模型训练
    
                train_x_no_label = tmp
    
            # validation
            model.eval()
            with torch.no_grad():
                total_loss, total_acc = 0, 0
    
                for i, (inputs, labels) in enumerate(valid):
                    inputs = inputs.to(device, dtype=torch.long)
                    labels = labels.to(device, dtype=torch.float)
    
                    outputs = model(inputs)
                    outputs = outputs.squeeze()
                    batch_loss = loss(outputs, labels)
                    accuracy = evaluation(outputs, labels)
                    total_acc += (accuracy / batch_size)
                    total_loss += batch_loss.item()
    
                print("Valid | Loss:{:.5f} Acc: {:.3f} ".format(total_loss / v_batch, total_acc / v_batch * 100))
                if total_acc > best_acc:
                    # 如果 validation 的结果优于之前所有的結果,就把当下的模型保存下来,用于之后的testing
                    best_acc = total_acc
                    torch.save(model, "ckpt.model")
            print('-----------------------------------------------')
    
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    sen_len = 20
    fix_embedding = True
    batch_size = 128
    epoch = 10
    lr = 0.001
    # w2v_path = 'w2v.model'
    w2v_path = 'new_w2v.model'
    
    print('loading data...')
    train_x, train_y = load_training_data()
    train_x_no_label = load_training_data('./data/training_nolabel.txt')
    
    pre_process = PreProcess(sen_len, w2v_path)
    embedding = pre_process.make_embedding(load=True)
    train_x = pre_process.sentence_word2idx(train_x)
    train_y = pre_process.labels_to_tensor(train_y)
    
    train_x_no_label = pre_process.sentence_word2idx(train_x_no_label)   # 新增
    
    model = LSTM_Net(embedding, embedding_dim=250, hidden_dim=150, num_layers=1, dropout=0.5, fix_embedding=fix_embedding)
    model = model.to(device)
    # stratify=y是指按照y标签来分层,也就是数据分层后标签的比例大致等同于原先标签比例
    X_train, X_val, y_train, y_val = train_test_split(train_x, train_y, test_size=0.1, random_state=1, stratify=train_y)
    
    # train_dataset = TwitterDataset(X_train, y_train)
    val_dataset = TwitterDataset(X_val, y_val)
    
    
    # train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=0)
    
    # 开始训练
    training(batch_size, epoch, lr, X_train, y_train, val_loader, train_x_no_label, model, device)
    

    可以看到,批次数量在不断增加,说明总的训练数据batch_num * batch_size在不断增加。

     再次提交,分数又略有提高。

    BiLSTM + self-attention

    没学会attention,自己胡乱写了一通,效果嘛,也就一般般。。。

    class AttenBiLSTM(nn.Module):
        def __init__(self, embedding, embedding_dim, hidden_dim, num_layers, dropout=0.5, fix_embedding=True):
            super(AttenBiLSTM, self).__init__()
    
            self.embedding = torch.nn.Embedding(embedding.size(0), embedding.size(1))
            self.embedding.weight = torch.nn.Parameter(embedding)
    
            # 是否将embedding固定住,不固定的话embedding会在训练过程中随之改变
            self.embedding.weight.requires_grad = False if fix_embedding else True
            self.embedding_dim = embedding.size(1)   # 词向量的维度,也就是之后的input_size
            self.hidden_dim = hidden_dim             # 隐藏层维度
            self.num_layers = num_layers             # LSTM层数
            self.dropout = dropout
            self.bi_lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True, bidirectional=True)
            self.classifier = nn.Sequential(
                nn.Dropout(dropout),
                nn.Linear(hidden_dim*2, hidden_dim),
                nn.Dropout(dropout),
                nn.Linear(hidden_dim, 64),
                nn.Dropout(dropout),
                nn.Linear(64, 32),
                nn.Dropout(dropout),
                nn.Linear(32, 16),
                nn.Dropout(dropout),
                nn.Linear(16, 1),
                nn.Sigmoid()
            )
            self.Q_layer = nn.Sequential(
                nn.Linear(hidden_dim*2, hidden_dim*2),
                nn.ReLU()
            )
            self.K_layer = nn.Sequential(
                nn.Linear(hidden_dim*2, hidden_dim*2),
                nn.ReLU()
            )
            self.V_layer = nn.Sequential(
                nn.Linear(hidden_dim*2, hidden_dim*2),
                nn.ReLU()
            )
    
        def attention(self, q, k, v):
            # q, k, v  (batch_size, seq_len, hidden_size * num_direction)
            d_k = q.size(-1)
            scores = torch.bmm(q, k.transpose(1, 2)) / math.sqrt(d_k)
    
            attn = nn.functional.softmax(scores, dim=-1)
            context = torch.bmm(attn, v).sum(1)
            return context
    
        def forward(self, inputs):
            inputs = self.embedding(inputs)
    
            # outputs: torch.size([batch_size, seq_len, num_directions * hidden_size])
            # hidden:  torch.size([num_layers * num_directions, batch_size, hidden_size])
            # cn:      torch.size([num_layers * num_directions, batch_size, hidden_size])
    
            outputs, (hidden, cn) = self.bi_lstm(inputs, None)
            # 如果采用这种写法的话一定要将设置training,默认值是False,不会改变状态
            #query = nn.functional.dropout(outputs, p=0.5, training=self.training)
            # hidden = hidden.permute(1, 0, 2)
            q = self.Q_layer(outputs)
            k = self.K_layer(outputs)
            v = self.V_layer(outputs)
            atten_out = self.attention(q, k, v)
            return self.classifier(atten_out)

    参考

    【1】⭐ 李宏毅2020机器学习作业4-RNN:句子情感分类

    【2】hw4_RNN.ipynb

  • 相关阅读:
    Python的网络编程 Socket编程
    Python之数据库模块安装 MySQLdb
    Python的内置函数
    Servlet及Tomcat介绍
    xml解析
    JDBC基础
    反射练习
    lambda和匿名内部类
    Object.wait()实现生产者消费者模式
    synchronized、lock及线程安全集合
  • 原文地址:https://www.cnblogs.com/zyb993963526/p/13784199.html
Copyright © 2020-2023  润新知