• Pytorch-使用Bert预训练模型微调中文文本分类


    渣渣本跑不动,以下代码运行在Google Colab上。

    语料链接:https://pan.baidu.com/s/1YxGGYmeByuAlRdAVov_ZLg
    提取码:tzao

    neg.txt和pos.txt各5000条酒店评论,每条评论一行。

    安装transformers库

    !pip install transformers

    导包,设定超参数

     1 import numpy as np
     2 import random
     3 import torch
     4 import matplotlib.pyplot as plt
     5 from torch.nn.utils import clip_grad_norm_
     6 from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler 
     7 from transformers import BertTokenizer, BertForSequenceClassification, AdamW
     8 from transformers import get_linear_schedule_with_warmup
     9 
    10 SEED = 123
    11 BATCH_SIZE = 16
    12 LEARNING_RATE = 2e-5
    13 WEIGHT_DECAY = 1e-2
    14 EPSILON = 1e-8
    15 
    16 random.seed(SEED)
    17 np.random.seed(SEED)
    18 torch.manual_seed(SEED)

    1.数据预处理

    1.1读取文件

     1 def readfile(filename):
     2     with open(filename, encoding="utf-8") as f:        
     3         content = f.readlines()
     4         return content
     5 
     6 pos_text, neg_text = readfile('hotel/pos.txt'), readfile('hotel/neg.txt')
     7 sentences = pos_text + neg_text
     8 
     9 #设定标签
    10 pos_targets = np.ones((len(pos_text)))
    11 neg_targets = np.zeros((len(neg_text)))
    12 targets = np.concatenate((pos_targets, neg_targets), axis=0).reshape(-1, 1)   #(10000, 1)
    13 total_targets = torch.tensor(targets)

    Tip:调用readfile时报错了UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 0

    解决办法:将txt文件在Notepad++中打开,点击工具栏的编码,转为UTF-8编码。

    1.2BertTokenizer进行编码,将每一句转成数字

    1 tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', cache_dir="E:/transformer_file/")
    2 print(pos_text[2])
    3 print(tokenizer.tokenize(pos_text[2]))
    4 print(tokenizer.encode(pos_text[2]))
    5 print(tokenizer.convert_ids_to_tokens(tokenizer.encode(pos_text[2])))

    不错,下次还考虑入住。交通也方便,在餐厅吃的也不错。

    ['不', '错', ',', '下', '次', '还', '考', '虑', '入', '住', '。', '交', '通', '也', '方', '便', ',', '在', '餐', '厅', '吃', '的', '也', '不', '错', '。']

    [101, 679, 7231, 8024, 678, 3613, 6820, 5440, 5991, 1057, 857, 511, 769, 6858, 738, 3175, 912, 8024, 1762, 7623, 1324, 1391, 4638, 738, 679, 7231, 511, 102]

    ['[CLS]', '不', '错', ',', '下', '次', '还', '考', '虑', '入', '住', '。', '交', '通', '也', '方', '便', ',', '在', '餐', '厅', '吃', '的', '也', '不', '错', '。', '[SEP]']

    为了使每一句的长度相等,稍作处理;

     1 #将每一句转成数字(大于126做截断,小于126做PADDING,加上首尾两个标识,长度总共等于128)
     2 def convert_text_to_token(tokenizer, sentence, limit_size=126):
     3     
     4     tokens = tokenizer.encode(sentence[:limit_size])  #直接截断  
     5     if len(tokens) < limit_size + 2:                  #补齐(pad的索引号就是0)
     6         tokens.extend([0] * (limit_size + 2 - len(tokens)))   
     7     return tokens
     8 
     9 input_ids = [convert_text_to_token(tokenizer, sen) for sen in sentences]
    10 
    11 input_tokens = torch.tensor(input_ids)
    12 print(input_tokens.shape)                    #torch.Size([10000, 128])

    1.3attention_masks, 在一个文本中,如果是PAD符号则是0,否则就是1

     1 #建立mask
     2 def attention_masks(input_ids):
     3     atten_masks = []
     4     for seq in input_ids:
     5         seq_mask = [float(i>0) for i in seq]
     6         atten_masks.append(seq_mask)
     7     return atten_masks
     8 
     9 atten_masks = attention_masks(input_ids)
    10 attention_tokens = torch.tensor(atten_masks)    

    构造input_ids和atten_masks的目的和前面一节中提到的.encode_plus函数返回的input_ids和attention_mask一样,input_type_ids和本次任务无关,它是针对每个训练集有两个句子的任务(如问答任务)。

    1.4划分训练集和测试集

    两个划分函数的参数random_state和test_size值要一致,才能使得train_inputs和train_masks一一对应。

    1 from sklearn.model_selection import train_test_split
    2 train_inputs, test_inputs, train_labels, test_labels = train_test_split(input_tokens, total_targets, random_state=666, test_size=0.2)
    3 train_masks, test_masks, _, _ = train_test_split(attention_tokens, input_tokens, random_state=666, test_size=0.2)
    4 print(train_inputs.shape, test_inputs.shape)      #torch.Size([8000, 128]) torch.Size([2000, 128])
    5 print(train_masks.shape)                          #torch.Size([8000, 128])和train_inputs形状一样
    6 
    7 print(train_inputs[0])
    8 print(train_masks[0])

    tensor([ 101, 2769, 6370, 4638, 3221, 10189, 1039, 4638, 117, 852, 2769, 6230, 2533, 8821, 1039, 4638, 7599, 3419, 3291, 1962, 671, 763, 117, 3300, 671, 2476, 1377, 809, 1288, 1309, 4638, 3763, 1355, 119, 2456, 6379, 1920, 2157, 6370, 3249, 6858, 7313, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

    tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

    1.5创建DataLoader,用来取出一个batch的数据

    TensorDataset 可以用来对 tensor 进行打包,就好像 python 中的 zip 功能。该类通过每一个 tensor 的第一个维度进行索引,所以该类中的 tensor 第一维度必须相等,且TensorDataset 中的参数必须是 tensor类型。

    RandomSampler对数据集随机采样。

    SequentialSampler按顺序对数据集采样。

    1 train_data = TensorDataset(train_inputs, train_masks, train_labels)
    2 train_sampler = RandomSampler(train_data)
    3 train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)
    4 
    5 test_data = TensorDataset(test_inputs, test_masks, test_labels)
    6 test_sampler = SequentialSampler(test_data)
    7 test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

    查看一下train_dataloader的内容:

    1 for i, (train, mask, label) in enumerate(train_dataloader):
    2     print(train.shape, mask.shape, label.shape)               #torch.Size([16, 128]) torch.Size([16, 128]) torch.Size([16, 1])
    3     break
    4 print('len(train_dataloader)=', len(train_dataloader))        #500

    2.创建模型、优化器

    创建模型

    1 model = BertForSequenceClassification.from_pretrained("bert-base-chinese", num_labels = 2)     #num_labels表示2个分类,好评和差评
    2 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    3 model.to(device)

    定义优化器

    参数eps是为了提高数值稳定性而添加到分母的一个项(默认: 1e-8)

    1 optimizer = AdamW(model.parameters(), lr = LEARNING_RATE, eps = EPSILON)

    更通用的写法:bias和LayNorm.weight没有用权重衰减

    1 no_decay = ['bias', 'LayerNorm.weight']
    2 optimizer_grouped_parameters = [
    3         {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': WEIGHT_DECAY},
    4         {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    5 ]
    6 optimizer = AdamW(optimizer_grouped_parameters, lr = LEARNING_RATE, eps = EPSILON)

     学习率预热,训练时先从小的学习率开始训练

    1 epochs = 2
    2 # training steps 的数量: [number of batches] x [number of epochs]. 
    3 total_steps = len(train_dataloader) * epochs
    4 
    5 # 设计 learning rate scheduler.
    6 scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)

    3.训练、评估模型 

    3.1模型准确率

    1 def binary_acc(preds, labels):      #preds.shape=(16, 2) labels.shape=torch.Size([16, 1])
    2     correct = torch.eq(torch.max(preds, dim=1)[1], labels.flatten()).float()      #eq里面的两个参数的shape=torch.Size([16])    
    3     acc = correct.sum().item() / len(correct)
    4     return acc

    3.2计算模型运行时间

    1 import time
    2 import datetime
    3 def format_time(elapsed):    
    4     elapsed_rounded = int(round((elapsed)))    
    5     return str(datetime.timedelta(seconds=elapsed_rounded))   #返回 hh:mm:ss 形式的时间

    3.3训练模型

    • 传入model的参数必须是tensor类型的;
    • nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2)用于解决神经网络训练过拟合的方法 ;

    输入是(NN参数,最大梯度范数,范数类型=2) 一般默认为L2 范数;

    Tip: 注意这个方法只在训练的时候使用,在测试的时候不用;

     1 def train(model, optimizer):
     2     t0 = time.time()
     3     avg_loss, avg_acc = [],[]
     4     
     5     model.train()
     6     for step, batch in enumerate(train_dataloader):
     7 
     8         # 每隔40个batch 输出一下所用时间.
     9         if step % 40 == 0 and not step == 0:            
    10             elapsed = format_time(time.time() - t0)
    11             print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
    12 
    13         b_input_ids, b_input_mask, b_labels = batch[0].long().to(device), batch[1].long().to(device), batch[2].long().to(device)
    14         
    15         output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    16         loss, logits = output[0], output[1] 
    17     
    18         avg_loss.append(loss.item())
    19         
    20         acc = binary_acc(logits, b_labels)
    21         avg_acc.append(acc)
    22         
    23         optimizer.zero_grad()
    24         loss.backward()
    25         clip_grad_norm_(model.parameters(), 1.0)      #大于1的梯度将其设为1.0, 以防梯度爆炸
    26         optimizer.step()              #更新模型参数
    27         scheduler.step()              #更新learning rate
    28         
    29     avg_acc = np.array(avg_acc).mean()
    30     avg_loss = np.array(avg_loss).mean()
    31     return avg_loss, avg_acc

    此处output的形式为(元组类型,第0个元素是loss值,第1个元素是每个batch中好评和差评的概率):

    (tensor(0.0210, device='cuda:0', grad_fn=<NllLossBackward>), 
    tensor([[-2.9815,  2.6931],
            [-3.2380,  3.1935],
            [-3.0775,  3.0713],
            [ 3.0191, -2.3689],
            [ 3.1146, -2.7957],
            [ 3.7798, -2.7410],
            [-0.3273,  0.8227],
            [ 2.5012, -1.5535],
            [-3.0231,  3.0162],
            [ 3.4146, -2.5582],
            [ 3.3104, -2.2134],
            [ 3.3776, -2.5190],
            [-2.6513,  2.5108],
            [-3.3691,  2.9516],
            [ 3.2397, -2.0473],
            [-2.8622,  2.7395]], device='cuda:0', grad_fn=<AddmmBackward>))

    3.4评估模型

    调用model模型时不传入label值。

     1 def evaluate(model):    
     2     avg_acc = []    
     3     model.eval()         #表示进入测试模式
     4       
     5     with torch.no_grad():
     6         for batch in test_dataloader:
     7             b_input_ids, b_input_mask, b_labels = batch[0].long().to(device), batch[1].long().to(device), batch[2].long().to(device)
     8         
     9             output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    10             
    11             acc = binary_acc(output[0], b_labels)
    12             avg_acc.append(acc)
    13     avg_acc = np.array(avg_acc).mean()
    14     return avg_acc

    此处output的形式为(元组类型,第0个元素是每个batch中好评和差评的概率):

    (tensor([[ 3.8217, -2.7516],
            [ 2.7585, -2.0853],
            [-2.9317,  2.9092],
            [-3.3724,  3.2597],
            [-2.8692,  2.6741],
            [-3.2784,  2.9276],
            [ 3.4946, -2.8895],
            [ 3.7855, -2.8623],
            [-2.2249,  2.4336],
            [-2.4257,  2.4606],
            [ 3.3996, -2.5760],
            [-3.1986,  3.0841],
            [ 3.6883, -2.9492],
            [ 3.2883, -2.3600],
            [ 2.6723, -2.0778],
            [-3.1868,  3.1106]], device='cuda:0'),)

    3.5运行训练模型和评估模型

    1 for epoch in range(epochs):
    2     
    3     train_loss, train_acc = train(model, optimizer)
    4     print('epoch={},训练准确率={},损失={}'.format(epoch, train_acc, train_loss))
    5     test_acc = evaluate(model)
    6     print("epoch={},测试准确率={}".format(epoch, test_acc))

    运行结果如下:

      Batch    40  of    500.    Elapsed: 0:00:14.
      Batch    80  of    500.    Elapsed: 0:00:28.
      Batch   120  of    500.    Elapsed: 0:00:42.
      Batch   160  of    500.    Elapsed: 0:00:57.
      Batch   200  of    500.    Elapsed: 0:01:12.
      Batch   240  of    500.    Elapsed: 0:01:26.
      Batch   280  of    500.    Elapsed: 0:01:41.
      Batch   320  of    500.    Elapsed: 0:01:56.
      Batch   360  of    500.    Elapsed: 0:02:11.
      Batch   400  of    500.    Elapsed: 0:02:26.
      Batch   440  of    500.    Elapsed: 0:02:42.
      Batch   480  of    500.    Elapsed: 0:02:57.
    epoch=0,训练准确率=0.9015,损失=0.2549531048182398
    epoch=0,测试准确率=0.9285
      Batch    40  of    500.    Elapsed: 0:00:16.
      Batch    80  of    500.    Elapsed: 0:00:31.
      Batch   120  of    500.    Elapsed: 0:00:47.
      Batch   160  of    500.    Elapsed: 0:01:03.
      Batch   200  of    500.    Elapsed: 0:01:18.
      Batch   240  of    500.    Elapsed: 0:01:34.
      Batch   280  of    500.    Elapsed: 0:01:50.
      Batch   320  of    500.    Elapsed: 0:02:06.
      Batch   360  of    500.    Elapsed: 0:02:22.
      Batch   400  of    500.    Elapsed: 0:02:37.
      Batch   440  of    500.    Elapsed: 0:02:53.
      Batch   480  of    500.    Elapsed: 0:03:09.
    epoch=1,训练准确率=0.9595,损失=0.14357946291333065
    epoch=1,测试准确率=0.939

    4.预测

     1 def predict(sen):
     2     
     3     input_id = convert_text_to_token(tokenizer, sen)
     4     input_token =  torch.tensor(input_id).long().to(device)            #torch.Size([128])
     5     
     6     atten_mask = [float(i>0) for i in input_id]
     7     attention_token = torch.tensor(atten_mask).long().to(device)       #torch.Size([128])         
     8     
     9     output = model(input_token.view(1, -1), token_type_ids=None, attention_mask=attention_token.view(1, -1))     #torch.Size([128])->torch.Size([1, 128])否则会报错
    10     print(output[0])
    11     
    12     return torch.max(output[0], dim=1)[1]
    13 
    14 label = predict('酒店位置难找,环境不太好,隔音差,下次不会再来的。')
    15 print('好评' if label==1 else '差评')
    16 
    17 label = predict('酒店还可以,接待人员很热情,卫生合格,空间也比较大,不足的地方就是没有窗户')
    18 print('好评' if label==1 else '差评')
    19 
    20 label = predict('"服务各方面没有不周到的地方, 各方面没有没想到的细节"')
    21 print('好评' if label==1 else '差评')

    tensor([[ 3.5719, -2.7315]], device='cuda:0', grad_fn=<AddmmBackward>)

    差评

    tensor([[-2.7998, 2.8675]], device='cuda:0', grad_fn=<AddmmBackward>)

    好评

    tensor([[-1.9614, 1.5925]], device='cuda:0', grad_fn=<AddmmBackward>)

    好评

    性能还可以,第三句这种有点奇怪的句子也能正确识别了。 

    参考链接:https://blog.csdn.net/Code_Tookie/article/details/104944888?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param

  • 相关阅读:
    数仓1.3 |行为数据| 业务数据需求
    数仓1.1 |概述| 集群环境搭建
    麒麟Kylin
    ng--todolist
    mysql必知必会--用正则表达式 进行搜索
    mysql必知必会--用通配符进行过滤
    mysql必知必会--数 据 过 滤
    mysql必知必会--过 滤 数 据
    mysql必知必会--排序检索数据
    mysql必知必会--检 索 数 据
  • 原文地址:https://www.cnblogs.com/cxq1126/p/13562466.html
Copyright © 2020-2023  润新知