第7篇在不同任务上微调预训练模型

如果在通用的下游任务上微调一个模型

其实本文与之前微调模型那篇有点重复，不过本文给出了更多的案例。

这篇教程将会告诉你如果在通用的下游任务上微调一个模型。你需要使用datasets库快速加载和预处理数据集，使它们能够用来训练。

本文会传授你在三个数据集上微调模型：

seq_imdb
tok_ner
qa_squad

在IMDb reviews数据集上做序列分类

序列分类任务指的是将文本序列分成指定的几类。在这个例子当中，学习在IMDb数据集上微调模型，使其能够区分一个评论是正面的还是负面的。

加载IMDb数据集

datasets数据集使得加载一个数据集变得非常容易

from datasets import load_dataset

imdb = load_dataset("imdb")

你会加载一个DatasetDict类，通过索引你可以查看实例：

imdb["train"][0]

'''
output:
{
    "label": 1,
    "text": "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as \"Teachers\". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is \"Teachers\". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!",
}
'''

预处理

下一步是将文本分词成模型能够理解的格式。使用模型训练时使用的那个分词器非常重要，因为这可以确保恰当地分词。使用AutoTokenizer加载一个DistilBERT分词器，因为我们最后要用预训练的DistilBERT模型训练一个分类器。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

实例化一个分词器之后，你需要创建一个对文本分词的函数。你还需要采取截断策略，使得文本中的较长序列不会比模型最大可接受长度更长。

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

使用datasets的map函数可以将预处理函数应用于整个数据集。你还可以设置batched=True将预处理函数一次应用于多个元素。

tokenized_imdb = imdb.map(preprocess_function, batched=True)

最后，将你的文本填充，使得它们有统一的长度。尽管你可以在tokenizer函数里通过设置padding=True来填充文本，但是一种更高效的方式是把文本填充到当前批次里面的最大长度。这就是经典的动态填充。你可以使用DataCollatorWithPadding函数来完成这些。

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

用Trainer API进行微调

现在使用AutoModelForSequenceClassification加载模型，同时指定标签的数目。

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

到这一步之后，还剩下三步要做：

在TrainingArguments里面定义超参数
将训练参数传入Trainer，同时传入model、dataset、tokenizer和data collator
调用Trainer.train()来微调模型

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

使用WNUT进行token分类

token分类指的是将句子中的token进行分类。一个最典型的token分类任务是命名实体识别。命名实体识别尝试对句子中的每一个token找到一个标签，比如：人、地点或组织。在这个案例中，学习如何在WNUT 17数据集上微调一个模型，用来检测新出现的实体。

加载WNUT 17数据集

从datasets库里面加载WNUT 17数据集。

from datasets import load_dataset

wnut = load_dataset("wnut_17")

快速观察一下数据集，可以看到句子里面每个单词的标签

wnut["train"][0]

'''
output:
{'id': '0',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}
'''

看一下指定的NER标签：

label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list

'''
output:
[
    "O",
    "B-corporation",
    "I-corporation",
    "B-creative-work",
    "I-creative-work",
    "B-group",
    "I-group",
    "B-location",
    "I-location",
    "B-person",
    "I-person",
    "B-product",
    "I-product",
]
'''

（上面那句代码原文中没有详细说明由来，这里解释一下）

wnut["train"]

'''
output:
Dataset({
    features: ['id', 'tokens', 'ner_tags'],
    num_rows: 3394
})
'''

wnut['train'].features

'''
output:
{'id': Value(dtype='string', id=None),
 'ner_tags': Sequence(feature=ClassLabel(num_classes=13, names=['O', 'B-corporation', 'I-corporation', 'B-creative-work', 'I-creative-work', 'B-group', 'I-group', 'B-location', 'I-location', 'B-person', 'I-person', 'B-product', 'I-product'], names_file=None, id=None), length=-1, id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
'''

wnut["train"].features['ner_tags']

'''
output:
Sequence(feature=ClassLabel(num_classes=13, names=['O', 'B-corporation', 'I-corporation', 'B-creative-work', 'I-creative-work', 'B-group', 'I-group', 'B-location', 'I-location', 'B-person', 'I-person', 'B-product', 'I-product'], names_file=None, id=None), length=-1, id=None)
'''

NER标签前面那个字母前缀的意义：

B-指示的是一个实体的开始
I-指示的是包含在同一个实体里面的token（比如："State"是"Empire State Building"实体的一部分）
0表示这个token不属于任何一个实体

预处理

现在你需要对文本进行分词。使用AutoTokenizer加载DistilBERT的分词器。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

因为输入已经被切分成了单词，因此设置is_split_into_words=True将单词切分成子词。

tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

'''
output:
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
'''

额外的特殊token：[CLS]和[SEP]以及子词切分会造成输入和标签的不匹配。可以通过如下操作进行标签-token对齐：

通过word_ids方法将所有的token映射到对应的单词
将标签-100赋给特殊token（[CLS]和[SEP]），这样pytorch的损失函数就会忽略它们
指定给定单词的第一个token设置标签，来源于同一个单词的其他子词的标签赋成-100

下面展示了将token和标签对齐的函数：

ef tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

现在整个数据集的分词和对齐标签可以使用datasets库的map函数：

tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

最后，填充你的文本和标签，让它们拥有统一的长度：

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

用Trainer API微调

现在使用AutoModelForTokenClassification加载模型，同时指定标签的数目。

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))

在TrainingArguments里面整合训练参数：

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

将训练参数传入Trainer，同时传入model、dataset、tokenizer和data collator

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

微调模型：

trainer.train()

用SQuAD做问答

问答任务有好几种。抽取式问答主要关注从文本中识别给定问题的答案。在这个案例当中，学习如何在SQuAD数据集上微调模型。

加载SQuAD数据集

从datasets库加载SQuAD数据集：

from datasets import load_dataset

squad = load_dataset("squad")

看一下数据集中的一个样本：

squad["train"][0]

'''
output:
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}
'''

预处理

用AutoTokenizer加载DistilBET分词器：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

当为问答预处理文本时，有很多事情需要关注：

数据集中的一些案例有非常长的context，以致于超过了模型最大输入长度。你可以通过截断上下文来处理这个问题，并且设置truncation="only-second"。
下一步，你需要将答案开始和结束的位置映射到原始文本，通过设置return_offset_mapping=True可以解决这个问题。
在映射的过程中，你可以找到答案开始和结束的token。使用sequence_ids方法可以找到偏移的哪一部分对应答案以及哪一部分对应上下文。
下面展示了将所有事情集合到一起的预处理函数：

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

使用map函数将预处理函数应用到整个数据集上：

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

将处理后的样本收集起来：

from transformers import default_data_collator

data_collator = default_data_collator

使用Trainer API进行微调

通过AutoModelForQuestionAnswering加载模型

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

将你的训练参数整合进TrainingArguments：

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)