第3篇预处理

Preprocessing

在你可以将你的数据放入模型之前，数据需要处理成一个模型可以接受的形式。一个模型不能理解原始的文本、图像或音频。这些输入需要被转换成数字并组装成张量。在这篇教程当中，我们只介绍用分词器处理文本数据。

NLP

处理文本数据的主要工具是分词器。分词器一开始会根据若干规则将文本切分成tokens。这些tokens会被转换成数字，用于构建输入到模型中的张量。任何模型要求的额外的输入都会被分词器添加。

如果你打算使用一个预训练模型，使用相关联的预训练分词器是很重要的。这保证了文本与预训练使用的语料库使用相同的切分方式和词表。

分词

使用AutoTokenizer.from_pretrained()加载一个预训练分词器。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

然后将你的句子放进分词器。

encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)

'''
output:
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
'''

分词器返回一个包含三个重要信息的字典：

input_ids：句子中每个token对应的数字表示
attention_mask：一个token是否需要被关注
token_type_ids：在多个序列的情况下，每个token属于哪个句子

你可以通过解码input_ids得到原始的输入：

tokenizer.decode(encoded_input["input_ids"])

'''
output:
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
'''

正如你看到的那样，分词器添加了两个特殊的token：CLS和SEP。不是所有的模型需要特殊token，但是如果需要的话，分词器就会自动添加。

如果有好几个句子你想要的处理，将这些句子放入一个列表中传入分词器即可

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

'''
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}
'''

pad

当你处理一批次的句子，它们可能不是等长的。这是一个问题，因为输入到模型的张量必须是统一形状的。填充是一个通过添加padding token的方法确保张量的形状是长方形的策略。

设置padding参数为True，使得较短的句子可以去匹配最长的序列。

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)

'''
output:
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
'''

注意到分词器用0填充了第1个句子和第3个句子，因为它们较短。

Truncation

在另一方面，有时一个序列对于模型来说太长了。在这种情况下，你需要截断序列使其变短。

设置truncation参数为True，使得将序列截断为模型可以接受的最大长度。

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)

'''
output:
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
'''

建立张量

最后，你想要分词器返回能够输入到模型中的张量。设置return_tensors参数为pt即可。

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

'''
output:
{'input_ids': tensor([[  101,   153,  7719, 21490,  1122,  1114,  9582,  1623,   102],
                      [  101,  5226,  1122,  9649,  1199,  2610,  1236,   102,     0]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
'''

所有关于填充和截断你想知道的事情

上文中介绍的内容已经适合绝大多数场景了，但是API还提供了更多的策略。你需要知道的三个参数：padding、truncation、max_length

padding控制填充。它可以是一个布尔值，也可以是一个字符串：
- True或者'longest'可以将句子填充到批次中最长句子的长度（如果你只提供了一个句子，那就不进行填充）
- 'max_length'可以将句子填充到max_length参数指定的长度，如果max_length没有提供，那么就填充到模型可以接受的最大长度。如果你只提供一个句子，填充也是会进行的。
- False或'do_not_pad'不进行填充。这是默认的行为。
truncation控制truncation。它可以是一个布尔值，也可以是一个字符串：
- True或'longest_first'将序列截断到max_length参数指定的最大长度，如果max_length没有设置的话，则截断到模型可以接受的最大长度。过程中会逐个token截断，移除序列对中最长序列的一个token，直到整个序列达到合适的长度为止。
- 'only_second'将序列截断到max_length参数指定的最大长度，如果max_length没有设置的话，则截断到模型可以接受的最大长度。过程中如果提供的是一对文本，那么对第二个文本进行截断。
- 'only_first'将序列截断到max_length参数指定的最大长度，如果max_length没有设置的话，则截断到模型可以接受的最大长度。过程中如果提供的是一对文本，那么对第一个文本进行截断。
- False或'do_not_truncate'不对序列进行截断。这是默认的行为。
max_length控制填充或截断的长度。它可以是一个整数或None，如果为None，则默认为模型可以接受的最大长度。

相关阅读:
【洛谷P6835】线形生物
 【洛谷P2679】子串
 【洛谷P5072】盼君勿忘
 【洛谷P3312】数表
 【洛谷P1447】能量采集
 【洛谷P2257】YY的GCD
【洛谷P4318】完全平方数
 【AT2300】Snuke Line
window.showModalDialog
js typeof
原文地址：https://www.cnblogs.com/miraclepbc/p/15886206.html

第3篇 预处理

Preprocessing

NLP

分词

pad

Truncation

建立张量

所有关于填充和截断你想知道的事情

第3篇预处理