在本节,将介绍与TVMC相同的知识,但展示的是如何使用Python API来完成它。完成本节后,我们将使用适用于 TVM 的 Python API 来完成以下任务:
- 为TVM Runtime编译预训练的ResNet-50 v2模型
- 通过编译的模型运行真实图像,并解释输出和模型性能。
- 使用TVM在CPU上对模型调优
- 使用 TVM 收集的调优数据重新编译优化的模型。
- 通过优化的模型运行图像,并比较输出和模型性能。
TVM是一个深度学习编译器框架,具有许多不同的模块可用于处理深度学习模型和运算符。在本教程中,我们将介绍如何使用 Python API 加载、编译和优化模型。
我们首先导入许多依赖项,包括用于加载和转换模型的onnx,用于下载测试数据的帮助器实用程序,用于处理图像数据的Python图像库,用于图像数据预处理和后处理的numpy,TVM Relay框架以及TVM图形执行器。
import onnx
from tvm.contrib.download import download_testdata
from PIL import Image
import numpy as np
import tvm.relay as relay
import tvm
from tvm.contrib import graph_executor
下载并加载ONNX模型
model_url = (
"https://github.com/onnx/models/raw/main/"
"vision/classification/resnet/model/"
"resnet50-v2-7.onnx"
)
model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx")
onnx_model = onnx.load(model_path)
# Seed numpy's RNG to get consistent results
np.random.seed(0)
下载/预处理/加载测试图片
同样,模型的输入/输出采用的是numpy的.npz
格式,
下载图片数据,并转换成numpy数组作为输入,送入模型
img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
img_path = download_testdata(img_url, "imagenet_cat.png", module="data")
# Resize it to 224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")
# Our input image is in HWC layout while ONNX expects CHW input, so convert the array
img_data = np.transpose(img_data, (2, 0, 1))
# Normalize according to the ImageNet input specification
imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev
# Add the batch dimension, as we are expecting 4-dimensional input: NCHW.
img_data = np.expand_dims(norm_img_data, axis=0)
Compile the Model With Relay
下一步就是编译ResNet模型。使用from_onnx
将模型导入到relay。然后,我们通过标准优化将模型构建到TVM库中。我们从library中创建一个 TVM graph runtime模块。
target = "llvm"
可使用Netron
检查模型的的输入名字
# The input name may vary across model types. You can use a tool
# like Netron to check input names
input_name = "data"
shape_dict = {input_name: img_data.shape}
mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, params=params)
dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))
Execute on the TVM Runtime
现在我们已经编译了模型,我们可以使用 TVM 运行时对其进行预测。要使用 TVM 运行模型并进行预测,我们需要两件事:
- 编译好的模型,这个刚刚处理过了
- 有效的模型输入,对它进行预测
dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()
Collect Basic Performance Data
我们希望收集与此未优化模型关联的一些基本性能数据,并在以后将其与优化的模型进行比较。为了帮助考虑 CPU 噪声,我们多次重复地分批运行计算,然后收集有关平均值、中位数和标准差的一些基础统计信息。
import timeit
timing_number = 10
timing_repeat = 10
unoptimized = (
np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
* 1000
/ timing_number
)
unoptimized = {
"mean": np.mean(unoptimized),
"median": np.median(unoptimized),
"std": np.std(unoptimized),
}
print(unoptimized)
运行结果:
{'mean': 48.89584059594199, 'median': 48.16241894150153, 'std': 2.2564635214327597}
Postprocess the output
如前所述,每个模型都有自己提供输出张量的特定方式。
在我们的例子中,我们需要运行一些后处理,使用为模型提供的查找表将ResNet-50 v2的输出呈现为更人性化的形式。
from scipy.special import softmax
# Download a list of labels
labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")
with open(labels_path, "r") as f:
labels = [l.rstrip() for l in f]
# Open the output and read the output tensor
scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
运行结果如下:
class='n02123045 tabby, tabby cat' with probability=0.621105
class='n02123159 tiger cat' with probability=0.356377
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262
这应该产生以下输出:
# class='n02123045 tabby, tabby cat' with probability=0.610553
# class='n02123159 tiger cat' with probability=0.367179
# class='n02124075 Egyptian cat' with probability=0.019365
# class='n02129604 tiger, Panthera tigris' with probability=0.001273
# class='n04040759 radiator' with probability=0.000261
Tune the model
在某些情况下,使用编译的模块运行推理时,我们可能无法获得预期的性能。在这种情况下,我们可以利用auto tuner,为我们的模型找到更好的配置,并提高性能。
TVM中的tuning 是指对模型进行优化以使其在特定目标上运行得更快的过程。这与training 和 fine-tuning 不同,因为它不影响模型的准确性,而只是影响运行时的性能。作为调整过程的一部分,TVM将尝试运行许多不同的算子实现变体,以观察哪些运算器表现最佳。这些运行的结果被储存在一个调整记录文件中。
在最简单的形式下,tuning要求你提供三样东西。
- 打算在这个模型上运行的设备的目标规格
- 输出文件的路径,调整记录将被存储在该文件中
- 需要被tune的模型
import tvm.auto_scheduler as auto_scheduler
from tvm.autotvm.tuner import XGBTuner
from tvm import autotvm
为运行器设置一些基本参数。运行器接收用一组特定参数生成的编译代码,并测量其性能。
number
指定了我们将测试的不同配置的数量
repeat
指定了我们将对每个配置进行多少次测量。
min_repeat_ms
指定了需要运行配置测试的时间。如果重复次数低于这个时间,它将被增加。这个选项对于在GPU上进行精确调整是必要的,而对于CPU调整则不需要。将这个值设置为0可以禁用它。
timeout
是对每个测试配置运行训练代码的时间的上限
number = 10
repeat = 1
min_repeat_ms = 0 # since we're tuning on a CPU, can be set to 0
timeout = 10 # in seconds
# create a TVM runner
runner = autotvm.LocalRunner(
number=number,
repeat=repeat,
timeout=timeout,
min_repeat_ms=min_repeat_ms,
enable_cpu_cache_flush=True,
)
创建一个简单的结构体保存tuning options.使用xgboost算法来搜索。对于产线任务,需要把试验次数设置得比使用的20次的大。对CPU推荐1500,GPU 3000~400.所需的试验次数可能取决于特定的模型和处理器,因此值得花一些时间来评估各种数值的性能,以找到调整时间和模型优化之间的最佳平衡。因为runing tuning是需要时间的,这里将实验次数设置为10次,但不建议使用这么小的数值。
early_stopping
参数是指在应用提前停止搜索的条件之前,需要运行的最小测试数。
measure option
参数指的是trial code哪里被构建,什么时候被运行。本例中,使用了LocalRunner
(刚刚创建的)和一个LocalBuilder
。
tuning_records
选项指的是一个写入调优数据的文件
tuning_option = {
"tuner": "xgb",
"trials": 20,
"early_stopping": 100,
"measure_option": autotvm.measure_option(
builder=autotvm.LocalBuilder(build_func="default"), runner=runner
),
"tuning_records": "resnet-50-v2-autotuning.json",
}
# begin by extracting the tasks from the onnx model
tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params)
# Tune the extracted tasks sequentially.
for i, task in enumerate(tasks):
prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
tuner_obj = XGBTuner(task, loss_type="rank")
tuner_obj.tune(
n_trial=min(tuning_option["trials"], len(task.config_space)),
early_stopping=tuning_option["early_stopping"],
measure_option=tuning_option["measure_option"],
callbacks=[
autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix),
autotvm.callback.log_to_file(tuning_option["tuning_records"]),
],
)
输出结果:
[Task 1/25] Current/Best: 85.05/ 251.80 GFLOPS | Progress: (20/20) | 7.68 s Done.
[Task 2/25] Current/Best: 91.92/ 209.11 GFLOPS | Progress: (20/20) | 6.07 s Done.
[Task 3/25] Current/Best: 95.80/ 219.97 GFLOPS | Progress: (20/20) | 6.27 s Done.
[Task 4/25] Current/Best: 166.34/ 237.16 GFLOPS | Progress: (20/20) | 7.10 s Done.
[Task 5/25] Current/Best: 81.68/ 260.01 GFLOPS | Progress: (20/20) | 6.13 s Done.
[Task 6/25] Current/Best: 41.35/ 242.81 GFLOPS | Progress: (20/20) | 6.41 s Done.
[Task 7/25] Current/Best: 75.99/ 240.20 GFLOPS | Progress: (20/20) | 5.48 s Done.
[Task 8/25] Current/Best: 123.49/ 216.88 GFLOPS | Progress: (20/20) | 9.69 s Done.
[Task 9/25] Current/Best: 53.55/ 230.81 GFLOPS | Progress: (20/20) | 16.94 s Done.
[Task 10/25] Current/Best: 86.86/ 240.26 GFLOPS | Progress: (20/20) | 5.03 s Done.
[Task 11/25] Current/Best: 191.19/ 257.60 GFLOPS | Progress: (20/20) | 6.02 s Done.
[Task 12/25] Current/Best: 94.22/ 225.94 GFLOPS | Progress: (20/20) | 6.71 s Done.
[Task 13/25] Current/Best: 127.52/ 220.16 GFLOPS | Progress: (20/20) | 6.42 s Done.
[Task 14/25] Current/Best: 239.47/ 252.94 GFLOPS | Progress: (20/20) | 18.66 s Done.
[Task 15/25] Current/Best: 62.80/ 260.21 GFLOPS | Progress: (20/20) | 13.09 s Done.
[Task 16/25] Current/Best: 86.70/ 194.14 GFLOPS | Progress: (20/20) | 5.30 s Done.
[Task 17/25] Current/Best: 101.12/ 257.36 GFLOPS | Progress: (20/20) | 6.23 s Done.
[Task 18/25] Current/Best: 130.45/ 248.23 GFLOPS | Progress: (20/20) | 6.19 s Done.
[Task 19/25] Current/Best: 26.57/ 237.67 GFLOPS | Progress: (20/20) | 7.63 s Done.
[Task 20/25] Current/Best: 140.13/ 179.09 GFLOPS | Progress: (20/20) | 14.41 s Done.
[Task 21/25] Current/Best: 49.51/ 199.20 GFLOPS | Progress: (20/20) | 11.11 s Done.
[Task 22/25] Current/Best: 193.76/ 228.26 GFLOPS | Progress: (20/20) | 5.81 s Done.
[Task 23/25] Current/Best: 61.72/ 257.58 GFLOPS | Progress: (20/20) | 9.12 s Done.
[Task 25/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s Done.
[Task 25/25] Current/Best: 4.54/ 42.50 GFLOPS | Progress: (20/20) | 22.68 s
Compiling an Optimized Model with Tuning Data
上述调优过程的输出结果,将其储存在了resnet-50-v2-autotuning.json中。编译器将使用该结果在你指定的目标上为该模型生成高性能代码。
现在,模型的调整数据已经收集完毕,我们可以使用优化的算子重新编译模型,以加快我们的计算速度。
with autotvm.apply_history_best(tuning_option["tuning_records"]):
with tvm.transform.PassContext(opt_level=3, config={}):
lib = relay.build(mod, target=target, params=params)
dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))
输出:
/home/workspace/tvm/tvm/python/tvm/driver/build_module.py:267: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
warnings.warn(
验证优化后的模型是否运行并产生相同的结果
dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()
scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
运行结果:
class='n02123045 tabby, tabby cat' with probability=0.621104
class='n02123159 tiger cat' with probability=0.356378
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262
Comparing the Tuned and Untuned Models
我们想收集一些与这个优化模型相关的基本性能数据,以便与未经优化的模型进行比较。根据你的底层硬件、迭代次数和其他因素,你应该看到优化后的模型与未优化的模型相比有性能的提高。
import timeit
timing_number = 10
timing_repeat = 10
optimized = (
np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
* 1000
/ timing_number
)
optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)}
print("optimized: %s" % (optimized))
print("unoptimized: %s" % (unoptimized))
输出结果:
optimized: {'mean': 41.897965169046074, 'median': 41.06571790762246, 'std': 2.092901884526126}
unoptimized: {'mean': 48.89584059594199, 'median': 48.16241894150153, 'std': 2.2564635214327597}