• 使用 Python 接口编译和优化模型 (AutoTVM)


    在本节,将介绍与TVMC相同的知识,但展示的是如何使用Python API来完成它。完成本节后,我们将使用适用于 TVM 的 Python API 来完成以下任务:

    • 为TVM Runtime编译预训练的ResNet-50 v2模型
    • 通过编译的模型运行真实图像,并解释输出和模型性能。
    • 使用TVM在CPU上对模型调优
    • 使用 TVM 收集的调优数据重新编译优化的模型。
    • 通过优化的模型运行图像,并比较输出和模型性能。

    TVM是一个深度学习编译器框架,具有许多不同的模块可用于处理深度学习模型和运算符。在本教程中,我们将介绍如何使用 Python API 加载、编译和优化模型。
    我们首先导入许多依赖项,包括用于加载和转换模型的onnx,用于下载测试数据的帮助器实用程序,用于处理图像数据的Python图像库,用于图像数据预处理和后处理的numpy,TVM Relay框架以及TVM图形执行器。

    import onnx
    from tvm.contrib.download import download_testdata
    from PIL import Image
    import numpy as np
    import tvm.relay as relay
    import tvm
    from tvm.contrib import graph_executor
    

    下载并加载ONNX模型

    model_url = (
        "https://github.com/onnx/models/raw/main/"
        "vision/classification/resnet/model/"
        "resnet50-v2-7.onnx"
    )
    
    model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx")
    onnx_model = onnx.load(model_path)
    
    # Seed numpy's RNG to get consistent results
    np.random.seed(0)
    

    下载/预处理/加载测试图片

    同样,模型的输入/输出采用的是numpy的.npz格式,

    下载图片数据,并转换成numpy数组作为输入,送入模型

    img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
    img_path = download_testdata(img_url, "imagenet_cat.png", module="data")
    
    # Resize it to 224x224
    resized_image = Image.open(img_path).resize((224, 224))
    img_data = np.asarray(resized_image).astype("float32")
    
    # Our input image is in HWC layout while ONNX expects CHW input, so convert the array
    img_data = np.transpose(img_data, (2, 0, 1))
    
    # Normalize according to the ImageNet input specification
    imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
    imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
    norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev
    
    # Add the batch dimension, as we are expecting 4-dimensional input: NCHW.
    img_data = np.expand_dims(norm_img_data, axis=0)
    

    Compile the Model With Relay

    下一步就是编译ResNet模型。使用from_onnx将模型导入到relay。然后,我们通过标准优化将模型构建到TVM库中。我们从library中创建一个 TVM graph runtime模块。

    target = "llvm"
    

    可使用Netron检查模型的的输入名字

    # The input name may vary across model types. You can use a tool
    # like Netron to check input names
    input_name = "data"
    shape_dict = {input_name: img_data.shape}
    
    mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)
    
    with tvm.transform.PassContext(opt_level=3):
        lib = relay.build(mod, target=target, params=params)
    
    dev = tvm.device(str(target), 0)
    module = graph_executor.GraphModule(lib["default"](dev))
    

    Execute on the TVM Runtime

    现在我们已经编译了模型,我们可以使用 TVM 运行时对其进行预测。要使用 TVM 运行模型并进行预测,我们需要两件事:

    • 编译好的模型,这个刚刚处理过了
    • 有效的模型输入,对它进行预测
    dtype = "float32"
    module.set_input(input_name, img_data)
    module.run()
    output_shape = (1, 1000)
    tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()
    

    Collect Basic Performance Data

    我们希望收集与此未优化模型关联的一些基本性能数据,并在以后将其与优化的模型进行比较。为了帮助考虑 CPU 噪声,我们多次重复地分批运行计算,然后收集有关平均值、中位数和标准差的一些基础统计信息。

    import timeit
    
    timing_number = 10
    timing_repeat = 10
    unoptimized = (
        np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
        * 1000
        / timing_number
    )
    unoptimized = {
        "mean": np.mean(unoptimized),
        "median": np.median(unoptimized),
        "std": np.std(unoptimized),
    }
    
    print(unoptimized)
    

    运行结果:

    {'mean': 48.89584059594199, 'median': 48.16241894150153, 'std': 2.2564635214327597}
    

    Postprocess the output

    如前所述,每个模型都有自己提供输出张量的特定方式。
    在我们的例子中,我们需要运行一些后处理,使用为模型提供的查找表将ResNet-50 v2的输出呈现为更人性化的形式。

    from scipy.special import softmax
    
    # Download a list of labels
    labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
    labels_path = download_testdata(labels_url, "synset.txt", module="data")
    
    with open(labels_path, "r") as f:
        labels = [l.rstrip() for l in f]
    
    # Open the output and read the output tensor
    scores = softmax(tvm_output)
    scores = np.squeeze(scores)
    ranks = np.argsort(scores)[::-1]
    for rank in ranks[0:5]:
        print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
    

    运行结果如下:

    class='n02123045 tabby, tabby cat' with probability=0.621105
    class='n02123159 tiger cat' with probability=0.356377
    class='n02124075 Egyptian cat' with probability=0.019712
    class='n02129604 tiger, Panthera tigris' with probability=0.001215
    class='n04040759 radiator' with probability=0.000262
    

    这应该产生以下输出:

    # class='n02123045 tabby, tabby cat' with probability=0.610553
    # class='n02123159 tiger cat' with probability=0.367179
    # class='n02124075 Egyptian cat' with probability=0.019365
    # class='n02129604 tiger, Panthera tigris' with probability=0.001273
    # class='n04040759 radiator' with probability=0.000261
    

    Tune the model

    在某些情况下,使用编译的模块运行推理时,我们可能无法获得预期的性能。在这种情况下,我们可以利用auto tuner,为我们的模型找到更好的配置,并提高性能。
    TVM中的tuning 是指对模型进行优化以使其在特定目标上运行得更快的过程。这与training 和 fine-tuning 不同,因为它不影响模型的准确性,而只是影响运行时的性能。作为调整过程的一部分,TVM将尝试运行许多不同的算子实现变体,以观察哪些运算器表现最佳。这些运行的结果被储存在一个调整记录文件中。
    在最简单的形式下,tuning要求你提供三样东西。

    • 打算在这个模型上运行的设备的目标规格
    • 输出文件的路径,调整记录将被存储在该文件中
    • 需要被tune的模型
    import tvm.auto_scheduler as auto_scheduler
    from tvm.autotvm.tuner import XGBTuner
    from tvm import autotvm
    

    为运行器设置一些基本参数。运行器接收用一组特定参数生成的编译代码,并测量其性能。

    number指定了我们将测试的不同配置的数量
    repeat指定了我们将对每个配置进行多少次测量。
    min_repeat_ms指定了需要运行配置测试的时间。如果重复次数低于这个时间,它将被增加。这个选项对于在GPU上进行精确调整是必要的,而对于CPU调整则不需要。将这个值设置为0可以禁用它。
    timeout是对每个测试配置运行训练代码的时间的上限

    number = 10
    repeat = 1
    min_repeat_ms = 0  # since we're tuning on a CPU, can be set to 0
    timeout = 10  # in seconds
    
    # create a TVM runner
    runner = autotvm.LocalRunner(
        number=number,
        repeat=repeat,
        timeout=timeout,
        min_repeat_ms=min_repeat_ms,
        enable_cpu_cache_flush=True,
    )
    

    创建一个简单的结构体保存tuning options.使用xgboost算法来搜索。对于产线任务,需要把试验次数设置得比使用的20次的大。对CPU推荐1500,GPU 3000~400.所需的试验次数可能取决于特定的模型和处理器,因此值得花一些时间来评估各种数值的性能,以找到调整时间和模型优化之间的最佳平衡。因为runing tuning是需要时间的,这里将实验次数设置为10次,但不建议使用这么小的数值。
    early_stopping参数是指在应用提前停止搜索的条件之前,需要运行的最小测试数。
    measure option参数指的是trial code哪里被构建,什么时候被运行。本例中,使用了LocalRunner(刚刚创建的)和一个LocalBuilder
    tuning_records选项指的是一个写入调优数据的文件

    tuning_option = {
        "tuner": "xgb",
        "trials": 20,
        "early_stopping": 100,
        "measure_option": autotvm.measure_option(
            builder=autotvm.LocalBuilder(build_func="default"), runner=runner
        ),
        "tuning_records": "resnet-50-v2-autotuning.json",
    }
    
    # begin by extracting the tasks from the onnx model
    tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params)
    
    # Tune the extracted tasks sequentially.
    for i, task in enumerate(tasks):
        prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
        tuner_obj = XGBTuner(task, loss_type="rank")
        tuner_obj.tune(
            n_trial=min(tuning_option["trials"], len(task.config_space)),
            early_stopping=tuning_option["early_stopping"],
            measure_option=tuning_option["measure_option"],
            callbacks=[
                autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix),
                autotvm.callback.log_to_file(tuning_option["tuning_records"]),
            ],
        )
    

    输出结果:

    [Task  1/25]  Current/Best:   85.05/ 251.80 GFLOPS | Progress: (20/20) | 7.68 s Done.
    [Task  2/25]  Current/Best:   91.92/ 209.11 GFLOPS | Progress: (20/20) | 6.07 s Done.
    [Task  3/25]  Current/Best:   95.80/ 219.97 GFLOPS | Progress: (20/20) | 6.27 s Done.
    [Task  4/25]  Current/Best:  166.34/ 237.16 GFLOPS | Progress: (20/20) | 7.10 s Done.
    [Task  5/25]  Current/Best:   81.68/ 260.01 GFLOPS | Progress: (20/20) | 6.13 s Done.
    [Task  6/25]  Current/Best:   41.35/ 242.81 GFLOPS | Progress: (20/20) | 6.41 s Done.
    [Task  7/25]  Current/Best:   75.99/ 240.20 GFLOPS | Progress: (20/20) | 5.48 s Done.
    [Task  8/25]  Current/Best:  123.49/ 216.88 GFLOPS | Progress: (20/20) | 9.69 s Done.
    [Task  9/25]  Current/Best:   53.55/ 230.81 GFLOPS | Progress: (20/20) | 16.94 s Done.
    [Task 10/25]  Current/Best:   86.86/ 240.26 GFLOPS | Progress: (20/20) | 5.03 s Done.
    [Task 11/25]  Current/Best:  191.19/ 257.60 GFLOPS | Progress: (20/20) | 6.02 s Done.
    [Task 12/25]  Current/Best:   94.22/ 225.94 GFLOPS | Progress: (20/20) | 6.71 s Done.
    [Task 13/25]  Current/Best:  127.52/ 220.16 GFLOPS | Progress: (20/20) | 6.42 s Done.
    [Task 14/25]  Current/Best:  239.47/ 252.94 GFLOPS | Progress: (20/20) | 18.66 s Done.
    [Task 15/25]  Current/Best:   62.80/ 260.21 GFLOPS | Progress: (20/20) | 13.09 s Done.
    [Task 16/25]  Current/Best:   86.70/ 194.14 GFLOPS | Progress: (20/20) | 5.30 s Done.
    [Task 17/25]  Current/Best:  101.12/ 257.36 GFLOPS | Progress: (20/20) | 6.23 s Done.
    [Task 18/25]  Current/Best:  130.45/ 248.23 GFLOPS | Progress: (20/20) | 6.19 s Done.
    [Task 19/25]  Current/Best:   26.57/ 237.67 GFLOPS | Progress: (20/20) | 7.63 s Done.
    [Task 20/25]  Current/Best:  140.13/ 179.09 GFLOPS | Progress: (20/20) | 14.41 s Done.
    [Task 21/25]  Current/Best:   49.51/ 199.20 GFLOPS | Progress: (20/20) | 11.11 s Done.
    [Task 22/25]  Current/Best:  193.76/ 228.26 GFLOPS | Progress: (20/20) | 5.81 s Done.
    [Task 23/25]  Current/Best:   61.72/ 257.58 GFLOPS | Progress: (20/20) | 9.12 s Done.
    [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s Done.
    [Task 25/25]  Current/Best:    4.54/  42.50 GFLOPS | Progress: (20/20) | 22.68 s
    

    Compiling an Optimized Model with Tuning Data

    上述调优过程的输出结果,将其储存在了resnet-50-v2-autotuning.json中。编译器将使用该结果在你指定的目标上为该模型生成高性能代码。
    现在,模型的调整数据已经收集完毕,我们可以使用优化的算子重新编译模型,以加快我们的计算速度。

    with autotvm.apply_history_best(tuning_option["tuning_records"]):
        with tvm.transform.PassContext(opt_level=3, config={}):
            lib = relay.build(mod, target=target, params=params)
    
    dev = tvm.device(str(target), 0)
    module = graph_executor.GraphModule(lib["default"](dev))
    

    输出:

    /home/workspace/tvm/tvm/python/tvm/driver/build_module.py:267: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
      warnings.warn(
    

    验证优化后的模型是否运行并产生相同的结果

    dtype = "float32"
    module.set_input(input_name, img_data)
    module.run()
    output_shape = (1, 1000)
    tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()
    
    scores = softmax(tvm_output)
    scores = np.squeeze(scores)
    ranks = np.argsort(scores)[::-1]
    for rank in ranks[0:5]:
        print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
    

    运行结果:

    class='n02123045 tabby, tabby cat' with probability=0.621104
    class='n02123159 tiger cat' with probability=0.356378
    class='n02124075 Egyptian cat' with probability=0.019712
    class='n02129604 tiger, Panthera tigris' with probability=0.001215
    class='n04040759 radiator' with probability=0.000262
    

    Comparing the Tuned and Untuned Models

    我们想收集一些与这个优化模型相关的基本性能数据,以便与未经优化的模型进行比较。根据你的底层硬件、迭代次数和其他因素,你应该看到优化后的模型与未优化的模型相比有性能的提高。

    import timeit
    
    timing_number = 10
    timing_repeat = 10
    optimized = (
        np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
        * 1000
        / timing_number
    )
    optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)}
    
    
    print("optimized: %s" % (optimized))
    print("unoptimized: %s" % (unoptimized))
    

    输出结果:

    optimized: {'mean': 41.897965169046074, 'median': 41.06571790762246, 'std': 2.092901884526126}
    unoptimized: {'mean': 48.89584059594199, 'median': 48.16241894150153, 'std': 2.2564635214327597}
    
  • 相关阅读:
    二柱子四则运算程序
    测绘软件使用感受
    二分图的最大匹配、完美匹配和匈牙利算法(转载)
    serialVersionUID的用处(转载)
    RMQ(模板)
    codeforces 825E
    红黑树
    SQL 范式(转载)
    java 移位运算
    [Hnoi2010]Bounce 弹飞绵羊
  • 原文地址:https://www.cnblogs.com/whiteBear/p/16488942.html
Copyright © 2020-2023  润新知