• 如何在GPU上优化卷积


    本文将演示如何在TVM中编写高性能的卷积实现。以平方大小的输入张量和滤波器为例,并假设卷积的输入量很大。使用不同的布局来存储数据,以实现更好的数据局部性。缓冲区布局为HWCN,代表高度,宽度,通道,批次。

    准备和算法

    将固定大小用于256通道和14 x 14尺寸的输入张量。批处理大小为256。卷积过滤器包含512个大小为3 x 3的过滤器。对于卷积,使用步幅大小1和填充大小1。以下代码定义了TVM中的卷积算法。

    import numpy as np
    import tvm
    from tvm import te
     
    # The sizes of inputs and filters
    batch = 256
    in_channel = 256
    out_channel = 512
    in_size = 14
    kernel = 3
    pad = 1
    stride = 1
     
    # Algorithm
    A = te.placeholder((in_size, in_size, in_channel, batch), name="A")
    W = te.placeholder((kernel, kernel, in_channel, out_channel), name="W")
    out_size = (in_size - kernel + 2 * pad) // stride + 1
    # Pad input
    Apad = te.compute(
        (in_size + 2 * pad, in_size + 2 * pad, in_channel, batch),
        lambda yy, xx, cc, nn: tvm.tir.if_then_else(
            tvm.tir.all(yy >= pad, yy - pad < in_size, xx >= pad, xx - pad < in_size),
            A[yy - pad, xx - pad, cc, nn],
            tvm.tir.const(0.0, "float32"),
        ),
        name="Apad",
    )
    # Create reduction variables
    rc = te.reduce_axis((0, in_channel), name="rc")
    ry = te.reduce_axis((0, kernel), name="ry")
    rx = te.reduce_axis((0, kernel), name="rx")
    # Compute the convolution
    B = te.compute(
        (out_size, out_size, out_channel, batch),
        lambda yy, xx, ff, nn: te.sum(
            Apad[yy * stride + ry, xx * stride + rx, rc, nn] * W[ry, rx, rc, ff], axis=[ry, rx, rc]
        ),
        name="B",
    )

    存储层级

    首先指定缓冲区的内存层次结构。下图显示了GPU内存层次结构。与CPU内存层次结构的一个重要区别是,GPU提供了一个称为共享内存的缓存缓冲区,该缓冲区由程序员管理。如何最大化共享内存中的数据重用,对于在GPU内核中实现高性能至关重要。

     

    将Apad和W都加载到缓冲区AA和WW中,将它们存储在共享内存中。这些缓冲区稍后将由同一线程块内的所有线程共享,以计算卷积。然后,每个线程将自己的部分从共享缓冲区加载到其本地寄存器AL和WL中。BL是输出B的本地缓存,它也存储在线程本地寄存器中。

    # Designate the memory hierarchy
    s = te.create_schedule(B.op)
    s[Apad].compute_inline()  # compute Apad inline
    AA = s.cache_read(Apad, "shared", [B])
    WW = s.cache_read(W, "shared", [B])
    AL = s.cache_read(AA, "local", [B])
    WL = s.cache_read(WW, "local", [B])
    BL = s.cache_write(B, "local")

    阻塞Blocking

    以下代码将工作负载分为线程块和单个线程。在矩阵乘法中遵循阻塞方案。如下图所示,给定像素坐标(y,x),线程块负责为输出通道和批处理计算block_factor x block_factor(64 x 64)的区域。由于共享内存空间的限制,每次仅将Apad和B中的step x block_factor(8 x 64)数据加载到共享内存中的缓冲区中。

     

     # tile consts

    tile = 8
    num_thread = 8
    block_factor = tile * num_thread
    step = 8
    vthread = 2
     
    # Get the GPU thread indices
    block_x = te.thread_axis("blockIdx.x")
    block_y = te.thread_axis("blockIdx.y")
    block_z = te.thread_axis("blockIdx.z")
    thread_x = te.thread_axis((0, num_thread), "threadIdx.x")
    thread_y = te.thread_axis((0, num_thread), "threadIdx.y")
    thread_xz = te.thread_axis((0, vthread), "vthread", name="vx")
    thread_yz = te.thread_axis((0, vthread), "vthread", name="vy")
     
    # Split the workloads
    hi, wi, fi, ni = s[B].op.axis
    bz = s[B].fuse(hi, wi)
    by, fi = s[B].split(fi, factor=block_factor)
    bx, ni = s[B].split(ni, factor=block_factor)
     
    # Bind the iteration variables to GPU thread indices
    s[B].bind(bz, block_z)
    s[B].bind(by, block_y)
    s[B].bind(bx, block_x)

    虚拟线程分割Virtual Thread Split

    将工作负载从线程块划分到各个线程。为避免内存库冲突,使用虚拟线程将区域划分为4个部分,然后平铺为8x8网格。如下图所示,每个线程计算4个网格,每个网格的大小为4 x 4。

     

     tyz, fi = s[B].split(fi, nparts=vthread)  # virtual thread split

    txz, ni = s[B].split(ni, nparts=vthread)  # virtual thread split
    ty, fi = s[B].split(fi, nparts=num_thread)
    tx, ni = s[B].split(ni, nparts=num_thread)
    s[B].reorder(bz, by, bx, tyz, txz, ty, tx, fi, ni)
     
    s[B].bind(tyz, thread_yz)
    s[B].bind(txz, thread_xz)
    s[B].bind(ty, thread_y)
    s[B].bind(tx, thread_x)

    协作获取Cooperative Fetching

    每个时间步骤都需要将步骤x block_factor数据从GPU全局内存传输到共享内存。为了减少每个线程的内存传输,以下代码使同一线程块中的线程,可以协作地从全局内存中获取相关数据。

    # Schedule BL local write
    s[BL].compute_at(s[B], tx)
    yi, xi, fi, ni = s[BL].op.axis
    ry, rx, rc = s[BL].op.reduce_axis
    rco, rci = s[BL].split(rc, factor=step)
    s[BL].reorder(rco, ry, rx, rci, fi, ni)
     
    # Attach computation to iteration variables
    s[AA].compute_at(s[BL], rx)
    s[WW].compute_at(s[BL], rx)
    s[AL].compute_at(s[BL], rci)
    s[WL].compute_at(s[BL], rci)
     
    # Schedule for A's shared memory load
    yi, xi, ci, ni = s[AA].op.axis
    ty, ci = s[AA].split(ci, nparts=num_thread)
    tx, ni = s[AA].split(ni, nparts=num_thread)
    _, ni = s[AA].split(ni, factor=4)
    s[AA].reorder(ty, tx, yi, xi, ci, ni)
    s[AA].bind(ty, thread_y)
    s[AA].bind(tx, thread_x)
    s[AA].vectorize(ni)  # vectorize memory load
     
    # Schedule for W's shared memory load
    yi, xi, ci, fi = s[WW].op.axis
    ty, ci = s[WW].split(ci, nparts=num_thread)
    tx, fi = s[WW].split(fi, nparts=num_thread)
    _, fi = s[WW].split(fi, factor=4)
    s[WW].reorder(ty, tx, yi, xi, ci, fi)
    s[WW].bind(ty, thread_y)
    s[WW].bind(tx, thread_x)
    s[WW].vectorize(fi)  # vectorize memory load

    生成CUDA内核

    最后,使用TVM生成和编译CUDA内核,并评估卷积的延迟。

    func = tvm.build(s, [A, W, B], "cuda")
    ctx = tvm.gpu(0)
    a_np = np.random.uniform(size=(in_size, in_size, in_channel, batch)).astype(A.dtype)
    w_np = np.random.uniform(size=(kernel, kernel, in_channel, out_channel)).astype(W.dtype)
    a = tvm.nd.array(a_np, ctx)
    w = tvm.nd.array(w_np, ctx)
    b = tvm.nd.array(np.zeros((out_size, out_size, out_channel, batch), dtype=B.dtype), ctx)
    func(a, w, b)
    evaluator = func.time_evaluator(func.entry_name, ctx, number=1)
    print("Convolution: %f ms" % (evaluator(a, w, b).mean * 1e3))

    出:

    Convolution: 53.197723 ms

    https://tvm.apache.org/docs/tutorials/optimize/opt_conv_cuda.html#sphx-glr-tutorials-optimize-opt-conv-cuda-py

    人工智能芯片与自动驾驶
  • 相关阅读:
    Python学习系列(七)( 数据库编程)
    Python学习系列(六)(模块)
    web.xml的常见配置
    [springMvc]常见配置
    常用JDBC数据库驱动包和类名
    log4j配置项
    BASE64Encoder cannot be resolved to a type类似问题的解决办法
    IDEA激活码
    eclipse快捷键
    ant 打包脚本
  • 原文地址:https://www.cnblogs.com/wujianming-110117/p/14171188.html
Copyright © 2020-2023  润新知