Pytorch的Autograd模块包括一个分析器(profiler),它可以让你检查模型中不同操作符的成本——包括CPU和GPU。
目前有两种模式——使用profile.实现仅cpu模式和基于nvprof(注册CPU和GPU活动)使用emit_nvtx。
torch.autograd.profiler.
profile
(enabled=True, use_cuda=False, record_shapes=False)
上下文管理器,用于管理autograd profiler状态并保存结果摘要。 在后台,它仅记录正在C ++中执行的函数的事件,并将这些事件公开给Python。 您可以将任何代码包装到其中,并且它只会报告PyTorch函数的运行时间。
参数:
enabled (bool, optional) – 将其设置为False将使该上下文管理器成为无操作。默认值:True。
use_cuda (bool, optional) – 使用cudaEvent API启用CUDA事件的计时。 每个张量操作会增加大约4us的开销。 默认值:False
record_shapes (bool, optional) – 如果设置了形状记录,则将收集有关输入尺寸的信息。这允许查看底层使用了哪些维度,并进一步使用prof.key_averages(group_by_input_shape=True)对它们进行分组。请注意,形状记录可能会使分析数据有偏差。对于最底部的事件(在嵌套函数调用的情况下),很可能是可以忽略的。但是对于更高级别的函数,由于形状的收集,总self cpu time可能会人为地增加。
Example
x = torch.randn((1, 1), requires_grad=True) with torch.autograd.profiler.profile() as prof: for _ in range(100): # any normal python code, really! y = x ** 2 y.backward() # NOTE: some columns were removed for brevity print(prof.key_averages().table(sort_by="self_cpu_time_total"))
结果(没有使用gpu):
------------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Input Shapes ------------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- pow 64.76% 3.096ms 64.76% 3.096ms 3.096ms 1 [] struct torch::autograd::GraphRoot 0.37% 17.700us 0.37% 17.700us 17.700us 1 [] PowBackward0 23.10% 1.104ms 23.10% 1.104ms 1.104ms 1 [] pow 1.37% 65.700us 1.37% 65.700us 65.700us 1 [] mul 10.11% 483.100us 10.11% 483.100us 483.100us 1 [] mul 0.13% 6.200us 0.13% 6.200us 6.200us 1 [] struct torch::autograd::AccumulateGrad 0.14% 6.500us 0.14% 6.500us 6.500us 1 [] detach 0.03% 1.500us 0.03% 1.500us 1.500us 1 [] ------------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 4.780ms
结果(使用gpu):
------------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ------------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- pow 29.13% 3.246ms 29.13% 3.246ms 3.246ms 31.62% 2.866ms 2.866ms 1 [] struct torch::autograd::GraphRoot 0.09% 9.600us 0.09% 9.600us 9.600us 0.02% 2.048us 2.048us 1 [] PowBackward0 34.12% 3.803ms 34.12% 3.803ms 3.803ms 32.89% 2.982ms 2.982ms 1 [] pow 8.53% 950.500us 8.53% 950.500us 950.500us 2.63% 238.592us 238.592us 1 [] mul 16.06% 1.789ms 16.06% 1.789ms 1.789ms 19.44% 1.762ms 1.762ms 1 [] mul 8.94% 996.700us 8.94% 996.700us 996.700us 10.73% 972.864us 972.864us 1 [] struct torch::autograd::CopyBackwards 1.47% 163.900us 1.47% 163.900us 163.900us 1.31% 118.688us 118.688us 1 [] to 1.40% 155.900us 1.40% 155.900us 155.900us 1.27% 114.944us 114.944us 1 [] empty_strided 0.09% 10.300us 0.09% 10.300us 10.300us 0.01% 1.023us 1.023us 1 [] struct torch::autograd::AccumulateGrad 0.13% 15.000us 0.13% 15.000us 15.000us 0.06% 5.281us 5.281us 1 [] detach 0.04% 4.700us 0.04% 4.700us 4.700us 0.02% 1.760us 1.760us 1 [] ------------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 11.144ms CUDA time total: 9.066ms
torch.autograd.profiler.
record_function
(name)
上下文管理器/函数装饰器,在运行autograd profiler时向Python代码(或函数)块添加标签。它在跟踪代码概要文件时非常有用。
>>> x = torch.randn((1, 1), requires_grad=True) >>> with torch.autograd.profiler.profile() as prof: ... y = x ** 2 ... with torch.autograd.profiler.record_function("label-z"): # label the block ... z = y ** 3 ... y.backward() ... >>> # NOTE: some columns were removed for brevity >>> print(prof.key_averages().table(sort_by="self_cpu_time_total")) ----------------------------------- --------------- --------------- --------------- Name Self CPU total % CPU time avg Number of Calls ----------------------------------- --------------- --------------- --------------- pow 60.77% 47.470us 3 mul 21.73% 25.465us 2 PowBackward0 12.03% 121.891us 1 torch::autograd::AccumulateGrad 2.70% 6.324us 1 label-z 2.13% 12.421us 1 torch::autograd::GraphRoot 0.64% 1.503us 1 ----------------------------------- --------------- --------------- --------------- Self CPU time total: 234.344us CUDA time total: 0.000us
torch.autograd.profiler.
emit_nvtx
(enabled=True, record_shapes=False)
上下文管理器,使每个autograd操作发出一个NVTX范围。
在nvprof下运行程序时非常有用:
nvprof --profile-from-start off -o trace_name.prof -- <regular command here>
不幸的是,无法强制nvprof将收集到的数据刷新到磁盘,因此对于CUDA分析,必须使用此上下文管理器注释nvprof跟踪并等待进程退出后再检查它们。 然后,可以使用NVIDIA Visual Profiler(nvvp)可视化时间轴,或者torch.autograd.profiler.load_nvprof()可以加载结果以进行检查,例如 在Python REPL中。
>>> with torch.cuda.profiler.profile(): ... model(x) # Warmup CUDA memory allocator and profiler ... with torch.autograd.profiler.emit_nvtx(): ... model(x)
torch.autograd.profiler.
load_nvprof
(path)
打开nvprof跟踪文件并解析autograd注释。
Pytorch模型的逐层分析
采用torchprof库进行pytorch模型的逐层分析
pip install torchprof
1 import torch 2 import torchvision 3 import torchprof 4 5 model = torchvision.models.alexnet(pretrained=False).cuda() 6 x = torch.rand([1, 3, 224, 224]).cuda() 7 8 with torchprof.Profile(model, use_cuda=True) as prof: 9 model(x) 10 11 print(prof.display(show_events=False)) # equivalent to `print(prof)` and `print(prof.display())`
Module | Self CPU total | CPU total | CUDA total | Occurrences ---------------|----------------|-----------|------------|------------ AlexNet | | | | ├── features | | | | │├── 0 | 1.671ms | 6.589ms | 6.701ms | 1 │├── 1 | 62.430us | 62.430us | 63.264us | 1 │├── 2 | 62.909us | 109.948us | 112.640us | 1 │├── 3 | 225.389us | 858.376us | 1.814ms | 1 │├── 4 | 18.999us | 18.999us | 19.456us | 1 │├── 5 | 29.560us | 52.720us | 54.272us | 1 │├── 6 | 136.959us | 511.216us | 707.360us | 1 │├── 7 | 18.480us | 18.480us | 18.624us | 1 │├── 8 | 84.380us | 300.700us | 590.688us | 1 │├── 9 | 18.249us | 18.249us | 17.632us | 1 │├── 10 | 81.289us | 289.946us | 470.016us | 1 │├── 11 | 17.850us | 17.850us | 18.432us | 1 │└── 12 | 29.350us | 52.260us | 52.288us | 1 ├── avgpool | 41.840us | 70.840us | 76.832us | 1 └── classifier | | | | ├── 0 | 66.400us | 122.110us | 125.920us | 1 ├── 1 | 293.658us | 293.658us | 664.704us | 1 ├── 2 | 17.600us | 17.600us | 18.432us | 1 ├── 3 | 27.920us | 49.030us | 51.168us | 1 ├── 4 | 40.590us | 40.590us | 208.672us | 1 ├── 5 | 17.570us | 17.570us | 18.432us | 1 └── 6 | 40.489us | 40.489us | 81.920us | 1
查看每个层中发生的低级操作:prof.display(show_events=True)
Module | Self CPU total | CPU total | CUDA total | Occurrences ------------------------------|----------------|-----------|------------|------------ AlexNet | | | | ├── features | | | | │├── 0 | | | | ││├── conv2d | 13.370us | 1.671ms | 1.698ms | 1 ││├── convolution | 12.730us | 1.658ms | 1.685ms | 1 ││├── _convolution | 30.660us | 1.645ms | 1.673ms | 1 ││├── contiguous | 6.970us | 6.970us | 7.136us | 1 ││└── cudnn_convolution | 1.608ms | 1.608ms | 1.638ms | 1 │├── 1 | | | | ││└── relu_ | 62.430us | 62.430us | 63.264us | 1 │├── 2 | | | | ││├── max_pool2d | 15.870us | 62.909us | 63.488us | 1 ││└── max_pool2d_with_indices | 47.039us | 47.039us | 49.152us | 1 ...
可以通过在概要文件实例上调用raw()返回原始的Pytorch事件列表。
1 trace, event_lists_dict = prof.raw() 2 print(trace[2]) 3 # Trace(path=('AlexNet', 'features', '0'), leaf=True, module=Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))) 4 5 print(event_lists_dict[trace[2].path][0])
--------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- conv2d 0.80% 13.370us 100.00% 1.671ms 1.671ms 25.34% 1.698ms 1.698ms 1 [] convolution 0.76% 12.730us 99.20% 1.658ms 1.658ms 25.15% 1.685ms 1.685ms 1 [] _convolution 1.83% 30.660us 98.44% 1.645ms 1.645ms 24.97% 1.673ms 1.673ms 1 [] contiguous 0.42% 6.970us 0.42% 6.970us 6.970us 0.11% 7.136us 7.136us 1 [] cudnn_convolution 96.19% 1.608ms 96.19% 1.608ms 1.608ms 24.44% 1.638ms 1.638ms 1 [] --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 1.671ms CUDA time total: 6.701ms
层可以选择单独使用可选kwarg路径参数。忽略所有其他层的分析。
1 model = torchvision.models.alexnet(pretrained=False) 2 x = torch.rand([1, 3, 224, 224]) 3 4 # Layer does not have to be a leaf layer 5 paths = [("AlexNet", "features", "3"), ("AlexNet", "classifier")] 6 7 with torchprof.Profile(model, paths=paths) as prof: 8 model(x) 9 10 print(prof)
Module | Self CPU total | CPU total | CUDA total | Occurrences ---------------|----------------|-----------|------------|------------ AlexNet | | | | ├── features | | | | │├── 0 | | | | │├── 1 | | | | │├── 2 | | | | │├── 3 | 3.189ms | 12.717ms | 0.000us | 1 │├── 4 | | | | │├── 5 | | | | │├── 6 | | | | │├── 7 | | | | │├── 8 | | | | │├── 9 | | | | │├── 10 | | | | │├── 11 | | | | │└── 12 | | | | ├── avgpool | | | | └── classifier | 13.403ms | 14.011ms | 0.000us | 1 ├── 0 | | | | ├── 1 | | | | ├── 2 | | | | ├── 3 | | | | ├── 4 | | | | ├── 5 | | | | └── 6 | | | |
参考: