• C++API搭建Tensorrt深度解析及构建模板(构建引擎/序列化与反序列化/推理)


      好久没写一篇关于使用C++ API搭建构建Tensorrt的文章了,而本篇文章该说些什么了,也不想像很多博客介绍Tensorrt的C++API,直接来几个步骤,然后如何搭建网络,构造engine,需要前提logger等。

    为此,本篇博客为了减少下载什么pth,根据什么onnx等不必要麻烦,我们将从torch构建一个简单网络, 并将其转换wts格式,然后通过C++API调用wts(权重)建立如何构建tensorrt的网络的一个模板。

    至于简单使用tensorrt搭建实列,可参考“使用tensorRT C++ API搭建MLP网络详解”文章,该篇文章简单介绍了visual studio环境搭建。、

    一.python搭建网络,得到权重wts

    因其搭建简单网络,我将不在介绍,直接根据代码可实现网络搭建,也不必要训练,直接保存pth,构建wts即可,以下为python构建wts代码。

    from torch import nn
    import torch
    import struct
    from torchsummary import summary
    import numpy as np
    import cv2
    
    
    class TRY(nn.Module):
        def __init__(            self    ):
            super(TRY, self).__init__()
            self.cov1=nn.Conv2d(3,64,3,1)
            self.r1=nn.ReLU(inplace=True)
            self.conv2=nn.Conv2d(64,2,3,1)
    
    
        def forward(self, x) :
            x = self.cov1(x)
            x = self.r1(x)
            x = self.conv2(x)
            return x
    
    def transform(img):
        img = img.astype('float32')
        img -= 127.5
        img *= 0.0078125
        img = np.transpose(img, (2, 0, 1))
        img = np.expand_dims(img, axis=0)
    
        return img
    
    
    def infer(img=None):
    
        net = torch.load('./try.pth')
        net = net.cuda()
        net = net.eval()
        print('model: ', net)
        # print('state dict: ', net.state_dict().keys())
        if img is None:
            tmp = torch.ones(1, 3, 224, 224).cuda()
        else:
            if isinstance(img, str):
                img = cv2.imread(img)
            tmp = transform(img)
            tmp = torch.from_numpy(tmp).cuda()
    
        print('input: ', tmp)
        out = net(tmp)
        out_index = torch.argmax(out).cpu().numpy()
        print('output_index:', out_index)
    
        summary(net, (3, 224, 224))
        # return
        f = open("try.wts", 'w')
        f.write("{}\n".format(len(net.state_dict().keys())))
        for k, v in net.state_dict().items():
            print('key: ', k)
            print('value: ', v.shape)
            vr = v.reshape(-1).cpu().numpy()
            f.write("{} {}".format(k, len(vr)))
            for vv in vr:
                f.write(" ")
                f.write(struct.pack(">f", float(vv)).hex())
            f.write("\n")
    
    
    def main_createnet():
        print('cuda device count: ', torch.cuda.device_count())
        net = TRY()
        net = net.eval()
        net = net.cuda()
        print(net)
        tmp = torch.ones(2, 3, 224, 224).cuda()
    
        out = net(tmp)
        print('out:', out.shape)
        torch.save(net, "./try.pth")
    
    
    if __name__ == '__main__':
        # main_createnet()
        infer(img='../dog.png')
    torch构建网络,获得wts

    以上代码可获得wts文件,基于此文件,我们将继续搭建tensorrt网络模板。

    二.构建tensorrt网络模板

    本次使用tensorrt版本为8.2左右。

    搭建tensorrt网络模板前,我先推荐大神git上的code,建议读者拜读,可点击code查阅。

    我将通过网络图重点介绍某些功能,具体说明已在代码中给出注释。

     ①构建logger信息,简单构建可用以下代码,来源于官网,不必纠结。

    class Logger : public ILogger
    {
        void log(Severity severity, const char* msg) noexcept override
        {
            // suppress info-level messages
            if (severity <= Severity::kWARNING)
                std::cout << msg << std::endl;
        }
    } gLogger;

    ②createEngine实际内部调用torch构建网络转换为wts加载,搭建网络,并创造了engine,此时想用engine依然可以进行推理,后续将不需要转序列化/反序列化再次构建引擎engine了,但每次推理需构建引擎会很耗时间,教程将不会推荐,我也强烈建议

    engine序列化方法。然我将说明无需序列化的推理,便于加深理解。具体细节如下图。

     为防止读者误解,我将贴出修改后的主函数,其所有方法函数将在下面代码里。

    int main() {
    
        bool serialize = false;
        std::string engine_path = "./model.engine";
    
        if (serialize) {
            //将模型序列化
            IHostMemory* modelStream{ nullptr };
            modelStream = engine2serialize(); //构建引擎与保存
    
        }
        else {
            //加载引擎,预测图片
            std::string path = "./2.jpg";
            cv::Mat img = cv::imread(path);
            if (img.empty()) {
                std::cout << "input images error!" << std::endl;
                return 0;
            }
            cv::Mat imgInput;
            cv::resize(img, imgInput, cv::Size(INPUT_W, INPUT_H), 0, 0, cv::INTER_LINEAR);
    
            vector<cv::Mat> InputImage;
    
            InputImage.push_back(imgInput);
            InputImage.push_back(imgInput);
    
            float input[BatchSize * 3 * INPUT_H * INPUT_W];
            float output[BatchSize * 2 * INPUT_H * INPUT_W];
    
            ProcessImage(InputImage,input);
            
            //ICudaEngine* engine = inite_engine(engine_path);
    
            IBuilder* builder = createInferBuilder(gLogger);
            IBuilderConfig* config = builder->createBuilderConfig();
            ICudaEngine* engine = createEngine(BatchSize, builder, config, DataType::kFLOAT,  "./trynet.wts");
    
            IExecutionContext* context = engine->createExecutionContext();
            assert(context != nullptr);
    
            doInference(*context, input, output, 2);
    
    
            cout << "output_count:" << sizeof(output) / sizeof(output[0]) << endl;
            cout << "BatchSize:" << BatchSize << "INPUT_H:" << INPUT_H << "INPUT_W:" << INPUT_W << endl;
            
            //for (int i = 0; i < BatchSize * 2 * INPUT_H * INPUT_W; i++) {
            //    cout << i << *output << endl;
            //}
    
        }
    
    
    }

    ③完成引擎构建,实际可以使用IExecutionContext* context = engine->createExecutionContext();进行推理,而使用CUDA推理需将输入输出转到cuda上,因此需要数据转换,此时可用模板doInference函数,也来源别人框架。

    ④ APIToMode/createEngine/doInference等函数,属于基本默认方法,APIToMode函数集成并将其序列化;createEngine主要搭建网络并将其转为engine,其中搭建网络很重要;doinference函数属于如何推理,并将输入输出在

    host与cuda内存传输,主要稍微修改输入输出,其它可复制。

     ⑤ 我已验证,若构建引擎可使用batch为1,而推理可使用batch为n,构建好的引擎的序列化,推理宽高batch可任意修改。dt构建输入float32或half等数据。

    以下为具体实习tensrrt C++API基本模板。 

    #include "NvInferRuntimeCommon.h"
    #include <cassert>
    #include "NvInfer.h"    // TensorRT library
    #include "iostream"     // Standard input/output library
    #include <map>          // for weight maps
    #include <fstream>      // for file-handling
    #include <chrono>       // for timing the execution<br>
    #include <assert.h>
    #include "NvInfer.h"
    #include "cuda_runtime_api.h"
    #include<opencv2/core/core.hpp>
    #include<opencv2/highgui/highgui.hpp>
    #include <opencv2/opencv.hpp>
    #include<vector>
    
    
    using namespace nvinfer1;
    using namespace std;
    
    
    
    static const int INPUT_H = 32;
    static const int INPUT_W = 32;
    static const int INPUT_C = 3;
    
    
    const char* INPUT_NAME = "data";
    const char* OUTPUT_NAME = "pred";
    static const int BatchSize = 2;
    
    
    
    
    
    
    
    
    
    
    
    
    //载入权重函数
    map<string, Weights> loadWeights(const string file) {
        /*
         * Parse the .wts file and store weights in dict format.
         * @param file path to .wts file
         * @return weight_map: dictionary containing weights and their values
         */
    
        std::cout << " Loading weights..." << file << std::endl;
        std::map<string, Weights> weightMap;  //定义声明
    
        // Open Weight file
        ifstream input(file);
        assert(input.is_open() && "[ERROR]: Unable to load weight file...");
    
        int32_t count;
        input >> count;//右移获得第一个数据,得到有多少个权重
        assert(count > 0 && "Invalid weight map file.");
    
        // Loop through number of line, actually the number of weights & biases
        while (count--) {
            // TensorRT weights
            Weights wt{ DataType::kFLOAT, nullptr, 0 };
            uint32_t size;
            // Read name and type of weights
            std::string w_name;
            input >> w_name >> std::dec >> size;
            wt.type = DataType::kFLOAT;
    
            uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(uint32_t) * size));
    
    
    
    
            for (uint32_t x = 0, y = size; x < y; ++x) {
                // Change hex values to uint32 (for higher values)
                input >> std::hex >> val[x];  //hex为16进制
            }
            wt.values = val;
            wt.count = size;
            cout << "weigth:" << val << endl;
            // Add weight values against its name (key)
            weightMap[w_name] = wt;  //将权重结果保存此处
        }
        return weightMap;
    }
    
    
    
    //构建Logger
    class Logger : public ILogger
    {
        void log(Severity severity, const char* msg) noexcept override
        {
            // suppress info-level messages
            if (severity <= Severity::kWARNING)
                std::cout << msg << std::endl;
        }
    } gLogger;
    
    #define CHECK(status) \
        do\
        {\
            auto ret = (status);\
            if (ret != 0)\
            {\
                std::cerr << "Cuda failure: " << ret << std::endl;\
                abort();\
            }\
        } while (0)
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    // 搭建网络 创造引擎
    ICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,string wts_path= "./trynet.wts")
    {
    
        INetworkDefinition* network = builder->createNetworkV2(0U);
        // Create input tensor of shape { 1, 32, 32 } with name INPUT_BLOB_NAME
        
        ITensor* data = network->addInput(INPUT_NAME, dt, Dims3{ 3, INPUT_H, INPUT_W });
        assert(data);
    
        std::map<std::string, Weights> weightMap = loadWeights(wts_path);
    
        IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{ 3, 3 }, weightMap["cov1.weight"], weightMap["cov1.bias"]);
        assert(conv1);
        conv1->setName("cov1");//设置名字
        conv1->setPaddingNd(DimsHW{ 1, 1 });
        conv1 = network->addConvolutionNd(*conv1->getOutput(0), 2, DimsHW{ 3, 3 }, weightMap["cov2.weight"], weightMap["cov2.bias"]);
        conv1->setPaddingNd(DimsHW{ 1, 1 });
        conv1->setName("cov2");//设置名字
    
        conv1->getOutput(0)->setName(OUTPUT_NAME);
        network->markOutput(*conv1->getOutput(0));
    
    
    
    
        // 构建引擎,其它网络都可以使用这个
        builder->setMaxBatchSize(maxBatchSize);
        config->setMaxWorkspaceSize(1 << 20);
        ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    
        network->destroy();// Don't need the network any more 释放内存
    
        // Release host memory
        for (auto& mem : weightMap)
        {
            free((void*)(mem.second.values));
        }
    
        return engine;
    }
    
    //构建模型将其序列化
    void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream, DataType dt,string wts_path = "./trynet.wts")
    {
        // Create builder
        IBuilder* builder = createInferBuilder(gLogger);
        IBuilderConfig* config = builder->createBuilderConfig();
    
        // Create model to populate the network, then set the outputs and create an engine
        ICudaEngine* engine = createEngine(maxBatchSize, builder, config, dt,wts_path=wts_path);
        assert(engine != nullptr);
    
        // Serialize the engine
        (*modelStream) = engine->serialize();
    
        // Close everything down
        engine->destroy();
        builder->destroy();
    }
    
    
    
    //保存序列化的模型
    IHostMemory* engine2serialize(std::string engine_path = "model.engine", DataType dt = DataType::kFLOAT)
    {
       
        IHostMemory* modelStream{ nullptr };
        APIToModel(1, &modelStream, dt); //batchsize 
        assert(modelStream != nullptr);
    
        std::ofstream p(engine_path, std::ios::binary);
        if (!p)
        {
            std::cerr << "could not open plan output file" << std::endl;
    
        }
        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());
        modelStream->destroy();
        return modelStream;
    
    }
    
    
    
    
    /*****************************************         推理               ***********************************************************/
    
    
    
    
    
    //加工图片变成拥有batch的输入, tensorrt输入需要的格式,为一个维度
    void ProcessImage(std::vector<cv::Mat> InputImage, float input_data[]) {
    
        int ImgCount = InputImage.size();
        assert(ImgCount == BatchSize);
        //float input_data[BatchSize * 3 * INPUT_H * INPUT_W];
        for (int b = 0; b < ImgCount; b++) {
            cv::Mat img = InputImage.at(b);
            int w = img.cols;
            int h = img.rows;
            int i = 0;
            for (int row = 0; row < h; ++row) {
                uchar* uc_pixel = img.data + row * img.step;
                for (int col = 0; col < INPUT_W; ++col) {
                    input_data[b * 3 * INPUT_H * INPUT_W + i] = (float)uc_pixel[2] / 255.0;
                    input_data[b * 3 * INPUT_H * INPUT_W + i + INPUT_H * INPUT_W] = (float)uc_pixel[1] / 255.0;
                    input_data[b * 3 * INPUT_H * INPUT_W + i + 2 * INPUT_H * INPUT_W] = (float)uc_pixel[0] / 255.0;
                    uc_pixel += 3;
                    ++i;
                }
            }
    
        }
    
    }
    
    
    //读取engine文件,将其反序列化,构造engine结构,相当于网络初始化
    ICudaEngine* inite_engine(std::string engine_path) {
        char* trtModelStream{ nullptr }; //指针函数,创建保存engine序列化文件结果
        size_t size{ 0 };
        // read model from the engine file
        std::ifstream file(engine_path, std::ios::binary);
        if (file.good()) {
            file.seekg(0, file.end);
            size = file.tellg();
            file.seekg(0, file.beg);
            trtModelStream = new char[size];
            assert(trtModelStream);
            file.read(trtModelStream, size);
            file.close();
        }
        // create a runtime (required for deserialization of model) with NVIDIA's logger
        IRuntime* runtime = createInferRuntime(gLogger); //反序列化方法
        assert(runtime != nullptr);
        // deserialize engine for using the char-stream
        ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
        assert(engine != nullptr);
    
        /*
        一个engine可以有多个execution context,并允许将同一套weights用于多个推理任务。
        可以在并行的CUDA streams流中按每个stream流一个engine和一个context来处理图像。
        每个context在engine相同的GPU上创建。
        */
        runtime->destroy();
        return engine;
    
    };
    
    //engine->createExecutionContext()加载engine后内容,该函数需要根据网络输入/输出略微修改
    void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
    {
        const ICudaEngine& engine = context.getEngine();
    
        // Pointers to input and output device buffers to pass to engine.
        // Engine requires exactly IEngine::getNbBindings() number of buffers.
        assert(engine.getNbBindings() == 2);
        void* buffers[2];
    
        // In order to bind the buffers, we need to know the names of the input and output tensors.
        // Note that indices are guaranteed to be less than IEngine::getNbBindings()
        const int inputIndex = engine.getBindingIndex(INPUT_NAME);
        const int outputIndex = engine.getBindingIndex(OUTPUT_NAME);
    
        // Create GPU buffers on device
        CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));  //CHECK 核对校验  也可不使用
        cudaMalloc(&buffers[outputIndex], batchSize * 2 * INPUT_H * INPUT_W * sizeof(float));
    
        // Create stream
        cudaStream_t stream;
        cudaStreamCreate(&stream);
    
        // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
        cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream);
        context.enqueue(batchSize, buffers, stream, nullptr);//通常TensorRT的执行是异步的,因此将kernels加入队列放在CUDA stream流上
        cudaMemcpyAsync(output, buffers[outputIndex], batchSize * 2 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyDeviceToHost, stream);
        cudaStreamSynchronize(stream);
    
        // Release stream and buffers
        cudaStreamDestroy(stream);
        cudaFree(buffers[inputIndex]);
        cudaFree(buffers[outputIndex]);
    }
    
       
    
    int main() {
    
        bool serialize = false;
        std::string engine_path = "./model.engine";
    
        if (serialize) {
            //将模型序列化
            IHostMemory* modelStream{ nullptr };
            modelStream = engine2serialize(); //构建引擎与保存
    
        }
        else {
            //加载引擎,预测图片
            std::string path = "./2.jpg";
            cv::Mat img = cv::imread(path);
            if (img.empty()) {
                std::cout << "input images error!" << std::endl;
                return 0;
            }
            cv::Mat imgInput;
            cv::resize(img, imgInput, cv::Size(INPUT_W, INPUT_H), 0, 0, cv::INTER_LINEAR);
    
            vector<cv::Mat> InputImage;
    
            InputImage.push_back(imgInput);
            InputImage.push_back(imgInput);
    
            float input[BatchSize * 3 * INPUT_H * INPUT_W];
            float output[BatchSize * 2 * INPUT_H * INPUT_W];
    
            ProcessImage(InputImage,input);
            ICudaEngine* engine = inite_engine(engine_path);
    
            IExecutionContext* context = engine->createExecutionContext();
            assert(context != nullptr);
     
    
            doInference(*context, input, output, 2);
    
    
            cout << "output_count:" << sizeof(output) / sizeof(output[0]) << endl;
            cout << "BatchSize:" << BatchSize << "INPUT_H:" << INPUT_H << "INPUT_W:" << INPUT_W << endl;
            
            //for (int i = 0; i < BatchSize * 2 * INPUT_H * INPUT_W; i++) {
            //    cout << i << *output << endl;
            //}
    
        }
    
    
    }
    Tensorrt C++ 构架网络

    结果展示:

    以上为本次进一步理解和实现tensorrt的过程,若有问题,欢迎指正。

    下次我将使用onnx搭建网络解析。

    最后无需复制啥,只需按照以下配置(照猫画虎)即可完成windows10的visual tudio的环境配置:

    电脑环境path配置:
    E:\InstallPackage\TensorRT-8.4.0.6\lib
    E:\InstallPackage\opencv\build\x64\vc15\bin
    说明:engine无需电脑环境配置
    包含目录:
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\include
    E:\InstallPackage\eigen-3.4.0
    E:\InstallPackage\opencv\build\include\opencv2
    E:\InstallPackage\opencv\build\include
    E:\InstallPackage\TensorRT-8.4.0.6\include
    
    库目录:
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\lib\x64
    E:\InstallPackage\opencv\build\x64\vc15\lib
    E:\InstallPackage\TensorRT-8.4.0.6\lib
    
    链接器-->附加依赖:
    opencv_world455d.lib
    nvinfer.lib
    nvinfer_plugin.lib
    nvonnxparser.lib
    nvparsers.lib
    cuda.lib
    cudart.lib


  • 相关阅读:
    用户登录
    在ASP.NET里实现计算器代码的封装
    计算器的封装
    典型用户和场景-老陈、小石头
    葫芦娃团队
    20155235 王玥 《基于Arm实验箱的接口测试和应用》 课程设计报告
    实验补交专用链接随笔
    20155235 《网络攻防》 实验九 Web安全基础
    20155235 《网络攻防》 实验七 网络欺诈防范
    20155235 《网络攻防》 实验八 Web基础
  • 原文地址:https://www.cnblogs.com/tangjunjun/p/16476121.html
Copyright © 2020-2023  润新知