飞道的博客

Yolov8实例分割Tensorrt部署实战

1702人阅读  评论(0)

目录

0 引言

1 生成onnx模型

2 onnx转为tensorrt的engine模型

3  Tensorrt推理

3.1 yolov8n-seg分割结果

3.2 yolov8s-seg分割结果

3.3 yolov8m-seg分割结果

3.4 yolov8l-seg分割结果

3.5 yolov8x-seg分割结果


0 引言

        ultralytics在github发布了yolov8模型,可实现快速分类、目标检测与实例分割,采用官方yolov8s-seg.pt效果如下图所示:

         本文依旧对其中的实例分割模型进行加速推理实战,开发c++版本的tensorrt推理代码,没有过多的文件依赖,就3个cpp程序文件,不夹带私货,可以算是最简单的推理版本了,直接上链接:Yolov8-instance-seg-tensorrt,本人环境为:cuda10.2、cudnn8.2.4、Tensorrt8.0.1.6、Opencv4.5.4。程序中测试了yolov8[n s m l x]-seg.pt,均能正常使用,代码列表如下


   
  1. ├── CMakeLists.txt
  2. ├── images
  3. ├── bus.jpg
  4. └── zidane.jpg
  5. ├── logging.h
  6. ├── main1_onnx2trt.cpp
  7. ├── main2_trt_infer.cpp
  8. ├── models
  9. ├── yolov8s -seg.engine
  10. └── yolov8s -seg.onnx
  11. ├── yolov8n -seg.engine
  12. └── yolov8n -seg.onnx
  13. ├── output.jpg
  14. ├── README.md
  15. └── utils.h

1 生成onnx模型

        yolov8提供了安装方法以及对应的使用代码,在网站下载对应的模型之后,采用如下代码生成所需要的onnx模型


   
  1. pip install ultralytics
  2. yolo task=segment mode= export model=yolov8[n s m l x]-seg.pt format=onnx opset= 12

        这里注意下,我的onnx版本由于不是最新版,所以opset=12,官方模型默认的opset=17,可以在ultralytics/yolo/configs/default.yaml中找到。

2 onnx转为tensorrt的engine模型

        官方代码提供了直接生成engine的方法,但是我不推荐直接用,原因是生成的engine是跟电脑环境有关的,你换了一个环境之后,比如换到部署服务器上,之前电脑生成的engine就不能用了,除非两个电脑的环境一模一样,不然又要安装所需要的python库,所以我们仅生成onnx模型,然后通过tensorrt的api进行模型转换。

        首先clone我的repo,然后用如下几句话

        1.首先定位到你clone的repo目录下,就是Yolov8-instance-seg-tensorrt目录下
        2.复制 yolov8[n s l m x]-seg.onnx 到 models/目录下

        3.运行下列代码,生成转换与推理的可执行文件-->onnx2trt 、trt_infer


  
  1. mkdir build
  2. cd build
  3. cmake ..
  4. make
  5. sudo ./onnx2trt ../models/yolov8s-seg.onnx ../models/yolov8s-seg.engine

        生成onnx2trt的代码如下,可以看到我们主要用的api就是onnxparser


  
  1. #include <iostream>
  2. #include "logging.h"
  3. #include "NvOnnxParser.h"
  4. #include "NvInfer.h"
  5. #include <fstream>
  6. using namespace nvinfer1;
  7. using namespace nvonnxparser;
  8. static Logger gLogger;
  9. int main(int argc,char** argv) {
  10. if (argc < 2) {
  11. argv[ 1] = "../../models/yolov8n-seg.onnx";
  12. argv[ 2] = "../../models/yolov8n-seg.engine";
  13. }
  14. // 1 onnx解析器
  15. IBuilder* builder = createInferBuilder(gLogger);
  16. const auto explicitBatch = 1U << static_cast< uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
  17. INetworkDefinition* network = builder-> createNetworkV2(explicitBatch);
  18. nvonnxparser::IParser* parser = nvonnxparser:: createParser(*network, gLogger);
  19. const char* onnx_filename = argv[ 1];
  20. parser-> parseFromFile(onnx_filename, static_cast< int>(Logger::Severity::kWARNING));
  21. for ( int i = 0; i < parser-> getNbErrors(); ++i)
  22. {
  23. std::cout << parser-> getError(i)-> desc() << std::endl;
  24. }
  25. std::cout << "successfully load the onnx model" << std::endl;
  26. // 2build the engine
  27. unsigned int maxBatchSize = 1;
  28. builder-> setMaxBatchSize(maxBatchSize);
  29. IBuilderConfig* config = builder-> createBuilderConfig();
  30. config-> setMaxWorkspaceSize( 1 << 20);
  31. //config->setMaxWorkspaceSize(128 * (1 << 20)); // 16MB
  32. config-> setFlag(BuilderFlag::kFP16);
  33. ICudaEngine* engine = builder-> buildEngineWithConfig(*network, *config);
  34. // 3serialize Model
  35. IHostMemory *gieModelStream = engine-> serialize();
  36. std::ofstream p(argv[2], std::ios::binary);
  37. if (!p)
  38. {
  39. std::cerr << "could not open plan output file" << std::endl;
  40. return -1;
  41. }
  42. p. write( reinterpret_cast< const char*>(gieModelStream-> data()), gieModelStream-> size());
  43. gieModelStream-> destroy();
  44. std::cout << "successfully generate the trt engine model" << std::endl;
  45. return 0;
  46. }

        通过上述操作,我们就可以得到对应的yolov8[n s m l x]-seg.engine模型。

3  Tensorrt推理

        推理代码和之前的yolov5的实例分割不一样,主要区别在于下图,左边是v5-seg,右边是v8-seg。

         v8中output0输出8400个结果,每个结果的维度是116,而v5是25200个结果,每个结果的维度是117。简述一下v8输出的是啥,116为4+80+32,4为box的cx cy w h,80是每个类的置信度,32是分割需要用到的,和v5的区别在于少了目标的置信度,v5是4+1+80+32,这个1就是是否为目标的置信度。

 推理代码如下:


  
  1. #include "NvInfer.h"
  2. #include "cuda_runtime_api.h"
  3. #include "NvInferPlugin.h"
  4. #include "logging.h"
  5. #include <opencv2/opencv.hpp>
  6. #include "utils.h"
  7. #include <string>
  8. using namespace nvinfer1;
  9. using namespace cv;
  10. // stuff we know about the network and the input/output blobs
  11. static const int INPUT_H = 640;
  12. static const int INPUT_W = 640;
  13. static const int _segWidth = 160;
  14. static const int _segHeight = 160;
  15. static const int _segChannels = 32;
  16. static const int CLASSES = 80;
  17. static const int Num_box = 8400;
  18. static const int OUTPUT_SIZE = Num_box * (CLASSES+ 4 + _segChannels); //output0
  19. static const int OUTPUT_SIZE1 = _segChannels * _segWidth * _segHeight ; //output1
  20. static const float CONF_THRESHOLD = 0.1;
  21. static const float NMS_THRESHOLD = 0.5;
  22. static const float MASK_THRESHOLD = 0.5;
  23. const char* INPUT_BLOB_NAME = "images";
  24. const char* OUTPUT_BLOB_NAME = "output0"; //detect
  25. const char* OUTPUT_BLOB_NAME1 = "output1"; //mask
  26. struct OutputSeg {
  27. int id; //结果类别id
  28. float confidence; //结果置信度
  29. cv::Rect box; //矩形框
  30. cv::Mat boxMask; //矩形框内mask,节省内存空间和加快速度
  31. };
  32. void DrawPred(Mat& img,std:: vector<OutputSeg> result) {
  33. //生成随机颜色
  34. std::vector<Scalar> color;
  35. srand( time( 0));
  36. for ( int i = 0; i < CLASSES; i++) {
  37. int b = rand() % 256;
  38. int g = rand() % 256;
  39. int r = rand() % 256;
  40. color. push_back( Scalar(b, g, r));
  41. }
  42. Mat mask = img. clone();
  43. for ( int i = 0; i < result. size(); i++) {
  44. int left, top;
  45. left = result[i].box.x;
  46. top = result[i].box.y;
  47. int color_num = i;
  48. rectangle(img, result[i].box, color[result[i].id], 2, 8);
  49. mask(result[i].box). setTo(color[result[i].id], result[i].boxMask);
  50. char label[ 100];
  51. sprintf(label, "%d:%.2f", result[i].id, result[i].confidence);
  52. //std::string label = std::to_string(result[i].id) + ":" + std::to_string(result[i].confidence);
  53. int baseLine;
  54. Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
  55. top = max(top, labelSize.height);
  56. putText(img, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 1, color[result[i].id], 2);
  57. }
  58. addWeighted(img, 0.5, mask, 0.8, 1, img); //将mask加在原图上面
  59. }
  60. static Logger gLogger;
  61. void doInference(IExecutionContext& context, float* input, float* output, float* output1, int batchSize)
  62. {
  63. const ICudaEngine& engine = context. getEngine();
  64. // Pointers to input and output device buffers to pass to engine.
  65. // Engine requires exactly IEngine::getNbBindings() number of buffers.
  66. assert(engine. getNbBindings() == 3);
  67. void* buffers[ 3];
  68. // In order to bind the buffers, we need to know the names of the input and output tensors.
  69. // Note that indices are guaranteed to be less than IEngine::getNbBindings()
  70. const int inputIndex = engine. getBindingIndex(INPUT_BLOB_NAME);
  71. const int outputIndex = engine. getBindingIndex(OUTPUT_BLOB_NAME);
  72. const int outputIndex1 = engine. getBindingIndex(OUTPUT_BLOB_NAME1);
  73. // Create GPU buffers on device
  74. CHECK( cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof( float))); //
  75. CHECK( cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof( float)));
  76. CHECK( cudaMalloc(&buffers[outputIndex1], batchSize * OUTPUT_SIZE1 * sizeof( float)));
  77. // cudaMalloc分配内存 cudaFree释放内存 cudaMemcpy或 cudaMemcpyAsync 在主机和设备之间传输数据
  78. // cudaMemcpy cudaMemcpyAsync 显式地阻塞传输 显式地非阻塞传输
  79. // Create stream
  80. cudaStream_t stream;
  81. CHECK( cudaStreamCreate(&stream));
  82. // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
  83. CHECK( cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof( float), cudaMemcpyHostToDevice, stream));
  84. context. enqueue(batchSize, buffers, stream, nullptr);
  85. CHECK( cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof( float), cudaMemcpyDeviceToHost, stream));
  86. CHECK( cudaMemcpyAsync(output1, buffers[outputIndex1], batchSize * OUTPUT_SIZE1 * sizeof( float), cudaMemcpyDeviceToHost, stream));
  87. cudaStreamSynchronize(stream);
  88. // Release stream and buffers
  89. cudaStreamDestroy(stream);
  90. CHECK( cudaFree(buffers[inputIndex]));
  91. CHECK( cudaFree(buffers[outputIndex]));
  92. CHECK( cudaFree(buffers[outputIndex1]));
  93. }
  94. int main(int argc, char** argv)
  95. {
  96. if (argc < 2) {
  97. argv[ 1] = "../models/yolov8n-seg.engine";
  98. argv[ 2] = "../images/bus.jpg";
  99. }
  100. // create a model using the API directly and serialize it to a stream
  101. char* trtModelStream{ nullptr }; //char* trtModelStream==nullptr; 开辟空指针后 要和new配合使用,比如89行 trtModelStream = new char[size]
  102. size_t size{ 0 }; //与int固定四个字节不同有所不同,size_t的取值range是目标平台下最大可能的数组尺寸,一些平台下size_t的范围小于int的正数范围,又或者大于unsigned int. 使用Int既有可能浪费,又有可能范围不够大。
  103. std::ifstream file(argv[1], std::ios::binary);
  104. if (file. good()) {
  105. std::cout << "load engine success" << std::endl;
  106. file. seekg( 0, file.end); //指向文件的最后地址
  107. size = file. tellg(); //把文件长度告诉给size
  108. //std::cout << "\nfile:" << argv[1] << " size is";
  109. //std::cout << size << "";
  110. file. seekg( 0, file.beg); //指回文件的开始地址
  111. trtModelStream = new char[size]; //开辟一个char 长度是文件的长度
  112. assert(trtModelStream); //
  113. file. read(trtModelStream, size); //将文件内容传给trtModelStream
  114. file. close(); //关闭
  115. }
  116. else {
  117. std::cout << "load engine failed" << std::endl;
  118. return 1;
  119. }
  120. Mat src = imread(argv[ 2], 1);
  121. if (src. empty()) { std::cout << "image load faild" << std::endl; return 1; }
  122. int img_width = src.cols;
  123. int img_height = src.rows;
  124. std::cout << "宽高:" << img_width << " " << img_height << std::endl;
  125. // Subtract mean from image
  126. static float data[ 3 * INPUT_H * INPUT_W];
  127. Mat pr_img0, pr_img;
  128. std::vector< int> padsize;
  129. pr_img = preprocess_img(src, INPUT_H, INPUT_W, padsize); // Resize
  130. int newh = padsize[ 0], neww = padsize[ 1], padh = padsize[ 2], padw = padsize[ 3];
  131. float ratio_h = ( float)src.rows / newh;
  132. float ratio_w = ( float)src.cols / neww;
  133. int i = 0; // [1,3,INPUT_H,INPUT_W]
  134. //std::cout << "pr_img.step" << pr_img.step << std::endl;
  135. for ( int row = 0; row < INPUT_H; ++row) {
  136. uchar* uc_pixel = pr_img.data + row * pr_img.step; //pr_img.step=widthx3 就是每一行有width个3通道的值
  137. for ( int col = 0; col < INPUT_W; ++col)
  138. {
  139. data[i] = ( float)uc_pixel[ 2] / 255.0;
  140. data[i + INPUT_H * INPUT_W] = ( float)uc_pixel[ 1] / 255.0;
  141. data[i + 2 * INPUT_H * INPUT_W] = ( float)uc_pixel[ 0] / 255.;
  142. uc_pixel += 3;
  143. ++i;
  144. }
  145. }
  146. IRuntime* runtime = createInferRuntime(gLogger);
  147. assert(runtime != nullptr);
  148. bool didInitPlugins = initLibNvInferPlugins( nullptr, "");
  149. ICudaEngine* engine = runtime-> deserializeCudaEngine(trtModelStream, size, nullptr);
  150. assert(engine != nullptr);
  151. IExecutionContext* context = engine-> createExecutionContext();
  152. assert(context != nullptr);
  153. delete[] trtModelStream;
  154. // Run inference
  155. static float prob[OUTPUT_SIZE];
  156. static float prob1[OUTPUT_SIZE1];
  157. //for (int i = 0; i < 10; i++) {//计算10次的推理速度
  158. // auto start = std::chrono::system_clock::now();
  159. // doInference(*context, data, prob, prob1, 1);
  160. // auto end = std::chrono::system_clock::now();
  161. // std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;
  162. // }
  163. auto start = std::chrono::system_clock:: now();
  164. doInference(*context, data, prob, prob1, 1);
  165. auto end = std::chrono::system_clock:: now();
  166. std::cout << "推理时间:" << std::chrono:: duration_cast<std::chrono::milliseconds>(end - start). count() << "ms" << std::endl;
  167. std::vector< int> classIds; //结果id数组
  168. std::vector< float> confidences; //结果每个id对应置信度数组
  169. std::vector<cv::Rect> boxes; //每个id矩形框
  170. std::vector<cv::Mat> picked_proposals; //后续计算mask
  171. // 处理box
  172. int net_length = CLASSES + 4 + _segChannels;
  173. cv::Mat out1 = cv:: Mat(net_length, Num_box, CV_32F, prob);
  174. start = std::chrono::system_clock:: now();
  175. for ( int i = 0; i < Num_box; i++) {
  176. //输出是1*net_length*Num_box;所以每个box的属性是每隔Num_box取一个值,共net_length个值
  177. cv::Mat scores = out1( Rect(i, 4, 1, CLASSES)). clone();
  178. Point classIdPoint;
  179. double max_class_socre;
  180. minMaxLoc(scores, 0, &max_class_socre, 0, &classIdPoint);
  181. max_class_socre = ( float)max_class_socre;
  182. if (max_class_socre >= CONF_THRESHOLD) {
  183. cv::Mat temp_proto = out1( Rect(i, 4 + CLASSES, 1, _segChannels)). clone();
  184. picked_proposals. push_back(temp_proto. t());
  185. float x = (out1. at< float>( 0, i) - padw) * ratio_w; //cx
  186. float y = (out1. at< float>( 1, i) - padh) * ratio_h; //cy
  187. float w = out1. at< float>( 2, i) * ratio_w; //w
  188. float h = out1. at< float>( 3, i) * ratio_h; //h
  189. int left = MAX((x - 0.5 * w), 0);
  190. int top = MAX((y - 0.5 * h), 0);
  191. int width = ( int)w;
  192. int height = ( int)h;
  193. if (width <= 0 || height <= 0) { continue; }
  194. classIds. push_back(classIdPoint.y);
  195. confidences. push_back(max_class_socre);
  196. boxes. push_back( Rect(left, top, width, height));
  197. }
  198. }
  199. //执行非最大抑制以消除具有较低置信度的冗余重叠框(NMS)
  200. std::vector< int> nms_result;
  201. cv::dnn:: NMSBoxes(boxes, confidences, CONF_THRESHOLD, NMS_THRESHOLD, nms_result);
  202. std::vector<cv::Mat> temp_mask_proposals;
  203. std::vector<OutputSeg> output;
  204. for ( int i = 0; i < nms_result. size(); ++i) {
  205. int idx = nms_result[i];
  206. OutputSeg result;
  207. result.id = classIds[idx];
  208. result.confidence = confidences[idx];
  209. result.box = boxes[idx];
  210. output. push_back(result);
  211. temp_mask_proposals. push_back(picked_proposals[idx]);
  212. }
  213. // 处理mask
  214. Mat maskProposals;
  215. for ( int i = 0; i < temp_mask_proposals. size(); ++i)
  216. maskProposals. push_back(temp_mask_proposals[i]);
  217. Mat protos = Mat(_segChannels, _segWidth * _segHeight, CV_32F, prob1);
  218. Mat matmulRes = (maskProposals * protos). t(); //n*32 32*25600 A*B是以数学运算中矩阵相乘的方式实现的,要求A的列数等于B的行数时
  219. Mat masks = matmulRes. reshape(output. size(), { _segWidth,_segHeight }); //n*160*160
  220. std::vector<Mat> maskChannels;
  221. cv:: split(masks, maskChannels);
  222. Rect roi(int((float)padw / INPUT_W * _segWidth), int((float)padh / INPUT_H * _segHeight), int(_segWidth - padw / 2), int(_segHeight - padh / 2));
  223. for ( int i = 0; i < output. size(); ++i) {
  224. Mat dest, mask;
  225. cv:: exp(-maskChannels[i], dest); //sigmoid
  226. dest = 1.0 / ( 1.0 + dest); //160*160
  227. dest = dest(roi);
  228. resize(dest, mask, cv:: Size(src.cols, src.rows), INTER_NEAREST);
  229. //crop----截取box中的mask作为该box对应的mask
  230. Rect temp_rect = output[i].box;
  231. mask = mask(temp_rect) > MASK_THRESHOLD;
  232. output[i].boxMask = mask;
  233. }
  234. end = std::chrono::system_clock:: now();
  235. std::cout << "后处理时间:" << std::chrono:: duration_cast<std::chrono::milliseconds>(end - start). count() << "ms" << std::endl;
  236. DrawPred(src, output);
  237. cv:: imshow( "output.jpg", src);
  238. char c = cv:: waitKey( 0);
  239. // Destroy the engine
  240. context-> destroy();
  241. engine-> destroy();
  242. runtime-> destroy();
  243. system( "pause");
  244. return 0;
  245. }

 最后各网络的结果如下图所示:

3.1 yolov8n-seg分割结果

3.2 yolov8s-seg分割结果

3.3 yolov8m-seg分割结果

3.4 yolov8l-seg分割结果

3.5 yolov8x-seg分割结果


转载:https://blog.csdn.net/qq_41043389/article/details/128682057
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场