飞道的博客

PaddleSlim 模型量化 源代码解读

892人阅读  评论(0)

前言:paddleslim的中文资料非常丰富,在如何使用的教学上做的非常完善。但是源码解读与解析资料却非常少,这篇博客结合源代码大家一起学习一下paddle静态离线量化的原理和方法。

目录

原理简述

支持的量化类型

支持后端

量化操作

Preparation stage

Sampling stage

保存scale

插入量化/反量化节点

激活校准过程梳理

量化公式

具体流程


原理简述

这部分官方手册说的非常nice,建议阅读:量化 — PaddleSlim 文档

静态离线量化这部分主要是封装了paddle的接口,基本的使用方法如下:


  
  1. ptq = PostTrainingQuantization(
  2. executor =exe,
  3. sample_generator =sample_generator,
  4. model_dir =model_dir,
  5. model_filename =model_filename,
  6. params_filename =params_filename,
  7. batch_ size =batch_ size,
  8. batch_nums =batch_nums,
  9. algo =algo,
  10. quantizable_op_ type =quantizable_op_ type)
  11. ptq.quantize()
  12. ptq.save_quantized_model(save_model_path)

支持的量化类型

paddle通过optimize_model控制是否进行算子融合优化,但是只支持CPU上的算子融合,而且只有conv2d/depthwise_conv2d + bn算子融合,相比于OPENPPL的算子融合,做了非常多的算子融合:

OpenPPL PPQ量化(3):量化计算图的加载和预处理 源码剖析_沉迷单车的追风少年的博客-CSDN博客

paddle就显得非常简陋了……

参数is_full_quantize,不要被名字迷惑了,这并不是全量化。paddle当中的"部分量化"只支持"conv2d"、"depthwise_conv2d"、"conv2d_transpose"、"mul"、"matmul"、"matmul_v2"这6种类型。支持的理由是这几个是主要的计算密集型算子,量化他们是最有效的。

而全量化支持的算子类型也是有限的,除了刚才列出的六种类型之外,还有如下:


  
  1. SUPPORT_ACT_QUANTIZATION_OP_DICT = {
  2. "mul": [["X", "Y"], ["Out"]],
  3. "matmul": [["X", "Y"], ["Out"]],
  4. "matmul_v2": [["X", "Y"], ["Out"]],
  5. "pool2d": [["X"], ["Out"]],
  6. "elementwise_add": [["X", "Y"], ["Out"]],
  7. "concat": [["X"], ["Out"]],
  8. "softmax": [["X"], ["Out"]],
  9. "argmax": [["X"], ["Out"]],
  10. "transpose": [["X"], ["Out"]],
  11. "equal": [["X", "Y"], ["Out"]],
  12. "gather": [["X"], ["Out"]],
  13. "greater_equal": [["X", "Y"], ["Out"]],
  14. "greater_than": [["X", "Y"], ["Out"]],
  15. "less_equal": [["X", "Y"], ["Out"]],
  16. "less_than": [["X", "Y"], ["Out"]],
  17. "mean": [["X"], ["Out"]],
  18. "not_equal": [["X", "Y"], ["Out"]],
  19. "reshape": [["X"], ["Out"]],
  20. "reshape2": [["X"], ["Out"]],
  21. "transpose2": [["X"], ["Out"]],
  22. "nearest_interp": [["X"], ["Out"]],
  23. "trilinear_interp": [["X"], ["Out"]],
  24. "slice": [["Input"], ["Out"]],
  25. "squeeze": [["X"], ["Out"]],
  26. "elementwise_sub": [["X", "Y"], ["Out"]],
  27. "relu": [["X"], ["Out"]],
  28. "relu6": [["X"], ["Out"]],
  29. "leaky_relu": [["X"], ["Out"]],
  30. "prelu": [["X", "Alpha"], ["Out"]],
  31. "tanh": [["X"], ["Out"]],
  32. "swish": [["X"], ["Out"]],
  33. "dropout": [["X"], ["Out"]],
  34. "batch_norm": [["X"], ["Y"]],
  35. "layer_norm": [["X"], ["Y"]],
  36. "sigmoid": [["X"], ["Out"]],
  37. "elementwise_mul": [["X", "Y"], ["Out"]],
  38. "elementwise_pow": [["X", "Y"], ["Out"]],
  39. "hard_swish": [["X"], ["Out"]],
  40. "hard_sigmoid": [["X"], ["Out"]],
  41. "gru": [["Input", "Weight"], ["Hidden"]],
  42. "lstm": [["Input", "Weight"], ["Hidden"]],
  43. "pad2d": [["X"], ["Out"]],
  44. "pad3d": [["X"], ["Out"]],
  45. "flatten": [["X"], ["Out"]],
  46. "flatten2": [["X"], ["Out"]],
  47. "unsqueeze2": [["X"], ["Out"]],
  48. "flatten_contiguous_range": [["X"], ["Out"]],
  49. "split": [["X"], ["Out"]],
  50. "squeeze2": [["X"], ["Out"]],
  51. "nearest_interp_v2": [["X"], ["Out"]],
  52. "bilinear_interp": [["X"], ["Out"]],
  53. "bilinear_interp_v2": [["X"], ["Out"]],
  54. "fill_constant_batch_size_like": [["Input"], ["Out"]],
  55. "arg_max": [["X"], ["Out"]],
  56. "abs": [["X"], ["Out"]],
  57. "assign": [["X"], ["Out"]],
  58. "cast": [["X"], ["Out"]],
  59. "clip": [["X"], ["Out"]],
  60. "box_coder": [["PriorBox"], ["OutputBox"]],
  61. "crop": [["X"], ["Out"]],
  62. "cumsum": [["X"], ["Out"]],
  63. "expand_v2": [["X"], ["Out"]],
  64. "fill_any_like": [["X"], ["Out"]],
  65. "fill_constant": [[], ["Out"]],
  66. "gelu": [["X"], ["Out"]],
  67. "instance_norm": [["X"], ["Y"]],
  68. "lookup_table": [["W", "Ids"], ["Out"]],
  69. "lookup_table_v2": [["W", "Ids"], ["Out"]],
  70. "norm": [["X"], ["Norm"]],
  71. "p_norm": [["X"], ["Out"]],
  72. "pow": [["X"], ["Out"]],
  73. "reduce_mean": [["X"], ["Out"]],
  74. "stack": [["X"], ["Y"]],
  75. "top_k_v2": [["X"], ["Out", "Indices"]],
  76. "logical_and": [["X", "Y"], ["Out"]],
  77. "logical_not": [["X"], ["Out"]],
  78. "meshgrid": [["X"], ["Out"]],
  79. "roi_align": [["X", "ROIs"], ["Out"]],
  80. "strided_slice": [["Input"], ["Out"]],
  81. "where": [["Condition", "X", "Y"], ["Out"]],
  82. "grid_sampler": [["X", "Grid"], ["Output"]],
  83. "tile": [["X"], ["Out"]],
  84. "group_norm": [["X"], ["Y", "Mean", "Variance"]],
  85. "reduce_sum": [["X"], ["Out"]],
  86. "square": [["X"], ["Out"]],
  87. "softplus": [["X"], ["Out"]],
  88. "shuffle_channel": [["X"], ["Out"]],
  89. "reduce_max": [["X"], ["Out"]],
  90. "scale": [["X"], ["Out"]],
  91. }

支持后端

支持的后端如下:

support_deploy_backend = [None, "tensorrt", "mkldnn", "arm"]

对应的量化类是:BaseQuantizer、TensorRTQuantizer、MKLDNNQuantizer、ARMCPUQuantizer。相比于openppl来看支持的后端数量非常少,关于不同后端之间优化方法上的区别,先挖个坑以后再讲。

后面以BaseQuantizer为例讲解下面的内容。

量化操作

关键是迭代寻找边界值的过程。因为每次sample的过程会有不同策略,这里用abs_max()为例先来看看。

Preparation stage


  
  1. if self._algo in [ "KL", "hist"]:
  2. batch_id = 0
  3. with tqdm(
  4. total = self._batch_nums,
  5. bar_ format = 'Preparation stage, Run batch:|{bar}| {n_fmt}/{total_fmt}',
  6. ncols = 80,
  7. ) as t:
  8. for data in self._ data_loader():
  9. self._executor. run(
  10. program = self._ program,
  11. feed = data,
  12. fetch_list = self._fetch_list,
  13. return_numpy = False,
  14. scope = self._scope,
  15. )
  16. self._collect_activation_abs_min_m ax()
  17. batch_id + = 1
  18. t.update()
  19. if self._batch_nums and batch_id >= self._batch_nums:
  20. break
  21. self._init_sampling_act_histogram()

需要注意的是当KL或hist的时候,需要对所有激活值计算abs_min和abs_max,调用_collect_activation_abs_min_max()方法。

Sampling stage

加载权重值:

var_tensor = utils.load_variable_data(self._scope, var_name)

然后会分成abs_max、channel_wise_abs_max两种方式寻找边界的最大值:


  
  1. if self._weight_quantize_ type = = "abs_max":
  2. abs_max_ value = float(np.max(np.abs(var_tensor)))
  3. elif self._weight_quantize_ type = = "channel_wise_abs_max":
  4. abs_max_ value = []
  5. if (
  6. self._weight_op_pairs[var_name]
  7. in utils._channelwise_quant_axis 1_ops
  8. ):
  9. for i in range(var_tensor.shape[ 1]):
  10. abs_max_ value.append(
  11. float(np.max(np.abs(var_tensor[:, i])))
  12. )
  13. else:
  14. for i in range(var_tensor.shape[ 0]):
  15. abs_max_ value.append(
  16. float(np.max(np.abs(var_tensor[i])))
  17. )
  18. self._quantized_threshold[var_name] = abs_max_ value

  
  1. batch_id = 0
  2. with tqdm(
  3. total = self._batch_nums,
  4. bar_ format = 'Sampling stage, Run batch:|{bar}| {n_fmt}/{total_fmt}',
  5. ncols = 80,
  6. ) as t:
  7. for data in self._ data_loader():
  8. self._executor. run(
  9. program = self._ program,
  10. feed = data,
  11. fetch_list = self._fetch_list,
  12. return_numpy = False,
  13. scope = self._scope,
  14. )
  15. self._sampling()
  16. batch_id + = 1
  17. t.update()
  18. if self._batch_nums and batch_id >= self._batch_nums:
  19. break

保存scale

量化的结果就是为每个节点寻找scale,之前在每个tensor_name当中保存了sacle,此时分割拿出来就行:

real_tensor_name, opera, scalar = tensor_name.split('#')

这里需要不断动态更新max_scale,在opera放缩的时候会用到。

插入量化/反量化节点

例如此图中的D是反量化操作,Q是量化操作。 

需要我们在计算图中把量化和反量化节点插入进去:


  
  1. # use QuantizationTransformPass to insert fake_quant /fake_dequantize op
  2. if not self._onnx_ format:
  3. transform_pass = QuantizationTransformPass(
  4. scope = self._scope,
  5. place = self._place,
  6. weight_bits = self._weight_bits,
  7. activation_bits = self._activation_bits,
  8. activation_quantize_ type = self._activation_quantize_ type,
  9. weight_quantize_ type = self._weight_quantize_ type,
  10. quantizable_op_ type = self.quant_config.weight_quant_operation_types,
  11. )
  12. else:
  13. transform_pass = QuantizationTransformPassV 2(
  14. scope = self._scope,
  15. place = self._place,
  16. weight_bits = self._weight_bits,
  17. activation_bits = self._activation_bits,
  18. activation_quantize_ type = self._activation_quantize_ type,
  19. weight_quantize_ type = self._weight_quantize_ type,
  20. quantizable_op_ type = self.quant_config.weight_quant_operation_types,
  21. )
  22. for sub_graph in graph. all_sub_graphs():
  23. # Insert fake_quant /fake_dequantize op must in test graph, so
  24. # set per graph 's _for_test is True.
  25. sub_graph._for_test = True
  26. transform_pass.apply(sub_graph)
  27. # use AddQuantDequantPass to insert fake_quant_dequant op
  28. if not self._onnx_format:
  29. add_quant_dequant_pass = AddQuantDequantPass(
  30. scope=self._scope,
  31. place=self._place,
  32. quantizable_op_type=self.quant_config.activation_quant_operation_types,
  33. )
  34. else:
  35. add_quant_dequant_pass = AddQuantDequantPassV2(
  36. scope=self._scope,
  37. place=self._place,
  38. quantizable_op_type=self.quant_config.activation_quant_operation_types,
  39. )
  40. for sub_graph in graph.all_sub_graphs():
  41. sub_graph._for_test = True
  42. add_quant_dequant_pass.apply(sub_graph)

激活校准过程梳理

weight是常数,这个是不需要校准的(因为不会存在误差),所以需要被校准的只有激活值。

量化公式

假设r表示量化前的浮点数,量化后的整数q可以表示为:

round(⋅)和clip(⋅)分别表示取整和截断操作,是量化后的最小值和最大值。s是数据量化的间隔,z是表示数据偏移的偏置,z为0的量化被称为对称(Symmetric)量化,不为0的量化称为非对称(Asymmetric)量化。对称量化可以避免量化算子在推理中计算z相关的部分,降低推理时的计算复杂度;非对称量化可以根据实际数据的分布确定最小值和最小值,可以更加充分的利用量化数据信息,使得计算精度更高。 

具体流程

  • 使用直方图统计的方式得到原始FP32数据的统计分布

  • 在给定的搜索空间中选取若干个分别对激活值量化,得到量化后的数据

  • 使用直方图统计得到的统计分布;

  • 计算每个的统计分布差异,并找到差异性最低的一个对应的来计算相应的量化参数,常见的用于度量分布差异的指标包括KL散度(Kullback-Leibler Divergence)、对称KL散度(Symmetric Kullback-Leibler Divergence)和JS散度(Jenson-Shannon Divergence)。


转载:https://blog.csdn.net/qq_41895747/article/details/128938211
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场