前言:paddleslim的中文资料非常丰富,在如何使用的教学上做的非常完善。但是源码解读与解析资料却非常少,这篇博客结合源代码大家一起学习一下paddle静态离线量化的原理和方法。
目录
原理简述
这部分官方手册说的非常nice,建议阅读:量化 — PaddleSlim 文档


静态离线量化这部分主要是封装了paddle的接口,基本的使用方法如下:
  
   - 
    
     
    
    
     
             ptq 
      = PostTrainingQuantization(
     
    
- 
    
     
    
    
     
                         executor
      =exe,
     
    
- 
    
     
    
    
     
                         sample_generator
      =sample_generator,
     
    
- 
    
     
    
    
     
                         model_dir
      =model_dir,
     
    
- 
    
     
    
    
     
                         model_filename
      =model_filename,
     
    
- 
    
     
    
    
     
                         params_filename
      =params_filename,
     
    
- 
    
     
    
    
     
                         batch_
      size
      =batch_
      size,
     
    
- 
    
     
    
    
     
                         batch_nums
      =batch_nums,
     
    
- 
    
     
    
    
     
                         algo
      =algo,
     
    
- 
    
     
    
    
     
                         quantizable_op_
      type
      =quantizable_op_
      type)
     
    
- 
    
     
    
    
     
             ptq.quantize()
     
    
- 
    
     
    
    
     
             ptq.save_quantized_model(save_model_path)
     
    
支持的量化类型
paddle通过optimize_model控制是否进行算子融合优化,但是只支持CPU上的算子融合,而且只有conv2d/depthwise_conv2d + bn算子融合,相比于OPENPPL的算子融合,做了非常多的算子融合:
paddle就显得非常简陋了……
参数is_full_quantize,不要被名字迷惑了,这并不是全量化。paddle当中的"部分量化"只支持"conv2d"、"depthwise_conv2d"、"conv2d_transpose"、"mul"、"matmul"、"matmul_v2"这6种类型。支持的理由是这几个是主要的计算密集型算子,量化他们是最有效的。
而全量化支持的算子类型也是有限的,除了刚才列出的六种类型之外,还有如下:
  
   - 
    
     
    
    
     
      SUPPORT_ACT_QUANTIZATION_OP_DICT = {
     
    
- 
    
     
    
    
         
      "mul": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "matmul": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "matmul_v2": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "pool2d": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "elementwise_add": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "concat": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "softmax": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "argmax": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "transpose": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "equal": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "gather": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "greater_equal": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "greater_than": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "less_equal": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "less_than": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "mean": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "not_equal": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "reshape": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "reshape2": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "transpose2": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "nearest_interp": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "trilinear_interp": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "slice": 
      [["Input"], ["Out"]],
     
    
- 
    
     
    
    
         
      "squeeze": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "elementwise_sub": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "relu": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "relu6": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "leaky_relu": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "prelu": 
      [["X", "Alpha"], ["Out"]],
     
    
- 
    
     
    
    
         
      "tanh": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "swish": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "dropout": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "batch_norm": 
      [["X"], ["Y"]],
     
    
- 
    
     
    
    
         
      "layer_norm": 
      [["X"], ["Y"]],
     
    
- 
    
     
    
    
         
      "sigmoid": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "elementwise_mul": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "elementwise_pow": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "hard_swish": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "hard_sigmoid": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "gru": 
      [["Input", "Weight"], ["Hidden"]],
     
    
- 
    
     
    
    
         
      "lstm": 
      [["Input", "Weight"], ["Hidden"]],
     
    
- 
    
     
    
    
         
      "pad2d": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "pad3d": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "flatten": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "flatten2": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "unsqueeze2": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "flatten_contiguous_range": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "split": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "squeeze2": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "nearest_interp_v2": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "bilinear_interp": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "bilinear_interp_v2": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "fill_constant_batch_size_like": 
      [["Input"], ["Out"]],
     
    
- 
    
     
    
    
         
      "arg_max": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "abs": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "assign": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "cast": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "clip": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "box_coder": 
      [["PriorBox"], ["OutputBox"]],
     
    
- 
    
     
    
    
         
      "crop": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "cumsum": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "expand_v2": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "fill_any_like": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "fill_constant": 
      [[], ["Out"]],
     
    
- 
    
     
    
    
         
      "gelu": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "instance_norm": 
      [["X"], ["Y"]],
     
    
- 
    
     
    
    
         
      "lookup_table": 
      [["W", "Ids"], ["Out"]],
     
    
- 
    
     
    
    
         
      "lookup_table_v2": 
      [["W", "Ids"], ["Out"]],
     
    
- 
    
     
    
    
         
      "norm": 
      [["X"], ["Norm"]],
     
    
- 
    
     
    
    
         
      "p_norm": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "pow": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "reduce_mean": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "stack": 
      [["X"], ["Y"]],
     
    
- 
    
     
    
    
         
      "top_k_v2": 
      [["X"], ["Out", "Indices"]],
     
    
- 
    
     
    
    
         
      "logical_and": 
      [["X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "logical_not": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "meshgrid": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "roi_align": 
      [["X", "ROIs"], ["Out"]],
     
    
- 
    
     
    
    
         
      "strided_slice": 
      [["Input"], ["Out"]],
     
    
- 
    
     
    
    
         
      "where": 
      [["Condition", "X", "Y"], ["Out"]],
     
    
- 
    
     
    
    
         
      "grid_sampler": 
      [["X", "Grid"], ["Output"]],
     
    
- 
    
     
    
    
         
      "tile": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "group_norm": 
      [["X"], ["Y", "Mean", "Variance"]],
     
    
- 
    
     
    
    
         
      "reduce_sum": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "square": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "softplus": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "shuffle_channel": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "reduce_max": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
         
      "scale": 
      [["X"], ["Out"]],
     
    
- 
    
     
    
    
     
      }
     
    
 支持后端
支持的后端如下:
support_deploy_backend = [None, "tensorrt", "mkldnn", "arm"]
对应的量化类是:BaseQuantizer、TensorRTQuantizer、MKLDNNQuantizer、ARMCPUQuantizer。相比于openppl来看支持的后端数量非常少,关于不同后端之间优化方法上的区别,先挖个坑以后再讲。
后面以BaseQuantizer为例讲解下面的内容。
量化操作
关键是迭代寻找边界值的过程。因为每次sample的过程会有不同策略,这里用abs_max()为例先来看看。

Preparation stage
  
   - 
    
     
    
    
     
         
      if 
      self._algo 
      in [
      "KL", 
      "hist"]:
     
    
- 
    
     
    
    
     
             batch_id 
      = 
      0
     
    
- 
    
     
    
    
     
             
      with tqdm(
     
    
- 
    
     
    
    
     
                 total
      =
      self._batch_nums,
     
    
- 
    
     
    
    
     
                 bar_
      format
      =
      'Preparation stage, Run batch:|{bar}| {n_fmt}/{total_fmt}',
     
    
- 
    
     
    
    
     
                 ncols
      =
      80,
     
    
- 
    
     
    
    
     
             ) 
      as t:
     
    
- 
    
     
    
    
     
                 
      for 
      data 
      in 
      self._
      data_loader():
     
    
- 
    
     
    
    
     
                     
      self._executor.
      run(
     
    
- 
    
     
    
    
     
                         
      program
      =
      self._
      program,
     
    
- 
    
     
    
    
     
                         feed
      =
      data,
     
    
- 
    
     
    
    
     
                         fetch_list
      =
      self._fetch_list,
     
    
- 
    
     
    
    
     
                         
      return_numpy
      =
      False,
     
    
- 
    
     
    
    
     
                         scope
      =
      self._scope,
     
    
- 
    
     
    
    
     
                     )
     
    
- 
    
     
    
    
     
                     
      self._collect_activation_abs_min_m
      ax()
     
    
- 
    
     
    
    
     
                     batch_id 
      +
      = 
      1
     
    
- 
    
     
    
    
     
                     t.update()
     
    
- 
    
     
    
    
     
                     
      if 
      self._batch_nums 
      and batch_id 
      >= 
      self._batch_nums:
     
    
- 
    
     
    
    
     
                         break
     
    
- 
    
     
    
    
     
             
      self._init_sampling_act_histogram()
     
    
 需要注意的是当KL或hist的时候,需要对所有激活值计算abs_min和abs_max,调用_collect_activation_abs_min_max()方法。
Sampling stage
加载权重值:
var_tensor = utils.load_variable_data(self._scope, var_name)然后会分成abs_max、channel_wise_abs_max两种方式寻找边界的最大值:
  
   - 
    
     
    
    
     
                 
      if 
      self._weight_quantize_
      type 
      =
      = 
      "abs_max":
     
    
- 
    
     
    
    
     
                     abs_max_
      value 
      = float(np.max(np.abs(var_tensor)))
     
    
- 
    
     
    
    
     
                 elif 
      self._weight_quantize_
      type 
      =
      = 
      "channel_wise_abs_max":
     
    
- 
    
     
    
    
     
                     abs_max_
      value 
      = []
     
    
- 
    
     
    
    
     
                     
      if (
     
    
- 
    
     
    
    
     
                         
      self._weight_op_pairs[var_name]
     
    
- 
    
     
    
    
     
                         
      in utils._channelwise_quant_axis
      1_ops
     
    
- 
    
     
    
    
     
                     ):
     
    
- 
    
     
    
    
     
                         
      for i 
      in range(var_tensor.shape[
      1]):
     
    
- 
    
     
    
    
     
                             abs_max_
      value.append(
     
    
- 
    
     
    
    
     
                                 float(np.max(np.abs(var_tensor[:, i])))
     
    
- 
    
     
    
    
     
                             )
     
    
- 
    
     
    
    
     
                     
      else:
     
    
- 
    
     
    
    
     
                         
      for i 
      in range(var_tensor.shape[
      0]):
     
    
- 
    
     
    
    
     
                             abs_max_
      value.append(
     
    
- 
    
     
    
    
     
                                 float(np.max(np.abs(var_tensor[i])))
     
    
- 
    
     
    
    
     
                             )
     
    
- 
    
     
    
    
     
                 
      self._quantized_threshold[var_name] 
      = abs_max_
      value
     
    
 
  
   - 
    
     
    
    
     
         batch_id 
      = 
      0
     
    
- 
    
     
    
    
     
         
      with tqdm(
     
    
- 
    
     
    
    
     
             total
      =
      self._batch_nums,
     
    
- 
    
     
    
    
     
             bar_
      format
      =
      'Sampling stage, Run batch:|{bar}| {n_fmt}/{total_fmt}',
     
    
- 
    
     
    
    
     
             ncols
      =
      80,
     
    
- 
    
     
    
    
     
         ) 
      as t:
     
    
- 
    
     
    
    
     
             
      for 
      data 
      in 
      self._
      data_loader():
     
    
- 
    
     
    
    
     
                 
      self._executor.
      run(
     
    
- 
    
     
    
    
     
                     
      program
      =
      self._
      program,
     
    
- 
    
     
    
    
     
                     feed
      =
      data,
     
    
- 
    
     
    
    
     
                     fetch_list
      =
      self._fetch_list,
     
    
- 
    
     
    
    
     
                     
      return_numpy
      =
      False,
     
    
- 
    
     
    
    
     
                     scope
      =
      self._scope,
     
    
- 
    
     
    
    
     
                 )
     
    
- 
    
     
    
    
     
                 
      self._sampling()
     
    
- 
    
     
    
    
     
                 batch_id 
      +
      = 
      1
     
    
- 
    
     
    
    
     
                 t.update()
     
    
- 
    
     
    
    
     
                 
      if 
      self._batch_nums 
      and batch_id 
      >= 
      self._batch_nums:
     
    
- 
    
     
    
    
     
                     break
     
    
 保存scale
量化的结果就是为每个节点寻找scale,之前在每个tensor_name当中保存了sacle,此时分割拿出来就行:
real_tensor_name, opera, scalar = tensor_name.split('#')这里需要不断动态更新max_scale,在opera放缩的时候会用到。
插入量化/反量化节点
例如此图中的D是反量化操作,Q是量化操作。

需要我们在计算图中把量化和反量化节点插入进去:
  
   - 
    
     
    
    
     
         # 
      use QuantizationTransformPass 
      to insert fake_quant
      /fake_dequantize op
     
    
- 
    
     
    
    
     
         
      if 
      not 
      self._onnx_
      format:
     
    
- 
    
     
    
    
     
             transform_pass 
      = QuantizationTransformPass(
     
    
- 
    
     
    
    
     
                 scope
      =
      self._scope,
     
    
- 
    
     
    
    
     
                 place
      =
      self._place,
     
    
- 
    
     
    
    
     
                 weight_bits
      =
      self._weight_bits,
     
    
- 
    
     
    
    
     
                 activation_bits
      =
      self._activation_bits,
     
    
- 
    
     
    
    
     
                 activation_quantize_
      type
      =
      self._activation_quantize_
      type,
     
    
- 
    
     
    
    
     
                 weight_quantize_
      type
      =
      self._weight_quantize_
      type,
     
    
- 
    
     
    
    
     
                 quantizable_op_
      type
      =
      self.quant_config.weight_quant_operation_types,
     
    
- 
    
     
    
    
     
             )
     
    
- 
    
     
    
    
     
         
      else:
     
    
- 
    
     
    
    
     
             transform_pass 
      = QuantizationTransformPassV
      2(
     
    
- 
    
     
    
    
     
                 scope
      =
      self._scope,
     
    
- 
    
     
    
    
     
                 place
      =
      self._place,
     
    
- 
    
     
    
    
     
                 weight_bits
      =
      self._weight_bits,
     
    
- 
    
     
    
    
     
                 activation_bits
      =
      self._activation_bits,
     
    
- 
    
     
    
    
     
                 activation_quantize_
      type
      =
      self._activation_quantize_
      type,
     
    
- 
    
     
    
    
     
                 weight_quantize_
      type
      =
      self._weight_quantize_
      type,
     
    
- 
    
     
    
    
     
                 quantizable_op_
      type
      =
      self.quant_config.weight_quant_operation_types,
     
    
- 
    
     
    
    
     
             )
     
    
- 
    
     
    
    
      
     
    
- 
    
     
    
    
     
         
      for sub_graph 
      in graph.
      all_sub_graphs():
     
    
- 
    
     
    
    
     
             # Insert fake_quant
      /fake_dequantize op must 
      in 
      test graph, so
     
    
- 
    
     
    
    
     
             # 
      set per graph
      's _for_test is True.
     
    
- 
    
     
    
    
     
       sub_graph._for_test = True
     
    
- 
    
     
    
    
     
       transform_pass.apply(sub_graph)
     
    
- 
    
     
    
    
     
      
     
    
- 
    
     
    
    
     
       # use AddQuantDequantPass to insert fake_quant_dequant op
     
    
- 
    
     
    
    
     
       if not self._onnx_format:
     
    
- 
    
     
    
    
     
       add_quant_dequant_pass = AddQuantDequantPass(
     
    
- 
    
     
    
    
     
       scope=self._scope,
     
    
- 
    
     
    
    
     
       place=self._place,
     
    
- 
    
     
    
    
     
       quantizable_op_type=self.quant_config.activation_quant_operation_types,
     
    
- 
    
     
    
    
     
       )
     
    
- 
    
     
    
    
     
       else:
     
    
- 
    
     
    
    
     
       add_quant_dequant_pass = AddQuantDequantPassV2(
     
    
- 
    
     
    
    
     
       scope=self._scope,
     
    
- 
    
     
    
    
     
       place=self._place,
     
    
- 
    
     
    
    
     
       quantizable_op_type=self.quant_config.activation_quant_operation_types,
     
    
- 
    
     
    
    
     
       )
     
    
- 
    
     
    
    
     
      
     
    
- 
    
     
    
    
     
       for sub_graph in graph.all_sub_graphs():
     
    
- 
    
     
    
    
     
       sub_graph._for_test = True
     
    
- 
    
     
    
    
     
       add_quant_dequant_pass.apply(sub_graph)
     
    
 激活校准过程梳理
weight是常数,这个是不需要校准的(因为不会存在误差),所以需要被校准的只有激活值。
量化公式
假设r表示量化前的浮点数,量化后的整数q可以表示为:

round(⋅)和clip(⋅)分别表示取整和截断操作, 和
和 是量化后的最小值和最大值。s是数据量化的间隔,z是表示数据偏移的偏置,z为0的量化被称为对称(Symmetric)量化,不为0的量化称为非对称(Asymmetric)量化。对称量化可以避免量化算子在推理中计算z相关的部分,降低推理时的计算复杂度;非对称量化可以根据实际数据的分布确定最小值和最小值,可以更加充分的利用量化数据信息,使得计算精度更高。
是量化后的最小值和最大值。s是数据量化的间隔,z是表示数据偏移的偏置,z为0的量化被称为对称(Symmetric)量化,不为0的量化称为非对称(Asymmetric)量化。对称量化可以避免量化算子在推理中计算z相关的部分,降低推理时的计算复杂度;非对称量化可以根据实际数据的分布确定最小值和最小值,可以更加充分的利用量化数据信息,使得计算精度更高。 
具体流程
-  使用直方图统计的方式得到原始FP32数据的统计分布  ; ;
-  在给定的搜索空间中选取若干个  和 和 分别对激活值量化,得到量化后的数据 分别对激活值量化,得到量化后的数据 ; ;
-  使用直方图统计得到  的统计分布; 的统计分布;
-  计算每个  与 与 的统计分布差异,并找到差异性最低的一个对应的 的统计分布差异,并找到差异性最低的一个对应的 和 和 来计算相应的量化参数,常见的用于度量分布差异的指标包括KL散度(Kullback-Leibler Divergence)、对称KL散度(Symmetric Kullback-Leibler Divergence)和JS散度(Jenson-Shannon Divergence)。 来计算相应的量化参数,常见的用于度量分布差异的指标包括KL散度(Kullback-Leibler Divergence)、对称KL散度(Symmetric Kullback-Leibler Divergence)和JS散度(Jenson-Shannon Divergence)。
转载:https://blog.csdn.net/qq_41895747/article/details/128938211
 
					