前言:paddleslim的中文资料非常丰富,在如何使用的教学上做的非常完善。但是源码解读与解析资料却非常少,这篇博客结合源代码大家一起学习一下paddle静态离线量化的原理和方法。
目录
原理简述
这部分官方手册说的非常nice,建议阅读:量化 — PaddleSlim 文档
静态离线量化这部分主要是封装了paddle的接口,基本的使用方法如下:
-
ptq
= PostTrainingQuantization(
-
executor
=exe,
-
sample_generator
=sample_generator,
-
model_dir
=model_dir,
-
model_filename
=model_filename,
-
params_filename
=params_filename,
-
batch_
size
=batch_
size,
-
batch_nums
=batch_nums,
-
algo
=algo,
-
quantizable_op_
type
=quantizable_op_
type)
-
ptq.quantize()
-
ptq.save_quantized_model(save_model_path)
支持的量化类型
paddle通过optimize_model控制是否进行算子融合优化,但是只支持CPU上的算子融合,而且只有conv2d/depthwise_conv2d + bn算子融合,相比于OPENPPL的算子融合,做了非常多的算子融合:
paddle就显得非常简陋了……
参数is_full_quantize,不要被名字迷惑了,这并不是全量化。paddle当中的"部分量化"只支持"conv2d"、"depthwise_conv2d"、"conv2d_transpose"、"mul"、"matmul"、"matmul_v2"这6种类型。支持的理由是这几个是主要的计算密集型算子,量化他们是最有效的。
而全量化支持的算子类型也是有限的,除了刚才列出的六种类型之外,还有如下:
-
SUPPORT_ACT_QUANTIZATION_OP_DICT = {
-
"mul":
[["X", "Y"], ["Out"]],
-
"matmul":
[["X", "Y"], ["Out"]],
-
"matmul_v2":
[["X", "Y"], ["Out"]],
-
"pool2d":
[["X"], ["Out"]],
-
"elementwise_add":
[["X", "Y"], ["Out"]],
-
"concat":
[["X"], ["Out"]],
-
"softmax":
[["X"], ["Out"]],
-
"argmax":
[["X"], ["Out"]],
-
"transpose":
[["X"], ["Out"]],
-
"equal":
[["X", "Y"], ["Out"]],
-
"gather":
[["X"], ["Out"]],
-
"greater_equal":
[["X", "Y"], ["Out"]],
-
"greater_than":
[["X", "Y"], ["Out"]],
-
"less_equal":
[["X", "Y"], ["Out"]],
-
"less_than":
[["X", "Y"], ["Out"]],
-
"mean":
[["X"], ["Out"]],
-
"not_equal":
[["X", "Y"], ["Out"]],
-
"reshape":
[["X"], ["Out"]],
-
"reshape2":
[["X"], ["Out"]],
-
"transpose2":
[["X"], ["Out"]],
-
"nearest_interp":
[["X"], ["Out"]],
-
"trilinear_interp":
[["X"], ["Out"]],
-
"slice":
[["Input"], ["Out"]],
-
"squeeze":
[["X"], ["Out"]],
-
"elementwise_sub":
[["X", "Y"], ["Out"]],
-
"relu":
[["X"], ["Out"]],
-
"relu6":
[["X"], ["Out"]],
-
"leaky_relu":
[["X"], ["Out"]],
-
"prelu":
[["X", "Alpha"], ["Out"]],
-
"tanh":
[["X"], ["Out"]],
-
"swish":
[["X"], ["Out"]],
-
"dropout":
[["X"], ["Out"]],
-
"batch_norm":
[["X"], ["Y"]],
-
"layer_norm":
[["X"], ["Y"]],
-
"sigmoid":
[["X"], ["Out"]],
-
"elementwise_mul":
[["X", "Y"], ["Out"]],
-
"elementwise_pow":
[["X", "Y"], ["Out"]],
-
"hard_swish":
[["X"], ["Out"]],
-
"hard_sigmoid":
[["X"], ["Out"]],
-
"gru":
[["Input", "Weight"], ["Hidden"]],
-
"lstm":
[["Input", "Weight"], ["Hidden"]],
-
"pad2d":
[["X"], ["Out"]],
-
"pad3d":
[["X"], ["Out"]],
-
"flatten":
[["X"], ["Out"]],
-
"flatten2":
[["X"], ["Out"]],
-
"unsqueeze2":
[["X"], ["Out"]],
-
"flatten_contiguous_range":
[["X"], ["Out"]],
-
"split":
[["X"], ["Out"]],
-
"squeeze2":
[["X"], ["Out"]],
-
"nearest_interp_v2":
[["X"], ["Out"]],
-
"bilinear_interp":
[["X"], ["Out"]],
-
"bilinear_interp_v2":
[["X"], ["Out"]],
-
"fill_constant_batch_size_like":
[["Input"], ["Out"]],
-
"arg_max":
[["X"], ["Out"]],
-
"abs":
[["X"], ["Out"]],
-
"assign":
[["X"], ["Out"]],
-
"cast":
[["X"], ["Out"]],
-
"clip":
[["X"], ["Out"]],
-
"box_coder":
[["PriorBox"], ["OutputBox"]],
-
"crop":
[["X"], ["Out"]],
-
"cumsum":
[["X"], ["Out"]],
-
"expand_v2":
[["X"], ["Out"]],
-
"fill_any_like":
[["X"], ["Out"]],
-
"fill_constant":
[[], ["Out"]],
-
"gelu":
[["X"], ["Out"]],
-
"instance_norm":
[["X"], ["Y"]],
-
"lookup_table":
[["W", "Ids"], ["Out"]],
-
"lookup_table_v2":
[["W", "Ids"], ["Out"]],
-
"norm":
[["X"], ["Norm"]],
-
"p_norm":
[["X"], ["Out"]],
-
"pow":
[["X"], ["Out"]],
-
"reduce_mean":
[["X"], ["Out"]],
-
"stack":
[["X"], ["Y"]],
-
"top_k_v2":
[["X"], ["Out", "Indices"]],
-
"logical_and":
[["X", "Y"], ["Out"]],
-
"logical_not":
[["X"], ["Out"]],
-
"meshgrid":
[["X"], ["Out"]],
-
"roi_align":
[["X", "ROIs"], ["Out"]],
-
"strided_slice":
[["Input"], ["Out"]],
-
"where":
[["Condition", "X", "Y"], ["Out"]],
-
"grid_sampler":
[["X", "Grid"], ["Output"]],
-
"tile":
[["X"], ["Out"]],
-
"group_norm":
[["X"], ["Y", "Mean", "Variance"]],
-
"reduce_sum":
[["X"], ["Out"]],
-
"square":
[["X"], ["Out"]],
-
"softplus":
[["X"], ["Out"]],
-
"shuffle_channel":
[["X"], ["Out"]],
-
"reduce_max":
[["X"], ["Out"]],
-
"scale":
[["X"], ["Out"]],
-
}
支持后端
支持的后端如下:
support_deploy_backend = [None, "tensorrt", "mkldnn", "arm"]
对应的量化类是:BaseQuantizer、TensorRTQuantizer、MKLDNNQuantizer、ARMCPUQuantizer。相比于openppl来看支持的后端数量非常少,关于不同后端之间优化方法上的区别,先挖个坑以后再讲。
后面以BaseQuantizer为例讲解下面的内容。
量化操作
关键是迭代寻找边界值的过程。因为每次sample的过程会有不同策略,这里用abs_max()为例先来看看。
Preparation stage
-
if
self._algo
in [
"KL",
"hist"]:
-
batch_id
=
0
-
with tqdm(
-
total
=
self._batch_nums,
-
bar_
format
=
'Preparation stage, Run batch:|{bar}| {n_fmt}/{total_fmt}',
-
ncols
=
80,
-
)
as t:
-
for
data
in
self._
data_loader():
-
self._executor.
run(
-
program
=
self._
program,
-
feed
=
data,
-
fetch_list
=
self._fetch_list,
-
return_numpy
=
False,
-
scope
=
self._scope,
-
)
-
self._collect_activation_abs_min_m
ax()
-
batch_id
+
=
1
-
t.update()
-
if
self._batch_nums
and batch_id
>=
self._batch_nums:
-
break
-
self._init_sampling_act_histogram()
需要注意的是当KL或hist的时候,需要对所有激活值计算abs_min和abs_max,调用_collect_activation_abs_min_max()方法。
Sampling stage
加载权重值:
var_tensor = utils.load_variable_data(self._scope, var_name)
然后会分成abs_max、channel_wise_abs_max两种方式寻找边界的最大值:
-
if
self._weight_quantize_
type
=
=
"abs_max":
-
abs_max_
value
= float(np.max(np.abs(var_tensor)))
-
elif
self._weight_quantize_
type
=
=
"channel_wise_abs_max":
-
abs_max_
value
= []
-
if (
-
self._weight_op_pairs[var_name]
-
in utils._channelwise_quant_axis
1_ops
-
):
-
for i
in range(var_tensor.shape[
1]):
-
abs_max_
value.append(
-
float(np.max(np.abs(var_tensor[:, i])))
-
)
-
else:
-
for i
in range(var_tensor.shape[
0]):
-
abs_max_
value.append(
-
float(np.max(np.abs(var_tensor[i])))
-
)
-
self._quantized_threshold[var_name]
= abs_max_
value
-
batch_id
=
0
-
with tqdm(
-
total
=
self._batch_nums,
-
bar_
format
=
'Sampling stage, Run batch:|{bar}| {n_fmt}/{total_fmt}',
-
ncols
=
80,
-
)
as t:
-
for
data
in
self._
data_loader():
-
self._executor.
run(
-
program
=
self._
program,
-
feed
=
data,
-
fetch_list
=
self._fetch_list,
-
return_numpy
=
False,
-
scope
=
self._scope,
-
)
-
self._sampling()
-
batch_id
+
=
1
-
t.update()
-
if
self._batch_nums
and batch_id
>=
self._batch_nums:
-
break
保存scale
量化的结果就是为每个节点寻找scale,之前在每个tensor_name当中保存了sacle,此时分割拿出来就行:
real_tensor_name, opera, scalar = tensor_name.split('#')
这里需要不断动态更新max_scale,在opera放缩的时候会用到。
插入量化/反量化节点
例如此图中的D是反量化操作,Q是量化操作。
需要我们在计算图中把量化和反量化节点插入进去:
-
#
use QuantizationTransformPass
to insert fake_quant
/fake_dequantize op
-
if
not
self._onnx_
format:
-
transform_pass
= QuantizationTransformPass(
-
scope
=
self._scope,
-
place
=
self._place,
-
weight_bits
=
self._weight_bits,
-
activation_bits
=
self._activation_bits,
-
activation_quantize_
type
=
self._activation_quantize_
type,
-
weight_quantize_
type
=
self._weight_quantize_
type,
-
quantizable_op_
type
=
self.quant_config.weight_quant_operation_types,
-
)
-
else:
-
transform_pass
= QuantizationTransformPassV
2(
-
scope
=
self._scope,
-
place
=
self._place,
-
weight_bits
=
self._weight_bits,
-
activation_bits
=
self._activation_bits,
-
activation_quantize_
type
=
self._activation_quantize_
type,
-
weight_quantize_
type
=
self._weight_quantize_
type,
-
quantizable_op_
type
=
self.quant_config.weight_quant_operation_types,
-
)
-
-
for sub_graph
in graph.
all_sub_graphs():
-
# Insert fake_quant
/fake_dequantize op must
in
test graph, so
-
#
set per graph
's _for_test is True.
-
sub_graph._for_test = True
-
transform_pass.apply(sub_graph)
-
-
# use AddQuantDequantPass to insert fake_quant_dequant op
-
if not self._onnx_format:
-
add_quant_dequant_pass = AddQuantDequantPass(
-
scope=self._scope,
-
place=self._place,
-
quantizable_op_type=self.quant_config.activation_quant_operation_types,
-
)
-
else:
-
add_quant_dequant_pass = AddQuantDequantPassV2(
-
scope=self._scope,
-
place=self._place,
-
quantizable_op_type=self.quant_config.activation_quant_operation_types,
-
)
-
-
for sub_graph in graph.all_sub_graphs():
-
sub_graph._for_test = True
-
add_quant_dequant_pass.apply(sub_graph)
激活校准过程梳理
weight是常数,这个是不需要校准的(因为不会存在误差),所以需要被校准的只有激活值。
量化公式
假设r表示量化前的浮点数,量化后的整数q可以表示为:
round(⋅)和clip(⋅)分别表示取整和截断操作,和
是量化后的最小值和最大值。s是数据量化的间隔,z是表示数据偏移的偏置,z为0的量化被称为对称(Symmetric)量化,不为0的量化称为非对称(Asymmetric)量化。对称量化可以避免量化算子在推理中计算z相关的部分,降低推理时的计算复杂度;非对称量化可以根据实际数据的分布确定最小值和最小值,可以更加充分的利用量化数据信息,使得计算精度更高。
具体流程
-
使用直方图统计的方式得到原始FP32数据的统计分布
;
-
在给定的搜索空间中选取若干个
和
分别对激活值量化,得到量化后的数据
;
-
使用直方图统计得到
的统计分布;
-
计算每个
与
的统计分布差异,并找到差异性最低的一个对应的
和
来计算相应的量化参数,常见的用于度量分布差异的指标包括KL散度(Kullback-Leibler Divergence)、对称KL散度(Symmetric Kullback-Leibler Divergence)和JS散度(Jenson-Shannon Divergence)。
转载:https://blog.csdn.net/qq_41895747/article/details/128938211