验证码终结者-基于CNN+BLSTM+CTC的训练部署套件_飞道的博客

验证码终结者-基于CNN+BLSTM+CTC的训练部署套件

2021-02-10 07:05 546人阅读评论(0)

谷歌图形验证码在AI 面前已经形同虚设，所以谷歌宣布退出验证码服务，这是为什么呢？
以下文章也许可以解释原因

1 定义一个模型

本项目采用的是参数化配置，不需要改动任何代码，可以训练几乎任何字符型图片验证码，下面从两个配置文件说起：

config.yaml # 系统配置

# - requirement.txt  -  GPU: tensorflow-gpu, CPU: tensorflow
# - If you use the GPU version, you need to install some additional applications.
# TrainRegex and TestRegex: Default matching apple_20181010121212.jpg file.
# - The Default is .*?(?=_.*\.)
# TrainsPath and TestPath: The local absolute path of your training and testing set.
# TestSetNum: This is an optional parameter that is used when you want to extract some of the test set
# - from the training set when you are not preparing the test set separately.
System:
  DeviceUsage: 0.7
  TrainRegex: '.*?(?=_)'
  TestRegex: '.*?(?=_)'
  HomePath: D:\app.python\captcha_model
  StartNo: 0
  TrainCount: 1000
  TestCount: 312
  TestSetNum: 300

# CNNNetwork: [CNN5, ResNet]
# RecurrentNetwork: [BLSTM, LSTM, SRU, BSRU, GRU]
# - The recommended configuration is CNN5+BLSTM / ResNet+BLSTM
# HiddenNum: [64, 128, 256]
# - This parameter indicates the number of nodes used to remember and store past states.
NeuralNet:
  CNNNetwork: CNN5
  RecurrentNetwork: BLSTM
  HiddenNum: 64
  KeepProb: 0.98

# SavedSteps: A Session.run() execution is called a Epochs,
# - Used to save training progress, Default value is 100.
# ValidationSteps: Used to calculate accuracy, Default value is 100.
# TestNum: The number of samples for each test batch.
# - A test for every saved steps. 
# EndAcc: Finish the training when the accuracy reaches [EndAcc*100]%.
# EndEpochs: Finish the training when the epoch is greater than the defined epoch.
Trains:
  SavedSteps: 100
  ValidationSteps: 500
  EndAcc: 0.999
  EndCost: 1
  EndEpochs: 1
  BatchSize: 64
  TestBatchSize: 64
  LearningRate: 0.01
  DecayRate: 0.98
  DecaySteps: 10000
  PreprocessCollapseRepeated: False
  CTCMergeRepeated: True
  CTCBeamWidth: 5
  CTCTopPaths: 1

上面看起来好多好多参数，其实大部分可以不用改动，你需要修改的仅仅是训练集路径就可以了，
注意：如果训练集的命名格式和我提供的新手训练集不一样，请根据实际情况修改TrainRegex和TestRegex的正则表达式。
TrainsPath和TestPath路径支持list参数，允许多个路径，这种操作适用于需要将多种样本训练为一个模型，或者希望训练一套通用模型的人。
为了加快训练速度，提高训练集读取效率，特别提供了make_dataset.py来支持将训练集打包为tfrecords格式输入，经过make_dataset.py打包之后的训练集将输出到本项目的dataset路径下，只需修改TrainsPath键的配置如下即可

TrainsPath: './dataset/xxx.tfrecords'

TestPath是允许为空的，
如果TestPath为空将会使用TestSetNum参数自动划分出对应个数的测试集。
如果使用自动划分机制，那么TestSetNum测试集总数参数必须大于等于TestBatchSize测试集每次读取的批次大小。
神经网络这块可以讲一讲，默认提供的组合是CNN5(CNN5层模型)+BLSTM(Bidirectional LSTM)+CTC，亲测收敛最快，但是训练集过小，实际图片变化很大特征很多的情况下容易发生过拟合。
DenseNet可以碰运气在样本量很小的情况下很好的训练出高精度的模型，为什么是碰运气呢，因为收敛快不快随机的初始权重很重要，运气好前500步可能对测试集就有40-60%准确率，运气不好2000步之后还是0，收敛快慢是有一定的运气成分的。

NeuralNet:
CNNNetwork: CNN5
RecurrentNetwork: BLSTM
HiddenNum: 64
KeepProb: 0.99

隐藏层HiddenNum笔者尝试过8~64，都能控制在很小的模型大小之内，如果想使用DenseNet代替CNN5直接修改如上配置中的CNNNetwork参数替换为：

NeuralNet:
CNNNetwork: DenseNet
......

model.yaml # 模型配置

# Sites: A bindable parameter used to select a model.
# - If this parameter is defined,
# - it can be identified by using the model_site parameter
# - to identify a model that is inconsistent with the actual size of the current model.
# ModelName: Corresponding to the model file in the model directory,
# - such as YourModelName.pb, fill in YourModelName here.
# ModelType: This parameter is also used to locate the model.
# - The difference from the sites is that if there is no corresponding site,
# - the size will be used to assign the model.
# - If a model of the corresponding size and corresponding to the ModelType is not found,
# - the model belonging to the category is preferentially selected.
# CharSet: Provides a default optional built-in solution:
# - [ALPHANUMERIC, ALPHANUMERIC_LOWER, ALPHANUMERIC_UPPER,
# -- NUMERIC, ALPHABET_LOWER, ALPHABET_UPPER, ALPHABET]
# - Or you can use your own customized character set like: ['a', '1', '2'].
# CharExclude: CharExclude should be a list, like: ['a', '1', '2']
# - which is convenient for users to freely combine character sets.
# - If you don't want to manually define the character set manually,
# - you can choose a built-in character set
# - and set the characters to be excluded by CharExclude parameter.
Model:
  Sites: []
  ModelName: cctv.com
  ModelType: 160x60
  CharSet: ALPHANUMERIC
  CharExclude: []
  CharReplace: {
   }
  ImageWidth: 160
  ImageHeight: 60
  ImageChannel: 1
  Version: 1.0

# Binaryzation: [-1: Off, >0 and < 255: On].
# Smoothing: [-1: Off, >0: On].
# Blur: [-1: Off, >0: On].
# Resize: [WIDTH, HEIGHT]
# - If the image size is too small, the training effect will be poor and you need to zoom in.
Pretreatment:
  Binaryzation: -1
  Smoothing: -1
  Blur: -1

上述的配置只要关注
ModelName、CharSet、ImageWidth、ImageHeight
首先给模型取一个好名字是成功的第一步，字符集CharSet其实大多数情况下不需要修改，一般的图形验证码离不开数字和英文，而且一般来说是大小写不敏感的，不区分大小写，因为打码平台收集的训练集质量参差不齐，有些大写有些小写，不如全部统一为小写，默认ALPHANUMERIC_LOWER则会自动将大写的转为小写，字符集可定制化很灵活，除了配置备注上提供的几种类型，还可以训练中文，自定义字符集用list表示，示例如下：

CharSet: ['常', '世', '宁', '慢', '南', '制', '根', '难']

可以自己根据收集训练集的实际字符集使用率来定义，也可以无脑网上找3500常用字来训练，注意：中文字符集一般比数字英文大很多，刚开始收敛比较慢，需要更久的训练时间，也需要更多的样本量，请量力而行

形如上图的图片能轻松训练到95%以上的识别率。
ImageWidth、ImageHeight只要和当前图片尺寸匹配即可，其实这里的配置主要是为了方便后面的部署智能策略。
其他的如Pretreatment之下的参数是用来做图片预处理的，因为笔者致力于做一套通用模型，模型只使用了灰度做预处理。其中可选的二值化、均值滤波、高斯模糊均未开启，即使不进行那些预处理该框架已经能够达到很理想的识别效果了，笔者自用的大多数模型都是98%以上的识别率。