小言_互联网的博客

简单图形验证码识别

197人阅读  评论(0)

简单图形验证码识别

1. Pytesser3安装配置
(1) 下载pytesser, 安装包放在任意位置都行
网盘链接: 链接:https://pan.baidu.com/s/1zYUodmQmrzs_J6yKZC4oaA 
提取码:jeed 
(2) 安装Pytesser3, 直接通过pip安装 pip install Pytesser3
(3) 在Python的安装目录里找到刚安装的pytesser3, 我自己的路径是在C:\Users\Administrator\AppData\Local\Programs\Python\Python37\Lib\site-packages
根据自己的情况找到对应的地方
(4) 将pytesser3包下面__init__文件内tesseract_exe_name的值设置为为你的tesseract.exe的路径(百度网盘下载的pytesser)


__init__文件内容, 第11行修改为你自己的路径(网盘下载的pytesser里的tesseract.exe路径, 注意路径用双反斜杠)

"""OCR in Python using the Tesseract engine from Google
http://code.google.com/p/pytesser/
by Michael J.T. O'Kelly
V 0.0.1, 3/10/07"""

from PIL import Image
import subprocess
from pytesser3 import util
from pytesser3 import errors
#请务必修改下面的tesseract的name 换成你安装的ocr路径,谢谢!
tesseract_exe_name = '\\pytesser\\tesseract' # Name of executable to be called at command line
scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format
scratch_text_name_root = "temp" # Leave out the .txt extension
cleanup_scratch_flag = True  # Temporary files cleaned up after OCR operation

def call_tesseract(input_filename, output_filename):
	"""Calls external tesseract.exe on input file (restrictions on types),
	outputting output_filename+'txt'"""
	args = [tesseract_exe_name, input_filename, output_filename]
	proc = subprocess.Popen(args)
	retcode = proc.wait()
	if retcode!=0:
		errors.check_for_errors()

def image_to_string(im, cleanup = cleanup_scratch_flag):
	"""Converts im to file, applies tesseract, and fetches resulting text.
	If cleanup=True, delete scratch files after operation."""
	try:
		util.image_to_scratch(im, scratch_image_name)
		call_tesseract(scratch_image_name, scratch_text_name_root)
		text = util.retrieve_text(scratch_text_name_root)
	finally:
		if cleanup:
			util.perform_cleanup(scratch_image_name, scratch_text_name_root)
	return text

def image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True):
	"""Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,
	converts to compatible format and then applies tesseract.  Fetches resulting text.
	If cleanup=True, delete scratch files after operation."""
	try:
		try:
			call_tesseract(filename, scratch_text_name_root)
			text = util.retrieve_text(scratch_text_name_root)
		except errors.Tesser_General_Exception:
			if graceful_errors:
				im = Image.open(filename)
				text = image_to_string(im, cleanup)
			else:
				raise
	finally:
		if cleanup:
			util.perform_cleanup(scratch_image_name, scratch_text_name_root)
	return text
	

if __name__=='__main__':
	im = Image.open('phototest.tif')
	text = image_to_string(im)
	print(text)
	try:
		text = image_file_to_string('fnord.tif', graceful_errors=False)
	except Exception as e:
		print("fnord.tif is incompatible filetype.  Try graceful_errors=True")
	text = image_file_to_string('fnord.tif', graceful_errors=True)
	print("fnord.tif contents:", text)
	text = image_file_to_string('fonts_test.png', graceful_errors=True)
	print(text)
2. 使用方法
import pytesser3
from PIL import Image
import requests
from io import BytesIO

# 验证码链接
url = "http://my.cnki.net/elibregister/CheckCode.aspx?id=1570249886669"

resp = requests.get(url).content

im = Image.open(BytesIO(resp))
# 原始验证码
im.show()
# 转换成灰度图
im = im.convert("L")
# 指定二值化的临界值, 临界值的大小决定结果的精度
threshold = 150

table = []
for i in range(256):
    if i<threshold:
        table.append(0)
    else:
        table.append(1)

"""

Image.point(table, mode),利用该函数可以通过查表的方式实现像素颜色的模式转换,其中table为颜色转换过程中的映射表,每个颜色通道应当有256个元素,而 mode表示所输出的颜色模式,同样的,'L'表示灰度,'1'表示二值图模式。
"""
im = im.point(table, '1')

im.show()
result = pytesser3.image_to_string(im)

print(result)

转载:https://blog.csdn.net/gklcsdn/article/details/102147294
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场