LSTM应用于验证码识别

jTessBoxEditorFX-2.3.0

预训练数据

1
2
3
4
5
6
7
#For CentOS 7 run the following as root to install Tesseract with English language traineddata:
yum -y install yum-utils
yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
yum update
yum install tesseract
yum install tesseract-langpack-eng
1
2
3
4
5
#For Win10 to install Tesseract:
1.下载解压 jTessBoxEditor
2.将 {解压目录}\jTessBoxEditorFX\tesseract-ocr 添加到 Path
3.下载解压预训练数据到当前目录
4.新建环境变量 TESSDATA_PREFIX ,值为 {解压目录}\tessdata

终端中运行命令 tesseract –help-extra 显示如上信息表示安装成功

自行获取训练所需的验证码

按照肖鹏伟的《Tesseract-OCR-04-使用 jTessBoxEditor提高文字识别准确率》中的方法生成fdu.ufont.exp0.tif文件

1
2
#通过此命令生成fdu.ufont.exp0.box文件
tesseract fdu.ufont.exp0.tif fdu.ufont.exp0 -l eng --psm 8 --oem 0 nobatch box.train -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzAT-

继续按照肖鹏伟的方法修正.box文件

1
2
3
4
5
6
7
8
9
10
#将fdu.ufont.exp0.tif文件、修正后的fdu.ufont.exp0.box文件一起保存到独立的同一新文件夹下,同目录下运行此.ps1文件即可得到fdu.traineddata
tesseract fdu.ufont.exp0.tif fdu.ufont.exp0 -l enb --psm 8 lstm.train
combine_tessdata -e "$env:TESSDATA_PREFIX\enb.traineddata" enb.lstm
$PSroot = Get-ChildItem
$PSroot = Split-Path $PSroot.Get(0).FullName
$fso=New-Object -ComObject Scripting.FileSystemObject
$fso.CreateTextFile('fdu.training_files.txt',2).Write("$PSroot\fdu.ufont.exp0.lstmf" )
if (-not (Test-Path -Path output)){mkdir output}
lstmtraining --model_output="$PSroot\output\output" --continue_from="$PSroot\enb.lstm" --train_listfile="$PSroot\fdu.training_files.txt" --traineddata="$env:TESSDATA_PREFIX\enb.traineddata" --debug_interval -1 --target_error_rate 0.001
lstmtraining --stop_training --continue_from="$PSroot\output\output_checkpoint" --traineddata="$env:TESSDATA_PREFIX\enb.traineddata" --model_output="$PSroot\fdu.traineddata"

最终得到如上结果

将得到的fdu.traineddata文件移动到tessdata文件夹下即可通过参数-l fdu进行使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#此程序用于简单判断训练效果
from PIL import Image
#from itertools import cycle
import os, random, re
import pytesseract
fl = re.compile(r'[a-zA-Z-]+')
def clearStr(str):
return ''.join(fl.findall(str))

class Fileset(list):
def __init__(self, name, ext='', _read=None, root=None):
if isinstance(name, str) :
self.root = os.path.join(root or os.getcwd(), name)
self.extend(f for f in os.listdir(self.root) if f.endswith(ext))
self._read = _read
def __getitem__(self, index):
if isinstance(index, int):# index是索引
return os.path.join(self.root, super().__getitem__(index))
else:# index是切片
fileset = Fileset(None)
fileset.root = self.root
fileset._read = self._read
fileset.extend(super().__getitem__(index))
return fileset
def getFileName(self, index):
fname, ext = os.path.splitext(super().__getitem__(index))
return fname
def __iter__(self):
if self._read: return (self._read(os.path.join(self.root, f)) for f in super().__iter__())
else: return (os.path.join(self.root, f) for f in super().__iter__())
def __call__(self):
retn = random.choice(self)
if self._read: return self._read(retn)
else: return retn

# def fopen(path):
# with open(path, 'rb') as f:
# return f.read()
# #from tesOCR import tesOCR as OCR1
# sample = Fileset('Captcha', '.jpg', fopen)
sample = Fileset('Captcha', '.jpg', Image.open)

config1 = '--psm 8'
def OCR1(img):
return pytesseract.image_to_string(img, lang='fdu', config=config1)

config2 = "--psm 8 --oem 0 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzAT-"
def OCR2(img):
return pytesseract.image_to_string(img, lang='eng', config=config2)

for a in sample:
b = a.convert("L")
x = clearStr(OCR1(b))
y = clearStr(OCR2(b))
if x != y:
display(a)
print(f"LSTM is {x} ; Legacy is {y}")

我的结果和python调用封装

注释:

  1. jTessBoxEditor中带有FX表示支持中文
    2.预训练数据中22.3Mb的是Legacy数据,14.6Mb的是LSTM数据,语言均为eng
    3.”tessedit_char_whitelist=”后面所接内容为验证码中可能出现的字符

LSTM应用于验证码识别
https://b.limour.top/309.html
Author
Limour
Posted on
July 11, 2020
Licensed under