6款支持中文开源OCR软件的简单使用

OCR（optical characterrecognition）光学字符识别，是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，然后用字符识别方法将形状翻译成计算机文字的过程；即，对文本资料进行扫描，然后对图像文件进行分析处理，获取文字及版面信息的过程。如何除错或利用辅助信息提高识别正确率，是OCR最重要的课题。衡量一个 OCR 系统性能好坏的主要指标有：拒识率、误识率、识别速度、用户界面

aabond

9112人浏览 · 2022-08-24 10:00:00

aabond · 2022-08-24 10:00:00 发布

前言

OCR（optical character recognition）光学字符识别，是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，然后用字符识别方法将形状翻译成计算机文字的过程；即，对文本资料进行扫描，然后对图像文件进行分析处理，获取文字及版面信息的过程。

如何除错或利用辅助信息提高识别正确率，是OCR最重要的课题。

衡量一个 OCR 系统性能好坏的主要指标有：拒识率、误识率、识别速度、用户界面的友好性，产品的稳定性，易用性及可行性等。

开源的 OCR 软件有很多种，在 github 上发现有很多项目支持中文汉字识别，以下面 6 个开源项目为例，测试使用 OCR 软件。

一、Tesseract OCR

Tesserat OCR 是一款可在各种操作系统运行的 OCR 引擎，这是一款开源软件，在Apache 许可下发布。最初由惠普在20世纪80年代作为专有软件开发，在2005年以开源的形式发布，从2006年开始由谷歌赞助，开发语言为 C++。

初始的版本只支持英文，版本 2.0 又增加了六种西方语言（法语、意大利语、德语、西班牙语、巴西葡萄牙语、荷兰语），版本 3.0 扩展了更多语言支持，包括表意文字（中文和日语）和从右到左（如阿拉伯语，希伯来语），以及更多语言。最终到版本 4.0，已经可以支持 116 种语言。版本 5.0 于 2021 年发布，显著地提高了性能，更多关于版本的细节可以查看：https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html

1.1 安装

Linux

sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt-get update
sudo apt install tesseract-ocr-chi-sim

Windows
- tesseract-ocr-w32-setup-v5.2.0.20220712.exe (32 bit)
- tesseract-ocr-w64-setup-v5.2.0.20220712.exe (64 bit)
Mac
```
brew install tesseract
```

1.2 运行

下面以在windows 运行为例，通过命令行

tesseract.exe  C:\Users\root\Desktop\emoj.jpg C:\Users\root\Desktop\hello -l chi_sim

识别需要指定语言，还需要对应的训练数据，官方提供已经训练好的数据：https://github.com/tesseract-ocr/tessdata

以下面表情包为例：测试识别结果
在这里插入图片描述

识别结果：

和牛局 “”砚害 1或哟梧槽!

哦哟，草泥马的，这么6 好好好 !

可以发现使用官方提供的训练数据，还是有很大的误差，在 github 上也有人提出疑问中文精确度不高：ocr quality on chi_sim，但到现在也没解决这个问题，所以想要更精确的结果还需要自己训练数据。

二、PaddleOCR

PaddleOCR 是百度开源深度学习平台 Paddle 旗下一款 OCR 软件，于 2020.5 月开源，其特点就是超轻量级中文OCR模型，后面拓展到支持 80 多种语言识别，开发语言是 Python。

在线体验地址：https://www.paddlepaddle.org.cn/hub/scene/ocr

文档：https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/doc/doc_ch/quickstart.md#21

2.1 安装

pip install paddlepaddle
pip install paddleocr

2.2 运行

paddleOCR 支持命令行

paddleocr --image_dir emoj.jpg --lang ch --use_gpu false

也支持python 脚本，可以将结果可视化

from paddleocr import PaddleOCR,draw_ocr
# Paddleocr supports Chinese, English, French, German, Korean and Japanese.
# You can set the parameter `lang` as `ch`, `en`, `fr`, `german`, `korean`, `japan`
# to switch the language model in order.
ocr = PaddleOCR(lang='ch', use_gpu=False) # need to run only once to download and load model into memory
img_path = 'emoj.jpg'
result = ocr.ocr(img_path, cls=False)
for line in result:
    print(line)

# draw result
from PIL import Image
image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, drop_score=0, font_path='C:\\Windows\\Fonts\\simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

在这里插入图片描述

三、EasyOCR

EasyOCR 是一个用于从图像中提取文本的 python 库，它是一种通用的 OCR ，可以读取图片中的自然场景文本和密集文本。目前支持80多种语言，并正在增加中。

在线使用及支持语言编码：https://www.jaided.ai/easyocr/

3.1 安装

pip install easyocr

3.2 运行

运行前需要下载文本检测和语言模型文件

文本检测：https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip

简体中文：https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/zh_sim_g2.zip

繁体中文：https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/chinese.zip

英文：https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip

将下载文件解压到 C:\Users\<用户名>\.EasyOCR\model，可通过 --model_storage_directory 修改目录

easyocr -l ch_sim en -f emoj.jpg --detail=1 --gpu=False

import easyocr
from paddleocr import PaddleOCR, draw_ocr
img_path = 'emoj.jpg'
reader = easyocr.Reader(['ch_sim','en']) # this needs to run only once to load the model into memory
result = reader.readtext(img_path)

from PIL import Image

image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1] for line in result]
scores = [line[2] for line in result]

im_show = draw_ocr(image, boxes, txts, scores, drop_score=0, font_path='C:\\Windows\\Fonts\\simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result2.jpg')

在这里插入图片描述

四、chineseocr

基于yolo3 与crnn 实现中文自然场景文字检测及识别

实现功能

文字方向检测 0、90、180、270度检测（支持 dnn/tensorflow）
支持(darknet/opencv dnn /keras)文字检测,支持 darknet/keras 训练
不定长 OCR 训练(英文、中英文) crnn\dense ocr 识别及训练 ,新增 pytorch 转 keras 模型代码(tools/pytorch_to_keras.py)
支持 darknet 转 keras , keras 转 darknet , pytorch 转 keras 模型
身份证/火车票结构化数据识别
新增 CNN + ctc 模型，支持 DNN 模块调用 OCR ，单行图像平均时间为 0.02 秒以下

4.1 安装及报错

从 github 上下载源码，执行 python app.py 出现以下问题

文件不存在报错

 File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py\h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = 'E:\tmp\chineseocr\models\text.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

需要下载模型文件，复制文件夹中的所有文件到models目录

百度网盘: https://pan.baidu.com/s/1gTW9gwJR6hlwTuyB6nCkzQ
other-links: http://gofile.me/4Nlqh/fNHlWzVWo

字符串属性报错

 original_keras_version = f.attrs['keras_version'].decode('utf8')
AttributeError: 'str' object has no attribute 'decode'

解决方法:

pip uninstall h5py
pip install h5py==2.10.0

成功开启 web 服务，但是编码错误，报错如下
```
'gbk' codec can't decode byte 0x80 in position 833: illegal multibyte sequence
```
github 已经有人碰到这种问题：win10+cuda10.0+py3.68，keras版本的ocr 访问 http://127.0.0.1:8088/ocr报错

解决方法：https://juejin.cn/post/6844903700079575047

在windows下，可以在Python安装目录下的Lib/site-packages目录中，新建一个sitecustomize.py文件（也可以建在其它地方，然后手工导入，建在这里，每次启动Python的时候设置将自动生效），内容如下：
```
import _locale
_locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])
```

4.2 运行

访问 http://localhost:8080/ocr，上传图片

在这里插入图片描述

五、chineseocr_lite

超轻量级中文 ocr，支持竖排文字识别, 支持 ncnn、mnn、tnn 推理 (dbnet(1.8M) + crnn(2.5M) + anglenet(378KB)) 总模型仅 4.7M

作者还提供了不同语言环境下的demo

5.1 安装

下载源码，执行 pip install -r requirements.txt

5.2 运行

执行 python backend/main.py，成功运行后，访问 http://localhost:8089

在这里插入图片描述

六、CnOCR

CnOCR 是 Python 3 下的文字识别（Optical Character Recognition，简称OCR）工具包，支持简体中文、繁体中文（部分模型）、英文和数字的常见字符识别，支持竖排文字的识别。自带了20+个 训练好的模型，适用于不同应用场景，安装后即可直接使用。同时，CnOCR也提供简单的训练命令供使用者训练自己的模型

6.1 安装

直接运行 pip install cnocr 出现了缺少 protobuf compiler 的错误。

因为缺少 protobuf ，安装 onnx 报错，先从源码安装 protobuf

git clone https://github.com/protocolbuffers/protobuf.git
cd protobuf
git checckout v3.13.0
git submodule update --init --recursive
cd cmake && mkdir build && cd build && cmake-gui ..

修改cmake 选项

CMAKE_INSTALL_PREFIX D:\softwareInstalled\protoc
protobuf_BUILD_SHARED_LIBS ON

在build 目录下执行

mingw32-make & mingw32-make install

将 D:\softwareInstalled\proto\bin 和 D:\softwareInstalled\proto\lib加入到环境变量PATH 以防 Cmake 找不到

set CMAKE_ARGS="-DProtobuf_LIBRARIES=D:\softwareInstalled\protoc\lib"
pip install onnx==1.10.0
pip install cnocr

6.2 运行

from cnocr import CnOcr
from paddleocr import PaddleOCR, draw_ocr
import numpy as np 

img_path = 'emoj.jpg'
ocr = CnOcr() 
result = ocr.ocr(img_path)

print(result)

from PIL import Image

image = Image.open(img_path).convert('RGB')
boxes = [line['position'] for line in result]
txts = [line['text'] for line in result]
scores = [line['score'] for line in result]

im_show = draw_ocr(image, boxes, txts, scores,drop_score=0, font_path='C:\\Windows\\Fonts\\simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result6.jpg')

在这里插入图片描述

七、总结

以上识别只是使用最基本的官方提供的模型，而且用的是 CPU ，更精确的识别可以通过参数或者自己训练模型来实现
从上面项目可以看到，除了 Tesserat OCR 用的 C/C++，其它都是用的 python ,可见 python 在机器学习这一块的重要性
支持多种语言的库都要指定语言编码，而中文编码各不相同，例如 Tesseract 和 Easyocr 中文简体编码为 chi_sim, PaddleOCR 为 ch , 在使用前要注意语言编码