How we're choosing OCR library to build a document management solution

August

Developing a document management solution: how we’re choosing between Google Text Recognition API, Tesseract, and Anyline OCR libraries

As we needed to improve document management within our company, notably to increase the speed of processing enterprise paper records, we decided to develop a software solution based on one of OCR libraries.

OCR, or optical character recognition, is a mechanical or electronic conversion of images of typed text into machine-encoded text.

OCR represents a method of digitizing a printed text so that it can be electronically stored, edited, displayed, and applied in different machine processes like cognitive computing, machine translation, data mining, etc. What’s more, OCR is used as a form of information entry from paper documents (including financial records, business cards, invoices, and a lot more).

Before starting the app development process, we investigated the three most popular OCR libraries in order to identify the one that would suit our goals best.

We investigated the most popular OCR libraries:

Google Text Recognition API

Tesseract

Anyline

Google Text Recognition API

Google Text Recognition API is the process of detecting text in images and video streams and recognizing the text contained therein. Once detected, the recognizer determines the actual text in each block and segments it into lines and words. The Text API detects text in multiple languages (French, German, English, etc.) in real-time.

One should note that in general Google Text Recognition API was effective for resolving our tasks. We got the ability to recognize text either in real-time or in text documents’ ready images. During our research, we defined some pros and cons of using Google Text Recognition OCR library.

Pros:

Ability to recognize texts in real-time;
Ability to recognize texts from images;
A small size of the library;
A high speed of recognition.

Cons:

A large size of files with training data (~30Mb).

Tesseract OCR library

Tesseract is an open source OCR engine for different operating systems. It’s free software, released under the Apache License, Version 2.0, and supporting a variety of languages.

Tesseract development has been sponsored by Google since 2006, the time when it was considered to be one of the most accurate and effective open-source OCR libraries.

However, we weren’t satisfied with the results of Tesseract OCR library integration as it has an incredibly huge volume and doesn’t allow to recognize text in real-time.

Pros:

It’s open source;
It’s easy enough to train OCR to recognize the necessary fonts and improve the quality of the information being recognized. It should be noted that the quality of recognition results has rapidly increased after a quick library configuration and training.

Cons:

Insufficient recognition rate, that can be eliminated by training and learning a recognition algorithm;
Additional processing of the received image is required for real-time text recognition;
Insufficient recognition rate when using standard files with data about fonts, symbols, and words.

Anyline OCR

Anyline provides a multi-platform SDK that allows developers to easily integrate OCR features into apps. This OCR library is paid and designed for commercial use.

We decided to investigate Anyline OCR as it offers a plenty of options to set recognition parameters as well as models for solving specific tasks.

Pros:

It’s rather easy to set the recognition of necessary fonts;
Enables real-time text recognition;
The library can recognize barcodes and QR-codes;
Provides ready modules for resolving different tasks.

Cons:

A low recognition speed;
To get satisfactory results an initial fonts setting is required (for text recognition).

As a result of our research, we decided to choose Google Text Recognition API as it combines a high processing speed, simple installation, and efficient recognition results.

We developed a document management solution for our company enabling to scan paper records, automatically digitize them and save in a single database. What’s more, the quality of the recognized information is very high – about 97%.

Thanks to the system integration, our internal document management (including document processing, data exchange between departments, document creation, etc.) was accelerated by 15%.

Post Tags:

Developing a document management solution: how we’re choosing between Google Text Recognition API, Tesseract, and Anyline OCR libraries

We investigated the most popular OCR libraries:

Google Text Recognition API

Tesseract OCR library

Anyline OCR

Post Tags:

Expertise

Company

About