Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all articles
Browse latest Browse all 736

2020-12-02: Comparing Four OCR Tools on US Patent Figure Label Recognition

$
0
0

The task is to extract labels from US patent figures. Patent figures are different from natural images. They are usually drawings of an object, or diagrams such as circuits. A figure file may contain one or multiple figures, each of which has a label. We need to find a software tool which can reliably identify figure labels. All the figures are in TIF format when they are downloaded from USPTO patent repository.


In the following experiments, I use OCR tools to extract figure labels using the whole figure file as the input. The candidates I compare include: tesseract, Abbyy, Amazon Textract API and Google cloud vision API. Below are figure samples and my comments.





Figure #1 is a standard type of figures with 1 drawing and 1 label.




Figure #2 represents figures with multiple drawings and labels. We need to extract both labels. The dot lines at the bottom of the outsole may be mistaken as words.



Figure #3 represents more abstract drawings with numbers and letters in it.



Figure #4 also represents more abstract drawings with numbers and letters in it.



Figure #5 represents figures with multiple drawings and labels. The label is close to the drawing.




Figure #6 represents figures with multiple drawings and labels. The label is inside the bounding boxes of drawings.





Figure #7 represents figures with multiple drawings and labels.


Tesseract and Abbyy extract the figures in similar ways. They give correct figure labels for images with only 1 drawing and 1 figure label (figure#1 is am example of this). But this is not very robust, for a small portion of the inputs of this format ( 1 drawing and 1 figure label), they just give empty output. When it comes to images with 2 or more figure labels (figure#2 is am example of this), the output consists of only 1 figure label or even nothing. They also fail with more abstract drawings with numbers of words in them (figure#3 and figure#4). Since images with 2 or more figure labels are the most important part of our task, this is not acceptable. 

The next candidate is Amazon Textract API. This API works much better than the previous ones. The first improvement is that it can correctly identity almost all images with only 1 figure label and in very few cases it will fail for no reason. The second improvement is that it can recognize images with 2 and more figure labels (figure #5 and figure #6). But somehow it fails with some others(figure #7) and just output noises or even nothing. The reason why it fails is not clear. It fails with more abstract drawings with numbers or words in them as well (figure #3 and figure #4).


I tested 60 figures to see the performance. The accuracy for figures with only 1 label is 84%, 91% and 95% for tesseract, Abbyy and Amazon separately. For figures with 2 or more figure labels the numbers are 21%, 35%, 75% separately.

Tesseract, Abbyy and Amazon API have one thing in common: there are noises in almost every output. It means they don’t separate the label and the drawing clearly. For example, figure #6 is recognized as follows:




It shows some parts of the drawing is recognized as words and the output is as follows:



Even though it captures the labels clearly, it also gives a lot of meaningless and irrelevant stuff. If we can find a way to separate the drawings and label, a better result can be obtained. One way is to build our own algorithm on basis of them to improve the results.


Google API gives almost perfect results. All the figure labels are correctly identified, and the outputs contain the least noise among all tools.  Compared with Amazon OCR, Google API produces noises only for abstract drawings (image #3 and image #4). The task was challenging because those drawings include both words and numbers. In such cases, even though there may be some irrelevant words in the output, all of them are separated by a space, so it’s easy to parse labels out. For example, you may see output like this: “figure. 5A 310 320 382 380 14 343 330 341 - 340. 360.” and “figure. 5A” can be parsed out using regular expressions.


Below is a comparison of the pros and cons of the tools:




In conclusion, Google gives the best performance among all those tools. If you have sufficient budget this is your choice. But the installation of it is much more complicated than Abbyy and Amazon. So if you want something easier to use, Amazon would be a good choice. If you don’t have enough budget and just need open-source tools, tesseract is a good choice. You can build your own algorithm on it.




Viewing all articles
Browse latest Browse all 736

Trending Articles