Table 1 Soper et al.: Classification of common OCR error types, their descriptions, and representative examples. These categories highlight challenges in text recognition systems, such as over-segmentation of words into multiple parts, merging of separate words (under-segmentation), character misinterpretations, omissions, and the generation of spurious or nonsensical content.extracted from the 'Details' page for quick and efficient information retrieval. |
Table 2 Soper et al.: Detailed examples of OCR error types in text recognition. The table illustrates five categories of errors—over-segmentation, under-segmentation, misrecognized characters, missing characters, and hallucination—along with source text, the OCR prediction, and the intended target text. These examples emphasize the nuances of text recognition challenges and their impact on maintaining textual integrity in automated systems. |
Exploring Approaches to Spelling Autocorrection and OCR Post-Correction
I began by delving into foundational methods for spelling correction, starting with Peter Norvig's seminal work How to Write a Spelling Corrector. This article provides an excellent introduction to spelling autocorrection, offering a Python-based implementation that combines a probabilistic language model (derived from word frequencies in a corpus) with an error model (based on edit distances) to predict the most likely corrections for misspelled words. With an accuracy of approximately 68–75% on test datasets, the approach balances simplicity and effectiveness. It also outlines potential enhancements, such as refining error models, incorporating context-sensitive corrections, and improving dictionary robustness. The article's Further Reading section offers many resources, forming a solid foundation for deeper exploration in this field.
Figure 1 Hládek et al.: The process of correction-candidate generation and error correction. The diagram illustrates the flow from an incorrect word to candidate proposals, ranking of candidates, error correction, and alignment with the intended word and context, emphasizing the interaction between error production and correction mechanisms. |
Figure 2 ICDAR2019: Overview of the OCR post-correction challenge. The task involves identifying and correcting errors in OCR-processed text to align it with a gold standard, using historical documents as the dataset. |
Figure 3 Whitelaw et al.: Overview of the spelling correction process and associated knowledge sources. |
Figure 4 Rijhwani et al.: Examples of scanned documents in endangered languages accompanied by translations. (a) Ainu text with Japanese translation, (b) Griko text with Italian translation, (c) Yakkha text with translations in Nepali and English, and (d) handwritten Shangaji text with typed English glosses. |
Table 3 Thomas et al.: Llama 2 13B Model Performance in Correcting Diverse OCR Error Types. |
Table 4 Thomas et al.: Comparative Analysis of Model Performance Across OCR Error Type. |
References:
- Peter Norvig’s spelling corrector: https://norvig.com/spell-correct.htm
- Hládek, D., Staš, J. and Pleva, M., 2020. Survey of automatic spelling correction. Electronics, 9(10), p.1670.
- Chiron, G., Doucet, A., Coustaty, M. and Moreux, J.P., 2017, November. ICDAR2017 competition on post-OCR text correction. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (Vol. 1, pp. 1423-1428). IEEE.
- Rigaud, C., Doucet, A., Coustaty, M. and Moreux, J.P., 2019, September. ICDAR 2019 competition on post-OCR text correction. In 2019 international conference on document analysis and recognition (ICDAR) (pp. 1588-1593). IEEE.
- Whitelaw, C., Hutchinson, B., Chung, G. and Ellis, G., 2009, August. Using the web for language independent spellchecking and autocorrection. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 890-899).
- Rijhwani, S., Anastasopoulos, A. and Neubig, G., 2020. OCR post correction for endangered language texts. arXiv preprint arXiv:2011.05402.
- Nguyen, T.T.H., Jatowt, A., Nguyen, N.V., Coustaty, M. and Doucet, A., 2020, August. Neural machine translation with BERT for post-OCR error detection and correction. In Proceedings of the ACM/IEEE joint conference on digital libraries in 2020 (pp. 333-336).
- Thomas, A., Gaizauskas, R. and Lu, H., 2024, May. Leveraging LLMs for Post-OCR Correction of Historical Newspapers. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA)@ LREC-COLING-2024 (pp. 116-121).
- Soper, E., Fujimoto, S. and Yu, Y.Y., 2021, November. BART for post-correction of OCR newspaper text. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021) (pp. 284-290).