Figure 1: Traditional and Deep Learning Approaches to Table Recognition (Hashmi et al.)
Table recognition refers to the process of using optical character recognition (OCR) and machine learning (ML) models to identify the rows, columns, and individual text cells in tables in digital documents either born-digital or scan PDFs. The task of table recognition has been under investigation for more than two decades for automatically extracting textual information from a variety of tables [Kieninger et al., Wei et al.]. Automatic table recognition can be very challenging due to tables having different structures, data types, and misaligned data entries (Figure 2). For instance, some tables have text spanning multiple rows or columns. Also, some tables have clearly defined borders while some do not have any border (borderless) or are partially-bordered. These complexities make it difficult for template-based and ML-based approaches to extract tables from diverse PDFs. In addition, extracted tables and table data may sometimes not retain their original contextual and hierarchical structure, which requires users to manually correct table structure and content.
Figure 2: Example of a complex table structure
(source: https://doi.org/10.3390/app9235102)
Table extraction involves solving two problems: table detection (TD) and table structure recognition (TSR). Prior works involve solving the two problems independently using rule-based and classical machine learning methods such as conditional random fields (Wei et al.). Recent works use end-to-end deep learning-based solutions to solve the two problems together. Below, I will give a brief summary of a paper published by Devashish Prasad et al. on "CascadeTabNet: An Approach for end to end table detection and table structure recognition from image-based documents" in CVPR 2020 open access.
CascadeTabNet
Figure 3: CascadeTabNet Pipeline (Prasad et al.)Structure Recognition for Borderless Tables
Borderless tables are tables with partial or no ruling-based lines (Prasad et al.). The CascadeTabNet predict the segmentations of table cells for borderless tables. When a table is classified by the model as borderless, the cell positions are automatically labeled using the predicted row and column ids. The authors use the position of identified rows and columns in combination with contour-based text detection algorithm (Liu et al.) to estimate the missing table lines (row and column borders). Based on the estimated table lines, the authors again used the contour-based text detection algorithm to detect the cells of previously undetected cells.
Structure Recognition for Bordered Tables
Dataset Preparation
The authors created a new dataset for the task of TD by merging three datasets of ICDAR 19, Marmot, and GitHub. The general dataset contains a total of 1934 document images containing 2835 tables. To create a more robust model, the authors implemented image-augmentation techniques on the original training data. The augmentation techniques implemented are Dilation and Smudge transform. Before implementing the dilation and smudge operations, the original image was first converted to a binary image. In the dilation operation, the image was transformed to thicken the black regions, while in the smudge operation, the original image was transformed to spread the black pixel regions and make it look like a kind of smeary blurred black pixel region (see Figure 5).
Figure 5: An example of applying dilation and smudge transformation on a table image (Prasad et al.)In order to analyze the effectiveness of the image-augmentation process, the authors created four training datasets to train different baseline models. The first set contains the original images. The second set contains both the original and dilated images. The third set contains both the original and smudged images. The last set contains the original, dilated, and smudged images.
For the task of TSR, the authors manually annotated at random 342 out of 600 images of the ICDAR 19 (Track A Modern) train set. The data contains 114 bordered and 429 borderless tables.