Figure 1. An illustration of the differences between aleatoric and epistemic uncertainties (Yang et al., 2023).
Introduction
Table Structure Recognition (TSR) is a task of document analysis that focuses on identifying rows and columns in digital table images [4]. While current TSR methods can identify cell locations, they lack the ability to predict uncertainties in their results [1]. This limitation has hindered the real-world application of TSR, such as automatically extracting data from table images in physical sciences.
In this blog post, we summarize our paper titled "Uncertainty Quantification (UQ) for Table Structure Recognition", presented at the 2024 IEEE International Conference on Information Reuse and Integration for Data Science. In this paper, we proposed a method called TTA-m(Test-Time Augmentation with multiple models) that aims to quantify uncertainties in TSR predictions, potentially enhancing how we extract and verify tabular data from digital documents.
Figure 2. A schematic illustration of the proposed UQ pipeline (TTA-m) (Figure 1 in our paper)
Dataset
We utilized the ICDAR 2019 dataset, which consists of both modern and historical tables:
Modern dataset: Includes tables from scientific papers, forms, and financial documents.
Historical dataset: Comprises hand-written accounting ledgers and train schedules.
Following the approach of Prasad et al. [2], we selected 543 table images from the ICDAR 2019 dataset. We randomly selected 443 table images for training our models and the remaining 100 table images for evaluation. This dataset provides diverse table structures and complexities, allowing for a robust evaluation of the proposed UQ method.
Method
The key components of our proposed UQ pipeline consist of the following:
Data Augmentation
Data augmentation has become a practice of developing robust and transformation-invariant models. We implemented a set of M = 4 distinct data augmentation techniques in the training and testing stages. The techniques include the elimination of all lines (NLT), the addition of horizontal lines (HLT), the addition of vertical lines (VLT), and the addition of both horizontal and vertical lines (HLT + VLT). Figure 2 presents examples of these augmented table images.
Figure 3. Augmentation examples of a table image (Figure 2 in our paper)Model Training
We fine-tuned CascadeTabNet, a TSR model originally proposed by Prasad et al. [2], on both original and augmented table images.
Test-Time Augmentation (TTA)
Our baseline model is the vanilla TTA. TTA is an ensemble method that applies various data augmentations to the input during inference, generating multiple predictions which are then combined to produce a more robust final output. In this paper, we modified the vanilla TTA. Specifically, during inference, the model was applied to testing data that were augmented using the same method as the fine-tuning data. Then, we ensembled the outputs of the multiple models.
Confidence Estimation via Ensembles
The process of uncertainty estimation involves combining predictions from multiple models to generate a set of merged cells, each with an associated confidence score. Here’s a more detailed summary:
Initial Setup: We randomly selected a model from a set of M+1 models as the base model.
Bounding Box Collection: We gathered all the predicted bounding boxes from the base model.
Comparison and Merging: We took the predicted bounding boxes from the second model and compared them with the predicted bounding boxes by the base model using the Intersection over Union (IoU) metric. If the IoU between a pair of cells from the base and second model meets or exceeds a predefined threshold θ > 0, we merge the two cells into one. We removed the merged cell from the second model’s list.
Iterative Merging: We repeated the comparison and merging process for the remaining models (from the third to the M+1 models), always comparing with the base model’s cells.
Sequential Model Use: After processing with the initial base model, we sequentially selected other models as the new base models and repeated the merging process for any cells that have not been merged yet.Confidence Score Calculation: For each merged cell combination created during the process, we counted how many distinct models contributed to that combination. We calculated the confidence score for each combination by dividing this count by M+1.
Baseline Methods
We compared our proposed UQ pipeline with the vanilla TTA and other variants such as TTA-t and TTA-tm as shown in Figure 3. TTA-t adds a small cell filter to the vanilla TTA to exclude small cells predicted by fine-tuned models. TTA-tm combines TTA-t and TTA-m, incorporating training data augmentation and the small cell filter.
We also compared our work with an active learning model proposed by Choi et al. [3]. Their model aims to reduce labeling costs by selecting only the most informative samples in a dataset. It uses a mixture density network that estimates a probabilistic distribution for each localization and classification head’s output to explicitly assess the aleatoric and epistemic uncertainty in a single forward pass of a single model. It uses a scoring function that aggregates these uncertainties for both heads to obtain every image’s informativeness score.
Evaluation
To assess the effectiveness of our UQ pipeline, we proposed two novel evaluation techniques: masking and cell complexity quantification methods, because there are no annotated ground truth table cells.
Masking Method
This involves modifying the difficulty of table image recognition by adjusting pixel intensity. Specifically, we doubled and tripled the pixel intensities (capped at 255) of table images in our training set. Then, we evaluated the confidence scores of the TSR model at each intensity level.
Cell Complexity Quantification
We modeled the table images as non-directed graphs, with cells as nodes and adjacencies as edges. We considered four types of adjacencies: left, top, right, and bottom. Then, we manually annotated the relations between cells to construct graphs for each table in the test set.Evaluation Results
Performance Comparison with Baseline Methods
We compared the TSR results of our proposed model, TTA-m with the TTA variants and the active learning method. Based on the results in Table 1, TTA-m outperformed the baseline methods. The combined TTA-tm approach showed further improvements.
Table 1. Comparing TSR results of models used in our study (Table 1 in our paper)
Masking Method Results
Cell pixel intensity significantly influenced the distribution of confidence scores in our TSR model. As pixel intensity decreased (making cells fainter), the difficulty of accurate detection increased. This should lead to higher levels of uncertainty from the TSR model as shown in Figure 4.
Figure 6. Effects of masking on UQ in TSR. m1: no masking applied; m2: pixel values doubled; m3: pixel values tripled (Figure 7 in our paper)
Cell Complexity Quantification Results
Based on Table 2, the mean confidence level decreased as the degree of relationships between cells increased from 1 to 6, with an exception at degree 5. This suggests that the model's confidence is inversely related to the complexity of cell relationships in a table.
Table 2. Quantifying cell complexity based on the adjacency degree of table cells (Table 2 in our paper)
Conclusion
This paper investigated uncertainty quantification (UQ) in table structure recognition (TSR) by adapting the traditional test-time augmentation (TTA) technique and applying it to a customized CascadeTabNet [2] model. To assess the effectiveness of our UQ method, we employed masking and cell complexity quantification techniques. These techniques adjust cell pixel intensity and determine cell complexity based on relationships among cells in table images at different confidence levels. Our experiments demonstrated that the proposed UQ method offers more reliable uncertainty estimation compared to the standard TTA approach.
Unlike the vanilla TTA, which only considers data variation, our method extends data augmentation to the training phase, factoring in both data and model variations. While this increases the computational cost, it provides a more robust uncertainty estimation for TSR models. When applying our pipeline to datasets without ground truth labels, pre-fine-tuned models can be used, requiring only test-time augmentation.Acknowledgment
I would like to acknowledge Leizhen Zhang for running key experiments crucial to our published paper and express my gratitude to Dr. Yi He for co-advising me alongside my advisor, Dr. Jian Wu throughout this work.
[1] K. Ajayi, L. Zhang, Y. He, and J. Wu, "Uncertainty Quantification in Table Structure Recognition" 2024 IEEE International Conference on Information Reuse and Integration for Data Science, 2024.
[3] J. Choi, H. Yoo, S. Lee, S. Yoon, and H. Yang, "Uncertainty-Aware Attention Gate U-Net for Active Learning in Medical Image Segmentation," in IEEE Access, vol. 9, pp. 36170-36181, 2021.
[4] Pascal Fischer, Alen Smajic, Giuseppe Abrami, and Alexander Mehler. "Multi-type-td-tsr–extracting tables from document images using a multi-stage pipeline for table detection and table structure recognition": From OCR to structured table representations. In German Conference on Artificial Intelligence (Künstliche Intelligenz), pages 95–108. Springer, 2021
[5] Yang, C.-I., & Li, Y.-P. (2023). "Explainable uncertainty quantifications for deep learning-based molecular property prediction". Journal of Cheminformatics.