Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all articles
Browse latest Browse all 738

2024-04-04: Paper Summary: DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding

$
0
0
          Figure 1. An example from DeepPatent2 extraction results for US Design Patent #0836880 (Figure 1 in DeepPatent2)


Introduction

Technical illustrations and drawings, unlike natural images, are designed for straightforward human comprehension, focusing on details such as strokes, lines, and shading rather than colors and gradients (Carney and Levin, 2002) [2]. While computer vision has extensively studied natural images for object recognition and understanding context, technical drawings, prevalent in design patents, remain less explored. These drawings do not contain rich information such as color and gradient as appearing in natural images, making them distinct from typical natural images (DeepPatent) [6].


Recent developments in sketch-based datasets like QuickDraw [3], TU-Berlin [4], and ImageNet-Sketch [5] have provided valuable resources for sketch-based image retrieval. In this blog, we summarize our paper titled, "DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding", published on Scientific Data, a journal under Springer Nature. This paper introduces a new dataset called DeepPatent2, which includes technical drawings with semantic information, object names, viewpoints, and segmented figures, offering a more comprehensive resource for multimodality technical drawing retrieval and analysis tasks in comparison to existing datasets like DeepPatent [6]. The DeepPatent2 dataset consists of over 2.7 million technical drawings with 132,890 object names, and 22,394 viewpoints extracted from over 300,000 US design patents granted from 2007 to 2020.


        Figure 2: Average number of figures and subfigures per patent each year from 2007 to 2020 (Figure 3 in DeepPatent2)


Methods: Building the DeepPatent2 Dataset

The dataset creation process consists of three main steps: data acquisition, text processing, and image processing. The technical drawings data are patent documents obtained from the United States Patent and Trademark Office (USPTO) comprising of XML files and TIFF image files are used as sources. The text processing step involves extracting design categories and readable object names from figure captions. We used regular expressions to identify figure references within the XML files, allowing the extraction of complete sentences that mention figures. Each figure's caption is then used to derive object names and viewpoints, adding rich semantic information to the dataset while ensuring correct mapping of the captions to individual or compound figures in the TIFF files. The image processing step focuses on segmenting compound figures and identifying figure labels in patent documents, a necessary step since individual figures are not directly linked to their captions in these documents. This process involves four key steps: detecting figure labels (both text and position), segmenting compound figures into individual ones, associating these labels with the respective individual figures, and finally aligning the metadata with each figure. The final dataset includes metadata, descriptions, and images in JSON files. The dataset size is about 380GB pre-compression.

             Figure 3. The architecture of the pipeline to create DeepPatent2 dataset (Figure 2 in DeepPatent2)

Figure Label Detection

This step involves converting the TIFF images to PNG format for compatibility with OCR tools. We use AWS Rekognition's DetectText tool for recognizing text in figure labels due to its high precision. We assess the quality of extraction by comparing various OCR engines on 100 randomly sampled design patent figures, with AWS Rekognition tool showing the F1 score as high as 96.8, indicating its effectiveness in accurate label recognition.


Compound Figure Segmentation

To segment the compound figures in the collected US design patents, which often include multiple individual figures, we employ a transfer learning approach using the MedT [7] transformer-based neural model proposed by Valanarasu and Oza, 2021. This model, initially trained on medical images, was fine-tuned for patent figure segmentation. Its performance was benchmarked against other leading image segmentation models. Segmented figures were then matched with labels through a proximity-based method. Further details of this methodology are elaborated in the work by Hoque et al. (2022) [8].


Label Association

This step involves associating OCR-generated labels with the segmented figures. We approach this step as a bipartite matching problem. We adopt a heuristic approach, which involves matching labels to the nearest segmented figure based on Euclidean distance between their geometric centers. This straightforward method achieved a 97% accuracy on a test set of 200 figures. The primary sources of error were linked to OCR and segmentation inaccuracies, as well as challenges arising from dense and irregular label arrangements. The output includes segmented figures with individual labels.

Metadata Alignment

In this phase, the figures labeled in the previous step are aligned with captions from XML files. This is done by matching the figure labels from XML with those assigned to individual figures. Each segmented figure in the final dataset includes metadata such as the label, caption, bounding box coordinates, and document-level details like patent ID and title. This method relies on precise integer matching, meaning any errors in the dataset are typically due to inaccuracies from earlier stages of the process.


Data Quality Evaluation

To gauge the overall error rate obtained in DeepPatent2 dataset, we approximate the label association (LA) based on mismatch errors, averaging 7.5%. This translates to a precision of LA at 92.5%. We also approximate errors from text preprocessing (TP), averaging 4.0%, which translates to a precision of TP at 96.0%. The overall error rate is then calculated using the formula: Error = 1 - P{LA} x P{TP}. Consequently, the estimated overall error rate is around 11.2%.


To validate the estimated error rates, we randomly select 100 compound figures from each year (2007 to 2020) in our DeepPatent2 dataset, which amounts to 1400 compound figures. We manually inspect the final data product to assess the correctness of our figure segmentation process and the accuracy of label, object name, and viewpoint extraction. We derive the error rates by dividing the number of figures containing any errors (404) by the total number of figures (3464). These calculated error rates, align with the estimated rates derived from the provided formula.


Application of DeepPatent2

We demonstrate the usefulness of our DeepPatent2 dataset through a conceptual captioning task. The task of conceptual captioning involves generating a short textual description of an image, specifically technical drawings. This study utilizes the ResNet-152 model, initially trained on ImageNet, and further fine-tuned it on our technical drawings dataset. We vary the training datasets ranging from 500 to 63,000 images, and we evaluate the model's performance on a separate set of 600 images employing standard metrics like METEOR, Translation Error Rate (TER), Rouge, and NIST. Our results show improved performance with the larger training datasets, indicating that increasing the training size for automatically tagged data improves the performance of the image captioning models.

 

         Table 2. A comparison of image captioning models with different training sizes. (Table 7 in DeepPatent2)

Conclusion

This study introduces DeepPatent2, a dataset comprising over 2.7 million technical drawings from the US design patents. We demonstrated its utility in advancing tasks like Conceptual Captioning employing state-of-the-art models like ResNet-152. We also show that a significant performance improvement can be obtained in conceptual captioning model with increased training data. DeepPatent2, with its rich semantic information and diverse viewpoints, opens new research avenues in technical drawing understanding and beyond. It holds promise for creating robust neural models and innovative generative design models, marking a significant contribution to the fields of machine learning and computer vision.

 

References

1. Ajayi, K., Wei, X., Gryder, M. et al. "DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding". Sci Data 10, 772 (2023).
2. Russell N Carney and Joel R Levin. “Pictorial illustrations still improve students’ learning from text”. Educational Psychology Review, Volume 14: pg. 5–26, 2002.
3. Google. The quick, draw! Dataset. 2020
4. Mathias Eitz, James Hays, and Marc Alexa. "How do humans sketch objects?"ACM Transactions on Graphics, 31(4), 2012.
5. Haohan Wang, Songwei Ge, Zachary C. Lipton, and Eric P. Xing. "Learning robust global representations by penalizing local predictive power". In Advances in Neural Information  Processing Systems, 2019.
6. Michal Kucer, Diane Oyen, Juan Castorena, and Jian Wu. "DeepPatent: Large scale patent drawing recognition and retrieval". In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2309–2318, 2022.
7. Valanarasu, J. M. J., Oza, P., Hacihaliloglu, I., & Patel, V. M. "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation". In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021 (pp. 36-46). Cham: Springer International Publishing. ISBN 978-3-030-87193-2.
8. Md Reshad Ul Hoque, Xin Wei, Muntabir Hasan Choudhury, Kehinde Ajayi, Martin Gryder, Jian Wu, and Diane Oyen. "Segmenting technical drawing figures in US patents". In Proceedings of the Workshop on Scientific Document Understanding co-located with 36th AAAI Conference on Artificial Inteligence, SDU@AAAI 2022, Virtual Event, March 1, 2022, volume 3164 of CEUR Workshop Proceedings. CEUR-WS.org, 2022.


Kenny Ajayi (@KennyAJ)

Viewing all articles
Browse latest Browse all 738

Trending Articles