Introduction
Technical illustrations and drawings, unlike natural images, are designed for straightforward human comprehension, focusing on details such as strokes, lines, and shading rather than colors and gradients (Carney and Levin, 2002) [2]. While computer vision has extensively studied natural images for object recognition and understanding context, technical drawings, prevalent in design patents, remain less explored. These drawings do not contain rich information such as color and gradient as appearing in natural images, making them distinct from typical natural images (DeepPatent) [6].
Recent developments in sketch-based datasets like QuickDraw [3], TU-Berlin [4], and ImageNet-Sketch [5] have provided valuable resources for sketch-based image retrieval. In this blog, we summarize our paper titled, "DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding", published on Scientific Data, a journal under Springer Nature. This paper introduces a new dataset called DeepPatent2, which includes technical drawings with semantic information, object names, viewpoints, and segmented figures, offering a more comprehensive resource for multimodality technical drawing retrieval and analysis tasks in comparison to existing datasets like DeepPatent [6]. The DeepPatent2 dataset consists of over 2.7 million technical drawings with 132,890 object names, and 22,394 viewpoints extracted from over 300,000 US design patents granted from 2007 to 2020.
Methods: Building the DeepPatent2 Dataset
Figure Label Detection
This step involves converting the TIFF images to PNG format for compatibility with OCR tools. We use AWS Rekognition's DetectText tool for recognizing text in figure labels due to its high precision. We assess the quality of extraction by comparing various OCR engines on 100 randomly sampled design patent figures, with AWS Rekognition tool showing the F1 score as high as 96.8, indicating its effectiveness in accurate label recognition.
Compound Figure Segmentation
To segment the compound figures in the collected US design patents, which often include multiple individual figures, we employ a transfer learning approach using the MedT [7] transformer-based neural model proposed by Valanarasu and Oza, 2021. This model, initially trained on medical images, was fine-tuned for patent figure segmentation. Its performance was benchmarked against other leading image segmentation models. Segmented figures were then matched with labels through a proximity-based method. Further details of this methodology are elaborated in the work by Hoque et al. (2022) [8].
Label Association
Metadata Alignment
In this phase, the figures labeled in the previous step are aligned with captions from XML files. This is done by matching the figure labels from XML with those assigned to individual figures. Each segmented figure in the final dataset includes metadata such as the label, caption, bounding box coordinates, and document-level details like patent ID and title. This method relies on precise integer matching, meaning any errors in the dataset are typically due to inaccuracies from earlier stages of the process.
Data Quality Evaluation
To gauge the overall error rate obtained in DeepPatent2 dataset, we approximate the label association (LA) based on mismatch errors, averaging 7.5%. This translates to a precision of LA at 92.5%. We also approximate errors from text preprocessing (TP), averaging 4.0%, which translates to a precision of TP at 96.0%. The overall error rate is then calculated using the formula: Error = 1 - P{LA} x P{TP}. Consequently, the estimated overall error rate is around 11.2%.
To validate the estimated error rates, we randomly select 100 compound figures from each year (2007 to 2020) in our DeepPatent2 dataset, which amounts to 1400 compound figures. We manually inspect the final data product to assess the correctness of our figure segmentation process and the accuracy of label, object name, and viewpoint extraction. We derive the error rates by dividing the number of figures containing any errors (404) by the total number of figures (3464). These calculated error rates, align with the estimated rates derived from the provided formula.
Application of DeepPatent2
We demonstrate the usefulness of our DeepPatent2 dataset through a conceptual captioning task. The task of conceptual captioning involves generating a short textual description of an image, specifically technical drawings. This study utilizes the ResNet-152 model, initially trained on ImageNet, and further fine-tuned it on our technical drawings dataset. We vary the training datasets ranging from 500 to 63,000 images, and we evaluate the model's performance on a separate set of 600 images employing standard metrics like METEOR, Translation Error Rate (TER), Rouge, and NIST. Our results show improved performance with the larger training datasets, indicating that increasing the training size for automatically tagged data improves the performance of the image captioning models.
Table 2. A comparison of image captioning models with different training sizes. (Table 7 in DeepPatent2)
Conclusion
This study introduces DeepPatent2, a dataset comprising over 2.7 million technical drawings from the US design patents. We demonstrated its utility in advancing tasks like Conceptual Captioning employing state-of-the-art models like ResNet-152. We also show that a significant performance improvement can be obtained in conceptual captioning model with increased training data. DeepPatent2, with its rich semantic information and diverse viewpoints, opens new research avenues in technical drawing understanding and beyond. It holds promise for creating robust neural models and innovative generative design models, marking a significant contribution to the fields of machine learning and computer vision.