Presenting our extended abstract at the 17th MSVSCC 2024

At the 17th Annual Modeling, Simulation, and Visualization Student Capstone Conference 2024 (MSVSCC), I presented our extended abstract, “AI-Driven App for Accessibility in Education: Converting Scanned Documents to Readable Text for Students with Low Vision.” With its intuitively designed user interface, this app is crafted to convert scanned documents into a readable format that seamlessly integrates with users’ existing accessibility tools. The app’s workflow is powered by Nougat (Neural Optical Understanding for Academic Documents), an advanced Optical Character Recognition (OCR) model that converts scientific documents into Markdown. The repository with the code can be found here: https://github.com/bllin001/storymodelers-ocr-documents-art-app.

Thanks to @storymodelers @WebSciDL @oducs @vmasc_odu for the opportunity to participate for 2nd year. See you next year!!!@Jhon_gbm12 @joseph_mars7 @boxlessmel @ErikaFrydenlund @jojpa and #Meaghan for the collaboration and learning process.#Collaboration🌐🤝#MSVSCC2024 pic.twitter.com/cobXVLSBmN
— Brian Llinas Marimon (@bllin001) April 15, 2024

Our motivation to create this OCR application stemmed from a request by a professor at Old Dominion University (ODU) in the World Languages & Cultures Department. This professor has students with low vision who need help accessing educational materials that are primarily visual despite the availability of screen reader software, which often fails to interpret text from certain types of scanned documents frequently used in class. We saw an opportunity to make a significant difference by creating a tool that could leverage Artificial Intelligence (AI) to provide students with better access to class materials, thus addressing the gap in educational resources for students with vision impairment.

We can reduce visually impaired students’ challenges in accessing class materials by using an AI-based OCR model that significantly minimizes errors and improves text accuracy compared with screen reader software. By leveraging AI technology, this study enhances accessibility in education, ensuring students with low vision can access scanned educational materials. Figure 1 summarizes what we wanted to do: 1) process the scanned documents, 2) use the OCR model, and 3) extract the text.

Figure 1. Workflow of the extraction process

But how can we do this for non-coders? How can we make OCR models accessible for non-coders? How can we integrate AI in a friendly environment? By using Streamlit, an open-source Python framework for data scientists and AI/ML engineers, it was possible to make it accessible to non-coders in a friendly environment. Streamlit allowed us to deliver dynamic data apps with only a few lines of code, making those apps accessible to non-coders. The app is hosted by Hugging Face, specifically in Hugging Face Spaces, which offers a simple way to host Machine Learning (ML) demo apps directly on your or your organization’s profiles. In this blog, I will focus on how we built the application using Streamlit and some of the limitations of Nougat for poetry structure.

WHAT IS STREAMLIT?

Streamlit is an open-source Python library that makes creating web applications for machine learning, data science, and other data-intensive tasks easy. It allows you to turn data scripts into shareable web apps in minutes without needing expertise in web development. To install and set up Streamlit, we need to use the pip function (pip install streamlit) and set up a basic Streamlit app using a Python script import streamlit as st.

To create a simple Streamlit app, such as a data visualization or interactive widget, we need to use Streamlit's core components like st.write(), st.title(), st.sidebar(), and st.button(). Then, to add interactivity, we need to use widgets like sliders (st.slider()), text inputs (st.text_input()), and select boxes (st.selectbox()). Streamlit can create various types of visualizations, such as charts (e.g., using st.line_chart() or st.bar_chart()), maps (st.map()), and tables (st.table()), but Python´s libraries such as Matplotlib and Seaborn can be used by the developer.

To host your application once you finish, you have different options for deploying your Streamlit app, including deploying to Streamlit Share, Heroku, or other hosting services such as Hugging Face. Also, you can add advanced features of Streamlit, such as caching (@st.cache), sharing data between sessions (st.session_state), and custom components (st.components).

THE PROCESS OF BUILDING THE AI-DRIVEN APP

Converting scanned documents and PDFs into editable text is crucial for accessibility and efficient data management. Our OCR app, built using Streamlit, provides a seamless solution for this need. The app includes key features such as:

User-Friendly Interface: The app allows users to upload PDF files easily.
Efficient Text Extraction: Utilizes the Nougat model for accurate OCR processing.
Convenient Downloads: Extracted text can be downloaded in Markdown format.

Figure 2 shows the methodology for the OCR application, detailing the interaction between the user, the frontend interface, and the backend processing engine.

Methodology Diagram

Figure 2. Methodology for the AI-Driven App

The process begins with the user interacting with the application through a user-friendly frontend built using Streamlit. This frontend serves as the primary interface where users can upload their PDF documents. The application is hosted on a server, explicitly leveraging the capabilities of Hugging Face Space for efficient and reliable processing.

Once the user uploads a PDF document, text extraction follows. This is achieved through an interactive button labeled “Extract Text” in the Streamlit interface. When the user clicks this button, the application initiates the text extraction process.

Behind the scenes, the backend OCR engine employs the Nougat model, a sophisticated OCR model capable of accurately transcribing text from PDF documents. The model runs the extraction process, converting the scanned text into a readable and editable format. This process results in a Markdown (.md) file, which preserves the structure and content of the original document in a more accessible format.

After the text extraction is complete, the frontend interface displays the extracted text to the user. This immediate feedback allows the user to verify the accuracy and completeness of the transcription. Additionally, the application will enable users to download the extracted text. A “Download File” button allows users to save the Markdown file to their local storage for further use or editing.

This methodology details a workflow for converting PDF documents into readable text using OCR technology. The workflow is implemented with a frontend built using Streamlit, allowing for interactive user input, and a backend OCR engine demonstrating a text extraction accuracy rate of 91.8%. Hosting the application on Hugging Face Space provides the necessary infrastructure for robust performance, making this solution practical and effective for users seeking to digitize and edit their documents.

Let’s delve into the code to understand how each component contributes to the app’s functionality.

Preparing the app page

1. Setting up the environment

We start by importing essential libraries. Streamlit handles the web interface, the subprocess allows the code to run system commands, which is used to run the OCR model, and os provides functions for interacting with the operating system, such as creating directories.

2. Helper functions for state management

These functions manage the app’s state, ensuring smooth user interactions. clear_submit() resets the submit state, while click_button() tracks button clicks.

3. Configuring the Streamlit App

This line configures the app’s appearance, setting a title, icon, and layout preferences.

Process the PDF scanned documents.

1. Building the Sidebar

The sidebar is designed for file uploads and user actions. Users can upload PDF files, and an extraction button appears upon successful upload. Figure 3 represents the app visualization until now.

Figure 3. The basic interface

2. Handling File Uploads

This section manages file storage. Uploaded files are saved in a dedicated ‘files’ directory, ensuring organized storage and retrieval. Figure 4 represents the upload file option.

Figure 4. Uploaded scanned document

Extract the text using Nougat

1. Model execution with caching

The load_model function runs the Nougat OCR model to extract text from the PDF. Caching this function optimizes performance by preventing redundant executions. When the extract button clicks, the app runs the OCR model and reads the extracted text, storing it in the session state. Figure 5 shows the app visualization when the file is uploaded, and the content is processed.

Figure 5. Extract text from the scanned document

Display and download the text

The extracted text is displayed on the main page, and users can download it via a button in the sidebar. Figure 6 represents the app visualization after removing the content from the document.

Figure 6. Text extracted visualization from the scanned document

APP UML DIAGRAM: USER INTERACTION WITH THE OCR APP

The Unified Modeling Language (UML) diagram (Figure 7) shows the user’s journey through the OCR app, highlighting each step and how the app interacts with the backend server to deliver the desired output. When users first interact with the OCR app, they are greeted with a straightforward interface that allows them to upload a PDF document. This is the starting point of the user’s journey. They select a PDF file from their local storage and upload it through the app’s file uploader component in the sidebar.

Once the PDF is uploaded, the user can choose to extract text from the document. This action is initiated by clicking the “Extract” button, which triggers a series of backend processes. The app takes the uploaded PDF and prepares it for processing by saving it to a specific location within its directory. At this point, the real magic happens. The Visual Transformer performs the OCR on the uploaded PDF, extracts text from the images, and converts it into a readable format.

The extracted text is then saved as a Markdown file (.md). The choice of Markdown is intentional, as it is simple and widely compatible with various text editing tools. This file is stored in a designated location within the app’s structure and is ready for further action by the user. With the complete text extraction, the app provides two primary functionalities to the user.

Firstly, the user can download the extracted text file. The app offers a download button that lets users save the Markdown file to their local storage. This feature ensures the user can easily access the processed document for editing, sharing, or archiving. Secondly, the app displays the extracted text directly within its interface. This immediate display allows the user to verify the content extracted from the PDF document, ensuring that the OCR process has delivered accurate and valuable results.

Figure 7. The UML Diagram for the app interface

IS NOUGAT GOOD OR BAD FOR POETRY STRUCTURE DOCUMENTS?

We tested the app with 12 scanned documents and used the BLEU score to measure the similarity between the app’s output and the manually transcribed text from the original document. The scores range from 0 to 1, with 1 being a perfect match. The model’s BLUE score was 0.918, indicating a perfect match between the app’s output and original documents. Despite the model’s excellent performance in poetry structure documents, this model provided, in some cases, a good or bad result.

Figure 8 illustrates a successful application of the Nougat Visual Transformer model, showcasing its ability to accurately extract text from a scanned document and convert it into a readable, editable format. On the left side, we see a scanned document with a section titled “7 Point of View.” The document contains various formatting elements such as headers, font styles, and sections that distinguish different text parts.

On the right side, the output generated by the Nougat model is displayed. This transcription exemplifies the model’s proficiency in recognizing and preserving the structural elements of the original document. The header “Chapter 7 Point of View” is correctly identified and formatted, ensuring that the extracted text retains the hierarchy and organization of the original content. Additionally, the model successfully distinguishes between different font styles and sizes, which helps maintain the readability and semantic structure of the text.

The transcription showcases how the model identifies subheadings, such as “7.1 Varieties in the Use of Narrative Viewpoint,” and different font styles that emphasize specific terms or phrases within the text. For instance, the use of italic and bold fonts in the original document is accurately reflected in the transcription, enhancing the clarity and meaning of the extracted content.

Figure 8. Example of a good result

However, Figure 9 presents an example where the Nougat Visual Transformer model struggled to transcribe a scanned document accurately. The scanned document on the left contains two sections: a passage about Pablo Neruda and a poem titled “The United Fruit Co.” by Pablo Neruda. The right side shows the transcription generated by the model, which demonstrates several significant errors.

Firstly, the model failed to process the initial page containing the Pablo Neruda passage, as indicated by the note “[MISSING_PAGE_EMPTY]” in the transcription. This omission suggests that the model skipped the page entirely or could not recognize the content, resulting in an incomplete transcription.

Secondly, the transcription reveals issues with capitalization. In the poem “The United Fruit Co.,” the model misinterprets capital letters, such as transcribing “When” as “W hen.” This inconsistency disrupts the readability and accuracy of the text, making it less coherent.

Additionally, the model struggled with the document’s multilingual aspect. The Spanish text in the Pablo Neruda passage was not transcribed, highlighting the model’s limitation in handling documents with multiple languages. This failure results in losing important contextual information, which is crucial for understanding the content thoroughly.

The formatting and structure of the original document were also not preserved in the transcription. The poem’s alignment and spacing are inconsistent, affecting the overall aesthetic and readability. For instance, the title “The United Fruit Co.” is not prominently formatted, and the text alignment is uneven, which detracts from the visual appeal and organization of the content.

Figure 9. Example of a bad result

CONCLUSION

Our Streamlit OCR app simplifies the process of converting PDFs into editable text, making it accessible and efficient. By leveraging the power of Streamlit and the Nougat OCR model, we have created an intuitive tool that seamlessly handles file uploads, text extraction, and downloads. This app demonstrates how modern web technologies can effectively address practical needs in data management and accessibility, highlighting the potential of using OCR models to improve access to educational materials for students with low vision and enhance the educational experiences of both students and teachers.

The model’s BLEU score of 0.918 indicates good performance in transcribing scanned documents. The Nougat Visual Transformer model excels in preserving original formatting elements, such as headers, font styles, and sections, making it an invaluable tool for high-fidelity text extraction and conversion tasks. This ensures the integrity and usability of the information, facilitating further editing, analysis, or digital archiving. However, it also faces challenges in accurately transcribing complex and multilingual documents. Issues such as page skipping, capitalization errors, and difficulties with foreign languages underscore the need for further improvements in OCR technology to enhance its reliability and effectiveness across diverse document types.

Several limitations affect the model’s performance. Processing altered documents, such as those with highlights, scratches, or underlines, presents challenges that can impede accurate text extraction. Additionally, the model struggles with low-resolution documents and small-size fonts, which further complicate the OCR process. Another significant limitation is the difficulty transcribing text from non-English languages, highlighting a need for more robust multilingual support. The model also faces challenges in accurately transcribing text from different styles, such as poetry, which requires precise formatting preservation. Despite these limitations, the model performs well in many scenarios, but improvements to handle large unstructured or low-resolution text are needed.

Future research should focus on comparing different OCR models, testing this tool on a larger scale, and further tailoring the interface to meet student and teacher needs. We are currently undertaking the transcription of 64 documents to fine-tune the model. Although we have yet to test the app directly with students to explore its usability, students are already using the transcribed documents. Enhancing the app to process text with significant unstructured or low-resolution elements and exploring its direct impact on student’s learning experiences will be critical for future work.

In conclusion, as captured by the UML diagram, our app effectively showcases the seamless interaction between the user and the OCR tool. From uploading a PDF to extracting and displaying the text and finally to downloading the result, each step is designed to enhance the user experience. The app leverages the powerful Nougat Visual Transformer model to deliver efficient and accurate text extraction, making it an invaluable tool for converting scanned documents into editable text.