Skip to main content

UAH Archives, Special Collections, and Digital Initiatives

From Manuscript to Machine: AI-Driven Transcriber

Elijah Shannon, Spring 2025

Leaflet | © OpenStreetMap contributors © CARTO

Access to physical historical documents remains a substantial bottleneck in academic research, particularly in the humanities and archival sciences. Despite digital advancements, there exists a notable disconnect between the retrieval of content from physical documents and querying systems. Digitization and transcription processes traditionally require significant time and human effort. To address these challenges, this paper proposes an automated AI pipeline integrating advanced object detection, convolutional neural networks (CNNs)4, and Generative Pretrained Transformers (GPTs) to transcribe historical texts efficiently– reducing manual transcription.

A page from an original 1613 copy of the 1611 King James Version Bible.

Figure 1.

Utilizing a 1613 copy1 of the original 1611 King James Version Holy Bible, the initial image dataset consisted of thirty-two images taken on an iPhone 15 Pro in the British Library Rare Books & Music Reading Room. From these images, eighty-nine distinct words were selected based on visual quality and alignment. Bounding boxes were annotated around each letter within these words, resulting in a character dataset comprising fifty-five unique classifications.

In order to identify letters in word images, object detection is utilized. The state of the art YOLOv122 by Ultralytics is the object detection model used. This model was selected based on the recency of its release, ease of use, and the object detection stability compared to its predecessors. The YOLOv12n model was fine-tuned specifically for this dataset, yielding bounding box predictions for individual characters. However, certain detection inaccuracies were noted, such as misclassification of lowercase letters with complex typography (e.g., interpreting a lowercase "m" as multiple "l" characters). While these limitations indicate room for improvement in dataset quality and model training, the overall performance remained within acceptable bounds for pipeline functionality.

The validation batch for the YOLOv12 model's final results.

Figure 2.
The loss plot for the bounding box loss.

Figure 3.

Identified letters from the object detection stage were cropped and subsequently classified using a Convolutional Neural Network (CNN). The CNN's architecture is highlighted in the figure 5.

Convolutional Neural Network Architecture for this classifier.

Figure 4.

This architecture takes in a grayscale version of our lettering dataset and outputs to one of the fifty-five classification labels possible. The results of this CNN were slightly less than ideal due to limitations in time for data preparation. Training constraints due to limited data resulted in overfitting after approximately 300 epochs. However, given the printed nature of the document, this overfitting was less detrimental as consistency in characters was relatively high. Ultimately, the CNN performed its intended classification task within the pipeline.

Loss plot for the Convolutional Neural Network.

Figure 5.

The overarching transcription pipeline starts with user-provided screenshots of words. These words are then fed into the YOLOv12 system to detect each letter in the word. Those individually detected letters are sorted by their x axis to ensure correct ordering, then they are run through the CNN a letter at a time. The resulting characters were then sequentially transcribed into digital text, completing the automated transcription process.

Pipeline layout figure.

Figure 6.

The Transcription to Corrected output pipeline.

Figure 7.

The overarching pipeline architecture can be split into three separate pieces: automated letter detection within an image of a word, automated character recognition of the previously detected letter, and, after the previous two pieces have been fully completed, a GPT-4o3 based autocorrector is implemented to correct the results to a coherent sentence. One primary advantage of using a GPT is being able to specify the context to which a document is written. For the test dataset, the GPT-4o was fed the following phrase: “Correct the following sentence that was interpreted from a 1613 Bible through a convolutional neural network. The sentence may contain spelling errors due to error in the convolution and object detection. Please correct it and output only the corrected sentence.” This provided important context for the highly complex transformer model, and it provides a cleaner and more accurate to the original sentence output than a regular autocorrecting system. Since the Holy Bible utilizes an old English, one would want the GPT to return a sentence that is written in a similar-era English rather than a modern interpretation of the sentence.

Using the Gestalt pattern matching algorithm for accuracy of predicted text to original text, the average transcription correctness rate was found to be 77%. The correctness after running the results through GPT-4o correction was measured to be 90%. This test was conducted by compiling five test sentences with varying words and length. Each sentence was housed in its own folder with an image of each word in each sentence. Then, iteratively, the model is tested on these sentences to produce the transcription. Finally, the transcription is run through a GPT-4o model and compared to the original sentence for an accuracy score.

Results table.

Figure 8.

The accuracy rates in Figure 7 imply that the Object-Detection to CNN to GPT-4o pipeline methodology is a valid route for automated transcription. Additionally, due to time constraints, it can be reasonably inferred that the accuracy of the results can be improved by further fine tuning or variation of the models.

The transcription pipeline boasting a 90% accuracy with the test set heavily suggests that with further fine-tuning that the methodology proposed is a viable option for the automated transcription of physical documents. Improved methods of digitization opens up many doors for archives, libraries, and museums to make their collections more accessible. A digital text document opens doors for future research opportunities such as advanced indexing. Hypothetically the more documents that can be quickly accessed, the more research in the document’s respective fields can be performed more efficiently.

Land Survey of the Mobile River in Alabama.

Figure 9.

The pipeline idea opens up many opportunities for future work. First, utilizing a transcribed version of a document, one could apply automated translations– allowing for more accessibility to many documents. Second, many documents exist to hold data, so a similar methodology could be used to automate extraction of said data. For example, many historical physical documents hold data in the form of tables, utilizing this pipeline and other algorithmic processes, a table can be recreated in a digital format for ease of recall and manipulation. The digitization process opens up a realm of possibilities for researchers who rely on physical documents by making them more accessible.

Bibliography

1. The Holy Bible. The Holy Bible, Conteyning the Old Testament, and the New. King James Version, 1613.
2. Tian, Yunjie, Qixiang Ye, and David Doermann. “Yolov12: Attention-Centric Real-Time Object Detectors.” arXiv preprint, February 18, 2025. https://arxiv.org/abs/2502.12524.
3. OpenAI. GPT-4o System Card. August 8, 2024. https://cdn.openai.com/gpt-4o-system-card.pdf.
4. LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep Learning.” Nature 521, no. 7553 (2015): 436–44. https://doi.org/10.1038/nature14539.
5. Reference to the Land Surveyed. Reference to the Land Surveyed on the Rivers Mobile and Alabama since the Establishment of the Civil Government in the Province of West Florida. 1770. The National Archives (UK).

Acknowledgments

I would like to acknowledge the UAH Honors College for funding this course and the SAGA Grant. I would like to thank Mr. Reagan Grimsley and Ms. Jennifer Staton for leading this course.