Optical Character Recognition And Google Gemini 2.5
Image of my Google Studio Environment.
For one component of my dissertation, I am researching how Eastern Roman Coins were represented in early-modern academic and numismatic manuscripts. I am interested in when the label ‘Byzantine’ began to be formally applied to Roman coins and how the periodization of these coins was developed. This analysis, in turn, will support my ideas on ‘Wicked Byzantine Problems’ and the many problems we, those who study the so-called Byzantine Empire, face and enforce when we continue to use this dishonest label to identify Roman coins.
I am currently exploring 16th - 19th century manuscripts, many of which are digitized and available on the Internet Archive, with other manuscripts sourced from various international institutions. Analyzing these manuscripts to identify word patterns and usage is laborious and tedious. However, I am experimenting with Optical Character Recognition or OCR to convert these manuscripts into plain text (.txt) formats in order to perform Topic Modelling. In other words, a form of digital distant reading that identifies common patterns in phrasing and word selection. What words are associated with each other, and what are the potential implications of these associations?
OCR is neither novel nor overly cumbersome when performed on modern typeset books. The problem is when these books are digitized manuscripts from the 16th, 17th or 18th centuries. Though some of these copies are typeset second editions, they are still not printed uniformly or in English. Many are in Latin, which has many idiosyncrasies depending on who the publisher was and the information being conveyed. Some have typos, others abbreviations, or missing content due to damage or time. Many of these Latin texts use the long ‘S’ which looks like ſ or something similar and can be easily misidentified as an f by OCR. Such nuances become problematic for OCR in creating an accurate and reliable output, not to mention the digitized image quality (300 dpi is minimum) and the PDF format, which is a headache.
Many OCR programs/code need JPG or PNG images to perform their operations effectively. However, converting PDFs into these formats is a pain, especially for 400-700-page manuscripts, of which I have about 20 thus far. See Working with batches of PDFs. So, all of this to say that recently, I have been trying to find a more effective way to process these manuscripts in order to perform Topic Modelling and write this article for my dissertation. Some solutions are presented on the Programming Historian website, while others can be excavated from numerous websites and YouTube videos.
I decided to explore using OCR (at my supervisor's suggestion) and began thinking about how to process PDFs instead of converting them into image files. Since I am not a coder, I also wanted to experiment with Google AI Studio (Gemini 2.5 Pro) and how AI can help generate Python code to execute OCR for PDFs. I have published the process notes of my experience here and here. You can view the code itself here.
I won’t get into the nitty gritty of the code, which is what the process notes are for, but to summarize, the OCR was somewhat of a success. The code development was not as difficult as I thought it would be. I went through four code iterations before it successfully executed the OCR. Gemini 2.5 Pro is slower than its sibling Gemini 2.5 Flash, but is more detail-focused. The feature I liked about Google AI Studio is the ability to provide detailed instructions that help set the tone of the AI. This is called System Instructions and is separate from Gemini’s chat prompt. System Instructions is a feature that allows you to help guide and focus the AI on particular expertise.
For example, I prompted Gemini in the System Instructions in one of my other experimental chats, “You are an expert in Optical Character Recognition. The image I will feed you is from a 16th-century manuscript and is about Roman coins.” This prompt aimed to focus Gemini on the potential variations of Latin used in the 16th century. I also wanted the AI to be aware (can it be aware?) that a manuscript focusing on coins may not follow the syntax of a traditional manuscript from this period. In turn, I wanted the AI to tailor the OCR for a 16th-century manuscript whose typeset is not comparable to modern typesets; thus, it would hopefully recognize characters that other OCRs may misinterpret. From that point, I input the image into the chat prompt and then ask the AI to “Please perform OCR on this image.” Its output followed, very successfully, I should add, in this particular example. It identified the long ‘S’ ‘ſ'‘ noted above and provided an accurate output. However, for this example, the file input was a JPG image that I had fed into the prompt, not a PDF.
Using a 400+ page PDF file and the Python code developed by Gemini (with me prompting what I wanted from the code), my latest attempt produced acceptable OCR output but had some inaccuracies. It did not detect the long ‘S’ ‘ſ’ like the previous example. The accuracy of the output, which was in JSON, has not been fully confirmed at this time. I suspect there will be other inaccuracies in the output as well. But for now, the potential is promising, and with the help of Google AI Studio, mediocre coders like myself can navigate some of the complexities of handling digitized assets like these particular manuscripts. Many may humanists may shit all over AI, and there are in no doubt some ethical quandries that need to be addressed and solutions proposed, but I think the positives out way the negatives so far. Well… until T1000 decides to show up and do its liquid metal thing and destroy our society.