Recently, as a result of my current research project —an edition and translation of al-Maqrizi’s fifteenth-century chronicle al-Suluk for the Ayyubid period (1171-1193) — I have been pondering issues related to Optical Character Recognition (OCR). Part of my work involves investigating from where al-Maqrizi took the material that he recycled (or plagiarized, if we’re being wholly accurate) for his chronicle. This, in itself, is a comparatively easy process: the computer can compare a PDF of the edition to vast chunks of texts scraped from the web. This is something that the computer can perform without too much difficulty, as the large size of text-chunks means that it can find them fairly easily. (Here, I must thank Sarah Savant of the Kitab project for her help with this).
Yet, an important question that remains for me relates to the extent to which OCR can be relied upon to locate smaller items of text, such as single words. If I wanted to locate all the examples of an individual word within a lengthy Arabic text whose only form was a photograph of a comparatively old printed edition, how easy would that be? Used to find single words or short phrases, OCR operates rather like an index in a book. It can hugely enhance the ability of the researcher to investigate the text in question if it is reliable; if not, it can actually prove a hindrance and risk producing results that are not just inaccurate but also misleading.
Being a script that is printed in a rather cursive form, Arabic in particular, along with similar scripts such as Persian, has several OCR-related issues to consider. One of these is the different shapes of the letters, depending on whether it is at the start, in the middle, or at the end of a word. That, however, has a fairly easy solution; train the computer to see each form as a different version of the same letter, much as humans do.
Another issue is related to Arabic’s root system. Let us imagine that one was researching a particular topic, such as the perennially popular issue of jihad. To investigate it properly one would not only have to search for all examples of that specific word (in Arabic: جهاد) across a corpus but also all related terms, i.e. all instances in which the radicals د-ه-ج appear, in that order. Thus, all forms of the base verb جهد would have to be included, as would all forms related to the noun and all adjectival forms.
In this particular case, there would appear to be a simple solution: make the software search for any words in which the base radicals appear in that order, no matter what other letters may appear within the same word. This would solve the problem with the term jihad, but a larger issue becomes apparent when searching for words whose radicals are letters that also appear regularly prefixes or suffixes, such as mim, ta, kaf, and nun. In such circumstances, the software could pick out a large amount of irrelevant material.
Yet there is another, larger issue, and one that is not unique to Arabic script: the accuracy of the OCR software. Looking at statements from both publishers’ marketing departments and software developers it can often seem as if OCR accuracy is essentially a binary issue: either the OCR software works or it does not. This may be the case for scripts originally developed on computers, via Word documents or likewise, but it is not the case for material that originates from before the digital age, and which has been digitized subsequently through photography. For this material, the OCR can be very hit-and-miss, depending on a variety of factors, including the clarity of the original script (something particularly acute when dealing with manuscripts), quality of the photographs, and consistency of Arabic letter shapes across multiple sources (particularly the case with letters that can be printed in various ways, such as mim and ha).
To put into perspective the issues involved, several years ago I was working at a publisher to develop an Arabic-language OCR tool in liaison with a company based in Egypt. This took a considerable amount of development and was extremely tricky, but the most revealing aspect was the discussions within the publisher at the time, particularly as to the acceptable accuracy level of the OCR software. At one point, it was suggested that 40% could be. In the end, it was deemed insufficient, and 70% was the eventual target accuracy. Whether or not the publisher ever reached that target is something I do not know because I left in the middle of the investigation, but, even if it did, 70% accuracy means that 30% of the time the OCR would be missing the keywords being searched for. If, for example, a book index missed the information 30% of the time I would consider it of little use for scholarly research, and yet OCR is used by many scholars without an assessment of the efficacy.
What can be done? I think that firstly, there should be much more clarity from both developers and scholars as to the accuracy, or lack thereof, of the OCR being used within a software program or a specific project today. What to do about it in terms of software development, though, is a bigger question. Because of the multiple issues that are encountered across a corpus of texts it is not something that will prove tricky to solve, and I have not seen anything to make me believe improvements will be made in the near future.
Perhaps the solution to this lies in machine learning. At this point, though, I will end this post because machine learning is a software development issue beyond me. Pondering this particular problem, though, has left me rather concerned as to the scholarliness of any research based on mass OCR-captured data and where it can – and cannot – take us.