Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.

Monk project, University of Groningen, The Netherlands

Monk project, University of Groningen, The Netherlands

Monk - a living, trainable system for continuous access to handwritten archives

L. Schomaker - April 2010 [ Draft ]

                          April 2012 [ v2 ]

                          September 2013 [Dutch->English]

 

1. Introduction

The automatic recognition of letters ( Optical Character Recognition) is possible, today. However, even in characters of high graphic quality, a hundred-percent accurate conversion of a document image ( ' image ' ) to a coded format such as Microsoft Word or a plain text file ( in ASCII , Unicode or UTF ) is not very well possible . Under these circumstances it is understandable that the automatic recognition of handwriting - with all its variations, within and between writers and historical periods - is not easy .

The automatic recognition of handwritten modern block letters , individual letters and numbers performs reasonably well. However, the recognition of connected-cursive writing in a free style is very difficult . Only in narrowly defined application systems such as address readers for envelopes, high performances are achieved in some commercial systems. This is achieved by a combination of powerful methods (traditionally: Markov models) and detailed modeling of layout and content of the addresses for this very specific and constrained application context, which also can make use of the redundancy between zip codes and the handwritten geographical names.

However, for open problems, such as presenting 'just any' historical book or letter on the Internet , not many systems are available as yet. In a series of research projects under the guidance of Prof. Lambert Schomaker, methods have been develop to improve on the accessibility of document-image collections. This research has resulted in a prototype system "Monk" in the course of 2009. This system is now being developed further at the University of Groningen, in the EU/SNN-funded Target project .

2 Design Philosophy

The design philosophy of Monk is based on a modern insight: If machine and human cannot solve a problem on their own, then they should work together. Computers are good at searching large databases . People are very well capable of analysing detailed information in a highly specific context. If humans would train the computer, we would get the best of both worlds . It should be noted that although humans are good at reading text, most of us still encounter quite a few problems when an unknown text from an unknown writing style, language or historical period is presented to us.

2.1. The one-on-one automatic transcription of handwriting to a text file is as yet not a useful and achievable goal . The pursuit will lead to disappointment in both designer ( on the last manuscript my algorithm did well ") and the user ( " worthless , in each line there is at least one error ). Therefore Monk promises no recognition. The goal is to open up a collection which until then was only available in paper form, difficult to access and only to be found digitally as object metadata, pointing to the physical object in a storage facility. Similar to the internet search engine "Google", Monk will present a ' hit list' with links to retrieved image samples. As with Google, there can be no assurance, for example, that absolutely all occurrences of a surname are found. Moreover, it can not be prevented that unintended words pop up in such a 'hit list'. But prior to the existence of Monk, users had no access to the content of the individual pages of the archived manuscript. This is the fundamental value of a Monk.

2.2. A second element of the philosophy of Monk is that recognition and finding keywords will become better as human users give more examples of words on the machine. By segmenting a page into line strips and segmenting each line strip into potential - word fragments, a document can be offered to group of users and volunteers in a 'Wiki' type system, with the goal of training the computer . For each word, it is necessary to collect a critical mass of samples. In machine learning, the general rule is: The more examples, the better. However, our approach will usually be able to start to retrieve useful word instances, starting from a training set of five examples. From twenty examples onwards, the recognition performance increases sharply . With hundreds of examples of a word ( word class ) modern pattern-recognition methods have a stable performance. In exceptional cases, however, a word class can be 'bootstrapped' based on a single example . After a nightly training phase , in which the computer processes all new examples , there may be a resulting ' hit list ' that can be confirmed quickly such that there will be 40 new examples for the next training round. In the process, there must be a form of symbiosis between man and machine. Motivated users , such as family tree researchers , will provide valuable collections off interesting words. Words which no interest is leading a shady existence and will be found only with difficulty. However, their accessibility increases significantly when one day a few new ' labels' to word image examples are added . In our research, we found that 'phase transitions' will take place. In a short phase the recognition rate will rise rapidly due to the fruitful collaboration between man and machine, after which a new plateau is reached. This process is akin to the 'Fahrkunst', an elevator method for mine shafts. Also in that machine, a goal is reached by alternating between two methods that make use of each other's results.

2.3. A third element of Monk is that the system has been set up as generic as possible . Certainly , with detailed domain knowledge of layout and content, an improved recognition rate would be achieved. However, from a technical ( computer science ) perspective it is not feasible to anticipate every possible detailed problem in a manuscript image (scan) collection For users and institutions who want to get more out of it makes sense to post-process the results ( 'output' , indices ) of Monk used. However, this requires a specific system to be developed, on top of and in collaboration with the Monk system itself. Both a ' Layout' model for the page layouts to be expected and a statistical model of the language used in the manuscript may be beneficial to improve on the retrieval and recognition performance.

3 Developments

In the development of Monk, initially books from the Cabinet of the Queen and later Captain's logs (Scheepsjournalen) from the Dutch CLIWOC project were ingested. Although the Monk system is designed for handwriting, it also achieved very good results on a printed book of Latin Ubbo Emmius from 1614 . In the summer of 2010, a medieval German and other texts from the 15th century processed ( four to seven pages ) with excellent results . For other difficult documents such as irregular Monk typed material can be used. Since 2011, a series of diverse manuscript styles was ingested by Monk, varying from the Schepenbank Leuven (1421) to the Dead Sea Scrolls and cyrillic documents. Although Monk is set to horizontal writing types that run from left to right, this is not a fundamental restriction and other types such as Asian writing are possible . However, it is important that the lines are straight (not bent or wavy). Texts of variable orientation or marginal text lines that 'walk around the corner ' when an author had a lack of space cannot be processed properly. The image contrast should be good and the resolution is generally 300dpi, yielding a thickness of the ink trace of 7-9 pixels, minimally.

[to be continued]