Keywords

1 Introduction

When printing was invented in the mid 15th century, a sort of transcription revolution took place all over Europe. Single handwritten texts were transformed into multiple copy books. Although this invention was crucial for the growth of knowledge, the process of writing continued well into the 20th century very much as before, with the help of pen and ink.

A similar media-revolution is taking place right now when modern technology in the form of electronic texts is revolutionizing our reading habits and our media distribution possibilities. One of the most crucial steps for science in this modern media-revolution is the ability to search within texts. Optical Character Recognition (OCR) technology [1,2,3] has opened up even old printed texts to modern science in an unprecedented way. In libraries, meta-data is no longer the sole entry to collections, electronic content can speak for itself and this also changes library practices. However, the large mass of handwritten texts in our libraries and archives is still waiting to be transformed into searchable texts. The reason for this is a combination of technical and economic factors. Modern technology does not yet give us the good results of OCR technology, which nowadays can be so successfully applied to printed texts that it is a straightforward part of digitization processes world-wide.

Handwritten text recognition (HTR) [4,5,6,7,8] is an emerging field and can be quite successful in certain circumstances, especially when applied to an even and uniform handwriting, but rarely so for the non-homogeneous handwritten texts that fill our archives. In most cases, manual transcription is still the most common way to produce reliable electronic texts from handwritten texts, but modern technology advances and many projects try to tackle this problem. Manual transcription is typically expensive and prone to human error. The incentives to open up this material to computerized searches is high. The information in archives and library collections world-wide, represent an enormously important source to history and only relatively small parts of it is available as electronic texts.

Semi-automatic transcription of manuscripts typically requires hundreds of already transcribed pages, with thousands of examples of each word, in order to produce a useful transcription of the rest of the text. Due to the time consuming machine learning procedures involved, this is computed as off-line batch jobs overnight [7]. However, this means that if just a dozen pages exist, the transcriber is forced to complete the transcriptions without the help of HTR techniques, unless a similar handwriting style exists. An alternative approach to fast transcription of text with a low cost is using computer assisted interactive techniques.

This paper introduces a simple yet effective text extractor tool, \(\textit{TexT}\) for transcription of historical handwritten documents. \(\textit{TexT}\) is designed for quick document transcription with the help of user interaction where the system finds multiple occurrences of the marked text on-the-fly using a word spotting system. Other advantages of \(\textit{TexT}\) include on-the-fly annotation of handwritten text with automatic generation of ground truth labels, adjustment and correction of user labeled bounding box annotations such that the word perfectly fits inside the rectangle. Nevertheless, the transcribed words are cleaned using filtering methods for background noise removal.

This paper is organized as follows. Sections 2, 3 and 4 discusses various transcription and annotation methods and tools available in literature, and discusses related work on handwritten text transcription. Section 5 explains the proposed text extractor tool \(\textit{TexT}\) in detail. Section 6 demonstrate the efficacy of the proposed method with implementation details on well-known historical document dataset. Section 7 concludes the paper.

2 Transcription Methods and Tools

Transcriptions can be made by several different techniques, by reading and typing, typically done by one person interested in using the contents of the documents, as opposed to collective transcription where many individuals make transcriptions using crowdsourcing techniques. HTR, and dictation, are other techniques that can be used to produce transcriptions. An example of the latter is the war-diary of Sven Blom, a Swedish volunteer in The Foreign Legion during the Great War. The diary is kept in Uppsala University Library and was transcribed by dictation [9].

Due to the labour-intensive task involved in transcriptions, crowdsourcing, a term originally coined by Jeff Howe in Wired Magazine in 2006 [10], has been a useful way of distributing transcription work to many people and therefore it sits at the core of many successful transcription projects. The Transcribe Bentham project at the University College of London is often mentioned as an example [11]. Like so many others, Transcribe Bentham is built with components from the open-source software MediaWiki, also used for the perhaps biggest crowdsourcing project on the planet, Wikipedia. Transcribe Bentham started in 2010 and has to this date completed approximately 43% of the whole collection [12]. They now collaborate with the READ project [13] and the application Transkribus [14], which can combine HTR with manual transcription.

There are numerous other transcription tools on the Internet. Zooniverse [15], based in Oxford, include transcription as one of their crowdsourcing tasks, among many others. The plugin Scripto [16] is one of the oldest, typically created in an environment close to the history discipline, the Roy Rosenzweig Center for History and New Media at George Mason University. It is also based on MediaWiki and can be used as a plugin for Omeka, Wordpress and Drupal. Veele Handen [17] is a Dutch application which offers crowdsourced transcriptions as a tool for archives and libraries wishing to open up their collections. They have recently included progress bars where followers and participants can monitor progress.

This feature is very similar to the Smithsonian Institution and their “Digital Volunteers” [18]. In fact, the Smithsonian Institution can be regarded as one of the pioneers in assigning tasks to volunteers. Already in 1849, soon after the founding of Smithsonian Institution, it’s first secretary, Joseph Henry, was able to initiate a network of some 150 volunteers for weather observations, all over the United States [19]. The “Smithsonian Digital Volunteers” is a very successful transcription application and their Graphics User Interface (GUI) combines a clear topical structure with progress bars and a general layout which has incorporated well-established practices used in proof-reading. The work of volunteer number one, has to be approved by a second volunteer and finally the result needs to be approved by the mother institution, wishing to publish the results on the web. Together with other activities, such as promoting projects via social networks, they have managed to achieve good results, demonstrating the importance of an attractive GUI in crowdsourcing. The topical structure facilitates for the user to find attractive tasks.

Uppsala university library is Sweden’s oldest university library and its manu-script collections consist of approximately four kilometers of handwritten material in letters, diaries, notebooks etc. The handwritten manuscript collections date back 2000 years; from BC till the 21st century. The medieval manuscripts are plentiful and the 16th to 20th centuries are well represented with many single important collections, such as the correspondence of the Swedish King Gustav III, containing letters from, for example the French Queen Marie Antoinette and the Waller collection of 38000 manuscripts with letters from both Isaac Newton and Charles Darwin. The languages in the collection are also diverse (e.g. Swedish, Arabic, Persian etc.). However, the main languages for this project include Swedish, Latin, German, and French.

Since a few years back it has been possible to publish digitized material in the Alvin platform [20], a repository for cultural heritage materials shared among the universities in Uppsala, Lund and Göteborg, as well as other Swedish libraries and museums. However, as so often is the case, very little of the handwritten material is transcribed. The collection can therefore be accessed only through meta-data and cannot be analyzed by computational means, a problem which may only be tackled by long term and multifaceted strategic planning for producing more handwritten document transcriptions.

As a start, Alvin [20] has been adapted to allow for publishing transcriptions alongside the original manuscripts. One example of this is a transcription made from a testimony of refugees arriving to Sweden in 1945 from the concentration camp in Ravenbrück, kept at Lund University Library [21]. In this case, the transcriptions in textual electronic format (such as PDF) are a result of manual transcription and are open to Google indexing, thus making the original manuscripts searchable on the Internet. However, this is only an example, to open up more texts for use in digital humanities, a combination of HTR technology and manual crowdsourced transcriptions is probably as far as our present technologies admit. This work takes an initiative towards transcription and annotation of huge volumes of historical handwritten documents present in our university library using HTR methods such as word spotting [22].

3 Document Annotation Methods and Tools

Several document image ground truth annotation methods [23, 24] and tools [25,26,27,28,29,30,31,32] have been suggested in literature. Problems related to ground truth design, representation and creation are discussed in [33]. However, these methods are not suitable for annotating degraded historical datasets with complex layouts [34]. For example, Pink Panther [25], TrueViz [26], PerfectDoc [27] and PixLabeler [28] work well on simple documents only and perform poorly on historical handwritten document images [35].

A highly configurable document annotation tool GEDI [29] supports multiple functionalities such as merging, splitting and ordering. Aletheia [30] is an advanced tool for accurate and cost effective ground truth generation of large collection of document images. WebGT [31] provides several semi-automatic tools for annotating degraded documents and has gained importance recently. Text Encoder and Annotator (TEA) was proposed in [32] for manuscripts annotation using semantic web technologies. However, these tools require specific system requirements for configuration and installation. Most of these tools and methods are either not suitable for annotating historical handwritten datasets, or represent ground truths with imprecise and inaccurate bounding boxes [35].

Our previous work [34] takes into account such issues, and proposed a simple method for annotating historical handwritten text on-the-fly. This work employs this annotation method with improvements using word spotting algorithm. A detailed discussion of the annotation tools and methods is out of scope of this paper, and the reader is referred to [34] for a deeper understanding of ground truth annotation methods, and on-the-fly handwritten text annotation in general.

4 Related Work on Handwritten Text Transcription

Manual transcription of historical handwritten documents requires highly skilled experts, and is typically a time consuming process. Manual transcription is clearly not a feasible solution due to large amounts of data waiting to be transcribed. Fully automatic transcription using HTR techniques offers a cost-effective alternative, but often fails in delivering the required level of transcription accuracy [36]. Instead, semi-automatic or semi-supervised transcription methods have gained importance in the recent past [36,37,38,39,40].

The transcription method proposed in [40] uses a computer assisted and interactive HTR technique: CATTI (Computer Assisted Transcription of Text Images) for fast, accurate and low cost transcription. For an input text line image to be transcribed, an iterative interactive process is initiated between the CATTI system and the end-user. The system thus generates successively improved transcription in response to the simple user corrective feedback.

Image and language models from partially supervised data have been adapted in [38] to perform computer assisted handwritten text transcription using HMM-based text image modeling and n-gram language modeling. This method has been recently implemented in GIDOC (Gimp-based Interactive transcription of old text Documents) [41] system prototype where confidence measures are estimated using word graphs that helps users in finding transcription errors.

An active learning based handwritten text transcription method is proposed in [39] that performs a sequential line-by-line transcription of the document, and a continuously re-trained system interacts with the end-user to efficiently transcribe each line.

The performance of CATTI system [40], and the methods proposed in [38] and [39] is dependent upon accurate detection of the text lines in each document page. However, the line detection and extraction in historical handwritten document images is a challenging task, and advanced line detection techniques [42] are required.

In practical scenarios, such methods are not appropriate as a system should ideally accept a full document page as an input and generate full transcription of the words as an output. An end-to-end system for handwritten text transcription is presented in [36, 37] that also uses HMM-based text image modeling with interactive computer assisted transcription. The transcription method proposed in this work addresses these issues and introduces \(\textit{TexT}\) for quick transcription of handwritten text using a segmentation-free word spotting algorithm [22]. The following section explains the proposed method and its advantages in detail.

5 \(\textit{TexT}\) - Text Extractor Tool

This paper presents a framework for semi-automatic transcription of historical handwritten manuscripts and introduces a simple interactive text extractor tool, \(\textit{TexT}\) for transcribing words in textual electronic format. The method is based on the idea of transcribing each unique word only once for the whole document, including annotations such as gender, geographical locations, etc. This will both speed up the tedious work of transcription and also make it less exhausting. Furthermore, an interactive approach is proposed where the system finds other occurrences of the same word on-the-fly using so-called word spotting system [22, 43]. The user simply identifies one occurrence, and while the word is being written by the user, the HTR engine finds other possible occurrences of the same word, which are shown to the user, meanwhile it continues in the background to search other pages. Further, the user helps the HTR engine in marking words that are correctly identified and correcting misclassified words. By marking these words, writing their corresponding letter sequence, and adding annotations, the HTR engine in the meanwhile processes these words and more accurately identifies them, making a better distinction between these two classes of words.

The proposed method inherits features from our previous work [34] and efficiently performs on-the-fly annotation of handwritten text with automatic generation of ground truth labels, and dynamic adjustment and correction of user annotated bounding box labels with perfect encapsulation of the text inside the rectangle. Interestingly, the transcriptions are generated such that the transcribed word contains no added noise from the background or surroundings. This is made possible by the use of two band-pass filtering approach for background noise removal [44]. This is followed by connected components extraction from the word image.

The following features are important parts of the \(\textit{TexT}\) project planning:

  • A simple yet informative, and user-friendly GUI that may attract users according to well defined topics such as botany, history, theology, diaries, etc.

  • A GUI where the user can download the transcription results on-the-fly as they are distributed in the University library digital repository.

  • Presence on social networks.

  • A ranking system combined with a merit-report for the use of the contributor.

  • A proof-reading structure with a first and a second proof-reader and a safe yet quick ingestion mechanism for the repository.

  • A graphic illustration of progress for each topic.

  • An administration of the application which includes active outreach to find interested audiences, close monitoring of the uploaded content and general advertising of opportunities, news and activities, including events which might give contributors extra value, such as exhibitions and shows of the original material.

  • An HTR application, active only in the background, making use of the user input through machine learning and delivering better results based on the user input.

The combination of crowdsourcing and HTR is crucial and, it is believed to be one of the key factors for the \(\textit{TexT}\) project. Human interaction with AI (artificial intelligence) might be the best way to combine IT-technologies with those interested in contributing to the cultural heritage [45].

Fig. 1.
figure 1

The user marks a word in the document (in the left), shown in red bounding box. The system finds the best fitting rectangle (in green) to perfectly encapsulate the word. The background noise is removed and the clean transcribed word generated is shown on the right. Figure best viewed in color. (Color figure online)

6 Experimental Framework and Implementation Details

This section emphasize on the overall experimental framework of \(\textit{TexT}\) along with insight on its implementation details. The proposed framework is tested on the Esposalles dataset [46], a subset of the Barcelona Historical Handwritten Marriages (BH2M) database [47]. BH2M consists of 244 books with information on 550,000 marriages registered between 15th and 19th century. The Esposalles dataset consists of historical handwritten marriages records stored in archives of Barcelona cathedral, written between 1617 and 1619 by a single writer in old Catalan. In total, there are 174 pages handwritten by a single author corresponding to volume 69, out of which 50 pages are selected from 17th century. In future, the ancient manuscripts from the Uppsala University library will be used for further experimentation.

Fig. 2.
figure 2

The result of searching one word marked by the user (for example, reberé), represented using blue bounding box. Figure best viewed in color. (Color figure online)

Fig. 3.
figure 3

The transcription can be performed in any order and in this case 11 different words have been marked, and the other occurrences are found automatically. Figure best viewed in color. (Color figure online)

Fig. 4.
figure 4

The ongoing transcription results in words being identified in their corresponding places. In this case, the user has also annotated names and places using different colors. Figure best viewed in color. (Color figure online)

The text transcription method based on word spotting is performed as follows. The system generates a document page query where the user marks a query word with a so called rubber band rectangle. The user marked red bounding box is highlighted in Fig. 1a for a sample word reberé. The system automatically finds the best fitting rectangle to perfectly encapsulate the word, as shown in Fig. 1a using green bounding box, and extracts the word. Furthermore, the noise from the background and surroundings is efficiently removed using two band-pass filtering approach in order to make the subsequent search more reliable (see Fig. 1b).

The system starts searching for the word in the document page and the result is shown in Fig. 2. Note that only a cropped part of the document page from the dataset is shown for demonstration. The search is performed while the user inserts the transcribed text together with the annotations. Now the user can let the system learn by clicking on one or several word boxes confirming that they are correctly found. If the system find words that are misclassified, the user can inform the system by clicking a button to switch from correct to incorrect mode, and then selecting the words. While doing this, the system continues to perform word search on other document pages and update the search on the basis of information the system learns from the user (Fig. 4).

The user can select words in any order by marking them once. Figure 3 shows how 11 words have been chosen and the system finds the rest. The corresponding transcription is shown in Fig. 3. In this case, the user has annotated some words as names (highlighted in red) and others as geographical places (highlighted in green). This example of a place represents the abbreviation for the word Barcelona.

7 Conclusion and Future Work

The transcription tool \(\textit{TexT}\) presented in this paper is based on an interactive word spotting system, and lends itself to collaborative work, such as online crowdsourcing for large-scale document transcription. The proposed method can be further improved using client-server or cloud-based solution to perform transcription without much latency. So far algorithms for word spotting [22] have been developed and a simple experimental framework is proposed to support the transcription approach presented herein.

As future work, we intend to implement a transcription framework on ancient manuscripts from Uppsala University Library that works as follows. Each user can freely mark words, annotate them and also identify words found by the search as correct or incorrect. The major part of the search will be performed on a dedicated computer that splits the work in parallel, making it possible to search even large documents in a few seconds. It can be noted that searching one word in our MATLAB implementation takes about 2 s for the example shown in Fig. 2. The word spotting approach used in this work [22] efficiently performs parallel processing such that the search in a single page can be distributed into several processes, and hence making the search much faster. Different learning methods are being evaluated to improve the transcription algorithm. Deep learning techniques can be used only when several hundreds of annotated examples are available for a document, but when starting a transcription of an entirely new document, no such are usually available.