1 Introduction

The automatic analysis of handwritings remains a difficult research problem, although much work has been done in this field [22]. This results both from the enormous variations in the handwriting style of different writers and from the changing style over the centuries due to the evolution of the various Latin scripts [17]. Even the appearance of the text of a single writer varies to a large extent, since there are no restrictions regarding his way of writing apart from the attempt to provide a legible manuscript; this, however, causes inconsistent spacing between lines and among words [14], not to mention the skewness of lines, the slant of characters, and their differences in size as well as shape.

In consequence, any algorithmic approach that tries to recognise handwritten texts will be faced with an entirely novel piece of handwriting style, showing an individual visual appearance. This happens at least, whenever the algorithm is fed with a manuscript of a writer whose handwriting has not been processed before. While one approach consists in employing learning algorithms that adapt the performance of the recognition system to specific writing styles, there might be a lack of data when only short documents are available or a learning phase might be inappropriate given the workflow in a specific application. In this case, a linguistic processing level can compensate for difficulties arising at the visual level [14]. But such a linguistic model might not be available in the process of transcription. In fact, a model of that kind might itself be one of the goals to be achieved, based on a set of transcribed documents, and this even more so, as the individual orthography of a scriber might be unusual and itself the subject of interest, while a general linguistic model might tend to normalise unusual orthography.

Indeed, from the palaeographer’s point of view it cannot be the exclusive purpose to provide the transcription of a manuscript. It is rather of interest to analyse, among others, the appearance of a writing style, the within-writer variability with respect to single alphabetic characters, to compare similar characters in their appearance, or to look at the differences the writer made with regard to the written letters of a single character class, depending on their position in a word as well as in a document, the orderliness frequently degrading the longer the text. Examining characteristics like that makes the thorough analysis of writing hands possible.

In this paper, we introduce a methodology for the interactive transcription of handwritings. Alongside the transcription output, the shape of each single character is precisely extracted. This allows a detailed description of a manuscript at the level of alphabetic characters. Because of the difficulties of non-restrictive handwritings, the method introduced here relies on the interaction between efficient algorithms on the one hand and the flexible abilities of the user on the other hand. This enables what we refer to below as the anytime anywhere document analysis paradigm.

In the following section, three typical application scenarios are introduced that motivate the need for a new interaction paradigm. A number of systems supporting the transcription of handwritings as well as specific interaction paradigms are shown and terms employed in this context are discussed in Sect. 3. The notion of anytime anywhere document analysis methods is introduced afterwards. One of the most intricate problems in the context of handwriting analysis is dealt with in Sect. 5. It presents three different glyph separation methods, which are evaluated in the following section with respect to different writing hands. A general discussion and summary will close this paper.

2 Typical application scenarios

2.1 Mass data processing by amateurs

For the expert, the amount of data sometimes makes it necessary to delegate work to the staff. However, transcribing a Latin manuscript with its incomplete or not existing word separation and dozens, if not hundreds of ambiguous abbreviations, there are many potential mistakes an amateur can commit and which need to be corrected later at low effort by a qualified palaeographer who is familiar with such handwritings. E. g. there might be a wrong letter within the final transcription. A fault, however, which can be easily revised.

The underlying causes for a given fault might have deeper roots: A defective letter could have been extracted much too imprecise. As a consequence, the binarisation needs to be adjusted. Changing, however, the figure-ground separation, results of subsequent processing steps like the separation of glyphs in the neighbourhood or their transcription must not be affected.

The solution would be the local application of some image pre-processing methods that do not affect any other part of the image or any glyph already extracted. In the current scenario, image enhancement should be possible for a local area at any time, even if all other subsequent image processing steps have already been applied, including the extraction and transcription of all other letters.

2.2 The preparation of critical editions

In most cases, the treatment of old mediaeval handwritings is neither simple nor clear. The opinions of experts may differ as to how to extract all signs correctly, for example, whether a region is part of an abbreviation sign or a punctuation mark or simply an inkblot.

While a transcription has been provided by some assistant transcribers, two or more editors should take a copy of that very same transcription in order to adjust it according to their own ideas and research assumptions. This might imply to change the extraction of signs locally depending on how they are interpreted. As a matter of fact, sometimes a meaningful interpretation becomes clear just as soon as a partial transcription of the document is available, making it necessary to mix image enhancement, region extraction, and transcription routines.

2.3 The joint work of editors

Another scenario is about two editors who are working together on the same documents in order to share their experiences and skills. Mistakes made by one of them are identified by the other one and vice versa. Several kinds of oversights are conceivable, for example, that letters have not been separated correctly.

Such segmentation steps need to be adjusted independently on all other letters which have already been correctly segmented within the same image. Additionally, a revision of single letters should be possible no matter how far the whole document is already transcribed by the other editor or by oneself. Both editors need to work up and down across a single document page which needs to be exchanged among them with its current revision status.

2.4 The principles of temporality and locality

In all those scenarios, it is neither sufficient just to revise the output at the end of the whole processing chain (like in [31] or as shown in Fig. 1), nor is it sufficient to provide user interactions for the different processing steps individually (as provided by [3, 5, 7, 27]), i. e. before the next step is processed. The conventional document image processing chain is not longer applied to a document image as a whole, but, if necessary, to each locality of interest separately. It must be possible to revise just parts of the processing chain whenever and wherever necessary. In a nutshell, a method is required that meets two conditions:

  • The principle of temporality: Revisions must be possible for any processing step regardless of how far the document is already transcribed.

  • The principle of locality: Revisions need to be applicable locally without influencing any processing results of other locations within the same document image.

Above all, both conditions must apply at any time. That means, any revision at any location can on principle be redone ever and ever again.

Fig. 1
figure 1

A conventional document processing workflow

3 Related work

A number of transcription systems do exist which are outlined in the following. These systems can be roughly classified with regard to the textual level they mainly take into account. Accordingly, line-based and word-based methods can be distinguished as well as those which directly work on alphabetic characters. Examples of all three kinds of approaches are described in turn, followed by a number of systems that are reviewed from the standpoint of user interactions to compare them with the desired interaction methodology introduced in the previous section. Eventually, the employed terminology is clarified.

3.1 Transcription systems

The computer-assisted transcription system described by [24] proposes transcriptions of text lines. The user corrects these transcriptions, and the system adapts itself to those corrections, in order to make improved suggestions. This process is repeated for a single text line until the user has validated the entire line. While the extracted features of a text line image are processed by a Hidden Markov Model, at the linguistic level, this system employs a word model and a character model. The approach by [16] is also text line-based. It covers the whole workflow from accessing documents to the recording of annotated transcriptions.

In [26] a computer-assisted transcription approach is described which works at the level of words. The system tries to recognise whole words by means of word graphs that show the probabilities of similar words to come into consideration, while words are prompted to the user where the system lacks confidence about their transcription. The feedback of the user improves the performance.

Both [24] and [26] employ a language model, as does [30], where a recognition system for modern scripts has been adapted to mediaeval scripts. In particular, the authors investigate the creation of suitable language models, which is often difficult due to the small quantity of verified ground truth transcriptions available.

As opposed to the mentioned systems, slant corrections and size normalisations are not desirable for our transcription system, since the original appearance of the handwriting is a valuable source for the palaeographic study. Additionally, the way how words separate into single characters is of no relevance within the systems mentioned thus far. They are restricted to export transcriptions, but lack any means for providing palaeographic methods at the level of single symbols and their visual appearance. Such palaeographic features are also needed elsewhere [12]. But up to now, the extraction of single character symbols during the transcription process is confined to rectangular areas that enclose single characters.

The clustering of similar character symbols based on their visual appearance is another approach [10, 23]. In this case, a transcription amounts to the assignment of a character to each cluster. This method works most efficiently if all the character symbols of one character class can be grouped into the same cluster and reduces the transcription effort inversely proportional to the number of character symbols contained in the clusters. Conversely, a poor grouping result, which is due to a large within-writer variability, results into many clusters. These have to be managed separately during the transcription process.

3.2 Interaction methods

In [4], a directed interaction model is proposed which prompts questions to the user whenever it detects any problems in the context of layout analysis. The initiative is, so to speak, taken by the system. Such an approach is possible when employing a model that defines potential problems the user would has to resolve for the system.

Along the same lines, there are other interaction models based upon page descriptions. They pertain to the paradigm referred to as concept-driven grammatical document analysis methods [3], and they are confined to specific document layouts or even to specific contents. Another one is the spontaneous interaction model. Here, it is not the system, but the user who has to take the initiative and who is allowed to change parts of the image content in order to let the system know how to improve automatic processing [3]. Furthermore, it is possible to annotate the employed document model itself in order to show the system potential pitfalls for which it gets prepared to receive corrections by the user during interaction steps.

An example for systems allowing user interactions at the final stage is the post-processing tool provided by [31]. The document structure as well as erroneous character recognition results can be corrected for digitised documents to be published. In contrast to this method, interactions are possible at different processing steps within the framework of [5]. This approach uses a scheduling module that makes use of a predefined scenario for which documents are to be analysed. It allows in particular the integration of user interaction modules within that framework.

Another example for user interactions at different processing steps is the system provided by [7] for the purpose of ground truth data generation in the context of handwriting recognition. The user has to select the text areas by means of the GNU image manipulation program called gimp [20], which is also used for binarisation where the user is asked to find appropriate parameters. Text lines are found by their system afterwards and can be corrected manually, as can the text alignment. Word boundaries for each text line are detected by means of a Hidden Markov Model. Those boundaries can be adjusted by the user if necessary.

Similar to [7], the approach of [27] builds upon the gimp toolbox. They also allow interactive corrections regarding the detection of text blocks, text lines, and the transcription. However, this approach is designed to work with collections of homogeneous documents, sharing a similar structure and writing style. The user can decide in which order to transcribe text lines. But the document processing order of block and text line detection as well as transcription have to be followed.

For the sake of completeness, investigations should also be mentioned that do not primarily focus on user interactions in the context of transcriptions, but which generally relate to the managing and archiving of documents. Thus, [15] describes a collaborative managing and remote access platform for digital reproductions of books of the Renaissance. As the fast access of digitised documents via the Internet is one of the main issues there, compression methods have been developed within that project. In order to make the contents of books available, manual labelling of character patterns is made possible, though this project is confined to printed texts and is not meant to provide complete and well-edited transcriptions, but search engines with a certain minimum amount of contents to support the search for those documents.

None of these approaches can deal with scenario 2.1 that requires a local change by means of an early processing step, while the text is already transcribed. Regarding scenario 2.2, there might be ways to save and multiply intermediate results for some of the aforementioned systems. Here again, binarisation results or those relating to the extraction of particular regions concern earlier processing stages which need to be revised. But this should be possible without changing anything but the affected abbreviation signs, punctuation marks, or inkblots, depending on the interpretation for a specific critical edition. Finally, scenario 2.3 does not only require the former flexibility concerning punctual corrections within earlier processing steps. It additionally demands the exchange of a document at any current editing state. Conventional approaches are designed to enable interactions at different processing stages, but usually in a particular order and for the entire document page, sometimes even confined to specific document types. Sooner or later, earlier processing methods become inaccessible. However, for an editor, the principles of temporality and locality in document processing are necessary.

3.3 Glyph versus abstract symbol

In order to distinguish the visual appearance of information contained in a document image and its abstract description, we refer to the linguistic notions of writing systems. In their context, a grapheme is the smallest semantically distinguishing unit [1]. The concrete written graphical symbol of a grapheme is referred to as a glyph or graph. Those glyphs which represent the same grapheme are called allographs.

In our case, glyphs usually correspond to single letters, but might also refer to ligatures, diacritic marks, or punctuation marks. In other words, the document image contains glyphs, while the transcription represents those glyphs by means of abstract symbols. It is the purpose of the presented work to extract all glyphs from a given document image and to assign to each glyph the correct abstract symbol.

The state of the art shows the difficulties of separating glyphs from each other automatically, which is also referred to as character segmentation [2]. This is why many methods rely on the recognition of words instead of single characters, together with word models of the underlying language [24, 26, 30]. Even more difficult than the segmentation of characters is to extract the shapes of single glyphs from the image precisely. This however is necessary, because it is not the only purpose to assign to each glyph its abstract symbol. Palaeographic researchers are rather interested in analysing the visual appearance of glyphs and want to compare glyphs in different contexts [29]. Consequently, methods are required which allow the careful separation and extraction of single glyphs. Due to the problems arising with any non-restrictive handwriting, such methods will hardly work without the user who will have to resolve difficult cases.

4 Semi-automatic document analysis

In contrast to the conventional document image processing approach which distinguishes a number of processing steps [18], for instance, those shown in Fig. 1, the presented approach allows to adjust everything, at any time and wherever necessary on the document page. While the user is inspecting the document, the system is asked to make suggestions about how to extract lines of text, for example. But as it is difficult to make perfect suggestions at the algorithmic level, the user can freely adjust the detected text lines. The design of the presented approach grants to her full control of the entire analysis process, while she benefits from efficient image processing algorithms.

4.1 Palaeographic studies

The palaeographic study of a document image usually means to be engaged in the inspection of that image for a long while. In doing so, one possible aim is to obtain all alphabetic characters and abbreviations employed by the writer. Moreover, palaeographical features about the document are to be determined in order to characterise her peculiar handwriting. While this can be supported algorithmically, a number of interactions with the document would support the researcher to become familiar with the handwriting. This includes the search for strings in the original document, to show specific glyphs in their document context and to pick out glyphs in order to compare them directly.

The continuous treatment of the document also implies, little by little, that the user detects glyphs hitherto not precisely extracted from the image, not completely lying in a text line, or other peculiarities which need adjustments. This is where the anytime anywhere document analysis paradigm gets involved, which allows the user to introduce any improvements concerning image enhancement, feature detection, and glyph recognition. Those adjustments, however, must not have an effect on any features already extracted. Any side effects are to be avoided. This is demanded by the anytime anywhere paradigm. Besides, the user can directly choose interaction methods, e. g. in order to mark a text line or glyph by herself/himself, instead of letting the system deal with particular complex cases which would be difficult to adjust afterwards. Nevertheless, later processing steps can again be automatic: User interventions and automatic methods interact with one another.

4.2 Adjustments

The methodology consists in the interplay between user and machine. This concerns image enhancement, figure-ground segmentation, text line detection, glyph separation, and transcription. The suggestions made by the system are immediately visualised:

  • image enhancement: The image can be enhanced globally; alternatively, by means of a virtual rubber band, an area can be defined for which the contrast is changed or in which noise is removed;

  • figure-ground segmentation: By a toggle button, the user can switch between the original image and black and white image, showing the quality of the extracted glyphs; a rubber band can be dragged around an area for which the figure-ground segmentation is to be adjusted independent of the remaining image;

  • text line detection: Each text line is framed by a rectangle; the user can grab that rectangle at each side in order to change its size towards all directions;

  • glyph separation: Adjacent glyphs are dyed with different colours; a glyph can be selected in order to align its shape precisely to the document image;

  • transcription: For each line, the transcription is shown at the same vertical level on a panel right of the original document page, as shown in Fig. 3; the transcription can be changed for the whole text line or for each single glyph independent of the rest of the text line.

The user can at any time and anywhere change these features locally and irrespective of the remaining parts of the document whenever facing a defect. Figure 2 illustrates this kind of flexible workflow which has abandoned the conventional document processing chain [31]. In fact, there is no fixed workflow anymore: The user loads a document image and can apply the methods locally just as desired. Alternatively, it is possible to load the current editing state and to continue working with it. Figure 3 shows the user interface with a document and its transcription.

Fig. 2
figure 2

Instead of the usual workflow shown in Fig. 1, in the Diptychon system presented here, any method can follow any other method somewhere on the document page. There is only the transcription which either requires a single glyph already being extracted by means of local figure-ground segmentation or by having extracted a first text line. Apart from that, each method, automatically or interactively applied, can follow each other method anywhere on a document image

Fig. 3
figure 3

A document page from [21] displayed within the user interface of the Diptychon system. Left the separated glyphs are displayed over the original document. The changing colours of the glyphs show how they got separated. Right the transcription. The $-signs enclose the abbreviations

In the following, the algorithms are shortly outlined. It turns out that, the advantage of our methodology consists in simple algorithms which just need to produce more or less accurate results, since the user will compensate for any imperfections. Additionally, the simplicity of the algorithms keeps them general, not confined to specific handwritings.

4.3 Figure-ground segmentation

Connected components which represent single glyphs or parts of disconnected glyphs are determined by applying Sauvola’s binarisation filter [25]. The user can determine the size of influence of the filter as well as a threshold \(\varTheta _k\) for the deviation allowed from the average grey value within the area of influence. The latter adapts the threshold to the local contrast in the neighbourhood.

Then, for each pixel \(I(x, y)\) in the image \(I\), an individual threshold \(\varTheta _{bin}\) is computed in order to decide whether \(I(x, y)\) pertains to the figure or the background:

$$\begin{aligned} \varTheta _{bin}(x, y) = \overline{x}(x, y) \left[ 1 + \varTheta _k \left( \frac{s(x, y)}{s_{max}} - 1\right) \right] \end{aligned}$$
(1)

The mean \(\overline{x}\) and standard deviation \(s\) are computed within a range of the image corresponding to the area of influence. \(s_{max}\) is the maximal value for the standard deviation, which is 128 in a grey scale image with 256 different possible values.

4.4 Text line detection

Text lines can be detected by means of projection profiles along the horizontal of the image [19]:

$$\begin{aligned} P(y) = \sum \limits _{x = 1}^{x_{max}} I(x, y) \end{aligned}$$
(2)

Deskewing algorithms that aim at the complete alignment of a document [13], which has been inappropriately digitised, for example, should be applied beforehand in order to obtain a useful projection profile. They often fail, however, when facing a dropping handwriting. In order to manage such slanting text lines, an algorithm has been devised which adapts the idea underlying the approach of [28]. It determines parts of a text line which are given by connected components. The smallest enclosing rectangles are defined around those components. By means of a least median square method, the enclosing rectangles are collected to define a coherent text line. Thereby, a number of constraints have to be satisfied, especially in regard to the difference in height of the rectangles. Deviations from these constraints lead to the definition of new text lines.

4.5 Single glyph extraction

The separation of glyphs proves to be particularly difficult, as different handwritings show differences in how characters connect. As a consequence, many methods rely on the interaction between character segmentation and classification [2].

Here, we follow a common heuristic that looks for thin sections within the handwriting, assuming that these refer to paths along which adjacent characters meet. Anywhere on the document page, single glyphs can be locally extracted from the image. For this purpose, the figure-ground segmentation methodology described in Sect. 4.3 is applied. If it is accidentally connected to adjacent regions, the user has the opportunity to crop the glyph from those regions.

5 Interactive glyph separation

The three categories of document analysis addressed in the previous section can be applied to any part of the document at any time. Among these methods, the extraction of single glyphs is the main challenge. It requires an efficient procedure for the separation of glyphs, i. e. easily manageable for the user but nonetheless very effective. The glyph extraction method, described in Sect. 4.5, is much too laborious when applied to a document page with a few thousands of glyphs to be separated. Therefore, a new interactive glyph separation algorithm is introduced in this section which is dedicated to solve this problem.

5.1 A text line-based method

The new method is text line-based. It processes simultaneously all the glyphs contained in a single line. Separating the whole bulk of glyphs in a row is more efficient than treating each single glyph apart from its context.

5.1.1 Known versus unknown transcriptions

The user has two alternatives. Either she lets the system know about the transcription of that text line or she simply requests the system to find a separation independent of any transcription. The former mode is applied by a user who is basically familiar with the given handwriting and who is mainly interested in the extraction of features for further palaeographic studies. The latter mode is helpful if the user himself/herself has difficulties in separating glyphs, as it might be the case for complex and unknown handwritings. The first mode has the advantage that the number of characters \(n\) is passed to the separation algorithm. This knowledge helps the system to find an appropriate number of glyph candidates. If the transcription is not available, \(n\) is estimated by the length of the text line divided by a heuristic value for the width of a single glyph.

5.1.2 Recursive region separation

Initially, all connected components \(\mathcal {R}\) are determined by means of a sequential region labelling algorithm [11]. Those connected components whose size is below a certain threshold \(\varTheta _{noise}\) are conceived as noise and are removed from the set \(\mathcal {R}\). Then, the recursive separation algorithm \(f\) is applied which is a binary function of the set of all regions \(\mathcal {R}\) and the given number of desired glyphs \(n\) or their estimation if that number is not available:

$$\begin{aligned} f(\mathcal {R}, n) = {\left\{ \begin{array}{ll} \mathcal {R} &{} \mathrm {if}\ \ |\mathcal {R}| \ge n\\ f(\mathcal {R} \setminus r, n-n_r) \cup f(h(r), n_r) &{} \mathrm {else} \end{array}\right. }\\ \textit{with}\ r = \textit{max}(\mathcal {R}), h: r \rightarrow 2^r,\ n, n_r \in \mathbb {N}. \end{aligned}$$

The algorithm terminates if there are at least as many regions as glyphs desired: \(|\mathcal {R}| \ge n\). If there are not yet enough regions, the largest one \(r\) is splitted into small subregions by the auxiliary function \(h\) and the result is passed to \(f\) together with an estimation of the number of glyphs \(n_r\) fitting into \(r\): \(f(h(r), n_r)\); thereby, \(2^r\) denotes the set of possible partitions of \(r\). In order to get to the next largest region, \(f\) is also applied to \(\mathcal {R}\) without the largest region: \(f(\mathcal {R} \setminus r, n-n_r)\). Both recursion results are put together when they pop back upwards.

5.1.3 Splitting \(r\) into subregions

The actual separation algorithm for a single region \(r\) is defined by \(h\). The rows and columns are sampled for \(r\), and those paths that have a length below a certain threshold \(\varTheta _{\beta }\) are marked to be candidates \(\mathcal {C}_x\) for boundaries between glyphs within the rows of \(r\):

$$\begin{aligned} \mathcal {C}_x&= \bigcup \ (p_{x, y}, \ldots , p_{{x+k}, y}), \quad k \le \varTheta _{\beta }, \quad \forall p \in r\ \wedge \nonumber \\ \forall p_{{x+l}, y}&= 1,\,\,l = 0 \ldots k \wedge p_{{x-1}, y} = 0 \wedge p_{{x+k+1}, y} = 0\quad \end{aligned}$$
(3)

Background pixels are denoted by 0 and foreground pixels by 1. Thin paths along the columns of \(r\) are determined accordingly:

$$\begin{aligned} \mathcal {C}_y&= \bigcup \ (p_{x, y}, \ldots , p_{x, {y+k}}), k \le \varTheta _{\beta }, \forall p \in r\ \wedge \nonumber \\ \forall p_{x, {y+l}}&= 1, l = 0 \ldots k \wedge p_{x, {y-1}} = 0 \wedge p_{x, {y+k+1}} = 0 \end{aligned}$$
(4)

In general, this process results into regions instead of paths, because very often there are a number of thin paths with lengths below \(\varTheta _{\beta }\) side by side. Therefore, approximately the middle of such a region is taken for splitting \(r\) into subregions. The emerging boundaries run more or less transverse depending on how adjacent paths differ with respect to their lengths and relative positions. But they might also be perfectly vertical or horizontal depending on the orientation of those paths. Since \(\varTheta _{\beta }\) determines what is conceived as thin, this parameter can be adapted to different handwritings.

Whenever resulting subregions are getting too small, they are added back to one of the other adjacent regions. For this purpose, the size of any region is controlled by another parameter \(\varTheta _{A}\) which avoids an over-segmentation with too many small regions. This is necessary as many available digital scans have a high resolution so that even noisy regions contain 50 or even more pixels. This parameter can be adapted to different documents depending on their quality.

5.1.4 Optimisations

The implementation of this algorithm is iterative due to stack overflow problems which arise as soon as the recursion gets too deep. This happens for long text lines with many large regions and when text lines are interspersed with descenders and ascenders of the preceding and next text line, respectively. Moreover, a number of additional data structures are employed for optimisation purposes.

5.2 User interactions

While the previously outlined algorithm generates a first solution to the glyph separation problem, the interactive part of the method allows the user to let the system know which regions need further treatment.

There are two converse operations, namely to join regions or to separate them. Both operations can be easily launched by successively moving the pressed mouse cursor over them in order to join regions and by clicking into a single region in order to separate it.

In fact, there are three different methods for separating regions. Their combination enables the user to manage every conceivable situation quite efficiently. All methods are described in turn.

5.2.1 Recursive region separation locally applied

The application of the glyph separation algorithm to a long text line will produce a perfect separation on rare occasions. However, for the user, it is simple to select all regions of the outcome that need further consideration. Selecting a region by the mouse cursor, the glyph separation method described in Sect. 5.1 is applied to that region. Whereas this method has been developed for managing a whole text line, a more or less complex ensemble of regions, or just a single one, can itself be conceived as a short text line. The glyph separation method can be initialised by other parameters concerning the expected number of regions and their sizes. Those parameters adjust the glyph separation algorithm locally and thus have a more effective impact on specific regions.

In particular, \(\varTheta _{\beta }\) should be chosen higher for local regions, since those regions obviously have not been split beforehand when the whole text line was treated. Hence, less strong constraints where to break up a region are required. The higher \(\varTheta _{\beta }\), the more likely it is that a region is broken up. Moreover, a lower value for \(\varTheta _{A}\) allows to accept the separation of a region even if the resulting subregions get quite small. The higher \(\varTheta _{A}\) the more probable it is that the algorithm decides to assign small pieces again to neighbouring regions.

The algorithm systematically determines thin paths, it takes into account the sizes of regions, and it is recursively applied to resulting regions. For the user, it would be quite cumbersome to analyse the regions in this way all by himself/herself. But the quality of the final outcome of the algorithm can in most cases be assessed quite easily, due to the systematic distribution of colours which visualise the discovered separations. Figure 4 shows an example, in which the user only had to click once into the first yellow region (a) and where it was necessary to join the first two regions as well as the following two pairs of regions (b) in order to get the first three letters properly segmented (c).

Fig. 4
figure 4

Recursive glyph separation of the word cabilonensi

5.2.2 Line separation

There are a number of cases which are particularly challenging, for example, when glyphs are very close to each other so that there is no clear transition between adjacent glyphs, when ligatures are to be separated into their constituent parts, or when there are imperfections, such as blobs or holes in the paper, smearing the boundaries of the glyphs. Yet another common problem is nearby text lines. In this case, descenders are overlapping with ascenders of the next text line. The heuristics of the recursive region separation algorithm are generally not very successful in such cases.

The line separation method allows the user to draw a line along which a region is supposed to be separated into two pieces. In this way, even oblique transitions between glyphs can be dealt with. The line might even cross a couple of adjacent regions all at the same time. As a consequence, they are simultaneously divided into twice as many pieces as there have been regions before. In Fig. 5, the example of Fig. 4 gets completed: The top image (d) shows in particular the white line which has been drawn by the user, in order to separate the letters ‘e’ and ‘n’ (e); resulting fractions are assigned to their glyphs (f).

Fig. 5
figure 5

Corrections by the line separation method

A separation line can be drawn wherever desired in order to divide regions. The sole exception is that such a line needs to start in the background and that it ends somewhere else in the background. This is to ensure that such a line does not end up in the midst of a region; if it would end up there, not all boundaries of the intended regions could be determined. Provided that the user does not care about this restriction, the triggered method will simply do nothing.

The line separation method can always be applied, but needs some effort and carefulness by the user. By contrast, the recursive region separation method is much easier to employ, but fails for some complex regions. Just in such cases, the employment of the line separation method resolves what the automatic algorithm is not able to handle.

5.2.3 Square separation

In the last resort, if none of the previous methods yields satisfying results, a region can be regularly tesselated into small squares. This enables the user to define any possible region by joining all squares which are assumed to belong to the same glyph. The user can specify the lateral length of the squares in order to define the glyph at an arbitrarily fine scale. Figure 6 shows the tessellation of the letter ‘p’ with a lateral side of 2 and 11 pixels on the left hand side and in the middle, respectively. The right hand side of Fig. 6 shows the usual approach, namely to divide a region into a few large subregions and to further divide a subregion into finer squares, wherever it seems necessary to consider more details. As the mouse cursor is just to be moved over all regions which are to be connected, such tessellations can be easily transformed to coherent glyphs.

Fig. 6
figure 6

The tessellation of the letter ‘p’ at two different granularity levels. From left to right, the squares have a lateral length of 2, 11, and intermixed, 2 and 11 pixels

This is another example that shows the effective interplay of user and system. While a tessellation can be efficiently computed automatically for arbitrary regions, the user can easily combine those squares which are thought to pertain to the same glyph. Defining fine details by drawing corresponding paths through the regions would be much more intricate for the user. In this case, she would have to take care himself/herself of the precise locations of any details. In contrast, with the help of the square separation method, she has just to assign to each part its corresponding region, by hovering the mouse cursor over it.

Note that the user interface provides a colour-coordinated display of adjacent regions, as shown in Figs. 4, 5 and 6. The display iterates periodically over the colours yellow, red, green, and blue. Among others, this helps in identifying tiny pieces between adjacent glyphs which have been assigned neither to any glyph nor to the background. The latter makes sense when those pieces represent noise.

5.3 Conclusion

The Diptychon method presented here has the advantage of working without the deployment of dictionaries or language models. It is therefore language-independent. Recursive region separation basically applies to quill created writings which often show thin parts in the transient zones between adjacent glyphs. In these cases, this algorithm will often yield satisfying region partitions. Otherwise, line separation and square separation will always be applicable.

6 A case study in glyph separation

A user-independent evaluation consists in counting the number of interactions necessary in order to arrive at a proper separation of glyphs. As a prerequisite, it is necessary to detect the text lines as described in Sect. 4.4 and to adjust them as described in Sect. 4.2.

6.1 Material

The methodology has been applied to two different document pages of a presumably single writing hand, that one of the eleventh century chronicler Hugh of Flavigny [21]. In terms of palaeography, this writing hand is a late Carolingian Minuscule used in books.

Two pages of that chronicle are compared. The first one, page eleven, is almost at the beginning of the book and shows a neat and regular writing style. The second one, page 144, is much more irregular. Among others, it shows how text lines drop off towards their ends. Both pages have been deliberately chosen by a palaeographic researcher in order to provide two very different document pages. Samples of both document images are found in Fig. 7.

Fig. 7
figure 7

Cut-outs of documents A (bottom), p. 11 [21], and B (top), p. 144 [21], approximately at the same scale

A second case study analyses three different handwritings from the ninth, thirteenth, and eighteenth centuries which are found in the IAM Historical Document Database [6, 8]. They exemplify handwritings in Latin, German, and English. This database is accessible to the public, and thus, the results can be taken as a reference with which others can compare themselves.

6.2 Methods to be evaluated

In each document, at least the first thousand glyphs are to be properly separated. For this purpose, the recursive glyph separation algorithm, described in Sect. 5.1, is applied to each text line. Corrections consist either in separating regions or in joining them. The number of necessary corrections are to be counted.

Both methods, recursive region separation and the joining of regions, are user-independent in that it makes no difference where to click into regions or where to grasp them in order to separate or join regions, respectively. By contrast, the line separation algorithm can result in different glyph separations, however, the user can withdraw any interaction and repeat it until the glyphs are properly separated. In this sense, even the line separation interaction can be employed for a user-independent evaluation.

The square separation interaction is omitted in this study. It is useful in order to deal with arbitrary complex handwritings. However, there are two reasons why the square separation method costs much more effort than the other separation interactions. At first, the user needs to determine an appropriate granularity level for tessellating a region in such a way that it is detailed enough. Secondly, the user has to join all tesserae which connect to a proper glyph. This amounts to hover the mouse device over all those tiny tesserae consecutively which are part of the same glyph. The other separation interactions are either confined to a single mouse click or to hover the mouse over adjacent regions which are much larger than small tesserae and, therefore, can be grasped more easily.

6.3 The interaction ratio

It is the purpose to determine the effort to correct glyph separations of handwritten documents after the automatic glyph separation algorithm has processed a single text line. This effort is expressed by means of the ratio of interactions required and objects (glyphs) to be considered. It is referred to as the interaction ratio:

$$\begin{aligned} \rho = \frac{\textit{number of interactions}}{\textit{number of objects}} \in \mathbb {R}_0^+ \end{aligned}$$
(5)

The lowest value of zero states that no interactions are required and that the higher \(\rho \), the more interactions are necessary. A value below one means that less interactions are needed than there are objects which are to be taken into account, while a value of 1 means that in the average, each glyph needs to be touched once, and if \(\rho > 1\), that it needs to be touched more often.

6.4 Evaluation of all interaction methods

As expected, it shows that there are many less interactions necessary in document A that exhibits a more regular writing style than document B (cf. Fig. 7). For document A, it holds \(\rho _\mathsf{A} = 0.65\), and for document B, it holds \(\rho _\mathsf{B} = 0.76\). The glyphs do not separate as clearly in document B as in document A. Note that there are complex cases in which two or more interactions are necessary to correct a single glyph. The values \(0.65\) and \(0.76\) are just averages.

For document A, 8 % of the glyphs have been divided by recursive region separation, while almost twice as many glyphs, namely 15 %, had to been split into two parts by means of the line separation method. The latter allows the user to more precisely indicate where a region is to be broken up. It requires only a little bit more effort for the user than the other method. For recursive region separation, the according region just needs to be selected somewhere, while the line separation method requires the user to draw a line which indicates the path along which the region is to be broken up. For document B, there are 13 % of the glyphs for which recursive region separation had been applied and 23 % of the glyphs for which line separation was required.

Interestingly, it shows that more joining operations are necessary in the case of the more regular handwriting. This is the case due to the higher ratio of fragments that are automatically determined by the pre-processing algorithm. Over-segmentation is higher for document A. From the correction interactions in document A, 64 % are joining operations, in document B there are 53 % such operations. For both documents, they dominate the modes of interactions.

Although the text lines in document B are quite close to each other, there were only seven interactions necessary for separating descenders from ascenders of following text lines. For document A, there are only two such occurrences. They have all been resolved by line separation. Table 1 summarises the main results.

Table 1 1,032 and 1,025 glyphs have been analysed for documents A and B, respectively

In Table 1, the interaction ratios are given for the entire documents. In order to analyse how \(\rho \) differs across single documents, this value has also been computed for each text line separately. It is not only assumed that this value is lower for the text lines of document A but that it is similar among text lines within the same document. The upper part of Table 2 shows the results. The small standard deviations \(\sigma _{\rho }\) in both cases support the assumption that the text lines within the documents are similar in the way how glyphs connect. Interestingly, the standard deviation for the more regular document is higher than that for the other document. But this is in accordance with what has been mentioned above about the over-segmentation of document A.

Table 2 Range, mean, and standard deviations for documents A and B after having computed the interaction ratio for all interactions for each text line separately; in the two bottom rows confined to both separation methods

6.5 Evaluation of the separation methods

There are two reasons why to re-evaluate the data by only looking at the separation methods. First, they cost more effort than the joining operations in that the user has to decide which separation method she wishes to apply and because line separation needs some carefulness by deciding where to separate a region, while neither recursive region separation nor the joining of regions leave any decisions to the user. Second, an over-segmentation of document A requires a large number of joining operations and leads to the wrong impression that document A is more complex than document B, since the total interaction ratio makes no difference between separation and joining operations.

The analysis of the glyph separation problem, however, should mainly take into account the number of actual separation methods to be applied. Focusing on the separation methods, it is expected that document A shows a lower interaction effort than document B.

The lower part of Table 2 shows the results. Clearly, the interaction effort is lower for document A. Its mean is \(0.23\), and the mean for document B is \(0.36\). The difference is lower for the previous evaluation shown in the upper part of the table. Moreover, the standard deviation is now lower for document A, showing how the effort for the joining operation offered a disputable comparison of the interaction ratio applied to both documents. In both cases, the upper bound for the number of necessary interactions clearly dropped down.

6.6 The IAM Historical Document Database

There are three different handwritings contained in this database. They have been analysed in the same way as documents A and B:

  • C: The manuscript images of the Codex Sangallensis 562, St. Gallen, Stiftsbibliothek, 9th century, Latin writing, page 3 [8].

  • D: The Abbey Library of Saint Gall, Cod. 857, 13th century, German writing, page 6, column a [6].

  • E: The Library of Congress, George Washington Papers, Series 2, Letterbook 1, 18th century, English writing, page 270 [8].

Figure 8 shows samples of those documents. The evaluation indicates that in document D, which is shown on the left hand side of that figure, the separation of glyphs is most difficult. There are in the average 1.44 operations necessary in order to correct a glyph of document D, as denoted by the interaction ratio shown in Table 3. The effort for corrections regarding document C is close to the interaction ratios of documents A and B. Document E shows the most recent handwriting of the analysed data. Its interaction ratio lies somewhere in the middle of the interaction ratios of the older handwritings.

Table 3 1,123, 1,064, and 1,015 glyphs have been analysed for documents C, D, and E, respectively
Fig. 8
figure 8

Excerpts of the documents of the IAM Historical Document Database; from left to right: documents D, C, and E. The changing colours of the glyphs show how they got separated. At the top, there are enlarged cut-outs for each handwriting (colour figure online)

The joining operations are dominant for all three documents as for documents A and B, followed by line separation for document C. Only for documents D and E, the local region separation algorithm has been more often applied than line separation.

6.7 Discussion

In the presented study, all handwritings have been analysed with the same parameter settings. This concerns the removal of noise in the beginning, the binarisation parameters, as well as the region separation parameters that define the acceptable thickness of glyph transitions and the minimum size of region fragments to be created when getting down to small pieces during recursion. In other words, the results show the performance of the algorithm when it is not optimised for a given handwriting. The trained user, however, can adapt the parameter settings for a specific handwriting, or even during the processing of a single document, to different parts of the same document. This again shows the extent of the anytime anywhere paradigm which also allows the adjustments of parameters to single regions, depending on the quality of any location on a document image. For old documents, this becomes of particular importance due to their degraded quality.

Figure 9 summarises the interaction ratios. It shows that the interaction ratio is quite similar for all documents with the exception of document D that costs the most effort. Each stacked bar, as a whole, coincides with the according value for \(\rho \), as each of the three components relates to the average effort per glyph. But note that the numbers contained within the bars are accumulated. In this way, the separation effort alone can be read of that diagram, as motivated in Sect. 6.5. The lowest effort is \(0.23\) separations per glyph for document A, while the highest is \(0.78\) separations per glyph for document D.

Fig. 9
figure 9

For all five documents, the interaction ratios are accumulated for all three modes of correction. For instance, the stacked bar for document A reads: There are \(0.08\) region separations, \(0.23\) region plus line separations, and \(0.65\) total operations per glyph in the average, including joining operations. The heights of the three component bars show the contributions of the three interaction methods

All correction methods are confined to either click somewhere into a region (for region separation), to hover the mouse cursor over two or more regions (in order to join regions), or to indicate the start point and end point of a line (line separation). Therefore, the temporal effort of those interactions is rather low. For all documents, the time required in order to correct a whole text line has been measured. As the text lines for document D are quite short, two text lines with 25 glyphs each have been taken. In the other cases, the number of glyphs contained in the text lines ranges from 45 to 77. However, the time required for a single glyph has been determined by the ratio of time and number of glyphs in a text line. For all documents, the temporal range lies between \(1.70\, \frac{\mathsf{sec}}{\mathsf{glyph}}\) and \(3.32\, \frac{\mathsf{sec}}{\mathsf{glyph}}\).

In order to compare the required time with the interaction effort \(\rho \) both quantities are related in Fig. 10. It turns out that the temporal effort per glyph is quite similar for all documents, though Fig. 10 suggests that document D is much more difficult to handle (in accordance with Fig. 9). However, there is only a difference less than a second to the temporal effort for document A. As expected, the higher the interaction effort the higher the temporal effort, though this is just a rough tendency with small deviations for documents A and B. A clear linear dependence can hardly be expected, given the small temporal differences.

Fig. 10
figure 10

For all five documents, the interaction ratio is related to the average time spend on correcting glyph separations. The numbers next to the data points of the documents indicate the percentage amount of correct glyphs which have been found by the recursive region separation algorithm, before having made any corrections

The only meaningful conclusion that could be drawn is that in the average, there is an effort of approximately 2 s per glyph. For a text line with 25 glyphs (such as in Document D), this amounts to an overall effort of around 1 min for the whole text line, and for long text lines with 77 glyphs (as for Document B), the effort lies around two and a half minutes. This refers to the expertise of a trained user.

7 General discussion

When sticking to the image processing pipeline as a one-way road, the user is forced to analyse documents in a strictly artificial way. By contrast, the interactive anytime anywhere document analysis paradigm put forward in this paper allows the user a natural way of dealing with documents—a paradigm probably applicable in other domains as well. A pre-processing step can be locally applied whenever necessary, as can be the adjustment of text lines or the separation of glyphs in a word. These steps can be performed for a whole document page at a single stroke or for parts of it separately in an arbitrary order. The palaeographic researcher is carefully inspecting documents while trying to read the handwriting and while analysing its characteristics. In doing so, she can immediately apply any image processing functions in order to neatly extract and separate all glyphs, which are needed for a standardised and detailed document statistic.

This approach complements other interaction methodologies which either require a formal model of documents to be processed or are confined to interactions within single processing steps. The advantage of the anytime anywhere analysis paradigm is that it does not make any assumptions about document structures, contents, and languages. It is applicable to any kinds of document images and bears resemblance to commercial picture-editing software except that it is optimised and restricted solely to document image processing. A drawback is that it is less effective than interaction methodologies that deploy models of document structures or languages, and, hence, provide more accurate suggestions to the user. Another distinction to other methodologies is that the user can directly apply interaction methods whenever she thinks that automatic methods would not provide useful suggestions, instead of letting the system deal with particular complex cases which would be difficult to adjust afterwards. In such cases, the roles are reversed: The user edits difficult parts of the document, while the system continues to process that document afterwards.

In particular, a number of interactive methods have been introduced that enable the separation of glyphs for arbitrary complex handwritings. The results are useful for both palaeographic research and generating ground truth data sets, for example, for classification evaluations based on characters. The interaction ratio allows the quantification of the effort that is necessary to manage different documents. As far as, the assumption is valid that a handwriting style is distinguished by the way how well its glyphs can be separated, the interaction ratio also applies as a measure of similarity. But this needs to be investigated in future work more carefully.

That letters can always be meaningfully separated into individual glyphs is doubtful. Some handwritings are highly connected and lack clear boundaries between glyphs. Indeed, some schools of palaeography reject the validity of glyph-based comparisons and look instead at the broader context of whole words or characteristics like the movements of the pen. But even if the boundaries between glyphs cannot always be determined, frequently there is at least a subset of glyphs with clear boundaries. Such subsets can be used for a glyph-based comparison by means of shape features [9]. Additionally, a glyph-based transcription enables the search for any character strings, even if the boundaries between glyphs have only be determined inaccurately [29]. Eventually, having separated only a subset of all glyphs, the transcription of that glyph subset could significantly help the user in transcribing poorly legible handwritings, since the Diptychon system presents all recognised glyphs in the context of all illegible word fragments.

8 Summary

Algorithms for the separation of glyphs of handwritings have been presented, together with a new interaction paradigm that provides a flexible editing tool for historical documents. It includes image enhancement, figure-ground segmentation, text line detection, glyph separation, and transcription methods that can be applied in any way desired, in order to deal with whole documents or parts of it.

An emphasis is put on the glyph separation problem for which several different methods are made available to the user. The only alternative for those methods is the employment of conventional picture-editing software which requires the fully manual separation of glyphs. In this sense, the suggested methods provide a significant improvement. The underlying philosophy of the approach derives from the observation that fully automated systems are not always successful or sometimes even not necessarily desirable. Instead, a methodology is put forward that seeks to bring together the advantages of automatic methods and skills of human experts.