1 Introduction

En forçant un peu, on pourrait imaginer que si quelqu’un trouvait un manuscrit des Exercices de style il se demanderait s’il ne s’agit pas d’une collection de variantes, trace d’une hésitation de Queneau entre diverses manières de raconter son histoire.

D. Ferrer, Logique du brouillon, Seuil 2001, p. 133

Textual variation is a central object of study for textual criticism, philologie, scholarly editing.

The variation takes place when there are competing readings of a portion of a work. It might take different shapes: it occurs inside the same document (striking out, additions, etc.) or between documents (witnesses of the same work). The nature of the variation is also variegated: the difference among readings might concern formal or substantive text features, where––generally and traditionally––the first relate to orthography (spelling, punctuation, etc.) and the second to all other linguistic categories (morphology, syntax, lexis).

Finding patterns in the moving universe of textual variation is one of the scholar’s goals. A writer might consistently remove references to his private daily life, moving from a note in a diary to a draft of a chapter.Footnote 1 A copyist might rewrite an entire text, according to changed orthography conventions.Footnote 2 These kinds of patterns indicate the direction of changes, tracing precious paths for exploring the work and its mouvanceFootnote 3; they help making sense out of a shapeless set of variants and shed light on textual dynamics. In stemmatics, patterns of substantive variants and, in particular, errors are also used to infer relationships among the witnesses and for drawing a stemma that accounts for the textual transmission.

This article introduces a model for annotating textual variants. Querying the annotations made, allows us to find patterns in textual variations. Instead of looking at a variation site as a single entity, the model attempts to decompose it and to explore its constituent parts: the readings and their relationships. For doing so, the model proposes to use a set of common general categories and other optional specific categories. These categories describe the features of the readings and those of the variation between them.

The model aims to be generic and applicable to a wide range of works. Nevertheless, the specific categories to be used for annotating the texts might vary greatly, depending on the texts themselves and on the scientific approach.Footnote 4 For example, a relevant category for studying the transmission of a medieval text might be the saut du même au même: it proves the tight relation among the witnesses because it is an error which hardly occurs by chance at the same point in unrelated witnesses. When studying modern manuscripts, a relevant category might be that of instant rewriting,Footnote 5 which is the opposite to later rewriting. Often, the same phenomenon can be covered with different approaches: in the example of the removal of references to private life in an author’s papers, above, an ad hoc category could be created, to annotate every relevant passage; another approach would be to decompose the phenomenon into smaller ones, and use multiple categories, such as the replacement of proper nouns with common ones,Footnote 6 the removal of dates, etc., all leading to the removal of private-life references.

Modelling, in this article, refers to the “heuristic process of constructing and manipulating models” (McCarty 2004),Footnote 7 and, in particular, data models. A data model is a formalization of the understanding and interpretation of an object, which should be consistent, coherent and explicit; these characteristics allow to move from a conceptual model to a logical model, that is a computable object to be implemented in one or more physical models (Flanders and Jannidis, 2015: 11; Flanders and Jannidis, 2016).Footnote 8 The conceptual model is here introduced using an entity-relationship diagram, while the logical view is presented in two schemas (relational tables and OWL ontology). A number of case studies where the conceptual model is implemented are also presented.

2 Conceptual model

The model covers textual variants, that is, competing readings, and does not take into account the rest of the text. This means that it does not allows to reconstruct the entire text of each witness or stage; on the contrary, it only represents what is traditionally gathered in the critical apparatus.Footnote 9

A reading is the atomic unit of the model. A reading is a string of characters in plain text, with no typographical, structural or semantic markup; it is composed by one or more letters, or one or more words. The scholar is at liberty to choose the boundaries for each reading, following strategies that might differ from case to case, also within the same text. Because the model does not represent the rest of the text, the reading might include some non-variant words, in order to better contextualize the variant reading. This is what happens in a traditional critical apparatus, where the non-variants words are often abbreviated, while the variant words are spelled in full, as in the following example:

Critical text: Il se vantoit de folie

Apparatus: Il se vantoit] A, qui se v.Footnote 10

The model describes two main aspects of the elements involved in the variation: the features of each single reading and those of the variation between them [Illustration 1]. This distinction is a fundamental characteristic of the model.

In blue, the space for the features of each reading and in red the space for the features of the variation between them. This distinction is a fundamental characteristic of the model

2.1 Features of the reading

For each single reading, two general features must be set: the witness to which the reading belongs, and the location of the reading in the witnessFootnote 11; optionally, the location of the reading in the work might be added [Illustration 2].

General features of each reading

Each single reading can also be annotated using customized categories, which might vary greatly. A relevant feature recorded in a category might be the writing tool associated with the reading, mostly in the case of modern manuscripts. Another category can be set to record erroneous reading, for instance bringing to a metric violation when too short or too long, or repeating erroneously a word remained in the memory of the scribe. These ad hoc categories are to be added to the general ones [Illustration 3].

Example of general and specific features of each reading

2.2 Features of the variation

The features of the variation express what kind of difference exists between the competing readings. Two categories are used to record the general features of the variation: the category of change and, in the case of substitution, the linguistic aspect involved [Illustration 4].

General features of the variation

The categories of change are addition, deletion, substitution and transposition. These four classes, referred to as quadripartita ratio (adiectio, detractio, immutatio, transmutatio) are defined as the categories of mutation by stoic philosophers and used by classical and modern rhetoricians. They correspond to the operations used for calculating the difference between two strings in computer science, known as edit distance,Footnote 12 and have been used in Textual Criticism for classifying variants (Stussi 2011: 182). A substitution includes everything that is not only an addition, a deletion or a transposition: it might contain them, but not be limited to it.

The linguistic category defines which aspect of the language is involved in the variation: orthography, morphology, syntax, lexis.

An example for the use of such general categories is the following: ‘I still had one bad leg’ vs ‘I had still one bad leg’ (O'Reilly et al. 2016),Footnote 13 which can be annotated as a transposition (category of change). Another case might be: ‘Et lors parla mestre Helie di Tolose’ vs ‘Et lors parla maistre Helie di Tolose’ (Micha 1978-1983, IV), where ‘mestre’ vs ‘maistre’ is a substitution (category of change) concerning orthography (linguistic category).

Specific categories can also be used to describe precise features of the variation. A relevant one might be the direction of the relation, that is from reading A to reading B, or the contrary. A specific category can be used, for instance, to record the type of intervention occurring: in the case of a substitution, reading A might be crossed out and reading B written above, below, after, etc. (Italia and Raboni 2010, 64).

These specific categories for describing the variation between the readings are to be added to the general ones (Illustration 5).

Example of general and specific features of the variation

The features of the readings coexist with the features of the variation [Illustration 6].

Example of features of the readings and of the variation

2.3 Variation site: Pairs of readings

When a variation site involves more than two readings, a number of phenomena take place at once, and describing them might require complex annotations. This is particularly relevant when no direction of change has been set in advance, that is when the relations between the readings are not known. In most of the case in medieval textual transmissions, for instance, at first the scholar might want to compare all the readings, without setting, more or less arbitrarily, a base text (Spadini 2017).

A simple example of variation site involving four readings is the followingFootnote 14:

  • BnF fr. 1466 (A): totes bontez pardue

  • BnF fr. 1430 (B): totes hennors pardues

  • BnF fr. 118 (C): toutez honneurs perdues et toutes ioyes

  • BnF fr. 751 (D): totes honors perdus et totes lois.

As said above, the boundaries of each reading can be decided freely. In this case, the texts might be divided in various ways: for example, aligning word by word, considering the entire sentence at once, or separating the sentence in two at the conjunction “et”. The latter scenario gives:

  • (1)

  • A: totes bontez pardue

  • B: totes hennors pardues

  • C: toutez honneurs perdues

  • D: totes honors perdus

  • (2)

  • A: /

  • B: /

  • C: et toutes ioyes

  • D: et totes lois

In (1), “bontez” (A) is different from “hennors” (and its orthographic variants, BCD).

In (2), A and B are null, while C and D have readings which are close at the paleographical level, but whose meanings are far (“ioyes” vs “lois”).

Using the model (only the general features of the variation, that is category of change and linguistic aspect), they can be described as follows:

  1. (1)

    A vs BCD substitution lexis orthography; B vs C vs D substitution orthography.

  2. (2)

    AB vs CD addition/deletion; C vs D substitution lexis orthography.

Given that the combinations of readings may change for each variation site (A vs BCD, B vs C vs D, AB vs CD, C vs D), the more consistent way to pursue the variation is to examine the witnesses in pairs,Footnote 15 which produces:

  1. (1)

    A vs B substitution lexis orthography; A vs C substitution lexis orthography; A vs D substitution lexis orthography; B vs C substitution orthography; B vs D substitution orthography; C vs D substitution orthography.

  2. (2)

    A vs C addition/deletion; A vs D addition/deletion; B vs C addition/deletion; B vs D addition/deletion; C vs D substitution lexis orthography.

From this complete description, it is possible to obtain other, less redundant, ones, combining the readings as above.

In principle, the model could accept more than two readings for each variation, and use the same features of the variation to describe the differences between all of them. One of the main characteristic of the model, however, is to break up the variation in its constituent parts, in order to achieve the maximum of expressiveness.Footnote 16

This description only covers the features of the variations between the readings. Each reading per se can also be annotated with specific categories; here an appropriate category would be ‘error’, since “pardue” (A) is erroneous because singular and “perdus” (D) is erroneous because masculine.

All the selected features of the variation site can be represented together [Illustration 7].

A variation site with multiple readings

2.4 Boundaries of the readings, nested variants and concatenation

Setting the correct reading boundaries is not the only way to manage the variation extent. A variation site might also be contained by another variation site. This is the case, in particular, for variations of smaller size (for number of characters involved) inside a variation, to be called nested variants; and for recording the evolution of a reading in a variation site, to be called concatenated variants. It is important to remember that the sub-reading inherits the features of the reading it is part of.

An example of the first type––variation of smaller size inside a variation––is A “La luna o la Ricordanza” vs B “La Ricordanza” [Italia and Raboni 2010, 68–71]. A vs B might be described as an addition/deletion; inside it, there is an orthographic substitution, opposing “la” to “La” [Illustration 8]. In this case, the two sub-readings are parts of two different readings.

Example of nested variants

In the second case – recording the evolution of a reading– a sub-reading is involved in another variation site, tracking previous alternatives. An example from the same poem is at v. 8: A “il tuo viso apparia, perché dolente” → B “al mio sguardo apparia, perché dolente” → C “il tuo volto apparia; chè travagliosa”. A part of reading C is the result of the change from Ca “, che” to Cb “; chè”: Ca is thus a sub-reading of one reading only, that is C, and it is involved in a variation site with Cb [Illustration 9].

Example of concatenated variants

2.5 Model outline

The model outlined here allows:

  • to distinguish between the features of the reading and those of the variation between the readings;

  • to append more than one feature to each reading and variation;

  • not to set a base witness to orient the variation;

  • to annotate each pair of witnesses or a combination of them for each variation site;

  • to nest and concatenate variation sites.

3 Case studies

The model has been used in the web-application La Commedia di Boccaccio (Spadini and Tempestini 2018). Here, other case studies in the form of graphics are presented to test its applicability [Illustrations 10, 11, 12].

Case study 1
Case study 1
Case study 1

In the first three examples, specific categories are employed to annotate common types of morphological variation, in addition to the general categories. The text in the examples is that of an Old-French pastourelle, “Par un matinet l’autrier” (Rivière 1974, III, n° LXXVI)Footnote 17; the distinction of types of morphological variations is relevant here, because certain types of them recur often, i.e. the alternation between present and past tense, while others are rare. Note that the combination of witnesses changes for each variation site.

A more complex example [Illustration 13], where three alternative readings are involved, is taken from Giacomo Leopardi’s La ricordanza, mentioned above. Its manuscripts are conserved at the National Library in Naples,Footnote 18 and an edition of the poem is provided by Italia (2010:68-71).

Case study 2

In the methodological chapter of the same volume [ibid: 64], Italia introduces a list of types of interventions occurring in a draft. The list includes: corretto in (reading A is corrected into reading B), soprascritto (reading B is overwritten on reading A which is crossed-out in the line), sottoscritto (reading B is underwritten to reading A which is crossed-out in the line), inserito (reading B is inserted), prima (reading B is preceded by reading A crossed-out in the line), dopo (reading B is followed by reading A crossed-out in line and then abandoned). In the model, it is possible to create a specific category of variation to record this information, here called intervention; in the example [Illustration 13], values for this category are ‘overwritten’ (as in soprascritto.) and ‘corrected in’ (as in corretto in). Furthermore, the relation between the readings has a direction, expressed with an arrow replacing the line. The readings also have a specific category, indicating the writing tool in use for each of them. A comment is attached to the third reading.

4 Logical model

The model can be implemented in different data structures: an OWL ontology and a relational database schema will be presented in this section.Footnote 19

A comparable XML/TEI solution will not be pursued here. This is because overlapping annotations are constituent of the model (e.g., the relation between A vs B and B vs C); therefore, a XML solution would be possible, but requires some workarounds. Nevertheless, a TEI compliant result can be achieved using the Feature Structures module or stand-off mechanisms.

4.1 Relational tables

A schema for a relational database, only covering the general features of the reading and of the variation, is presented below [Illustration 14]. Specific categories can be added by means of new tables, connected to the Variation table.

Relational DB schema representing the model

4.2 OWL ontology

The model can be implemented in the following OWL 2 ontology, formulated in Turtle syntaxFootnote 20 and visualized belowFootnote 21 [Illustration 15]. Here too, only the general, and not the specific, features of the reading and of the variation are represented.

Visualization of the OWL 2 ontology representing the model

The choice of an OWL ontology is dictated by the fact that it is a standard data-model, part of the architectural formalisms of the Semantic Web.Footnote 22 Note, however, that using a labeled property graph, such as Neo4j, the Variation class would not be needed because the information it carries could be stored as properties of the edge between the Readings.

5 Conclusions

This article presents a model for annotating textual variants. Once the annotations are made and conveniently stored, they can be queried, in order to find patterns and analyse the mouvance of the work. Possible queries depend on the categories of reading and variation in use. The distinction between features of the readings and features of the variations is fundamental to the organization of the categories. In addition to the general categories (additions, deletions, substitution, transposition; orthography, morphology, syntax, lexis), the annotations might cover, for example, verbal tenses, paleographical variations, errors of different types (coniunctivus, separativus), dialectal forms, synonyms; over selected sections of the work and selected witnesses or stages. Specific queries can be performed in order to isolate, for studying of removing the noise of, the phenomena covered by the annotations: all the changes of verbal tense in section A, all the deletions between witness/stage A and witness/stage B, all the instant rewriting, etc. The model is flexible, as much as it ensures freedom to the scholar in choosing the categories and setting the boundaries of the readings; the length of the readings, in particular, might vary in the annotations of the same text.

Adopting the model is cumbersome work. On the other hand, it provides detailed and organized information, which is fundamental for certain projects of scholarly editing. Asking precise questions to a machine often requires this kind of thorough work: eventually, we can only ask what we previously gave it.Footnote 23 Annotating variations following the model could benefit from a dedicated GUI. In addition, some of the categories might be identified automatically.Footnote 24

The implementation in different data structures proves that the relational DB schema and the OWL ontology have the same expressiveness: namely, in articulate relationships. XML, on the contrary, is less suitable for conveying the information gathered using the model, even if XML solutions can eventually be implemented. This conclusion should be evaluated taking into account that the model covers a textual phenomenon, that of variation; even if, in the model, this phenomenon is detached from the rest of the text, it should be possible to expand the model in order to include the contexts, or, better, the co-texts. Now, in digital scholarly editing the de-facto standard data structure for text is XML. This is of course related to the adoption of the TEI Guidelines, but also, more generally, to the fact that digital scholarly editing often results in digital publishing, and the language of the web is XML, in the form of HTML. Comparing relational databases and graphs with XML, we note that from the first is less intuitive to retrieve a stream––which is a fundamental quality for working with texts––, and the second lacks of tools for handling entire texts to be published digitally. In short, they are commonly used for data which are much more structured and fragmented than texts.

Ongoing experiences, however, prove that there is an interest in the digital scholarly editing community to explore solutions other than the tree formalism of XML. In particular, the graph structure is emerging, as a conceptual model to be implemented in different ways.Footnote 25 The adoption of graphs raises a number of technical and theoretical challenges. Among the technical ones, there might be the need to integrate the information stored in graphs within the XML (or HTML) representation of the text: the discussion on the TEI List about the integration of RDF annotations in a TEI document shows that the discussion is open-endedFootnote 26; stand-off solutions can peer out here, for overcoming the limitation of XML and for filling the gap with other data structures. Among the theoretical challenges, on the other hand, there is the possibility to call into question the way texts are employed and consumed, which is not unrelated to the way they are visualized. This means, for instance, that scholarly editing can produce various outputs: diplomatic or critical texts; but also SVG objects and, more in general, graphics and dynamic visualizations results of analysis, which might represent some of the features of the texts better than typographical devices reproduced by HTML (Andrews and van Zundert 2016; Cummings et al. 2017). The terms visualization and analysis recall that what is represented is data, and not only words or sentences. In this scenario, it is easier to take advantage of data structures such as graphs or relational tables.

The exercise in modelling presented in this article is intended as a minor contribution to the broad discussion briefly addressed here above, but primary as a way to explore how computational methods may contribute to the old issue of handling textual variation. Applying it to other case studies will prove its usefulness and versatility.