Keywords

1 Introduction

Determining the time frame of a book or a manuscript and identifying an author are important and challenging problems. Time-related key-words and key-phrases with references can be used to date and identify authors. Key-phrases and key-words have great potential to provide great information in many domains, such as academic, legal and commercial. Thus, the automatic extraction and analysis of key-phrases and key-words is growing rapidly and gaining momentum. Web search engines, machine learning, etc. are based on key-phrases and key-words. As a result, features are extracted and learned automatically; thus, the analysis of key-phrases and key-words has enormous importance. Key-phrases and key-words are essential features not only of the needs of scientific papers or for industry and commerce but also of rabbinic responsa (answers written in response to Jewish legal questions authored by rabbinic scholars). Key-phrases and words included in rabbinic text are more complex to define and to extract than key-phrases and key-words in academic papers written in English because: (1) Natural Language Processing (NLP) in Hebrew, Aramaic and Yiddish has been studied relatively little; (2) there is a mixture of the complex morphology of Hebrew, Aramaic and Yiddish. For example, key-phrases and key-words can be presented with different types of prefixes (e.g., “and …”, “when …”, “and when …”, “in …”, “and when in …”); (3) key-phrases and words in Hebrew, Aramaic and Yiddish texts may be ambiguous. Responsa texts present various interesting problems: (1) the morphology in Hebrew is richer than in English. Hebrew has 70,000,000 valid forms, while English has only 1,000,000 [1]. Declensions in Hebrew can be up to 7000 for one stem, while in English, there are only a few declensions; (2) responsa documents have a high rate of abbreviations (nearly 20 %) [2].

This research estimates the date of undated documents of authors using (1) the year(s) mentioned in the text, (2) “late” (“of blessed memory”) key-phrases, (3) “rabbi” key-phrases, (4) “friend” key-phrases that are mentioned in the texts and (5) undated references of other dated authors that refer to the considered author or are mentioned by him. The assessments are with different degrees of certainty: “iron-clad”, heuristic and greedy. The rules are based on key-phrases with and without references.

This paper is organized as follows: Sect. 2 gives background concerning the extraction and analysis of key-phrases and citation. Section 3 presents the boosting extraction key-phrases algorithm. Section 4 presents various rules of some degrees of certainty: “iron-clad”, heuristic and greedy rules, which are used to assess writers’ birth and death years. Section 5 presents the model description. Section 6 familiarizes the dataset, experiments, results and analysis. Section 7 includes the summary, conclusions and future works.

2 Related Research

Following the explosion of electronic information, there has been a growing need for extracting key-phrases and words automatically. Many studies have been made in this area for different purposes and from different perspectives. Key-words in documents allow for quick search on multiple large databases [3]. Key-words can also help to improve the NLP performance, as well as Information Retrieval performance in issues such as text summarization [4], text categorization [5], topic change during conversational text [6] and opinion mining [7].

Although key-words are important in many computer programs, there is still much to be done in this area, and the state-of-the-art methods underperform compared to other NLP core tasks [8].

There are several difficulties in extracting key-phrases and key-words. One is the length of the documents. In scientific articles, although there can only be approximately 10 key-words or key-phrases and approximately another 30 candidates in the abstract section, the rest of the article may contain hundreds of candidate key-phrases or key-words [9]. Moreover, key-words can also appear at the end of an article. If key-phrase or key-word appears at the beginning and at the end of an article, it indicates the importance of that key-phrase or key-word [10].

When documents are structured, key-words extraction is easier. For example, in scientific papers, most of the key-words appear in the abstract, in the introduction and in the title [11]. In other cases, key-phrases can be automatically extracted from web page text and from its metadata [12] for the purpose of advertisement.

Automatic extraction and analysis of references from academic papers was first proposed by Garfield [13]. Berkowitz and Elkhadiri [14] extracted writers’ names and titles from articles. A knowledge-based system was used by Giuffrida et al. [15] to derive metadata, including writers’ names, from computer science journal articles. Hidden Markov Models were used by Seymore et al. [16] to extract writer names from a limited collection of computer science articles. The use of terms leads to progress in the extraction of information. Selecting text before and after references to extract good index terms to improve retrieval effectiveness was done by Ritchie et al. [17]. Bradshaw [18] used terms from a fixed window around references.

In contrast with scientific articles, the documents we are working on are from the Responsa ProjectFootnote 1; they are without any structural base, usually contain a mixture of at least two languages, and contain noise. Previous research on the Responsa Project dealt with text classification [19]. They checked whether classification could be done over the long axis of ethnic groups of authors with stylistic feature sets. HaCohen-Kerner and Mughaz [20] and Mughaz et al. [21] investigated in which era rabbis lived using undated Responsa, but they did not address the problem of how to extract time-related key-words or key-phrases. This article is a continuation research of this issue, i.e., determining when writers lived using key-phrases.

3 Semi-automatic Boosting Extraction of Key-Phrases

We want to automate the extraction of time-related key-phrases. We found that most of the sentences that contain time-related concepts (i.e., time-related words and phrases) to rabbinic literature (e.g., “late”, “friend”) are usually nearby rabbinic names/nicknames/acronyms/abbreviations/book-names. We developed a semi-automatic algorithm that boosts concepts extraction for extracting time-related concepts. The main idea is to extract sentences that contain names of rabbis so that the words and phrases that are nearby the rabbinic names are treated as the key-phrases (among others) that we look for. Now, we present a general description of our extraction algorithm.

3.1 The Algorithm

  • Notations:

  • TW – vector of Time-related Words.

  • RN – vector of Rabbinic Names.

  • n – number of iterations of the algorithm.

  • TRC – set of Time-Related Concepts starting with, e.g., year, life, colleague, era.

  • TW ← TRC //initiate TW vector with TRC set

  • For i = 1 to n, do:

    • Search for sentences that contain the last concepts added to TW.

    • Extract new rabbinic names from those sentences.

    • Add the new rabbinic names to RN.

    • Search for sentences that contain the last rabbinic names added to RN.

    • Extract time-related concepts from the new sentences:

      • Delete stop words.

      • Add the new time-related concepts to TW.

      • Add the new time-related words and phrases to TRC (with their frequency) and for the “old” time-related words and phrases only add their frequency.

        Sort TRC by the frequency of time-related words and phrases in decreasing order (normally, concepts have a large number of appearances).

    • Select from the table the most frequent time-related concepts.

3.2 Algorithm Results

After using the algorithm, we extract time-related key-words, key-phrases and acronyms (a partial list is shown in Table 1). We divide them into three Hebrew and Aramaic key-words and key-phrases sets:

Table 1. Hebrew and Aramaic cue words – a partial list

Rabbi – addressing another person as a rabbi/master, i.e., there is overlap between the lifetime of one author and the lifetime of another who is referred to by the first author as rabbi.

Friend – addressing another person as a friend, i.e., there is a large overlap between the lifetime of one author and the lifetime of another who is referred to by the first author as a friend.

Late – addressing a person who has already died.

Table 1 presents a partial list of Hebrew and Aramaic key-words and key-phrases and a few acronyms in Hebrew and their translation into English.

4 Rule-Based Constraints

This section presents the rules, based on key-phrases and references, formulated for the estimation of the birth and death years of an author X (the extracted results point to specific years) based on his texts and the texts of other writers (Yi) who mention X or one of his texts. We assume that the birth years and death years of all writers are known, excluding those that are under interrogation. Now, we will give some notions and constants that are used: X – The writer under consideration, Yi – Other writers, B – Birth year, D – Death year, MIN – Minimal age (at present, 30 years) of a rabbinic writer when he starts to write his response, MAX – Maximal age (at present, 100 years) of a rabbinic author, and RABBI_DIS – The gap age between rabbi and his student (at present, 20 years). The estimations of MIN, MAX, and RABBI_DIS constants are heuristic, although they are realistic on the basis of typical responsa authors’ lifestyles.

Different types of references exist: general references with and without key-phrases, such as “rabbi”, “friend” and “late”. There are two types of references: those referring to living authors and those referring to dead authors. In contrast to academic papers, responsa include many more references to dead authors than to living authors.

We will introduce rules based on key-phrases and references of different degrees of certainty: “iron-clad” (I), heuristic (H) and greedy (G). “Iron-clad” rules are always true, without any exception. Heuristic rules are almost always true. Exceptions can occur because the heuristic estimates for MIN, MAX and RABBI_DIS are incorrect. Greedy rules are rather reasonable rules for responsa authors. However, wrong estimates can sometimes be drawn while using these rules. Each rule will be numbered and its degree of certainty (i.e., I, H, G) will be presented in brackets.

4.1 “Iron-Clad” and Heuristic Rules with Key-Phrases

First, we present two general heuristic rules, which are based on regular references (i.e., without any key-phrase), based on authors that cite X.

General rule based on authors that were mentioned by X

$$ {\text{D}}\left( {\text{X}} \right) > = {\text{ MAX}}\left( {{\text{B}}\left( {\text{Yi}} \right)} \right) + {\text{MIN }}\left( { 1 { }\left( {\text{H}} \right)} \right) $$

X must have been alive when he referred to Yi, so we can use the earliest possible age of publishing of the latest born author Yi as a lower estimate for X’s death year.

General rule based on authors that referred to X

$$ {\text{B}}\left( {\text{X}} \right) < = {\text{ MIN}}\left( {{\text{D}}\left( {\text{Yi}} \right)} \right) - {\text{MIN }}\left( { 2 { }\left( {\text{H}} \right)} \right) $$

All Yi must have been alive when they referred to X, and X must have been old enough to publish. Hence, we can use the earliest death year amongst such authors Yi as an upper estimate of X’s earliest possible publication age (and thus his birth year).

General rules based on year mentioning Y that appeared in X’s documents

$$ {\text{D}}\left( {\text{X}} \right) > = {\text{ MAX}}\left( {\text{Y}} \right) \, \left( { 3 { }\left( {\text{I}} \right)} \right) $$

X must have been alive when he mentioned the year Y. We can use the most recent year mentioned by X to evaluate the death year of X as an estimation of X’s death year.

Posthumous key-phrase rules. Posthumous rules estimate the birth and death years of an author X based on references of authors who refer to X with the key-phrase “late” (“of blessed memory”) or on references of X that mention other authors with the key-phrase “late”. Figure 1 describes possible situations where various types of authors Yi (i = 1, 2, 3) refer to X with the key-phrase “late”. The lines depict writers’ life spans; the left edges represent the birth years and the right edges represent death years. In this case (as all Yi refer to X with the key-phrase “late”), we know that all Yi passed away after X, but we do not know when they were born in relation to X’s birth. Y1 was born before X’s birth; Y2 was born after X’s birth but before X’s death; and Y3 was born after X’s death.

Fig. 1.
figure 1

References mentioning X with the key-phrase “late”.

$$ {\text{D}}\left( {\text{X}} \right) < = {\text{ MIN}}\left( {{\text{D}}\left( {\text{Yi}} \right)} \right) \, \left( { 4 { }\left( {\text{I}} \right)} \right) $$

However, we know that X must have been dead when Yi referred to him with the key-phrase “late”; thus, we can use the earliest born Y’s death year as an upper estimate for X’s death year. Like all writers, dead writers of course have to comply with rule (2) as well.

Now, we look at the cases where the author X that we are studying refers to other authors Yi with the key-phrase “late”. Figure 2 describes possible situations where X refers to various types of authors Yi (i = 1, 2, 3) with the key-phrase “late”. All Yi passed away before X’s death (or X may still be alive). Y1 died before X’s birth; Y2 was born before X’s birth and died when X was still alive; Y3 was born after X’s birth and passed away when X was still alive.

Fig. 2.
figure 2

References by X who mentions others with the key-phrase “late”.

$$ {\text{D}}\left( {\text{X}} \right) > = {\text{MAX}}\left( {{\text{D}}\left( {\text{Yi}} \right)} \right) \, \left( { 5 { }\left( {\text{I}} \right)} \right) $$

X must have been alive after the death of all Yi who were referred by him with the key-phrase “late”. Therefore, we can use the death year of the latest-born Y as a lower estimate for X’s death year.

$$ {\text{B}}\left( {\text{X}} \right) > = {\text{MAX}}\left( {{\text{D}}\left( {\text{Yi}} \right)} \right) - {\text{MAX }}\left( { 6 { }\left( {\text{H}} \right)} \right) $$

X was probably born after the death year of the latest-dying person that X wrote about. Thus, we use the death year of the latest-born Y minus his max life-period as a lower estimate for X’s birth year.

Contemporary key-phrases rules. Contemporary key-phrases rules calculate the upper and lower bounds of the birth year of a writer X based only on the references of known writers who refer to X as their friend/rabbi. This means there must have been at least some period of time when both were alive together. Figure 3 shows possible situations where various types of writers Yi refer to X as their friend/rabbi. Y1 was born before X’s birth and died before X’s death; Y2 was born before X’s birth and died after X’s death; Y3 was born after X’s birth and passed away before X’s death; Y4 was born after X’s birth and passed away after X’s death. Like all writers, contemporary authors of course have to comply with rules 1 and 2 as well.

Fig. 3.
figure 3

References by authors who refer to X as their Friend/Rabbi.

$$ {\text{B}}\left( {\text{X}} \right) > = {\text{ MIN}}\left( {{\text{B}}\left( {\text{Yi}} \right)} \right) - \left( {{\text{MAX}} - {\text{MIN}}} \right) \, \left( { 7 { }\left( {\text{H}} \right)} \right) $$

All Yi must have been alive when X was alive, and all of them must have been old enough to publish. Thus, X could not have been born MAX-MIN years before the earliest birth year amongst all authors Yi.

$$ {\text{D}}\left( {\text{X}} \right) < = {\text{MAX}}\left( {{\text{D}}\left( {\text{Yi}} \right)} \right) + \left( {{\text{MAX}} - {\text{MIN}}} \right) \, \left( { 8 { }\left( {\text{H}} \right)} \right) $$

Again, all Yi must have been alive when X was alive, and all of them must have been old enough to publish. Hence, X could not have been alive MAX-MIN years after the latest death year amongst all writers Yi.

4.2 Greedy Rules

Greedy rules bounds are sensible but can sometimes lead to wrong estimates.

Greedy rule based on authors who are mentioned by X

$$ {\text{B}}\left( {\text{X}} \right) > = {\text{ MAX}}\left( {{\text{B}}\left( {\text{Yi}} \right)} \right) \, - {\text{MIN }}\left( { 9 { }\left( {\text{G}} \right)} \right) $$

Many of the references in our research domain relate to dead authors. Thus, most of the references within X’s texts relate to dead authors. Namely, many Yi were born before X’s birth and died before X’s death. Thus, a greedy assumption would be that X was born no earlier than the birth of the latest author mentioned by X; however, because there may be at least one case where Y was born after X was born, we subtract MIN.

Greedy rule based on references to year Y made by X

$$ {\text{B}}\left( {\text{X}} \right) > = {\text{ MAX}}\left( {\text{Y}} \right) - {\text{MIN }}\left( { 10 \, \left( {\text{G}} \right)} \right) $$

When X mentions years, he usually writes the current year in which he wrote the document or a few years ahead. Most of the time, the maximum year, Y, minus MIN is larger than X’s birth year.

Greedy rule based on authors who refer to X

$$ {\text{D}}\left( {\text{X}} \right) < = {\text{ MIN}}\left( {{\text{D}}\left( {\text{Yi}} \right)} \right) \, - {\text{MIN }}\left( { 1 1 { }\left( {\text{G}} \right)} \right) $$

As mentioned above, most of the references within Yi texts refer to X as being dead. Hence, most Yi died after X’s death. Therefore, a greedy assumption would be that X died no later than the death of the earliest author who referred to X minus MIN.

Rules refinements 9–11 are presented by rules 12–17. Rules 12–14 are due to X referring to Yi and rules 15–17 are due to Yi referring to X.

Greedy rule for defining the birth year based only on authors who were referred to by X with the key-phrase “late”

$$ {\text{B}}\left( {\text{X}} \right) > = {\text{ MAX}}\left( {{\text{D}}\left( {\text{Yi}} \right)} \right) \, - {\text{MIN }}\left( { 1 2 { }\left( {\text{G}} \right)} \right) $$

When taking into account only references that were written in X’s texts, most of the references are related to dead authors. That is, most Yi died before X’s birth. Moreover, an author does not write from his birth; rather, he usually begins near his death. Thus, a greedy assumption would be that X was born no earlier than the death of the latest author mentioned by X minus MIN.

Greedy rule for defining the birth year based only on authors who are mentioned by X with the key-phrase “friend”

$$ {\text{B}}\left( {\text{X}} \right) < = {\text{ MIN}}\left( {{\text{B}}\left( {\text{Yi}} \right)} \right) \, + {\text{ RABBI}}\_{\text{DIS }} \, \left( { 1 3 { }\left( {\text{G}} \right)} \right) $$

When taking into account only references that are mentioned by X, which are related to contemporary authors, a greedy rule could be that X was born no later than the birth of the earliest author mentioned by X with the key-phrase “friend”. Because many times the older author refers to the younger author as “friend”, we need to add RABBI_DIS.

Greedy rule for defining the birth year based only on authors who are mentioned by X with the key-phrase “rabbi”

$$ {\text{B}}\left( {\text{X}} \right) < = {\text{ MIN}}\left( {{\text{B}}\left( {\text{Yi}} \right)} \right) \, + {\text{ RABBI}}\_{\text{DIS }} \, \left( { 1 4 { }\left( {\text{G}} \right)} \right) $$

When taking into account only references written in X’s texts, which are related to contemporary authors, a greedy rule could be that X was born no later than the birth of the earliest author mentioned by X as a “rabbi”. Due to the age difference between a student and his rabbi being approximately 20 years, we need to add RABBI_DIS.

Greedy rule for defining the death year of X based only on authors who referred to X with the key-phrase “late”

$$ {\text{D}}\left( {\text{X}} \right) < = {\text{ MIN}}\left( {{\text{B}}\left( {\text{Yi}} \right)} \right) \, + {\text{ MIN }} \, \left( { 1 5 { }\left( {\text{G}} \right)} \right) $$

When taking into account only references written in Yi texts that refer to X with the key-phrase “late”, a greedy assumption could be that X died no later than the birth of the earliest author who referred to X with the key-phrase “late”; because an author does not writes from birth, we need to add MIN.

Greedy rule for defining the death year of X based only on authors who referred to X with the key-phrase “friend”

$$ {\text{D}}\left( {\text{X}} \right) > = {\text{ MAX}}\left( {{\text{D}}\left( {\text{Yi}} \right)} \right) \, - {\text{ RABBI}}\_{\text{DIS }} \, \left( { 1 6 { }\left( {\text{G}} \right)} \right) $$

When taking into account only references written in Yi texts that refer to X with the key-phrase “friend”, all Yi must have been alive when X was alive, and all of them must have been old enough to publish; also, many times, the older author refers to the younger author with the key-phrase “friend”, and the opposite never occurs. Therefore, a greedy assumption would be that X died no earlier than the death of the latest author who referred to X with the key-phrase “friend” minus RABBI_DIS.

Greedy rule for defining the death year of X based only on authors who referred to X with the key-phrase “rabbi”

$$ {\text{D}}\left( {\text{X}} \right) > = {\text{ MAX}}\left( {{\text{D}}\left( {\text{Yi}} \right)} \right) \, - {\text{ RABBI}}\_{\text{DIS }} \, \left( { 1 7 { }\left( {\text{G}} \right)} \right) $$

This follows the same principle as the rule for defining the birth year, but because this time the student mentions the rabbi, we need to reduce RABBI_DIS.

4.3 Birth and Death Year Tuning

Application of the Heuristic and Greedy rules can lead to abnormalities, such as an author’s death age being unreasonably old or young. Another possible anomaly is that the algorithm may result in a death year greater than the current year (i.e., 2015). Hence, we added some tuning rules: D – death year, B – birth year, age = D–B.

Current Year: if (D > 2015) {D = 2015}, i.e., if the current year is 2015, then the algorithm must not give a death year greater than 2015.

Age: if (age > 100), {z = age–100; D = D–z/2; B = B+z/2}, and if (age < 30), {z = 30 – age; D = D+z/2; B = B−z/2}. Our postulate is that a writer lived at least 30 years and no more than 100 years. Thus, if the age according to the algorithm is greater than 100, we take the difference between that age and 100, and then we divide that difference by 2 and normalize D and B to result in an age of 100.

5 The Model

The main steps of the model are presented below.

  1. 1.

    Cleaning the texts. Because the responsa may have undergone some editing, we must make sure to ignore the possible effects of differences in the texts resulting from variant editing practices. Therefore, we eliminate all orthographic variations.

  2. 2.

    Boosting extracting key-phrases and key-words.

  3. 3.

    Normalizing the references in the texts. For each author, we normalize all types of references that refer to him (e.g., various variants and spellings of his name, books, documents and their nicknames and abbreviations). For each author, we collect all references syntactic styles that refer to him and then replace them with a unique string.

  4. 4.

    Building indexes, e.g., authors, references to “late”/“friend”/“rabbi”, and calculating the frequencies of each item.

  5. 5.

    Performing various combinations of “iron-clad” and heuristic rules on the one hand and greedy rules on the other hand to estimate the birth and death years of each tested author.

  6. 6.

    Calculating averages for the best “iron-clad”, heuristic and greedy versions.

6 Examined Corpus, Experiments and Results

The documents of the examined corpus were downloaded from Bar-Ilan University’s Responsa Project. The examined corpus contains 15,495 responsa written by 24 scholars, averaging 643 files for each scholar. The total number of characters in the whole corpus is 127,683,860 chars, and the average number of chars for each file is 8,240 chars. These authors lived over a period of 229 years (1786–2015). These files contain references; each reference pattern can be expanded into many other specific references [22].

Reference identification was performed by comparing each word to a list of 339 known authors and many of their books. This list of 25,801 specific references refers to the names, nicknames and abbreviations of these authors and their writings. Basic references were collected and all other references were produced from them.

We split the data into two corpora: (1) 10,512 responsa authored by 12 rabbis, with an average of 876 files for each scholar and each file containing an average of 1800 words spread over 135 years (1880–2015); (2) 15,495 responsa authored by 24 rabbis, with an average of 643 files for each rabbi and each file containing an average of 1609 words spread over 229 years (1786–2015) (the set of 24 rabbis contains the group of 12 rabbis). For more detailed information on the data set, refer to Table 2 in the appendix at the end of this article.

Table 2. Full details about the data set

Because of the nature of the problem, it is difficult to appraise the results in the sense that although we can compare how close the system guess is to the actual birth or death years, we cannot assess how good the results are, i.e., there is no real notion of what a ‘good’ result is. For now, we use the notion Distance, which is defined as the estimated value minus the ground truth value.

The outcomes appear in the following histograms. Each histogram shows the results of one algorithm – Iron + Heuristic (I + H) or Greedy. Each algorithm was performed on two groups of authors: a group of 12 writers and a group of 24 writers. For both algorithm executions, there are outcomes containing estimated birth years and death years. The results shown in the histograms are the best birth/death date deviation results. In every histogram, there are eight columns; there are two quartets of columns in each histogram: the right quartet indicates the deviation from the death year, while the left quartet indicates deviations from the birth year. Each column represents a deviation without a key-phrase or with the year that was mentioned in the text, a deviation with the “late” key-phrase, with the “rabbi” key-phrase, and with the “friend” key-phrase. Moreover, we used two manipulations – Age and Current year. The column with a gray background contains the best results. Each histogram contains 8 columns (results); there are 16 histograms, so there are, in total, 128 results.

The Age manipulation is very helpful; we used it in 94.5 % of the experiments (i.e., 121/128 = 0.945) for all of the refinements, in both algorithms, with or without constants.

Examination of the effect of mentioning a year, listed in Figs. 4, 5, 6, and 7, compared with Figs. 8, 9, 10, and 11 regarding death year deviation, indicates that the contribution of referencing a year leads to an improvement of 2.8 years on average.

Fig. 4.
figure 4

12 authors I + H no constant

Fig. 5.
figure 5

24 authors I + H no constant

Fig. 6.
figure 6

12 authors Greedy no constant

Fig. 7.
figure 7

24 authors Greedy no constant

Fig. 8.
figure 8

12 authors I + H no constant

Fig. 9.
figure 9

24 authors I + H no constant

Fig. 10.
figure 10

12 authors Greedy no constant

Fig. 11.
figure 11

24 authors Greedy no constant

This phenomenon is more noticeable in Iron + Heuristic (average upswing of 4.2 years) than with Greedy (average deviation upswing of 1.4 years). The main reason for this is that a writer usually writes until close to his death. Additionally, when a year is mentioned in the text, it is often the year in which the writer wrote the document. Because an author writes, in many cases, until near his death year, the maximum year mentioned in his texts is close to the year of his death.

In contrast to the death year assessment, birth year assessment has a negative impact; the deviation increases by 10.4 years, on average. It is essential to note that we are now evaluating the impact of the year mentioned in the text. If the results without using the year mentioned are better than the results using the year mentioned, it means that we should not use it. For example: the result of the birth year using Greedy rules, without year mentioning and without any refinement, for the 12 authors has a deviation of 16.7 years. After using the year mentioned, the deviation is 51.5 years, decreasing the accuracy by 34.83 years. The result of the birth year using the Iron + Heuristic, without year mentioning and without key-phrases, for the 12 authors has a deviation of 26.5 years. After using a reference to years, the deviation is 50.7 years, decreasing the accuracy by 24.2 years, i.e., the deviation with the use of year mentioning is greater. An analysis of the formulas shows that the formula that determines the birth year in the Greedy (10(G)) uses the most recent year the writer writes in his texts. The most recent year that the rabbi mentions is usually near his death, as explained above; therefore, very poor birth results are obtained, with a decline of 12.5 years. The results of the Greedy are better than Iron + Heuristic (decline of 8.4 years), but the effect of year mentioning on the results of Iron + Heuristic is less harmful. Thus, to estimate the death year, we will use the Iron + Heuristic algorithm with the use of year mentioning without any key-phrases.

The use of the key-phrase “friend” for birth year assessment gives the best results compared with the other key-phrases – “late”, “rabbi” or none. This is because friends are of the same generation and more or less the same age; thus, they are born in roughly the same year. Thus, for a writer addressing another author as his friend, the assessment of his birth year will give good results. For the death year, however, this is not assured because there may be a much greater period between the deaths of friends (one may die at the age of 50, while his friend at the age 75). Hence, the “friend” key-phrase usually gives better birth year assessment than death year assessment (Figs. 12, 13, 14, and 15).

Fig. 12.
figure 12

12 authors I + H with constant

Fig. 13.
figure 13

24 authors I + H with constant

Fig. 14.
figure 14

12 authors Greedy with constant

Fig. 15.
figure 15

24 authors Greedy with constant

After we found that the best results for the birth year are always with the “friend” key-phrase (except for one case), we investigated at greater depth and found that this occurs specifically with the use of constants. Constants are important, resulting in an average improvement of 6.3 years in the case of Greedy (for the 12 and 24 authors). In general, a Posek is addressed in responsa after he has become important enough to be mentioned and regarded in the Halachic Responsa, which is usually at an advanced age.

We stated above that the use of Greedy rules with constants gives the greatest improvement. Even without the use of constants, Greedy produces the best results. The reason lies in the formulae; formula (13(G)) finds the lowest birth year from the group of authors that the arbiter mentioned. Unlike the Greedy, the Iron + Heuristic formula (7(H)) reduces the constant (at present, 20); therefore, the results of the Greedy are better. In conclusion, to best assess the birth year, we apply the Greedy algorithm, using constants and also the key-phrase “friend”.

The best results when evaluating birth year occurred when using the Greedy algorithm with constants and without mentioning years. The best results when evaluating death year occurred when using the Iron + Heuristic algorithm with constants and without mentioning years. When we compare these results with the results shown in Figs. 16, 17, 18, and 19 we find that in the case of Greedy, when we add more authors, there is an improvement only in one case, i.e., for the 12 authors using the “late” key-phrase; the remaining results show a decline in performance. The reason for this phenomenon may lie in the Greedy formula; when an author is more successful, in addition to being mentioned by others many times, he is mentioned at a younger age by authors that are older than him; therefore, the estimation is less accurate. For example: the estimation of the death year of the late Rabbi Ovadia Yosef has an error of 61 years (instead of 2014, the algorithm result is 1953), determining that he died at an age of 34; using the Iron + Heuristic algorithm, there was a decrease in two results and an improvement in 5 results. For Iron + Heuristic, there is an average improvement of 0.64 years and, in fact, the best death year result estimation. The quality of the Greedy algorithm birth year results estimation using year mentioning pretty severely impairs the results (explained above). A possible explanation for this is that the improvement that comes from using constants cannot overcome the deterioration that comes from year mentioning. In contrast, when assessing the death year, using year mentioning with Iron + Heuristic significantly improves the results, and using constants improves them a little more; therefore, a combination of constants + year mentioning brings better assessment of the death year. Therefore, when assessing birth year and death year, it is not enough to use references; we have to use key-words and key-phrases. To estimate death year, we will use the year(s) mentioned in the text and constants with the Iron + Heuristic algorithm; to estimate birth year, we will run the Greedy algorithm using constants and the “friend” key-phrase without year mentioning.

Fig. 16.
figure 16

12 authors I + H with constant and year

Fig. 17.
figure 17

24 authors I + H with constant and year

Fig. 18.
figure 18

12 authors Greedy with constant and year

Fig. 19.
figure 19

24 authors Greedy with constant and year

7 Summary, Conclusions and Future Work

We investigated the estimation of the birth and death years of authors using year mentioning, the “late” (“of blessed memory”) key-phrase, the “rabbi” key-phrase, the “friend” key-phrase and undated references that are mentioned in documents of other dated authors that refer to author being considered or those mentioned by him. This research was performed on responsa documents, where special writing rules are applied. The estimation was based on the author’s texts and texts of other authors who refer to the discussed author or are mentioned by him. To do so, we formulated various types of iron-clad, heuristic and greedy rules. The best birth year assessment was achieved by using the Greedy algorithm with constants and the “friend” key-phrase. The best death year assessment was achieved by using the Iron + Heuristic algorithm with year mentioning.

We plan to improve this research by (1) testing new combinations of iron-clad, heuristic and greedy rules, as well as a combination of key-phrases (e.g., “late” and “friend”); (2) improving existing rules and/or formulating new rules; (3) defining and applying heuristic rules that take into account various details included in the responsa, e.g., events, names of people, new concepts and collocations that can be dated; (4) conducting additional experiments using many more responsa written by more authors to improve the estimates; (5) checking why the iron-clad, heuristic and greedy rules tend to produce more positive differences; and (6) testing how much of an improvement we can obtain from a correction of the upper bound of D(x) and how much we will, at some point, use it for a corpus with long-dead authors.