Keywords

1 Introduction

Determining the time frame of a book or a manuscript and identifying an author are important and challenging problems. Time-related key-words and key-phrases with references can be used to date and identify authors. Key-phrases and key-words have great potential to provide great information in many domains, such as academic, legal and commercial. Thus, the automatic extraction and analysis of key-phrases and key-words is growing rapidly and gaining momentum. Web search engines, machine learning, etc. are based on key-phrases and key-words. As a result, features are extracted and learned automatically; thus, the analysis of key-phrases and key-words has enormous importance. Key-phrases and key-words are essential features not only of the needs of scientific papers or for industry and commerce but also of rabbinic responsa (answers written in response to Jewish legal questions authored by rabbinic scholars).

Key-phrases, key-words and citations/references included in rabbinic responsa text are more complex to define and to extract than key-phrases, key-words and citations/references in academic papers because:

  1. (1)

    There is an interaction with the complex morphology of Hebrew, Aramaic and Yiddish (e.g. various citations can be presented with different types of prefixes included in the name of a citation, e.g., “and …”, “when …”, “and when …”, “in …”, “and in …”, “and when in …”);

  2. (2)

    Natural language processing (NLP) in Hebrew, Aramaic and Yiddish has been relatively little studied;

  3. (3)

    In contrast to academic papers, there is no reference list that appears at all, even not at the end of a responsa;

  4. (4)

    Many references in Hebrew-Aramaic-Yiddish documents are ambiguous. For instance: (a) a book titled (magen-avot) was composed by four different Jewish authors; and (b) The abbreviation (m”b) relates to two different Jewish authors and has also other meanings, which are not authors’ names; and

  5. (5)

    At least 30 different syntactic styles (see next paragraph) are used to present references. This number is higher than the number of citation patterns used in academic papers written in English (e.g., see [1]).

Each specific document written by a specific author can be referred to in at least 30 general possible citation syntactic styles. For example, the pardes book written by the famous Jewish author Rabbi Shlomo Yitzhaki, known by the abbreviation Rashi, can be referenced using the following patterns: (1) “Shlomo son of Rabbi Yitzhaki”, (2) “Shlomo son of Yitzhaki”, (3) “Shlomo Yitzhaki”, (4) “In the name of Rashi who wrote in the pardes book”, (5) “In the complete pardes book of Rashi”, (6) “Rashi of blessed memory in the pardes book”, (7) “In Rashi in the pardes”, (8) “Rashi the possessor of the pardes of blessed memory”, (9) “In the name of of Rashi in the pardes”, (10) “The pardes responsa of Rashi”, (11) “Which wrote Rashi in the pardes”, (12) “Rashi in the pardes book”, (13) “The pardes book of Rashi”, (14) “The pardes b’ of Rashi”, (15) “In the name of the pardes book”, (16) “Rashi the possessor of the pardes”, (17) “In the name of the pardes b’”, (18) “Rashi in the pardes”, (19) “The great pardes book of Rashi”, (20) “The great pardes book”, (21) “The great pardes”, (22) “The pardes book”, (23) “The pardes responsa”, (24) “The possessor of the pardes”, (25) “In the name of the pardes”, (26) “ The pardes b’” (27) “In the name of Rashi”, (28) “There, chapter-number, sentence-number” (there refers to the book/paper mentioned on the latest reference), (29) “There, there, sentence-number” (the first there refers to the book/paper mentioned in the latest reference and the second there refers to the chapter-number mentioned in the latest reference), and (30) “There, there, there” (refers to the book/paper, chapter-number and sentence-number mentioned on the latest reference).

Furthermore, each citation pattern can be expanded to many other specific citations by replacing the name of the author and/or his book/responsa by each of their other names (e.g., different spellings, full names, short names, first names, surnames, and nicknames with/without title) and abbreviations of their names or their book titles.

Hebrew-Aramaic-Yiddish documents in general and Hebrew responsa in principle present various interesting text mining problems: (1) the morphology in Hebrew is richer than in English. Hebrew has 70,000,000 valid forms, while English has only 1,000,000 [2]. Declensions in Hebrew can be up to 7000 for one stem, while in English, there are only a few declensions; (2) responsa documents have a high rate of abbreviations (nearly 20%) while more than one third of them (about 8%) are ambiguous [3].

This research estimates the date of undated documents of authors using (1) the year(s) mentioned in the text, (2) “late” (“of blessed memory”) key-phrases, (3) “rabbi” key-phrases, (4) “friend” key-phrases that are mentioned in the texts and (5) undated references of other dated authors that refer to the considered author or are mentioned by him. The assessments are with different degrees of certainty: “iron-clad”, heuristic and greedy. The rules are based on key-phrases with and without references.

This paper is organized as follows: Sect. 2 gives background concerning the extraction and analysis of key-phrases and citation. Section 3 presents the boosting extraction key-phrases algorithm. Section 4 presents various rules of some degrees of certainty: “iron-clad”, heuristic and greedy rules, which are used to assess writers’ birth and death years. Section 5 presents the model description. Section 6 familiarizes the dataset, experiments, results and analysis. Section 7 includes the summary, conclusions and future works.

2 Related Research

Following the explosion of electronic information, there has been a growing need for extracting key-phrases and key-words automatically. Many studies have been made in this area for different purposes and from different perspectives. Key-words in documents allow for quick search on multiple large databases [4]. Key-words can also help to improve the NLP performance, as well as the information retrieval performance in issues such as text summarization [5], text categorization [6], topic change during conversational text [7], and opinion mining [8].

Although key-words are important in many computer applications, there is still much to be done in this area, and the state-of-the-art methods underperform compared to other NLP core tasks [9].

There are several difficulties in extracting key-phrases and key-words. One is the length of the documents. In scientific articles, although there can only be approximately 10 key-words or key-phrases and approximately another 30 candidates in the abstract section, the rest of the article may contain hundreds of candidate key-phrases or key-words [10]. Moreover, key-words can also appear at the end of an article. If key-phrase or key-word appears at the beginning and at the end of an article, it indicates the importance of that key-phrase or key-word [11].

When documents are structured, key-words extraction is easier. For example, in scientific papers, most of the key-words appear in the abstract, in the introduction and in the title [12]. In other cases, key-phrases can be automatically extracted from web page text and from its metadata [13] for the purpose of advertisement.

One possible way to date a document and determining who the writer of a document is to use visual image of the document. Analyzing image of a document generally consists, of: locating the place where the document was written (different climate types mean different erosion, discoloration and degradation), extracting features from letters, words and empty regions. Features can be extracted by the use of the contours of the alphabet shapes of the text, [14]; by extracting features that not related to a specific text [15, 16] or by geometric patterns.

Bar-Yosef et al. [17] developed a multi-phase binarization method using concavity (also of cavities), moments, among other features for identification, verification and classification of writers of historical Hebrew calligraphy texts. They performed an experiment on erosion letters of a 34 writers and the identification experiment yielded result of 100% correct classification.

The problem with those methods (e.g., signal or image processing) is that they can find the writer but not necessarily the author. We mean that an author, i.e., the original author, wrote a book or an article and after 200 or 400 years the paper disintegrated. The text was important so it copied, thus those methods may find the era of the copier/writer but not the epoch of the author.

Automatic extraction and analysis of references from academic papers was first proposed by Garfield [18]. Berkowitz and Elkhadiri [19] extracted writers’ names and titles from articles. A knowledge-based system was used by Giuffrida et al. [20] to derive metadata, including writers’ names, from computer science journal articles. Hidden Markov Models were used by Seymore et al. [21] to extract writer names from a limited collection of computer science articles. The use of terms leads to progress in the extraction of information. Selecting text before and after references to extract good index terms to improve retrieval effectiveness was done by Ritchie et al. [22]. Bradshaw [23] used terms from a fixed window around references.

In contrast with scientific articles, the documents we are working on are from the Responsa ProjectFootnote 1; they are without any structural base, usually contain a mixture of at least two languages, and contain noise (e.g., editorial additions). Previous research on the Responsa Project dealt with text classification [24]. They checked whether classification could be done over the long axis of ethnic groups of authors with stylistic feature sets. HaCohen-Kerner and Mughaz [25] investigated in which era rabbis lived using undated Responsa, but they did not address the problem of how to extract time-related key-words or key-phrases. This article is a continuation research of this issue, i.e., determining when writers lived using key-phrases.

3 Semi-automatic Boosting Mining of Key-Phrases

We want to mine the time-related key-phrases automatically. We found that most of the sentences that contain time-related concepts (i.e., time-related words and phrases) to rabbinic literature (e.g., “late”, “friend”) are usually nearby rabbinic names/nicknames/acronyms/abbreviations/book-names. We developed a semi-automatic algorithm that boosts concepts mining in order to mine the time-related concepts. The main idea is to extract sentences that contain names of rabbis so that the words and phrases that are nearby the rabbinic names are treated as the key-phrases (among others) that we look for. Now, we present a general description of our mining algorithm and after that Illustration of the run of the algorithm.

3.1 The Algorithm

Notations:

  • TP – temporal vector of Time-related Phrases.

  • RN – vector of Rabbinic Names.

  • n – number of iterations of the algorithm.

  • TRC – set of Time-Related Concepts starting with, e.g., year, life, fiend and era.

  • TP ← TRC // initiate TP with the value of TRC

  • For i = 1 to n do:

    • Search for sentences that contain the last concepts that was added to TP

    • Extract new rabbinic names from those sentences

    • Add the new rabbinic names to RN

    • Search for sentences that contain the last rabbinic names that was added to RN

    • Mine time-related concepts from the new sentences:

      • Delete stop words.

      • Add the new time-related concepts to TP

      • Add the new time-related words and phrases to TRC (with their frequencies) and for the “old” time-related words and phrases only add their frequencies

Sort TRC by the frequency of time-related words and phrases in decreasing order (normally, concepts have larger number of appearances). Select from TRC the most frequent time-related concepts.

Bellow we present several examples of sentences, which contain rabbinic names, rabbinic acronym names and rabbinic books names (in Hebrew with translation to English)

figure c

… and my teacher and Rabbi Hgrsh”z (acronym of the genius Rabbi Shlomo Zalman) Auerbach za”l (acronym of righteous memory) already explained it in his ma’adany eretz book …

figure d
  1. (2)

    Responsa Hechal Yitzchak Even HaEzer part 2 chapter 53

Answer to the previous question, from Rabbi Shlomo Zalman Auerbach shlit”a (acronym of may he live a good long life, Amen) head of Kol Torah yeshiva (talmudic college) in Jerusalem. In the matter of a courier of the Haifa court that appointed abroad, to be an emissary to conveyance get (divorce certificate) and died suddenly …

figure e
  1. (3)

    Responsa Har-Zvi yore dea’a chapter 19

In the book Tzitz-Eliezer written by my friend the genius Hgra”i (acronym of the genius Rabbi Eliezer Yehuda) Waldenberg shlit”a (acronym of may he live a good long life, Amen) ab”d (acronym of head of rabbinical court) of the city of Jerusalem …

Illustration of the run of the algorithm

figure f

3.2 Algorithm Results

After using the algorithm, we mine time-related key-words, key-phrases and acronyms (a partial list is shown in Table 1).

Table 1. Hebrew and Aramaic cue words partial list

We divided the Hebrew and Aramaic key-words and key-phrases into three sets:

Late – addressing a person who has already died.

Friend – addressing another person as a friend, i.e., there is a large overlap between the lifetime of one author and the lifetime of another person who is referred to by the first author as a friend.

Rabbi – addressing another person as a rabbi/master, i.e., there is overlap between the lifetime of one author and the lifetime of another person who is referred to by the first author as rabbi.

Table 1 presents a partial list of Hebrew and Aramaic key-words and key-phrases and a few acronyms in Hebrew and their translation into English.

4 Rules-Based Constraints

This section presents the rules, based on key-phrases and references, formulated for the estimation of the birth and death years of an author X (the extracted results point to specific years) based on his texts and the texts of other writers (Yi) who mention X or one of his texts. We assume that the birth years and death years of all writers are known, excluding those that are under interrogation. Now, we will give some notions and constants that are used: X – The writer under consideration, Yi – Other writers, B – Birth year, D – Death year, MIN – Minimal age (at present, 30 years) of a rabbinic writer when he starts to write his response, MAX – Maximal age (at present, 100 years) of a rabbinic author, and RABBI_DIS – The gap age between rabbi and his student (at present, 20 years). The estimations of MIN, MAX, and RABBI_DIS constants are heuristic, although they are realistic on the basis of typical responsa authors’ lifestyles.

Different types of references exist: general references with and without key-phrases, such as “rabbi”, “friend” and “late”. There are two types of references: those referring to living authors and those referring to dead authors. In contrast to academic papers, responsa include many more references to dead authors than to living authors.

We will introduce rules based on key-phrases and references of different degrees of certainty: “iron-clad” (I), heuristic (H) and greedy (G). “Iron-clad” rules are always true, without any exception. Heuristic rules are almost always true. Exceptions can occur because the heuristic estimates for MIN, MAX and RABBI_DIS are incorrect. Greedy rules are rather reasonable rules for responsa authors. However, wrong estimates can sometimes be drawn while using these rules. Each rule will be numbered and its degree of certainty (i.e., I, H, G) will be presented in brackets.

4.1 “Iron-Clad” and Heuristic Rules with Key-Phrases

First, we present one general iron rule and two general heuristic rules, which are based on regular citations (i.e., without any key-phrase), based on authors that cite X.

General rule based on authors that were mentioned by X

$$ {\text{D(X)}} > = {\text{MAX(B(Yi))}}\quad (0\;({\text{I}})) $$
$$ {\text{D(X)}} > = {\text{MAX(B(Yi))}} + {\text{MIN}}\quad ( 1\; ( {\text{H)}}) $$

X must have been alive when he referred to Yi, so we can use the earliest possible age of publishing of the latest born author Yi as a lower estimate for X’s death year. The heuristic rule includes the addition of MIN, which is the minimum age where Yi starts to write his response.

General rule based on authors that referred to X

$$ {\text{B(X)}} < = {\text{MIN(D(Yi))}} - {\text{MIN}}\quad ( 2\; ( {\text{H)}}) $$

All Yi must have been alive when they referred to X, and X must have been old enough to publish. Hence, we can use the earliest death year amongst such authors Yi as an upper estimate of X’s earliest possible publication age (and thus his birth year).

General rules based on year mentioning Y that appeared in X’s documents

$$ {\text{D(X)}} > = {\text{MAX(Y)}}\quad ( 3\; ( {\text{I)}}) $$

X must have been alive when he mentioned the year Y. We can use the most recent year mentioned by X to evaluate the death year of X as an estimation of X’s death year.

Posthumous Key-Phrase Rules

Posthumous rules estimate the birth and death years of an author X based on references of authors who refer to X with the key-phrase “late” (“of blessed memory”) or on references of X that mention other authors with the key-phrase “late”. Figure 1 describes possible situations where various types of authors Yi (i = 1, 2, 3) refer to X with the key-phrase “late”. The lines depict writers’ life spans; the left edges represent the birth years and the right edges represent death years. In this case (as all Yi refer to X with the key-phrase “late”), we know that all Yi passed away after X, but we do not know when they were born in relation to X’s birth. Y1 was born before X’s birth; Y2 was born after X’s birth but before X’s death; and Y3 was born after X’s death.

$$ {\text{D(X)}} < = {\text{MIN(D(Yi))}}\quad ( 4\; ( {\text{I)}}) $$
Fig. 1.
figure 1

References mentioning X with the key-phrase “late”.

However, we know that X must have been dead when Yi referred to him with the key-phrase “late”; thus, we can use the earliest born Y’s death year as an upper estimate for X’s death year. Like all writers, dead writers of course have to comply with rule (2) as well.

Now, we look at the cases where the author X that we are studying refers to other authors Yi with the key-phrase “late”. Figure 2 describes possible situations where X refers to various types of authors Yi (i = 1, 2, 3) with the key-phrase “late”. All Yi passed away before X’s death (or X may still be alive). Y1 died before X’s birth; Y2 was born before X’s birth and died when X was still alive; Y3 was born after X’s birth and passed away when X was still alive.

$$ {\text{D(X)}} > = {\text{MAX(D(Yi))}}\quad ( 5\; ( {\text{I)}}) $$
Fig. 2.
figure 2

References by X who mentions others with the key-phrase “late”.

X must have been alive after the death of all Yi who were referred by him with the key-phrase “late”. Therefore, we can use the death year of the latest-born Y as a lower estimate for X’s death year.

$$ {\text{B(X)}} > = {\text{MAX(D(Yi))}} - {\text{MAX}}\quad ( 6\; ( {\text{H)}}) $$

X was probably born after the death year of the latest-dying person that X wrote about. Thus, we use the death year of the latest-born Y minus his max life-period as a lower estimate for X’s birth year.

Contemporary Key-Phrases Rules

Contemporary key-phrases rules calculate the upper and lower bounds of the birth year of a writer X based only on the references of known writers who refer to X as their friend/rabbi. This means there must have been at least some period of time when both were alive together. Figure 3 shows possible situations where various types of writers Yi refer to X as their friend/rabbi. Y1 was born before X’s birth and died before X’s death; Y2 was born before X’s birth and died after X’s death; Y3 was born after X’s birth and passed away before X’s death; Y4 was born after X’s birth and passed away after X’s death. Like all writers, contemporary authors of course have to comply with rules 1 and 2 as well.

$$ {\text{B(X)}} > = {\text{MIN(B(Yi))}} - ({\text{MAX}} - {\text{MIN}})\quad ( 7\; ( {\text{H)}}) $$
Fig. 3.
figure 3

References by authors who refer to X as their Friend/Rabbi.

All Yi must have been alive when X was alive, and all of them must have been old enough to publish. Thus, X could not have been born MAX-MIN years before the earliest birth year amongst all authors Yi.

$$ {\text{D(X)}} < = {\text{MAX(D(Yi))}} + ({\text{MAX}} - {\text{MIN}})\quad ( 8\; ( {\text{H)}}) $$

Again, all Yi must have been alive when X was alive, and all of them must have been old enough to publish. Hence, X could not have been alive MAX-MIN years after the latest death year amongst all writers Yi.

4.2 Greedy Rules

Greedy rules bounds are sensible but can sometimes lead to wrong estimates.

Greedy rule based on authors who are mentioned by X

$$ {\text{B(X)}} > = {\text{MAX(B}}\left( {\text{Yi}} \right) )- {\text{MIN}}\quad ( 9\; ( {\text{G)}}) $$

Many of the references in our research domain relate to dead authors. Thus, most of the references within X’s texts relate to dead authors. Namely, many Yi were born before X’s birth and died before X’s death. Thus, a greedy assumption would be that X was born no earlier than the birth of the latest author mentioned by X; however, because there may be at least one case where Y was born after X was born, we subtract MIN.

Greedy rule based on references to year Y made by X

$$ {\text{B(X)}} > = {\text{MAX(Y)}} - {\text{MIN}}\quad ( 10\;({\text{G}})) $$

When X mentions years, he usually writes the current year in which he wrote the document or a few years ahead. Most of the time, the maximum year, Y, minus MIN is larger than X’s birth year.

Greedy rule based on authors who refer to X

$$ {\text{D(X)}} < = {\text{MIN(D(Yi))}} - {\text{MIN}}\quad ( 1 1\; ( {\text{G)}}) $$

As mentioned above, most of the references within Yi texts refer to X as being dead. Hence, most Yi died after X’s death. Therefore, a greedy assumption would be that X died no later than the death of the earliest author who referred to X minus MIN.

Rules refinements 9–11 are presented by rules 12–17. Rules 12–14 are due to X referring to Yi and rules 15–17 are due to Yi referring to X.

Greedy rule for defining the birth year based only on authors who were referred to by X with the key-phrase “late”

$$ {\text{B(X)}} > = {\text{MAX(D(Yi))}} - {\text{MIN}}\quad ( 1 2\; ( {\text{G)}}) $$

When taking into account only references that were written in X’s texts, most of the references are related to dead authors. That is, most Yi died before X’s birth. Moreover, an author does not write from his birth; rather, he usually begins near his death. Thus, a greedy assumption would be that X was born no earlier than the death of the latest author mentioned by X minus MIN.

Greedy rule for defining the birth year based only on authors who are mentioned by X with the key-phrase “friend”

$$ {\text{B(X)}} < = {\text{MIN(B(Yi))}} + {\text{RABBI}}\_{\text{DIS}}\quad ( 1 3\; ( {\text{G)}}) $$

When taking into account only references that are mentioned by X, which are related to contemporary authors, a greedy rule could be that X was born no later than the birth of the earliest author mentioned by X with the key-phrase “friend”. Because many times the older author refers to the younger author as “friend”, we need to add RABBI_DIS.

Greedy rule for defining the birth year based only on authors who are mentioned by X with the key-phrase “rabbi”

$$ {\text{B(X)}} < = {\text{MIN(B(Yi))}} + {\text{RABBI}}\_{\text{DIS}}\quad ( 1 4\; ( {\text{G)}}) $$

When taking into account only references written in X’s texts, which are related to contemporary authors, a greedy rule could be that X was born no later than the birth of the earliest author mentioned by X as a “rabbi”. Due to the age difference between a student and his rabbi being approximately 20 years, we need to add RABBI_DIS.

Greedy rule for defining the death year of X based only on authors who referred to X with the key-phrase “late”

$$ {\text{D(X)}} < = {\text{MIN(B(Yi))}} + {\text{MIN}}\quad ( 1 5\; ( {\text{G)}}) $$

When taking into account only references written in Yi texts that refer to X with the key-phrase “late”, a greedy assumption could be that X died no later than the birth of the earliest author who referred to X with the key-phrase “late”; because an author does not writes from birth, we need to add MIN.

Greedy rule for defining the death year of X based only on authors who referred to X with the key-phrase “friend”

$$ {\text{D(X)}} > = {\text{MAX(D(Yi))}} - {\text{RABBI}}\_{\text{DIS}}\quad ( 1 6\; ( {\text{G)}}) $$

When taking into account only references written in Yi texts that refer to X with the key-phrase “friend”, all Yi must have been alive when X was alive, and all of them must have been old enough to publish; also, many times, the older author refers to the younger author with the key-phrase “friend”, and the opposite never occurs. Therefore, a greedy assumption would be that X died no earlier than the death of the latest author who referred to X with the key-phrase “friend” minus RABBI_DIS.

Greedy rule for defining the death year of X based only on authors who referred to X with the key-phrase “rabbi”

$$ {\text{D(X)}} > = {\text{MAX(D(Yi))}} - {\text{RABBI}}\_{\text{DIS}}\quad ( 1 7\; ( {\text{G)}}) $$

This follows the same principle as the rule for defining the birth year, but because this time the student mentions the rabbi, we need to reduce RABBI_DIS.

4.3 Birth and Death Year Tuning

Application of the Heuristic and Greedy rules can lead to abnormalities, such as an author’s death age being unreasonably old or young. Another possible anomaly is that the algorithm may result in a death year greater than the current year (i.e., 2015). Hence, we added some tuning rules: D – death year, B – birth year, age = D − B.

Current Year: if (D > 2015) {D = 2015}, i.e., if the current year is 2015, then the algorithm must not give a death year greater than 2015.

Age: if (age > 100), {z = age − 100; D = D − z/2; B = B + z/2}, and if (age < 30), {z = 30 − age; D = D + z/2; B = B − z/2}. Our postulate is that a writer lived at least 30 years and no more than 100 years. Thus, if the age according to the algorithm is greater than 100, we take the difference between that age and 100, and then we divide that difference by 2 and normalize D and B to result in an age of 100.

4.4 Example of the Use of a Certain Heuristic Rule and the Key-Phrase “Late”

Below we present texts written by Rabbi Herzog Yitzchak (1889–1959):

figure g
  1. (1)

    Responsa Hechal Yitzchak Even HaEzer part 2 chapter 43

… and inspecting (book) bayit chadash chapter 134 that was ahead of me to the differences between before writing etc. (to this bayit chadash, turned my attention the genius Rabbi Yehuda Ades, One of the religious court judges of the court shlit”a) …

figure h
  1. (2)

    Responsa Hechal Yitzchak Even HaEzer part 2 chapter 47

… and here I am now looking at the responsa of Maharsham of blessed memory in part 2 chapter 140 in the matter of deaf mute who finished their school … that the divorce certificate of a deaf is as like divorce certificates, It is not clear to me at all. And in my opinion the objections of the genius Rabbi Ben-Zion Uziel of blessed memory are not opposite to him.

figure i
  1. (3)

    Responsa Hechal Yitzchak Even HaEzer part 2 chapter 77

The response of the genius the author of Chazon Ish book of blessed memory in the previous question …

Figure 4 shows the timeline of the authors that are relevant to the texts that appear above (Table 2).

Fig. 4.
figure 4

Herzog Yitzchak refers to other fore Rabbis

Table 2. Birth and death years of authors that relate to the example
Table 3. Full details about the citations in the corpora
Table 4. Current results vs. HaCohen-Kerner and Mughaz [25] results for the 12 authors corpus (the results are with years deviation)

We use a heuristic rule to improve the assessment of the death year of Rabbi Herzog Yitzchak:

Activate the iron rule (formula (0(I))): D(X) >= MAX(B(Yi)) → D(X) >= 1898 (Distance from the real death year (1959) is 61)

Activate the heuristic rule (formula (1(H))): D(X) >= MAX(B(Yi)) + min → D(X) >= 1898 + 30 = 1928 (Distance from the real death year (1959) is 31)

We use the key-phrase “late” to improve the assessment of the death year of Rabbi Herzog Yitzchak:

Activate the iron rule (formula (0(I))): D(X) >= MAX(B(Yi)) → D(X) >= 1898 (Distance from the real death year (1959) is 61)

Activate the iron key-phrase “late” rule (formula (5(I))): D(X) >= MAX(D(Yi)) → D(X) > = 1953 (Distance from the real death year (1959) is 6)

We can see that the heuristic rule (1(H)) improves the result. However, with the use of the key-phrase “late”, rule (5(I)), the result is much better.

5 The Model

The main steps of the model are presented below

  1. 1.

    Cleaning the texts. Because the responsa may have undergone some editing, we must make sure to ignore the possible effects of differences in the texts resulting from variant editing practices. Therefore, we eliminate all orthographic variations.

  2. 2.

    Boosting mining key-phrases and key-words.

  3. 3.

    Normalizing the references in the texts. For each author, we normalize all types of references that refer to him (e.g., various variants and spellings of his name, books, documents and their nicknames and abbreviations). For each author, we collect all references syntactic styles that refer to him and then replace them with a unique string.

  4. 4.

    Building indexes, e.g., authors, references to “late”/“friend”/“rabbi”, and calculating the frequencies of each item.

  5. 5.

    Performing various combinations of “iron-clad” and heuristic rules on the one hand and greedy rules on the other hand to estimate the birth and death years of each tested author.

  6. 6.

    Calculating averages for the best “iron-clad”, heuristic and greedy versions.

6 Examined Corpus, Experiments and Results

The documents of the examined corpus were downloaded from Bar-Ilan University’s Responsa Project. The examined corpus contains 15,495 responsa written by 24 scholars, averaging 643 files for each scholar. The total number of characters in the whole corpus is 127,683,860 chars, and the average number of chars for each file is 8,240 chars. These authors lived over a period of 229 years (1786–2015). These files contain references; each reference pattern can be expanded into many other specific references [26].

Reference identification was performed by comparing each word to a list of 339 known authors and many of their books. This list of 25,801 specific references refers to the names, nicknames and abbreviations of these authors and their writings. Basic references were collected and all other references were produced from them.

We split the data into two corpora: (1) 10,561 responsa authored by 12 rabbis, with an average of 876 files for each scholar and each file containing an average of 1800 words spread over 135 years (1880–2015); (2) 15,495 responsa authored by 24 rabbis, with an average of 643 files for each rabbi and each file containing an average of 1609 words spread over 229 years (1786–2015) (the set of 24 rabbis contains the group of 12 rabbis). For more detailed information on the data set, refer to Table 5 in the appendix at the end of this article.

Table 5. Full details about the data set

Because of the nature of the problem, it is difficult to appraise the results in the sense that although we can compare how close the system guess is to the actual birth or death years, we cannot assess how good the results are, i.e., there is no real notion of what a ‘good’ result is. For now, we use the notion Distance, which is defined as the estimated value minus the ground truth value.

The outcomes appear in the following histograms. Each histogram shows the results of one algorithm – Iron+Heuristic or Greedy. Each algorithm was performed on two groups of authors: a group of 12 writers and a group of 24 writers. For both algorithm executions, there are outcomes containing estimated birth years and death years. The results shown in the histograms are the best birth/death date deviation results. In every histogram, there are eight columns; there are two quartets of columns in each histogram: the right quartet indicates the deviation from the death year, while the left quartet indicates deviations from the birth year. Each column represents a deviation without a key-phrase or with the year that was mentioned in the text, a deviation with the “late” key-phrase, with the “rabbi” key-phrase, and with the “friend” key-phrase. Moreover, we used two manipulations – Age and Current year. The column with a gray background contains the best results. Each histogram contains 8 columns (results); there are 16 histograms, so there are, in total, 128 results.

The Age manipulation is very helpful; we used it in 94.5% of the experiments (i.e., 121/128 = 0.945) for all of the refinements, in both algorithms, with or without constants.

Examination of the effect of mentioning a year, listed in Figs. 5, 6, 7 and 8, compared with Figs. 9, 10, 11 and 12 regarding death year deviation, indicates that the contribution of referencing a year leads to an improvement of 2.8 years on average.

Fig. 5.
figure 5

12 authors I+H no constant

Fig. 6.
figure 6

24 authors I+H no constant

Fig. 7.
figure 7

12 authors Greedy no constant

Fig. 8.
figure 8

24 authors Greedy no constant

Fig. 9.
figure 9

12 authors I+H no constant

Fig. 10.
figure 10

24 authors I+H no constant

Fig. 11.
figure 11

12 authors Greedy no constant

Fig. 12.
figure 12

24 authors Greedy no constant

This phenomenon is more noticeable in Iron+Heuristic (average upswing of 4.2 years) than with Greedy (average deviation upswing of 1.4 years). The main reason for this is that a writer usually writes until close to his death. Additionally, when a year is mentioned in the text, it is often the year in which the writer wrote the document. Because an author writes, in many cases, until near his death year, the maximum year mentioned in his texts is close to the year of his death.

In contrast to the death year assessment, birth year assessment has a negative impact; the deviation increases by 10.4 years, on average. It is essential to note that we are now evaluating the impact of the year mentioned in the text. If the results without using the year mentioned are better than the results using the year mentioned, it means that we should not use it. For example: the result of the birth year using Greedy rules, without year mentioning and without any refinement, for the 12 authors has a deviation of 16.7 years. After using the year mentioned, the deviation is 51.5 years, decreasing the accuracy by 34.83 years. The result of the birth year using the Iron+Heuristic, without year mentioning and without key-phrases, for the 12 authors has a deviation of 26.5 years. After using a reference to years, the deviation is 50.7 years, decreasing the accuracy by 24.2 years, i.e., the deviation with the use of year mentioning is greater. An analysis of the formulas shows that the formula that determines the birth year in the Greedy (10(G)) uses the most recent year the writer writes in his texts. The most recent year that the rabbi mentions is usually near his death, as explained above; therefore, very poor birth results are obtained, with a decline of 12.5 years. The results of the Greedy are better than Iron+Heuristic (decline of 8.4 years), but the effect of year mentioning on the results of Iron+Heuristic is less harmful. Thus, to estimate the death year, we will use the Iron+Heuristic algorithm with the use of year mentioning without any key-phrases.

The use of the key-phrase “friend” for birth year assessment gives the best results compared with the other key-phrases – “late”, “rabbi” or none. This is because friends are of the same generation and more or less the same age; thus, they are born in roughly the same year. Thus, for a writer addressing another author as his friend, the assessment of his birth year will give good results. For the death year, however, this is not assured because there may be a much greater period between the deaths of friends (one may die at the age of 50, while his friend at the age 75). Hence, the “friend” key-phrase usually gives better birth year assessment than death year assessment.

After we found that the best results for the birth year are always with the “friend” key-phrase (except for one case), we investigated at greater depth and found that this occurs specifically with the use of constants. Constants are important, resulting in an average improvement of 6.3 years in the case of Greedy (for the 12 and 24 authors). In general, a Posek is addressed in responsa after he has become important enough to be mentioned and regarded in the Halachic Responsa, which is usually at an advanced age.

We stated above that the use of Greedy rules with constants gives the greatest improvement. Even without the use of constants, Greedy produces the best results. The reason lies in the formulae; formula (13(G)) finds the lowest birth year from the group of authors that the arbiter mentioned. Unlike the Greedy, the Iron+Heuristic formula (7(H)) reduces the constant (at present, 20); therefore, the results of the Greedy are better. In conclusion, to best assess the birth year, we apply the Greedy algorithm, using constants and also the key-phrase “friend”.

The best results when evaluating birth year occurred when using the Greedy algorithm with constants and without mentioning years. The best results when evaluating death year occurred when using the Iron+Heuristic algorithm with constants and without mentioning years. When we compare these results with the results shown in Figs. 17, 18, 19 and 20 we find that in the case of Greedy, when we add more authors, there is an improvement only in one case, i.e., for the 12 authors using the “late” key-phrase; the remaining results show a decline in performance. The reason for this phenomenon may lie in the Greedy formula; when an author is more successful, in addition to being mentioned by others many times, he is mentioned at a younger age by authors that are older than him; therefore, the estimation is less accurate. For example: the estimation of the death year of the late Rabbi Ovadia Yosef has an error of 61 years (instead of 2014, the algorithm result is 1953), determining that he died at an age of 34; using the Iron+Heuristic algorithm, there was a decrease in two results and an improvement in 5 results. For Iron+Heuristic, there is an average improvement of 0.64 years and, in fact, the best death year result estimation. The quality of the Greedy algorithm birth year results estimation using year mentioning pretty severely impairs the results (explained above). A possible explanation for this is that the improvement that comes from using constants cannot overcome the deterioration that comes from year mentioning. In contrast, when assessing the death year, using year mentioning with Iron+Heuristic significantly improves the results, and using constants improves them a little more; therefore, a combination of constants + year mentioning brings better assessment of the death year. Therefore, when assessing birth year and death year, it is not enough to use references; we have to use key-words and key-phrases. To estimate death year, we will use the year(s) mentioned in the text and constants with the Iron+Heuristic algorithm; to estimate birth year, we will run the Greedy algorithm using constants and the “friend” key-phrase without year mentioning.

The Effect of the Relatively Larger Corpus

We compared between the results that achieved for the corpora containing responsa written by 12 and by 24 authors. We enlarged not only the number of authors but also the number of the responsa; from 10,561 files for 12 authors to 15,495 files for 24 authors and from 19,011,130 words for 12 authors to 24,930,082 words for 24 authors. The time-frame of the 12 authors spread over 135 years, while the time-frame of the 24 authors spread over 229 years. Because the span of the years of 24 authors is almost twice bigger than time-frame of the 12 authors, we must compare the results relative to the year’s amplitude. When we analyze the results proportionally to the span of the years, we find that when we have a larger data-set we get better results (for 90.6% of the results as shown in Figs. 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 and 20). For instance, in Fig. 17, (12 authors) the smallest deviation predicting death year, i.e., the deviation of the best result, is 9.2, 9.2/135 → 6.8%; in Fig. 18, (24 authors) the smallest deviation predicting death year is 13.1, 13.1/229 → 5.7% (improvement of 16%). In Fig. 17, (12 authors) the smallest deviation predicting birth year is 16.6, 16.6/135 → 12.3%; in Fig. 18, (24 authors) the smallest deviation predicting birth year is 22.6, 22.6/229 → 9.8% (improvement of 20%).

Fig. 13.
figure 13

12 authors I+H with constant

Fig. 14.
figure 14

24 authors I+H with constant

Fig. 15.
figure 15

12 authors Greedy with constant

Fig. 16.
figure 16

24 authors Greedy with constant

Fig. 17.
figure 17

12 authors I+H with constant and year

Fig. 18.
figure 18

24 authors I+H with constant and year

Fig. 19.
figure 19

12 authors Greedy with constant and year

Fig. 20.
figure 20

24 authors Greedy with constant and year

The number of a general citation composed of a name/acronym/abbreviation/book-title of an author is very large (see Table 3). In many cases, the references to author appear with affixes of Rabbi, friend or “late” (see Sect. 3.1). A citation with time-related phrase is a citation of author’s name and it is close to the time-related phrase. This fact leads us to the fact that the number of occurrences of general citations contains the number of occurrences of reference with time-related phrases. From Table 3 we can see that the number of general citations and “late” citations occurrences are the largest, also per author, so we would expect them to be the most influential. However, as we wrote above, the experiments indicate that the best results are using time-related phrases of a friend and references to years.

From a glance at Table 3 the time-related phrase “Rabbi” is negligible certainly relative to other time-related phrases. We also can see that the number of occurrences of “Rabbi” time-related phrase is the lowest, and per author it is less than two. But when we look deeper, we see that despite the minimal number of occurrences to time-related phrase “Rabbi” it sometimes gives better results than general citations or references to the time-related phrase “late” (the two, general citation and “late” time-related phrase are with the largest number of occurrences). For example, Fig. 15, the greedy algorithm for the birth year assessment using references to time-related phrase “Rabbi” gives a deviation of 13.7 years from the truth birth years; in the same Fig. 15, references to “late” time-related phrase gives a result of 22.3 years deviation, which is worse by 62.8%. Also in Fig. 15, a general citation gives deviation of 16.8 years which is worse by 22.6% compared to the “Rabbi” time-related phrase. We can see a similar phenomenon in Figs. 6 and 9 using the Iron+Heuristic algorithm to estimate the birth years. The phenomenon that the key-phrase with a few occurrences can achieve a better result than a key-phrase with many occurrences is due to the structure of the rules (Sects. 4.1 and 4.2). The rules are using the minimum and the maximum functions. Because the nature of the minimum/maximum functions a few occurrences of citations can affect the results. As a result, occasionally the references to the time-related phrase “Rabbi” (which appears a few times) give better results than using the general citations or “late” time-related phrase that appear much more.

Current Research Versus First Research

In this research, various novelties are presented comparing to HaCohen-Kerner and Mughaz [25]:

  1. 1.

    There are two corpora of responsa composed by 12 authors and 24 authors, instead of one corpus (12 authors and with a far fewer files);

  2. 2.

    There is a use of years that are mentioned in the text documents (the text was not labeled with a date or year but sometimes years can appear in the text, e.g., quotation from a contract, which contains the year of the agreement);

  3. 3.

    Heuristics were added to the Greedy algorithm by adding a few greedy constraints;

  4. 4.

    New Rabbi’s constrains were formulated;

  5. 5.

    Two new manipulations, “Current Year” and “Age” were added.

HaCohen-Kerner and Mughaz [25] examined a corpus, which includes 3,488 responsa authored by 12 Jewish rabbinic scholars, while in this research the current corpus for 24 authors contains 15,495 responsa and the current corpus for 12 authors contains 10,561 responsa. The 3,488 responsa used in HaCohen-Kerner and Mughaz [25] are included in these 10,561 responsa.

Table 4 presents a comparison between the results of our current work and the best results of HaCohen-Kerner and Mughaz [25]. This table shows that three results (out of four results) for 12 authors in the current research are much better (in their quality) than the corresponding results reported in HaCohen-Kerner and Mughaz [25]. Only one result (the birth years using the Greedy algorithm) was slightly worse. The results of 24 authors were not presented in HaCohen-Kerner and Mughaz [25].

Using the Iron+Heuristic algorithm, we reduced the years deviation, compared to HaCohen-Kerner and Mughaz [25], for death years by 60% (from 22.67 to 9.2 years) and for birth years by 42% (from 22 to 12.8 years). Using the Greedy algorithm, we reduce the deviation for death years by 32% (from 15.54 to 10.5 years); however for the birth years we got a slightly worse result of about 3% (from 13.04 to 13.4 years).

7 Summary, Conclusions and Future Work

We investigated the estimation of the birth and death years of authors using year mentioning, the “late” (“of blessed memory”) key-phrase, the “rabbi” key-phrase, the “friend” key-phrase and undated references that are mentioned in documents of other dated authors that refer to author being considered or those mentioned by him. This research was performed on responsa documents, where special writing rules are applied. The estimation was based on the author’s texts and texts of other authors who refer to the discussed author or are mentioned by him. To do so, we formulated various types of iron-clad, heuristic and greedy rules. The best birth year assessment was achieved by using the Greedy algorithm with constants and the “friend” key-phrase. The best death year assessment was achieved by using the Iron+Heuristic algorithm with year mentioning.

We plan to improve this research by (1) testing new combinations of iron-clad, heuristic and greedy rules, as well as a combination of key-phrases (e.g., “late” and “friend”); (2) improving existing rules and/or formulating new rules; (3) defining and applying heuristic rules that take into account various details included in the responsa, e.g., events, names of people, new concepts and collocations that can be dated; (4) conducting additional experiments using many more responsa written by more authors to improve the estimates; (5) checking why the iron-clad, heuristic and greedy rules tend to produce more positive differences; and (6) testing how much of an improvement we can obtain from a correction of the upper bound of D(x) and how much we will, at some point, use it for a corpus with long-dead authors.