Mining and Using Key-Words and Key-Phrases to Identify the Era of an Anonymous Text

Mughaz, Dror; HaCohen-Kerner, Yaakov; Gabbay, Dov

doi:10.1007/978-3-319-59268-8_6

Dror Mughaz^17,18,
Yaakov HaCohen-Kerner¹⁸ &
Dov Gabbay^17,19

Part of the book series: Lecture Notes in Computer Science ((TCCI,volume 10190))

331 Accesses
1 Citations

Abstract

This study is trying to determine the time-frame in which the author of a given document lived. The documents are rabbinic documents written in Hebrew-Aramaic languages. The documents are undated and do not contain a bibliographic section, which leaves us with an interesting challenge. To do this, we define a set of key-phrases and formulate various types of rules: “Iron-clad”, Heuristic and Greedy, to define the time-frame. These rules are based on key-phrases and key-words in the documents of the authors. Identifying the time-frame of an author can help us determine the generation in which specific documents were written, can help in the examination of documents, i.e., to conclude if documents were edited, and can also help us identify an anonymous author. We tested these rules on two corpora containing responsa documents. The results are promising and are better for the larger corpus than for the smaller corpus.

Access provided by CONRICYT-eBooks. Download chapter PDF

Key-Phrases as Means to Estimate Birth and Death Years of Jewish Text Authors

Identification of Lost or Deserted Written Texts Using Zipf’s Law with NLTK

Authorship Identification with Multi Sequence Word Selection Method

Keywords

1 Introduction

Determining the time frame of a book or a manuscript and identifying an author are important and challenging problems. Time-related key-words and key-phrases with references can be used to date and identify authors. Key-phrases and key-words have great potential to provide great information in many domains, such as academic, legal and commercial. Thus, the automatic extraction and analysis of key-phrases and key-words is growing rapidly and gaining momentum. Web search engines, machine learning, etc. are based on key-phrases and key-words. As a result, features are extracted and learned automatically; thus, the analysis of key-phrases and key-words has enormous importance. Key-phrases and key-words are essential features not only of the needs of scientific papers or for industry and commerce but also of rabbinic responsa (answers written in response to Jewish legal questions authored by rabbinic scholars).

Key-phrases, key-words and citations/references included in rabbinic responsa text are more complex to define and to extract than key-phrases, key-words and citations/references in academic papers because:

(1)
There is an interaction with the complex morphology of Hebrew, Aramaic and Yiddish (e.g. various citations can be presented with different types of prefixes included in the name of a citation, e.g., “and …”, “when …”, “and when …”, “in …”, “and in …”, “and when in …”);
(2)
Natural language processing (NLP) in Hebrew, Aramaic and Yiddish has been relatively little studied;
(3)
In contrast to academic papers, there is no reference list that appears at all, even not at the end of a responsa;
(4)
Many references in Hebrew-Aramaic-Yiddish documents are ambiguous. For instance: (a) a book titled (magen-avot) was composed by four different Jewish authors; and (b) The abbreviation (m”b) relates to two different Jewish authors and has also other meanings, which are not authors’ names; and
(5)
At least 30 different syntactic styles (see next paragraph) are used to present references. This number is higher than the number of citation patterns used in academic papers written in English (e.g., see [1]).

Each specific document written by a specific author can be referred to in at least 30 general possible citation syntactic styles. For example, the pardes book written by the famous Jewish author Rabbi Shlomo Yitzhaki, known by the abbreviation Rashi, can be referenced using the following patterns: (1) “Shlomo son of Rabbi Yitzhaki”, (2) “Shlomo son of Yitzhaki”, (3) “Shlomo Yitzhaki”, (4) “In the name of Rashi who wrote in the pardes book”, (5) “In the complete pardes book of Rashi”, (6) “Rashi of blessed memory in the pardes book”, (7) “In Rashi in the pardes”, (8) “Rashi the possessor of the pardes of blessed memory”, (9) “In the name of of Rashi in the pardes”, (10) “The pardes responsa of Rashi”, (11) “Which wrote Rashi in the pardes”, (12) “Rashi in the pardes book”, (13) “The pardes book of Rashi”, (14) “The pardes b’ of Rashi”, (15) “In the name of the pardes book”, (16) “Rashi the possessor of the pardes”, (17) “In the name of the pardes b’”, (18) “Rashi in the pardes”, (19) “The great pardes book of Rashi”, (20) “The great pardes book”, (21) “The great pardes”, (22) “The pardes book”, (23) “The pardes responsa”, (24) “The possessor of the pardes”, (25) “In the name of the pardes”, (26) “ The pardes b’” (27) “In the name of Rashi”, (28) “There, chapter-number, sentence-number” (there refers to the book/paper mentioned on the latest reference), (29) “There, there, sentence-number” (the first there refers to the book/paper mentioned in the latest reference and the second there refers to the chapter-number mentioned in the latest reference), and (30) “There, there, there” (refers to the book/paper, chapter-number and sentence-number mentioned on the latest reference).

Furthermore, each citation pattern can be expanded to many other specific citations by replacing the name of the author and/or his book/responsa by each of their other names (e.g., different spellings, full names, short names, first names, surnames, and nicknames with/without title) and abbreviations of their names or their book titles.

Hebrew-Aramaic-Yiddish documents in general and Hebrew responsa in principle present various interesting text mining problems: (1) the morphology in Hebrew is richer than in English. Hebrew has 70,000,000 valid forms, while English has only 1,000,000 [2]. Declensions in Hebrew can be up to 7000 for one stem, while in English, there are only a few declensions; (2) responsa documents have a high rate of abbreviations (nearly 20%) while more than one third of them (about 8%) are ambiguous [3].

This research estimates the date of undated documents of authors using (1) the year(s) mentioned in the text, (2) “late” (“of blessed memory”) key-phrases, (3) “rabbi” key-phrases, (4) “friend” key-phrases that are mentioned in the texts and (5) undated references of other dated authors that refer to the considered author or are mentioned by him. The assessments are with different degrees of certainty: “iron-clad”, heuristic and greedy. The rules are based on key-phrases with and without references.

This paper is organized as follows: Sect. 2 gives background concerning the extraction and analysis of key-phrases and citation. Section 3 presents the boosting extraction key-phrases algorithm. Section 4 presents various rules of some degrees of certainty: “iron-clad”, heuristic and greedy rules, which are used to assess writers’ birth and death years. Section 5 presents the model description. Section 6 familiarizes the dataset, experiments, results and analysis. Section 7 includes the summary, conclusions and future works.

2 Related Research

Following the explosion of electronic information, there has been a growing need for extracting key-phrases and key-words automatically. Many studies have been made in this area for different purposes and from different perspectives. Key-words in documents allow for quick search on multiple large databases [4]. Key-words can also help to improve the NLP performance, as well as the information retrieval performance in issues such as text summarization [5], text categorization [6], topic change during conversational text [7], and opinion mining [8].

Although key-words are important in many computer applications, there is still much to be done in this area, and the state-of-the-art methods underperform compared to other NLP core tasks [9].

There are several difficulties in extracting key-phrases and key-words. One is the length of the documents. In scientific articles, although there can only be approximately 10 key-words or key-phrases and approximately another 30 candidates in the abstract section, the rest of the article may contain hundreds of candidate key-phrases or key-words [10]. Moreover, key-words can also appear at the end of an article. If key-phrase or key-word appears at the beginning and at the end of an article, it indicates the importance of that key-phrase or key-word [11].

When documents are structured, key-words extraction is easier. For example, in scientific papers, most of the key-words appear in the abstract, in the introduction and in the title [12]. In other cases, key-phrases can be automatically extracted from web page text and from its metadata [13] for the purpose of advertisement.

One possible way to date a document and determining who the writer of a document is to use visual image of the document. Analyzing image of a document generally consists, of: locating the place where the document was written (different climate types mean different erosion, discoloration and degradation), extracting features from letters, words and empty regions. Features can be extracted by the use of the contours of the alphabet shapes of the text, [14]; by extracting features that not related to a specific text [15, 16] or by geometric patterns.

Bar-Yosef et al. [17] developed a multi-phase binarization method using concavity (also of cavities), moments, among other features for identification, verification and classification of writers of historical Hebrew calligraphy texts. They performed an experiment on erosion letters of a 34 writers and the identification experiment yielded result of 100% correct classification.

The problem with those methods (e.g., signal or image processing) is that they can find the writer but not necessarily the author. We mean that an author, i.e., the original author, wrote a book or an article and after 200 or 400 years the paper disintegrated. The text was important so it copied, thus those methods may find the era of the copier/writer but not the epoch of the author.

Automatic extraction and analysis of references from academic papers was first proposed by Garfield [18]. Berkowitz and Elkhadiri [19] extracted writers’ names and titles from articles. A knowledge-based system was used by Giuffrida et al. [20] to derive metadata, including writers’ names, from computer science journal articles. Hidden Markov Models were used by Seymore et al. [21] to extract writer names from a limited collection of computer science articles. The use of terms leads to progress in the extraction of information. Selecting text before and after references to extract good index terms to improve retrieval effectiveness was done by Ritchie et al. [22]. Bradshaw [23] used terms from a fixed window around references.

In contrast with scientific articles, the documents we are working on are from the Responsa Project^{Footnote 1}; they are without any structural base, usually contain a mixture of at least two languages, and contain noise (e.g., editorial additions). Previous research on the Responsa Project dealt with text classification [24]. They checked whether classification could be done over the long axis of ethnic groups of authors with stylistic feature sets. HaCohen-Kerner and Mughaz [25] investigated in which era rabbis lived using undated Responsa, but they did not address the problem of how to extract time-related key-words or key-phrases. This article is a continuation research of this issue, i.e., determining when writers lived using key-phrases.

3 Semi-automatic Boosting Mining of Key-Phrases

We want to mine the time-related key-phrases automatically. We found that most of the sentences that contain time-related concepts (i.e., time-related words and phrases) to rabbinic literature (e.g., “late”, “friend”) are usually nearby rabbinic names/nicknames/acronyms/abbreviations/book-names. We developed a semi-automatic algorithm that boosts concepts mining in order to mine the time-related concepts. The main idea is to extract sentences that contain names of rabbis so that the words and phrases that are nearby the rabbinic names are treated as the key-phrases (among others) that we look for. Now, we present a general description of our mining algorithm and after that Illustration of the run of the algorithm.

3.1 The Algorithm

Notations:

TP – temporal vector of Time-related Phrases.
RN – vector of Rabbinic Names.
n – number of iterations of the algorithm.
TRC – set of Time-Related Concepts starting with, e.g., year, life, fiend and era.
TP ← TRC // initiate TP with the value of TRC
For i = 1 to n do:
- Search for sentences that contain the last concepts that was added to TP
- Extract new rabbinic names from those sentences
- Add the new rabbinic names to RN
- Search for sentences that contain the last rabbinic names that was added to RN
- Mine time-related concepts from the new sentences:
  - Delete stop words.
  - Add the new time-related concepts to TP
  - Add the new time-related words and phrases to TRC (with their frequencies) and for the “old” time-related words and phrases only add their frequencies

Sort TRC by the frequency of time-related words and phrases in decreasing order (normally, concepts have larger number of appearances). Select from TRC the most frequent time-related concepts.

Bellow we present several examples of sentences, which contain rabbinic names, rabbinic acronym names and rabbinic books names (in Hebrew with translation to English)

… and my teacher and Rabbi Hgrsh”z (acronym of the genius Rabbi Shlomo Zalman) Auerbach za”l (acronym of righteous memory) already explained it in his ma’adany eretz book …

(2)
Responsa Hechal Yitzchak Even HaEzer part 2 chapter 53

Answer to the previous question, from Rabbi Shlomo Zalman Auerbach shlit”a (acronym of may he live a good long life, Amen) head of Kol Torah yeshiva (talmudic college) in Jerusalem. In the matter of a courier of the Haifa court that appointed abroad, to be an emissary to conveyance get (divorce certificate) and died suddenly …

(3)
Responsa Har-Zvi yore dea’a chapter 19

In the book Tzitz-Eliezer written by my friend the genius Hgra”i (acronym of the genius Rabbi Eliezer Yehuda) Waldenberg shlit”a (acronym of may he live a good long life, Amen) ab”d (acronym of head of rabbinical court) of the city of Jerusalem …

Illustration of the run of the algorithm

3.2 Algorithm Results

After using the algorithm, we mine time-related key-words, key-phrases and acronyms (a partial list is shown in Table 1).

Table 1. Hebrew and Aramaic cue words partial list

Full size table

We divided the Hebrew and Aramaic key-words and key-phrases into three sets:

Late – addressing a person who has already died.

Friend – addressing another person as a friend, i.e., there is a large overlap between the lifetime of one author and the lifetime of another person who is referred to by the first author as a friend.

Rabbi – addressing another person as a rabbi/master, i.e., there is overlap between the lifetime of one author and the lifetime of another person who is referred to by the first author as rabbi.

Table 1 presents a partial list of Hebrew and Aramaic key-words and key-phrases and a few acronyms in Hebrew and their translation into English.

4 Rules-Based Constraints

This section presents the rules, based on key-phrases and references, formulated for the estimation of the birth and death years of an author X (the extracted results point to specific years) based on his texts and the texts of other writers (Yi) who mention X or one of his texts. We assume that the birth years and death years of all writers are known, excluding those that are under interrogation. Now, we will give some notions and constants that are used: X – The writer under consideration, Yi – Other writers, B – Birth year, D – Death year, MIN – Minimal age (at present, 30 years) of a rabbinic writer when he starts to write his response, MAX – Maximal age (at present, 100 years) of a rabbinic author, and RABBI_DIS – The gap age between rabbi and his student (at present, 20 years). The estimations of MIN, MAX, and RABBI_DIS constants are heuristic, although they are realistic on the basis of typical responsa authors’ lifestyles.

Different types of references exist: general references with and without key-phrases, such as “rabbi”, “friend” and “late”. There are two types of references: those referring to living authors and those referring to dead authors. In contrast to academic papers, responsa include many more references to dead authors than to living authors.

We will introduce rules based on key-phrases and references of different degrees of certainty: “iron-clad” (I), heuristic (H) and greedy (G). “Iron-clad” rules are always true, without any exception. Heuristic rules are almost always true. Exceptions can occur because the heuristic estimates for MIN, MAX and RABBI_DIS are incorrect. Greedy rules are rather reasonable rules for responsa authors. However, wrong estimates can sometimes be drawn while using these rules. Each rule will be numbered and its degree of certainty (i.e., I, H, G) will be presented in brackets.

4.1 “Iron-Clad” and Heuristic Rules with Key-Phrases

First, we present one general iron rule and two general heuristic rules, which are based on regular citations (i.e., without any key-phrase), based on authors that cite X.

General rule based on authors that were mentioned by X

$$ {\text{D(X)}} > = {\text{MAX(B(Yi))}}\quad (0\;({\text{I}})) $$

$$ {\text{D(X)}} > = {\text{MAX(B(Yi))}} + {\text{MIN}}\quad ( 1\; ( {\text{H)}}) $$

X must have been alive when he referred to Yi, so we can use the earliest possible age of publishing of the latest born author Yi as a lower estimate for X’s death year. The heuristic rule includes the addition of MIN, which is the minimum age where Yi starts to write his response.

General rule based on authors that referred to X

$$ {\text{B(X)}} < = {\text{MIN(D(Yi))}} - {\text{MIN}}\quad ( 2\; ( {\text{H)}}) $$

All Yi must have been alive when they referred to X, and X must have been old enough to publish. Hence, we can use the earliest death year amongst such authors Yi as an upper estimate of X’s earliest possible publication age (and thus his birth year).

General rules based on year mentioning Y that appeared in X’s documents

$$ {\text{D(X)}} > = {\text{MAX(Y)}}\quad ( 3\; ( {\text{I)}}) $$

X must have been alive when he mentioned the year Y. We can use the most recent year mentioned by X to evaluate the death year of X as an estimation of X’s death year.

Posthumous Key-Phrase Rules

Posthumous rules estimate the birth and death years of an author X based on references of authors who refer to X with the key-phrase “late” (“of blessed memory”) or on references of X that mention other authors with the key-phrase “late”. Figure 1 describes possible situations where various types of authors Yi (i = 1, 2, 3) refer to X with the key-phrase “late”. The lines depict writers’ life spans; the left edges represent the birth years and the right edges represent death years. In this case (as all Yi refer to X with the key-phrase “late”), we know that all Yi passed away after X, but we do not know when they were born in relation to X’s birth. Y1 was born before X’s birth; Y2 was born after X’s birth but before X’s death; and Y3 was born after X’s death.

$$ {\text{D(X)}} < = {\text{MIN(D(Yi))}}\quad ( 4\; ( {\text{I)}}) $$

However, we know that X must have been dead when Yi referred to him with the key-phrase “late”; thus, we can use the earliest born Y’s death year as an upper estimate for X’s death year. Like all writers, dead writers of course have to comply with rule (2) as well.

Now, we look at the cases where the author X that we are studying refers to other authors Yi with the key-phrase “late”. Figure 2 describes possible situations where X refers to various types of authors Yi (i = 1, 2, 3) with the key-phrase “late”. All Yi passed away before X’s death (or X may still be alive). Y1 died before X’s birth; Y2 was born before X’s birth and died when X was still alive; Y3 was born after X’s birth and passed away when X was still alive.

$$ {\text{D(X)}} > = {\text{MAX(D(Yi))}}\quad ( 5\; ( {\text{I)}}) $$

X must have been alive after the death of all Yi who were referred by him with the key-phrase “late”. Therefore, we can use the death year of the latest-born Y as a lower estimate for X’s death year.

$$ {\text{B(X)}} > = {\text{MAX(D(Yi))}} - {\text{MAX}}\quad ( 6\; ( {\text{H)}}) $$

X was probably born after the death year of the latest-dying person that X wrote about. Thus, we use the death year of the latest-born Y minus his max life-period as a lower estimate for X’s birth year.

Contemporary Key-Phrases Rules

Contemporary key-phrases rules calculate the upper and lower bounds of the birth year of a writer X based only on the references of known writers who refer to X as their friend/rabbi. This means there must have been at least some period of time when both were alive together. Figure 3 shows possible situations where various types of writers Yi refer to X as their friend/rabbi. Y1 was born before X’s birth and died before X’s death; Y2 was born before X’s birth and died after X’s death; Y3 was born after X’s birth and passed away before X’s death; Y4 was born after X’s birth and passed away after X’s death. Like all writers, contemporary authors of course have to comply with rules 1 and 2 as well.

$$ {\text{B(X)}} > = {\text{MIN(B(Yi))}} - ({\text{MAX}} - {\text{MIN}})\quad ( 7\; ( {\text{H)}}) $$

All Yi must have been alive when X was alive, and all of them must have been old enough to publish. Thus, X could not have been born MAX-MIN years before the earliest birth year amongst all authors Yi.

$$ {\text{D(X)}} < = {\text{MAX(D(Yi))}} + ({\text{MAX}} - {\text{MIN}})\quad ( 8\; ( {\text{H)}}) $$

Again, all Yi must have been alive when X was alive, and all of them must have been old enough to publish. Hence, X could not have been alive MAX-MIN years after the latest death year amongst all writers Yi.

4.2 Greedy Rules

Greedy rules bounds are sensible but can sometimes lead to wrong estimates.

Greedy rule based on authors who are mentioned by X

$$ {\text{B(X)}} > = {\text{MAX(B}}\left( {\text{Yi}} \right) )- {\text{MIN}}\quad ( 9\; ( {\text{G)}}) $$

Many of the references in our research domain relate to dead authors. Thus, most of the references within X’s texts relate to dead authors. Namely, many Yi were born before X’s birth and died before X’s death. Thus, a greedy assumption would be that X was born no earlier than the birth of the latest author mentioned by X; however, because there may be at least one case where Y was born after X was born, we subtract MIN.

Greedy rule based on references to year Y made by X

$$ {\text{B(X)}} > = {\text{MAX(Y)}} - {\text{MIN}}\quad ( 10\;({\text{G}})) $$

When X mentions years, he usually writes the current year in which he wrote the document or a few years ahead. Most of the time, the maximum year, Y, minus MIN is larger than X’s birth year.

Greedy rule based on authors who refer to X

$$ {\text{D(X)}} < = {\text{MIN(D(Yi))}} - {\text{MIN}}\quad ( 1 1\; ( {\text{G)}}) $$

As mentioned above, most of the references within Yi texts refer to X as being dead. Hence, most Yi died after X’s death. Therefore, a greedy assumption would be that X died no later than the death of the earliest author who referred to X minus MIN.

Rules refinements 9–11 are presented by rules 12–17. Rules 12–14 are due to X referring to Yi and rules 15–17 are due to Yi referring to X.

Greedy rule for defining the birth year based only on authors who were referred to by X with the key-phrase “late”

$$ {\text{B(X)}} > = {\text{MAX(D(Yi))}} - {\text{MIN}}\quad ( 1 2\; ( {\text{G)}}) $$

When taking into account only references that were written in X’s texts, most of the references are related to dead authors. That is, most Yi died before X’s birth. Moreover, an author does not write from his birth; rather, he usually begins near his death. Thus, a greedy assumption would be that X was born no earlier than the death of the latest author mentioned by X minus MIN.

Greedy rule for defining the birth year based only on authors who are mentioned by X with the key-phrase “friend”

$$ {\text{B(X)}} < = {\text{MIN(B(Yi))}} + {\text{RABBI}}\_{\text{DIS}}\quad ( 1 3\; ( {\text{G)}}) $$

When taking into account only references that are mentioned by X, which are related to contemporary authors, a greedy rule could be that X was born no later than the birth of the earliest author mentioned by X with the key-phrase “friend”. Because many times the older author refers to the younger author as “friend”, we need to add RABBI_DIS.

Greedy rule for defining the birth year based only on authors who are mentioned by X with the key-phrase “rabbi”

$$ {\text{B(X)}} < = {\text{MIN(B(Yi))}} + {\text{RABBI}}\_{\text{DIS}}\quad ( 1 4\; ( {\text{G)}}) $$

When taking into account only references written in X’s texts, which are related to contemporary authors, a greedy rule could be that X was born no later than the birth of the earliest author mentioned by X as a “rabbi”. Due to the age difference between a student and his rabbi being approximately 20 years, we need to add RABBI_DIS.

Greedy rule for defining the death year of X based only on authors who referred to X with the key-phrase “late”

$$ {\text{D(X)}} < = {\text{MIN(B(Yi))}} + {\text{MIN}}\quad ( 1 5\; ( {\text{G)}}) $$

When taking into account only references written in Yi texts that refer to X with the key-phrase “late”, a greedy assumption could be that X died no later than the birth of the earliest author who referred to X with the key-phrase “late”; because an author does not writes from birth, we need to add MIN.

Greedy rule for defining the death year of X based only on authors who referred to X with the key-phrase “friend”

$$ {\text{D(X)}} > = {\text{MAX(D(Yi))}} - {\text{RABBI}}\_{\text{DIS}}\quad ( 1 6\; ( {\text{G)}}) $$

When taking into account only references written in Yi texts that refer to X with the key-phrase “friend”, all Yi must have been alive when X was alive, and all of them must have been old enough to publish; also, many times, the older author refers to the younger author with the key-phrase “friend”, and the opposite never occurs. Therefore, a greedy assumption would be that X died no earlier than the death of the latest author who referred to X with the key-phrase “friend” minus RABBI_DIS.

Greedy rule for defining the death year of X based only on authors who referred to X with the key-phrase “rabbi”

$$ {\text{D(X)}} > = {\text{MAX(D(Yi))}} - {\text{RABBI}}\_{\text{DIS}}\quad ( 1 7\; ( {\text{G)}}) $$

This follows the same principle as the rule for defining the birth year, but because this time the student mentions the rabbi, we need to reduce RABBI_DIS.

4.3 Birth and Death Year Tuning

Application of the Heuristic and Greedy rules can lead to abnormalities, such as an author’s death age being unreasonably old or young. Another possible anomaly is that the algorithm may result in a death year greater than the current year (i.e., 2015). Hence, we added some tuning rules: D – death year, B – birth year, age = D − B.

Current Year: if (D > 2015) {D = 2015}, i.e., if the current year is 2015, then the algorithm must not give a death year greater than 2015.

Age: if (age > 100), {z = age − 100; D = D − z/2; B = B + z/2}, and if (age < 30), {z = 30 − age; D = D + z/2; B = B − z/2}. Our postulate is that a writer lived at least 30 years and no more than 100 years. Thus, if the age according to the algorithm is greater than 100, we take the difference between that age and 100, and then we divide that difference by 2 and normalize D and B to result in an age of 100.

4.4 Example of the Use of a Certain Heuristic Rule and the Key-Phrase “Late”

Below we present texts written by Rabbi Herzog Yitzchak (1889–1959):

(1)
Responsa Hechal Yitzchak Even HaEzer part 2 chapter 43

… and inspecting (book) bayit chadash chapter 134 that was ahead of me to the differences between before writing etc. (to this bayit chadash, turned my attention the genius Rabbi Yehuda Ades, One of the religious court judges of the court shlit”a) …

(2)
Responsa Hechal Yitzchak Even HaEzer part 2 chapter 47

… and here I am now looking at the responsa of Maharsham of blessed memory in part 2 chapter 140 in the matter of deaf mute who finished their school … that the divorce certificate of a deaf is as like divorce certificates, It is not clear to me at all. And in my opinion the objections of the genius Rabbi Ben-Zion Uziel of blessed memory are not opposite to him.

(3)
Responsa Hechal Yitzchak Even HaEzer part 2 chapter 77

The response of the genius the author of Chazon Ish book of blessed memory in the previous question …

Figure 4 shows the timeline of the authors that are relevant to the texts that appear above (Table 2).

Table 2. Birth and death years of authors that relate to the example

Full size table

Table 3. Full details about the citations in the corpora

Full size table

Table 4. Current results vs. HaCohen-Kerner and Mughaz [25] results for the 12 authors corpus (the results are with years deviation)

Full size table

We use a heuristic rule to improve the assessment of the death year of Rabbi Herzog Yitzchak:

Activate the iron rule (formula (0(I))): D(X) >= MAX(B(Yi)) → D(X) >= 1898 (Distance from the real death year (1959) is 61)

Activate the heuristic rule (formula (1(H))): D(X) >= MAX(B(Yi)) + min → D(X) >= 1898 + 30 = 1928 (Distance from the real death year (1959) is 31)

We use the key-phrase “late” to improve the assessment of the death year of Rabbi Herzog Yitzchak:

Activate the iron rule (formula (0(I))): D(X) >= MAX(B(Yi)) → D(X) >= 1898 (Distance from the real death year (1959) is 61)

Activate the iron key-phrase “late” rule (formula (5(I))): D(X) >= MAX(D(Yi)) → D(X) > = 1953 (Distance from the real death year (1959) is 6)

We can see that the heuristic rule (1(H)) improves the result. However, with the use of the key-phrase “late”, rule (5(I)), the result is much better.

5 The Model

The main steps of the model are presented below

1.
Cleaning the texts. Because the responsa may have undergone some editing, we must make sure to ignore the possible effects of differences in the texts resulting from variant editing practices. Therefore, we eliminate all orthographic variations.
2.
Boosting mining key-phrases and key-words.
3.
Normalizing the references in the texts. For each author, we normalize all types of references that refer to him (e.g., various variants and spellings of his name, books, documents and their nicknames and abbreviations). For each author, we collect all references syntactic styles that refer to him and then replace them with a unique string.
4.
Building indexes, e.g., authors, references to “late”/“friend”/“rabbi”, and calculating the frequencies of each item.
5.
Performing various combinations of “iron-clad” and heuristic rules on the one hand and greedy rules on the other hand to estimate the birth and death years of each tested author.
6.
Calculating averages for the best “iron-clad”, heuristic and greedy versions.

6 Examined Corpus, Experiments and Results

The documents of the examined corpus were downloaded from Bar-Ilan University’s Responsa Project. The examined corpus contains 15,495 responsa written by 24 scholars, averaging 643 files for each scholar. The total number of characters in the whole corpus is 127,683,860 chars, and the average number of chars for each file is 8,240 chars. These authors lived over a period of 229 years (1786–2015). These files contain references; each reference pattern can be expanded into many other specific references [26].

Reference identification was performed by comparing each word to a list of 339 known authors and many of their books. This list of 25,801 specific references refers to the names, nicknames and abbreviations of these authors and their writings. Basic references were collected and all other references were produced from them.

We split the data into two corpora: (1) 10,561 responsa authored by 12 rabbis, with an average of 876 files for each scholar and each file containing an average of 1800 words spread over 135 years (1880–2015); (2) 15,495 responsa authored by 24 rabbis, with an average of 643 files for each rabbi and each file containing an average of 1609 words spread over 229 years (1786–2015) (the set of 24 rabbis contains the group of 12 rabbis). For more detailed information on the data set, refer to Table 5 in the appendix at the end of this article.

Table 5. Full details about the data set

Full size table

Because of the nature of the problem, it is difficult to appraise the results in the sense that although we can compare how close the system guess is to the actual birth or death years, we cannot assess how good the results are, i.e., there is no real notion of what a ‘good’ result is. For now, we use the notion Distance, which is defined as the estimated value minus the ground truth value.

The outcomes appear in the following histograms. Each histogram shows the results of one algorithm – Iron+Heuristic or Greedy. Each algorithm was performed on two groups of authors: a group of 12 writers and a group of 24 writers. For both algorithm executions, there are outcomes containing estimated birth years and death years. The results shown in the histograms are the best birth/death date deviation results. In every histogram, there are eight columns; there are two quartets of columns in each histogram: the right quartet indicates the deviation from the death year, while the left quartet indicates deviations from the birth year. Each column represents a deviation without a key-phrase or with the year that was mentioned in the text, a deviation with the “late” key-phrase, with the “rabbi” key-phrase, and with the “friend” key-phrase. Moreover, we used two manipulations – Age and Current year. The column with a gray background contains the best results. Each histogram contains 8 columns (results); there are 16 histograms, so there are, in total, 128 results.

The Age manipulation is very helpful; we used it in 94.5% of the experiments (i.e., 121/128 = 0.945) for all of the refinements, in both algorithms, with or without constants.

Examination of the effect of mentioning a year, listed in Figs. 5, 6, 7 and 8, compared with Figs. 9, 10, 11 and 12 regarding death year deviation, indicates that the contribution of referencing a year leads to an improvement of 2.8 years on average.

This phenomenon is more noticeable in Iron+Heuristic (average upswing of 4.2 years) than with Greedy (average deviation upswing of 1.4 years). The main reason for this is that a writer usually writes until close to his death. Additionally, when a year is mentioned in the text, it is often the year in which the writer wrote the document. Because an author writes, in many cases, until near his death year, the maximum year mentioned in his texts is close to the year of his death.

In contrast to the death year assessment, birth year assessment has a negative impact; the deviation increases by 10.4 years, on average. It is essential to note that we are now evaluating the impact of the year mentioned in the text. If the results without using the year mentioned are better than the results using the year mentioned, it means that we should not use it. For example: the result of the birth year using Greedy rules, without year mentioning and without any refinement, for the 12 authors has a deviation of 16.7 years. After using the year mentioned, the deviation is 51.5 years, decreasing the accuracy by 34.83 years. The result of the birth year using the Iron+Heuristic, without year mentioning and without key-phrases, for the 12 authors has a deviation of 26.5 years. After using a reference to years, the deviation is 50.7 years, decreasing the accuracy by 24.2 years, i.e., the deviation with the use of year mentioning is greater. An analysis of the formulas shows that the formula that determines the birth year in the Greedy (10(G)) uses the most recent year the writer writes in his texts. The most recent year that the rabbi mentions is usually near his death, as explained above; therefore, very poor birth results are obtained, with a decline of 12.5 years. The results of the Greedy are better than Iron+Heuristic (decline of 8.4 years), but the effect of year mentioning on the results of Iron+Heuristic is less harmful. Thus, to estimate the death year, we will use the Iron+Heuristic algorithm with the use of year mentioning without any key-phrases.

The use of the key-phrase “friend” for birth year assessment gives the best results compared with the other key-phrases – “late”, “rabbi” or none. This is because friends are of the same generation and more or less the same age; thus, they are born in roughly the same year. Thus, for a writer addressing another author as his friend, the assessment of his birth year will give good results. For the death year, however, this is not assured because there may be a much greater period between the deaths of friends (one may die at the age of 50, while his friend at the age 75). Hence, the “friend” key-phrase usually gives better birth year assessment than death year assessment.

After we found that the best results for the birth year are always with the “friend” key-phrase (except for one case), we investigated at greater depth and found that this occurs specifically with the use of constants. Constants are important, resulting in an average improvement of 6.3 years in the case of Greedy (for the 12 and 24 authors). In general, a Posek is addressed in responsa after he has become important enough to be mentioned and regarded in the Halachic Responsa, which is usually at an advanced age.

We stated above that the use of Greedy rules with constants gives the greatest improvement. Even without the use of constants, Greedy produces the best results. The reason lies in the formulae; formula (13(G)) finds the lowest birth year from the group of authors that the arbiter mentioned. Unlike the Greedy, the Iron+Heuristic formula (7(H)) reduces the constant (at present, 20); therefore, the results of the Greedy are better. In conclusion, to best assess the birth year, we apply the Greedy algorithm, using constants and also the key-phrase “friend”.

The best results when evaluating birth year occurred when using the Greedy algorithm with constants and without mentioning years. The best results when evaluating death year occurred when using the Iron+Heuristic algorithm with constants and without mentioning years. When we compare these results with the results shown in Figs. 17, 18, 19 and 20 we find that in the case of Greedy, when we add more authors, there is an improvement only in one case, i.e., for the 12 authors using the “late” key-phrase; the remaining results show a decline in performance. The reason for this phenomenon may lie in the Greedy formula; when an author is more successful, in addition to being mentioned by others many times, he is mentioned at a younger age by authors that are older than him; therefore, the estimation is less accurate. For example: the estimation of the death year of the late Rabbi Ovadia Yosef has an error of 61 years (instead of 2014, the algorithm result is 1953), determining that he died at an age of 34; using the Iron+Heuristic algorithm, there was a decrease in two results and an improvement in 5 results. For Iron+Heuristic, there is an average improvement of 0.64 years and, in fact, the best death year result estimation. The quality of the Greedy algorithm birth year results estimation using year mentioning pretty severely impairs the results (explained above). A possible explanation for this is that the improvement that comes from using constants cannot overcome the deterioration that comes from year mentioning. In contrast, when assessing the death year, using year mentioning with Iron+Heuristic significantly improves the results, and using constants improves them a little more; therefore, a combination of constants + year mentioning brings better assessment of the death year. Therefore, when assessing birth year and death year, it is not enough to use references; we have to use key-words and key-phrases. To estimate death year, we will use the year(s) mentioned in the text and constants with the Iron+Heuristic algorithm; to estimate birth year, we will run the Greedy algorithm using constants and the “friend” key-phrase without year mentioning.

The Effect of the Relatively Larger Corpus

We compared between the results that achieved for the corpora containing responsa written by 12 and by 24 authors. We enlarged not only the number of authors but also the number of the responsa; from 10,561 files for 12 authors to 15,495 files for 24 authors and from 19,011,130 words for 12 authors to 24,930,082 words for 24 authors. The time-frame of the 12 authors spread over 135 years, while the time-frame of the 24 authors spread over 229 years. Because the span of the years of 24 authors is almost twice bigger than time-frame of the 12 authors, we must compare the results relative to the year’s amplitude. When we analyze the results proportionally to the span of the years, we find that when we have a larger data-set we get better results (for 90.6% of the results as shown in Figs. 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 and 20). For instance, in Fig. 17, (12 authors) the smallest deviation predicting death year, i.e., the deviation of the best result, is 9.2, 9.2/135 → 6.8%; in Fig. 18, (24 authors) the smallest deviation predicting death year is 13.1, 13.1/229 → 5.7% (improvement of 16%). In Fig. 17, (12 authors) the smallest deviation predicting birth year is 16.6, 16.6/135 → 12.3%; in Fig. 18, (24 authors) the smallest deviation predicting birth year is 22.6, 22.6/229 → 9.8% (improvement of 20%).

The number of a general citation composed of a name/acronym/abbreviation/book-title of an author is very large (see Table 3). In many cases, the references to author appear with affixes of Rabbi, friend or “late” (see Sect. 3.1). A citation with time-related phrase is a citation of author’s name and it is close to the time-related phrase. This fact leads us to the fact that the number of occurrences of general citations contains the number of occurrences of reference with time-related phrases. From Table 3 we can see that the number of general citations and “late” citations occurrences are the largest, also per author, so we would expect them to be the most influential. However, as we wrote above, the experiments indicate that the best results are using time-related phrases of a friend and references to years.

From a glance at Table 3 the time-related phrase “Rabbi” is negligible certainly relative to other time-related phrases. We also can see that the number of occurrences of “Rabbi” time-related phrase is the lowest, and per author it is less than two. But when we look deeper, we see that despite the minimal number of occurrences to time-related phrase “Rabbi” it sometimes gives better results than general citations or references to the time-related phrase “late” (the two, general citation and “late” time-related phrase are with the largest number of occurrences). For example, Fig. 15, the greedy algorithm for the birth year assessment using references to time-related phrase “Rabbi” gives a deviation of 13.7 years from the truth birth years; in the same Fig. 15, references to “late” time-related phrase gives a result of 22.3 years deviation, which is worse by 62.8%. Also in Fig. 15, a general citation gives deviation of 16.8 years which is worse by 22.6% compared to the “Rabbi” time-related phrase. We can see a similar phenomenon in Figs. 6 and 9 using the Iron+Heuristic algorithm to estimate the birth years. The phenomenon that the key-phrase with a few occurrences can achieve a better result than a key-phrase with many occurrences is due to the structure of the rules (Sects. 4.1 and 4.2). The rules are using the minimum and the maximum functions. Because the nature of the minimum/maximum functions a few occurrences of citations can affect the results. As a result, occasionally the references to the time-related phrase “Rabbi” (which appears a few times) give better results than using the general citations or “late” time-related phrase that appear much more.

Current Research Versus First Research

In this research, various novelties are presented comparing to HaCohen-Kerner and Mughaz [25]:

1.
There are two corpora of responsa composed by 12 authors and 24 authors, instead of one corpus (12 authors and with a far fewer files);
2.
There is a use of years that are mentioned in the text documents (the text was not labeled with a date or year but sometimes years can appear in the text, e.g., quotation from a contract, which contains the year of the agreement);
3.
Heuristics were added to the Greedy algorithm by adding a few greedy constraints;
4.
New Rabbi’s constrains were formulated;
5.
Two new manipulations, “Current Year” and “Age” were added.

HaCohen-Kerner and Mughaz [25] examined a corpus, which includes 3,488 responsa authored by 12 Jewish rabbinic scholars, while in this research the current corpus for 24 authors contains 15,495 responsa and the current corpus for 12 authors contains 10,561 responsa. The 3,488 responsa used in HaCohen-Kerner and Mughaz [25] are included in these 10,561 responsa.

Table 4 presents a comparison between the results of our current work and the best results of HaCohen-Kerner and Mughaz [25]. This table shows that three results (out of four results) for 12 authors in the current research are much better (in their quality) than the corresponding results reported in HaCohen-Kerner and Mughaz [25]. Only one result (the birth years using the Greedy algorithm) was slightly worse. The results of 24 authors were not presented in HaCohen-Kerner and Mughaz [25].

Using the Iron+Heuristic algorithm, we reduced the years deviation, compared to HaCohen-Kerner and Mughaz [25], for death years by 60% (from 22.67 to 9.2 years) and for birth years by 42% (from 22 to 12.8 years). Using the Greedy algorithm, we reduce the deviation for death years by 32% (from 15.54 to 10.5 years); however for the birth years we got a slightly worse result of about 3% (from 13.04 to 13.4 years).

7 Summary, Conclusions and Future Work

We investigated the estimation of the birth and death years of authors using year mentioning, the “late” (“of blessed memory”) key-phrase, the “rabbi” key-phrase, the “friend” key-phrase and undated references that are mentioned in documents of other dated authors that refer to author being considered or those mentioned by him. This research was performed on responsa documents, where special writing rules are applied. The estimation was based on the author’s texts and texts of other authors who refer to the discussed author or are mentioned by him. To do so, we formulated various types of iron-clad, heuristic and greedy rules. The best birth year assessment was achieved by using the Greedy algorithm with constants and the “friend” key-phrase. The best death year assessment was achieved by using the Iron+Heuristic algorithm with year mentioning.

We plan to improve this research by (1) testing new combinations of iron-clad, heuristic and greedy rules, as well as a combination of key-phrases (e.g., “late” and “friend”); (2) improving existing rules and/or formulating new rules; (3) defining and applying heuristic rules that take into account various details included in the responsa, e.g., events, names of people, new concepts and collocations that can be dated; (4) conducting additional experiments using many more responsa written by more authors to improve the estimates; (5) checking why the iron-clad, heuristic and greedy rules tend to produce more positive differences; and (6) testing how much of an improvement we can obtain from a correction of the upper bound of D(x) and how much we will, at some point, use it for a corpus with long-dead authors.

Notes

1.
Contained in the Global Jewish Database (The Responsa Project at Bar-Ilan University). http://www.biu.ac.il/ICJI/Responsa.

References

Powley, B., Dale, R.: Evidence-based information extraction for high accuracy citation and author name identification. In: RIAO 2007 (2007)
Google Scholar
Wintner, S.: Hebrew computational linguistics: past and future. Artif. Intell. Rev. 21(2), 113–138 (2004)
Article MATH Google Scholar
HaCohen-Kerner, Y., Kass, A., Peretz, A.: HAADS: A Hebrew Aramaic abbreviation disambiguation system. J. Am. Soc. Inf. Sci. Technol. JASIST 61(9), 1923–1932 (2010)
Article Google Scholar
Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., Frank, E.: Improving browsing in digital libraries with key-phrase indexes. Decis. Support Syst. 27(1), 81–104 (1999)
Article Google Scholar
Zhang, Y., Zincir-Heywood, N., Milios, E.: World wide web site summarization. Web Intell. Agent Syst. 2(1), 39–53 (2004)
Google Scholar
Hulth, A., Megyesi, B.B.: A study on automatically extracted key-words in text categorization. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL, pp. 537–544 (2006)
Google Scholar
Kim, S.N., Baldwin, T.: Extracting key-words from multi-party live chats. In: Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation, pp. 199–208 (2012)
Google Scholar
Berend, G.: Opinion expression mining by exploiting key-phrase extraction. In: IJCNLP, pp. 1162–1170 (2011)
Google Scholar
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic key-phrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 366–376. ACL (2010)
Google Scholar
Hasan, K.S., Ng, V.: Conundrums in unsupervised key-phrase extraction: making sense of the state-of-the-art. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 365–373. ACL (2010)
Google Scholar
Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic key-phrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1318–1327. ACL (2009)
Google Scholar
Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T.: Automatic key-phrase extraction from scientific articles. Lang. Resour. Eval. 47(3), 723–742 (2013)
Article Google Scholar
Yih, W.T., Goodman, J., Carvalho, V.R.: Finding advertising key-words on web pages. In: Proceedings of the 15th International Conference on World Wide Web, pp. 213–222. ACM (2006)
Google Scholar
Schomaker, L., Bulacu, M.: Automatic writer identification using connected-component contours and edge-based features of uppercase western script. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 787–798 (2004)
Article Google Scholar
Said, H., Tan, T., Baker, K.: Personal identification based on handwriting. Pattern Recogn. 33(1), 149–160 (2000)
Article Google Scholar
Bulacu, M., Schomaker, L.: Text-independent writer identification and verification using textural and allographic features. IEEE Trans. Pattern Anal. Mach. Intell. 29(4), 701–717 (2007)
Article Google Scholar
Bar-Yosef, I., Beckman, I., Kedem, K., Dinstein, I.: Binarization, character extraction, and writer identification of historical Hebrew calligraphy documents. IJDAR 9(2–4), 89–99 (2007)
Article Google Scholar
Garfield, E.: Can citation indexing be automated? In: Stevens, M. (ed.) Statistical Association Methods for Mechanical Documentation, Symposium Proceedings, vol. 269, pp. 189–192. National Bureau of Standards Miscellaneous Publication, Washington, D.C. (1965)
Google Scholar
Berkowitz, E., Elkhadiri, M.R.: Creation of a Style Independent Intelligent Autonomous Citation Indexer to Support Academic Research, pp. 68–73 (2004)
Google Scholar
Giuffrida, G., Shek, E.C., Yang, J.: Knowledge-based metadata extraction from postscript files. In: Proceedings of the 5th ACM conference on Digital libraries, pp. 77–84. ACM (2000)
Google Scholar
Seymore, K., McCallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. In: AAAI-1999 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)
Google Scholar
Ritchie, A., Robertson, S., Teufel, S.: Comparing citation contexts for information retrieval. In: The 17th ACM Conference on Information and Knowledge Management (CIKM), pp. 213–222 (2008)
Google Scholar
Bradshaw, S.: Reference directed indexing: redeeming relevance for subject search in citation indexes. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 499–510. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45175-4_45
Chapter Google Scholar
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine: classification using stylistic feature sets and/or name-based feature sets. J. Am. Soc. Inf. Sci. Technol. (JASIST) 61(8), 1644–1657 (2010)
Google Scholar
HaCohen-Kerner, Y., Mughaz, D.: Estimating the birth and death years of authors of undated documents using undated citations. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS, vol. 6233, pp. 138–149. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14770-8_17
Chapter Google Scholar
HaCohen-Kerner, Y., Schweitzer, N., Mughaz, D.: Automatically identifying citations in Hebrew-Aramaic documents. Cybern. Syst.: Int. J. 42(3), 180–197 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Bar-Ilan University, 5290002, Ramat-Gan, Israel
Dror Mughaz & Dov Gabbay
Department of Computer Science, Lev Academic Center, 9116001, Jerusalem, Israel
Dror Mughaz & Yaakov HaCohen-Kerner
Department of Informatics, Kings College London, Strand, London, WC2R 2LS, UK
Dov Gabbay

Authors

Dror Mughaz
View author publications
You can also search for this author in PubMed Google Scholar
Yaakov HaCohen-Kerner
View author publications
You can also search for this author in PubMed Google Scholar
Dov Gabbay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dror Mughaz .

Editor information

Editors and Affiliations

Institute of Informatics, Wroclaw University of Technology, Wroclaw, Poland
Ngoc Thanh Nguyen
Swinburne University of Technology , Hawthorn, South Australia, Australia
Ryszard Kowalczyk
University of Lisbon , Lisbon, Portugal
Alexandre Miguel Pinto
Huawei German Research Center, Munich, Germany
Jorge Cardoso

Appendix

Data Set Information.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mughaz, D., HaCohen-Kerner, Y., Gabbay, D. (2017). Mining and Using Key-Words and Key-Phrases to Identify the Era of an Anonymous Text. In: Nguyen, N., Kowalczyk, R., Pinto, A., Cardoso, J. (eds) Transactions on Computational Collective Intelligence XXVI. Lecture Notes in Computer Science(), vol 10190. Springer, Cham. https://doi.org/10.1007/978-3-319-59268-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-59268-8_6
Published: 15 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59267-1
Online ISBN: 978-3-319-59268-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Mining and Using Key-Words and Key-Phrases to Identify the Era of an Anonymous Text

Abstract

Similar content being viewed by others

Key-Phrases as Means to Estimate Birth and Death Years of Jewish Text Authors

Identification of Lost or Deserted Written Texts Using Zipf’s Law with NLTK

Authorship Identification with Multi Sequence Word Selection Method

Keywords

1 Introduction

2 Related Research

3 Semi-automatic Boosting Mining of Key-Phrases

3.1 The Algorithm

3.2 Algorithm Results

4 Rules-Based Constraints

4.1 “Iron-Clad” and Heuristic Rules with Key-Phrases

4.2 Greedy Rules

4.3 Birth and Death Year Tuning

4.4 Example of the Use of a Certain Heuristic Rule and the Key-Phrase “Late”

5 The Model

6 Examined Corpus, Experiments and Results

The Effect of the Relatively Larger Corpus

Current Research Versus First Research

7 Summary, Conclusions and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Mining and Using Key-Words and Key-Phrases to Identify the Era of an Anonymous Text

Abstract

Similar content being viewed by others

Key-Phrases as Means to Estimate Birth and Death Years of Jewish Text Authors

Identification of Lost or Deserted Written Texts Using Zipf’s Law with NLTK

Authorship Identification with Multi Sequence Word Selection Method

Keywords

1 Introduction

2 Related Research

3 Semi-automatic Boosting Mining of Key-Phrases

3.1 The Algorithm

3.2 Algorithm Results

4 Rules-Based Constraints

4.1 “Iron-Clad” and Heuristic Rules with Key-Phrases

4.2 Greedy Rules

4.3 Birth and Death Year Tuning

4.4 Example of the Use of a Certain Heuristic Rule and the Key-Phrase “Late”

5 The Model

6 Examined Corpus, Experiments and Results

The Effect of the Relatively Larger Corpus

Current Research Versus First Research

7 Summary, Conclusions and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation