1 Introduction

The identification of names in documents is typically a part of a bigger task. For example, in our setting, we were involved with development of two classifiers to detect privacy and sensitive unclassified information in a large digitized collection for the U.S. Department of Energy (DOE) [10]. More specifically, in the first classifier, we were looking for Personally Identifiable Information (PII) which refers to any information that identifies or can be used to identify, contact, or locate the person to whom such information pertains. This includes information that is used in a way that is personally identifiable, including linking it with identifiable information from other sources, or from which other personally identifiable information can easily be derived, including, but not limited to, name, address, phone number, fax number, email address, financial profiles, social security number, and credit card information.

Our second classifier deals with detection and redaction of sensitive unclassified information. The Sensitive Unclassified (SU) information is defined as any unclassified information that may cause adverse consequences against the government facilities. The techniques were used in development of these two classifiers are explained in [19, 22]. There are many other applications similar to ours such as MUC [5] or the 2003 Los Alamos National Lab (LANL) project on Advanced Knowledge Integration In Assessing Terrorist Threats [16], the task deals with identification of individuals who are involved with terrorist activities. Almost all of these applications deal with extraction of certain entities such as personal name, date of birth, place of event, acronyms [20], time, and or money. In this paper, we solely concentrate on personal name extraction. Detecting names in general is difficult because:

  • Many names are composed of words that are uncommon in the culture in which the collection originated. e. g. Barack Obama.

  • Many names are composed of words that are usually used for something other than personal names. e.g. Will Cook.

  • Many non-personal-name phrases are composed of words that are common to personal-name phrases e.g. Kyle Canyon, Clark County.

Moreover, detecting names in a digitized heterogeneous collection of a large federal bureaucracy is made more difficult by:

  • the broad diversity of contexts in which names are found, and

  • the broad diversity of text styles, such as tables, forms, reports, etc.

Furthermore, detecting names in documents replete with OCR noise is made more difficult by the fact that OCR noise can wipe-out occurrences of features that indicate names.

This paper is organized into eight sections including this introduction. In Sect. 2 we give a brief introduction to techniques used to solve name identification problem. Section 3 describes the set of features that are typically used to identify names. Section 4 is a gentle tutorial on FCA along with a detailed example of how we use FCA to identify personal name concepts in the lattice. Section 5 is a description of our HMM and its implementation. In Sect.  6, we revisit FCA and explain how we can modify our context to make FCA mimic the decoding of states similar to Viterbi algorithm. Section 7 provides some statistics on the performance of our HMM. In Sect. 8, we will explore directions for future work.

2 Background

Relation extraction is the process of identifying certain entities and their associated relationships. For example, one may want to populate a database with extracted names and birthdays or acronyms and their definitions [20]. In the 2003 Los Alamos National Lab (LANL) project on Advanced Knowledge Integration In Assessing Terrorist Threats, relationships were established between entities by extracting structured data in the text. There are many known techniques for extracting relational information from text. The most popular methods are a combination of HMM, dictionary look up, natural language pattern processing, hand written rules, and ad-hoc algorithms [4]. The extraction must first identify entities such as personal name and dates before establishing the relationship birth.

The most cited series of conferences associated with relational extraction are Message Understanding Conferences (MUC) [5]. The entities of choice in these conferences were personal names (people), organizations, places,and date. Many systems were developed that were based on above mentioned techniques. Most of the personal name recognizers for MUC intuitively rely on capitalization, list of known names, titles, and other linguistic clues such as suffixes.

The most known system based on HMM is BBN’s Nymble tagger [1]. We will give a brief description of this system in Sect. 5. In addition to Nymble, there are other notable efforts in name recognition. Some examples of these efforts are [1113].

3 Linguistic clues, attributes, and emission symbols

In general, name identification relies on characteristics of names in specific language or domain. For example, in English, personal names are capitalized. Depending on the method, these characteristics are given different terminologies. In rule based or natural language processing systems, we typically refer to these characteristics as linguistic clues. In statistical methods such as HMM or classification systems, we refer to them as emission symbols, word classes, or types. In FCA, we generally consider objects and attributes, and therefore these characteristics will be referred to as attributes.

From the computation point of view, a characteristic of name can be considered an attribute if we can test the membership of a word with the specific characteristic. In the case of capitalization, we can write a regular expression such as /[A-Z][a-z]+/. Almost all the approaches to name identification use some lists of first or last names. Again we can consider the membership to a list as an attribute of a word. Therefore, the word Johnson has the attribute name while the word “mathematical” does not. The following is a summary of the information we used to integrate statistics on names into our HMM model. The details of this approach is previously reported in [23].

We use lists of names and their relative frequencies compiled from U.S. Census data [24]. We also use a list of non-name words along with their frequencies in a training set from our collection of documents from a federal agency. We used five word lists to compile a single master list to be used at run time.

The first list is the non-name. The non-name is a list of words that, in our collection, are often capitalized but rarely refer to personal names. It consists mostly of names of governmental and commercial organizations, and geographical names e.g. “River”. To make this list, we use the regular expression /[A−Z][a−z]+/. We tracked the frequency of each string matching /[A−Z][a−z]+/ and put the 7000 most common strings into an initial list, which we then manually purged of words that are more common as personal names. Furthermore, as we developed the name-finder we added many other words that had been under represented in previous versions.

The next step was to form the non-person name list. We went through the training set and tallied every word occurrence that either did not match /[A−Z][a−z]+/ or did match an entry in the non-name list. For each word that made the list, we recorded its frequency: the number of occurrences divided by the total number of non-personal-name word occurrences in the text. The list was sorted in descending order of frequency, and based on this order, each entry had a cumulative frequency.

The next two lists, first names, and last names were taken from the 1990 US census. Like the non-person name word list, these included the frequency and cumulative frequency for each entry.

Finally, the list of titles such as ‘Ms.’ and ‘Senator’, was manually compiled and had the same kind of information as the name lists and the non-person name word list.

As mentioned previously we can look at these lists as attributes of objects in FCA as seen in Fig. 1.

4 Formal concept analysis

Formal concept analysis (FCA) is a mathematical theory based on the concept of Galois Lattice [2, 14]. The methodologies based on FCA facilitate the identification of concepts (or grouping of objects) based on their attributes.

FCA has been used in many applications to extract both concepts and rules. Notable examples of these are the use of FCA in patient classification [3], characterization of human cognitive processes [25], computing approaches to two way learning [26], and a granular concept learning computing methods [8]. The closely related rough sets and FCA have also been combined for rule acquisition in the recent work of [9]. It is also worth mentioning how FCA is used to model human memory and some of its cognitive functionality [7].

Following Stumme [18], a formal context is defined as (GMI), where G is a set of objects, M is a set of attributes, and I is a binary relation over \(G \times M\).

For a set \(A \subseteq G\), the set \(A'\) is defined as \(\{ m \in M \mid \forall g \in A, (g,m) \in I\}\), dually, for a set \(B \subseteq M\), the set \(B'\) is defined as \(\{ g \in G \mid \forall m \in B, (g,m) \in I\}\).

A formal concept is a pair of (AB) such that \(A' = B\) and \(B' = A\). A is called the extent, and B is called the intent of the concept.

There are many interesting examples of formal concepts in Stumme [18] and the readers are encouraged to look at these examples. We will shortly give a detailed example dealing with the name identification.

As usual, a partial ordering \(\le\) is defined on pairs of concepts as follows:

\((A_1, B_1) \le (A_2, B_2)\) iff \(A_1 \subseteq A_2\) or equivalently \(B_2 \subseteq B_1\)

The set of concepts with this partial ordering forms a complete lattice. Let \(C_1\) and \(C_2\) be concepts \((A_1, B_1)\) and \((A_2, B_2)\), respectively, then the join of \(C_!\) and \(C_2\), in notation, \(C_1 \sqcup C_2\), is defined as

$$C_1 \sqcup C_2 = ( (B_1 \cap B_2)', (B_1 \cap B_2) )$$
(1)

There are many ways to find all the formal concepts. The intersection method is the bottom up approach and widely used [17]. Here is an outline of the algorithm:

  1. 1.

    Compute the bottom element of the lattice

  2. 2.

    Compute all the concepts from singletons, i.e., \((\{ x \}'', \{ x \}')\). This is your ConceptList.

  3. 3.

    Repeatedly, for all pairs of concepts \(C_1\) and \(C_2\) in Conceptlist such that \(C_1 \not \le C_2\) and \(C_2 \not \le C_1\), compute \(C_1 \sqcup C_2\). Add to the ConceptList if this is a new concept.

Here is a detailed example:

4.1 Name identification example

Consider the following excerpt from Las Vegas Sun on June 23, 2012:

figure a

In Fig. 1, we give a characterization of all the objects of this text. We have used the object OTHERS to represent all non-significant words of the text (e.g. many, after, we) to make our table more manageable.

Fig. 1
figure 1

A characterization of the newspaper text

The intersection method for concept computation is straight forward as follows:

For the first object Like, singleton { Like }’ = {cap, non-personal} and {cap, non-personal}’ = {Like, American}, hence the concept:

\(C_0\) = ({Like, American}, {cap, non-personal})

Here are the other concepts formed from singletons:

  • object American \(\Longrightarrow C_1\) = ({American}, {cap, non-name, non-personal})

  • object Lisa \(\Longrightarrow C_2\) = ({Lisa, Greg}, {cap, first})

  • object Heck \(\Longrightarrow C_3\) = ({Heck, Lemon}, {cap, last})

  • object spokesman \(\Longrightarrow C_4\) = ({spokesman, congressman}, {title})

  • object Politico \(\Longrightarrow C_5\) = ({American, Politico}, {cap, non-name})

  • object OTHERS \(\Longrightarrow C_6\) = ({American, Like, OTHERS}, {non-personal})

The iterative step of intersection algorithm leads to two new concepts:

  • \(C_0 \sqcup C_2 \Longrightarrow C_7\) = ({Like, American, Lisa, Heck, Greg, Lemon, Politico}, {cap})

  • \(C_7 \sqcup C_4 \Longrightarrow top\) = ({Like, American, Lisa, Heck, Greg, Lemon, spokesman, congressman, Politico, OTHERS}, \(\emptyset\))

We observe that the concepts \(C_2\) and \(C_3\) represent first and last names respectively. We also want to point out that the titles spokesman and congressman seem not to contribute to the name identification although the FCA identifies the concept \(C_4\) which is the title concept. We will revisit this point in Sect. 6

5 Hidden Markov model for name identification

In this section, we first give a general overview of the HMM along with a short description of BBN’s Nymble name tagger. We then give a detailed description of our HMM.

5.1 Hidden Markov model

A Hidden Markov Model is a finite state automaton with probabilistic transitions and symbol emissions.

An HMM consists of

  • A set of states, S \(= \{s_1, \ldots s_n\}\)

  • An emission vocabulary V = {\(= \{w_1 \ldots w_n\}\)

  • Each state is associated with a probability distribution over emission symbols where the probability that a state s emits symbol w is given by P(w|s).

  • Each state is further associated with a probability distribution over the set of possible outgoing transitions. The probability of moving from state \(s_i\) to \(s_j\) is given by \(P(s_i \mid s_j)\).

  • Finally, a subset of the states are considered start states, and to each of these is associated an “initial” probability that the state will be a start state.

Fig. 2
figure 2

Name HMM

Consider the HMM given in Fig. 2. There are four states and a few emission symbols. The emission symbols are class of symbols such as titles, Capitalized Words, or list of known names. This means that a word in the text is first scanned to determine if it is a title, capitalized, a name, or a generic word. Based on this determination, the word is placed in the correct taxonomy (or emission symbol). The probability for transitions and symbols in the HMM are obtained from training of HMM on tagged data. The typical HMM training uses Maximum Likelihood Estimate (MLE) algorithm. The transition probabilities are estimated by MLE as:

$$\begin{aligned} P(s_i \mid s_j) = \frac{\text{ Number } \text{ of } \text{ transitions } \text{ from } \ s_i \ \text{ to } \ s_j}{\text{ Total } \text{ number } \text{ of } \text{ transitions } \text{ out } \text{ of } \ s_i} \end{aligned}$$
(1)

The emission probability table is estimated with Maximum Likelihood supplemented by smoothing. Smoothing is required because Maximum Likelihood estimation will sometimes assign a zero probability to unseen emission-state combinations. Prior to smoothing, emission probabilities are estimated by:

$$\begin{aligned} P(w|s)_{ml} = \frac{\text{ Number } \text{ of } \text{ times } \text{ symbol } \ w \ \text{ is } \text{ emitted } \text{ at } \text{ state } \ s}{\text{ Total } \text{ number } \text{ of } \text{ symbols } \text{ emitted } \text{ by } \text{ state } \ s} \end{aligned}$$
(2)

After training, the HMM can be used to identify names in new unseen text. The new text is first converted to symbols by this taxonimizing process, names are discovered by applying the well-known Viterbi algorithm to the resulting string of symbols. The Viterbi algorithm uses dynamic programming techniques to determine the most likely state sequence for the text [15]. Of interest to us are those sub-sequences consisting of name states since they represent the most likely name tokens.

The BBN’s name tagger is very similar to the HMM shown in Fig. 2. It is estimated that 1.2 million tagged words are used to train this HMM [1].

Our adaptation of a hidden Markov model includes an extra factor to account for the punctuation and white space between tokens. This falls outside the definition of a hidden Markov model, even if we attach the preceding inter-word symbols to each token, because the probability of the transition emission is a function of multiple states. We use a slightly modified Viterbi algorithm to find the most probable path, and select tokens that are emitted by name states in the most probable path.

Our new HMM is shown in Fig. 3. This HMM has five states corresponding to the five word types described in Sect. 3: none-name, first, last, titles, non-person.

The training of this HMM was based on 125,000 documents randomly selected from a DOE repository. This collection is not publicly available due to the privacy concerns. Further details of our adaptation is reported in [23].

Fig. 3
figure 3

Our Name HMM

As mentioned above, the Viterbi algorithm is used to decode the text to identify name entities. To see this decoding in action, consider the following segment of text taken from CNN on August 27, 2012:

figure b

The first word, What, is decoded as non-person. Similarly the next eight words, we would want to do is define what, are also decoded as non-person. The tenth and eleventh words, President Obama, are identified as title and last, respectively. The table in Fig. 4 shows how Viterbi algorithm decodes the entire text segment.

Fig. 4
figure 4

Decoding of the string sequence by Viterbi

6 Formal concept analysis revisited

In this section, we want to revisit the FCA with a new example. This example is based on the same excerpt taken from CNN on August 27, 2012 (repeated here again for further clarity):

figure c

In Fig. 5, we give a similar characterization of all the objects of this text.

Fig. 5
figure 5

A characterization of the newspaper text

The intersection algorithm will produce the following list of concepts:

  • \(C_0 =\) ({What, American, Monday}, {cap, non-personal})

  • \(C_1 =\) ({President}, {cap, title})

  • \(C_2 =\) ({What, President, Obama, American, Russ, Schriefr, Romney, Monday}, {cap})

  • \(C_3 =\) ({American, Monday}, {cap, non-name, non-personal})

  • \(C_4 =\) ({Russ}, {cap, first})

  • \(C_5 =\) ({Schriefer, Romney}, {cap, last})

  • \(C_6 =\) ({American, What, Monday, OTHERS}, {non-personal})

  • top = ({What, President, Obama, American, Russ, Schriefer, Romney, Monday, OTHERS}, \(\emptyset\))

We observe that concepts \(C_4\) and \(C_5\) identify first and last names, respectively. We also observe that our FCA does not identify Obama as a name. There are two reasons for this problem. Firstly, the name Obama is not in the list of frequently occurring last names compiled by U.S. Census in 1990. Secondly, our FCA is not detailed enough to relate the title President to the word Obama.

As mention in Sect. 5, if we process the same text by our HMM, the Viterbi will decode Obama as last name.

Since FCA formalizes the concepts as units of thought [18], we need to incorporate a new attribute title-proximity to the list of attributes in our context. Figure 6 represents the new characterization of the text.

Fig. 6
figure 6

A characterization of the newspaper text with title-proximity

Now, the intersection algorithm replaces the concept \(C_2\) with

\(C_2\) = ({Obama}, {cap, title-proximity})

We can now observe that \(C_4\), \(C_5\), and \(C_2\) represent concepts first name, last name, and name with title, respectively.

This example shows how FCA mimics the thought process that goes into a rule-based information extraction system. The difference is that in a rule-based system, the rules are hand coded and difficult to maintain. The formal approach of FCA combined with already established computational techniques such as intersection algorithm avoids the difficulties associated with rules.

7 Evaluation results

The standard performance measures of precision, recall, and F1 were used to evaluate our approach. Let TP be the number of true positives, that is, the number of personal names which both experts and the HMM (or FCA) agreed on. Let FN be the number of false negatives, i.e., the number of personal names which experts identified, but the HMM (or FCA) did not. We then define recall as

$$\begin{aligned} recall = \frac{TP}{TP + FN} \end{aligned}$$
(3)

Letting FP signify the number of false positives, i.e., those strings which the HMM (or FCA) marked as personal names but which experts did not, precision is defined as

$$\begin{aligned} precision = \frac{TP}{TP + FP} \end{aligned}$$
(4)

The harmonic mean of precision and recall is called the F1 measure, defined as:

$$\begin{aligned} F1 = \frac{2}{1 / precision + 1 / recall} \end{aligned}$$
(5)

A sample of 82 documents were selected randomly to test the performance of our HMM. These documents had total of 520 name instances. In addition, there were many instances of handwritten names. Since these documents were OCR’ed, handwritten instances are ignored. There were 471 true positives, 49 false negatives, and 38 false positives. This yields a precision of 0.925, a recall of 0.902, and an F1 value of 0.919. The single greatest cause of false negatives were names with only one word.

As mentioned before, FCA help us design the topology of HMM. This design process distinguishes our HMM 3 from BBN’s name tagger 2.

8 Conclusion and future work

This paper is a summary of our work on the role of FCA in recognition of personal names. We also gave a summary of our approach for name extraction with HMM. The implementation of our approach uses a version of HMM to identify names in a collection of DOE documents. Our intention in writing this paper is to motivate the use of FCA for extraction of structured information from textual documents.

It is anticipated to extend this work on other entities for FCA. For example, we want to identify all the entities such as place, organization, date, and money as was proposed in MUC.

A more elaborate FCA can be used to identify entities such as date of birth, and home address [21]. This consequently can lead to the establishment of the association between individuals and their birth dates or addresses. It is anticipated to further report on the use of these types of FCA in our future work on privacy act classification.

Finally, one of the reviewers suggested a fuzzy FCA, in which attributes and objects have degrees of membership in concepts. We have observed that this idea can improve our work on the use of FCA for stemming [6]. We hope to report on this in the near future.