Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Real text mining systems have been developed for finding precise and specific information on large collections of texts from keyword-based or natural language questions. The development of Web searching and question-answering systems that aim to retrieve precise answers corresponds to the combination of methods from computational linguistics, graph analysis, machine learning and, of course, information extraction and information retrieval. The most recent systems tend to combine statistical machine learning methods along with symbolic and rule-based approaches. The challenge is to get robust efficient systems that can self-adapt to users’ needs and profiles, to dynamic collections composed of unstructured or weakly structured and interconnected documents.

Numerical methods based on a statistical study of corpus, have proven their ability to adapt quickly to different themes and languages and are very popular on Web search systems. This was done despite some approximation in the results that would be more accurate by using symbolic rule-based approaches as long as the rules would be fair, complete and unambiguous… while natural language is absolutely ambiguous. On the other hand, Natural Language Processing (NLP) tasks such named entity recognition (automatic extraction of the names of “things” in texts: person or company names, locations…) and part-of-speech tagging (automatic labeling of words according to their lexical classes: noun, verbs, adjectives…) can be realized both by means of symbolic approach that rely on hand-coded rules, grammar and gazetteers and by following supervised machine learning-based approaches that need large quantities of manually annotated corpus. For Named Entity Recognition, symbolic systems tend to produce better results on well-formed texts [55] whereas statistical approaches show best results on low-quality texts such as automatic transcripts of speech recording [5].

This chapter is composed of two main sections dedicated to text mining, information retrieval and information extraction. We begin by a general discussion about symbolic and numerical approaches for Natural Language Processing then we present the most popular tasks that have been evaluated through annual conferences for twenty years along with some resources one can employ to process texts. In the first section, we present some classical information retrieval models for document and Web searching and the principles of semantic information retrieval that can exploit specialized lexicon, thesaurus or ontologies. The second section introduces natural language processing for information extraction. We outline the most effective approaches for question-answering systems and semantic tagging of texts (named entities recognition, pattern extraction…): rule and lexicon based approaches and machine learning approaches (Hidden Markov Models and Conditional Random Fields). Then, we present some approaches that aim to find information about entities and to populate knowledge bases. In this section, we describe the approaches we proposed and experimented in the last few years. The last section is dedicated to some industrial applications we work on and that respectively relate to digital libraries, marketing and dialog systems.

2 Symbolic and Numerical Approaches for Natural Language Processing

Symbolic and numerical approaches for natural language processing (NLP), have long been opposed, both linked to scientific communities distinguished by differing goals between limited but accurate prototypes and rough but functional systems and by attitudes more or less pragmatic. This opposition joined in some way the one that opposed and opposes always, some linguists and philosophers on the nature of language and its acquisition that is, in simplified terms, the pre-existence of a system (cognitive one) generating rules of possible sentences in a language (or at least the degree of pre-existence). This debate on the nature and role of grammar has its origins in the 17th century between proponents of empiricism (the human being is born “empty” and is fully shaped by experience) and rationalism (the man can not be reduced to experience). In the 1950s, the behaviorists, empiricist, attempted to define the acquisition of language learning as a form of chain reaction from positive reinforcement or negative one. In contrast, Chomsky proposed the pre-existence of mechanisms in the human brain that are specific to language and that could distinguish humans from other species. That suggests language is something really organic [48] even in the very beginning of the life and that language learning does not only rely on association between stimuli and responses.

A direct consequence of the pre-existence of a ‘minimalist program’ (generative grammar) in language acquisition has been to define both a universal grammar expressing linguistic universals and particular grammars for the specificities of given languages [22]. In this view, called cognitive linguistics, language acquisition was seen as a process of rule induction: the child, provided the general structure of universal grammar that defines a certain class of grammar, needs to discover the special rules that generate the particular language to which (s)he is exposed [88]. Subsequently, specific grammars have been reduced to specific values of parameters of universal grammar, acquisition of appropriate language and in setting these parameters [23]. There may be no need to induce any rules [88], induction can be replaced by a process of identifying and selecting among all a priori possible linguistic productions [66].

The learning process can be viewed from two points of view, namely statistical learning or “analytical” (both unconscious and possibly combined in human mind). In the first case, the child has to observe which language productions lead to the goal (s)he fixed and, on a rolling basis, to accumulate a kind of accounting of what succeeds and what fails and allows him/her to achieve a selection of possibilities. Computational neural networks, involving different layers more or less explicit, linking lexicon, concepts and sounds, might simulate this kind of learning. In that sense, any communicational intention (any goal) and any linguistic production might be seen as specific paths within the neural network. The success or failure would result in strengthening or weakening connections and the issue of convergence during learning step arises as well as the reduction of combinatorial.

In the second case, learning is a progressive refinement of the value of the parameters of universal grammar allowing the production and the understanding, of the statements it can generate. Note that in the first case (statistical learning), a rule-based grammar can still be induced once the system is stabilized. It would allow producing explicit structural patterns, which would be necessary to improve consistency over time and mutual understanding between two dialoging people.

Finally, proposals and models arising from the work of N. Chomsky do not seem incompatible with a contextual behaviorist point of view [51]. One can postulate the pre-existence of a specific neural network (or of any other biological element) with an initial structure facilitating language acquisition, both in compliance with an universal grammar and with the need to interconnect brain language areas with all the cognitive areas: the network can still be seen as a whole and one can begin to model “human neuronal dynamics” in which reinforcement learning (selection process) is constrained by the genetic code while being sensitive to the experience.

In the following, we do not pretend to decide between one approach or another for modeling the way a human brain works but we wish to present approaches that allow a computer to understand a text at best for tasks related to information retrieval and extraction in very large collections. We believe that one of the main achievements of natural language processing in recent years is that it is necessary, at least in the present state of knowledge and computational capacity of the computers, not to be confined in a fully statistical approach nor in a purely symbolic one. Combining the best of both worlds has to be considered for the tasks we want to resolve by means of computers. In a few words, statistics and probability theory allow to process by analogy (pattern analysis and matching) on a large scale and to manipulate uncertaintyFootnote 1 whereas symbolic approaches can facilitate inferences in domain specific texts (textual entailment), preprocessing texts (e.g. for lexical normalization and disambiguation) and filtering obvious abusive analogies but also offer a better human readability of processing rules.

3 Information Retrieval Models

An information retrieval model is intended to provide a formalization of the process of information search. He must perform many roles: the most important is to provide a theoretical framework for modeling the relevance measure involving a query from a user on one hand and a document or a passage of a document on the other hand. A classical information retrieval system provides indeed the user with a list of documents or a list of Web pages in response to a query (usually composed of few keywords) and not with precise and unique information. Information retrieval is then a generic term that designates document retrieval, passage retrieval and Web page retrieval among others. In that sense, a document can broadly correspond to any text (article, press release, blog entry, book…) that can be more or less structured and more or less respectful of language and of good writing rules (spelling, syntax, discourse structure). In the case of multimodal information retrieval, the retrieval can be obtained by handling speech or handwriting recognition systems.

Depending on its complexity level, the information retrieval model can take into account different types of descriptors and textual clues, lexical markers, syntactic, semantic and structural (or even prosodic in the case of searching for documents audio). Each clue can be obtained through a surface analysis or depth linguistic analysis resolving dependencies and references among others. Most information retrieval models characterize a document according to the collection from which it comes, taking into account global indices such as average length of documents in the collection and the average frequency distribution of words. The combination of theses indices involves a large number of parameters to be set and adapted to the corpus or learned according to user feedback.

Because of the network structure of Web pages (hyperlinks), Web search engines also take into account non-linguistic cues for estimating relevance scores: number of pages that link to a given page, the probability to access this page by “random walk” etc. These hints are the basis of the famous PageRank algorithm popularized by the Google Web search engine.

For information need and a set of documents given, the whole question is to determine which model, what parameters but also which external resources (dictionaries of inflected forms, thesaurus, knowledge bases, ontologies, online encyclopedias…) will be most effective in helping the search.

3.1 Original Boolean Model

The Boolean model is the simplest model for information retrieval. It was the most popular model before the advent of vector and probabilistic models in the 1960s and 1970s. It is based on the theory of sets and Boolean algebra and considers a query as a logical combination of terms or keywords. The Boolean model assumes that the query terms are present or absent in a document and that their importance is equal. The formulation of the query has a very high impact and its words must be chosen very carefully, ideally from a controlled vocabulary. Either the query is a conjunction of terms allowing to retrieve documents with high precision (the number of non relevant documents retrieved is low) but with low recall (a large number of relevant documents using different words to those of the query are not retrieved), either the query is a disjunction and the precision tends to be low and the recall high.Footnote 2 In its classic form, the Boolean model does not order the documents retrieved because two documents having an equal number of common words with the query have the same score and cannot be differentiated. Only a non-binary weighting of the query words would differentiate and then order these documents (see extensions of the Boolean model in [91, 14] for a fuzzy approach generalizing the Boolean model). The number of retrieved documents from any query on the Web is so important that it is mandatory to propose methods to order documents. In the other hand, indexing large collections and the emergence of the Web has made more complex the use of controlled vocabularies despite folksonomies and cooperative normalization of hashtags in social networks [21, 109].

3.2 Vector-Space Model

The vector space model represents documents and queries by vectors of terms or keywords [92]. These keywords are the index terms that are extracted from the texts during indexing and that may be individual words, composed words or phrases. In a controlled vocabulary context, they represent a thematic or a domain. In full text and non-controlled approaches, every word is an indexing unit (pre-processing such lemmatization. For each document, a weight is assigned to each index entry it contains. Vectors represent every query and every document in a unique vector space and they can be compared with each other: comparing a query and documents allows to rank documents and comparing a document to other ones allows to suggest similar documents for reading and to cluster them. This comparison can be conducted by computing and by ordering the values of a similarity (or distance) such as cosine function or Euclidian distance. In the basic vector-space model, words are supposed to occur independently of each other while this is clearly not the case. Newer models propose to take into account this dependence, as we shall see later in this chapter.

The aim of weighting scheme is to identify the words that are the most discriminant in a document, in respect to a user’s query and according to a collection or to a sub-collection. In other words, a word is important (its weight must be high) if its presence in a document is a strong indicator of the relevance of this document. Sparck-Jones [101] proposed to express term weighting by combining term frequency tf in a document and inverse document frequency idf in a collection. The idea is that more a word occurs in a document more important it is, and that this has to be weighted inversely proportional to the number of documents that contain the word in the collection (a very frequent word in the collection is less informative and less discriminant than a rare one: they tend to refer to low specific concepts). Even if original tf.idf weighting is still used and may be a good starting point to build a search engine, more efficient weighting schemes were proposed. For example, other weightings help to differentiate words according to their average use in a collection.

In the vector-space model, the cosine similarity is defined as:

$$ {\text{sim}}\left( {q, \,d} \right) = \frac{{\sum\limits_{i = 1}^{n} {q_{i} \times d_{i} } }}{{\sqrt {\sum\limits_{i = 1}^{n} {q_{i}^{2} } } \cdot \sqrt {\sum\limits_{i = 1}^{n} {d_{i}^{2} } } }} $$
(1.1)

with q i the weight of term/word i in a query q containing n different terms and d i the weight of i in a document d.

According to the basic term-weighting scheme described above, three factors are involved: the frequency of a term in a document and in the query, the frequency of a term in the whole collection and the weight of a document (normalization by taking into account the length of the document or the sum of the weights of the words it contains). A popular weighting scheme is Lnu.ltc [97] that incorporates two parameters that have to be optimized for the considered collection: pivot and slope. They prevent cosine function from preferring shorter documents [99] and allow to take into account the fact that relevance and document length are not independent. In that way, the relevance score of a document d for a query q is:

$$ {\text{sim}}(q,d) = \sum\limits_{{i \in q \cap d}} {\frac{{\frac{{1 + \log ({\text{freq}}_{{i,d}} )}}{{1 + \log ({\text{avg}}_{{j \in d}} {\text{freg}}_{{i,d}} )}} \cdot \left( {\frac{{{\text{freg}}_{{i,q}} }}{{\max _{{j \in q}} {\text{freq}}_{{i,q}} }} \cdot \log \left( {\frac{N}{{n_{i} }}} \right)} \right)}}{{((1 - slope).pivot + slope \times uniqueWords_{d} ).\sqrt {\sum\limits_{{i \in q}} {\left( {\frac{{freq_{{i,q}} }}{{\max _{{j \in q}} {\text{freq}}_{{i,q}} }} \cdot \log \left( {\frac{N}{{n_{i} }}} \right)} \right)^{2} } } }}} \\ $$
(1.2)

with freq i,d the number of occurrences of i in a document d (resp. in the query q).

Some years later, Deerwester et al. [31] proposed Latent Semantic Indexing (LSI) in order to reduce the dimensionality of the vector space and the mismatch between query and document terms by means of singular-value decomposition (that is computationally costly). By employing an LSI indexing approach, documents can be retrieved and be conceptually similar even if they do not share words/terms with the query.

Much other works have been conducted to integrate the result of linguistic analysis (anaphora resolution [69, 44], disambiguation [103], lemmatization, semantic role labeling…). Unfortunately, their impact seems to be limited and only stemmingFootnote 3 is an effective procedure [54] often used but only for certain languages.

3.3 Probabilistic Models

Some information retrieval models exploit user feedback on the retrieved documents for improving searching by using probability theory [90]. It allows to represent the research process as a decision-making process: for the user, the cost associated with the retrieval (downloading and reading time) of a document must be as low as possible. In other words, the decision rule is equivalent to propose a document only if the ratio of the probability that it is relevant (interesting for the user) and that it is not, is greater than a given threshold. One then seeks to model the set of relevant documents, that is to say, to estimate the probability that a document is relevant to a user i.e. that the given query words appear or do not appear in these documents. For a document d and a query q and with ‘relevant’ the set of relevant documents, this probability is estimated as:

$$ P({\text{relevant}}|d) = \frac{{P({\text{relevant}}) \times P(d|{\text{relevant}})}}{P(d)} $$
(1.3)

By incorporating the probability of non-relevance and from the Bayes rule:

$$ S(d,q) = \frac{{P(d|{\text{relevant}}) \times P({\text{relevant}})}}{{P(d|{\text{relevant}}) \times P({\text{relevant}})}} \approx \frac{{P(d|{\text{relevant}})}}{{P(d|{\text{relevant}})}} $$
(1.4)

By assuming that words are independent from each other (bag of words models), one can set, for i a term in q:

$$ P(d|{\text{relevant}}) = \prod\limits_{i = 1}^{n} {P(d_{i} |{\text{relevant}})} $$
(1.5)

With \( p_{i} = P(i \in d|{\text{relevant}}) \) and \( q_{i} = P(i \in d|\overline{\text{relevant}} ) \), one obtains:

$$ s(d,q) \approx \sum\limits_{i \in d \cap q} {\log \frac{{p_{i} (1 - q_{i} )}}{{q_{i} (1 - p_{i} )}}} $$
(1.6)

After integrating non-binary weights of terms in the query and in the documents, the relevance score is:

$$ s(d,q) \approx \sum\limits_{i \in d \cap q} {w_{i,d} \cdot w_{i,q} \cdot {\log} \frac{{p_{i} (1 - q_{i} )}}{{q_{i} (1 - p_{i} )}}} $$
(1.7)

The solution to this problem needs an unsupervised estimation or, on the contrary, can be realized by following an iterative process in which the user (or a pseudo-user by assuming that the first retrieved documents are relevant) selects relevant documents in a limited ranked list (e.g. the first ten documents found by a vector-space model). The ranking function BM25 [89] is defined in that sense and it is yet very popular:

$$ s(d,q) \approx \sum\limits_{i \in d \cap q} {w_{i} \times \frac{{(k_{1} + 1) \cdot {\text{freq}}_{i,d} }}{{K + {\text{freq}}_{i,d} }} \times \frac{{(k_{3} + 1) \cdot {\text{freq}}_{i,q} }}{{k_{3} + {\text{freq}}_{i,q} }}} $$
(1.8)
$$ w_{i} = \log \left( {\frac{{N - n_{i} + 0.5}}{{n_{i} + 0.5}}} \right),\quad K = k_{i} \left( {(1 - b) + b \cdot \frac{{l_{d} }}{l}} \right) $$

k 1, k 3 and b are parameters, l d the length of d, \( \bar{l} \) the average document length, n i the number of documents containing i and N the total number of documents.

A lot of models, more or less personalized, can be estimated automatically based on user behaviors on the Web. These models take into account the number of Web pages to which the user accesses from the list of retrieved results, the time between the recovery of each page, the number and the nature of queries entered after the initial query and the explicit expression of the relevancy by means of an adapted user-interface (e.g. star ratings). This leads researchers to propose to learn ranking functions [42] based on large sets of training data and features (query logs).

One of the models that emerged in the late 1990s is the language model introduced by Ponte and Croft [77]. This probabilistic model is based on common linguistic models that attempt to capture the regularities of a language (probable sequence of words or bi-grams or tri-grams of words) by observing a training corpus. A language model then estimates the likelihood (probability) to have a given sequence of words in a given language (or for a given topic in a given language). Its use in information retrieval consists in considering that every document is represented by its language model and is the basis for generating it. The relevance of a document in respect to a query is seen as the probability that the query is generated from the same model language model than the one of the document. In such a model, dependency between terms can be taken into account [by using word n-gram models—a word n-gram is an ordered set of n contiguous words in a sentenceFootnote 4—and proximity operators for example [67] and often outperforms models that assume independency.

The models that have been presented are not able to take into account the semantic relationships between lexical items without the use of pre-processing and post-processing (synonyms and semantic proximity, hierarchical links), no more than the expression of negation in queries (expression in natural language of what is not searched) and in the documents. To partially answer these problems, the Web search engines offer users to use Boolean operators. Another approach consists in expanding queries by employing some external semantic resources [32] and methods of disambiguation (based on statistical co-occurrences of words).

In the following, information retrieval models are an essential piece of information extraction systems. They allow to filter the collections of documents according to an information need and to restrict deeper and costly linguistic analysis to a sub-collection.

4 Information Extraction

Information extraction is about automatically extracting structured information from unstructured sources. It has its origin in the natural language processing community [65, 96]. Historically, information extraction was focused on the identification of named entities (limited to persons, companies or locations) from texts and then finding relationships between these entities. Two evaluation campaigns strongly encouraged and deeply influenced research on these particular topics: the Message Understanding Conference (MUC) in 1995 [47] and the Automatic Content Extraction (ACE) in 1999 [33]. Both asked participants to retrieve persons, locations and companies in texts but also temporal information and quantities and some relationships among them. Information extraction was then performed in collection of news articles exclusively. Open information extraction [37, 39] is about modeling how relationships are expressed in general between an object and its properties and so being able to extract automatically entities of non-predefined classes and new kind of relations.

Locating named entities in texts, a task called named entity recognition (NER) [72], seems to be an easy task. But, this is not always as easy as we can first think. Let consider the following sentence: “Jackson did a great show yesterday in Neverland”. One can guess that “Jackson” refers to a person because a show is likely to be performed by a person (let us remark that Jackson could designates an animal, a software or the name of a band and not of a single human person). Anyway, without more contextual information—what is Neverland? A location, but of what kind: a city or a concert hall?—, we cannot guess what kind of person is “Jackson”: is he a politician, an actor or a singer? In this example, one might assume that Jackson refers to “Mickael Jackson” knowing that he owned a theme park called “Neverland”. This example shows that without contextual knowledge and by limiting analysis to a single sentence, associating a semantic type to an entity may be very difficult. Such knowledge is of course difficult to gather and formalize in order to be used by a named entity recognizer. Another clue for semantic labeling is linked to character properties (such as capitalized letters). Let consider: “yesterday, I saw a Jaguar”. Are we talking about a car maker (“a Jaguar”) or an animal (either way, almost every Jaguar car owns a jaguar—not a real jaguar even if it is a real thing—on the hood and its logo)? Delimiting boundaries is another challenge for named entity recognition. Is “Jackson, Alabama” correspond to one entity or to two separate entities: one people and one city, two people, two cities? Actually, there is no a single definitive answer. It completely depends on the context.

Relationship extraction is the second historical task in Information Extraction [16, 27, 40, 68]. First implementations focused on a set of predefined relation between entities of predefined types. One example is “x is the CEO of y” where x is a person and y an organization. As for named entity recognition, choices on expected form of relations have to be made and are dependent of the objectives.

With the democratization of Internet, a wide variety of new information extraction problems raised. Indeed, the massive use of Internet allows companies to gather or to observe big data generated by Internet users. Brands are for instance interested to collect opinion expressed on the Internet by their customers about products. Collecting and even indexing this information may be easy but using them for analytic purpose and decision-making may be difficult. Let consider the following problem: a company wants to monitor what their customers think about their new smartphone. They obviously cannot read each review or comment about on the Web and one needs to exploit information retrieval (filtering the Web) and information extraction (opinion mining). In order to accomplish this task, a system has to automatically detect what are the main features of the phone (e.g. size of the screen, autonomy, OS…), to locate segment of texts dealing with each feature then to extract from these segments the expressed opinion. Each of these three subtasks is an information extraction problem, more or less related to a list of couples of precise information requests (e.g. “what is the size of the screen… what the Web users think about it”).

Over the years, the scope of information extraction became wider. More types of entities had to be recognized, more complex kind of relations has to be found (e.g. not only unary relations). Nowadays, most of the available information extraction systems are domain-specific because building and maintaining ordered set of types and relations for larger applications is a hard problem and because the availability of semantic resources.

Historical applications of information extraction have greatly benefit from existing works in Natural Language Processing. As one knows, information extraction systems can be separated in two coarse classes: rules based approaches and statistical approaches. In this section we will see pros and cons and talk about results obtained with manual, supervised and weakly supervised approaches.

4.1 Symbolic Rule-Based Approaches

Early works on information extraction relied exclusively on manually created set of rules as recalled in [26, 71]. A rule generally takes the form of a condition (or pattern) that has to be found in the text and something to do if so. For instance let consider the following rule which would process any sequence of capitalized words, preceded by a title, as the name of a person: “Mr. [Word starting with a capital letter]+ --> label the word sequence as a person name”. To find what kind of relation have a person (PERS) with a given organization (ORG), one could use the next rule: “PERS is REL of ORG -> rel(pers, org)”. The pattern used in this rule may be instantiated to find out the name of the CEO of an organization: “PERS is CEO of ORG -> ceo_of(pers, org)”.

Now, suppose that we want to label as a person name all sequences of words beginning with a capitalized letter and following title information. One may build a pattern that contains all the titles (or build one rule by title). A generic rule could be represented by the following pseudo-regular expression: “[Title][.]? [Word starting with a capital letter]+ --> label the words as a person name”. This single rule will be able to identify “Terry Pratchett” as a person whatever his title and in both phrases “Mr. Terry Pratchett” and “Sir Terry Pratchett”. To improve robustness and generality, a lot of rule-based systems rely on the use of dictionaries, called gazetteers, which are lists of person names, organizations etc. Such resources may be hard to build and to maintain. In the same time, rules become more and more complex and maintaining them may be very costly. However, manually created symbolic rules have the advantage to be easy to interpret, much more than any numeric model.

One alternative to manually created rules consists in employing rule induction algorithms [24, 59]. Like every supervised machine learning algorithm, they require training data. For instance, for named entity recognition, these data will be sentences with named entities labeled with their type and their boundaries. These algorithms will select seed instances from the training data to create new rules. An instance is selected as a seed depending on whether it is already covered by an existing rule. There are two classes of rule induction approaches: bottom-up algorithms and top-down algorithms. In bottom-up algorithms, one (or sometimes more) uncovered instance is selected as a seed. First, a new rule is created to exactly match this instance. Then, this rule will generalize more and more in order to increase its recall. However, precision will be lowered and a tradeoff has to be made between precision and recall. Example 4.1.1 shows one example of the creation of one rule from a seed instance. At the contrary, top-down approaches start with a rule with a high coverage but a low precision and then specialized it to improve precision. Specialization is realized with the objective that the rules cover the seed instance. Process is stopped when a predefined threshold is reached.

Example

Suppose we have the following sentence: “Dr. Isaac Asimov was a writer.” The rule induction (generalization) might follow:

  • “[Dr.] [Isaac] [Asimov] [was] [a] [writer] -> label “Isaac Asimov” as a person”

  • “[Dr.] [Capitalized word] [Capitalized word] [was] [a] [writer] -> label the two capitalized words as a person”

  • “[Dr.] [Capitalized word] [Capitalized word] [be] [a] [writer] -> label capitalized words as a person”

  • “[Title] [Capitalized word] [Capitalized word] [be] [a] [writer] -> label capitalized words as a person”

  • “[Title] [Capitalized word] [Capitalized word] [be] [a] [Profession] -> label capitalized words as a person”.

Going from 2. to 3. requires to lemmatize words, from 3. to 4. requires to have a predefined list of titles and going from 4. to 5. requires a list of professions.

Rule-based systems are often composed of a large set of rules that can be in conflict. One of the central components of a rule-induction deals with resolution of these conflicts. Many strategies may be used but most of them are either based on the a posteriori performance of each rule (recall, precision) or either by following priorities defined by a set of policies.

4.2 Statistical Approaches: HMM and CRF

The other point of view for information extraction is to treat the extraction problem as a statistical sequence-labeling task that basically involves a numerical learning procedure.

Let us take again the example “Dr. Isaac Asimov was a writer”. With the statistical viewpoint, this sentence is a sequence instance, which consists of six ordered words \( w_{i} :w_{1} = Dr.,w_{2} = Isaac,w_{3} = Asimov,w_{4} = was,w_{5} = a,{\text{ and}}\;w_{6} = writer \). Named entity recognition becomes a special case of sequence labeling, where the aim is to predict the most probable label for each word from a set of predefined class labels. For example, if we want to distinguish three types of named entities—person (PERS), organization (ORG), and location (LOC)—, one can develop a system that takes into account those three classes more an additional label for non-named entity words (NA). The previous example corresponds to the following sequence of labels: NA, PERS, PERS, NA, NA, and NA.

Hidden Markov Model (HMM) [81, 82] is one of the traditional techniques for sequence labeling [57] and Conditional Random Field (CRF) [58] overcomes a restriction of HMM by allowing more informative input features in modeling [98, 75]. Apart from these two direct sequence labeling methods, many other classification and clustering algorithms have been applied to the labeling task [19, 25, 52] but direct labeling approaches are preferred because of simplicity and the fact that they usually provide good result.

A statistical model is described by a set of mathematical equations including an objective probability function to be maximized in a learning algorithm. The objective function consists of random variables representing input sequence words and their labels, and model parameters indicating the distributional information about the model. In the learning algorithm, the model parameters and the joint or conditional distributions of variables are repeatedly updated. We briefly introduce two major sequence labeling models, HMM and CRF.

The objective function of HMM is the joint distribution of inputs and labels. It means that we want to find and optimize the set of parameters and distributions that maximizes the distribution of co-occurring words and labels fitting training data. An input sequence corresponding to a sentence is called a sequence of observations \( w = \left\{ {w_{1} ,w_{2} , \ldots , w_{T} } \right\} \) and a label sequence \( y = \{ y_{1} ,y_{2} , \ldots , y_{T} \} \) is called a sequence of states. In HMM, we simplify the relation among random variables such that each state \( {\text{y}}_{\text{t}} \) depends only on its previous state \( y_{t - 1} \) and each observation \( x_{t} \) depends only on the corresponding state \( y_{t} \). We can also say that observations in a HMM are generated by hidden states. The word ‘hidden’ does not mean that the label values are not given in the training set. Instead, it signifies the nature of state sequence that we model: we assume that the states are invisible because they are not observed directly.

All the relations are represented by means of a directed graphical model (Fig. 1.1a). The joint distribution of an observation sequence and a state sequence can be written as:

$$ P(y,w) = \prod\nolimits_{t = 1}^{T} {{\text{P}}(y_{t} |y_{t - 1} ) \cdot {\text{P(w}}_{t} | {\text{y}}_{t} )} $$
(1.9)
Fig. 1.1
figure 1

Hidden-Markov model (HMM) (a) and conditional random fields (CRF) (b)

By applying an appropriate algorithm, we can compute the most probable label assignment for a given observation sequence: y* = argmax y P(y|x).

Let us return to the example. To predict the labels of “Dr. Isaac Asimov was a writer”, we first compute the probability P(\( y_{1} | \)Dr.’ for all candidate labels using the learned parameters and distributions. Then one chooses the most probable label for the observed word ‘Dr.’. With a recursive computing, we find the most probable labels, \( y_{1}^{*} ,\;y_{2}^{*} \ldots ,\;y_{6}^{*} \).

Conditional Random Field (CRF) can be thought as a discriminative version of HMM. A HMM is a generative model because sequence observations are generated from probability distributions. Since a generative model maximizes the joint distribution, we need to make statistical modeling for all random variables. CRFs overcome this restriction by directly modeling the conditional distribution P(y|w) instead of modeling the joint distribution P(y, w) (see Fig. 1.1b). Therefore in a CRF, we do not need to make a supposition over the distribution of observations, we just need to model P(y|w). This is a great advantage in modeling because in many information extraction tasks input data has rich information that may be useful for labeling. Suppose that we apply a HMM for a NER system where we can only use the sentence words as input. If we want to encode the specific characteristics of each word such as capitalized character, we now need to consider the relations of these characteristics and words w for modeling P(w). But in a CRF, modeling P(w) is unnecessary, thereby we can add any features describing the input words without complicating the model. This kind of models is called discriminative model against generative model. The conditional distribution of a CRF is written as the following equation where \( f_{k} \) is a feature function related to a specific composition of \( w_{t} \), \( y_{t - 1} \), and \( y_{t} \) values.

$$ {\text{P}}({\text{y}}|{\text{w}}) = \frac{1}{{{\text{Z}}({\text{w}})}} \mathop \prod \limits_{{{\text{t}} = 1}}^{\text{T}} { \exp }\left\{ {\mathop \sum \limits_{{{\text{k}} = 1}}^{\text{K}} {{\uptheta}}_{\text{k}} {\text{f}}_{\text{k}} ({\text{y}}_{\text{t}} , {\text{y}}_{{{\text{t}} - 1}} , {\text{w}}_{\text{t}} )} \right\} $$
(1.10)
$$ {\text{Z}}\left( {\text{w}} \right) = \sum\limits_{\text{y}} { \mathop \prod \limits_{{{\text{t}} = 1}}^{\text{T}} { \exp }\left\{ {\mathop \sum \limits_{{{\text{k}} = 1}}^{\text{K}} {{\uptheta}}_{\text{k}} {\text{f}}_{\text{k}} \left( {{\text{y}}_{\text{t}} , {\text{y}}_{{{\text{t}} - 1}} , {\text{w}}_{\text{t}} } \right)} \right\}} $$
(1.11)

With this model, we can enrich the input features of the word ‘Isaac’ in the previous example. For instance, we can add a feature called ‘firstcap’, which indicates whether the first letter in the word is capitalized. Now instead of considering only word itself, we also take account this feature to calculate the most probable label of ‘Isaac’. An important thing is that candidate features are not limited to the current word, but previous or next words (and their characteristics) are also acceptable as features. For example, we can add the information that the current word ‘Isaac’ follows an abbreviation, a capitalized word or anything else could be useful. Unlike rule-based approaches, this extra knowledge is not directly used to select a label but is just an additional clue. The co-occurrence of features and words affect the parameter estimation. For example, a capitalized word is more probable to be labeled as a named entity because named entities in training data are in general capitalized.

Core techniques in current named-entity recognition systems using statistical approaches converge into CRF thanks to its evident advantage described above. These NER systems especially focus on finding effective features to enhance system performance. Another important factor is training data as for other statistical learning problems. It is well known that we cannot construct a totally universal model for a learning task that works well for any new instances. It is unavoidable to restrict target-training data and we usually evaluate a constructed system on a closed dataset, which has same distributions with training set. Nevertheless, to design a system less dependent on change of training data is always an interesting subject. And that is why a well-designed system ponders on an efficient way of selecting and updating training data.

5 From Question-Answering to Knowledge Base Population

Since 1992, international evaluation campaigns are organized by NIST through the Text REtrieval Conferences TREC [108] in which specific retrieval tasks are defined. They allow evaluating several information retrieval approaches over the same large size collections of documents and queries by means of some standard criterion. For example, the 2012 Knowledge Base Acceleration Track of TRECFootnote 5 aimed to develop systems helping human knowledge base curators to maintain databases. On the other side, the purpose of the Entity TrackFootnote 6 was to perform entity-oriented searches on the Web [4] and the purpose of the Question-Answering track [105] was to answer to natural language questions by mining large document collections. For each task, the ability to recognize named entities in text is a fundamental problem.

5.1 Named-Entity Recognition

For named-entity recognition, a wide variety of clues are used like presence of words, part-of-speech, case etc. State-of-the-art approaches achieve really good results (more than 0.90 for F-measure) on coarse-grained classes (person, organization and location) in well-formatted documents like news articles.Footnote 7 However, most of them are not adapted for either more complex and/or fined-grained types of entities or for more challenging sources of Web documents.

Indeed, for named entity recognition on the Web, many new challenges raised, both in terms of scale and scope: systems have to be fast, in order to be able to deal with the huge quantity of texts and dynamic aspects. Manually created rules or building training data for fine grained types of named entities and for every Web source is impractical (What are the types one needs? For what purpose? How could we annotate Web pages efficiently for training?).

Downey et al. [34] deal with locating complex named entities like book or film titles, for which named entity recognition systems often fail. They consider named entities as species of multiword units, which can be detected by accumulating n-gram statistics over the Web. When their system tries to extend named entity bounds to adjacent n-grams, it is made according to Web statistics: do these n-grams co-occur frequently on the Web? With this approach, they achieve a F1 score 50 % higher than most supervised techniques trained on a limited corpus of Web pages.

Whitelaw et al. [110] present an approach generating training data automatically. They propose a multi-class online classification training method that learns to recognize broad categories such as place and person, but also more fine-grained categories such as soccer players, birds, and universities. The resulting system obtains precision and recall performance comparable to that obtained for more limited entity types in much more structured domains such as company recognition in newswire, even though Web documents often lack consistent capitalization and grammatical sentence construction.

Pasca [74] chooses to focus on types of named entities for which users of information retrieval search engines may want to look for. The extraction is guided by a small set of seed named entities, without any need for handcrafted extraction patterns or domain-specific knowledge, allowing for the acquisition of named entities pertaining to various classes of interest for Web users.

These methods are focused on the precision of the extraction regardless to its recall (the recall is difficult to estimate because the lack of large pre-annotated date). However, Etzioni et al. [38] propose the Know-ItAll system that is able to extract thousands of named entities, without any hand-labeled training data and obtains a high precision and a good recall too (it is estimated by comparing extracted named entities to gazetteers). The Know-ItAll system is based on learning domain-specific extraction rules and is able to automatically identify sub-classes in order to boost recall. Moreover, this system is able to take advantage of lists of class instances by learning a wrapper for each list.

For most of the NER approaches, types are pre-defined and users cannot search for entity types that were not included in the learning phase (with statistical approaches, adding a new type in the hierarchy of types means that training data must be revisited). To answer this problem, we propose in [12] an unsupervised way to determine to what extent an entity belongs to any arbitrary given very fine-grained conceptual class (or type). This fully automatic method is mainly inspired by the idea of “distributional hypothesis” which state that words that occur in the same contexts tend to have similar meanings. From this, we propose to measure to what extent an entity belongs to a type according to how much the entity’s context is similar to the one of the given type. Our idea is that we could do it by comparing the word distribution in Web pages related to an entity to the one in Web pages related to the type in the sense that every entity (instance) characterizes its own type (concept). Related Web pages are retrieved by mean of a Web search engine and by querying it twice: first with the entity name as query and the second with the type as query. Word distribution are then estimated from the top retrieved documents and compared to each other. The evaluation of this approach gives promising results and, as expected, shows that it works particularly well for highly specific types (e.g. scotch whisky distilleries) and less for broad ones like “person”. It appears as a good and fast method when others fail (i.e. when the requested type was not learned before).

New challenges arise with social networks. Ritter et al. [87] show that most of state-of-the-art named entity recognizers failed for tweets that are short (144 characters at most) and employing specific language. They address this issue by re-building the NLP pipeline (part-of-speech tagging, chunking, named-entity recognition). They leverage the redundancy in tweets to achieve this task, using LabeledLDA [84] to exploit FreebaseFootnote 8 structured data as a resource during learning. Their system doubled the F1 score compared to the well-known Stanford NER module.Footnote 9

5.2 Retrieving Precise Answers in Documents

Question-answering corresponds to provide users with precise answers to their natural language questions by extracting answers from large document collections or the Web [60, 62, 64, 85, 106]. This task imply semantic analysis of questions and documents (Web pages, microblog messages, articles, books…) in order to determine who did what, where and when.

A classical architecture for question-answering systems involves many steps such as question and document analysis (Natural Language Processing: part-of-speech tagging, syntactic parsing, named-entity recognition, semantic role labeling [63, 73]…), document and passage retrieval (information retrieval approaches) and information extraction (logic or rule based, employing machine-learning or not) from selected passages. This corresponds to a sequential approach of question-answering that reduces the search field from the collection as a whole to a precise answer through a limited set of documents then a set of passages and a set of sentences extracted from these passages. This architecture has a double origin: the pre-existence of each module (evaluated and tuned through evaluation campaigns: TREC ad-hoc, MUC…) and the difficulty of deep analysis techniques to operate quickly on large corpora.

The analysis of natural language queries corresponds to the application of several more or less optional treatments: spelling, linguistic normalization or expansion, focus extraction [28, 41, 70, 80], identification of constraints (dates, places…), guessing the type of needed answers (factual—quantity, name of a place, name of an organization …—, definitional, yes/no…). For this last operation, most systems employ categorizers based on lexical patterns from simple heuristics on keywords but Qanda employs a base of several thousand nominal phrases for robustness [17]. The patterns can be handwritten or built by means of machine learning (Naïve Bayes, Decision Trees…) and can be lexical only or a combination of words, part-of-speech and semantic classes.

Passage retrieval can be performed by computing a lexical density score for each sentence in the set of the documents retrieved from a question by an IR system. Such a score [9] measures how much the words of the question are far away from the other ones. Then it allows to point at the centers of the document areas where the words of the topic are most present by taking into account the number of different lemmas |w| in the topic, the number of topic lemmas |w, d| occurring in the currently processed document d and a distance \( \upmu \left( {o_{w} } \right) \) that equals the average number of words from o w to the other topic lemmas in d (in case of multiple occurrences of a lemma, only the nearest occurrence to o w is considered).

Let s(o w , d) be the density score of o w in document d:

$$ s\left( {o_{w} ,d} \right) = \frac{{\log \left( {\mu (o_{w} ) + (\left| w \right| - \left| {w,d} \right|.p} \right)}}{\left| w \right|} $$
(1.12)

where p is an empirically fixed penalty aimed to prefer or to not prefer few common words with the topic that are close to each other or many words that are distant to each other.

Secondly, a score is computed for each sentence S in a document d. The score of a sentence is the maximum density score of the topic lemmas it contains:

$$ \text{s} \left( {S,d} \right) = \mathop {\hbox{max} }\limits_{{o_{w} \in S}} \text{s} \left( {o_{w} ,d} \right) $$
(1.13)

Once passages are retrieved, they are mined to extract precise answers by matching question and answer syntactic trees, by exploiting semantic word labeling, surface analysis or logical inference. Redundancy in text collection often participates in ranking candidate answers [62].

The first TREC evaluation campaignFootnote 10 in question-answering took place in 1999 [107]. TREC-QA successive tracks evolved over the years to explore new issues. In TREC-8, the participants had to provide 50-bytes or 250-bytes document passages (from a corpus 500,000 documents—2 GB) containing answers to some factoid questions. For TREC-9, questions were derived from real user questions and a set of variants was proposed in order to study the impact of formulating questions in different ways. In TREC-2001, required passage size of answers was reduced to 50-bytes and list questions were introduced [105]. For this kind of questions, the answers should be mined in several documents and the questions specified the number of instances of items to be retrieved. Context questions were introduced for TREC-2001 as some sets of questions related to each other. In TREC-2002, systems were required to return exact answers rather than text passages. In TREC-2003, definition questions (e.g. “Who is Colin Powell?”) appeared. In TREC-2004, more difficult questions were introduced including temporal constraints, more anaphors and references to previous questions. The collection was the AQUAINT Corpus of English News TextFootnote 11 consists of newswire text data in English, drawn from three sources: the Xinhua News Service, the New York Times News Service, and the Associated Press Worldstream News Service (375 million words, about 3 GB of data).

The question-answering TREC track last ran in 2007. However, in recent years, TREC evaluations have led to some new tracks that involved precise information retrieval and deep text mining. Among them, Entity Track for performing entity-related search on Web data, Recognizing Textual Entailment for determining whether a text entails the meaning of another one. In the same period, for question-answering, NIST encouraged the use of documents extracted from blogs instead of newspaper articles and sub-tasks dedicated to opinion mining and automatic summarization. This led to some new evaluation conferences such as Text Analysis Conference (TAC).Footnote 12 In 2008, the answers had to be either named-entities or complex explanatory answers. The tests were realized on the Blog06 corpus that is composed of 3.2 million texts extracted from more than 100,000 blogs.Footnote 13

The INEX 2009-10 QA@INEX track we organizedFootnote 14 aimed to estimate the performance of question-answering, passage retrieval and automatic summarization systems together on an encyclopedic resource such as Wikipedia [94]. The track considered two types of questions: factual questions which require a single precise answer to be found in the corpus if it exists and more complex questions whose answers require the aggregation of several passages (summarization of multiple documents). In order to consider more difficult texts than news articles, we have been organizing the Tweet Contextualization task of INEXFootnote 15 since 2011 [8]. Given a new tweet, participating systems must provide some context about the subject of a tweet, in order to help the reader to understand it. In this task, contextualizing tweets consists in answering questions of the form “what is this tweet about?” which can be answered by several sentences or by an aggregation of texts from different documents of the Wikipedia. The summaries are evaluated according to their informativeness (the way they overlap with relevant passages) and to their readability (linguistic quality). Informativeness is measured as a variant of absolute log-diff_ between term frequencies in the text reference and the proposed summary. Maximal informativeness scores obtained by participants from 19 different groups are between 10 and 14 %. This task corresponds to a real challenge [95].

5.3 Opinion Mining and Sentiment Analysis in Question Answering

The objective of the opinion question-answering task of the Text Analysis Conference TAC is to accurately answer questions on an opinion expressed in a document. For example: “Who likes Trader Joe’s?” or “Why do people like Trader Joe’s?” that are two questions about the chain of food stores “Trader Joe’s” and calling either named entities or explanatory answers. In the latter case, the answers must contain the information nuggets differentiating essential information and accurate information but not essential. Automatic software had to retrieve different aspects of opinion in respect of a particular polarity.

In 2008, 9 teams participated in the opinion question-answering task in TAC [29]. For the 90 sets of rigid type questions, the best system achieved an F-score of 0.156. In comparison, the scoring of the manual reference was 0.559. The scores of the other 8 teams in 2008 ranged between 0.131 and 0.011. The best scores were obtained by systems of the THU Tsinghua University (China) [61] and IIITHyderabad (India) [104].

The first, THU Quanta was based on the question-answering system Quanta enriched with a vocabulary expressing feelings. The authors have tried several lexical databases such as WordnetFootnote 16 but without much success because they did not establish the polarity of a word in context (e.g. the word big can be positive or negative). Depending on the type of question, the answer extraction was based either on analysis of the frequency of occurrences of words in the query, in selected sentences and in the title documents and on the number of opinion words or by using a probabilistic information retrieval model associated with a measure of density of the query words in the retrieved passages (traditional approach for question-answering—see above). The extraction of nuggets was achieved by combining pattern matching (for why questions for example) and external knowledge (such as list of actors and films extracted from the IMDB database).

The second system with the highest performance employed Lucene information retrieval engineFootnote 17 and machine learning approaches. The retrieved documents were classified into two categories according to their polarity and matched with the question polarity. To determine the polarity, the authors have used lists of positive and negative words and established classification rules. To calculate the polarity of the documents, they have chosen to create two Bayesian classifiers: one to recognize opinion sentences (regardless of the polarity of opinions) and second to differentiate positive and negative opinions.

In continuation of the work described above, NIST introduced in 2008 a task entitled Opinion Summarization in TAC campaigns. The aim was to produce texts of limited length to 7,000 or 14,000 characters summarizing the multiple opinions encountered in some blogs related to specific topics, themselves expressed as sets of questions in natural language. For example, related to Gregory Peck, the issues were:

What reasons did people give for liking Gregory Peck’s movies?

What reasons did people give for not liking Gregory Peck’s movies?

Here are some excerpts of text that could be extracted from blogs and that carry an opinion answering the questions:

  • I’ve always been a big Peck fan since I saw him inTo Kill a Mockingbird. A Charmed Life is a fine biography of one my favorite actors.

  • Gregory Peck can be seen playing his first character without absolutely any re- deeming quality in this thriller (Boys from Brazil).

  • after half an hour of the slow-paced antics of Roman Holiday, I voted to stop watching it, so I oinked it from the DVD player, sealed up the disc in its return envelope

The methods used by the system IIITHyderabad are very similar to those described in the previous section for the opinion QA task. The main differences lie in the classifier that determines the polarity of sentences—Support Vector Machines (SVM) rather than Naïve Bayes classifier—and further exploitation of the SentiWordNetFootnote 18 lexical resource that assigns to each synset of Wordnet three sentiment scores (positive, negative or neutral) [3]. The descriptions of the methods used by each system and the results are available on the websites of TREC and TAC (see above).

Sentiment Analysis in Twitter became an important task and was the focus of the International Workshop on Semantic Evaluation that organized SemEval-2013 Exercices.Footnote 19 We proposed [50] to use many features and resources in order to improve a trained classifier of Twitter messages. These features extend the unigram model with the concepts extracted from DBpedia,Footnote 20 verb groups and similar adjectives extracted from WordNet, Senti-features extracted from SentiWordNet. We also employed a homemade dictionary of emotions, abbreviation and slang words frequent in tweets in order to normalize them. Adding these features has improved the f-measure accuracy 2 % from SVM with words only and 4 % from a Naïve Bayes classifier.

5.4 Automatic Knowledge Base Population

Knowledge bases (KB) are considered in a variety of domains as a massive resource of information. Two main issues arise. First, building such base is a lot of effort since it has to be populated enough to be really useful. In addition, to make a KB usable, the veracity of the information must be guarantee. So the first issue could be “How to automatically build a reliable knowledge base?”. Second, maintaining a KB is also very challenging as information may change along the time. This gives the second problem: “How a knowledge base can be kept up to date?”. The aim of some evaluation campaigns is to address these issues and participation systems use some well-known IR methods.

For example, the 2012 Knowledge Base Acceleration Track of TREC (Text Retrieval Conference organized by NIST) aims to develop systems helping human knowledge base curators to maintain databases. The Text Analysis Conference (TAC) has been initiated in 2008. In 2009 a track named Knowledge Base Population (KBP) has been launched. Its aim was to extract information from unstructured texts to create nodes in a given ontology, it was subdivided into three tasks:

  • Entity Linking: the aim of this task is to find the KB node that is being referred by the entity in a query. The query is composed of a name-string (an entity) and a document id, which refers to a given document in a large text collection. Each name-string occurs in the associated document. The document can help to disambiguate the name-string because some entities may share a confusable name (e.g., George Washington could refer to the president, the university, or the jazzman). An entity may occur in multiple queries under different name-strings that must be detected as variants of the entity (e.g., Elvis, The King).

  • Slot Filling: the task is more about information extraction where entities’ attributes (called slots) must be found in the collection. A set of generic attributes is defined for each type of entity (i.e., person, organization, geo-political entity…). It is not expected though to correct or modify values from the reference KB, but only to add information. A slot can be either single-valued or list-valued (i.e., accept more than one value).

  • Cold Start Knowledge Base Population: this task starts with a knowledge base schema that describes the facts and the relations that will compose the knowledge base and that is initially unpopulated. Participants have to build software that will create an entire knowledge base that will be then evaluated.

Run for the first time in 2012, the Knowledge Base Acceleration (KBA) track has been introduced as a KBP Entity Linking (EL) reverse process. EL is to find KB node that matches the tuple name-document. KBA starts from the KB node (called topic) and is to retrieve information and classify documents about that topic. When classified with a high degree of relevancy, information has to be extracted from the document with the aim of suggesting KB node edition. The 2012 collection is a stream of documents where every document has a time stamp which starts from October 2011 to May 2012. It is made up of three types of documents: news (from public news websites), socials (from blogs) and links (from bit.ly database). In 2012, KBA track started with only classifying documents into four classes that defined the degree of relevancy of the documents:

  • Garbage: not relevant, e.g., spam;

  • Neutral: not relevant, no information could be deduced about the entity, or only pertains to community of target such that no information could be learned about the entity;

  • Relevant: relates indirectly; tangential with substantive implications;

  • Central: relates directly to target such that you would cite it in the Wikipedia article.

The document collection for the KBP 2013 Entity Linking tasks are composed of English, Spanish, and Chinese documents including approximately half a million discussion forum posts, 1 million web texts and 2 million newswire articles.

5.4.1 Architectures and Approaches

There have been more than forty teams participating to KBP track over the last 4 years and there were eleven teams on the first KBA session in 2012. With a general improvement for KBP over the last years, the best system reaches the 85.78 % micro average accuracy. Most KBP Entity Linking architectures include 3 main steps: query expansion, KB Node candidate generation, KB Node candidate ranking (see Fig. 1.2). On KBA, the approaches are quite different, since the entity is already given but the document collectionFootnote 21 is much larger (134 million news wires, 322 million feeds from blogs, 135 million texts that were shortened at bitly.com, 40 % in English, approximately 9 TB of ‘raw’ text). Teams must use entity KBNode information to retrieve centrally relevant documents.

Fig. 1.2
figure 2

Knowledge base population system architecture for entity linking [53]

Many methods involving supervised, semi-supervised or unsupervised systems have been developed and they all have pros and cons. When reading proceedings from TAC KBP,Footnote 22 it is interesting to notice that the best teams (in term of system performance) start with the same schema. Most approaches are recall oriented at first, and then the precision comes after going through different sets of filters that may provide eventually a result. In order to obtain as much answers as possible the name-string (considered as a query) must be analyzed to obtain information about the entities such as variants, aliases, and acronym complete form when the name-string is an acronym. All variants are often used as new queries to complete the original query (query expansion). Then, the resulting queries are processed by an Information Retrieval system and the output corresponds to a ranked list of KBNode candidates.

A KBA system starts with a name of an entity and a KBNode that is in our case a Wikipedia entry. Then, as the stream of document goes by, the system is to find whether a document in this stream concerns the entity. Even if the both evaluation campaigns KBA and KBP start with different inputs, some common points arise such as finding all possible aliases for an entity to sort of build an entity profile. This profile is then useful to assess whether a document is relevant. Then the document must be classified into one of the four classes.

5.4.2 Wikipedia as a Resource for Entity Disambiguation

One of the first things to do when dealing with an entity is to gather data about it. This helps to ensure that a document is really about that entity. It also helps to disambiguate or ensure there is no ambiguity with another entity. A knowledge base such as Wikipedia is really convenient for this task since both its structure and its content provide much information about entities. Every page on Wikipedia has a unique name that obviously corresponds to the main object of the page. For homonyms, it exists different alternative for naming pages such as:

  • Disambiguation pages show all alternative uses of a word. There is no other content. The page name is always suffixed with_(disambiguation) in the English-speaking version of Wikipedia.

  • Redirect pages are to redirect a user to another page where content relates the current page. It is often used for acronym such as UK page that is redirected to United Kingdom page. This particular feature also helps for handling acronyms.

  • Hat notes are used when there is one really common usage of the word (for instance Rice) but other senses exist. The main pages are filled with common sense content and the hat notes contains an hyperlink that points to the disambiguation page or if only one other sense is known, it points directly to the other sense page.

  • Bold texts in the first paragraph may help for disambiguation since it often contains alias names or full names.

On KBP sides, participants often build their own knowledge repository with all those information gathered from the given set of entities. Moreover, the document attached to the query can also be used. On KBA side, the input already provides the Wikipedia page linked to an entity. However, KBA participants show that a name may not be enough to ensure a document is about a particular entity. They also use disambiguation pages, redirect pages hat notes and bold text. Some participants also use links from the other pages that points to the entity Wikipedia page to try to find even more aliases.

5.4.3 Automatic Query Expansion

The query expansion is to build a set of queries that describes different facets of the original query. It is usually used in recall-oriented system to gather as much documents as possible in order not to miss any relevant document. Xu and Croft [111] divided the automatic query expansion mechanism into two classes: global analysis of a whole document collection, and local analysis where only documents retrieved from the original query are used to expand it. The local mechanism is also known as pseudo relevance feedback (PRF). The last method is more efficient than the global one but it is not without any risk since a substantial amount of non-relevant document found from the original query may lead to a semantic drift.

Many KBP EL participants used the information gathered from the entity disambiguation process to generate queries and obtain a collection of candidates related to entities. Then, they proceed to a candidate disambiguation to select the correct KBNode. One of the best KBP EL system [20], build up from the original name-string a collection of queries based on:

  • Whether the original name-string corresponds to a Wikipedia redirect page. If so, the title of the page pointed by the redirection is used as a query,

  • Whether the title followed by_(disambiguation) exists in the Wikipedia collection. If so, every titles from disambiguation page are added to the set of query,

  • The original name-string is considered as an acronym if it contains only capitalized letters. The full name is then searched in the attached document and is added to the set of query when found.

5.4.4 Results

In both KBP EL and KBA evaluation campaigns a confrontation between a query and the result is mandatory to assess whether a document deals with the entity in the original query. In KBP EL, the input is a query (name-string) and a document. Then, the knowledge base is queried using the different queries to generate a pool of candidates and to rank them eventually. Different ranking approaches have been used such as IR oriented approach where the attached document is used as a single query and the aim is to retrieve most relevant candidates. Another interesting unsupervised approach, measures the similarity between the attached document and the candidate. Some teams make this method weakly supervised with annotated data for tuning.

Using query expansion based on Wikipedia hyperlinks, title, disambiguation and the usage of a supervised classifier allows the Standford-UBC team to improve significantly the micro-averaged accuracy from 74.85 to 82.15 % where other unsupervised systems obtain up to 59.91 %. For unsupervised methods, efficient selection of features is a very important step. It is shown in Ji, et al. that unsupervised methods obtain significant improvement when using semantic relation features or context inference, even get better when using Semantic Relation Features, from 58.93 to 73.29 % micro-averaged accuracy for BuptPris system.

In KBA the inputs are an entity name and a knowledge base node (a Wikipedia page). For KBA 2012, we proposed an original approach [13] that obtained the 3rd score by introducing some new features and a new weakly supervised classification process. We subdivided the first problematic into a two-step classification process. For both classifiers we estimated a set of features:

  • Document Centric Features: estimated on single documents such as Term Frequency (TF), TF in the title, whether a document has a title, TF on first 10 and 20 % of the document, word entropy…

  • Entity Related Features: estimated with the help of the KBNode. This is the only needed input provided by a tier in our solution. A cosine similarity (or any other relevance score) is computed between the KBNode document and the candidate document.

  • Time Features: their first purpose was to characterize burst effect and to estimate their impact in document relevancy. They are used to evaluate the quantity of relevant documents about an entity within a day or within a week.

KBA is a really new track, but a lot can be expected from it. In 2012, 11 teams participated and, for centrally relevant classification, best F-measure obtained was 0.359, 0.289 for the median and 0.220 for the average.

5.4.5 Automatic Extraction of Attributes Related to Entities

Poesio and Almuhareb [76] defined a named entity as a complex object characterized by its attributes (or features) and their value. If this assertion is widely accepted by the community and confirmed by many works, the definition of an attribute is still subject to discussion. Guarino [49] defined two types of attributes: relational attributes like qualities (color, position, role, spouse etc.) and non-relational ones like parts (wheels, engine). Many other definitions have been proposed but Poesio and Almuhareb [76] proposed the broadest one: in addition to Guarino’s definition, they added related objects (e.g. nest for a bird), related activities (reading or writing for a book), related agents (reader, writer). However, each existing work presents more or less its own definition of what an attribute is, depending mostly on the aimed task. For instance, for product attributes extraction, attributes are mainly qualities but for populating lexical bases like Wordnet, the extended definition of Poesio may be followed. Attributes and values have been showed to effectively describe and characterize entities [2].

Many works rely on rules to extract attributes. Berland and Charniak [10] introduced patterns for extracting meronyms (parts) like “NP’s NP” (phone’s battery) or “NP of NPs” (battery of phones). Poesio and Almuhareb [76] proposed additional patterns and refined those proposed by Berland (“NP’s NP {is | are | was | were}”). Sánchez [93] added patterns with verbs indicating inclusion like “NP {come | comes | came} with NP}” (phone come with battery). Zhao et al. [114] learned syntactic structures from examples and then compared them to syntactic trees in texts to find attributes.

Web documents are widely used to perform attribute extraction. For instance, [115] detected in web documents, regions of interests (minimum set of HTML nodes containing attributes) and then labeling elements of this region as attributes. They achieved good performances by using a “vision-tree” based on the HTML tag tree enhanced with information like color or font type.

Query logs have also been proved to be a really good source for attribute extraction. Pasca [74] started to exploit them with the intuition that if an attribute is relevant and central for an entity, people must have looked for it on the Web before. Attributes are then extracted from query logs with the help of patterns and then ranked according to their frequencies. They showed that using query logs produces high precision set of attributes, better than ones produced from Web pages by employing similar approaches. However, recall is lower. Finally, they showed that using both Web pages and query logs lead to the best results.

Merging different labels that refer to the same real-world attributes (e.g. size of the screen and length of the screen) is one the challenging problem in attribute extraction. Yao et al. [112] tackled it by looking at co-occurrences of the label (with the assumption that if two labels appear in the same documents they are probably not referring to the same attribute) and values of the attributes (two labels with the same values for the same entity are probably referring to the same attribute). Raju et al. [83] used Group Average Agglomerative Clustering, based on commons n-grams and Dice similarity to group labels together. One representative label is then selected.

When attributes are extracted, they have to be ranked according to their importance for the corresponding entity. Effective approaches exploit the number of times the attribute has been extracted, popularity of documents from where the extraction was made according to query logs [100] and measures of association between the attribute and the entity (like the PMI) computed by means of Web search engine hits [93].

Last step is value extraction. For most scenarios, value extraction is more or less straightforward: it may be done during record extraction and using patterns has been proved to be effective [93] (e.g. “NP’s NP is” - > phone’s battery life is 24 h). For more difficult cases, Davidov and Rappoport [30] proposed an approach to check whether extracted values are likely to be correct: they compare the extracted value to maximum and minimum values found for similar entities and compare between entities. If the extracted value does not match these constraints then it is discarded. Their approaches rely on manually crafted patterns and conversion tables for units.

6 Industrial Applications

Of course, there are many industrial applications based on natural language processing and information retrieval. In this section, we present three applications we work on. The first is dedicated to digital libraries, the second to spoken language understanding for speech systems and the last to a marketing purpose.

6.1 Structuring Bibliographic References in Papers

In this section we present a real-world application of information extraction. BILBO is an automatic bibliographic reference parsing system that has been started by a research and development project supported by Google Digital Humanities Research Awards.Footnote 23 The system deals with in-text bibliographic references published on CLEO’s OpenEditionFootnote 24 Web platforms for electronic articles, books, scholarly blogs and resources in the humanities and social sciences.

6.1.1 Reference Data

Most of earlier studies on bibliographic references parsing (annotation) are intended for the bibliography part at the end of scientific articles that has a simple structure and relatively regular format. Automated bibliographic reference annotation is comparably recent issue in the sequence labeling problems [75, 98]. It aims to separately recognize different reference fields that consist of author, title, date etc. This is involved in citation analysis, which has already started several decades ago [43], intended for finding patterns among references such as bibliographical coupling and also for computing impact of articles via citation frequency and patterns. A practical example is CiteSeer [45], a public search engine and digital library for scientific and academic papers that first realized automatic citation indexing. Recent works using citations concentrate in taking advantage of extracted citation information, for example on scientific paper summarization [78], text retrieval [86] and document clustering [1]. The success of these applications priory depends on the well-recognized reference fields.

We are interested in the annotation of much less structured bibliographic references than the usual data. OpenEdition on-line platform consists of more than 330 journals in social sciences. The mostly used language is French, which is the original language of the platform but it has been designed for the electronic publishing in the global scale and the quantity of papers in English or in Portuguese is growing fast. The rate of articles written in a different language than French is 10 %, of which more than half are in English. The articles have a diverse range of formats in their bibliographical references not only according to journal types, but also to authors in a same journal. We distinguish references into the following three levels according to the difficulty of manual annotation. We construct three training sets with the different levels, since the same difficulties hold true for automatic annotation (see Fig. 1.3):

Fig. 1.3
figure 3

Three levels of bibliographic references and three corpora: references can be in the body of papers, in the footnotes and in a specific ‘Bibliography’ section

  • Level 1: references are at the end of the article in a heading ‘Bibliography’. Manual annotation is comparably simple,

  • Level 2: references are in the notes and they are less formulaic compared to that of corpus level 1. The extraction of bibliographical part is necessary,

  • Level 3: references are in the body of articles. The identification and annotation are complex. Even finding a beginning and an end of references is difficult.

6.1.2 Presentation of BILBO, a Software for Information Extraction

BILBO is an automatic reference annotation tool based on Conditional Random Fields (CRFs). It provides both training and labeling services. For the former, manually annotated XML files are necessary as input data and BILBO automatically extracts labels for each reference field and uses them to learn a CRF model. Once BILBO finishes training a model, we can run it for labeling new data.

We try to encode as much information as possible during the manual annotation of OpenEdition training data using TEI Guidelines.Footnote 25 TEI is a consortium that develops, defines and maintains a markup language for describing structural and conceptual features of texts. By the way, rich information is not always useful for training a model and unnecessarily too detailed output labels, which are reference fields in our case, can complicate the learning process of a CRF model and increase the risk of over-fitting. Choosing appropriate output labels (see Table 1.1) is important before applying a CRF model. Meanwhile, the superiority of a CRF compared to other sequence models comes from its capacity to encode very diverse properties of input through feature functions. Feature selection is also an essential process (see Table 1.2). To avoid confusing input tokens and features describing the characteristics of the input, we call the latter local features. Because of the heterogeneity of references in different levels (see Fig. 1.3), we learned a different model for each one.

Table 1.1 Output fields labels for automatic labeling bibliographic references by CRFs
Table 1.2 Local features for automatic labeling bibliographic references by CRFs

We have tested more than 40 different combinations of tokenization method, output labels and local features. 70 % of the manually annotated references are used as training data and the remaining 30 % are used for test. We obtained about 90 % of overall accuracy on test data for the level 1 dataset. Compared to the scientific reference data used in the work of Peng and McCallum [75], our level 1 dataset is much more diverse in terms of reference styles. We obtained a successful result in annotation accuracy, especially on surname, forename and title fields (92, 90, and 86 % of precision respectively). They are somewhat less than the previous work of Peng (95 % overall accuracy) but considering the difficulty of our corpus, the current result is very encouraging. The state of the art methods always learn and evaluate their systems with a well-structured data having rather simple and homogeneous reference style sheets (that is a reason why the parsing of new reference different from a regular format does not work well).

We tested labeling performance on the level 2 dataset as well. We obtained 84 % of overall accuracy using the same strategy and the same setting as the previous evaluation. The precision on the three most important fields was around 82 %. We have compared BILBO to different other online systems and shown that BILBO outperforms them especially on the level 2 dataset [56]. Using the same strategy and model as that applied for OpenEdition data, BILBO obtained 96.5 % of precision on the CORA dataset. That underlines the robustness of our approach as well as its capacity to adapt to multiple languages.

6.2 The LUNA Project for Spoken Language Understanding

The LUNAFootnote 26 (for spoken Language UNderstanding in multilinguAl communication systems) project has for main goal to develop and deploy component for robust phones services, in natural spoken language. Its purpose is to bring user with a complete interactive service and interfaces and a full understanding systems, not only with keywords like most automatic operating systems phone services, for example like in [15]. This project contains many interesting components relative to machine learning such as, spoken language understanding, speech to text, dialogue interpretation. Here, we will discuss only about the semantic composition component.

This component is an important process that makes emerge sense from the transcribed speech signal that contains recognition errors often because recording noise, poor pronunciation… The interpretation (semantic analysis) of the transcribed speech is also difficult because it is derived from a spoken dialogue, subject to the disfluency of speech, self-correction… The produced text is often ungrammatical because the spoken discourse itself is ungrammatical. Indeed the application of classical automatic grammatical analysis methods does not produce good results and does not help to improve the outcome of speech transcription [6]. In particular, hope to use deep syntactic analysis must be abandoned in favor of superficial analysis.

A primary objective is to provide a semantic representation of the meaning of the spoken messages. Ontologies are considered [36, 79] to conceptualize the knowledge related to the situation (context) of the recording. One can expresses the semantic components by means of first order logic with predicates. In the work described here, we represent the semantic elements by frames (FrameNetFootnote 27 is a lexical database that is both human and machine-readable, based on more than 170,000 manually annotated sentences that are examples of how words are used in actual texts). The frames are hierarchical structures and fragments of knowledge that can be inserted in the transcription or employed to infer other fragments of knowledge.

It is then proposed a system for speech understanding from logical rules with the support of an ontology in order to create links from semantic components [36]. We conducted a study on the discovery of syntactic/semantic relationships. A proposal was done for a compositional semantics experience to enrich the basic semantic components. A detection system for lambda-expression hypothesis to find the relationship through discourse is used at the end of the process [35]. Then, a semantic knowledge is retrieved from a transcribed spoken text and transmitted to a dialogue process that remains to be analyzed [46]. The system can finally transform a human request into a computer understandable request with predicate logic derivations for instance.

The process for retrieving users’ intentions, wishes and requests is driven by defining an appropriate ontology [18]. It is important for machine learning (learning correct word associations and linking them to the ontology) to have a good training corpus. For this, training data has to be manually annotated. Figure 1.4 shows the result of the formalization of a spoken sentence after an automatic speech recognition process.

Fig. 1.4
figure 4

The semantic representation of the sentence, included in the French MEDIA Corpus [11] “Vous pouvez me confirmer tous les éléments, ils ont bien une baignoire un téléphone et c’est à quatre-vingt-dix euros à côté de Paris” (Can you confirm all these, they have a bath, a phone and it’s 90 euros near Paris)

To achieve the best semantic representation (frame detection and frame composition), it is important to make correct requests to the dialogue manager and to other information retrieval systems involved. Some results of the correctness of this process can be found in [5] and are briefly exposed here. About 3,000 dialogue turns have been processed in two ways: a reference dialogue was built with a manual transcription and an automatic transcription with a state-of-the-art speech recognition system (ASR) that obtained about 27.4 % word error rate on these difficult data. Results presented in Table 1.3. Concern frame detection and frame composition F-measure when semantic analysis is applied on the reference corpus or on the automatically transcribed corpus. As expected on the latter, the quality of detection decreases, but with 27.4 % word error rate, one can see that frame detection is not so bad meaning that one can infer some missing information from badly formed data transcription.

Table 1.3 Quality of automatic frame detection and frame composition on manual and automatic transcription of speech

In a more general point of view training data is valuable information for a tasks involving machine learning and text mining. Here we have data that link speech signals and transcriptions to ‘knowledge chunks’ that can be used for other semantic annotation tasks.

6.3 Marketing Product Characterization

The last industrial application we want to write about is under development by the young company KWare. It can be seen as an information retrieval and extraction process allowing to collect information on the Web that characterizes any consumer product. It differs from online well-known shopping services in that retrieval and extraction have to be realized in an unsupervised way from very short textual labels employed by distributors as only input. As one can see in the example “bodch frige KDV3”, labels contain misspelled words (bodch instead of Bosch, frige instead of fridge).

In our case, each product label is associated to a product category (e.g. “smartphones” or “cooling”) and an unsupervised process is accomplished for extracting the main words describing a category on the top Web pages (they retrieved by using a clean version of the labels in the corresponding category as query). Due the heterogeneous style of Web pages, locating then extracting the words that characterize the category at most is a very difficult task. Using quantities as mutual information or log-likelihood can help to determine the most discriminant words. The Web page zones that contain the largest number (density) of discriminant words are likely to be technical parts corresponding to the names of the properties and the values one looks for and that will have to be extracted. For the first part of this work, i.e. automatic query formulation from labels, we proposed a two step approach first grouping similar words according to an adapted Levenshtein edit distance [113] and then selecting candidates by means of distributional statistics on the sequence of letters in the words. First evaluation shows that the precision of information retrieval is better by employing the queries automatically generated than using labels as queries (for the previous example label, the two misspelled words are corrected). In addition, the generated queries are more effective than the ones one could obtain by employing the effective but generic approach offered by Google Suggest service.

7 Conclusion

We have presented some popular numerical models for information retrieval or extraction, several challenges organized in conferences yearly and three different applications related to information extraction. As discussed before, one difficulty is having enough data and sufficient information for employing machine-learning approaches. Despite the recent advances and good practical results, improvements remain to be achieved. Some approaches have been proposed so far for adapting information retrieval models to data (e.g. book retrieval vs. article retrieval) but have shown only small improvements, not always significant. For book retrieval, we know that some extra-linguistic features are specific to the books: the organization of the text and the logical structure of chapters and sections, the physical structure and paging, the types (novels, short stories, scientific books…), their length and the temporal evolution of themes covered, the presence of recurring characters and dialogue (possibly interviews), how the books relate to each other are all aspects that are studied for critical analysis and that are not taken into account by information retrieval systems. We believe that this is all part of which current models of information processing should be studied.

The robustness of processing has also to be improved for being able to consider documents with low-level language quality or language specific syntax and lexicon. This may include blog and micro-blog entries containing opinion expressed by users, audio transcriptions and OCR-processed (historical documents, for example). Improving robustness of information retrieval systems by providing efficient processing of noisy data is likely to be better than trying to correct or normalize the texts.

On another level, usual information retrieval models rank documents according to how much information they convey in respect to a query a user has typed, while taking into account, in the best case, the quantity of new information obtained. This is a purely informational definition of relevance based on the hypothesis that the greater the amount of new information, the more the document is likely to interest the user. This is true to a certain extent, but ignores the fact that the needs are different depending on the level of expertise of the user; a novice will certainly be more interested in an overview document than in a comprehensive study where the vocabulary and structure are more complex. This is even truer for people with reading difficulties such as dyslexics. Then, we will have to define new measures taking into account this aspect while providing the opportunity to present the most relevant AND the most simple documents first (the most readable). Belkin [7] cites Karen Sparck-Jones during the 1988 ACM SIGIR conference [102]:

As it is, it is impossible not to feel that continuing research on probabilistic weighting in the style in which it has been conducted, however good in itself in aims and conduct, is just bombinating in the void…

The current interest […] is in integrated, personalisable information management systems.

The issue of customization and the inclusion of the user in finding information, naturally refers to the well broadest bases of language processing at the intersection of linguistics and computer science, both joined by psychology to the study of individual behavior, neuroscience to study brain and physiological roots of language but also in sociology and semiotics for global analysis of needs, attitudes and meanings. A such cross-discipline approach is a major challenge for years to come if we want to go beyond, to quote K. Sparck-Jones, the only hope (and it still does just a hope without being convinced of the significance of earnings) of picking up a few points of precision for searching.

The latest advances in technology and medical imaging provide plausible models of our cognitive functions that can inspire us to simulate the human in such areas as language and thought. We think we have to pay attention simultaneously to work from neurosciences and linguistics, although it must be aware that the transfer from one discipline to another can be realized in the long term and that it is discussed for a long time. The way to combine these different pieces of information is a major issue, not to illuminate the human capacity (it is certainly not the purpose of computer science), but to develop software able to meet the many challenges of the Information Society.