Keywords

1 Introduction

The Web increased the textual revolution by made available a great amount of on-line information. Information and knowledge about almost any subject is encoded in text data available on-line, in articles or books. The process of text mining refers to extraction of high quality information from text data. Quality of collected information is related with elicited patterns and trends. Text mining usually involves the process of structuring the input text (through parsing, and adding or removing linguistic features), deriving patterns within the structured data, and eventually analyzing and interpreting the output.

When looking at text data in any support, people may have expectations regarding its content, which can be: (1) to discover aspects about a specific natural language, its usage, as well as the patterns on it; (2) mining knowledge from content of text data about the observed world, getting the essence of it or extracting information about relevant aspects of the world; (3) mining knowledge about an observer, which means using text data to infer properties of a person; and (4) making predictive analytics using text mining to infer real world variables. When real world variables are inferred they can also use intermediate results of other predictions. So, multiple types of knowledge can be mined from text in general [2,3,4].

Text mining techniques are surveyed in several works [7,8,9,10,11,12,13]. In this paper we focus on statistical approaches regarding topic mining used for extracting specific knowledge from documents [14]. The next sections describe mining techniques for topic mining and analysis, a way to analyze content of text [17].

2 Mining of Topics

In the current context, a topic [18] is a main idea discussed, a theme or subject of discussion or conversation in text data. The topic can be mined at different levels of granularity (e.g. a sentence, an article, a paragraph or several articles). In general, topic mining is the process of collecting knowledge about the world. This process involves: (2) find out the covered topics in analyzed documents and; (2) measure the extent of the coverage.

The setup of topic models [3] begins by taking test data as input. After the mining process, a set of topics is the output, with each topic characterized by a word distribution, as well as the proportions by which the topics are covered in each document.

In a formal way, the topic mining problem can be defined as having as input a collection of n text documents \(C={d_1,\dots ,d_n}\) and the number k of topics, with C denoting a text collection, and \(d_i\) each text article.

The output of the process is the k discovered topics, \({\theta _{1},\dots ,\theta _{k}}\), and the coverage degree of topics by each document, denoted by \({\pi _{i1},\dots ,\pi _{ij}}\), with each \(\pi _{ij}\) being the probability of the document \(d_i\) covering topic \(\theta _{j}\). Topics’ coverage are constrained by \({\scriptstyle \forall \, i \in [1,N], \sum _{j = 1}^{k} \pi _{ij} = 1}\), i.e., the topics in each document must be within the discovered topics.

The following sections present different ways of discovering the topic \(\theta _{j}\), beginning with a simplistic approach and introducing, step by step, more realistic assumptions.

2.1 A Topic as a Simple Term

The simplistic approach is a topic being a simple term (e.g. a word). The first step to perform in mining these simple topics is the elicitation of candidate terms by parsing the text data of documents collection. Next, a scoring function is used to evaluate the adequacy of the terms to actually being one of the k topics. Eventually, the degree of topics’ coverage of each document in the collection is determined.

Since choosing frequent terms as candidate topics is preferable, some approaches from information retrieval are used, namely Term Frequency (TF) and Inverse Document Frequency (IDF) weighting. Care should be taken regarding semantically equivalent terms when building the scoring function. Redundancy of terms must be tackled since not desirable. This can be done using for instance, the maximal marginal relevance ranking algorithm. The process of selecting terms can be done moving through the list of scored terms and find the threshold of the k top ranked terms, thus avoiding to choose semantically close terms.

The degree of topic coverage \(\pi _{ij}\), in each document, can be computed by counting the normalized occurrences of the terms in the document, in order that each topic coverage sums to one, as formalized by \({\scriptstyle \pi _{ij} = \frac{\text {count}(\theta _{j}, d_i)}{\sum _{L = 1}^{k} \text {count(}\theta _{L}, d_i)} }\). The expression can also be seen as a distribution of the topics for the document, and a characterization and the coverage of different topics by the document.

This approach, however, has some issues: (1) it does not consider synonyms words when the words belonging to a topic are counted; (2) the inherent ambiguity of certain words requires that their context must be considered to derive the corresponding topic; and (3) lack of expressiveness power in case of complex topics.

To solve these problems, more words must be used to describe a topic. The incompleteness of vocabulary coverage, by a topic represented as one term, can be addressed through the use of weights on words assigned to the topic. This allows to distinguish subtle differences in topics and fuzziest semantically related words. To solve the problem of word ambiguity, ambiguous words should be split, so the topic can be disambiguated. The next approach addresses the issues by using a probabilistic topic model.

2.2 A Topic as a Word Distribution

The basic idea in this approach is to improve the representation of the topic with word distributions. So, each topic become a word distribution with known probabilities that sum to one, for all the words (w) in the vocabulary (V), which can be denoted as \({\displaystyle {\scriptstyle \forall \, j \in [1,k], \sum _{w = V} p(w|\theta _{j}) = 1}}\). Another constraint imposed to the topic coverage, is that all the \(\pi _{ij}\)’s should sum to one for the same document.

As input, the topics’ discovery problem has: (1) a collection of N text documents \(C = {d_1, \dots , d_n}\); (2) a vocabulary set \(V = {w_1,\dots , w_M}\); and (3) a specified number of k topics. The problem solution are: (1) the k topics; (2) each word distribution \({\theta _{1},\dots ,\theta _{k}}\); (3) for each document \(d_i\), the coverage of topics \({\pi _{i1},\dots ,\pi _{ij}}\); and (4) the probability, \(\pi _{ij}\), of each \(d_i\) cover each topic \(\theta _{j}\).

The general way of solving this text mining problem, using statistical modeling, is the generative model. The model data generation with a probabilistic model is given by \({\displaystyle \text {P}(\text {Data}|\text {Model},\varLambda )}\). The main idea is first design a model for data. The probabilistic model is designed next, to model how the data are generated. This gives the probability distribution of the data, with the particular model and parameters being denoted by \({\displaystyle \varLambda = ( \{ \theta _{1} ,\dots ,\theta _{k} \}, }\) \({\displaystyle \{ \pi _{11},\dots ,\pi _{1k} \}, \dots , \{ \pi _{N1},\dots ,\pi _{Nk} \} )}\).

After the set up of the model it can be fitted with actual data. Meaning that the most likely parameter values, \(\varLambda ^*\), can be estimated based on the data set. So, \({\displaystyle \varLambda ^* = \text {argmax}_{\varLambda } \text {p}(\text {Data}|\text {Model},\varLambda ) }\), would maximize the probability of the observed data. The parameters are the outcome of the data mining algorithm that can then be used to discover knowledge from text.

Statistical Language Models (SLM) cover probabilistic topic models as a special cases, and can be regarded as probabilistic mechanism for generating text. The simplest language model is called the Unigram Language Model (ULM). It simply assume that the text is generated word by word independently, i.e., \({\displaystyle \text {p}(w_1,w_2,\dots , w_n) = \text {p}(w_1)\,\text {p}(w_2)\dots \text {p}(w_n)}\).

With this assumption, the probability of the sequence of words is the product of the probability of each word, and the model has as many parameters as the number of words in the vocabulary. Assuming that there are n words, it means there are n probabilities, and the result of their sum is \(\text {p}(w_1) +\dots \text {p}(w_n) = 1\). So, the text is a sample drawn according to the word distribution. Since this model ignores the words’ sequence it may not be suited for some kind of problems. However, it is quite sufficient for many tasks that involve topic analysis.

For the best estimation of the parameters of the Unigram Language Model, there are two different ways of estimating the parameters: the Maximum Likelihood (ML) estimator and the Maximum a Posteriori (MAP) estimator. The ML estimator gives the observed data the maximum probability and is formalized as \({\displaystyle \hat{\theta } =\mathop {\text {arg max}}\limits _\theta \text {P}(X|\theta )}\), where arg max means a function that returns the argument \(\theta \) that gives the function maximum value of X.

When the sample is small, unseen words could get zero probability, which sometimes may not be reasonable if the distribution is to characterize topics. This problem is addressed through the Bayesian estimation, since it uses both the data, and prior knowledge about the parameters, such that \({\displaystyle \hat{\theta } =\mathop {\text {arg max}}\limits _\theta \text {P}(X|\theta )\text {P}(\theta )}\). Where the prior is defined by \(\text {P}(\theta )\), which means that preferences are imposed on certain \(\theta \)’s. So, by using Bayes Rule, the MAP estimator is derived to combine the likelihood function with the prior and give the posterior probability of the parameters.

The MAP estimator is more general than the ML estimator because if the prior is defined as a non-informative it is reduced to the maximum likelihood estimated.

2.3 Mining a Single Topic from a Document

The Unigram Language Model is used for mining a single topic from one document. The generating model discovers the words’ probabilities for the single topic using as data the document d, with \(x_i\)’s as sequence of words from the document. The document is formalized as \(d = x_1\,x_2\dots x_{|d|}\) with \(x_i \in \, V = \{w_1, \dots , w_M \}\). The model, is a word distribution that denotes the topic \(\theta \), with M parameters (as many as words in the vocabulary), and \(\theta _i\) as the probability of choosing word \(w_i\). The ULM is formally expressed as \(\theta : \{ \theta _{i} = p(w_i | \theta ) \}, i = 1, \dots , M\) with \(\sum _1^M p(\theta _{i}) = 1\).

Because it is assumed the independence when generating each word, the probability of the document is the product of the probability \(\theta _i\) of each word. Since some word might have repeated occurrences, the likelihood function is a product over all the unique words in the vocabulary. A counter function \(\text {c}(w,d)\) denotes the count of each word in document. So, the likelihood function is the probability of generating the whole document, given the model \({\scriptstyle p(d|\theta ) = \prod _{i = 1}^{M} \theta _i^{\text {c}(w,d)}}\).

By maximizing the likelihood function \(p(d|\theta )\), subjected to the constraint \({\scriptstyle \sum _{i=1}^{M} \theta _i=1}\), the \(\hat{\theta }_{i}\)’s for each word are found through the expression: \({\scriptstyle (\hat{\theta }_{1},\dots ,\hat{\theta }_{M}) = \mathop {\text {arg max}}\nolimits _{\theta _1, \dots , \theta _M} p(d|\theta ).}\)

The analytical solution for the optimization problem is the normalized count of the words by the sum of all the counts of words in the document given by \({\scriptstyle \hat{\theta }_{i} = \frac{\text {c}(w_i,d)}{|d|}}\).

3 Conclusions

Text mining, as an interdisciplinary field, benefits from contribution of several correlated disciplines. Text Mining allows the identification of patterns in large sets of data, by uncovering previously unknown, useful knowledge for decision making.

This work is concerned with overview of techniques for discovering the main topics in text documents. The surveyed techniques included a (1) simplistic approach, with a topic as a simple term, (2) a model where each topic is a word distribution with known probabilities, for all the words in the vocabulary, and (3) a model for mining a single topic from a document using the Unigram Language Model.