1 Introduction

Large corpora of text data are the essential prerequisite for building statistical language models (LM) that constitute one of the core components of many applications related to natural language processing (e.g., automatic speech recognition (ASR), machine translation, OCR, etc.). A decade ago, those corpora were usually assembled “by hand”, meaning that they were obtained by transcribing speech recordings from TV and/or radio stations (Psutka et al. 2001), amassed from the existing electronic documents covering a given domain or the language model was trained using a corpus that was originally built for different purposes (Kučera 2002).

It would seem that the problem of data resources has essentially disappeared with the growth of the Internet content as the quantity of electronic texts available on-line nowadays exceed every conceivable limit. However, there are several issues that have to be addressed in order to obtain suitable data for language modeling.

First, it is clear to everybody who has ever visited more than a few Internet pages that the “linguistic” quality of the text is widely variable and that the text gathered from, for example, music fans discussions would not be suitable for training the language model designed for the automatic speech recognition of political debates. Therefore we have to be very careful when selecting the sources of the data. Since we already had specific target domains in mind—namely, the automatic transcription of the parliament sessions and TV political debates (Trmal et al. 2010)—when designing the presented data mining engine, we have decided to collect the data from news websites that were selected as “trustworthy” sources of periodically updated text data.

Extracting the data from news websites results into another important property of the resulting corpus—topicality. The essential property of a suitable language model training corpus is a good coverage of the speech content that is going to be transcribed (or, in other words, low out-of-vocabulary rate on the processed utterances). Words that are not present in the text corpus are usually missing also in the ASR lexicon and consequently cannot be correctly transcribed. The out-of-vocabulary (OOV) rate is therefore a key factor that influences the recognition accuracy. Fortunately, when the language model and the lexicon is built from a sufficiently large corpus, the set of OOV words encountered in TV news and political debates usually consists mostly of personal and geographical names that are related to current events; those are exactly the words that can be extracted from up-to-date Internet news data. This phenomenon is illustrated by the Fig. 1.

Fig. 1
figure 1

Occurrences of certain salient words on the timelime

The figure shows the number of occurrences of three selected words in particular months starting January 2010. The “Concordia” is the second (and more distinctive) part of the name of the ship “Costa Concordia” that sunk after an accident in January 2012. We dare to claim (although without closer inspection) that previous occurrences of the word “Concordia” were related mostly to other usages of this word (such as the Roman goddess of harmony). Similarly, the surname “Strauss-Kahn” have began to appear much more frequently after the managing director of the IMF Dominique Strauss-Kahn was arrested and charged with a sexual assault in May 2011. According to the graph, the case had quite a steady coverage until September 2011, which coincides with the case dismissal in late August. And finally, even the relative “perennial star” of national news, the former Czech president Václav Havel, has a discernible peaks of attention Footnote 1—the first one in March 2011 is almost certainly related to the release of his first (and last) movie, the smaller one in October coincides with his birthday and finally the biggest one in December is connected with his death.

Second set of problems related to getting the data from the Internet sources are the more or less technical issues concerning the actual download of the on-line content, the algorithms for stripping of the HTML (or other) markup, methods for text tokenization and normalization and, last but not least, also the detection of possible duplicate documents. The approaches and algorithms dealing with those issues are present in Sects. 36.

Once we have the cleaned data from trusted sources available, it is still not practical to use them for language modeling right away. First, the data are typically huge to the extent that it complicates the actual language model construction. Even more importantly, there is the evidence that the data quantity by itself might not be sufficient for good language model performance and what is more important is the right scope of the LM training texts. When the topic of the LM target domain is really specific, it happens that the “in-domain” language model estimated on a moderate-sized corpus vastly outperforms the model built using the data that are one or two orders of magnitude bigger but constitute just a general corpus (Psutka et al. 2003). Thus, when we download and store texts that are meant for future LM training, the information about the document topic is extremely valuable. Such a meta-information is often present in the documents downloaded from the news servers which constitutes another advantage of using this data source. However, the style of marking the topic is usually not consistent across different servers (often not even within a single one) and frequently this information is missing completely. We have therefore decided to use a method for automatic identification of a document topic that is trained using the topic metadata from the most consistent data source (see Sect. 7). Finally, the statistics of the corpus that we have put together using presented techniques are given in Sect. 8.

2 Related work

There is actually a plethora of articles describing the tools and frameworks for automatic or semiautomatic creation of text corpora from the Web. However, we have found out that most of them aspire to create corpora that are primarily meant for linguistic research in a wide sense and thus they aim for the texts that cover a broad range of topics and consequently yield a representative sample of the general language usage.

With such a goal in mind, it is only natural that the authors usually employ the “wide crawling” approach and use a general search engines (such as Google) or specialized tools like BootCaT (Baroni and Bernardini 2004) to repeatedly submit queries created from the words contained in a “seed list” and collect the retrieved documents. Examples of this approach can be found in Sharoff (2006), Kilgarriff et al. (2010), Li et al. (2007) and, for Czech, Spoustová et al. (2010).

The goal of our work is somehow different. As was already mentioned, our framework is aimed at creating corpora that would adequately and robustly represent only a couple of target domains; consequently, we plan to use them mainly for language model training. Some authors use for the same purposes the “crawling” approach described above (Bulyko et al. 2007), yet we have decided to collect the data from a limited number of “trusted” sources such as news websites. Our system is in this sense similar to the Corporator tool developed by Fairon (2006) but contains more sophisticated processing algorithms, especially for duplicity detection and topic identification.

Interesting work dealing with duplicity detection can be found in Jan Pomikálek’s thesis (Pomikálek 2011). He builds upon the work of Broder et al. (1997) just as the algorithms used in our framework. However, he has finished his thesis only recently and we were not aware of his results when we were developing our framework.

3 System architecture

The core of the system is an SQL database on which the individual processing algorithms operate. The design of the architecture used for our data collection and processing engine had to be considered very carefully. Main requirements were the following:

  • Extensibility The possibility to modify the database scheme according to evolving needs (we expect to use the core engine for the creation of language corpora that might serve for various purposes).

  • Modularity We always need to be able to add and/or modify algorithms that would perform (sometimes very complex) operations on various databases with similar inner structure. At the same time, the algorithms should be allowed to invoke each other and they should also be easily configurable.

  • Scalability The scalability is actually required along two dimensions. First, with regard to the volume of the data—we expect to gather tens of millions of documents. The second dimension involves the possibility of parallelization that would speed up the processing of massive data.

  • Portability We would like the final system to run on a variety of operational systems and also in several operating modes (interactive, batch, debugging, testing, etc.)

After taking all the requirements into account, we have decided to employ the Voiar library developed by Švec (2010), originally as an efficient platform for spoken term detection in large audiovisual archives (Psutka et al. 2011). The library is implemented in Python and utilizes the SQLAlchemy framework for SQL database access. Extension algorithms can be used for tailoring the Voiar library to a specific task (such as the Web data mining and text processing that we need)—it is possible to add attributes to the objects, define new object types, modify the behavior of existing object, etc. The algorithms also constitute the basic units for parallelization based on MPI (Message Passing Interface).

The overall system architecture is depicted on Fig. 2. The functionality of individual blocks is thoroughly explained in the chapters below.

Fig. 2
figure 2

System architecture schema

Once the data are processed with all the algorithms and stored in the database, a wide selection of filters is available in order to export the data into a textual form suitable for language modeling (or other NLP task). Typically, we are able to select the data based on:

  • data source (and, in some cases, also a subsource)

  • publication date (or period)

  • assigned keywords (or the absence thereof)

and more. The individual filters can be combined to create quite sophisticated selections.

4 Data sources

The system database currently contains documents from the following sources:

  • CeskeNoviny.cz (CNO) This website is a news server of the Czech News Agency and as such it provides mostly “serious” home and world news and news from business, culture and sports. It also has the most reliable system for assigning keywords to individual articles and thus the data from this source constitute the training and evaluation corpora for the topic detection algorithms (see Sect. 7).

  • iDnes.cz (IDS) News portal connected with the nationwide newspaper “Mladá fronta DNES”. Besides the usual news mix similar to CeskeNoviny.cz it also publishes regional news and hobby-related articles about housing, cars, computers, etc.

  • Lidovky.cz (LID) Internet version of another Czech nationwide newspaper “Lidové noviny”. Similar structure as iDnes.cz, probably with less “tabloid-like” articles.

  • ParlamentniListy.cz (PAL) Internet news server dealing predominantly with political issues. Despite its name, it does not have any direct connection with the Czech parliament and is run entirely by a private firm. It publishes a lot of original material and adds diversity to the corpus, while at the same time keeping the focus on politics which is the target domain of many of our ASR applications.

  • Anopress (ANP) These data were actually not acquired online but were provided to us by the Anopress IT a.s., a media monitoring company. The data contain articles published in the printed newspapers (significant portion of those articles is in fact from “Mladá fronta DNES”, which causes some overlap with the IDS data) and transcripts of several television news and discussion broadcasts.

  • Otázky Václava Moravce (OVM) The transcripts of the discussion show that were singled out from the ANP data because this broadcast constitutes one of the pilot shows that are currently being live captioned by our ASR system on Czech Television (the national public TV broadcaster) using the shadow-speaker technology (Pražák et al. 2011).

  • Close captions from Czech Television (IVY) This collection contains hidden subtitles that were prepared for the Czech Television broadcasts aired in the period Jan 2000 to Feb 2012.

  • Community-generated subtitles (SUB) Czech subtitles for movies and TV shows created by the online community.

Basic properties of data from all the sources are summarized in Sect. 8, Table 1.

Table 1 Data amounts by data sources

5 Data preprocessing

Updates of the selected online data sources are periodically checked and downloaded using the standard RSS format. Then the raw (usually HTML-tagged) documents are passed through the cascade of text processing algorithms.

5.1 Text cleaning

The text cleaning algorithm in our system is a rule-based procedure which processes the input web page (an article usually in the HTML format) and extracts the text from the main body of the article. Each of the data sources is assigned a specific set of rules to extract the text and the metadata of the article. The metadata include the date when the article was published, keywords of the article, the author, the title and the subtitle etc. Embedded tables, images and text boxes are excluded from further processing. In addition, the text is checked for invalid characters and character-based substitution is performed. Footnote 2 The reduction of the character set simplifies the design of the subsequent processing algorithms.

The source-specific cleaning algorithms become impractical as the number of data sources grows. Thus we have started work on general cleaning algorithm that would be independent of the data source but it was not yet sufficiently evaluated and therefore is not used to process the working data set. The principles of this general algorithm are outlined in Sect. 9.

5.2 Tokenization and text normalization

The task of tokenization algorithm is to segment the input text into the so-called “dictation units”, i.e., words, numbers and other characters that are dictated to the ASR system separately. A typical example of such units that have to be found and separated are the punctuation marks which act as standalone dictation tokens (that is, if a user of the ASR system wishes for example the mark “.” to be written, he/she has to say “tečka” which means “full stop” in Czech). Unfortunately, especially the full stop character has obviously multiple usages in the written text. It could denote the end of the sentence (in which case the tokenization algorithm separates it from the adjacent word), it could be a part of an abbreviation or it could constitute the decimal mark in a number. In the latter two cases, the full stop remains attached to its neighborhood and the abbreviation and number are replaced with its full-length form and numeral, respectively, that could then be correctly processed by a phonetic transcription module. The correct conversion from numbers to numerals (corresponding word forms) is the main aim of the text normalization module and a non-trivial task for highly inflectional languages such as Czech, since the form of the numeral actually depends on the gender, number and case of the related words (e.g., the number “2” has the correct numeral “dva” if it is in the phrase “2 muži” (“two men”) and “dvě” in the phrase “2 ženy” (“two women”)). The algorithm based on morphological tagging developed by Zelinka et al. (2005) is used in our system.

5.3 Vocabulary-based token substitution and true casing

Tokens of the normalized text are then processed with a vocabulary-based substitution algorithm—large vocabularies prepared by experts are used to homogenize sequences of tokens. The substitution rules are of three types:

  1. 1.

    Rules for fixing the common typos (they will, for example, replace the misspelled word “zda-li” to “zdali”)

  2. 2.

    Rules that replace sequences of tokens with a multiword (e.g., the company name “Czech Coal” is replaced with “Czech_Coal”). Tokens are grouped into multiwords mainly in cases when the meaning of the individual tokens treated separately is quite different from the meaning of the entire multiword. The usage of multiwords makes the language model more accurate and robust, leading to lower perplexity. The rules for creating multiwords correspond predominantly to names of renowned people, political parties and geographical names.

  3. 3.

    The third type of rule unifies the written form of common terms (e.g., a company name “EON” is unified with the correct form “E.ON”).

A large number of terms has more than one rule because of inflection. In total, the human-prepared rule lists contain 17k rules.

Another operation called true casing also takes place during the substitution. We use the term true casing to denote the process of substitution of the capitalized words at the beginning of sentences with the corresponding lower-case variants, except for proper names or other word forms commonly written with the capital first letter. The algorithm based on essential corpus statistics was used in the previous version of our data mining framework (Švec et al. 2011). However, it turned out that it incorrectly leaves too many capitalized words unmodified. The most illustrative example is the adjective “český” (“Czech”) and its other declensions—according to Czech grammar, it must be written in lowercase except for the case when it constitutes a part of a name (of company, place, etc.). You can image that this adjective is found in many company names in Czech Republic—e.g., “České dráhy” (Czech Railways) or, for that matter, “Česká republika” (Czech Republic) itself. Thus the ratio between the number of occurrences of the capitalized “Český” in the sentence beginnings and in the corpus as a whole (which is used as a decision criterion in Švec et al. 2011) drops bellow a threshold and the word is left unchanged.

Thus we have resorted to a simple rule-base method that essentially takes the capitalized word on possible sentence start (after a punctuation mark) and searches the large vocabulary extracted from Ispell Footnote 3 for both capitalized and lower-cased variant. If the lower-cased variant is found while the capitalized is not, the word is decapitalized; otherwise it is left unchanged. That is, it will correctly lowercase the word “Český” mentioned above. Note also that any word that is not present in the Ispell lexicon at all is discarded from our working vocabulary.

6 Duplicity detection

Because of the partial overlap of the data sources (see Sect. 4) and the widespread practice of republishing press material from the news agencies almost unchanged, the database can be expected to contain a substantial number of duplicate documents. There is also a second set of “partial duplicates” resulting from extensive citations from other documents or merging several existing articles into a new one. The detection (and consequent removal) of duplicates is important because the language model created from a text including duplicates can prefer duplicated phrases and sentences instead of correctly modeling the language that is being used in the particular domain.

Our duplicity detection algorithm is based on the shingling method introduced by Broder et al. (1997) and allows us to detect both types of duplicates outlined above. The algorithm first converts each article into a shingle set representation which is composed of a set of overlapping token bigrams. Footnote 4 Then the metric rating the similarity of two shingle sets A and B is evaluated. The simplest similarity metric can be defined as a ratio of a number of shingles in both shingle sets to a number of shingles in a union of the two shingle sets:

$$ S_1(A, B) = \frac{|A \cap B|}{|A \cup B|} $$
(1)

The main disadvantage of this metric (also known as the Jaccard index) is a bad performance in cases where the length of A is very different from the length of B even though \(A \subset B\). The solution is to introduce a containment metric:

$$ S_2(A, B) = \frac{|A \cap B|}{|A|} $$
(2)

Unfortunately, the S 2 metric is asymmetric (S 2 (AB) ≠ S 2 (BA)). Therefore we use the symmetrized maximum containment metric defined by Malkin and Venkatesan (2005):

$$ S_3(A,B) = \frac{|A \cap B|}{\min\{|A|,|B|\}}. $$
(3)

This metric allows us to compare shingle sets with substantially different numbers of elements. The value of S 3 is from the interval [0; 1] where the value 0 means absolutely different shingle sets and the value 1 correspond to the cases where \(A \subseteq B\) or \(B \subseteq A\).

This definition of duplicity metric allows to define a duplicity relation. We say that an article (more precisely a shingle set) A is a duplicate of an original article B if S 3(A,B) ≥ t s and S 3(AB) = S 2(AB) (or, in other words, |A| < |B|). Currently we are using t s  = 0.5. In other words, the shingle set A is a duplicate of B if there are half or more shingles from A in the shingle set B and the number of shingles in A is lower than the number of shingles in B. For a very rare case |A| = |B| we define that the newer article (according to the date of publication) is the duplicate and the older one is an original.

We can assume that the duplicates occur in a short time window so we detect duplicates only in a set of articles published in a window of two weeks. In the current setup, the detection is performed every day and each run of the detection processes up to 10k articles and takes approximately 7 min. It suggests that the number of evaluations of the Eq. 3 is kept at the acceptable level even though the articles are compared pairwise.

7 Topic identification

As mentioned before, the main purpose of our topic identification module is to filter the huge amount of data according to their topics for the future use as the LM training data. We decided that more than one topic (keyword) should be assigned to each article in our database and that the topics should form some sort of hierarchical system—a topic tree.

7.1 Topic tree

When we started to design our topic identification module, we searched for some kind of an existing topic hierarchy, but we did not find any hierarchical system suitable for our needs. Consequently, we have build our own topic hierarchy in the form of a topic tree, based on our expert findings in topic and keyword distribution in the articles found predominantly in the CNO data and to a smaller extent in the IDS source.

At present the topic tree has 32 main topic categories like health , culture or sports , each of this main category has its subcategories with the “smallest” topics represented as leaves of this tree. An example of a branch representing the topic category justice and courts from the topic tree can be seen on Fig. 3.

Fig. 3
figure 3

Branch of the topic tree representing the topic justice and courts (translated from the original Czech version)

In the current system, we use the topic tree with about 450 topics and topic categories, which correspond to the keywords assigned to the articles on the mentioned news servers. The articles with these “originally” assigned topics are used as training data for our identification algorithms.

7.2 Topic identification algorithms

Two methods for automatic topic identification were implemented so far, a classification based on the TF-IDF vector space model and a language-modeling-based classification. These methods were selected based on their good performance in our previous information retrieval experiments (Kanis and Skorkovská 2010), since we had no experience with the topic identification task so far.

7.2.1 Classification based on language modeling (LM)

The language modeling based approach chosen for the first experiments is similar to the Naive Bayes classifier (Manning et al. 2008), where the probability P(T|A) of an article A belonging to a class (topic in our case) T is computed as

$$ P(T|A)\propto P(T)\prod\limits_{t\in A}P(t|T) $$
(4)

where P(T) is the prior probability of a topic T and P(t|T) is a conditional probability of a term t given the topic T. This probability can be estimated by the maximum likelihood estimate simply as the relative frequency of the term t in the training articles belonging to the topic T:

$$ \hat{P}(t|T) = \frac{{tf}_{t,T}}{N_{T}} $$
(5)

where tf t,T is the frequency of the term t in T and N T is the total number of tokens in articles of the topic T.

The goal of this language modeling based approach is to find the most likely topic(s) of an article A, i.e. the ones with the maximum a posteriori probability:

$$ T_{map} = \hbox{arg max}_{T} \hat{P}(T|A) = \hbox{arg max}_{T} \hat{P}(T)\prod\limits_{t\in A}\hat{P}(t|T) . $$
(6)

The prior probability of the topic \(\hat{P}(T)\) was implemented as the relative frequency of the articles belonging to the topic in the training set, but we found out that using the uniform prior \(\hat{P}(T)\) provides comparable identification results.

7.2.2 Vector space model (VSM) classification

The second tested algorithm is the TF-IDF vector space model based classification. For each term t in the topic T the term frequency tf t,T and inverse document frequency is computed:

$$ idf_{t} = \hbox{log}\frac{N}{N_{t}} $$
(7)

where N is the total number of topics and N t is the number of topics containing the term t. The similarity of an article A and a topic T is then computed as:

$$ sim(A,T) = \sum\limits_{t\in A}\hbox{tf}_{t,T}\cdot idf_{t} . $$
(8)

The topics with the highest similarity are then assigned to the tested article.

8 Corpus analysis and evaluation

8.1 Corpus statistics

This chapter offers some basic statistics and consistency analysis of the gathered corpus, computed after the elimination of duplicate articles.

The number of articles and the number of tokens (without punctuation marks) in particular years of publication is shown in Fig. 4. The bars denoted with N/A represent all articles for which the publication date was not available. Our corpus currently has more than 3 million articles containing over 1 billion tokens in total.

Fig. 4
figure 4

Data amounts by year of publication

The number of articles and the number of tokens divided according to the data source is shown in Table 1, along with the duplicity ratio for individual data sources (that is, the percentage of articles that were detected as duplicates and had to be removed from the “raw” data set). It is evident that the ANP source has by far the largest volume of data, which is due to the fact that it contains many “subsources”. On the other hand, the OVM source contains only 4 millions tokens, but since this data consist of transcripts of the live discussion show (while the rest of the data are predominantly newspaper articles), it is very valuable for modeling spontaneous speech.

For the comparison of different data sources and for the evaluation of consistency of a particular data source we used a standard Spearman rank correlation coefficient of the distance of ranks of 500 most frequent words (Spoustová et al. 2010; Kilgarriff 2001). It can be seen in Table 2 that all data sources are very consistent (the value on the diagonal is always higher then 0.9). Another interesting observation is the high similarity between ANP, IDS and LID sources. It is most probably caused by the similar newspaper style used in all those data.

Table 2 Consistency and similarity coefficients

The spontaneous speech contained in the OVM data is much different from each of the other sources (the coefficient is always lower than 0.67) which indirectly confirms their importance for building language models usable for automatic transcription of spontaneous speech. That is, the language style of spontaneous speech indeed is substantially different from the style used in newspaper articles. Such finding is further corroborated by the similarity coefficients of IVY and SUB data sources. They are “most similar” to each other (because they are both transcription of spoken content), yet the IVY shows much higher similarity to other sources than SUB. The reason is that a substantial part of IVY data are the transcripts of TV news which cover essentially the same topics as newspaper articles contained in other data sources, whereas the SUB data consist almost entirely of movie a TV shows subtitles.

Figure 5 depicts the dependence of the size of the vocabulary on the number of tokens collected in the corpus. The figure also shows the dependence of the size of a pruned vocabulary which contains only words occurring more than 5, 10, 100 and 1 000 times, respectively, on the number of tokens. It is evident that the curve representing vocabulary size does not seem to approach a saturation level, i.e., new words keep occurring regardless of the size of the collected data. The straight dashed line represents a linear regression of the curve between 900M and 1,26B tokens and the extrapolation shows that by adding 1 million tokens of text to the corpus the vocabulary grows by about 1,850 words. This fact is in line with the phenomenon illustrated by Fig. 1—especially new proper nouns make its way into the news (and out again) with almost every event, affair or disaster.

Fig. 5
figure 5

Dependency of the vocabulary size (pruned to different minimum number of occurrences) on the size of the corpus

8.2 Evaluation of topic detection algorithms

For the evaluation of the individual topic identification methods, a smaller collection of articles from the CNO source was put aside. The articles from ČeskéNoviny.cz include keywords that were assigned by their authors (3.5 keywords per article in average). Since the keyword assignment seemed to be rather consistent across authors and articles, we have decided to declare this metadata as the “gold-standard” training and reference topics. This collection contains 158,000 articles, 140,000 of these articles were used as topic training data, remaining 18,000 are available for evaluation testing.

Two types of evaluation were performed on the test collection. The first one is more from the point of view of information retrieval (IR), where each newly downloaded article is considered as a query in IR and precision (P), recall (R) and F 1-measure is computed for the answer topic set:

$$ P = \frac{T_{C}}{T_{A}},\quad R = \frac{T_{C}}{T_{R}}, \quad F_{1} = 2 \frac{P\cdot R}{P + R} $$
(9)

where T A is the number of topics assigned to the article, T C is the number of correctly assigned topics and T R is the number of relevant reference topics. An average of these measures is then computed across a set of testing articles.

The second type of evaluation is from the point of view of a topic classifier, where PR and F 1 is computed for each topic separately. Two ways of computing the average measures can be applied in this case, microaveraging (topics count proportionally to the size of the topic article set):

$$ P_{micro} = \frac{\sum\nolimits_{T}{T_{C}}}{\sum\nolimits_{T}{T_{A}}}, \quad R_{micro} = \frac{\sum\nolimits_{T}{T_{C}}}{\sum\nolimits_{T}{T_{R}}} $$
(10)

and macroaveraging (all topics count the same):

$$ P_{macro} = \frac{\sum\nolimits_{T}{P_{T}}}{|T|}, \quad R_{macro} = \frac{\sum\nolimits_{T}{R_{T}}}{|T|} $$
(11)

In this case T A refers to the number of articles assigned to a topic, T C is the number of articles correctly assigned to the topic i.e. the “true positives”, T R is the true number of articles with the topic and |T| is the total number of topics. The macroaverage measures are more important in our case, because we want our classifier to perform well on infrequent topics, too.

First, we wanted to find out the best number of topics to assign to each article. The relation between the number of topics and PR and F 1 measures from the IR point of view is shown on Fig. 6; it can be seen that best results are obtained for 3 assigned topics. PR and F 1 measures obtained for the test set of 18,000 articles and 3 assigned topics are shown in Table 3.

Fig. 6
figure 6

Dependency of P, R and F1 on the number of assigned topics

Table 3 Average PR and F 1 measures of topic identification results for the set of 18,000 articles

The language modeling approach seems to achieve better results than vector space modeling, especially for topics with a small article set, which can be seen from the macroaverage R and F 1 measures.

It may seem at the first glance that the results are not particularly good, but it must be taken into consideration that we have a very large set of topics that are in many cases not well distinguished. Also the articles in the test collection are taken as they were on the news server, the original reference topics was not revised in any way, so in many cases the topic we assign to the article is also “correct”, but it is not included in the reference set of topics. For example, the article about the achievements of the hockey representation has only hockey in reference topics, but our topic identification module assigned the topics hockey, representation , which is correct as well.

Finally, the Fig. 7 shows the number of articles that fall within particular first-tier branches of our topic tree (that is, articles that are tagged with any of the keywords belonging to a subtree with the given headword).

Fig. 7
figure 7

Histogram of the first-tier topics

8.3 Language modeling and ASR experiments

The main motivation for the development of an automatic topic identification method introduced in Sect. 7 was that we wanted to be able to effectively retrieve large amounts of domain-specific data for language model training. In this chapter, we will therefore present several experiments with language models estimated on the text corpora that were filtered from the large database of newspaper articles using various selection criteria. Since the ultimate measure of the language model quality is the performance of the system where the LM is employed (in this case the ASR decoder), we will also describe the speech recognition system that we have used and report relevant Word-Error-Rates (WER).

All language models perplexities (PPL) and WER in the first set of experiments were evaluated on a test set consisting of speech obtained during the testing phase of the automatic closed-captioning system that employs the so-called “shadow-speaker” approach (Pražák et al. 2011). It means that the potentially noisy and/or overlapping broadcast speech is respoken with a trained speaker in controlled acoustic conditions in order to ensure higher recognition accuracy. The evaluation set contains recordings from just a single female speaker. The total length of the first test set audio is 98 min.

Since the speaker, whose utterances are in the test set, recorded in fact over 25 h of data in total, we were able to tailor the acoustic models of the ASR system to this particular speaker (see Vaněk and Psutka 2010 and Zajíc et al. 2010 for details). This gives us a very high quality acoustic model and consequently, we can safely assume that any ASR performance gain from the improved language model would be even more prominent in the case when the acoustic model is less effective. All the language models described in the following paragraphs are trigram LMs estimated using the SRI Language Modeling Toolkit (SRILM) by Stolcke (2002) employing the default Good-Turing discounting method. The resulting models always contain all the lexicon word bigrams that are found in the training data; the trigrams must occur at least twice to be included in the model.

The first test set consists of samples of the dialogues that took place during the political talk show (“Otázky Václava Moravce”—cf. the OVM data) broadcast by the Czech Television on July 18th, 2010. The first line of Table 4 thus shows results for the baseline model estimated from all articles published between year 1998 (date of the oldest articles contained in our database) and July 17th, 2010. Note that the model is very large, with a vocabulary exceeding one million words.

Table 4 Properties of language models trained using different data selections—OVM test set

The show whose part constitutes our test set discussed mainly the newly appointed Czech government, the state budget and also health care issues. The appropriate keywords from the first tier of the tree would then be politics and diplomacy (which we will denote as politics from now on), economy and health . Footnote 5 The following three lines of the table thus present the results for the language models that were estimated from the filtered set of articles from the baseline corpus—only the articles labeled with any keyword that comes from the subtree with the headword politics , politics and economy , and politics , economy and health , respectively, were included in the selection.

With the topic-filtered models, we observed moderate drop in the recognition performance (2–5 % relative) but significant reduction of the language model size (49–65 %) which is a factor that might not be that important in the laboratory setting but plays a crucial role when designing a system that is supposed to work for example on a portable device with limited hardware resources.

Analyzing the results on the OVM test set, we have also came up with a hypothesis that such a discussion show for general audience does not really contain much of a domain-specific sublanguage to benefit from the topic-specific models. In order to test this hypothesis, we have constructed a second test set containing three broadcasts of commented tennis matches from the 2012 Summer Olympics (total length 2:15 h). The results are summarized in Table 5. This time the general language model is even bigger that in the case of the OVM data, because the first of the matches took place on July 28, 2012, and we have included all the articles up to the day before this match. It can be seen that the ASR performance is in general much worse—it is due to the fact that we have used the acoustic models trained for a completely different acoustic channel and also because of the noisy background present in the audio (this time there was no re-speaking—the speech was taken directly from the audio broadcasted from the event venue). However, the performance gain was observed when restricting the data first to the sports articles (first-tier topics) and then specifically to articles labeled with tennis . Note that the tennis language model is almost a hundred times smaller than the general one.

Table 5 Properties of language models trained using different data selections—tennis test set

Since the adverse acoustic conditions and mismatching acoustic channel in the tennis test set rendered the results unconvincing, we have prepared yet another test set. It consists of a news article reporting the situation after the strike of the hurricane Sandy. This article, containing 774 words, was read by 4 speakers and recorded using a high-quality headset. The language models were again prepared first using the all the collected data up to the date of the article publication (line 1 of Table 6). Then we have estimated 3 topic-specific models using the broader first-tier topics (lines 2–4) and another 4 using more specific second-tier topics (lines 5–8). The relation between individual topics within the topic tree is shown in Fig. 8.

Table 6 Properties of language models trained using different data selections—“Sandy” test set
Fig. 8
figure 8

Section of the topic tree showing the first-tier and second-tier topics used for the “Sandy” test set

The results presented in Table 6 show that none of the topic-specific language models alone had outperformed the model built from all the data. We suspect that the reason could be that the selected article covers in fact several topics spread across the topic tree. We have therefore decided to interpolate all the models listed in the table (including the one from the entire dataset). The interpolation coefficients were determined in an unsupervised manner—first, the test set was recognized using the general model from all the data, then the recognition output was used as a “development set” for tunning the interpolation parameters using the compute-best-mix tool from SRILM Stolcke (2002) and finally the interpolated model was employed to re-recognize the test data. That way we have managed to reduce the WER by 6.6 % relative.

9 Discussion and work in progress

We have presented a flexible and scalable framework for downloading, processing and storing large amounts of electronic text data that could be then used for various purposes related to natural language and speech processing. Main goal of our work was to develop a system that would allow to automatically create different subcorpora mainly using time- and topic-specific filters—as our results show, this goal was successfully met.

The framework is currently routinely used within our research team for building suitable text corpora that are used in speech recognition systems prepared for several domains. The layout and algorithms used in the framework are generally language independent; however, there are several modules that make use of the Czech-specific resources, such as the vocabularies for token substitution and true casing and also the topic tree. Those need to be replaced when porting the system to a different language environment.

However, there are still some functionalities of the system that we have found worth implementing and are currently a subject to evaluation experiments.

The first of them is a general algorithm for cleaning the content downloaded from an arbitrary Web data source. The current version of the tool employs a set of multiple rule-based cleaning algorithms, each of them tailored to a specific data source (see Sect. 5.1). This approach becomes impractical with the growing number of sources. The devised general algorithm first removes from the downloaded item all the chunks that could be detected as not being a part of the article text on the basis of the HTML tags (e.g., hyperlinks, lists, tables, pictures, etc.). Then the resulting text is split into paragraphs that are consequently classified as belonging or not belonging to the article text using a k-means classifier. The features of the classifier include for example the number of OOV words in the given paragraph, number of hyperlinks, punctuation marks, etc.

The second investigated issue is related to detecting salient new words from the downloaded content. It is impossible to add all the words from the corpus to the lexicon of the ASR decoder due to the limitation stemming from the computational demands of the decoding process, Footnote 6 yet it is highly desirable to add those new words that are likely to be used frequently in the upcoming period (let us refer once again to Fig. 1.). We have designed a semi-automatic procedure for selecting candidate words for the vocabulary extension. First, all the new words are filtered based on their capitalization (as the capitalized words are more likely to be salient) and their frequency distribution across the timeline. The preselected list is then presented to the human annotator that makes the final decision about adding words to the ASR lexicon.

Finally, the framework is being adapted for handling audiovisual data, with a special focus on the data that are equipped with text subtitles capturing the content of the audio track. This extension will allow us to store also the material that could be later used for training acoustic models or potentially also processed with algorithms dealing with the visual component.