1 Introduction

Hindi is the national and official language of India. According to https://www.Vistawide.com after English, Spanish and Mandarin, Hindi is the most natively spoken language. It is used by about 400 million people. It goes beyond 400 million if dialects of Hindi are considered which share the Devenagari script of Hindi language such as Marathi, Sanskrit, etc. Most of the government resolutions, documents, historical records, etc. are available in English which may not be understood by the villagers in India. This raises a need to develop an efficient automatic translation system from English to Hindi. Machine Translation is a technique used for translation of text in source natural language into target natural language. Language model is very essential component of a statistical machine translation (SMT) system. A language model finds the likelihood of words during some conversation in any natural language. It finds the tendency of words following other words in a natural language. The language model is also used to predict which word would be coming next depending on the previous words already appeared in the sentence [1]. The language model stores the probability of each possible upcoming word for the given context of words in the training data. For the given sequence of words in test data it predicts the word with highest probability as the next possible word. It is also used to assign a probability value to an entire sentence. If \({\text{W}}\, = \,w_{1} w_{2} w_{3} \ldots w_{n}\) is a sentence in a natural language then the language model technique estimates the probability \({\text{p(W)}}\) of whole sentence using chain rule of probability and Markov assumption. Today, language models are used by Google in almost all online services for sentence completion while a user is trying to type some text. Language model restricts the search space and provides guidance to select next word hypothesis to complete the sentence. Apart from on words, language model can also be constructed on sequence of characters, signs or symbols.

This research work shows that selection of next word in a valid sentence in Hindi language does not depend merely on preceding words. Going beyond n-gram technique which uses surface form of words only, through experimentation it is demonstrated that linguistic knowledge of previous words, called as factors, also plays very crucial role for predicting upcoming words. As Hindi is morphologically a very rich language it provides many factors. The significance of this research work lies in two objectives. The first objective of the research work is to find the effective factors of words in Hindi language by using factored language modeling (FLM) technique. Experiments done confirms that FLM significantly improves the prediction capability and offer significant reduction in perplexity when compared to standard n-gram language models. Generalized back-off in FLM explores many options and dynamically selects the efficient lower order model when enough contextual history is not available for predicting the upcoming word. This gives rise the second objective of research as to find effective back-off strategy in FLM. Thus, significance of this research work is that it finds the effect of various combinations of linguistic features and generalized back-off strategies on the upcoming word prediction capability of language model over Hindi language. While doing translation from any other source language to Hindi language, due to structural difference between that source language and Hindi language, the order of words gets changed. Hence, it becomes very essential to find out the sequential order in which the correctly translated words must be placed. Here, the language model plays its role and statistically determines the correct order of words and generates a fluent translated output sentence. The results of the study will definitely empower the developer of a machine translation system to use FLM over Hindi language instead of using the baseline n-gram model to get better translation results. The results of experimentation will provide the developer of a machine translation system with information on how the factors of previous words affect the selection of next word. This study will provide information regarding which group of factors for the Hindi language is effective and may improve the translation quality. The results will enable him to think about utilization of other factors also for getting better translation performance. The results of experimentation done for generalized back-off during the study will help the developer of a machine translation system to select the effective back-off strategy for Hindi language.

2 Language modeling techniques

(A) Standard n-gram language modelling

The leading method of language modelling is n-gram where, n indicates the number of words used. n also indicates the degree of the language model. n-gram language modelling is a technique to capture the probability of how likely words are following each other. Ideally, a word appearing at current time step in the sentence may depend on any word appeared at previous time steps in the sentence. But, Markov assumption simplifies the conditional dependency of next word and suggests utilizing only limited backward context of words. Thus, standard n-gram language model utilizes previous n − 1 words only, called as contextual history h, for predicting the nth word. Mathematically, probability of next word \(w_{n}\) coming after the word sequence \(w_{1} w_{2} w_{3} \ldots w_{n - 2} w_{n - 1}\) in the sentence is given as:

$${\text{p(w}}_{\text{n}} |h )= {\text{p}}({\text{w}}_{\text{n}} |{\text{w}}_{1} {\text{w}}_{2} {\text{w}}_{3} \ldots {\text{w}}_{{{\text{n}} - 2}} {\text{w}}_{{{\text{n}} - 1}} )$$
(1)

For example, if the contextual history is अरुण ने खत, then the probability of the next word being लिखा can be represented by using conditional probability as:

(2)

One way to calculate this probability is by using relative frequency counts of the construct. Count of how many times अरुण ने खत is appearing in corpus is obtained first and then count of how many times it is followed by word लिखा is obtained. Using maximum likelihood estimation (MLE) technique it is written as:

(3)

Thus, in a sufficiently large corpus, the probability distribution over the sequence of n words present in the corpus can be obtained using MLE technique. In generalised form, MLE of nth word is represented as:

$${\text{p(w}}_{\text{n}} /{\text{w}}_{1} , \ldots ,{\text{w}}_{{{\text{n}} - 2}} ,{\text{w}}_{{{\text{n}} - 1}} ) = \frac{{{\text{C}}({\text{w}}_{1} , \ldots {\text{w}}_{{{\text{n}} - 2}} ,{\text{w}}_{{{\text{n}} - 1}} {\text{w}}_{\text{n}} )}}{{{\text{C}}({\text{w}}_{1} , \ldots ,{\text{w}}_{{{\text{n}} - 2}} ,{\text{w}}_{{{\text{n}} - 1}} )}}$$
(4)

In Eq. 4, the n-gram probability is estimated by taking ratio of total observed frequency count of a particular sequence of n words and the total observed frequency count of the given prefix of n – 1 words for the nth word.

(B) Factored language model (FLM)

In any natural language a valid sentence is formed by keeping the words in that language in certain order by following the grammatical rules of the language. It is fact of any natural language that upcoming word in a sentence does not depend on preceding words only, but also depends on the grammatical category of preceding words in the sentence. Part-of-Speech, number, gender etc. of preceding words plays a very vital role in selection of next word in the sentence. For instance, in Hindi, ‘लता गाना गाती है’ is more correct than ‘लता गाना गाता है’ from grammatical point of view of Hindi language as gender of noun word ‘लता’ is feminine. This is due to the fact that unlike English, the verb in Hindi language changes according to the gender of the subject in the sentence.

n-gram language model uses only preceding words but FLM also utilizes grammatical and linguistic knowledge of preceding words for predicting the next word. In FLM, each word now can be seen as a set of k factors. Hence, tth word \({\text{w}}_{\text{t}}\) in the sentence will be written as: \({\text{w}}_{\text{t}} = \, \{ {\text{f}}_{{{\text{t}}1}} ,{\text{f}}_{{{\text{t}}2}} , \ldots ,{\text{f}}_{\text{tk}} \}\) where, \(\{ {\text{f}}_{{{\text{t}}1}} ,{\text{f}}_{{{\text{t}}2}} , \ldots ,{\text{f}}_{\text{tk}} \}\) is set of factors associated with the word. Each word itself is considered as a factor and other factors associated with the word may be any grammatical information of the word such as stem of the word, part-of-speech tag, morphological class or any other linguistic information of the word. For highly inflected languages it will be more useful as more and more factors of a word will be obtained in it. The factored representation is applicable to any language as semantic features available in that language can be considered as factors. Thus, FLM calculates probability distribution of sequence of words and its associated factors as shown in Eq. 5.

$${\text{p}}({\text{f}}_{{1:{\text{t}}}}^{{1:{\text{k}}}} )$$
(5)

where, \(f_{1:t}^{1:k}\) is any factor from 1 to k of any word from 1 to t. Probability of any factor \(f\) from 1 to k of any tth word can be obtained by depending on all the factors used in the contextual history of the factor. Using n-gram formalism, it is shown as:

$${\text{p}}({\text{f}}_{\text{t}}^{{1:{\text{k}}}} | {\text{f}}_{{{\text{t}} - {\text{n}} + 1 :{\text{t}} - 1}}^{{1:{\text{k}}}} )$$
(6)

The probability of any particular kth factor \({\text{f}}_{\text{t}}^{\text{k}}\) of tth word can be obtained using all the remaining factors \({\text{f}}_{\text{t}}^{{1:{\text{k}} - 1}}\) of that word and all the factors of all the words \({\text{f}}_{{{\text{t}} - {\text{n}} + 1 :{\text{t}} - 1}}^{{1:{\text{k}}}}\) in the history as shown in Eq. 7:

$$\mathop \prod \limits_{\text{k}} {\text{p(f}}_{\text{t}}^{\text{k}} | {\text{f}}_{\text{t}}^{{1:{\text{k}} - 1}} \, \ldots \, {\text{f}}_{{{\text{t}} - {\text{n}} + 1 :{\text{t}} - 1}}^{{1:{\text{k}}}} )$$
(7)

Equation 7 shows a single possible ordering among many orderings of the factors by chain rule. Apart from this, any set of factors in any order can be selected for finding the probability of the next factor. k factors of a word can be permuted in k! ways and totally 2nk subsets of factors can be obtained. Here, n is the number of words and k is the number of factors associated with each word totally n*k factors. This way FLM opens up k!2nk possibilities for language modelling options in addition to standard n-gram model [4]. Standard n-gram model can be treated as the simplest case of the FLM in which only one factor i.e. word itself, is considered. Same is the case with class-based model where class is considered as a factor and it depends only on the previous class. Figure 1 shows generalised representation of FLM.

Fig. 1
figure 1

Graphical representation of factored language model

The morphologically rich language like Hindi makes many linguistic and grammatical features available to be used as a factor in factored language modelling. For this research work following factors in Hindi are used:

  1. 1.

    Word (शब्द)

  2. 2.

    Part-of-speech (POS) (शब्द भेद)

  3. 3.

    Gender (लिंग)

  4. 4.

    Number (वचन)

  5. 5.

    Stem of the word (मूलशब्द)

Figure 2 shows the graphical representation of the FLM for a word in Hindi language. Equation 8 shows same FLM model in conditional probability form:

Fig. 2
figure 2

FLM for a word

$${\text{p(W}}_{\text{t}} | {\text{W}}_{{{\text{t}} - 1}} , {\text{G}}_{{{\text{t}} - 1}} , {\text{N}}_{{{\text{t}} - 1}} , {\text{P}}_{{{\text{t}} - 1}} , {\text{W}}_{{{\text{t}} - 2}} )$$
(8)

where, \({\text{W}}_{\text{t}}\) is word factor, \({\text{G}}_{\text{t}}\) is the word’s gender tag factor, \({\text{N}}_{\text{t}}\) is the word’s number tag factor, \({\text{P}}_{\text{t}}\) is the word’s part-of-speech tag factor. Here, probability of next word factor is conditioned on immediate previous word’s number, gender and POS tag factors along with previous two words. Like this probability of next factor can be obtained by taking any useful combination of factors present in the contextual history of the word. For instance, one more combination is:

$${\text{p(W}}_{\text{t}} | {\text{W}}_{{{\text{t}} - 1}} , {\text{W}}_{{{\text{t}} - 2}} , {\text{G}}_{{{\text{t}} - 1}} ,{\text{G}}_{{{\text{t}} - 2}} , {\text{P}}_{{{\text{t}} - 1}} , {\text{P}}_{{{\text{t}} - 2}} )$$
(9)

In this model, the word factor \({\text{W}}_{\text{t}}\) is used as child and the factors \({\text{W}}_{{{\text{t}} - 1}} , {\text{W}}_{{{\text{t}} - 2}} , {\text{G}}_{{{\text{t}} - 1}} , {\text{G}}_{{{\text{t}} - 2}} , {\text{P}}_{{{\text{t}} - 1}}\) and \({\text{P}}_{{{\text{t}} - 2}}\) are used as the parents. This FLM model is same as word only trigram model but along with words previous two word’s POS tags and gender tags are also considered as parents to predict next word factor.

3 Back-off

Back-off is the method of obtaining probability distribution of n-grams when the higher order n-grams do not have sufficient probability mass. In other word, back-off suggests that, when data is not sufficient to estimate a high order n-gram conditional probability, estimate only a portion of the table and construct the remaining table using a lower order n-gram model. For example, when the trigram \({\text{p( W}}_{\text{t}} | {\text{W}}_{{{\text{t}} - 1}} , {\text{W}}_{{{\text{t}} - 2}} )\) count is not sufficient, move down to the bigram \({\text{p(}} {\text{W}}_{\text{t}} | {\text{W}}_{{{\text{t}} - 1}} )\). This process is then recursively applied down up to unigram.

The back-off model for trigram \({\text{P}}_{\text{BO}} (W_{\text{t}} | {\text{W}}_{{{\text{t}} - 1}} , {\text{W}}_{{{\text{t}} - 2}} )\) is defined as given in Eq. 10:

$${\text{P}}_{\text{BO }} ({\text{W}}_{\text{t}} | {\text{W}}_{{{\text{t}} - 1}} , {\text{W}}_{{{\text{t}} - 2}} ) = \left\{ {\begin{array}{ll} {{\text{d}}_{{{\text{N}}\left( { {\text{w}}_{\text{t }} , {\text{w}}_{{{\text{t }} - 1}} , {\text{w}}_{{{\text{t}} - 2 }} } \right) }} {\text{P}}_{\text{ML }} ( {\text{W}}_{\text{t}} | {\text{W}}_{{{\text{t}} - 1}} , {\text{W}}_{{{\text{t}} - 2}} ) } &{{\text{if }} N({\text{W}}_{\text{t}} , {\text{W}}_{{{\text{t}} - 1}} , {\text{W}}_{{{\text{t}} - 2}} ) > {{\tau }}_{3 } } \\ {\alpha ({\text{W}}_{{{\text{t}} - 1}} , {\text{W}}_{{{\text{t}} - 2}} ) {\text{P}}_{\text{BO }} ({\text{W}}_{\text{t}} | {\text{W}}_{{{\text{t}} - 1}} )} &{\text{otherwise}} \end{array} } \right.$$
(10)

It says that, trigram distribution is used if its frequency count is greater than threshold value \({{\tau }}_{3 }\) otherwise go to the bigram model after applying the discount function \({\text{d}}_{{{\text{N}}({\text{w}}_{\text{t}} ,{\text{w}}_{{{\text{t }} - 1}} ,{\text{w}}_{{{\text{t}} - 2 \, }} )}}\) on the trigram distribution. \({\text{d}}_{{{\text{N}}({\text{w}}_{\text{t}} ,{\text{w}}_{{{\text{t }} - 1}} ,{\text{w}}_{{{\text{t}} - 2 \, }} )}}\) is a number ranging from 0 to 1. For discounting, different smoothing techniques can be used such as Good Turing smoothing, Kneser–Ney smoothing, Natural discounting, etc. [2]. The quantity \(\alpha ({\text{w}}_{{{\text{t}} - 1}} ,{\text{w}}_{{{\text{t}} - 2}} )\) ensures that the sum of entire distribution remains equal to unity after applying back-off.

Figure 3 shows the back-off right from 4-gram to the unigram distribution. In this back-off model, if 4-gram sequences are not in sufficient amount then back-off to 3-gram sequences. If 3-gram sequences are also not enough then go for bigram count and same way up to unigram distribution.

Fig. 3
figure 3

Back-off process in a 4-gram model

3.1 Generalized back-off for FLM

In standard n-gram language model, back-off procedure drops the word that is most distant in the contextual history of the next word. Successively next most distant word is then dropped and this way the procedure goes on till the unique neighbouring preceding word of the next word is dropped and a unigram distribution is obtained. In the graphical representation shown in Fig. 3, first the most distant parent node \({\text{W}}_{{{\text{t }} - 3}}\) is dropped. Then the next distant node \({\text{W}}_{{{\text{t }} - 2}}\) is dropped and so on till the immediate parent node \({\text{W}}_{{{\text{t }} - 1}}\) is dropped. The graphical representation of the back-off procedure shown in Fig. 3 is called as back-off graph. It starts from a node where all three previous words are present and steps down to the node where only single immediate preceding word remains and then last unigram node. In FLM, due to availability of a lot of factors and possibility of various combinations of the factors in the contextual history of the word, many options are available for dropping during the back-off procedure.

For example, if for a word in Hindi language the FLM is;

$${\text{p(}} {\text{W |}} {\text{W}}_{1} , {\text{N}}_{1} , {\text{P}}_{1} )$$
(11)

Then for this 4-gram FLM all the possible back-off paths will be as shown in Fig. 4.

Fig. 4
figure 4

Back-off graph in FLM

Each path in the back-off graph is a distinct back-off model. Hence, which back-off path should be selected is a decision problem.

Figure 5 shows only specific back-off paths in the back-off graph given in Fig. 4. In this graph, intermediate node Wt|N1P1 is obtained by dropping W1 factor in the root node Wt|W1N1P1. Now, two options are available for drooping i.e. N1 and P1. The selection of a factor to be dropped is a decision problem. The generalised back-off method solves this decision dilemma.

Fig. 5
figure 5

Back-off path in FLM

Generalized Back-off for FLM is generalization of standard back-off in language modelling [3]. In generalized back-off for FLM, while backing off from higher to lower order language model, instead of selecting a fixed path, multiple paths are selected dynamically at run time according to a preselected threshold of probability for a particular factor. There are two approaches for selecting the multiple back-off paths. In first approach, only one path is selected at a time for a set of factors but it can be changed as any other set of factors is required. It means as the set of factors is changed the back-off path can also be changed. In second approach, for a given set of factors multiple back-off paths are used simultaneously at runtime to obtain the probability.

Probabilities in generalized back-off (\({\text{p}}_{\text{GBO}}\)) for FLM are obtained using Eq. 12 [3].

$${\text{p}}_{{{\text{GBO}}}} \left( {{\text{f|f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} } \right) = \left\{ {\begin{array}{ll} {{\text{d}}_{{{\text{N}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )}} {\text{p}}_{{{\text{ML}}}} ({\text{f|f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )} &{{\text{if}}~N({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) > {{\tau }}_{4} } \\ {\alpha ({\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ){\text{g}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )}& {\text{otherwise}} \\ \end{array} } \right.$$
(12)

where, \({\text{p}}_{\text{ML}}\) is maximum likelihood of factor counts which is given as:

$${\text{p}}_{\text{ML }} ({\text{f| f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = \frac{{{\text{N(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )}}{{{\text{N}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2 } )}}$$

It says that, find the maximum likelihood estimate of given set of factors and compare it with the predefined threshold value \({{\tau }}\) specified by the user. If it is greater than threshold, then a hit is occurred and the probability is retained otherwise back-off is applied. \({\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )\) is non-negative back-off distribution function. \(\alpha{(\text{ f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )\) ensures that sum of entire distribution remains equal to unity. There are many possible ways how the function \({\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )\) can be calculated such as max normalized counts, min normalized counts, geometric mean, weighted mean, average mean and their variations. Each way of calculation of \({\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )\) results in a different back-off strategy. In the remaining part of this section, all such back-off strategies are presented as given in tutorial [4].

(1) Fixed back-off path

In fixed back-off path strategy, the path is fixed for all the process. It is initially defined by the user at start in the specification and remains same for all set of factors at all the time. User can define any one of the following fixed back-off path for a 4-gram model.

$$\begin{aligned}& {\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = {\text{P}}_{\text{BO }} ({\text{f }}| {\text{f}}_{1} ,{\text{f}}_{2} ) \hfill \\ &{\text{or g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = {\text{P}}_{\text{BO }} ({\text{f }}| {\text{f}}_{1} ,{\text{f}}_{3} ) \hfill \\ &{\text{or g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = {\text{P}}_{\text{BO }} ({\text{f }}| {\text{f}}_{2} ,{\text{f}}_{3} ) \hfill \\ \end{aligned}$$
(13)

(2) Max counts

As name indicates, the back-off node is selected by comparing the counts of all the possible back-off models and the model for which count is maximum is selected. This gives rise the opportunity to select different back-off path for each instance. This strategy prefers statistical estimation properties than the statistical predictability of the back-off models while selecting back-off path. For a 4-gram model, it may be defined as in Eq. 14:

$${\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )= {\text{P}}_{\text{GBO }} ({\text{f}}| {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(14)

where,

$$({\text{a}}_{1} ,{\text{a}}_{2} ) = \mathop {\text{argmax}}\limits_{{({\text{b}}_{1} ,{\text{b}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} {\text{N(f}},{\text{f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} )$$

In this, count of the pairs \(({\text{f}}_{1} ,{\text{f}}_{2} )\), \(({\text{f}}_{1} ,{\text{f}}_{3} ) {\text{and }}({\text{f}}_{2} ,{\text{f}}_{3} )\) is obtained and the pair for which the count is maximum is selected for back-off path purpose.

(3) Max normalized counts

This strategy is similar to max counts strategy. Only difference is, in this the normalized counts are used instead of normal counts for selecting the back-off node. The back-off node is selected by comparing the maximum likelihood estimates of all the possible back-off models and the model for which estimate is maximum is selected. This strategy prefers statistical predictability than the statistical estimation properties of the back-off models while selecting back-off path. For a 4-gram model, it may be defined as in Eq. 15:

$${\text{g}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = {\text{P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(15)

where,

$$\begin{aligned} ({\text{a}}_{1} ,{\text{a}}_{2} ) = \mathop {\text{argmax}}\limits_{{({\text{b}}_{1} ,{\text{b}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} \frac{{{\text{N(f}},{\text{f}}_{{{\text{b}}_{1} }} , {\text{f}}_{{{\text{b}}_{2} }} )}}{{{\text{N(f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} )}} \hfill \\ = \mathop {\text{argmax}}\limits_{{({\text{b}}_{1} ,{\text{b}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} {\text{P}}_{\text{ML}} ({\text{f }}| {\text{f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} ) \hfill \\ \end{aligned}$$

(4) Max num-words normalized counts

This strategy is similar to max counts strategy. Only difference is that, for the present set of parent values, the number of possible factor values in the training set are used for normalization purpose. The possible factor values for present parent context \({\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3}\) in a 4-gram model are expressed as \(\{ {\text{f}}:{\text{ N}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} )\, > \,0\}\). The cardinality of this set represents the number of possible upcoming factors for the present context:

$${\text{g}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = {\text{P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(16)

where,

$$({\text{a}}_{1} ,{\text{a}}_{2} ) = \mathop {\text{argmax}}\limits_{{({\text{b}}_{1} ,{\text{b}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} \frac{{{\text{N(f}}, {\text{f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} )}}{{\{ {\text{f }}: {\text{N(f}}, {\text{f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} )> 0\} }}$$

(5) Max back-off-graph node probability

In this strategy, the back-off node is selected by comparing the probability of all the possible back-off models and the model for which probability is maximum is selected. This is similar to fixed back-off strategy but instead of selecting back-off node initially, it is selected dynamically. For a 4-gram model, it may be defined as in Eq. 17:

$${\text{g}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = {\text{P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(17)

where,

$$({\text{a}}_{1} ,{\text{a}}_{2} ) = \mathop {\text{argmax}}\limits_{{({\text{b}}_{1} ,{\text{b}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} {\text{P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} )$$

(6) Max product-of-cardinality normalized counts

Underlying random variables cardinality product is used for normalization in this strategy. The possible values of random variable constitute its cardinality. The cardinality F is represented as:

$$|{\text{F}}| \triangleq \{ {\text{f}}:{\text{N}}({\text{f}}) \, > \, 0\}$$
(18)

For a 4-gram model, it may be defined as in Eq. 19:

$${\text{g}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ){\text{ = P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(19)

where,

$$({\text{a}}_{1 } ,{\text{a}}_{2} ) = \mathop {\text{argmax}}\limits_{{({\text{b}}_{1} ,{\text{b}}_{2} ) \in \{ (1,2),(1,3),(2,3)\} }} \frac{{{\text{N}}( {\text{f}},{\text{f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} )}}{{|{\text{F}}| | {\text{F}}_{{{\text{b}}_{1} }} || {\text{F}}_{{{\text{b}}_{2} }} |}}$$

(7) Max sum-of-cardinality normalized counts

In this strategy, instead of using the product of underlying random variables cardinality, their sum is used for normalization. The cardinality F is represented as:

$$|{\text{F}}| \triangleq \{ {\text{f}}:{\text{N}}({\text{f}}) \, > \, 0\}$$

For a 4-gram model, it may be defined as in Eq. 20.

$${\text{g}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ){\text{ = P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(20)

where,

$$({\text{a}}_{1 } ,{\text{a}}_{2} ) = \mathop {\text{argmax}}\limits_{{({\text{b}}_{1} ,{\text{b}}_{2} ) \in \{ (1,2),(1,3),(2,3)\} }} \frac{{{\text{N}}( {\text{f}},{\text{f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} )}}{{|{\text{F}}| { + }| {\text{F}}_{{{\text{b}}_{1} }} | + | {\text{F}}_{{{\text{b}}_{2} }} |}}$$

(8) Max sum-of-log-cardinality normalized counts

This is similar to max sum-of-cardinality normalized counts strategy. The difference is that, for normalization purpose in this strategy instead of using the sum of underlying random variables cardinality, the sum of natural logarithm of the cardinalities is used. The cardinality F is represented as:

$$|{\text{F}}| \triangleq \{ {\text{f}}:{\text{N}}({\text{f}}) \, > \, 0\}$$

For a 4-gram model, it may be defined as in Eq. 21.

$${\text{g}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ){\text{ = P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(21)

where,

$$({\text{a}}_{1 } ,{\text{a}}_{2} ) = \mathop {\text{argmax}}\limits_{{({\text{b}}_{1} ,{\text{b}}_{2} ) \in \{ (1,2),(1,3),(2,3)\} }} \frac{{{\text{N}}( {\text{f}}_{ } ,{\text{f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} )}}{{\ln |{\text{F}}| { + }\ln | {\text{F}}_{{{\text{b}}_{1} }} | + \ln | {\text{F}}_{{{\text{b}}_{2} }} |}}$$

(9) Min counts

In contrast to the max strategies discussed earlier, the back-off node is selected by comparing the counts of all the possible back-off models and the model for which count is minimum is selected. This gives rise the opportunity to select different back-off path for each instance. This strategy prefers statistical estimation properties than the statistical predictability of the back-off models while selecting back-off path. For a 4-gram model, it may be defined as in Eq. 22:

$${\text{g}}({\text{f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ){\text{ = P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(22)

where,

$$({\text{a}}_{1} ,{\text{a}}_{2} ) = \mathop {\text{argmin}}\limits_{{({\text{b}}_{1} ,{\text{b}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} {\text{N(f}},{\text{f}}_{{{\text{b}}_{1} }} ,{\text{f}}_{{{\text{b}}_{2} }} )$$

In this, count of the pairs \(({\text{f}}_{1} ,{\text{f}}_{2} )\), \(({\text{f}}_{1} ,{\text{f}}_{3} ) {\text{and }}({\text{f}}_{2} ,{\text{f}}_{3} )\) is obtained and the pair for which the count is minimum is selected for back-off path purpose.

Similar to max strategies, there are six more min strategies as listed in Table 2 in section six given below. In these strategies, the counts of all the possible back-off models are obtained using various mathematical operations as given for max strategies and the back-off node for which the count is minimum is selected.

In the following strategies, instead of using a single back-off path, multiple paths are used to obtain the final probability score. These strategies are different from above discussed min or max strategies. In these strategies, instead of finding minimum or maximum counts of the back-off probable candidates, the mathematical operation of all the back-off candidate probabilities is calculated for finding back-off model.

(10) Sum

The sum of all the back-off candidate probabilities is calculated for finding back-off model. For a 4-gram model it may be defined as in Eq. 23:

$${\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = \mathop \sum \limits_{{({\text{a}}_{1} ,{\text{a}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} {\text{P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(23)

(11) Average (arithmetic mean)

The arithmetic mean of all the back-off candidate probabilities is calculated for finding back-off model. For a 4-gram model, it may be defined as in Eq. 24:

$${\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = \frac{1}{3} \mathop \sum \limits_{{({\text{a}}_{1} ,{\text{a}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} {\text{P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(24)

(12) Product

The product of all the back-off candidate probabilities is calculated for finding back-off model. For a 4-gram model, it may be defined as in Eq. 25:

$${\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = \mathop \prod \limits_{{({\text{a}}_{1} ,{\text{a}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} {\text{P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(25)

(13) Geometric mean

The geometric mean of all the back-off candidate probabilities is calculated for finding back-off model. For a 4-gram model, it may be defined as in Eq. 26:

$${\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ) = \left( { \mathop \prod \limits_{{({\text{a}}_{1} ,{\text{a}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} {\text{P}}_{\text{GBO }} ({\text{f}}_{ } | {\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )} \right)^{1/3}$$
(26)

(14) Weighted mean

In this strategy, different weights using numerical values are assigned by the user to each available back-off node and the weighted mean of all the back-off candidate node probabilities is calculated for finding back-off model. For a 4-gram model, it may be defined as in Eq. 27.

$${\text{g(f}},{\text{f}}_{1} ,{\text{f}}_{2} ,{\text{f}}_{3} ){\text{}} = \mathop \prod \limits_{{({\text{a}}_{1} ,{\text{a}}_{2} ) \in \left\{ {\left( {1,2} \right),\left( {1,3} \right),\left( {2,3} \right)} \right\}}} {\text{P}}_{{{\text{GBO}}}}^{{{{\gamma }}_{{{\text{a}}_{1} ,{\text{a}}_{2} }} }} {\text{}}({\text{f}}_{{\text{}}} |{\text{f}}_{{{\text{a}}_{1} }} ,{\text{f}}_{{{\text{a}}_{2} }} )$$
(27)

where, \(\gamma_{{a_{1} ,a_{2} }}\) represents the weights assigned to the factors \(a_{1}\) and \(a_{2}\) by the user.

4 Related work

The initial use of FLMs reported were improvement of perplexity of language models on the WSJ corpus and the Arabic Callhome corpus. After that the models were used in various applications of NLP such as language modelling component of phrase based machine translation (PBMT), speech recognition. In paper [3], authors demonstrated how use of generalized parallel back-off (GPB) along with FLM improves the results when compared to n-gram models. Kirchhoff et al. in their tutorial described in detail the implementation of FLM using SRILM toolkit [4]. While in standard baseline language model a back-off path is selected manually and remain fixed, in their tutorial on FLM the authors proposed many ways for finding the back-off paths, which they called generalized back-off paths, based on probability and statistics of occurrence of data. They used geometric mean, min/max counts and weighted mean for back-off purpose. FLMs are always used in Slavic and Arabic languages. It is also used widely in Russian languages [7, 8]. It is observed that FLMs are also effectively used in other morphologically rich languages like Amharic [9] and Turkish [10]. In paper [10], authors showed that even with only limited size of available corpora for training, the FLM based language models for Turkish language are far better than traditional n-gram models. These results proved relevant in speech recognition task for some specific domains where data scarcity was a problem. In paper [11], authors used FLM for Portuguese text generation. In paper [12], authors used linguistic factors of the word in FLM for generating a Romanian language model. The FLMs are also used for N-best list rescoring in paper [13]. Authors described the use of a POS tag of a word in FLM for rescoring the sentences. They used first pass based on morphemes and second pass based on FLM models which further improved the results. In paper [14], authors claimed that morpheme based models need higher ordered models as compared to word based models to get comparable performance. In paper [15], the authors discussed morpheme based FLM which is a methodology for language modelling to combine both FLM approach and morphological decomposition of words. They obtained approximately 2.5% reduction in Word Error Rate (WER) for speech recognition task against a traditional full words system. In paper [16], the authors proposed that the morpheme based language model is observed to be more effective for large vocabulary continuous speech recognition (LVCSR) than the whole word language models. In paper [17], the authors used POS tags as syntactic features. They proposed that for speech recognition task, FLMs outperformed traditional trigram language models regarding mixed error rate and perplexity. They extracted syntactic and semantic features from code-switching text data and integrated it into FLM. In paper [18], the authors described their investigations of factors in code-switching language modelling. They considered Brown word clusters, POS tags, individual words, open class word clusters as factors in FLM and evaluated perplexity of the model. The grouping of these factors resulted in best perplexity. FLM developed using these factors reduced the perplexity by 10.8% on SEAME evaluation set and the error rate by 3.4%. In paper [19] also, authors applied FLM for speech recognition task.

Recently, Ganji and Sinha used FLM for recognizing the code-switched data containing mixing of English and Hindi (Hinglish) text [20]. For constructing FLM they added a novel factor named as CS-factor along with POS tag factor. CS-factor designed by them differentiate the word at which the code is shifted from Hindi language to English language. They obtained a significant improvement in perplexity for Hinglish data. Gregor and Zdravko proposed a novel idea of back-off that is context-dependent and claimed that it performs better than simple back-off method for speech recognition [21]. They constructed FLM by using factors obtained from morph-syntactically tagged data.

The real application of FLM is in statistical machine translation. In [22], author first utilized FLM for machine translation application. Nair et al. proposed a hybrid MT system based on singular, plural, case, gender knowledge obtained from declensions of noun and adjectives [23]. They also used morphological analyzer and POS tagger to annotate the words of a sentence. Ramanathan et al. used FMT for English to Hindi translation by including the corresponding suffixes and semantic relations of words in English sentences by aligning them with case markers and inflections of words in Hindi language [24]. During translation, many forms of a Hindi word are possible for an English word. But, in the given training data some of the forms may be missing. Hence, Sreelekha and Bhattacaryya improved the performance of English to Hindi translations by injecting missing morphological forms of words as factors in the target Hindi language corpus [25]. Through this they also reduced the out of vocabulary problem and improved the fluency of output. Patel et al. used FMT for many English to Indian languages including Hindi [26]. They first restructured the source English language sentences according to target language to tackle the structural divergence problem of languages. They also separated the suffixes from words and used stem of words as factors in FMT. This way they handled the morphological divergence problem. Jaya and Gupta also claimed the improvement in the quality of English to Hindi translation by augmenting the parallel corpus with morphological variations for Hindi noun and adjective [27]. Kumar et al. utilized part-of-speech tags, lemma, dependency information and syntactic information as factors for English-Tamil translation and claimed improvement in performance [28]. Dungarwal et al. [29] used case, number and Tree Adjoining Grammar information factors for improving English–Hindi translation quality in factored machine translation. Sachdeva et al. used POS tag factor in FMT for Hindi to English machine translation [30].

Researchers are now trying to integrate factored approach in Neural Machine Translation (NMT). García-Martínez et al. utilized morphological and many other factors of words in NMT [31]. They tried to manage the larger vocabulary problem of NMT. Wilken and Matusov extended the work of García-Martínez et al. by exploring the usefulness of factors beyond prediction of word lemmas and its inflections [32]. Hokamp used word level features for ensemble of NMT systems and achieved comparable results in both quality estimation (QE) & automatic post editing (APE) tasks [33].

It can be observed that, many different kinds linguistic factors of Hindi language words are used by many researchers. Various factors contribute differently in the task of prediction of upcoming word in a Hindi sentence. This motivates to find out the best factors of a Hindi word that influence a lot on upcoming word in a Hindi sentence. Many back-off models are proposed in the literature [3, 21, 34, 35]. This research work compares basic back-off model and generalized back-off model in context of Hindi language.

5 Methodology

5.1 Selection of factors

For using FLM to natural language processing task, first some effective and informative factors in Hindi language are identified. Then the effective FLM that produces lowest perplexity is selected out of many possible models. For training purpose, the corpus is pre-processed by adding number, gender and POS tag factors in the text. The significance and reason behind selection of these factors especially for Hindi language can be justified as:

(1) Gender factor

In Hindi language, next word depends on gender of words in contextual history of next word. For instance, in the Hindi sentence “सीता पढ़ने जाती है”, subject of the sentence is noun word ‘सीता’. Here, gender of the subject is female hence verb used is ‘जाती’. But, if the subject is changed to a noun having male gender as ‘राम’ then verb ‘जाता’ must be used instead of ‘जाती’. Thus, gender of previous words determines the upcoming word in the sentence.

(2) Number factor

In Hindi language, next word depends on the grammatical category ‘number’ related to words in contextual history of the next word. For instance, in the Hindi sentence “राम ने बहुत आम खाये”, adjective used is ‘बहुत’ that is plural in nature hence the verb used is ‘खाये’. But, if the adjective is changed by the word ‘एक’ then verb ‘खाया’ must be used to construct a correct Hindi sentence. Thus, grammatical category ‘number’ of previous words determines the upcoming word in Hindi sentence.

(3) Part-of-speech tag factor

In Hindi language, next word depends on part-of-speech tag of words in the contextual history of the next word. Occurrence of words in sequence of a valid Hindi language sentence is governed by certain rules regarding part-of-speech tags of words in the sentence. Such rules make prediction of next word easy as they give a high probability value to words with certain part-of-speech tag. Some such rules in Hindi language are given below.

Rule 1:

Adjective is followed by noun with high probability

For instance: सुरेश एक सच्चा दोस्त है.

Here, adjective ‘सच्चा’ is followed by noun ‘दोस्त’.

Rule 2:

The reflexive pronoun is followed by noun with high probability.

For instance: सुरेश अपने गांव गया है.

Here, reflexive pronoun ‘अपने’ is followed by noun ‘गांव’.

Rule 3:

Noun is followed by verb or noun with high probability.

For instance: (1) सुरेश सोने जा रहा है. (2) सुरेश युद्ध हार गया.

Here, noun ‘सोने’ is followed by verb ‘जा’ and noun ‘सुरेश’ is followed by noun ‘युद्ध’.

Rule 4:

Main verb tag is followed by auxiliary verb with high probability.

For instance: सुरेश सोने जा रहा है.

Here, main verb ‘जा’ is followed by auxiliary verb ‘रहा’.

Thus, while predicting next word, using part-of-speech tags the search space is restricted to a limited set of words with a certain part-of-speech tag only instead of searching all words in the vocabulary of the language.

5.2 Pre-processing of data

While working with FLM, pre-processing is first and most important step that includes identification of factors to be used and determination of possible values of each of the factor. In this research, factored language models are trained using Hindi language text having totally 21,835 factored words. Trained models are tested using a test corpus having 2144 words. During pre-processing, each and every word present in the corpus is annotated with all the factor values. Thus, every word is now considered as a set of factors. A grammatical factor and its associated value for that word are separated by ‘-’ character. Consecutive factors associated with a word are differentiated by ‘:’ character. Special tag symbols like <s> is used to mark beginning of each sentence and </s> is used to mark end of each sentence.

For example, if the Hindi sentence is:

“रेडियो और दूरदर्शन महत्वपूर्ण मौसम सूचनाओं और कृषकों से संबंधित सूचनाओं जैसी अन्य जानकारियों का प्रसार करते हैं।”

After inserting factors, it will become:

<s> W-रेडियो: G-पुल्लिंग: N-एकवचन: P-नाम: S-<NULL> W-और: G-<NULL>: N-<NULL>: P-समुच्चयबोधकअव्यय: S-<NULL> W-दूरदर्शन: G-पुल्लिंग: N-एकवचन: P-नाम: S-<NULL> W-महत्वपूर्ण:G-पुल्लिंग:N-एकवचन:P-विशेषण:S-<NULL> W-मौसम: G-पुल्लिंग: N-एकवचन: P-नाम:S-<NULL> ··· </s>

where, W is word itself, G is gender, N is number, P is part-of-speech and S is stem factor of the word.

5.3 Experimentation

Factored language models are constructed using freely available SRILM language modelling toolkit [5, 6]. SRILM language modeling toolkit supports both generalized back-off and FLMs. Essentially, the SRILM provides a way to implement graphical models like syntax in terms of code. The code syntax can contain the factor to be used as child node and other required factors as its parent node. It also provides a way to specify each possible node that can be dropped to determine the back-off path in the back-off graph. Different options can be specified at each node for smoothing and threshold value required for back-off decision making. The SRILM accepts a factored language data file and produces one count file containing counts of each n-gram and one language model file showing probabilities of n-grams. The language model is generated as per the guidelines given in the language model specification file. SRILM provides various methods of smoothing such as Kneser–Ney smoothing, Good-Turing smoothing, Witten-Bell smoothing and Natural discounting [2].

5.3.1 Experimentation number 1

As mentioned earlier, apart from word itself four different factors such as POS tag, gender, number and word stem are considered for building FLMs. For identifying the factor that is effective in predicting upcoming word many FLM models are constructed. Various combinations of factors and discounting techniques are used for model generation. From the results, it is observed that among all the factors considered during experimentation, the ‘number factor’ associated with the previous word is more informative as compared to other factors for next word prediction. The sample example FLM specification program written during experimentation is:

5.3.2 FLM specification program 1

figure a

where, ## indicates comment, W is a child node having two parent factors such as one previous word W (−1) and it’s part-of-speech P (−1) factor. Here, minus sign indicates one word before the current word. Integer 3 at the end of line 2 indicates that three back-off nodes are used. But, here simple fixed back-off is used by dropping word factor first in line three and part-of-speech factor in line 4.

5.3.3 Experimentation number 2

Experiment 2 is for implementing generalized back-off in FLM. Once the best informative factor i.e. number factor, is identified in experimentation 1, it is combined with second best informative factor i.e. part-of-speech, and FLM models with generalized back-off are constructed. The sample example FLM models with generalized back-off specification programs written during experimentation are explained below.

5.3.4 FLM specification program 2

figure b

Integer 5 at the end of line 2 indicates that five back-off nodes are used in the back-off graph. Line number 4 indicates the generalized back-off where simultaneously two nodes N1 and P1 are dropped from parent node W1, N1, P1 leading to two separate paths. This specification program produces “knsum.count1.gz” as the frequency count file and “knsum.lm1.gz” as the language model. The combine sum strategy instructs to take summation of the probabilities of these back-off paths to produce final probability of the model. Kneser–Ney discounting is used here with interpolation option. The gtmin parameter equal to 0.1 is threshold value used to restrict the n-grams with low probability count.

5.3.5 FLM specification program 3

figure c

Line number 5 indicates the generalized back-off where simultaneously two nodes N1 and P1 are dropped from parent node W1, N1, P1 leading to two separate paths. The combine max counts_no_norm strategy instructs to take the maximum of the probabilities of these back-off paths to produce the final probability of the model. Witten-Bell discounting is used here with interpolation option. Threshold parameter gtmin equal to 0.1 is kept same in all models generated to compare the models for different combine strategies with various discounting techniques.

To check the effect of generalized back-off, the standard n-gram models are also constructed. Many experiments are carried out by writing various scripts satisfying generalised back-off FLM specification format for testing different back-off paths and smoothing techniques.

6 Results

The language model can be evaluated in two different ways. One is intrinsic evaluation and other is extrinsic evaluation. For extrinsic evaluation, the obtained model is integrated in some NLP application and its performance is evaluated by measuring the effect of variations in the model on application output. In intrinsic evaluation, the language model is not applied to any NLP application but it is evaluated at its own performance. The intrinsic evaluation is a probabilistic approach and the performance is measured in it using a metric named as perplexity. Perplexity is measured by using entropy of the probability distribution. If the word sequence produces higher conditional probability, then the language model will have lower perplexity. Thus, Perplexity is inversely related to probability. For a given test corpus, if the language model is producing low perplexity then it is treated as a better model. For this research work the language models are evaluated by using perplexity metric.

The perplexity of standard n-gram models with degree 1, 2 and 3 is given in first three rows of the Table 1. It considers only previous words and does not consider any additional linguistic knowledge to predict the upcoming word. The results show that, as the degree of language model, i.e. number of previous word considered are increasing its perplexity is decreasing. It means to predict the upcoming word correctly more and more previous words are required.

Table 1 Perplexity of language models without generalized back-off

The perplexity obtained on test corpus for the FLM without generalized back-off models is given in Table 1 from row 4 to row 21. The results in Table 1 are compared graphically in Fig. 6 where x axis shows the serial number of the language model in Table 1 and y axis is showing its perplexity. In this experimentation, the factors are dropped without using generalized back-off. FLM models produce better results than standard baseline n-gram language models. The perplexity is decreased from 3.763726 for baseline trigram model to 2.570084 for model p (W|W1N1) when W1 is dropped first without interpolate with Kneser–Ney discounting. Here, W1 is immediate previous one word and N1 is immediate previous word’s number factor.

Fig. 6
figure 6

Graph showing perplexity of language models without generalized back-off

The perplexity obtained for various FLMs with generalized back-off using different combine strategies is listed in Table 2. The results in Table 2 are compared graphically in Fig. 7 where x axis shows the serial number of the strategy in Table 2 and y axis shows its perplexity. It is observed that, for available test corpus, the perplexity of FLM with product combine strategy is lowest as compared to other combine strategies. Out of the three discounting techniques used, Natural discounting technique with product combine strategy produces the lowest perplexity in all the models generated that is 1.881235. It is about 50% less than trigram baseline n-gram language model. In all developed FLMs, interpolation is used to estimate probabilities at the back-off node. For all discounting approaches same gtmin of 0.1 is used for training of all FLMs. As FLM utilizes grammatical and linguistic knowledge of the preceding words for predicting the next word, this more information adds more robustness in the model.

Table 2 Perplexity of factored language models with generalized back-off using various combine strategies
Fig. 7
figure 7

Perplexity of FLM with generalized back-off using various combine strategies

7 Conclusion and future scope

The generalized back-off is the generalization of back-off in baseline n-gram language modelling. The generalized back-off gives opportunity to consider all possible back-off paths simultaneously using various strategies and selects the most appropriate. As compared to n-gram model, FLM predicts the next word more correctly because of availability of additional knowledge of grammatical and lexical features of the words in the language. It is observed that Natural discounting technique with product combine strategy produces the lowest perplexity of 1.881235 and it is about 50% less than trigram word only language model. The difficulty to work with the FLM is huge pre-processing of text as every word in the corpus must be annotated by all the factors. Also, for FLM with many factors, the availability of many back-off paths increases the computational complexity of the model. In future work, some more linguistic features of Hindi language can be included in FLM and effective generalized back-off strategy can be identified. Combinations of available factors and levels of back-off graph may also be increased to check whether it can further improve the perplexity or not.