1 Introduction

Machine translation (MT) is a sub-field of computational linguistics and Natural Language Processing (NLP) which enables an automatic translation of sentences or documents from one language to another. It aims at reducing the language barriers of human communication belonging to different linguistic backgrounds. Language perplexity has a tremendous impact on several aspects of human subsistence, and the same can be mitigated with effective use of MT. It endeavours at minimising the involvement of human-being. Although machine-generated output differs from human translation but is understandable. It manifests its effectiveness by producing grammatically and semantically fluent output. The classical translation approaches delve the deep insight of linguistic knowledge of source language and cognitively translate to target language word by word. The approaches for MT are categorized predominantly into rule-based, i.e. based on linguistic handcrafted rules [1] involving transfer-based mechanism [2] and interlingua mechanism [3], corpus-based approach that is entirely based on corpus, i.e. statistical phrase-based [4] and neural-based [5]. Even some approaches are less used nowadays, i.e. example-based approach [6], knowledge-based [7].

1.1 Need for machine translation systems

The tremendous increase in industrial growth over the past decades has a significant impact on the global MT market which enables content to be available in all regional languages across the globe. Computational activities have become mainstream nowadays. As the internet opens up the wider multilingual and global community, research and development in MT continue to grow at a rapid rate. MT system is required in small, medium and large organisations. Some deal with only specific domain, but some provide services in multiple domains. Various fields are requiring domain-specific services such as government, software and technology, healthcare, legal, military and defence, e-commerce, finance and many more. The other online MTS, i.e. commercial systems deal with instant translation where source language is converted into target language for text, image and audio data. These MT systems are generic, light-weight, cloud-based and have high accuracy. It offers translations like Text-to-text, Speech to Speech, Text to Speech, Speech to Text and Image to Text. These commercial systems are not economical; therefore, it necessitates a need for economic MT system providing translation services.

1.2 Objective of the research work

India is a country of enormous language diversity, with more than 1.3 billion people with 29 official languages and more than 12 scripts; hence, there is a need to provide a translation of the content from one language to another language. One of the main requirements in Sanskrit computational linguistics is to translate the life-transforming stories (epics), Vedas and so forth to make them available in other languages, for the public at large. As per the literature survey, the machine translation system (MTS) using Sanskrit as a source or target language is in developing stages as very few systems have been developed using rule-based and statistical system [8, 9]. This approach is not adequate for extending the system to generic and huge domains as requires linguistic insights. Therefore, there is a need to develop a machine-engineered generic MTS for translating Sanskrit–Hindi. We have merged the rule-based linguistic approach with a recurrent neural network. Although neural is a new approach to MT, it has produced promising results in the translation domain. It also has some pitfalls which we would overcome by merging along with rules. Thus, a hybrid system combining the best of neural-based and rule-based is developed and presented in this paper.

1.3 Problem domain

According to the census of India in 2018, Sanskrit is the mother tongue of 24,821 people and Hindi of 52,83,47,193 people, i.e. 43% of total languages in India [10]. Reasons for choosing Sanskrit for translation purpose is the richness of its scientific literature with extensiveness and comprehensive analysis, structured approach and traditional grammar [11]. There are numerous characteristics of this language, some of them have been listed.

  • It is also considered as the ’donor’ of almost all Indian languages. Most of the languages have been derived from Sanskrit either partially or fully.

  • The vast reserves in this language could be converted into other languages [12]. As this language comprises of rich literature, Ayurveda, Vedas which can be produced in other languages to improve its accessibility.

  • It holds a rich grammar confined by Panini near 2500 years ago formulating 3949 rules, which extended later on [13].

  • It has the strongest and non-ambiguous grammar [14]. Many people have attempted to write a grammar for Sanskrit language using the Paninian framework and used it to develop translation system [15].

  • Panini focused on decoding the information contained in the language string of particularly given input language by Karaka (syntax-semantics) relations and not thematic roles [16]. It also highlights the importance of case markers, postpositions and word-order. The central element of a semantic model in Paninian framework is that every verbal root (dhaatu) denotes an action consisting of an activity (or vyaapara), and a result (or phala). The result is a state reached after completion of an action. An activity consists of steps that are performed by different participants or Karakas involved in the action.

  • The concept of Karaka relation is a central theme to Paninian grammar. These Karaka relations refer to syntactic-semantic relations, and on a surface level, it highlights syntactic information and also captures semantic information at a deeper level.

  • The Sanskrit grammar is termed as ‘Father of Informatics’ as it builds a relationship between speech and utterance of speaker and meaning derived by the listener [17].

  • The primary objective of Sanskrit Paninian grammar is to form a theory of human natural language communication.

  • Sanskrit and Hindi belong to the same Indo-Aryan family [18]. They both have structural and lexical similarity as Hindi inherits from Sanskrit.

  • Sanskrit has the rich and structured grammar in the form of Panini Astadhayayi whereas in Hindi such parallel grammar does not exist. Therefore, it becomes difficult to map the divergence between these two languages.

  • The non-existence of parallel grammar leads to exceptional cases which uncover linguistic generalisations such as Vibhakti in Hindi. The cases where Vibhakti in Sanskrit and Hindi diverges [18] are optional, exceptional, differential, alternative, non-Karaka, verb and complex-predicate divergence.

Despite these features, choosing Sanskrit as a source language is difficult on using both rule-based as well as neural-based approach. In the rule-based approach, parsing fails due to its synthetic nature in which single word can run up to 32 pages. Whereas in case of training the translation system using NMT approach lead to the high occurrence of Out-of-vocabulary words. These words are morphologically rich words, carrying multiple meanings according to context. The work presented in this paper clearly outlines these challenges and overcomes them by providing a sound translation.

1.4 Challenges in neural machine translation

There are many challenges for neural machine translation (NMT) as compared to other approaches. This new trend in MT has emerged neural network in yielding quite effective translation systems. Besides, the promising result provided by NMT, there are few challenges listed.

  • NMT systems undertake fluency rather than adequacy leading to out of domain low-quality translation.

  • NMT systems are directly proportional to the training data. Therefore, rich-resource language pair exhibits better performance.

  • NMT systems face problem in translating low-frequency words with an inflected category such as verbs. It has encountered that NMT systems applied with Byte-Pair Encoding perform better than SMT systems on low-frequency words.

  • NMT systems result for higher translation quality for smaller sentences rather than longer sentences.

  • The word alignment performed by the attention model in NMT does not produce desirable results.

  • In decoding phase, translation quality is improved by beam search only for narrow beams and decreases for larger beams.

  • NMT system requires better analytics for its computation as the working is not interpreted.

These challenges are overcome in our proposed approach by merging NMT with linguistic features extracted by applying the rule-based approach.

1.5 Contributions

Neural machine translation (NMT) has been outperforming traditional statistical phrase-based and RBMT. Motivating from its performance, we implemented a neural model along with linguistic rich features from rule-based System for Indian language pair, i.e. Sanskrit–Hindi.

  • Initially, parallel corpus was gathered which was available from different sources and rest of the data was manually created for building a neural model using encoder–decoder architecture with attention mechanism [5].

  • We trained and tested the NMT system with different models, activation functions, training data, epochs, sentence lengths to yield better accuracy of the proposed system.

  • We then merged linguistic tools output from the classical rule-based approach as features embedding matrices in NMT to test whether it guides for the disambiguation translation of words as the same word may have a different meaning in a different context. We also tested whether it reduces the data sparseness and performs meaningful tokenization.

  • Performance testing was performed to compare rule-based, neural and hybrid systems on four measures, i.e. BLEU, F-measure, METEOR, Word-Error-Rate.

  • Various Sanskrit to Hindi MT sentence were evaluated corresponding to reference translation based on BLEU score, Precision, Recall, F-Measure, WER, F-mean, Penalty and Meteor.

  • Human-based evaluation was also performed based on the grammatical category. We have formed 15 cases of grammatical category and tested the system translation corresponding to them. The existing system for Sanskrit–Hindi translation was also compared.

1.6 Paper organization

The remainder of the paper is organised as follows: Sect. 2 presents the background of both NMT and MTS for processing the Sanskrit language. An elaborate system description of our proposed system described in Sect. 3 followed by the experimental design of the system in Sects. 4 and 5 discusses detailed result analysis, and finally, Sect. 6 concludes the paper along with the future scope.

2 Background

Neural network (NN) has been in research since 1980’s-1990’s [19] till then these methods for machine translation were unexplored. Later in the year neural network (NN) and finite state machine (FSM) were trained for English–Spanish, and experimental results show NN performing better than FSM on the same size of training data [20]. Another model “Recurrent hetro-associative memories” capable of learning simple translation task proposed by building a state-space representation of the input string to obtain the output string [21]. These approaches turned out to be similar to current NN approaches, but it lacked the computational complexity to process a large amount of data and decrease the learning time or size of NN. The computational complexity involved deserted these ideas for decades. Meanwhile, data-intensive approaches, i.e. statistical machine translation (SMT) came into dominance and turned out to be useful for applications of information extraction to professional translators. In SMT, the analysis is performed on data and calculated by the maximum probability (Pr) of target (t) sentence for the given source sentence (s) in Eqs. (12)

$$\begin{aligned} Pr (s|t)& = \frac{Pr (s)Pr (t|s)}{P (t)} \end{aligned}$$
(1)
$$\begin{aligned} e& = argmax_e Pr (s) Pr (t|s) \end{aligned}$$
(2)

It builds a language model from monolingual text and translation model from the parallel corpus. The decoding algorithm uses both these translation model and language model for predicting the target language sentence. It follows two objectives, i.e. adequacy and fluency. Adequacy refers to target translation that must carry the same meaning as the source. It requires a translation model that assigns a score to each sentence. Fluency refers to the fluent translation which requires language model that assigns a score to each sentence. These statistical models do not consider context, as a single word may have a different meaning in a different context. The statistical, probabilistic language model improved with context, perplexity and feature vectors by proposing a neural probabilistic language model [22]. Later, an explorer in this field using neural language models in SMT showed large improvements [23], but the computational drawback was still there for many research groups as it required GPU for training the data. There were few more experiments carried out by integrating neural components into SMT such as statistical language model based on neural network [24], neural network joint model, combining into the decoder. The later model showed empirically good results [25]. Other integrations of language model include continuous space language model on a GPU for SMT [26], factored recurrent neural network language model in TED lecture transcription [27]. Also used in source side pre-ordering for faster and better SMT [28]. Neural models used for improvised reordering of SMT systems [29, 30].

2.1 Neural machine translation

Neural machine translation (NMT) involves the incorporation of context and better calculation of probability for similar words. It is a machine learning technique which takes inputs as a source sentence and predicts output as a target sentence. It has been outperforming the traditional methods for MT such as rule-based and statistical-based. Early efforts for pure NMT include convolution neural network which uses a continuous-space representation of words, phrase and sentence [31]. Deep long short-term memory (LSTM) was used for encoding and decoding of input sequence into a fixed dimension vector and vice-versa. These models performed quite well on short sentences but were not able to produce a translation for long sentences. Later, neural models added with attention method was proposed by selectively choosing parts of source sentence during translation. [32]. These models produced quite promising results but embedding them into fixed vectors restricted the user. Therefore, [5] proposed embedding of sentences into variable-length vectors. This work yields promising results and is being widely used to train NMT systems.

2.2 Machine translation systems for processing Sanskrit language

In this paper, Sanskrit–Hindi hybrid machine translation (SHH-MTS) is proposed, where hybrid mechanism (a combination of linguistic features extracted from the rule-based and recurrent neural network) is applied for translating Sanskrit–Hindi language. Sanskrit is one of the oldest Indo-European languages [33]. In Uttarakhand, Sanskrit is the official language [34], and it is also known as the donor of the other [35]. Hinduism, Buddhism and Jainism used this language in their holy books, Vedas and other philosophical texts [36]. As currently, the Sanskrit-based systems are using rule-based approaches which are not easily extendable and are time-consuming. The challenging task in the translation field for Sanskrit is its morphologically rich features that consist of various modules, resources and tools. Thus, it is a complex system. Sanskrit has a richness of scientific literature with in-depth analysis for cognitive knowledge description. Sanskrit has extendability to other languages. The phonetic basis of the Sanskrit language is useful for speech analysis as well [37]. Research in the field of Sanskrit MT is in the early stages. Some of the developed systems to provide computation and translation of Sanskrit language include Sanskrit Translator (2009) which [38] uses an example-based technique to translate from English to Sanskrit using different modules. Merging rule-based approach with artificial neural network (2010) by [39]for translating English to the Sanskrit language was developed. English to Sanskrit MT and Synthesizer System (2010) by [40] using a dictionary-based approach which includes speech synthesiser in translation process developed. English to Sanskrit Machine Translator system was formed (2010) by [41] using rule-based approach and modules such as the lexical parser, semantic Mapper, translator and composer. [42] formed a complete English to Sanskrit speech translation system. It was later enhanced in (2012) by [43] merging rule-based approach. E-Trans system (2012) by [44] suggests using a rule-based approach from English to Sanskrit using a synchronous context-free grammar. The results are quite promising for small and large sentences. A system for translating Sanskrit to English, i.e. TranSish (2014) by [45] uses a rule-based approach but only for present tense. This system extended for other tense sentences. These research works were useful for sharing extensive knowledge of Sanskrit with traditional translation approaches. Recent approaches to Sanskrit language translation were performed using a statistical-based approach such as English to Sanskrit Machine Translator with Ubiquitous Application (2012) by [46]. It uses features like phrase translation probability, inverse phrase translation probability, lexical weighting probability, phrase penalty, language model probability, distance-based distortion model, word penalty to train the system. All of this research work based on processing the Sanskrit language to English or vice-versa. There was an ample amount of work for Sanskrit to Hindi translation including a rule-based system and a statistical-based system. The work for Sanskrit to Hindi MT system include [47] in 2015 for translating children stories (e-learning content, multimedia for Kids) using a rule-based approach. Extension to yoga and Ayurveda domain. Though this is not yet available as it is under progress. Another system of translation and computational tools of Sanskrit, i.e. Samsaadhani (2009) by [48,49,50] for translating Sanskrit to Hindi also uses rule-based linguistic approach. Recent attempts using a statistical approach for translating Sanskrit to Hindi have proposed. The system trained on MT-Hub platform [51] and Moses [8] resulting in significant improvement in Sanskrit language computations.

In this paper, we have proposed Sanskrit–Hindi hybrid machine translation system (SHH-MTS) handling these limitations of the existing MTS developed for the Sanskrit Language.

3 System description

The section consists of data pre-processing, the addition of linguistic features, encoder for embedding source input sentence into vectors, a decoder for converting the trained vectors into target sentence and a web-interface to make the translation available as a service for users.

3.1 Data pre-processing

The corpus data are formed suitable for applying the NMT model. It is a twofold process, i.e. clean text and split text. The cleaning of the text involves dividing the document into sentences. Then, removing all non-printable characters, punctuation markers, normalise Unicode characters to ASCII value, changing uppercase letters to lowercase and removing any remaining tokens that are not alphabetic or numeric. We performed these operations on each phrase for each pair of dataset loaded. Secondly, splitting operation is performed on cleaned data. The dataset contains different length sentence pairs; therefore, different computation graphs were drawn. We then sort the sentences in a batch based on sentence pair by length and break similar-length sentences into mini-batches. Therefore, we recurrently, shuffle the training corpus and break the corpus into maxi-batches and again perform splitting in mini-batches. It is processed further by applying gradient for parameter update.

3.2 Rule-based machine translation system; extraction of linguistic features

The proposed work has a pipeline architecture, which takes input from its previous phase and performs computations and passes the output to the next step. Different tools are divided into different modules or phases developed under Sanskrit Consortium Project funded by MIT using Anusaaraka [52]. There are 10 modules in the rule-based pipeline architecture of Sanskrit to Hindi translation [49]. All of the modules provide an individual output as linguistic features to Neural based encoder–decoder to train the system more efficiently. We used Anusaaraka engine for translation [52].

  • Pre-processing of user input: It takes input from the user, cleanses, normalise the text, converts the input notations into WX notation, call and invokes MT system which performs computation and shows the output result.

  • Tokenizer: Tokenizer receives a flow of character, and that character breaks into individual words called as tokens (words, punctuation, markers). It removes the formatting information and adds a sentence tag. Here, the term morphology is used for linguistics. It refers to a study of words, their internal process and their word meaning. The model has a stream of words; those words tokenized first, and then morphology gives meaning to those words [53].

  • Sandhi-Splitter: It is invoked when the input text contains Sanskrit sandhi words. It splits these words as well as compound words [50, 54, 55].

  • Morphological Analyzer: It splits words into their roots and grammatical suffixes. There are different units, and each unit provides meaning as well as grammatical function. It also provides inflectional analysis, prunes the answer, uses local morph analysis to handle unrecognised words and produce a derivational analysis of the derived roots. [56,57,58].

  • Parsing: Parser is used as a compiler or interpreter that breaks data into smaller units for easy translation of one language to another. Parsers take input from the sequence of words or tokens. These inputs translated in the form of a parse tree. It converts the source language into the target language in the form of a tree with labels of noun, verbs and their associated attributes. Morph analysis according to context along with karaka analysis is performed. According to computational Paninian grammar, it identifies and names the relation between the verb and its participants. [59,60,61,62,63].

  • Shallow Parsing: If the parser fails on any input, it does minimum parsing of the sentence and produces pruned morph analysis to next layer. [50, 59, 59, 64].

  • Word Sense Disambiguation (WSD): The modules perform word sense disambiguation of input sentence words roots, vibhakti and lakara. It identifies a correct sense of a Sanskrit word [16].

  • Parts of Speech Tag (POS): It adds parts of speech tags to each word such as adjective, verb or noun. tags [65, 66].

  • Chunker: This phase performs a minimum grouping of words in a sentence such as a noun phrase, verb phrase, adjective phrase. The rule base allocates an appropriate chunk tag to it. [67].

  • Hindi Lexical Transfer: The Sanskrit Lexicon is transferred to Hindi identifying root words using the dictionary. The output formatted according to the Hindi Generator, which generates the output in Hindi Language corresponding to the Sanskrit language. This module also performs transliteration in case of translation fails [68].

  • Hindi Generator: This phase involves sentence level generator which performs agreement between a noun, adjective and verb in the target language. Addition of vibhakti markers ’ne’ and dropping ’ko’ at required positions. Final generation involves root words and their associated grammatical features, corresponding suffixes and concatenates them by generating words into a sentence [16].

Hence, a translation of each Sanskrit word to its corresponding Hindi word is performed using linguistic rules and tools. Further, these data are passed to consequent phase. The data passed to next phase is converted into Comma-separated values (CSV) format suitable for training, model development and fitting the values for neural-based encoder–decoder architecture for predicting translation of Sanskrit word to Hindi word. These linguistic tools output is embedded as features for input encoding of source sentence.

3.3 RNN encoder–decoder with attention mechanism embedding extracted features

The proposed work shown in Fig. 1 is novel and can be applied to any low-resource language having less amount of parallel data. We have used improved NMT with attention mechanism proposed by [32] along with GRU cells for computation [5]. The implementation consists of stacked Bi-directional RNN layers for both encoder and decoder with attention mechanism. The NMT proposed first encodes the source sentence \(W_{s}=w_{s1},\ldots ,w_{sn}\) into variable sequence of context vectors \(S={h_{1},h_{2},h_{3}\ldots h_{n}}\). The decoder decodes the context vector \(S_{i}\) and generates target sentence one word at a time \(W_{h}=w_{h1},w_{h2},w_{h3}\ldots w_{n}\) by maximizing the target word probability given previous generated word \(h_{i-1}\), hidden state decoder \(d_{s_{i}}\) and context vector \(s_{i}\) \(P ({h_{i}}|d_{s_{i}},h_{i-1},s_{i})\). The detailed explanation of the encoder, attention mechanism and decoder have been mentioned in further sub-sections. Entire implementation has also been depicted in Algorithm 1 in the form of pseudocode.

Fig. 1
figure 1

Deep neural network architecture

3.3.1 Encoder

Given a source sentence in Sanskrit Language \(W_{s}=w_{s_{1}},w_{s_{2}},w_{s_{3}}..w_{s_{z}}, s_{i}\in {\mathbb {R}}^{Ks} \) and target sentence in Hindi Language \(W_{h}=w_{h_1},w_{h_2},w_{h_3},\ldots w_{h_l}, h_{x}\in {\mathbb {R}}^{kh}\) from the parallel corpus. Here, \(K_{s}\) and \(k_{h}\) are vocabulary sizes and z and x are length of input sentence and output sentence. The model first tokenizes \(W_{s}\) to form input representation where probability of a sequence of \(T (w_{s_1},w_{s_2},\ldots w_{s_n})\) words is denoted as \(P_{1} (w_{s_1} \ldots \ldots .w_{s_t})\). It is usually conditioned on a window words rather than all previous words. Since the number of words coming before a previous word w1 varies depending on locations with input document in Eq. (3).

$$\begin{aligned}&P_{1} (w_{s_{1}},w_{s_{2}}\ldots ,w_{s_{t_{z}}})=\prod _{i=1}^{t} P (w_{s_{1}}\ldots w_{s_{i-1}})\nonumber \\&\quad \approx \prod _{i=1}^{t} P (w_{s_{i}}|w_{s_{1}}\ldots w_{s_{z-1}})\ldots w_{s_{z-1}} \end{aligned}$$
(3)

As we cannot directly apply neural network to text data. Text is converted into numbers or integer-tokens which is further converted into vectors by embedding layers. By setting the maximum number of words in the vocabulary, we use the tokenizer for source and target language. The data-set once converted into sequence of integers-tokens are then padded and truncated and saved as numpy arrays. The encoder uses this output of tokenizer as arrays and computes embedded vectors \( (\mathbf {w_{s_1}},\mathbf {w_{s_2}},\mathbf {w_{s_3}}\ldots \mathbf {w_{s_z}})\) for hidden layers computation. These vectors have value between 1 to − 1 having similar semantic meaning words mapped to similar vectors.

Forward RNN reads input sentence from starting to end \(\mathbf {f}\) and compute the hidden states \( (\mathbf {h_1},\mathbf {h_2},\mathbf {h_3}\ldots \mathbf {h_{\tau _j}})\). The backward RNN computes the hidden states \( ({\varvec{h_1}},{\varvec{h_2}}\ldots ,{\varvec{h_{\tau _j}}})\) by reading the sentence in reverse order. These hidden states, i.e. forward and backward are combined to form an annotation vector \(H_{i}=[\mathbf {h_j}^{T};{\varvec{h_j}}^{T}]\). The traditional encoder consisting of an embedding lookup of each input word \(s_{z}\) and mapping steps through hidden states \(\mathbf {h_{\tau }}\)and \({\varvec{h_{\tau }}}\)in Eq. (4).

$$\begin{aligned} \mathbf {H_j}=f ({\mathbf {h_i-1},{\bar{E}}W_{s_n}}) \end{aligned}$$
(4)

The encoder computations are deeply stacked in following manner as in Eqs. (56). For first layer,

$$\begin{aligned} h_{t,1}=f_{1} (h_{t-1},1,w_{s}{t}) \end{aligned}$$
(5)

For \(i>1\)

$$\begin{aligned} h_{t,i}=f_{h_{t-1,i},h_{t,i-1}} \end{aligned}$$
(6)

where, \(h_{t-1,i}\):previous time stamp value and \(h_{t,i-1}\):previous layer in sequence value. The context vector \(s_{i}\) contains the summary of the input sentence computed by processing backwards and forward RNN’s. We have used gated recurrent units for the function of encoder and decoder [69] as well. We have designed the gated recurrent units (GRU) in a manner to have more persistent memory thereby making it easier for RNN to capture long-term dependency. Mathematically, GRU has previous state \(h_{t-1}\) and input \(w_{s_t}\) to generate the next hidden state \(h_{t}\)

For update gate in Eq. (7), reset in Eq. (8), new memory in Eq. (9) and hidden State, for all I words of a sentence in Eq. (10).

$$\begin{aligned} \mathbf {up_{i}}& = \sigma (\mathbf {W_{up}}{\bar{E}}_{s_i}+ \mathbf {O_{up}} \mathbf {h_{i-1}}) \end{aligned}$$
(7)
$$\begin{aligned} \mathbf {res_{i}}& = \sigma (\mathbf {W_{res}}{\bar{E}} {s}_{i} + \mathbf {O_{res}} \mathbf {h_{i-1}}) \end{aligned}$$
(8)
$$\begin{aligned} \mathbf {h}_{i}& = tanh (\mathbf {W}{\bar{E}}s_{i}+ \mathbf {O}[\mathbf {res_{i}}\odot \mathbf {h_{i-1}}] \end{aligned}$$
(9)
$$\begin{aligned} h_{i}& = (1-\mathbf {up_{i}})\odot \mathbf {h_{i-1}} + up_{i}\odot \mathbf {h_{i}}) \end{aligned}$$
(10)

Here, d is the dimensionality of word embedding and u is number of hidden units \({\bar{E}}\in {\mathbb {R}}^{d\times ks}\).

$$\begin{aligned}&\mathbf {W},\mathbf {W_{up}},\mathbf {W_{res}} \in {\mathbb {R}}^{u \times d}\\&\mathbf {O},\mathbf {O_{up}},\mathbf {O_{res}}\in {\mathbb {R}}^{u\times u} \end{aligned}$$

\(\sigma \) is logistic sigmoid function. The backward states of bidirectional recurrent neural network are computed similarly for update gate in Eq. (11), reset gate in Eq. (12), new memory in Eq. (13) and hidden State, for all i words of a sentence in Eq. (14).

$$\begin{aligned} {\varvec{up_{i}}}& = \sigma ({\varvec{W_{up}}}{\bar{E}}_{s_{i}}+ {\varvec{O_{up}}} {\varvec{h_{i-1}}}) \end{aligned}$$
(11)
$$\begin{aligned} {\varvec{res_{i}}}& = \sigma ({\varvec{W_{res}}}{\bar{E}} {s}_{i} + {\varvec{O_{res}}} {\varvec{h_{i-1}}}) \end{aligned}$$
(12)
$$\begin{aligned} {\varvec{h}}_{i}& = tanh ({\varvec{W}}{\bar{E}}s_{i}+ {\varvec{O}}[{\varvec{res_{i}}}\odot {\varvec{h-{i-1}}}] \end{aligned}$$
(13)
$$\begin{aligned} h_{i}& = (1-{\varvec{up_{i}}})\odot {\varvec{h_{i-1}}} + up_{i}\odot {\varvec{h_{i}}}) \end{aligned}$$
(14)

The forward and backward states are combined as \(h_{i}=[\mathbf {h_{i}}+{\varvec{h_{i}}}]\).

3.3.2 Addition of linguistic features to encoder

Our framework integrates linguistic features [70] extracted from the pipeline architecture of rule-based to train recurrent neural network. Each feature has a distinct vector word embedding \(s_{zy}\). Combining all these word vectors form a feature embedding matrix \(E\in {\mathbb {R}}^{dy\times ky}\) with \(d_{k}\) as a summation of the dimension of all feature embedding and ky as vocabulary size of \(K^{th}\) feature. These embeddings are later concatenated with total embedding size as the length matches. The input embedded sentence vectors are multiplied with these extracted linguistic features. All other functionality and parameters of the model remain the same, only this change in the encoder is performed as in Eq. (15) which result in exceptional improvement in the fluency of output.

$$\begin{aligned} h_{l}=tanh\left ( \mathbf {W} \prod ^{F}_{y}{\bar{E}}_{y} s_{zy}+\mathbf {O}\mathbf {h}_{l-1}\right) \end{aligned}$$
(15)

3.3.3 Attention mechanism

The attention layer bridges the gap of encoder that produces a sequence of word representation \(h_{j}= (\mathbf {h_{i}},{\varvec{h_{i}}})\) and decoder expecting \(S_{i}\) context vector at each time step \(t_{i}\). It calculates association between input word \(W_{s}\) to produce the next output word\({w_h}\) by calculating the impact of word representation (\(\mathbf {h_i},{\varvec{h_i}}\)). The context vector mathematically calculated as a weighted sum of annotations \(h_{i}\). For this, we first need to calculate alignment model \(a_{ij}\) as in Eqs. (1618), the score of output position around i to input position around j. It takes hidden state \(d_{i-1}\) and \(h_{j}\) as \(j\)th annotation of input Sanskrit sentence.

$$\begin{aligned} a_{ij}& = J^{\tau }_{a} tanh (W_{a} d_{i-1}+ O_{a}h_{j}) \end{aligned}$$
(16)
$$\begin{aligned} \alpha _{ij}& = \frac{exp (a_{ij})}{\sum ^{Ts}_{y=1} exp (a_{iy})} \end{aligned}$$
(17)
$$\begin{aligned} S_{i}& = \sum ^{ts}_{j=1} \alpha _{ij} h_{j} \end{aligned}$$
(18)

Here, S is feed-forward neural network.

\(W_{a}\in {\mathbb {R}}^{n'_{1}},O_{a}\in {\mathbb {R}}^{n'\times n},J_{a}\in {\mathbb {R}}_{n'\times 2n}\) are weight matrices. The computed scalar attention value is normalized using softmax activation function, so all input words s adds up to 1.

3.3.4 Decoder

The decoder at each time step t takes sequence of previous hidden state \(d_{i-1}\) , some representation of input context \(s_{i}\) and embedding of previous word output \({E_{h_{i-1}}}\) to output a new word prediction \(w_{h_i}\) and new output decoder hidden state. The initial hidden state is computed in Eq. (19).

$$\begin{aligned} d_{0}=f (W_{d}\mathbf {h_{1}}) \end{aligned}$$
(19)

where \(W_{d}\in {\mathbb {R}}^{d\times d}\). The hidden state \(d_{i}\) is computed given annotation from encoder in Eq. (20) and for update in Eq. (21),Reset in Eq. (22).

$$\begin{aligned}&\bar{d_{i}=tanh (WE{h_{y_{i-1}}}})+O[res_{i}+d_{i-1}]+Ss_{i}) \end{aligned}$$
(20)
$$\begin{aligned}&up_{i}=\sigma (W_{up}E_{h_i-1}+O_{up}d_{i-1} S_{up}s_{i}) \end{aligned}$$
(21)
$$\begin{aligned}&res_{i}=\sigma (W_{res}Eh_{i-1}+O_{res}h_{i-1} + S_{res} S_{i}) \end{aligned}$$
(22)

Where, E is embedded matrix of word for target language with u as number of hidden units and d is word embedding dimension. \(W,W_{up},W_{res}\in {\mathbb {R}}^{u\times d}\), \(0,O_{up},O_{res}\in {\mathbb {R}}^{d\times 2d}\) are weight matrices. The vector for prediction \(p_{i}\) for a output word is based on decoder hidden state \(d_{i-1}\), input context \(s_{i}\) and embedding of previous output word \(h_{i-1}\) as in Eq. (23).

$$\begin{aligned} p_{i}=softmax (O_{ot_{d_i-1}}+V_{ot}E_{h_i-1}+S{0}s_{i}) \end{aligned}$$
(23)

Where, \(V_{ot}\in {\mathbb {R}}^{2l\times d}, O_{ot}{\mathbb {R}}^{2l\times u}, C_{o}{\mathbb {R}}\in {2l\times 2u}\) are output word embedding matrices. On, \(E_{W_{h_i-1}}\) condition is repeated as we use \(d_{i-1}\) rather than \(d_{i}\) as it fragments the encoder state progress from \(d_{i-1}\) to \(d_{i}\) for prediction of output word \(p_{i}\) in Eq. (24). Here, token for output word \(w_{h_i}\) is the highest value in the vector.

$$\begin{aligned} p_{i}=[\max{\bar{p_{i},2j-1}},\bar{i,2j}]^{\tau }_{j=1\ldots ,l}. \end{aligned}$$
(24)

Even training is performed accordingly as the network being aware of correct output \(w_{h_{i}}\) assigned larger probability value as in Eq. (25).

$$\begin{aligned} prob (h_{i}|d_{i-1},s_{i})\propto (h^{\tau }W_{o}p{i}) \end{aligned}$$
(25)

Activation function softmax is used to convert the raw vector into a probability distribution having a sum of values as 1. We have also used here Relu [71] that combines input to yield the next hidden state. To predict the target variable more efficiently activation function is passed to the model. It also works as a rectifier. The model follows a deep output as suggested by [72]and has been visualized in the table: 1. A web interface as shown in Fig. 2 is designed for SHH-MTS proposed for delivering it as a service.

Table 1 Visualizing model structure on keras
figure a
Fig. 2
figure 2

Web interface

4 Experimental design

The section contains details of corpora, model Size, parameter initialisation and training performed. The experiment performed on different epochs, beams sizes and different sentence length which lead to change in the update, BLEU score, training probability and development probability shown in Figs. 4 and 5.

4.1 Corpora

The system architecture designed required parallel as well as monolingual corpora. Firstly, the existing parallel corpora available for Sanskrit–Hindi was gathered as depicted in Table 2, [73]. These data were gathered from different domains such as news, healthcare, tourism, literature, Wikipedia, judicial and general domain. Even a parallel corpus of Bhagwad-Geeta was manually created containing 700 slokas along with Hindi conversion. Further, 50,000 parallel Sanskrit–Hindi corpus was made available on request from the project of Indian Languages Corpora Initiative (ILCI) [51]. The entire parallel corpora 162,760 parallel sentences, which were used for training the system. It trained the system with less accuracy, and the output was not understandable. So, to overcome this problem we applied synthetic technique as suggested by [74]. This technique was applied on 2.3 million monolingual sentences and Sanskrit prose sentences [75] as shown in Tables 2 and 3. Later on, various others corpus of Sanskrit books were available from “Development of Sanskrit computational tools and Sanskrit–Hindi machine translation System (2008-2012)”, funded by DeiTy, Government of India, under the TDIL program, manually developed [76]. To enhance the system, the data were incorporated and entire data were pre-processed and divided into training, development and test set as given in Table: 4.

Table 2 Parallel and monolingual dataset from different domains
Table 3 Additional monolingual dataset
Table 4 Dataset division into training, development and testing

4.2 Model size

The neural model along with the extracted linguistic features trained on TensorFlow platform consists of various parameters with their respective dimensionality in Table 5. The model structure visualization is depicted in Table: 1 using Keras.

Table 5 Model size

4.3 Parameter initialization

The parameters used to train the model are mentioned in Table 5. We have used weight matrices recurrently, as random orthogonal matrices. The bias component has been omitted for forming simpler equations. All the alignment element (\(V_{a}\)) and bias component were initialized as 0. The alignment matrices are initialized, with a \(variance=0.001\) and \(mean=0\) from Gaussian distribution. All other matrices were initialized with the same mean with a variation in \(variance=0.01\).

4.4 Training

In the proposed work, Keras sequential model is used to process the data. The proposed model is processed through a highly configured core GPU with 32 GB of RAM to achieve a high throughput speed approximately 2500 words per second. This speed is not possible for normal systems because in this one epoch will take approximately two hours to run. So, we use a highly configured GPU along with NVIDIA Geforce GTX 1050 and Quadro K6000. Each epoch is a pass over the training set and test set as shown in Fig. 3 and update are performed for each minibatch parameters. On increasing the number of epochs over the training set, accuracy also increases, whereas in case of test set the accuracy fluctuates. As the trained neural network fits the training data, accuracy is increasing. The training accuracy is effected by setting the learning rate, regularization, etc., for the model. As the network fits only the training data so it modifies the weights according to data. Whereas in the case of test data, which is never used for weight update in gradient descent algorithm. Despite these test data are not exactly the training data. Though test accuracy should globally improve over the iterations if the network is learning, but it is not bound to improve at each iteration exactly. The graph was a generated output of the developed model on Keras platform for modelling overfitting of training data on model. The training and development probability is the average conditional log-probability of the sentence to be in either of the sets.

Fig. 3
figure 3

Epochs for training and test set

Vanilla stochastic gradient descent (SGD) algorithm has been used with automatically updating learning rate using Adadelta [77] (parameters\(\rho =0.95\) and \(\epsilon =10^{-6}\)). Adam optimizer is employed for stochastic optimization [78]. Normalization is performed [79] for each of the mini-batch (distributed data set). As the distribution of input layer changes due to the change in the parameter of the previous layer during the training, it makes training difficult to perform. Normalization conducted to reduce the internal covariate shift, and it increases the learning rate by reducing the initialization process. It even reduces the need for dropout, by acting as a regulariser. We have taken a minibatch of 64 sentences which was normalized when exceeded the threshold value of 1. Each update took time equivalent to its longest sentence. To minimize the time, we manually sorted and shuffled sentences by retrieving 1500 pair sentence after every 20th update.

Fig. 4
figure 4

Sentence length affecting. a Updates. b Epochs and c Time

Fig. 5
figure 5

a BLEU varies with Beam sizes. b Development probability varies depending on Sentence length c Training probability varies depending on Sentence length

5 Result analysis and discussion

The performance of this proposed SHH-MTS evaluated using automated metrics and human evaluators. The sentence length of the corpora affecting the update of the model. Figure 4a depicts, i.e. on increase in the sentence length in the training corpus the number of updates of weights in the model training drastically increases over a point (till 20 words length) and then decreases after the sentence length exceeds more than 20 words. Figure 4b denotes the effect of sentence length on number of iterations performed on training set, i.e. epochs. It can be clearly deduced from the graph that the number of epochs is decreased after the point (i.e. 20 sentence length). Figure 4c denotes the effect of sentence length on time for building the model. As depicted in the figure, the time fluctuates drastically. For sentence length 10–20, the model training time remains same, whereas for 20–30 sentence length it increases rapidly. In conclusion, the sentence length upto 20 words limits the update, epochs and time. If the sentence length in corpus exceeds this limit, there is a drop seen as the graphs plotted. Figure 5a depicts the BLEU score varies with the beam size. A beam search was used during the inference to find the most likely sequence of words for each translation. The beam problem in neural machine translation exists for relatively small beam sizes – especially when compared to traditional beam sizes in statistical machine translation systems. We can see from the figure beam size (1–4) have a constant change in the BLEU score, whereas from 5-10 the BLEU score was enhanced. In case if there is a large increase in beam size, it drops the BLEU score. So, here in our experiment training, we have limited the beam size by normalizing the length of sentences. Figure 5b depicts the sentence length effect on the development probability. As shown in the plot, the development probability increases only till 20 sentence length and decreases thereafter. Therefore, either splitting the longer sentences or normalizing the sentences of length greater than 20 would be an ideal strategy. Figure 5c models the sentence length effect on the training probability. As depicted, the training probability decreases with the time. So in order to increase the training probability, shorter sentences would be better modelled . The different automated metrics used such as BLEU, Word Error Rate (WER), F-measure and Meteor have been covered in detail in further sections.

5.1 Automatic error analysis

  • BiLingual Evaluation Understudy (BLEU) is an important metric used for calculating the accuracy of translated sentences as compared to the human-generated reference translations as in Eq. (26). It provides accurate results for longer sentences but fails for shorter sentences [80]. The BLEU score is evaluated at each iteration of performance enhancement as depicted in Table 6. Firstly, a simple sequential model is built which gives an accuracy of 10.23%. To improve the accuracy of the model, we applied Keras model with the bidirectional layer that significantly improved the accuracy to 29.12%. The result produced a readable translation but required significant improvement. It was enhanced with gated recurrent unit (GRU) cells which perform better computation. To attain the output activation function is applied in the neural model, but with the implementation of different activation function better level of accuracy is achieved, i.e. 56.78% for our proposed system. The accuracy attained at a stagnant level with all the significant experimentation but auto-tuning our model gave a bit of enhancement to the proposed model and gave an accuracy of 61.02% as a final accuracy.

    $$\begin{aligned} BLEU = min{\left ( 1,\frac{output\_length}{Reference\_length}\right) }\left ( \prod ^4_{i=1}precision_{i}\right) \end{aligned}$$
    (26)

    It will compute precision w.r.t human-generated translation without taking into account any grammatical corrections/errors. As we experimented with different models to enhance the performance of MTS. The BLEU score computed for all the three models, i.e. RBMT, neural and hybrid depicted in Fig. 6a. As we build our model in three phases. Firstly, we build the rule-based model as explained in Sect. 3.2, secondly neural model as in Sect. 3.3 and finally hybrid model combining other models which perform better than the rest of the models. The result demonstrates the efficiency of our novel technique applied in our proposed work. The BLEU score obtained for our proposed system varies with the beam size as depicted in Fig. 5a. The development probability varies with the sentence length as shown in Fig. 5b and training probability depends on sentence length.

  • Word Error Rate (WER) It is a metric used to calculate the error rate by comparing MT output with the human translated output as in Eq. (27). The lower the WER, the better the model.

    $$\begin{aligned} WER=\frac{substitutions+insertions+deletions}{reference\_length} \end{aligned}$$
    (27)

    Here, substitution means replacement of one word with another in a particular sentence. Insertion means the addition of words, and deletion means dropping of words. The WER score computed for all the three models, i.e. RBMT, neural and hybrid depicted in Fig. 7a. As we build our model in three phases, firstly, we build the rule-based model as explained in Sect. 3.2, secondly neural model as in Sect. 3.3 and finally hybrid model combining other models which perform better than the rest of the models. The result demonstrates the efficiency of our novel technique applied in our proposed work. As hypothesised from figure, hybrid model has a minimum WER score. BLEU and WER are inversely proportional to each other.

  • F-measure It is a metric for calculation of accuracy and precision of model as in Eq. (2830). It calculates the quality or exactness of an output. Mathematically, the calculation of F-measure requires precision and recall values also. Therefore,

    $$\begin{aligned} Precision& = \frac{Correct}{Output\_Length} \end{aligned}$$
    (28)
    $$\begin{aligned} Recall& = \frac{Correct}{reference\_length} \end{aligned}$$
    (29)

    F-measure

    $$\begin{aligned} F{\text {-}}measure=\frac{ (Precision\ddot{\mathrm{O}}recall) (Precision+Recall)}{2} \end{aligned}$$
    (30)

    The F-Measure computed for all the three models, i.e. RBMT, neural and hybrid depicted in Fig. 7a. As we build our model in three phases, firstly we build the rule-based model as explained in Sect. 3.2, secondly neural model as in Sect. 3.3 and finally hybrid model combining other models which perform better than the rest of the models. The result demonstrates the efficiency of our novel technique applied in our proposed work. It is observed that the hybrid model has more F-measure as compared to other existing models.

  • METEOR It is used to find the correlation between the machines translated output and the human-generated sample output as in Eq. (31-33). This score is also directly proportional to accuracy. Reducing the effect of F-mean is helpful [81]

    $$\begin{aligned} F\;mean= \frac{10PR}{9+RP} \end{aligned}$$
    (31)

    Here, P is Precision and R is Recall. F-mean, Precision and Recall are based upon the unigrams matches. For longer values penalty required computation. Mathematically,

    $$\begin{aligned} Penalty & = 0.5{\frac{chunks}{unigrams\_matched}}^3 \end{aligned}$$
    (32)
    $$\begin{aligned} Score & = F\; mean \cdot (1-penalty) \end{aligned}$$
    (33)

    The METEOR score computed for all the three models, i.e. RBMT, neural and hybrid depicted in Fig. 6b. As we build our model in three phases, firstly we build the rule-based model as explained in Sect. 3.2, secondly neural model as in Sect. 3.3 and finally hybrid model combining other models which perform better than the rest of the models. The result demonstrates the efficiency of our novel technique applied in our proposed work. In the proposed hybrid model, meteor value is high as compared to other models due to the high correlation between the words of the output sentences. In conclusion, the hybrid model has higher BLEU, F-score, Meteor but low WER as compared to other models.

Table 6 BLEU score of different experiment performed
Fig. 6
figure 6

Different models, i.e. RBMT, neural and hybrid across their a BLEU, b METEOR

Fig. 7
figure 7

Different models, i.e. RBMT, neural and hybrid across their a Word Error Rate (WER), b F_measure

Fig. 8
figure 8

Comparison of baseline system, i.e. RBMT for Sanskrit–Hindi [49] corresponding to proposed system on various evaluation measures

The performance analysis was also conducted on different category Sanskrit sentences on different automated metrics BLEU, Precision, Recall, F-measure, WER, F-mean, Penalty and Meteor as shown in Table 7. These sentences run on the developed system provides with Hindi output sentences used for performance evaluation along with their reference translation (Table 8).

Table 7 Metric analysis of Sanskrit to Hindi translation

5.2 Human error analysis

In this work, linguistically errors are identified by performing a case study. It includes 15 different grammatical cases corresponding to which Sanskrit sentences are tested. The output generated by passing different kinds of sentences is displayed in Tables 9 and 10. These results show that the highest error rate in proposed hybrid MTS is encountered in the sentences of category verb 4%, whereas other categories have less than 3% of error rate.

Table 8 A case study of linguistic analysis of Sanskrit to Hindi translation (cont.)
Table 9 A case study of linguistic analysis of Sanskrit to Hindi translation

5.3 Comparison of proposed system with existing work

The comparison of the proposed work is performed with the existing work [49, 51] and [8]. The comparison of proposed work with the existing work [49] on automated metrics is depicted in Fig. 8.

Fig. 9
figure 9

Human evaluation of proposed system with Sahit

The comparative analysis is exhibited in fig. 10b display that the error rate of the Sanskrit–Hindi statistical machine translation (SaHit) [51] is higher compared to the proposed system in Fig. 10a. It was also compared with a recent research work by [8] in Fig. 10c.

Fig. 10
figure 10

Comparison of overall error rate and accuracy a SHH-MTS, b [51], c [8]

The proposed hybrid MTS has 61% accuracy, whereas the accuracy of the existed MTS proposed by [51] is only 27% and by [8] is 57%. The comparison of result was also performed by human evaluators based on grammatical categories, i.e. Sandhi, Compound, Verb, Over Generation, Less Generation, Visarga/Anusarva, Karaka and others. Comparison results show that proposed system provides with a more readable and grammatically correct output than existing work for this domain. The proposed work has been also compared based on adequacy and fluency of the output generated by MTS with existing work as depicted in Table 7. The proposed work for Sanskrit language processing in novel in technique and accuracy achieved in terms of both automated as well as human analysis.

Table 10 Comparison of proposed system with the existing work [8] based on adequacy and fluency

6 Conclusion and future scope

In our work, three different translation models ( rule-based, neural-based and hybrid) have been produced and analysed for Sanskrit to Hindi. MT is a challenging job; hence, the model becomes complex and time-consuming. Earlier work [8, 9, 51] lacks extensibility, generalizability and adaptability which have been overcome by the proposed system. The work developed and presented here is novel and can apply to any low-resource language pair with minimum linguistic knowledge. The work extracts features from the linguistic rule and further passes these features to train the recurrent neural network. Performance evaluation performed on automatic and human measures results in high performance of the hybrid system. It even performs quite well in accuracy, speed and response time. The proposed hybrid model is fast and more efficient than the existing rule-based systems. In non-rule match cases, the rule-based model does not return any output; on the other hand, our proposed model always returns the best solution. The complexity of existing model becomes very high for long sentences, and these are practically infeasible sometimes, but our proposed model is efficient for such cases also. In future, multiple linguistic languages can be taken to convert into the single target language and multi-lingual platform for the same purpose (Figs. 9, 10).