1 Introduction

Machine translation is a mechanism of converting text from one natural language to another language with the help of computer systems. Till now, no MT system with 100% accuracy and domain independent has been developed. MT is a sub-domain of natural language processing which in turn is a sub-domain of artificial intelligence. This article presents a MT system for translation of Sanskrit text to Universal networking language expressions. The SANSUNL system was the first MT system to transfer Sanskrit text to Universal Networking Language expressions [1]. Sanskrit is one of the oldest languages in world and more suitable for computer programming due to its systematic grammatical structure and less ambiguous characteristics. Sanskrit is one out of 22 recognized Indian languages. Sanskrit language is written in Devanagari script and uses Panini structured grammar. In Sanskrit, there are 16 vowels and 36 consonants. Words in Sanskrit represents properties of object not the object itself. Words in Sanskrit are classified into three parts as shown in Fig. 1.

Fig. 1
figure 1

Sanskrit word architecture

UNL stands for universal networking language, which is an intermediate representation of natural language that could be interpreted by computers easily. Nowadays, UNL is being used by researchers as an interlingual representation of source NL while translating the text into TL. UNL system is shown in Fig. 2 which consists of two parts. Right part of Fig. 2 shows how information is represented in UNL and left part shows the tools used in UNL system. UNL represents the natural language information in the form of semantic nets with nodes representing concepts known as universal words (UW), edges among the nodes represents relations known as UNL relations and their attributes. UW dictionary is the core database needed for EnConverter system. EnConverter is the process of converting natural language (NL) text into UNL expressions and Deconverter is the process of generating NL text from UNL expressions. The UNL is one of the youngest language in world developed by United Nations University for processing the natural language text [2]. The detailed information about UNL is available at http://www.undl.org/.

Fig. 2
figure 2

UNL system

Long short-term memory (LSTM) is a special type of recurrent neural network which resolves the problem of fixed size vector encoding [3, 4]. LSTM has an edge over the basic RNN networks because of their capability to selectively remember the pattern for long duration. LSTM uses cell structure with three gates for operating any text. Bidirectional LSTM (BiLSTM) architecture is shown in Fig. 3 in which two LSTM layers have been used each for forward as well as backward processing [5].

Fig. 3
figure 3

BiLSTM architecture

Stacked LSTM (SLSTM) architecture is shown in Fig. 4 which consists of two LSTM layers one above the other to process the input text [6].

Fig. 4
figure 4

SLSTM architecture

According to Ethnologue Languages of World, approximately 7102 languages and thousands of dialects have been used by the people for communication [7]. Human translation has never been an effective solution for such problems due to their lower availability and hard accessibility to everyone and also due to their high cost of manual translation. According to Census of India 2001, 22 scheduled and 100 non-scheduled languages with approximately 1600 local dialects are being used by people [8, 9]. For the development of country like India, people have to exchange technology, science, ideas and work together without any language barrier. MT techniques can remove such problems in an effective manner. So there is a great need of MT at the global level as well as at local level.

MTS in general has its uses in every field of life some of them are tourism, health domain, finance, defense, education, business, government work, web content, app development. Proposed machine translation system could be used in teaching-learning of Sanskrit language in schools to understand the features of Sanskrit language (one of most unambiguous language, well-structured grammar, divine feature, best suited for computers as accepted by NASA, treasure of ancient science and technology, meditation power, rich in named entities) for research purpose. MTS has several benefits over traditional methods of human translation which includes high translation speed, lower cost, more memory than human to remember large data, easy to translate into multiple languages at once in multi-lingual environment, translation could be done without any fatigue and availability of the system any time anywhere.

Proposed system is an extension of previously developed Sanskrit to UNL MT system i.e. “SANSUNL”. Earlier system was an initial attempt to perform the translation of Sanskrit text into UNL expressions and was focused on resolution of UNL relations. Some of the limitations of the previous version of SANSUNL are listed below:

  1. 1.

    Only 35 UNL relations were successfully resolved by SANSUNL system out of total 56 UNL relations. The performance of this old versions needs to be enhanced at several stages.

  2. 2.

    Earlier system was using only simple grammar rules for POS tagging and was not using any pre-tagged dataset which resulted into less efficient POS tagging of the input text.

  3. 3.

    For successful recognition of input sentences, the system was not using any Sanskrit grammar and standard parsing algorithm.

  4. 4.

    System was tested only using two datasets and for better evaluation more datasets need to be used.

Proposed system overcomes the limitations of previously developed SANSUNL system. Summary of research contribution of this article is of many fold but major contributions are listed below:

  1. 1.

    Sanskrit stemmer

  2. 2.

    Neural network-based POS tagging

  3. 3.

    Sanskrit grammar for processing Sanskrit text.

  4. 4.

    Implementation of CYK parser for Sanskrit language.

  5. 5.

    A novel algorithm for generating the parse tree from CYK parsing table.

  6. 6.

    The evaluation of the system on four standard datasets.

This article is organized into five sections. Section 1 gives introduction to Sanskrit language, UNL system, LSTM, motivation and research contribution for the development of a new machine translation system. Section 2 gives literature review of existing research work. Proposed work is presented in Sect. 3. Implementation and results of the proposed system is shown in Sect. 4. Last Sect. 5 provides an informative conclusion of the proposed system.

2 Literature review

Lots of work has been done by researchers worldwide in the field of machine translation. Focus of this review is on machine translation systems developed using UNL approach. Several efforts has been made to develop MTS for Indian languages as well as for other world languages such as for Hindi [10], Punjabi [11, 12], Sanskrit [1], Tamil [13, 14], Malayalam [15, 16], English [17, 18] [19, 20],Chinese [21], Vietnamese [22], French [23], Brazilian, Portuguese, English [24] and Russian [25].

After surveying machine translation systems based on UNL approach, authors reviewed various neural machine translation (NMT) systems and applications of neural networks in different phases of MT development. NMT is an extension of statistical machine translation. NMT is the process of building single network which can be tuned to maximize the translation. NMT performs end-to-end translation. In 2014, recurrent neural networks has been proposed by [26, 27] and [28] that has used encoder and decoder approach to perform translation. Encoder converts input sentence into an intermediate vector form and decoder converts vector to target language sentence. In 2015, a new model has been proposed [29] which enhances the basic encoder–decoder NMT for English–French translation. Microsoft has also provided NMT based translation support for 21 languages and has added Hindi recently [30]. In 2016, Google has also applied NMT approach over the existing statistical machine translation approach for translation [31]. Facebook in 2017 has proposed implementation of NMT using Convolutional Neural Networks and claimed faster performance than other systems like given in [32] and [33]. Amazon has also launched its machine translation system using NMT approach [34]. An English to Punjabi NMT system has been proposed by [35] in 2018. English to Hindi MT system using NMT approach has been proposed by [36]. Deep neural network (DNN) [37] models have shown state-of-the-art performance in solving complex problems like speech recognition, image processing due to their extreme machine learning capabilities and performing computations in parallel as discussed in [38, 39] and [40]. LSTM have shown significant improvement in processing NL text like tagging, classification and machine translation [41,42,43,44,45,46,47,48,49,50,51] and [52].

Neural network processes only numeric form of data. So to process text data through neural network first it has to be encoded into numeric form before processing further. The encoding could be done at character level, word level as well as at sentence level. Number of researchers have proposed several encoding schemes [53]. One-hot encoding has been proposed by [54] is a basic encoding scheme for representing words. Word2vec embedding proposed by [55] and [56] has two models for embedding the words that are continuous bag of words (CBOW) and skip-gram models. Both the models have used a particular window size to predict target word from context words or context words from target word. Glove embedding have been proposed by [57] and has also used global context in comparison to word2vec which was using only local window context. FastText embedding has been proposed by [58] and has used CBOW for text categorization. FastText technique has used sub-categorized word n-gram information for the semantic relation identification among characters of word. Embedding from language models (ELMo) has been proposed by [59] and has used two-way language models (forward as well as backward LSTM) for embedding the text in to numbers. Open artificial intelligence- generative pre-training (OAI-GPT) proposed by [60] has been used to find the semantics of words in application context domain. It has used one-way language model with transformer to extract semantic features from words. Bidirectional Encoder Representations from Transformers (BERT) proposed by [61] has used bi-transformer technique to extract semantic knowledge from the sentences.

For the purpose of part-of-speech tagging, character-based encoding has been performing significant role [62, 63]. Character-based encoding have been used in several applications including POS tagging [43, 64], morphological analysis [65], parsing [66], language modeling [67] in the field of Natural Language Processing (NLP) [68,69,70,71,72].

Fig. 5
figure 5

Modified Architecture of SANSUNL system

3 Proposed system

Keeping in view existing MTS, neural network techniques and encoding schemes, authors have proposed an extension of previous MTS [1] with the addition of stemmer, neural network for POS tagging, Sanskrit grammar and CYK parser. Architecture of the proposed system is divided into seven layers each performing different tasks. Figure 5 shows all the layers: pre-processing and tokenization, POS tagging, parsing, node list creation, case marker identification, unknown token handling and UNL generation with corresponding operations performed in each layer of the proposed architecture.

3.1 Pre-processing layer

The input to the system can be given either in Unicode format or Indian Language Transliteration (ITRANS) format. The ITRANS text is converted into Unicode and vice versa using online Sanscript tool available at http://www.learnsanskrit.org/tools/sanscript. The input text is tokenized using the regular expression and the StringTokenizer class of java with space as delimiter. The individual tokens are stored into an array.

3.2 POS tagging layer

The part-of-speech (POS) tagging plays an important role in developing an efficient MTS. It is the process of assigning different grammar roles such as noun, verb, pronoun, proper noun. to different words present in the sentence. It becomes more challenging in case of Sanskrit language due to less availability of Sanskrit text in digital form. Several POS tagsets have been proposed [73,74,75], and [76]. A comparison of such POS tagsets is presented in Table 1 which is based on well-defined criteria and that includes application to Sanskrit specific or common to many languages, fine-grained or coarse-grained analysis, flat or hierarchical structure, multi-lingual support and their basis for tagging.

Table 1 POS tagset comparison

Researchers can take benefits of these POS tagsets from the following internet sources:

https://www.sketchengine.eu/tagset-indian-languages/,

http://sanskrit.jnu.ac.in/corpora/JNU-Sanskrit-Tagset.htm,

http://www.ldcil.org/standardsTextPOS.aspx,

http://sanskrit.jnu.ac.in/corpora/MSRI-JNU-Sanskrit-Tagset.htm

and http://sanskrit.jnu.ac.in/cpost/post.jsp, respectively.

IL-POSTS Sanskrit tagset [77] has been selected for the proposed translation system after analyzing various POS tagsets in Table 1. The author adopted two strategies for POS tagging: stemmer-based and neural network-based tagging.

3.2.1 Stemmer-based tagging

Stemming is a process of removing morphological and inflectional endings from the words to get base form of a word. In first strategy, proposed stemmer has been used to stem the Sanskrit words and then corresponding rule from the rule-base has been applied to find out category of a word in the sentence. Proposed stemmer consists of 774 suffices and 23 prefixes which are further classified into three categories. The first category is of proper noun with 120 suffices, the second is of nouns other than proper noun with 552 suffices and the third category consists of verb 102 suffices. To find the valid word and then to obtain the word category with case, number, person and gender information, a set of tagged Sanskrit words and Sanskrit to English dictionary have also been used in the stemming process. Further, output of the proposed POS tagger is compared with the existing Sanskrit analyser (http://sanskrit.uohyd.ac.in/scl/#). If more than one tag is obtained then the selection of correct tag is done based on tags of precedent word and successor word tag.

Fig. 6
figure 6

POS tagger using LSTM

3.2.2 Neural network-based tagging

The application of neural network on POS tagging is still a challenging task. In this approach, two long short-term memory models are used on the tagged Sanskrit dataset (https://gitlab.inria.fr/huet/Heritage_Resources). The tagged dataset consists of approximately four lakh word entries. The fields in this dataset consists of Sanskrit words in ITRANS format with grammatical categories (noun, verb and pronoun) along with types and attributes. Sanskrit words with their grammatical category and attributes are extracted from this dataset using Python’s XML parser and stored in the form of python record files. The architecture of the proposed POS tagger having four modules is shown in Fig. 6.

  1. (i)

    Module 1

    In this module, Sanskrit tokens in Unicode format are first converted into ITRANS format. The reason for this is that the dataset available for training and testing is in ITRANS format. So to proceed further, the Sanskrit tokens need to be converted into ITRANS format if not already in that format.

  2. (ii)

    Module 2

    In this module, the tokens are accepted one by one and converted into vector form using one-hot encoding scheme [78].

  3. (iii)

    Module 3

    In this module, the dataset is divided into 80:20 ratio for training and testing purpose. Ten models have been built each for BiLSTM as well as stacked LSTM configuration. The switching among ten models is done as follows:

    Model 1—It predicts the word as noun, pronoun or verb.

    Model 2—If word is predicted as noun, then this model predicts the gender.

    Model 3—If word is predicted as noun, then this model predicts the case.

    Model 4—If word is predicted as noun, then this model predicts the number.

    Model 5—If word is predicted as pronoun, then this model predicts the gender.

    Model 6—If word is predicted as pronoun, then this model predicts the case.

    Model 7—If word is predicted as pronoun, then this model predicts the number.

    Model 8 - If word is predicted as verb, then this model predicts the verb root.

    Model 9—If word is predicted as verb, then this model predicts the number.

    Model 10—If word is predicted as verb, then this model predicts the person.

    After performing the training and testing of the models, the encoded tokens are fed into the models.

  4. (iv)

    Module 4

    If the output of Module 3 is still ambiguous, then the rules are applied to resolve any such ambiguity and get the final tagged tokens as output.

3.3 Parsing

The tagged words obtained in previous section are parsed now to obtain the syntactic information about the sentence such as subject, predicate, and object. Two approaches have been used for the parsing: shallow parsing and CYK parsing.

3.3.1 Shallow parsing

In this approach, a set of Sanskrit rules and word endings are used to perform the parsing. Sanskrit sandhi rules are applied in reverse to remove the word endings and then Sanskrit case markers rules are applied to find different roles (subject, object, verb and their person, number, gender information as well) of words in the sentence.

3.3.2 CYK parsing

In second strategy, a context-free grammar is designed for the Sanskrit language processing. Existing CYK parsing algorithm [79] is used to generate the parse tree for the Sanskrit grammar.

Sanskrit grammar \(G=\{ N, \sum , P, S \}\),

where N = {S,NP(obj), predicate, NP(conj) }//set of non-terminal symbols , \(\sum =\{ NP(subj), VP, Conj , NP(Ind\_obj)\}\)

//set of terminal symbols ,

P is the set of production rules.

figure a

S=S // start symbol.

Since the CYK parser uses only Chomsky normal form (CNF) of the CFG grammar. So the CFG grammar is converted into CNF form as follows:

\(G_{1}=\{ N_{1},\sum _{1}, P_{1},S \}\),

\(where N{}_{1}=\{ S, NP(obj), predicate, NP(conj), V,X,A,B\}\) //set of non-terminal symbols,

\(\sum _{1}= \{ NP(subj), VP, Conj, NP(Ind\_obj)\}\) //set of terminal symbols,

\(P_{1}\) is the set of production rules.

figure b

CYK parsing table Proposed Sanskrit grammar is implemented using CYK Parser [80]. CYK parsing is done in a triangular form for any input string of length ‘m’ and grammar with ‘p’ non-terminals. The worst case time complexity of the CYK parser is O\((m^{3})\) and space complexity is O\((m^{2})\) [81], which is better than other parsing algorithms in worst case scenario. The process of CYK parsing is discussed as follows:

  1. (a)

    Initially tagged words are given as input.

  2. (b)

    Create a matrix of size [N, N] where N is the number of tokens in the sentence.

  3. (c)

    Fill the diagonal cells of the matrix with the mapped grammar’s variables and terminals in the same order as tokens are present in sentence.

  4. (d)

    If right side of the production rule can be partitioned into two parts then write the variable present at left side in that production rule of grammar at position [i, j].The first part is present at [i, x], where \(\hbox {x}{>}\hbox {i}\) and \(\hbox {x}{<}\hbox {j}\) and second part is present at [y,j] where \(\hbox {y}{<}\hbox {j}\) and \(\hbox {y}{>}\hbox {i}\).

  5. (e)

    Whenever there is more than one possibility, CYK implementation considers the one which is discovered at the later stage (the last one overwrites all previous reduction decisions in case of any overlapping).

  6. (f)

    At last, convert the CYK matrix into an actual tree by beginning from start symbol of grammar present at [0, N] and tracing children at each point.

If the input sentence is processed successfully by the proposed grammar, then the parse table is used to generate the Sanskrit parse tree with the help of proposed Algorithm 1 and if not then control goes to Sect. 3.3.1.

figure c
Fig. 7
figure 7

Structure of node list

3.4 Node-list creation and universal word matching

In this section, a node list is created from the parsed text that has been generated in the previous section. Each node consists of Sanskrit tagged word with corresponding English equivalent word. The node list also consists of syntactic and/or semantic attributes obtained from previous section and will be updated in next section. Figure 7 shows the structure of node list. Each node is selected one by one from the node list and are searched in the Sanskrit-UW dictionary. There may be multiple entries for one word in the dictionary for depicting different aspects of a word in the sentence. To resolve the ambiguity among multiple entries and selecting the correct word, grammatical attributes obtained from previous sections (POS tagger and parser) are used. Update of the attributes in node list is performed, after selecting correct word from dictionary. This process is repeated until all nodes in list are processed completely. If the node word does not exist in the dictionary then the user is asked to update the dictionary by marking the node word as UNK (unmatched word).

3.5 Case marker identification

The Sanskrit language is a morphology-rich language. Unlike English, the preposition in Sanskrit are identified with the help of Kaarkaa (case) and are associated as suffix with the word. The kaarkaa analyser have been used for identification of different role of words in the sentence and this information is used to resolve the UNL relations among nodes. The resolved UNL relations have been stored into relation table with corresponding links to the nodes.

3.6 Unmatched word handling

The words for which no corresponding entry has been found in the UW dictionary are termed as unmatched words and are marked as ’UNK’. To handle such words either a user is asked to update the dictionary or the grammatical attributes obtained from parser will be used to resolve UNL relations for such words.

3.7 UNL expression generation

After successfully resolving all the UNL relations among different UW’s, a set of approximately 1500 rules is applied to generate the UNL expressions for the input Sanskrit text.

4 Implementation and results

Experimental setup of proposed system is shown in Table 2 that consists of hardware, software, neural network architecture and used datasets.

Table 2 Experimental setup for the proposed system

A step by step processing of Sanskrit text by the proposed system is demonstrated with an example below:

रामः ओदनम् चमसेन कपिलस्य थालिकायाः खादति

4.1 Pre-processing and tokenization

The ITRANS form of the above Devanagari text is

Example 1

(Sanskrit) SS: रामः ओदनम् चमसेन कपिलस्य थालिकायाः खादति

ITRANS :rAmaH odanam chamasena kapilasya thAlikAyAH khAdati

IAST: rāmaḥ odanam camasena kapilasya thālikāyāḥ khādati.

The words from the sentences are tokenized using the regex class and StringTokenizer class of java.

4.2 POS tagging

Figure 8 shows the POS tagging of Sanskrit Tokens.

Fig. 8
figure 8

LSTM-based tagging

One-hot embedding is used to do the character-based embedding for the neural network [82]. Figure 9 depicts word categories as noun, pronoun and verb and their attributes as gender, number, person and type used for tagging the tokens. Figure 10 shows tagged output for the sentence tokens.

Fig. 9
figure 9

Word categories and their attributes

Fig. 10
figure 10

Tagged Tokens

4.3 Parsing

The tagged tokens are converted back into Devanagari form and processed by the proposed Sanskrit grammar and a parse tree is generated to get the specific role of word in the sentence. Figure 11 shows the Sanskrit Parse Tree generated by the Sanskrit grammar.

Fig. 11
figure 11

Sanskrit parse tree

4.4 Node list creation

The node list created after the parsing phase is shown below:


रामः ->Ram ( noun, male, Nominative, Singular, X(NP(Sub)).


ओदनम् -> Rice ( noun, male, Accusative, singular, Np(Obj)).


चमसेन ->Spoon ( noun, male, Instrumental, Singular, Np(\(Ind_Obj\))).


कपिलस्य -> Kapil’s ( noun, male, Genitive, Singular, Np(\(Ind_Obj\))).


थालिकायाः ->Plat ( noun, female, Ablative, singular, Np(\(Ind_Obj\))).


खादति ->eat( verb, Prasmepadi, Singular, Third Person, Np(\(Ind_Obj\))).

If any ambiguity persists then the Sanskrit grammar rules are used to disambiguate.

The attributes obtained after parsing phase have been used to identify the word in the UW dictionary and the UNL attributes are added to the node list. Also the case marking is done with the help of Kaarka analyzer.

[रामः]“Ram” (N,M, Nominative,3S,ANIMT,FAUNA, X(NP(Sub)).

[ओदनम्] “Rice” ( (N,M, Accusative, 3S, ANIMT, FLORA, NP(Obj)).

[ चमसेन] “Spoon” ((N,M, Instrumental, 3S, INANI, Np(\(Ind_Obj\))).

[कपिलस्य “Kapil’s” ((N,M, Genitive, 3S, ANIMT, FAUNA, Np(\(Ind_Obj\))).

[थालिकायाः]“Plat” ((N,F, Ablative, 3S, INANI, KitUtensil, Np(\(Ind_Obj\))).

[खादति] “eat”(v,3S, Prasmepadi, Present, @entry, VOA, Np(\(Ind_Obj\))).

4.5 UNL expression generation

This is the final step in which the UNL expressions are generated. Nodes are scanned from left to right using a window size of two. The UNL relations are resolved using same process as was used in previous system, but with enhanced number of rules. The final output is shown below:

agt(eat(icl>do).@entry.@present, Ram(iof>person)).

obj(eat(icl>do).@entry.@present, rice(icl>thing).@def)).

ins(eat(icl>do).@entry.@present, spoon(icl>thing).@indef)).

frm(eat(icl>do).@entry.@present, :01).

pos:01(Plat(icl>thing), Sita(iof>person)).

In this particular example, only five phases have been used as there is no unmatched word found and the case marking has been done during the universal word attribute extraction process.

Result summary

In this work, ten models have been developed for POS tagging. For training and testing of these models, whole tagged Sanskrit dataset (https://gitlab.inria.fr/huet/Heritage_Resources) is divided into two sections of 80% and 20%. The performance of both architectures is shown in Figure 12 and 13. The result analysis shows that BiLSTM architecture is beating stacked LSTM in its performance.

Fig. 12
figure 12

Test score of stacked LSMT versus BiLSTM

Fig. 13
figure 13

Test accuracy of stacked LSMT versus BiLSTM

Proposed system is validated using three standard datasets DS-1, DS-2 and DS-3. Sentences from these datasets are first translated into Sanskrit manually and then tested on the proposed system. The UNL expressions generated by the proposed system is then compared with UNL expressions available in these datasets. Dataset DS-4 is also used to evaluate the performance of the proposed system. The performance of the proposed system in terms of BLEU score and fluency score on four datasets is shown in Table 3. Figure 14 shows that proposed system is now capable of resolving 46 UNL relations in comparison to the previous system that resolved 35 UNL relations. Proposed system reported an efficiency of 95.375% in comparison to 93.18% of the previous system.

Table 3 Evaluation of the updated system
Fig. 14
figure 14

UNL relation resolution

5 Conclusion

A new Sanskrit to UNL EnConverter system has been proposed that has enhanced the capabilities of previous SANSUNL system by adding modules for stemming, POS tagging and parsing. The system has reported an average BLEU score of 0.81 and average fluency score of 3.705 with an overall efficiency of 95.375% . Proposed system is capable of resolving 46 UNL relations in comparison to 35 done in earlier version.

In future enhancement, the application of deep neural networks could be used to generate all UNL relations automatically. For translating Sanskrit compound sentences, a parallel corpus of Sanskrit sentences and UNL expressions trained on BiLSTM networks could be used. Proposed parse tree generation algorithm and POS tagging technique may be used for parsing and POS tagging for other languages as well in the future.