1 Introduction

Natural languages have shown a vital role in shaping human social behavior as they prepare the necessary mechanism for day to day communication among human beings (Fromkin et al. 2011). Natural Language Processing (NLP) comprises of three basic components: processing, understanding and generation (Allen 1995). NLP is a sub-domain of Artificial Intelligence (AI) and Machine Translation (MT) is one of the application of NLP. Machine Translation (MT) is a mechanism of translating the sentences of one language designated as Source Language (SL) into other language designated as Target Language (TL) with the help of computers (Hutchins 1995; Hutchins and Somers 1992; Slocum 1985). The translation may occur one-to-one, i.e. from one SL to another TL, known as bi-lingual translation; one-to-many, i.e. from one SL into many TLs and many-to-many translation, i.e. from many SLs to many TLs known as Multilingual Machine Translation (MMT). MT comes under Natural Language Processing (NLP) domain which is a sub-domain of Artificial Intelligence (AI) (Rao 1998). The translation may be unidirectional or bidirectional. Several efforts have been made to review the MT systems whereas major contributions has been done by Antony (2013), Desai and Dabhi (2021), Garje and Kharate (2013), Naskar and Bandyopadhyay (2005). The research in the MT field has been increased rapidly in the last few decades. Therefore a systematic yet critical evaluation of available MT techniques, methods and systems is needed. In this article, the authors have surveyed the traditional as well as state-of-the-art techniques and systems of MT. An effort has been made to identify existing MT approaches, development tools, data repositories, environments, evaluation metrics and platforms.

1.1 Motivation

According to Ethnologue languages of world, approximately 7102 languages and thousands of dialects have been used by people in the world (Lewis et al. 2015). Human translation has never been an effective solution for such problems due to less availability of human translators, high cost of manual translation and difficult to approach by everyone. According to Census of India 2001 data, 22 scheduled and 100 non-scheduled languages with approximately 1600 local dialects were being used by people (Dorr et al. 2004; Mallikarjun 2010). So, for the development of country like India, people have to exchange technology, science, ideas and work together without any language barrier. MT techniques can remove such problems in an effective manner. Thus, there is a great need of MT at the global level as well as local level in India also.

The summary of contribution and novelty of this review article is of many folds which are listed as follows:

  • Presenting comparison of MT techniques and evaluation methods based on well-defined criteria to analyze the existing MT platforms with their characteristics and applications.

  • Analyzed the availability of various language resources and presents word embedding techniques used in neural machine translation for Indian languages.

  • Explored the new research areas in the field of machine translation for Indian languages.

1.2 Approaches of MTS

Figures 1 and 2 shows different MTS approaches (Dorr et al. 2004; Seasly 2003). Broadly we can categorize approaches into five groups: Direct Machine Translation (DMT), Rule-Based MT (RBMT), Corpus-Based MT (CBMT), Knowledge-Based MT (KBMT) and Hybrid Based MT (HBMT). RBMT is further divided into Transfer Based MT (TBMT) and Interlingua Based MT (IBMT) whereas CBMT is divided into Statistical MT (SMT) and Example-Based MT (EBMT). Neural Machine Translation (NMT) is an extension of SMT as depicted in Fig. 1. Figure 2 shows the level of complexity in different approaches in the form of Vauquois triangle. From bottom to top complexity increases.

Fig. 1
figure 1

MT approaches

Fig. 2
figure 2

Vauqois triangle

1.2.1 DMT

DMT comes at the bottom of the triangle and needs fewer efforts. There is no intermediary representation of the source and target language, only word to word matching is performed for the translation and the system may have pre-processing and post-processing paring phases for the input sentence morphological analysis and the target sentence reordering, respectively. The system uses a bilingual dictionary for matching the SL words with TL words. Figure 3 depicts the DMT approach.

Fig. 3
figure 3

Direct MT approach

1.2.2 TBMT

In this approach after the morphological analysis of input sentence, the syntactic and semantic analysis using the SL dictionary is performed to find out grammar structure and generates a parse tree. The system uses a set of transfer rules to transfer SL parse tree into TL with the help of a bilingual source-target language dictionary. The TL text is generated as per the grammar of TL using syntactic and semantic generator modules and the target language dictionary. The working of TBMT approach is depicted in Fig. 4.

Fig. 4
figure 4

Transfer based MT approach

1.2.3 IBMT

In this approach, SL text is analysed and an intermediate language independent code is generated to obtain the TL text. As the intermediate code representation is independent of SL as well as TL so could be used in multilingual machine translation. The language analyser is dependent on SL in the input process and the target language generator is dependent on the particular target language. The functioning of IBMT is shown in Fig. 5.

Fig. 5
figure 5

Interlingua based MT approach

1.2.4 SMT

In this approach, statistical or probabilistic techniques have been applied in machine translation system development. There are two major components of this approach as-language model and the translation model. The language model produces the probability of occurrence for the strings of words in the source as well as the target language and also the conditional probabilities of occurrence of a word in the target language which translates a word in the source language. The multiplication of the probability of occurrence of a word in SL with the conditional probability of occurrence of a word corresponding to this word in TL provides the occurrence of source and destination pairs of words occurring in the corpus available for translation. This method requires a large amount of database and very complex statistical techniques to do the translation. The efficiency of the system increases with more training data sets and parallel corpora availability for the language pair. Machine translation can be done based on word, phrase, sentence, or hierarchical phrase. The translation model generally uses the N-gram model. N-gram model predicts the occurrence of the next word of the text given the previous words. The working process of the SMT approach is presented in Fig. 6.

Fig. 6
figure 6

Statistical MT approach

1.2.5 EBMT

The basic translation principle used by this approach was analogy. This approach does not require huge amount of corpora, it needs a bilingual corpus of stored examples and using one of the matching algorithm to find the translation which matches with the source language sentence. Generally EBMT does not require any grammar rule base in detail; it uses only the stored examples and the matching algorithm to find the closest match corresponding to the given input sentence. The architecture of EBMT approach is shown in Fig. 7.

Fig. 7
figure 7

Example based MT approach

1.2.6 KBMT

This approach extracts the linguistic information from SL and stores that information into the knowledge base used for translation purpose. Information extraction is done by using bilingual dictionaries, language structure, stored translation information, domain specific information dictionaries etc. Figure 8 depicts the architecture of KBMT approach.

Fig. 8
figure 8

Knowledge based MT approach

Each approach has its own advantages and disadvantages, so hybridization of two or more than two approaches might give a better translation quality. Hence researchers are focusing on hybridization of approaches at different levels for developing MTS. Comparison of MTS approaches have been done based on a set of well defined criteria as shown in Table 1. RBMT approach gives better results than other approaches, but needs deep linguistic knowledge, more time to create translation rules.

Table 1 Comparison of MT approaches based on several criteria

Corpus Based Machine Translation (CBMT) approach performs better than DMT for long sentence translation, but requires large volume of text corpus for both SL and TL, statistical tools, algorithms to handle and high computation power for the development of MTS. DMT approach is better for translating single clause sentences and requires less time to develop MTS. Neural Machine Translation is an emerging technique and reports similar results to the present state-of-art MTS (Hassan et al. 2018; Wu et al. 2016).

Hybridization of CBMT and RBMT can be done based on confidence-estimation and classification (Christopher and Rao 2010). However, the problem with such hybridization is the requirement of a large corpus of parallel sentences to extract translation rules to cover all aspects of natural language. To overcome such problems Recursive Chain-Learning (RCL) or Genetic Algorithms or Neural Networks can be used over the existing systems (Echizen-Ya et al. 2004). For translating fixed patterns, the RBMT approach was not effective, because conventional syntactic analyzers are not able to recognize such fixed patterns (collocation, idioms and compound nouns). To remove such problems specific pattern recognition modules can be added to the existing RBMT based systems. This will reduce the load on POS tagger and parser, helps in resolving word sense ambiguities (Jung et al. 1999). Other hybrid combinations are explained in Sects. 4.1 and 4.2.

The rest of the article is organized as Sect. 1 gives the introduction to MT, Motivation, the contribution of this article and approaches of MT. Section 2 describes the evolution of MT in general as well as for English, Hindi and Sanskrit languages. Section 3 explains the survey methodology adopted for the current work. Section 4 describes outcomes as results obtained from various MT systems. State-of-the-art MTS platforms, parsing and language modeling tools, available corpora have been discussed in Sect. 5. Section 6 highlights the role of Neural Networks in Machine Translation with some latest examples of MT systems based on NMT approach and Sect. 7 depicts MT evaluation methods and platforms with their characteristics. Section 8 provides research avenues generated from this work and recommendation for new researchers. Finally the concluding notes are given in Sect. 9.

2 Evolution of MTS

2.1 Evolution of MTS in general

Machine translation history had started in the 17th century when Discartes and Leibniz proposed the concept of mechanical dictionaries based on the method of universal numerical codes. But the actual proposal for the machine translation came in the 20th century. Figure 9 shows the development of machine translation in five phases in general (Hutchins 1995; Hutchins and Somers 1992).

Fig. 9
figure 9

MT evolution in general (Cho et al. 2014; Hutchins 1995; Hutchins and Somers 1992; Kalchbrenner and Blunsom 2013; Sutskever et al. 2014)

2.2 MTS development in Indian perspective

The MTS development for Indian languages has started in 1990s and Fig. 10 shows various MTS developed for English, Hindi and Sanskrit languages based on different approaches.

Fig. 10
figure 10

Evolution of MT in Indian perspective based on different approaches

The domain, efficiency, features and the research group associated with these MTS is explained in Sect. 4. Initially due to non-availability of online corpus for Indian languages compared to other languages, DMT and RBMT approaches have been used for developing MTS among Indian languages, although some CBMT based MTS for English to Indian languages or Indian to English language translation have also been developed. In 2003 the hybridization of different approaches have started for developing MTS. From 2009 to 2014 RBMT approach has been used extensively for MTS development. In the duration from 2016 to now the graph of CBMT increases due to the application of NMT approach in MTS. The hybrid approach was also used in parallel to RBMT and CBMT in a few MT systems during the same time. In hybridization, Artificial Neural Network (ANN) and Quantum Neural Network (QNN) techniques outperform compare to other combinations. RBMT approach dominates other approaches in Indian MT development scenario.

3 Survey process

The approach used for survey in this article follows the guidelines given in Budgen and Brereton (2006), Kitchenham et al. (2009), Moher et al. (2015). The different stages involved in the survey process are planning, execution, analysis of results, documentation of results and highlighting the research gaps. The planning of survey includes the creation of an effective research question framework as shown in Table 2, sources of articles as discussed in Sect. 3.1. Execution of survey includes criteria for searching the article as shown in Table 3, inclusion or exclusion criteria of articles in the survey.

Table 2 Research question framework
Table 3 Search strategy

3.1 Information sources

A broad perspective is essential for broad coverage of literature as suggested by Kitchenham et al. (2009) and Budgen and Brereton (2006). So the following electronic sources were used for searching the relevant articles for the survey:

3.2 Searching criteria

All the articles searched over electronic sources include the token” Machine Translation” which makes the process of searching relevant articles a time-consuming and challenging, as these articles are vast in numbers. So, a search strategy is needed to include as many related articles as possible with ease and in less time. One such approach is presented in Table 3, but still, some of the right papers might not be added to this survey, a reason may be due to missing such keywords into the abstract part. The work on MT for Indian languages started in the 90s, and the current survey includes articles from different sources like journals, conferences, workshops, seminars, technical reports, and symposiums from 1990 to Feb 2021.

3.3 Inclusion/exclusion criteria

The process of including or excluding the article in the current survey is shown in Fig. 11. In the first phase, the exclusion of articles has been done based on the title of the article. The exclusion percentage in this stage was 28%. In Phase-2, 1057 articles are separated from the original 1500 article database, and after studying their abstracts, only 410 articles are selected for the next phase based on their relevance to the field of machine translation. In Phase-3, after reviewing the full text of 410 articles only 220 are moved to the next phase, and rest are excluded. In Phase-4, the exclusion is done based on the MT for English, Hindi and Sanskrit languages and finally, 118 articles are included for the current survey.

Fig. 11
figure 11

Inclusion and or exclusion criteria

4 Results and discussion

This article examines the existing literature in the field of MT based on the research questions as per Table 2 and finds out the solutions to these questions as the outcome. Out of 118 articles, 45% are available in Journals, and 55% are published in conferences, workshops, Summits, Lecture Series and Technical Reports. The following sub-sections give an outcome-based analysis of various MTS and further examined based on approach, domain, and development year.

4.1 Machine translation system for Hindi and Sanskrit languages

Hindi and Sanskrit both belong to the Indo-Aryan language family which is a subgroup of the Indo-European language family. Both the languages are free word order and different from English which follows Subject–Verb–Object (SVO) word order. Hindi and Sanskrit both use the Devanagari script and shares many common features with each other.

Sanskrit is one of the oldest languages in the world and has been treated as a holy language in India. In the past, it was the language of educated people and used as a major language in communication, literature, education, administrative documents, and spiritual activities. The treasure of Sanskrit includes not only scientific, mathematical, philosophical, medical, poetry, and religious information but also India’s spiritual as well as cultural aspects. Several languages have emerged from Sanskrit including Indian as well as foreign languages. The Sanskrit users have decreased gradually with time. Recently the Indian government and some non-governmental agencies have started to promote the Sanskrit language so that more people can be associated with this beautiful, spiritual, and most powerful language of the world. Several efforts have been made in developing Sanskrit language MTS all around the world. Based on Panini grammar several tools for Sanskrit language analysis, parsing, and generation tools have been developed by different research groups. Special Center for Sanskrit Studies at Jawaharlal Nehru University (Prof. Girish Nath Jha) New Delhi, University of Hyderabad (Dr. Amba Kulkarni), IIT Bombay (Prof. Pushpak Bhattacharya), IIT Kanpur (Prof. RMK Sinha and Pawan Goyal), Banaras Hindu University Banaras have been the core places for Sanskrit language processing tools development.

Hindi is regarded as the fourth most spoken language in the world and is also morphological rich (Lane 2016). Different research groups have been working to develop MTS for Hindi and Sanskrit languages following various MTS approaches. Tables 4 and 5 provide an overview of such MT systems based on several criteria which include approach used, year, language pair, features, domain, and efficiency. The next section discusses these systems based on the approach used for development and suggests solutions to improve their efficiency.

Table 4 Overview of Hindi MTS
Table 5 Overview of Sanskrit MTS

4.1.1 DMT based MTS

Based on the DMT approach three MTS have been included in this survey (Dubey 2019b; Dubey et al. 2013; Goyal and Lehal 2010). The main drawbacks of these MTS were that these systems were not able to resolve the word sense ambiguities, context resolution, translation of complex sentences because in the DMT approach word to word replacement strategy is followed. These issues can be resolved either by combining DMT with other approaches or by improving the lexicon of words with more syntactic as well as semantic attributes.

4.1.2 CBMT based MTS

Four MTS based on the CBMT approach have been included for review (Jain et al. 2001; Sachdeva et al. 2014; Sinha 2004; Sinha and Thakur 2005). The problems of NER, out of corpus translation in Jain et al. (2001) were resolved by Sinha (2004) adding special modules which will handle a particular problem. This modular approach makes the system more scalable and flexible. The problem of the polysemous verb with Sinha and Thakur (2005) can be resolved either by adding a special module as done in Sinha (2004) or by using the finite-state automaton approach or enhancing the POS tagger capability to resolve the issue. The issue with Sachdeva et al. (2014) is the feature extraction from the dataset which can be resolved easily with the help of deep neural networks (LSTM, RNN, CNN). Based on NMT citepmujadia-sharma-2020-nmt, kumar2019augmented, singh2020corpus, Laskar et al. (2020) systems have been developed. Evaluation of two MTS have also been covered (Goyal and Lehal 2009) and (Dungarwal et al. 2014). Other evaluation metrics like METEOR, NIST, R-L/W/S can be applied to validate these systems.

4.1.3 RBMT based MTS

Several MTS and MT tools have been considered for review based on the RBMT approach. The MTS using UNL as Interlingua were having issues of scalability and limited rule base which can be removed by the learning and feature extraction capabilities of neural networks even without the deep knowledge of SL and TL (Singh et al. 2007). The MTS based on GB theory was able to translate only simple sentences whose capability can be enhanced by the application of minimalist approach and generating the transfer rules either using SMT or NMT (Choudhary and Singh 2009). Hindi to Sanskrit and Sanskrit to Gujarati translation systems (Bhadwal et al. 2020; Raulji and Saini 2019) have been discussed. The efficiency of Sampark MTS was enhanced with the help of Memcached technique which can be done with LSTM network models (Christopher and Rao 2010). The Shakti Standard Format (SSF) format used in the system can be applied to other MTS which involves modular approach (Bharati and Kulkarni 2009). Two MTS for Sanskrit have also been included (Aparna 2005; Upadhyay et al. 2014). Several tools have been developed to process Sanskrit text (Bhadra et al. 2009; Kulkarni 2013; Kulkarni et al. 2010; Kumar et al. 2010). One issue regarding the morphological analysis of feminine nouns was reported by the authors to the developer in 2018 and that was rectified later on by the developer (Kulkarni 2013). The issues with these tools are that these are still in the testing phase. By developing the automatic testing tools for such systems an help in finding the issues early and fix them as soon as possible.

4.1.4 HBMT based MTS

Five MTS based on HBMT approach have been included for survey (Bawa et al. 2020a,b; Goyal and Lehal 2011; Narayan et al. 2014; Sitender and Bawa 2018). Different combinations of MT approaches DMT with RBMT, QNN with RBMT and RBMT with DMT have been used for the development of these systems, respectively.

4.1.5 MTS outcomes

After studying above mentioned Hindi and Sanskrit MTS thoroughly Figure 12 shows the possible outcomes.

Fig. 12
figure 12

Outcomes of Sanskrit and Hindi MTS

4.2 Machine translation system for the English language to Indian languages

Several MTS have been proposed based on different approaches for English language which is the third most spoken language worldwide (Lane 2016). This section discusses such systems based on the approach used for development followed by a tabular representation of such systems is presented in Table 6.

Table 6 Machine Translation System Based on English Language

4.2.1 RBMT based MTS

Based on RBMT approach, various MTS have been categorized into four groups. The first group have used pseduo-interlingua code (Goyal and Sinha 2009; Jayan and Bhadran 2014; Sinha and Jain 2003; Sinha et al. 1995; Sinha 2005) and second group has used UNL intermediate code to represent the intermediate code (Dave et al. 2001; Desai et al. 2014; Sridhar et al. 2016; Udupa and Faruquie 2005). The third group has translated the source syntax tree to target syntax tree using rule base (Aasha and Ganesh 2015; Bahadur et al. 2012; Darbari 1999; Pathak and Godse 2010). The fourth group uses Panini grammar rules, Sandhi rules, root word generation, pattern generation approach for translation (Ata et al. 2007; Balyan and Chatterjee 2015; Mishra and Mishra 2012; Reddy and Hanumanthappa 2013).

The issues with these systems are small size and non-standard form of analysis as well as generation rules, scalability, limited domain, time-consuming while writing the rules. The language processing tools like stemmer, POS tagger, parser used for the Indian language part were not competent with state-of-the-art tools like Porter stemmer, Malt parser, and Stanford parser. The approach followed in Porter stemmer to form the rule base should be adopted while making the rule base which will speed up the process. Language independent parsers should be developed like Malt parser or UNL parsers for Indian languages with the application of the NMT approach to remove the scalability and domain restriction issues.

4.2.2 CBMT and HBMT based MTS

Based on the CBMT approach several MTS have been proposed and classified into three groups. The first group has used statistical models like the IBM model, Bag of Words model, SRILM language model (OCH F 2007; Sharma 2011; Udupa and Faruquie 2005; Venkatapathy and Bangalore 2009). The second group has used Hierarchical phrase-based, simple phrase-based SMT techniques to perform the translation (Ali et al. 2013; Jawaid et al. 2014; Khan et al. 2013). The third group has used the EBMT approach for translation (Badodekar 2003).One system has also used the machine learning technique for the English–Bengali question–answer system (Sheikh and Conlon 2013). The issues with these are the availability of parallel aligned corpus of sentences, the complexity of statistical techniques to form the language as well as translation models which can be resolved with the help of the NMT approach or hybridization with other approaches. Application of machine learning techniques for prediction like CRF++, LSTM, RNN. Three MTS have been included based on the HBMT approach. Bharati et al. (2003) and NCST (2008) have used RBMT with SMT, while Narayan et al. (2014) have used RBMT with QNN for translation.

4.2.3 English MTS outcomes

Based on the discussion done in the above section and Table 6, Fig. 13 shows the outcomes obtained.

Fig. 13
figure 13

Outcomes of English to Indian languages MTS

4.3 Research questions vs outcome

Ten outcomes are obtained after discussing the MTS in Subsects. 4.1 and 4.2 and are tabulated in Table 7. Research Questions are denoted by O1, O2, O3, O4, O5 and Q1, Q2, Q3, Q4, Q5, Q6 are the outcomes for Hindi and Sanskrit MTS while E1, E2, E3, E4, E5 are outcomes of English MTS. A four scale mapping is done with value ‘3’ as the maximum contribution and value of ‘0’ indicates least contribution of an outcome with respect to the research questions as shown in Table 7.

Table 7 Outcome and research questions

5 Machine translation platforms and tools

This section gives an overview of some statistical tools, parser and corpus available online for developing new MTS and can be downloaded freely as shown in Table 8. Table 9 shows some of the popular MTS platforms which could be used for developing new MTS. Various language corpora available for Indian languages are also highlighted. Enabling Minority Language Engineering (EMILLE) contains three types of corpora such as parallel, monolingual and annotated. In parallel corpus it contains two lakhs words for Bengali, Gujarati, Hindi, Punjabi, and Urdu to English and reverses. Twenty annotated Hindi files are there in the corpus.

Table 8 Online Resources
Table 9 Popular MTS Platform

Gyan Nidhi corpus contains fifty thousand number of pages as a parallel corpus for each of eleven Indian languages including (Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Telugu, Tamil) and English language.

Open Source Parallel Corpus (OPUS) contains parallel corpus for Assamese, Bengali, Bhojpuri, English, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Marathi, Oriya, Punjabi, Sanskrit, Tamil, Telugu and Urdu.

ILCI (Indian Language Corpora Initiative) contains a corpus of 50,000 parallel aligned sentences in Bangla, English, Hindi, Gujarati, Konkani, Malayalam, Marathi, Oriya, Punjabi, Urdu, Tamil, Telugu in the domain of tourism and health.

6 Role of artificial neural network in machine translation

With the explosive growth of the internet and easy access to high computing power systems, Neural Machine Translation has emerged as a fast-growing approach for developing new MTS (Cho et al. 2014; Kalchbrenner and Blunsom 2013; Sutskever et al. 2014).

The basic components of the NMT system are the encoder and decoder. It uses single neural network architecture to generate a target sentence for the input sentence, instead of using multiple small components optimized in pipeline form for obtaining translation in traditional phrase-based systems as shown in Fig. 14. Initially, the problem with NMT systems was the fixed- size vector space generated by the encoder for input sentence which was resolved by Bahdanau et al. (2014).

Fig. 14
figure 14

NMT system architecture

Different types of neural network architectures have been used for developing new MTS. Recurrent Neural Networks (RNN) are used mostly for MTS development due to their feature of preservation with the processing of input data/memorization of features of natural language. LSTM (Long Short-Term Memory) a type of RNN with two or more than two hidden layers is used for extracting features from the input text and increases the efficiency of translation (Agrawal 2017).

Machine Translation among eleven Indian languages using the NMT approach has been proposed and obtained better results than the traditional SMT approach (Agrawal 2017). Microsoft provided NMT based translation support for 21 languages and added Hindi recently (Microsoft 2017). Wu et al. (2016) also uses the NMT approach over the existing SMT approach and show better results than SMT. Facebook in 2017 proposed the implementation of NMT using Convolutional Neural Networks and claimed faster performance than the work presented by Gehring et al. (2016, 2017). Amazon has also launched its machine translation system using NMT approach (Faes 2018). Some important platforms useful for the development of NMT systems includes Tensorflow, Torch, Theano, PyTorch, Matlab, DyNet-lamtram and EUREKA are available at Zhang (2017).

7 MT evaluation methods

The MT evaluation methods are divided into two categories : Traditional Evaluation Methods and Automatic Evaluation Methods

7.1 Traditional evaluation methods

This section will highlight some of the commonly used methods of MT evaluation (Van Slype 1979) following the traditional approach.

7.1.1 Fluency test

Fluency of an MTS gives the measure of the amount with which the target text is well-formed according to the TL grammar rules. A grammatically well-formed with correct spellings, stick to the common use of terms, names, and titles which can easily be interpreted and acceptable by the native speaker of the TL is known as the fluent segment (Singh et al. 2007; Goyal 2010). The 4-point scale was used in the evaluation of the Punjabi EnConverter and DeConverter System. The fluency score using Table 10.

Table 10 4 Point fluency score

7.1.2 Intelligibility evaluation

It provides the measure of easiness with which the translated text can be understood by the user. In this method, a group of persons is required to read the sentences in various versions (original, human translation with and without revision, MT without and with post-editing) in such a way that a particular person is receiving only one copy of the sentences of a particular version in the group. The ranking of the sentences on a 4-point scale is shown in Table 11 (Van Slype 1979). The ranking is received from the readers, and the average is taken of all the rankings to find out the overall intelligibility rank of the translation. This approach is applied to the evaluation of the Hindi–Dogri language, Hindi to Punjabi MTS, Punjabi to Hindi MTS, SYSTRAN English–French MT system. According to Carroll (1966) the measure of intelligibility is done on a 9-point scale as shown in Table 12.

Table 11 Sentence ranking by G Van Slype
Table 12 Sentence Ranking by J Caroll

This scale is used in the evaluation of automatic translation of ALPAC system.

7.1.3 Fidelity/adequacy test

Fidelity is the measure of an amount of information correctly translated into the TL from SL. It tells about the correctness of the translation. Rating of fidelity should be less than or equal to the intelligibility ratings and is done on a 4-point scale. It has been applied to the evaluation of Hindi–Dogri MTS, Punjabi Deconverter and English–French MT produced by the SYSTRAN system in which the rank of ‘3’ means complete faithful and rank of ‘0’ means completely unfaithful.

7.2 Automatic evaluation methods

Several automatic evaluation methods have also been proposed. Some of the popular methods are included for the survey and compared based on different metrics as shown in Table 13.

Table 13 Comparison of MT Evaluation Metrics

7.3 MT evaluation platforms

This section provides information about evaluation platforms available to evaluate MT systems on various metrics. Three platform ORANGE, Asiya, and IQMT have been explained in Table 14.

Table 14 MT evaluation platforms

8 Research avenues and recommendations

Although lots of work have been done in the last three decades for developing MTS with different language pairs (Indian languages) and of various domains. The emergence of the NMT approach and the easy availability of high computing resources and corpus for Indian languages has created several new opportunities for researchers to work in this field. The researchers are now more focused to apply the machine learning algorithms for text processing rather than other fields and as a result, several new tools and platforms are available for text processing. It is a very difficult and time-consuming process to create the rule base which will cover all the aspects of the language specifically for Hindi and Sanskrit languages which are highly inflected and morphological rich in nature. To apply the SMT approach the need for a large corpus is again a big hurdle for languages like Sanskrit. The following are some of the research avenues with which the researchers can start their research work:

  • Developing POS tagger or stemmer for Hindi and Sanskrit languages using a hybrid approach of rule base and machine learning techniques.

  • Developing automatic Karaka Analyzer (case marker) for Sanskrit and Hindi by making use of the similarity features among Indian languages in such a way that only a small effort is required to make this system for other Indian languages.

  • Developing a platform like Snowball (http://snowball.tartarus.org) for creating the rule base in an easy and fast manner.

  • Creating small modules which can enhance the performance or reduce the response time of the existing MTS like the Named Entity Recognition (NER) tool, automatic pre- or post-processing tools using machine learning techniques.

  • Anaphora or Catphora resolution is still a challenging task for the Sanskrit language. So, special modules can be developed for such types of problems which can be easily merged with the MTS adopting modular approach.

  • For MTS using UNL as an interlingua approach, the resolution of UNL relation is a challenging area because it requires thousands of rules to resolve all the 56 UNL relations (Le Thuyen and Hung 2016). So, machine learning approaches can be used over the UNL dictionary to predict the possible relations with the Case marker module.

  • Development of the Sanskrit Deconverter using UNL is still an open area of research.

  • Development of Operating Systems for computers using less ambiguous language like Sanskrit.

  • Developing tools to extract text from scanned images and develop digital corpus for languages like Sanskrit and Punjabi.

    Based on the discussions done in Sects. 4.1, and 4.2 and the outcomes shown in Figs. 12, 13 on various MTS the following recommendations are derived for researchers working in field of machine translation:

  • The application of any architecture (approach) to develop new MTS depends on various parameters like language pair, availability of linguistic resources for the language pair, the application domain of MTS, linguistic knowledge.

  • SMT approach performs better for long sentence translation and DMT gives better results for short length sentences.

  • Maximum utilization of similarity feature at syntax level or semantic level among Indian languages such as noun, verb, declension, prefix, Karka Analysis for case identification, word formation, and word order, etc. should be done for developing MTS among Indian Languages.

  • Interlingua approach needs fewer efforts for developing multilingual MT systems like Anglabharti, Anubharti, UNL based MTS, and Sampark. So, Interlingua representation like of pseudo-Interlingua, UNL expressions, or an intermediate representation of Sanskrit language as Interlingua could be used efficiently for developing new MTS, and less effort is required for new language translator development.

  • Panini Grammar is one of the most unambiguous grammars ever developed for a natural language and written in a more structured manner for Indian languages. Panini principles will help to develop new MTS for Indian Languages based on the RBMT or HBMT approach.

  • RBMT systems require deep linguistic knowledge of the source as well as the target language and are a time-consuming process although the quality of translation using RBMT is better than other approaches.

  • Use of statistical tools like Moses’ toolkit, Giza +  + , IRSTLM, SRILM makes the developing process much faster than other systems but requires a large amount of parallel corpus in digital format, so applicable only for language pairs having large corpus availability in digital form.

  • Google and Microsoft have used deep neural networks over the SMT approach and proved that the Neural Machine Translation approach performs much better than SMT and even requires fewer amounts of data for training, but requires large computational power to train such systems.

  • For Sanskrit Language, various part of speech taggers is available like BIS POS, JPOS (JNU), CPOS, IL POS (Indian Language), and Gerard Huet Parser, Constraint-Based Parser, Deterministic Parser of Amba Kulkarni, and Indic NLP Library could be used to develop Sanskrit Based MTS.

  • For English Language Stanford Parser is efficient enough to give the analysis of the English Language.

  • The availability of wordnet for English, Hindi and Punjabi and Punjabi makes the translation task easier and less time- consuming. The shallow parser available on the TDIL website could be used for Indian Languages.

  • The fastest way of developing MTS is by using the DMT approach, and the quality of translation is also good but limited to a small domain and requires bilingual dictionaries and a small number of transfer rules like in Sampark MTS.

The Hindi and Sanskrit languages have used the traditional methods of MT evaluation which include Fluency Test, Intelligibility Test, and Fidelity Test. Most of these tests depend on human evaluation but the application of the NMT approach be easily applied to them also. In the case of automatic evaluation methods, the BLEU and METEOR score has become the common standards for MT evaluation. For English to Indian language MTS the BLEU, NIST, and METEOR have been used by the developers.

9 Conclusion

This article presents an outcome-based systematic survey of machine translation for English, Hindi, and Sanskrit languages. Out of 1500 research articles, 118 articles have been included in this survey based on the Inclusion-Exclusion criteria mentioned in Subsect. 3.3. The results of the survey are presented in different dimensions like MT Evolution, MT approaches, mapping research questions with outcomes, overview of MTS based on several criteria (approach, language pair, domain, efficiency, features), state-of-the art-MT tool-kits, technological enhancement in MT approach, MT evaluation methods and platforms. The latest trends in MTS development are based on neural networks and provides human-like translation quality as seen in Hassan et al. (2018). Also, it is still not feasible for languages like Sanskrit to develop an efficient MTS and apply SMT or NMT approach due to non-availability of corpus and complexity of the language. State-of-the-art MTS platforms with MT development tools and corpus have also been discussed. State-of-the-art MT evaluation methods and platforms with specific features have been explored in this survey. Several research avenues have been highlighted in this survey work for further research in machine translation. Future recommendations have also been included to help researchers to develop new MT or enhance existing MT development.