1 Introduction

Machine Translation (MT) has evolved over five decades, which the developers have religiously followed since then. The most primitive approach in MT is the Statistical Machine Translation (SMT), which uses algorithms that are predictive in nature while teaching a system to translate the text. The existing translated text is used for translating the input text to the required language. The major drawback of this approach is that it requires a bi-lingual material for the model to predict the input text. This also hampers its ability to predict obscure languages.

Whereas, the evolution of Neural Machine Translation (NMT) approach, has addressed the major drawbacks associated with SMT with its more accurate translation. Though this approach is also based on Deep learning techniques, with the aid of existing statistical models, the input data is distributed among the layers enabling a faster response.

So based on the definitions thus stated related to two different approaches, it is quite clear that NMT can handle intricate computations when compared to the conventional statistical model. Despite having sufficient information about NMT, the extension of this towards handling the social media content has been looked at with meagre intent. The data posted on social networking sites have been the source for major cyber-attacks. Twitter statistics indicate a whopping 313 million tweets posted monthly. The data is very huge, which makes to difficult to analyse the information posted in different regional languages. A tweet posted in a regional language can neither be understood nor appreciated by the account holders of the same social media belonging to other regions. This make the users of other region deprived of the information posted. If the tweet is anti-national or anything that can threaten the social security can be averted by blocking the tweet at the source.

The present study aims at looking at the Twitter data more precisely with an intention of transliteration and translation of tweets, so that the social media users can be made aware of the content. This enables the end users to appreciate or criticize the information based on their understanding. Transliteration and translation also addresses the issue of social security and the information, paving way for any issue related to social security, can be screened-off at the nascent stage.

2 Literature survey

Enormous research has been carried out in the area of translation and transliteration since half-a decade. Some of the efforts include Statistical Machine Translation (SMT) methodology for translation via transliteration from Hindi to Urdu (Durrani et al. 2010), where the Bi-Lingual Evaluation Understudy (BLEU) scores of two different probability models viz. conditional probability model and joint probability model were found to be 19.35 and 19.00 respectively. The scores were compared with BLEU score obtained for DNN methodology to sequence-to-sequence problems using multi-layered Long Short-Term Memory (LSTM) approach (Sutskever et al. 2014), wherein it is proved that LSTM methodology not only outperforms SMT-based system but also standard Recurrent Neural Network (RNN) can be easily trained with a greater accuracy when the source sentences are reversed. In sequence-to-sequence LSTM framework, the text is read one byte at a time, producing span annotation over inputs (Gillick et al. 2016; Beck and Sales 2001). Also, some merits of production of span annotations are identified, which includes easy training of multi-lingual models without additional parameters and smaller output vocabulary. Most importantly it is also observed that the models turn out to be compact than conventional word-based systems, which discards the usage of tokenizers for text segmentation. Usage of RNN-LSTM for language modeling and extending it for identifying the image and providing suitable caption has been carried out (Al-muzaini et al. 2018). The results have been compared with CNN model, which indicated that RNN model gave more promising results. Also, Neural Machine Translation (NMT) for Vietnamese to English using sequence to sequence RNN, sequence to sequence CNN (ConvS2S) has been performed during which the BLEU scores were observed to be fairly good enough for low-resource data or language pairs (Phan-Vu et al. 2019). Though there are commendable efforts related to Text Summarization (TS), the latest work on Abstractive Text Summarization (ATS) using LSTM methodology based on Convolution Neural Network (CNN) (Song et al. 2019) is worth noting, where LSTM-CNN based ATSDL framework has been demonstrated. It is also proved to be state-of-art model for semantics and syntactic data structures. Finally, the most pertinent work for the present study includes the bangla sentence generation using LSTM-RNN with sequence-to-sequence model (Islam et al. 2019) which demonstrates the usage of LSTM for predicting the next word in a sentence. This uses.

On the other hand, the quality of translation is measured in terms of Bi-Lingual Evaluation Understudy (BLEU) factor which indicates the closeness and accuracy of translation. The aspects of machine translations viz. fluency, adequacy and fidelity are evaluated by humans (Hovy 1999; White and O’Connell 1994). For a bi-lingual human evaluation, two experts are required, in a way that they understand the other language but expert in one of them. This renders human evaluation costlier than MT, which is therefore facilitated by the corpus available for evaluation. Hence, machine translation comprises of two ingredients viz. numerical translation closeness matrix and a superior quality corpus of human reference translations (Papineni et al. 2002).

3 Methodology

Majority of the present study is aimed at utilizing the strengths of RNN models for an obvious reason that it can retain long term dependencies. RNN is complemented by LSTM model by enacting as a mechanism to ensure propagation of information by multiple time steps properly. Though umpteen efforts have been put-forth to understand the usage of RNN-based Language Models (Hochreiter and Schmidhuber 1997; Gers et al. 2000; Mikolov et al. 2010; Mikolov and Zweig 2012; Chelba et al. 2013; Zaremba et al. 2014; Williams et al. 2015; Ji et al. 2015a, b; Wang and Cho 2015), the present study aims at exploring RNN model for transliteration of Twitter data. Figure 1 depicts the architecture of RNN model considered for current study, where \({x}_{0}, {x}_{1}, {x}_{2}\dots {x}_{n}\) indicate the input to the neural network chunk, the output of which is indicated by \({y}_{0}, {y}_{1}, {y}_{2}\dots . {y}_{n}\) obtained at times \({t}_{0},{t}_{1},{t}_{2}\dots .{t}_{n}\) respectively. The input data is compared with the previous data which makes long term dependencies quite visible and allows the current methodology to stand out when compared to other conventional methodologies.

Fig. 1
figure 1

Architecture of RNN model

RNN Methodology computes the outputs indicated in the architecture using the following equations iteratively:

$${h}_{n}=\Sigma ({W}^{hx}{x}_{n}+{W}^{hh}{h}_{n-1 })$$
$${y}_{n}={W}^{yh}{h}_{n}$$

where \({x}_{n}\) and \({y}_{n}\) are the lengths of input and output sequence respectively and \({h}_{N}\) is the sequence used for mapping between \({x}_{N}\) and \({y}_{N}\)

The major issue with conventional RNN methodology is that, though it can handle long-term dependencies, fails to learn in the expected manner a human wishes to. Therefore, a special kind of RNN methodology called Long-Short Term Memory (LSTM) is used for the present study, which is explicitly designed for remembering and learning the information for longer durations. Though there are umpteen number of LSTM architectures available, the present study uses forget gate type of architecture, which forms an integral part of LSTM unit. An LSTM therefore comprises of a cell, an input gate, an output gate and a forget gate (Fig. 2). The forget gate layer is solely responsible for considering the previous cell state \(\left({C}_{n-1}\right)\) in the current cell calculation by assigning a value between \(0 {\text{and}} 1\). The value is assigned based on the comparison between the input vector \(\left({x}_{n}\right)\), output vector of the previous cell \(\left({y}_{n-1}\right)\) and the previous cell state \(\left({C}_{n-1}\right)\), which is indicative that the entire value of \({C}_{n-1}\) has to be allowed to pass through to the Input Gate layer or not. If a value \(0\) is generated upon comparison, then \({C}_{n-1}\) is discarded whereas a value 1 is to consider \({C}_{n-1}\) for the current cell calculation. The generation of values is facilitated by a sigmoid neural layer and a pointwise multiplication operator in the forget gate layer (Fig. 2). Input gate layer performs two activities, first it updates the incoming vector data with the help of a sigmoid activation function and second, the output of sigmoid activation function is compared with the hyperbolic activation function, which generates new set of values, which are concatenated with the previous cell state (\({C}_{n-1})\). Finally, in the output gate layer, the input vector \(\left({x}_{n}\right)\) is passed through sigmoid activation function \(\left(\sigma \right)\), which is compared with the incoming updated cell state \(\left({C}_{n}\right)\) through hyperbolic activation function \((tanh)\). The sigmoid function decides the portion that has to be put out of the cell \(({y}_{n})\).

Fig. 2
figure 2

Architecture of LSTM cell

LSTM mainly aims at estimating the conditional probability \(p\left({x}_{0},{x}_{1},{x}_{2}, \dots \dots {x}_{N}\right|{y}_{0},{y}_{1},{y}_{2},\dots ..{y}_{N})\) between the input sequence \({x}_{0},{x}_{1},{x}_{2}\dots \dots {x}_{N}\) with the output sequence \({y}_{0},{y}_{1},{y}_{2},\dots \dots {y}_{N}\), which is given as:

$$p\left( {y_{0} ,y_{1} ,y_{2} , \ldots \ldots y_{{N^{\prime}}} |x_{0} ,x_{1} ,x_{2} \ldots \ldots x_{N} } \right) = \mathop \prod \limits_{n = 1}^{n^{\prime}} p(y_{n} |v,y_{0} ,y_{1} , \ldots y_{n - 1} )$$
(1)

This conditional probability is achieved by obtaining the processed data, the procedure of which may be elaborated. Firstly the data is collected from a reliable and authentic source. The data is collected from Twitter developer account, which is dumped into a database. In the present study, MongoDB is used for this purpose, which is a cross-platform or multi-platform document-oriented and NoSQL based database program. The data is stored in JSON format with integrity constraint imposed by database schema. Since the data is bulky and handling the same poses a challenge. It is easy to parse the entire dataover a field by writing simple queries in MongoDB. Secondly, the data is processed by cleaning, tokenizing and saving only the necessary fields of the raw-data into a new collection in the database, without overwriting the raw-Twitter data collection. The data is segregated based on the language and stored into their respective collections.

On the other hand, vocabularies for the participating languages are developed (say Hindi and English) (Table 1) by defining a set of all possible characters, which are encoded as vectors. The vectors are sequentially input to the LSTM model, where sequence length forms a critical parameter. Since it works on character level data, the access to dictionaries is eliminated. LSTM is bi-directional, where one layer reads the sequence from left to right and the hidden layer reads from right to left, the outputs of which are concatenated. Concatenation of input vector to output vector has a similar performance as that of residual connections introduced in deep residual networks. The batch size was initially set to 30 which indicated a slower performance, due to which the batch size was later reduced to 10. This was found to be optimum demonstrating a faster performance. Learning rate was kept constant at 0.001.

Table 1 Sample English–Hindi pair for translation and transliteration

The result thus obtained using the above model is assessed using BLEU approach, where the greater the \({\text{n-grams}}\) the researcher uses for the reference, the better is the accuracy. To obtain the BLEU score, the sourcelanguage is tested with a test-suite, suitably designed in the source language. The sentences are subjected to reference translations. This gives the deviation of the source language from the reference language.

The measure of BLEU score is performed by a modified precision score which is mathematically defined as (Papineni et al. 2002):

$$p_{n} = \frac{{\mathop \sum \nolimits_{{C \in \left\{ {{\text{Candidates}}} \right\}}} \mathop \sum \nolimits_{{{\text{n } - \text{ gram}} \in C }} {\text{Count}}_{{{\text{clip}}}} \left( {\text{n } - \text{ gram}} \right)}}{{\mathop \sum \nolimits_{{C^{\prime} \in \left\{ {{\text{Candidates}}} \right\}}} \mathop \sum \nolimits_{{{\text{n - gram}}^{\prime} \in C }}^{\prime} {\text{Count}}\left( {{\text{n - gram}}}^{\prime} \right)}}$$
(2)

where C is considered for all the sentences being translated, whereas \(\mathrm{C}\mathrm{o}\mathrm{u}\mathrm{n}{\mathrm{t}}_{\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}\) indicates all the matching translations. Hence the ratio indicates how accurately the machine is translating when compared to human translations. Smaller the value of \({p}_{n}\), closer is the translation to the human translations, rendering a better accuracy. In order to attain the accuracy for shorter translations, brevity penalty is introduced which is mathematically given as:

$$B_{p} = \left\{ {\begin{array}{*{20}c} 1 & {{\text{if}}\; c_{l} > r_{l} } \\ {e^{{1 - \frac{{r_{l} }}{{c_{l} }}}} } & {i{\text{f}}\; c_{l} \le r_{l} } \\ \end{array} } \right.$$

where, cl total or cumulative length of the translation; rl is the length of the reference translation.

The BLEU is therefore calculated by the expression:

$$\mathrm{log}BLEU=\mathrm{min}\left(1-\frac{{r}_{l}}{{c}_{l}},0\right)+\sum_{n=1}^{N}{w}_{n}\mathrm{l}\mathrm{o}\mathrm{g}({p}_{n})$$

where \({p}_{n}\) is the geometric mean of modified precision score (Eq. 2) and \({w}_{n}\) is the associated weight for the score.

Whereas, on the other hand, the BLEU score is very poor when considered for transliterated sentences due to the fact that some of the terms which seize to exist in the reference is transliterated and hence the BLEU score will be equal to the score of human translation. Due to this reason, a transliterated data has a lower BLEU score with an average value of 0.315 (Fig. 4, Table 2).

Table 2 Details of the model used for present study

4 Results

During this study, one million Twitter data is stored in MongoDB. Storage in JSON format facilitates easy and quick writing and retrieval of data into and from the database respectively. The required fields of sample raw-Twitter data in JSON format is demonstrated below:

  • "text": "RT @varusarath: Happy birthday to the man of steel. Our #ThalaAjith Sir…May u haveee a lonnnngggggggg lonnngggggggg life filled with loa…",

  • "lang": "en".

  • "text":"सामना के मुताबिक अगर कट्टरवादी मुसलमान इस्लामि आतंकवाद कहलाते है।तो कट्टरवादी हिन्दूभी, हिन्दू आतंकवाद होने चाहिए।\n@NewsSamna",

  • "lang": "hi".

  • The cleaned data is obtained as follows:

  • "text": " Happy birthday to the man of steel. Our ThalaAjith Sir May u have a long long life filled with loa".

  • "lang": "en".

  • "text":"सामना के मुताबिक अगर कट्टरवादी मुसलमान इस्लामि आतंकवाद कहलाते है।तो कट्टरवादी हिन्दूभी, हिन्दू आतंकवाद होने चाहिए ",

  • "lang": "hi".

Table 2 indicates details of specifications considered for present study, where stacked LSTM is used. Based on the execution, perplexity curve is plotted with respect to the time step considered. It can be observed that the curve behaves exponentially and becomes an asymptote as it moves close to unity (Fig. 3). This indicates that the model is trained sufficiently and therefore can be used for translation and transliteration. The trained model demonstrates sequence to sequence learning model. Since the average training sentences length was around 18 to 19 and the average test set sentences length was around 22 to 23, the BLEU scores for both translation and transliteration were found to be in good agreement with the available literature (Ananthakrishnan et al. 2006; Sen et al. 2016) (Table 3). For translation using SMT based translation with hierarchial phrase approach, the BLEU score was found to be 13.56, whereas using LSTM approach the BLEU score is found to be 13.365 which is around 1.4% deviation from the 13.56 (Fig. 4). On the other hand, the difference in transliteration precision score compared between the value in literature (Ananthakrishnan et al. 2006) and value obtained in present study was found to be 0.045 which is quite close to the values reported in the literature (Table 3, Fig. 5).

Fig. 3
figure 3

Variation of perplexity with time-step

Table 3 Comparison of BLEU scores between SMT and LSTM Models
Fig. 4
figure 4

Comparison for BLEU scores between LSTM translation and baseline model

Fig. 5
figure 5

BLEU comparison between translated and transliterated data

Sample translated and transliterated data is given as follows:

4.1 Input data

"text":" मुझे भूख लग रह| है”.

"lang": "hi".

4.2 Translated data

"text": "to me hunger feeling is",

"lang": "hi".

4.3 Transliterated data

"text": "Mujhe Bhook lag raha hi",

"lang": "hi".

5 Conclusion and future work

The present work demonstrates an idea of how to utilize RNN-LSTM model to address social security by identifying improper content being posted on to social media. It is observed that RNN-LSTM is more accurate than the conventional statistical machine translation (SMT) models (Table 2). Also, BLEU score is one of the reliable parameters that identify the quality of the machine translation. But, this parameter fails to determine the accuracy of transliterated data, due to the fact that some of the terms to be translated may not exist in the reference data.

The Twitter data stored in the database is parsed and cleaned for subjecting them to the process of translation and transliteration. During the course, it was observed that processing of data plays a very important role as the Twitter data comprises of slang words for which equivalent word is not available in the dictionary. Neglecting those words compromises the accuracy of transliteration. The present work can be extended to other social and professional media sites such as Facebook, Instagram, LinkedIn etc.

On the other hand, it can be extended to perform content search associated with improper video, audio and image content posted on social media. The video and image data can be obtained from social media developer accounts which can be used to train the LSTM model to analyse the content.