Keywords

1 Introduction

There are a few different ways to make spam messages like email, SMS/MMS sent tumultuously to your PDA, short code, different remote numbers, and so on. As per the Text Retrieval Conference (TREC), the term “spam” is—a spontaneous, undesirable email that was sent aimlessly (Cormack, 2008). These undesirable and unnecessary spam messages are named as pervasive spam (Spam, 2015). A GSMA pilot spam reporting program, (GSMA Launches SMS Spam Reporting Service, 2011). The improvement of Open Mobile Alliance (OMA) (Efficient Support Vector Machines for Spam Detection: A Survey, 2015) morals for versatile spam revealing. The Internet is responsible for email spam though versatile organization is utilized for SMS spam (Kim et al., 2013; Torabi et al. 2015). The immense volume of spam sends moving through the PC networks effectively affect the memory space of email workers, correspondence transmission capacity, and CPU force and client time. To effectively handle the danger which is presented by email spams, fundamental essential email suppliers, for instance, Gmail, Yahoo mail and Outlook have chipped away at the gathering of various AI (ML) methods which are neural networks in its spam channels. The way that email is an extremely modest method for coming to a great many potential clients fills in as a solid inspiration for novice publicists and direct advertisers (Cranor & Lamacchia, 1998). One potential answer for improving spam characterization calculation is utilizing a spam channel named LingerIG actualized in 2003 out of an email arrangement framework named Linger (Chae et al., 2017).

1.1 Market Inclinations Resultant in an Increase of SMS Attacks

The SMS takes remained used as dollar-making machine by mobile operators over the years. As per a survey, uncountable SMSs are sent on daily basis. Communication through SMS has its own benefits like all the GSM mobile companies use SMS communication. Nowadays, it is possible to send ringtones, animations, business cards, logos, and WAP configuration settings easily by a SMS which leads to increase the SMS attack by sending malicious malwares along with these setting SMS. The SMS market or mobile messaging market is an extremely beneficial production for mobile operator and is growing speedily (What YOU can do to control cell phone spam, 2012).

As per Fig. 1, people from developed countries like USA, UK, and Japan prefer communication through SMS instead of other mode of contact. Electronically sent messages are at higher risk and can easily be sensed by spammers. Spammers may use these messages to play with user’s personal data or may impairment with the users by using theirs premium tariff facilities (Spam News, 2015; Sao & Prashanthi, 2015). In comparison of email spamming, SMS spam has an exponential growth measured up to more than 500% yearly (Benevenuto et al., 2010; GSMA, 2011; Guzella & Caminhas, 2009).

Fig. 1
figure 1

SMS and IM messages sent in UK each year in billions (Delany et al., 2004)

As per Table 1, Cloud mark report states that, in 2019, the SMS spam counting varies from region to region (Text Message (SMS) Spam Reporting, 2012). Asia has highest rate of SMS spam up to 30%, while North America reports the figure up to 1%. The volume of spam emails containing malware and other malicious codes between the fourth quarter of 2019 and first quarter of 2020 is depicted in Fig. 2 (Dada et al., 2019; Fonseca et al., 2016).

Table 1 Graph of mobile behavior shown in developed countries from September 2019 to March 2020 (age group 16+)
Fig. 2
figure 2

Volume of spam emails 4th quarter 2019 to 1st quarter 2020

2 Related Works

In recent approach, SMS spam is a serious security threat in lots of countries which badly destroy the individual privileges and still harm the public safety measures. Pervasive SMS spam filtering can be carried out using various approaches and methodologies on different programming framework. This section aims to analyze previous work which is related to spam detection filtering the spam messages in pervasive environment (Jaswal & Professor, 2013; Satish kumar, 2013; Malarvizhi & Saraswathi, 2013). This section also focused on the motivation and findings through the previous research papers.

2.1 Background

As we have go through the previous research work, we had treasure that mostly email spam clarifying was taken into consideration (Kim et al., 2013; GSMA, 2011; GSMA, 2011a; Sharma et al., 2012; Education & Science, 2013; Johnson et al. 2014) which can be defined as: Operator’s server can catch the messages of any frequency triggered from phone numbers. This way has two defects related to pseudo-base stations and possible delay the mass SMS. Users can define black lists or fair lists or secure keywords in their mobile phone at the cost of their own negative impact.

Author (Kou et al., 2020) discuss another most effective method content-based SMS filtering, but it is cost-effective. Finally, MTM is compared with the SVM; the result shows that the MTM is more powerful tool to protect from SMS spam. Author examined about feature determination which is a significant segment in AI and an important advance for text and order. Author You et al. (2020) creator examine and depict a solo technique focusing on astutely distinguishing on the web survey spams. Their investigations on TripAdvisor exhibit the high adequacy and knowledge of the proposed model, which can possibly altogether help the online web business. Author Gopi et al. (2020) in their proposed thought of this paper is to improved RBF piece of SVM-performed with 98.8% of precision when contrasted and the current SVM-RBF classifier and different models. Author Barushka and Hajek (2020) the methodology followed in their paper is to utilize cost-touchy gathering learning strategies with regularized profound neural organizations as base students. Their methodology beats other well-known calculations utilized in interpersonal organization spam sifting, for example, arbitrary woods, Naïve Bayes or backing vector machines. Gaurav et al. (2020) their paper proposed a novel, spam mail discovery strategy dependent on the archive naming idea which arranges the new ones into ham or spam. The experimental aftereffects of this paper delineate that RF has higher exactness when contrasted and different strategies. Author Cekik and Uysal (2020) experimental outcomes showed that the PRFS offers either better or serious execution regarding other component determination strategies as far as Macro-F1.

The Author Abayomi Alli et al. (2019) examined investigation that closes with fascinating discoveries which show that most of existing SMS spam separating arrangements are still between the “Proposed” status and “Proposed and Evaluated” status. Likewise, the scientific categorization of existing best in class techniques is created, and it is presumed that 8.23% of Android clients really use this current SMS against spam applications. Their investigation likewise presumes that there is a requirement for specialists to misuse all security strategies and calculation to make sure about SMS consequently improving further characterization in other short message stages. Author Bahassine et al. (2020) their paper give the blend fundamentally improves the presentation of Arabic content order model. The best f-measures got for this model are 90.50%, when the quantity of highlights is 900. Author Asghar et al. (2020) in their paper the work show that joining spam-related highlights with rule-based weighting plan can improve the presentation of even gauge spam location strategy. This improvement can be useful to Opinion Spam recognition frameworks, because of the developing enthusiasm of people and organizations in detaching counterfeit (spam) and certifiable (non-spam) surveys about items. Author Jain et al. (2020) this paper additionally presents a similar examination of various calculations on which the highlights are executed. Furthermore, it presents the commitment of various highlights in spam identification. After execution and according to the arrangement of highlights chosen, artificial neural network algorithm utilizing back propagation strategy works in the most effective way. Author Bhat et al. (2020) explore and propose two profound neural organization variations (2NN DeepLDA and 3NN DeepLDA) of existing subject displaying method Latent Dirichlet Allocation (LDA) with explicit intend to deal with huge corpuses with less computational endeavors. Two proposed models (2NN DeepLDA and 3NN DeepLDA) are utilized to copy the measurable cycle of latent Dirichlet allocation. Reuters-21578 dataset has been utilized in the examination. Results registered from LDA are contrasted, and the proposed models (2NN DeepLDA and 3NN DeepLDA) utilize Support Vector Machine (SVM) classifier. Proposed models have demonstrated noteworthy exactness other than computational adequacy in contrast with conventional LDA.

3 Message Topic Model (MTM)

The MTM follows the latent semantic analysis enriched by probability. MTM can reduce sparsely problems up to higher extent in comparison to other filtering technologies. The multinomial distributions like document topic or topic-word distributions are governed by several parameters like α and β which show the hyperparameters prior to θ and φ obtained through a Gibbs sampling (Saxena and Payal 2011).

4 Problem Statement

Today, network security is more unpredictable contrasted with before. Huge increment in the quantity of spontaneous business notices being sent to client's cell phones has been watched by means of text informing. SMS spam is an eminent issue for the mobile phone customers. Among the network, the ongoing increments in the spam rate had caused an extraordinary concern. To manage this spam issue, there are numerous methods utilizing diverse sort of spam channels. Essentially, every one of these channels arrange the messages into the classification of spam and non-spam (Ham). The majority of the classifiers choose the destiny of an approaching message based on certain words in information part and sort it. There are two sections, known as test information and preparing information that function as the information base the spam classifier to characterize the messages. The issue of spam has been tended to be as a straightforward two-class record arrangement issue where primary point is to sift through or separate spam from non-spam (Ham). As archive grouping assignments are driven by enormous ineffective information, so choosing most separating highlights for improving exactness is one of the fundamental destinations, and this theory work focuses on this undertaking. The essential point of this work is to focus on various arrangement strategies and to look at their exhibitions on the space of spam message discovery. To lookout, which one is more effective under which set of highlights, various pre-characterized messages are prepared with the strategies. Auxiliary point of the proposition is to progress in the direction of actualizing the strategy which spam can undoubtedly be distinguish with no information superseding and which additionally increment the prior exhibition by finding the best couple of included decrease procedure and characterization calculation. As a huge number of such couples as of now exists, however this work can be considered as a stage toward that objective.

5 Methodology

To achieve all the aims, objectives and overcome the problems as discussed in Sect. 3, various algorithms need to be used for clustering, classification, tokenization and more. In this section, we discussed some of the important algorithms need to be used in this paper or research work. Also the algorithm defines to form some of the associate correct prediction which is a key challenges for facing meteorologist at all planets (https://www.developershome.com/sms/smsIntro.asp; Jain & Mallick, 2016,2017; Failed, 2017). The security algorithm will provide a great level of security in which we have lesser key size as associated with other cryptographic techniques (Jain, 2018).

6 K-means

There are several algorithms for data clustering. To achieve simplification, we do clustering which is nothing but the partitioning of data into groups. Although clustering simplifies the dataset, it loses few details. Clustering is not suitable for infinite streams. Working of K-means algorithms can be described as K input parameter with n set of objects can be partitioned into K clusters in l iterations. The time complexity of K-means algorithm is O(nkl).

6.1 Term Frequency-Inverse Document Frequency

Term frequency-inverse document frequency is also known as TF-IDF. To create tokens or categorize documents, text mining techniques like TF-IDF are used. Term frequency can be described as the occurrence of a particular word in an individual document (Bønes, et al. 2007; www.securelist.com). There is possibility that the same word can be occurred in multiple documents many times TD-IDF uses inverse document frequency which is nothing but the balancing of the occurrence count a particular word.

$$\begin{aligned}{\text{TF}}\left( t \right) &= \left( {{\text{Number of times term }}t{\text{ appears in a document}}} \right)\\&\quad/\left( {\text{Total number of terms in the document}} \right)\\{\text{IDF}}\left( t \right) &= {\text{log}}\_{\text{e}}\left( {{\text{Total number of documents}}}\right.\\&\quad\left. {/{\text{Number of documents with term }}t{\text{ in it}}} \right). \end{aligned}$$
$${\text{Value}} = {\text{TF}}*{\text{IDF}}$$

6.2 GNBC (Gaussian Naive Bayes Classifier)

In our study, we present a classifier named as Gaussian Naive Bayes Classifier (GNBC) which is a combination of Naive Bayes algorithm and Gaussian distance function. The probability theory of semantic analysis has been used for GNBC, and research tells that it is more suitable algorithm for SMS spam filtering. The basic difference between GNBC and MTM is given as follows: The number of tokens is able to find the appropriate status class (SPAM or HAM) due to which identification of spam message is accurate. Number of tokens which are already fixed for spam filter would not cover for the sparse matrix due to which data does not over ride and spam and ham messages can easily be identified.

In Table 2, the Gaussian distribution shown is standardized so that the sum over all values of x gives a probability of 1. Within one standard deviation of the mean, the nature of the Gaussian gives a probability of 0.683. The Gaussian distribution is also termed as “normal distribution” and is often described as “bell-shaped curve.” The mean value is a = yz, where

$$\begin{aligned} y & = {\text{number of events}} \\ z & = {\text{probability of any integer value of }}x \\ \end{aligned}$$
Table 2 Gaussian distribution function

7 Experimental Results

GNBC classifier is applied on the dataset: the large corpus SMS Spam Collection Dataset created by T.A. Almeida et al. By applying classifier on the dataset, the best classifier can be judged as compared with MTM. From the statistics, while applying GNBC and MTM on the dataset, GNBC classifier has the best accuracy than MTM and consumed less time without overriding the data. This paper has incorporated datasets which were specified by T.A. Almeida et al. can be downloaded from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. This is accessible online for study sources and is used generally. For SMS Spam research, the SMS Spam collection v.1 (hereafter the corpus) is a set of SMS tagged messages which have been collected. It contains one set of SMS messages.

7.1 Evaluation Metrics

The metrics measured the percentage of spam detected by the system and how many misclassifications it makes. Few of the evaluation metrics are:

  • True Positive (TP): When positive occurrences are effectively arranged, it is spoken to by a number called genuine positive.

  • False Positive (FP): When positive occurrences are mistakenly characterized, it is spoken to by a number called bogus positive.

  • False Negative (FN): When negative cases are inaccurately characterized, it is spoken to by a number called bogus negative.

  • True Negative (TN): When negative occasions are effectively arranged, it is spoken to by a number called genuine negative.

  • Accuracy (ACC): It very well may be characterized as the extent of effectively ordered classes to be specific True Positive and True Negative over the complete number of arrangements.

    $$\frac{{{\text{True Negative}} + {\text{False Positive}}}}{{{\text{True Negative}} + {\text{False Positive}} + {\text{True Positive}} + {\text{False Positive}}}}*100$$
    (1)
  • Precision (P): It is the fraction of the messages retrieved that are related to the end client.

    $$\frac{{\text{True Positive}}}{{{\text{True Positive}} + {\text{False Positive}}}}$$
    (2)
  • Recall (R): It is the fraction of the positively retrieved messages that are related to the client (Figs. 3, 4, 5 and 6).

    $$\frac{{\text{True Positive}}}{{{\text{True Positive}} + {\text{False Negative}}}}$$
    (3)
    Fig. 3
    figure 3

    Snapshot showing error on command window

    Fig. 4
    figure 4

    Data without filtering of ham, spam and noisy data

    Fig. 5
    figure 5

    Data after filtering in which blue color represents noisy data, red color represents spam data

    Fig. 6
    figure 6

    Comparison between MTM and GNBC, i.e., accuracy

7.2 Experiment 2

Again running a classifier on a separate dataset, it has been shown that Fig. 7 shows the data without filtering which consist of ham, spam and noisy data. Figure 8 shows the data in which red color represented as Spam data, green color represented as Ham data and blue color represented as noisy data. Figure 9 shows the accuracy in MTM and GNBC models, and it has shown that GNBC gives more accurate result when we compared with MTM. Figure 10 shows the error coming on command window.

Fig. 7
figure 7

Snapshot showing error on command window

Fig. 8
figure 8

Data without filtering of ham, spam and noisy data

Fig. 9
figure 9

Data after filtering which blue color represents noisy data, green color represents ham data, red color represents spam data

Fig. 10
figure 10

Comparison between MTM and GNBC, i.e., accuracy

And the error we had received is 0.0320.

7.3 Experiment 3

Again running a classifier on a separate dataset.

And the error we had received is 0.1340 (Fig. 11).

Fig. 11
figure 11

Showing error on command window

8 Results and Discussion

On performing experiment on the dataset, trying to classify the data as spam or ham using GNBC and MTM classifier, we observe that GNBC is more accurate as compared with MTM. Also we had observed that when data is filtered by MTM, it is overriding, whereas when we filtered the same by GNBC, it is clearer. In the above experiment, data with red color represented as Spam data, green color represented as Ham data and blue color represented as noisy data. Also we had calculated the error which changes as per the dataset and spam messages.

9 Conclusion

Nowadays, the undertaking of automatic SMS spam clarifying in pervasive environment is stagnant a real task. The major problem handled in detection of spams in SMS is due the total character small in number in short text message and the usual practice of idioms and acronyms. The immediate conclusion from the results is that GNBC has the best performance considering accuracy and overriding of the data. It requires fewer input features to achieve the same results produced by other classifiers. The main aim of our research to refine spam token precisely as compared to existing technique which increase accuracy of the system. Furthermore, our aim will be using classifiers of Gaussian-based NBC to increase the efficiency of spam detection system.

10 Future Scope

The feature plot can increase the aspect of the future work which gives practice in numerous methods. If we are adding more significant features like in the given certain thresholds for the measurement and for the evaluation of the knowledge and the given arcs can also contribute to the development in results. In future, using this technique, we can make and application for smartphones (iPhone, android, Windows) for protecting them from spam message. This will also be compared like we did in DND (Do not disturb) in which we can also block numerous annoying communications, but in the given future we can also make an attempt to block all the communicated messages from given undesirable figures as well as junk bases.