1 Introduction

Demonetization is the act where current form of money is stopped in circulation or replacement of older currencies or coins with new currencies. Based on the government regulations the measurable steps are being regulated in accordance with RBI to ensue the TAX evasion and to eradicate the combat inflation, corruption and crime.

Predictive analytics [1] is the branch of the advanced analytics which makes the user to predict the future events with current statistics. The patterns found in historical and transactional data can be used to identify risks and opportunities for future. Predictive analytics models capture relationships among many factors to assess risk with a particular set of conditions to assign a score. By successfully applying predictive analytics, the businesses can effectively interpret big data for their benefit. The statistics make use of text analytics with data mining to develop a predictive intelligence based on the relationship in structured and unstructured data [2]. Predictive analytics allows organizations to become proactive, forward looking, anticipating outcomes and behaviors based upon the data and not on a hunch or assumptions [3]. The contributions of the proposed scheme crisped as follows:

The proposed PAD-SVM scheme involved three stages including preprocessing stage, descriptive analysis stage, and prescriptive analysis.

  • The pre-processing stage involves cleaning the obtained data, performing missing value treatment and splitting the necessary data from the tweets.

  • The descriptive analysis stage involves finding the most influential people regarding this subject and performing analytical functionalities.

  • Semantic analysis also performed to find the sentiment values of the users and to find the compound polarity of each tweet. Once the polarity scores are calculated, categorize these tweets as “POSITIVE”, “NEGATIVE” and “NEUTRAL”.

  • Finally, the predictive analysis is performed that creates the time-frame for the tweets with the given data, defining the DEPENDENT and INDEPENDENT variables, transforming the data necessary for processing

Through extensive experimental results, we evaluate the performance of the proposed PAD-SVM scheme. The remainder of the paper contains five sections. Section 2 review related work. Section 3 describes our proposed PAD-SVM scheme. Section 4 presents our experimental results and a relevant performance analysis in terms of execution time and classification error. Finally, Sect. 5 presents our conclusions and discusses future direction.

2 Related works

Wilson et al. [2], proposed integrated predictive analytics and social media framework that perform analysis and prediction. This framework used machine learning algorithms to perform predictive analytics tasks, such as feature selection, parameter optimization and result validation.

Jeffery et al. [4] proposed predictive analytics using data mining technique in which data mining techniques were applied on large data sets for prediction. Jimmy and Kolcz [5] proposed polarity classification approach to make classification and enhance accuracy in data sets. Jiang et al., [6] proposed Towards Large-Scale Twitter Mining technique for Drug-Related Adverse Events. It described an approach to find drug users and potential adverse events by analyzing the content of twitter messages utilizing NLP. Bingwei et al. [7] proposed that scalable sentiment classification for big data analysis using Naive Bayes Classifier, Machine learning technologies are widely used in sentiment classification and evaluate the scalability of NBC in large-scale datasets. Cuesta et al., [8] proposed a framework for massive twitter data extraction and analyze data from Twitter’s public streams. The framework included a language-agnostic sentiment analysis module, which provides a set of tools to perform sentiment analysis of the collected tweets. Michal and Romanowski [9] proposed a sentiment analysis of twitter data within big data distributed environment for stock prediction and discussed a possibility of making prediction of stock market based on classification of data coming from twitter micro blogging platforms.

Mohit et al. [10], proposed that multi-class tweet categorization using map reduce paradigm, and apache Hadoop framework, Tao et al. [11], proposed sentiment analysis technology and polarity computation of sentiment words. The consumer gets some balance between the price and certain attributes he may concern most. If there are only dozens of reviews, ordinary browsers can just handle them.

Yingyi et al. [12], proposed a new way of sentiment analysis in product evaluation, When purchasing a product for the first time one usually needs to choose among several products with similar characteristics. The best way to choose the most suitable product is to rely upon the opinions of others. The system to be described here collects opinions about hotels from the web, evaluates them, aggregates these evaluations and offers cumulative, easy-to-understand information. Generated information is intended for the possible prospective customer, but also for the hotel managers providing them with additional guidance in future business development.

Maite et al. [13], proposed that sentiment analysis in twitter using machine learning technique, sentiment analysis deals with identifying and classifying opinions or sentiments expressed in source text. Social media is generating a vast amount of sentiment rich data in the form of tweets, status updates, blog posts etc. Sentiment analysis of this user generated data is very useful in knowing the opinion of the crowd. Knowledge base approach and Machine learning approach are the two strategies used for analyzing sentiments from the text.

Tosi et al. [3], proposed that big data from cellular networks: real mobility scenarios for future smart cities and describes a novel use of big data coming from the cellular network of the Vodafone Italy Telco operator to compute mobility patterns for smart cities. These mobility patterns are able to describe different mobility scenarios of the city, starting from how people move around Point Of Interests of the city in real-time. Katkar  et al. [14], proposed that real time sentiment analysis of twitter data using Hadoop. Hadoop cluster architecture able to process huge amount of data faster in real time to analysis the sentiment.

Jose  et al. [15], proposed that mining twitter big data to predict 2013 Pakistan election winner and analyze the impact of tweets in predicting the winner of the recent 2013 election held in Pakistan. Identify relevant twitter users, pre-process their tweets, and construct predictive models for three representative political parties which were significantly tweeted. The predictions for last 4 days before the elections showed that party1 emerged as the election winner, which was actually won by party2. The Rapid Miner tool used to experiment with three different standard predictive models.

Wook et al.  [16] proposed that big data and predictive analytics methods for modeling and analysis of semiconductor manufacturing processes, Semiconductor manufacturing process generate huge amount of data and harness the value from this data using predictive analytics methods. Kaliappan et al. [17,18,19], discussed clustering the networks based on the dynamic genetic algorithms. Visual sentimental analysis [20] and Sentiment analytics [21] play a major role in predicting future in hybrid system and twitter data streaming system respectively. Security mechanism [22] was needed to analysis the twitter data. A various enhanced security mechanisms [23] proposed to analysis the security threats and attacks in various domain such as Ad hoc networks, wireless cognitive networks [24]. These security mechanisms may be suitable for analysis the security threats in twitter data. Sudhakar Ilango et al. [25] proposed Artificial Bee Colony approach to select optimal clusters for big data. It imitates the bee behavior to select the clusters. Data distribution, networked and security discussed in Suresh co authors [26,27,28].

3 Proposed work

The proposed PAD-SVM scheme study the various data obtained from twitter streams which includes the tweets sent by various users and analyzes the present available data to predict the pattern of the data to find its relative future trend. The proposed PAD-SVM system consists of three modules for finding and performing operation on social media data sets. The main scope of the project is to analyze and fetch the twitter IDs of those users whose statuses have been re-tweeted [14] the most by the user whose tweets are being analyzed. First, the system involves collecting the tweets from the social network using the twitter. Then, this consists of standard platform as Hadoop to solve the challenges of big data through map reduce framework where the complete data is mapped to frequent datasets and reduced to smaller sizable data to ease of handling. Finally, it includes analyze the collected tweets and fetching the twitter IDs of those users whose statuses have been re-tweeted the most by the user whose tweets are being analyzed [5]. This system proposes three modules for finding and performing operation on social media data sets [15]. The main scope of the project is to analyze and fetch the twitter IDs of those users whose statuses have been re-tweeted the most by the user whose tweets are being analyzed [6].

3.1 Pre-processing

In the preprocessing stage, the dataset is loaded into the Hadoop file-system. In order to access the files available in the Hadoop, need an interface to connect the HDFS with python application. Pydoop is a python interface to Hadoop that allows the user write applications in pure python [16]. Once the files are accessed in Hadoop using Pydoop, it loads the data in a faster and efficient way. The pandas package used to read the dataset in n-dimensional structure [11]. Table 1 shows the sample data from the dataset

figure f
Table 1 Sample twitter data

Load the dataset, to perform the missing value treatment for checking whether the dataset contains any missing values or not. The missing values are filled with zero to avoid errors in processing. Now, split the tweets into re-tweeted users and only the tweets. Split the users by separating them using the semicolon (:) and check whether the tweet contains “RT @” (which denotes the tweet is re-tweeted) [1]. If a tweet is a re-tweeted, the name is entered into dataset. If not, “other” is entered. Split the tweets by separating them using the regex (‘(? <=:)(.*)’) and enter the first group in them. Table 2 shows the processed data with the tweets and users separated from one another.

Table 2 Processed data with the tweets and users separated from one another

3.2 Descriptive analysis

figure g

The output of pre-processing stage is fed into input of descriptive analysis phase. In order to find the most influential people regarding the demonetization, ordering the users based on the number of times whose tweets has been re-tweeted [8]. Table 3 shows the top 14 people with most re-tweet count. Also, individual influence on the majority of people could be found by ordering the users based on their re-tweet percentage. Table 3 shows the top 14 people with their re-tweet percentage.

Table 3 User list based on Retweet counts and percentage

The processed tweets are taken and perform collaborative functions on them. First, remove the unnecessary data in the processed tweets. Second, cleanse the special characters in them. Then, lemmatize the obtained preprocessed data to group the various forms [21] and join all the processed words as a single tweet. This process is repeated for all the tweets.

Sentiment analysis is performed for finding the polarity scores for each tweet using sentiment Intensity Analyzer [9]. The polarity scores for compound, neutral, negative and positive scores has been calculated. Compound polarity classifies the tweets as sentiment values. If the compound polarity is greater than zero, the tweet is represented as “POSITIVE”. If it is lesser than zero, the tweet is represented as “NEGATIVE” [8]. If the tweet is equal to zero, the tweet is represented as “NEUTRAL”. All these values are categorized as sentiment type. Table 4 shows the processed data along with sentiment type and polarity scores for each tweet.

Table 4 Descriptive data

Table 5 shows the count of each sentiment type (POSITIVE, NEGATIVE and NEUTRAL) while Table 6 shows the percentage of each sentiment type.

Table 5 Sentiment type in count and percentage
Table 6 Sentiment type by time

3.3 Predictive analysis

figure h

The resultant output from the descriptive analysis is taken as input for this stage. First, to declare DEPENDENT and INDEPENDENT variables [10]. The DEPENDENT variables is the target output to perform the analysis on. The INDEPENDENT variables are the columns that are used to support the analysis done on DEPENDENT variables. Also, separate the minute, hour and date from the tweets time frame [20].The cryptographic techniques will also be used with the data set to attain the future datas [26].

The transformation of object columns in the dataset into integer types is done using LabelEncoder. LabelEncoder is a utility to help normalize labels such that they can transform non-numerical values into numerical values. Now, the SVC has been set to model. The objective of using SVC is to fit the data, also it is returning the “best fit” hyperplane that categorizes the DEPENDENT and INDEPENDENT data. After getting the hyperplane, can feed the INDEPENDENT data to the classifier to see the predicted output. The accuracy of the model can calculated by using accuracy score to calculate the precision of the model [4].The required predicted set has been obtained which are used to override with the present data. Now, the outcome is in integer type. In order to produce the output in terms of classification, assign the numerical values to their respective types such as 0 to ‘POSITIVE’, 1 to ‘NEUTRAL’ and 2 to ‘NEGATIVE’ [12]. The Table 6 shows the predicted Sentiment type (NEGATIVE, NEUTRAL and POSITIVE) for each hour.

Table 7 shows the predicted sentiment type (POSITIVE, NEGATIVE and NEUTRAL) and the percentage of each predicted sentiment type.

Table 7 Predicted sentiment type in count

4 Results and discussions

The proposed PAD-SVM scheme is implemented in Pydoop architecture multimode cluster environment that yield the results. The data analytic process is performed on Demonetization data. This dataset is downloaded from UCI Machine Learning Dataset Repository.

4.1 Descriptive analysis

Figure 1 shows the total count of each sentiment type in bar chart representation. The graph contains the values that are obtained from the descriptive analysis. This shows that the majority of people are in support of the demonetization of 500 and 1000 rupee note with a total count 3068 out of 8000 people. Even though, there are large numbers of people showing their support to this scheme, 2364 out of 8000 have opposing views regarding this issue. The remaining 2568 out of 8000 people have conflicting views about this issue which shows that they are neither negative nor positive but neutral.

Fig. 1
figure 1

Sentiment type versus people

Figure 2 shows the percentage of each sentiment type in pie chart representation. The graph also contains the values that are obtained from the descriptive analysis. While the bar chart represents the data in terms of count, we need an overall view regarding the analysis. This graph shows that 38.35% of people are in support of the demonetization of 500 and 1000 rupee note. Even though, there are majority of people are showing their support to this scheme, 29.55% of people have opposing views regarding this issue. The remaining 32.1% of people have conflicting views about this issue which shows that they are neither negative nor positive but neutral.

Fig. 2
figure 2

Sentiment type versus percentage

Figure 3 provides the linear time representation for each sentiment type per hour. The graph represents the sentiment values such as POSITIVE, NEGATIVE and NEUTRAL along with the timeframe. Since our dataset is constrained by a day, the time is embodied in terms of hour. The overall highest peak in the graph is positive which is achieved in noon at a rate of 300 tweets. The highest peak for negative is achieved in the morning while the highest peak for neutral is in the evening.

Fig. 3
figure 3

Sentiment analysis

4.2 Predictive analysis

Figure 4 shows the total count for predicted values of each sentiment type. The graph contains the values that are obtained from the descriptive analysis. This shows that the majority of people who were in support of the demonetization of 500 and 1000 rupee note could decline 3068–2248. There is also rise of opposing views with a count of 2364–3207. These people could have previously supported the demonetization scheme [13]. There is a little change in the NEUTRAL values with a decline from 2568 to 2545 people.

Fig. 4
figure 4

Predicted sentiment type versus people

Figure 5 shows the percentage for predicted values of each sentiment type. The graph contains the values that are obtained from the previous analysis. While the Fig. 4 represents the data in terms of count, we need an overall view regarding the analysis. This graph shows that the POSITIVE views regarding the demonetization scheme has been changed with a drop from 38.35 to 28.11%. The people adverse views towards the scheme have grown from 29.55 to 40.08% which is higher than the POSITIVE views that were calculated during the descriptive analysis. There is a very little change in the values of NEUTRAL which is reduced from 32.1 to 31.81%.

Fig. 5
figure 5

Predicted sentiment type versus percentage

Figure 6 provides the linear time representation for predicted values of each sentiment type per hour in count. The graph represents the predicted sentiment values such as POSITIVE, NEGATIVE and NEUTRAL along with the timeframe. Since our dataset is constrained by a day, the predicted values are also limited by this time which is embodied in terms of hour. The overall highest peak in the graph is NEGATIVE which is achieved around noon at a rate of higher than the POSITIVE values in the descriptive analysis [3]. The highest peak for POSITIVE is achieved in the morning while the highest peak for neutral is in the evening.

Fig. 6
figure 6

Predicted sentiment analysis

5 Conclusion

The effect of demonetization of 500 and 1000 rupees that accounted for 86% of the country’s circulating cash has led to very hectic and chaotic events that changed the views of many people in the country. The analysis of the demonetization using twitter data shows descriptive analysis of the current people’s view, their sentiment values towards the issue, the people feel about it and how their views might change in the near future. The preprocessing stage involves cleaning the obtained data, performing missing value treatment and splitting the necessary data from the tweets

The descriptive analysis involves finding the most influential people regarding this subject, how much they influence the people and performing analytical functionalities. These analytical functionalities include striping the already processed tweets, cleansing them from special characters, lemmatizing the tweets and using this to find the compound polarity of each tweet. Once the polarity scores are calculated, and categorize these tweets as “POSITIVE”, “NEGATIVE” and “NEUTRAL”. The graphical representation of the values shows that the 38.35% of people support the idea of demonetization, 32.1% are feeling conflicted about the idea and 29.55% of people oppose the idea of demonetization. The predictive analysis creates the time-frame for the tweets with the given data, defining the DEPENDENT and INDEPENDENT variables, transforming the data for processing. The transformed data is used to fit the SVC model with the DEPENDENT and INDEPENDENT to make it ready for processing. The fitted model is predicted using the SVM technique with INDEPENDENT variables. The accuracy score of the Predicted Model is 96.66%. The predicted model is used to override the existing data to find the predicted data. The graphical representation of the values shows that the 40.08% of people oppose the idea of demonetization, 31.81% are feeling conflicted about the idea and 28.11% of people support the idea of demonetization. From the obtained information, it can be seen that during the initial stage, the majority of people support demonetization. But as the time progresses, the positive views are plummeting and there is an increase in the negative tide which shows that many people who first supported the idea are changing their views.