Keywords

1 Introduction

Today, the data generated through social networking sites is increasing day by day. Twitter one of the most popular social networking platforms deals with both structured and unstructured format of data. However, Twitter data is mostly in unstructured format like followers, tweets, likes, and expressions, etc. It is very difficult to process the data easily. All kinds of industries and companies are using this type of data for the future development and advertising work.

2 Big Data

Big data comprises of both structured and unstructured data, which includes video, audio, data generated through emails, etc. Social sites generate vast amount of data through daily activities. It should be understood that the historical data maintained in industries and business sector help them to thrive in this competitive world. This can be termed as big data. It is very difficult to analyze and process the Big data. However, with the help of Hadoop technology, the complexity in analyzing and processing of data can be reduced substantially. Big data is a heterogeneous collection of complex data sets and produced by varied sources like television, mobile, etc.

There are three characteristics of big data—

  1. 1.

    volume

  2. 2.

    velocity

  3. 3.

    variety

Structure of the big data—

  1. 1.

    Volume—Volume mainly defined in terms of amount of data. Volume of data is growing exponentially. Management of high volume of data in reference with processing and storing is always complex.

  2. 2.

    Velocity—Social sites are constantly generating complex data in unstructured and semi-structured form. Increasing the collection of big data with the help of mobile, televisions, and more advance technologies as Internet mainly influences high velocity.

  3. 3.

    Variety—Variety of big data exists in structured and unstructured format. Structured data are always of fixed format and there is no possibility of changes of this data like tabular data, ERP, etc.

3 Related Work

Sentiment analysis is very popular technology in today’s world. Vast amount of work has been done in this field. Mostly, work in this area relates to storage of the data.

  1. (i)

    Semantic analysis assumes much importance: also it deals with document and word type of the data and is mostly dependent on NLP processing techniques.

  2. (ii)

    It deals with point-wise data and information and is mathematical in nature.

4 Hadoop

Hadoop is an open source framework which is freely available for every user. It is based on the Java programming framework. Hadoop is a project of apache. Hadoop is a framework which is available to support for the reliable and scalable distributed computing system. Hadoop framework was designed for solving the problems like processing the data and analysis the big data (Fig. 1).

Fig. 1
figure 1

Data replication [1]

  1. 1.

    Execution Engine (Map Reduce)

  2. 2.

    Hadoop distributed file system (HDFS) (Fig. 2)

    Fig. 2
    figure 2

    Map reduce [2]

5 Hadoop Distributed File System (HDFS)

It is mainly handling large amount of big data. Big data stores the blocks in HDFS. It is client server architecture. HDFS comprises of name node and many data nodes. It stores the data for the named nodes that is known as name node. The Name node searches to track the data node positions. Also, it is responsible to support the file system operations. If the name node fails in running the operation, then Hadoop does not support and recovers any data node state. Data replication is done for achieving targets of fault tolerance. In HDFS, the large sized data cluster is stored as a parallel sequence of blocks.

Using Our Approach

In this paper, we focus on mainly speed performing analysis and also accuracy. What removes the various problems in big data technology? Like part of speech and tagging using opennlp, it is easy to solve the problem. It is mainly known as tagging and used for following purposes.

  1. (i)

    Firstly: Usage of words like a, an can be stopped. It is not useful for the real-time sentiment analysis.

  2. (ii)

    Second approach is unstructured to structured: The twitter messages and the comments are mostly unstructured i.e. comment of “on bajrange bhai jaan” “favorite” is written “favorable,” “God” is written “good,” “aswm” is written for “awesome,” “bd” is written for “bad.”

  3. (iii)

    Thirdly, emoticons: It is most expressive approach available on ideas and opinion. It is symbolic representation converted to words at this stage.

  1. 1.

    Data and Real-time data features

In this paper, the real time is very important. It is obtained from the data streaming API’s available from twitter. It uses keywords like we are using in the movie bajrange bhai jaan. Objects that we use to perform the sentiment analysis are submitted to the twitter APIs. This provides Twitter, the tweets that are related to only that object.

Twitter data mostly used unstructured data. A tweet mostly consists of maximum 140 characters and likes. The messages’ comments consist of a user name and timestamp. Mostly, timestamp is useful for the future development in our project. It is also helpful for different geographical regions (Fig. 3).

Fig. 3
figure 3

Data processing [1]

  1. 2.

    Data defined part of speech

The file contains the obtained tweets.

  1. 3.

    Data in Root form

It is widely used program to increase the overall efficiency and lowers the time access of the system. Root forms the word on twitter for the tweet are changed to their root form and split all that word which is unwanted and extra storage of the derived word’s sentiment analysis.

  1. 4.

    Sentiment Data Directory

Now, the Real Time Data Directory is making and using standard directory for the big data sentiment, word net, and uses all condition for this word i.e. “good,” “bad.” The overall data is used to store the sentiment data in standard directories. It is local to the program, it is only primary memory. We can utilize our time in searching the main word in primary memory.

  1. 5.

    Map Reduce Algorithm

This is mapped reduce algorithm used for the tweets of the data. The sentiment values can be obtained from standard algorithm which we are using.

6 Final Data Accuracy

The complete overall accuracy in some twitters data i.e. accessing the data from sentiwordnet, opennlp, and wordnet is shown. The data are bajrange bhai jaan comments like negative word and positive word and neutral word. The data are compared by the help of movie bajrange bhai jaan, tweets like good, favorite, and negative words. The final special data available on web like following http://www.cs.tau.ac.il/~kfirbar/mlproject/twitter.data, now the checked data are as follows—

Sentiment

Count

Correct

%

Tolerance

Positive

739

520

72.22

−0.01

Negative

637

399

61.67

+0.05

Neutral

86

43

73.42

±0.003

The total accuracy of this project is 72.22. It is the mean of the total accuracy.

7 Total Time Efficiency

In our project, a necessary aspect is efficiency, that is why our project working well. Also reduces the time from hard disk that is only possible with the help of Hadoop and it is also using a lower time.

8 Conclusion

The Sentiment analysis is widely used at this time. The research is used for the analysis of the data. This project is also expanded to the social media platform and movie reviews like blogs and comment and likes per day. The accuracy is totally value following. Also the use of hashtags and emoticons is very useful and necessary for social media data for this project. In our project, use of emoticons and hash tags like comment and tweets analyzed per day data. The total accuracy found is 72.22%.