Keywords

1 Introduction

In the artificial intelligence (AI) community, stock price prediction has always been one of the biggest challenges [51]. Starting from the initial time, professional traders developed lots of analytical methods to cater to this problem, in which few methods which caught the eye were fundamental analysis, quantitative analysis, and many more [56]. However, none of the earlier financial methods showed any noticeable results to determine the stock market prices. Prediction of stock prices has been seen beyond the capability of professional traders who, in general, are driven by greed and fear were not able to make rational decisions about buying and selling in the stock market [40] and traditional AI that is supposed focus on imitating human [56]. Recent days have seen the exponential growth of the artificial neural network in terms of underhood capability to approximate any complex continuous. These developments enable an artificial intelligence system to find out the more complicated relationship between the features and the target class and ability to cater lots of data with the advancement in algorithmic architecture to parallelize for a huge amount of data [23].

This research project is aimed to capture the events happening in real life, which can be a merger of a company, changes made in the hierarchy of an organization in respective company and subsidiary companies to the change in own nation’s growth where that particular organization is expanded and change in the relationship amongst different countries. The study tries to captures these events in the form of sentiment scores and in-depth emotion score to measure events impact on the rise and fall of the stock price for the concerned organization.

Social media platform provides a space to every individual around the world to put up their concerns and thoughts regarding the different aspects of an organization, which can depict their likeliness towards the new launch of a product, service released by the organization or any event in association to the respective organization. These opinions and flowing thoughts have been captured concerning the organization and processed to analyze the sentiments and emotions of the general people, which is then further passed upon to the predictive models to relate such events to the rise and fall of the organization’s stock price.

Business News is another platform that captures the sentiments and emotions of the people who bring in the money into the stock market. Business NEWS helps in capturing critical events as compared to the social media platform as it targets explicitly to bring insight from the event. As experts in the financial area derive the notion of the event, analysis done over Business News holds vital importance. All engineered features processed from the Business NEWS are again passed onto the predictive models to learn the correlation amongst those features to predict the stock price movement.

The financial indicators and its derived features with the existing formulated financial models have been deriving the investment in the stock market until recent years. They have been powerful to capture enough trends and movement of the stock market. With the exponential growth of other platforms now these models can be empowered with more advanced features and hybrid AI models can be introduced to capture feature relation, which is still not known to financial experts.

This study aims to assess the hybrid predictive models and their capability to make advancement in the financial model by incorporating many powerful features directly related to the response of the events from the general expert or non-expert people in finance around the world. Amid finding the best model for prediction, the study also tries to visualize the continuous stream of tweets and periodic feeds of business news to visually infer and get a notion of attachment of stock prices to the affective response of the peoples.

2 Background

This section provides a brief about the stock price and how artificial intelligence can help in inferring the prices of stocks. As Big Data streaming is captured from different sources and analyzed, thereby, this section also aids in creating a notion of Visual Analytics (VA) tools.

2.1 Stock Marketing

A stock market is a place in real or virtual, which provides “trading” facilities on corporations’ stock and derivatives to the investors to trade securities and stocks of a corporation or mutual organization. The stock exchange is a regulatory body that governs the issue and redemption of securities. It also facilitates investments, income, capital events, and dividends [49]. The stock market is also known as the secondary market as it involves trading between two individuals where individuals can be organization, corporation, broker, or investor [51]. Stock prices are highly volatile, but the notion of price remains the same where if the stock is highly in demand, its price will rise, whereas on the contrary, if the market notion is against the notion of the company, there will be a dip in the prices. All companies whose stocks can be purchased over the stock exchange are known as “listed companies.”

2.2 Introduction to Artificial Intelligence

Artificial Intelligence or AI as we call it is still a field of ongoing research and experiment. AI is a field in which human intelligence is replicated into machines which enhances them from simple mechanical devices to enhanced intelligent and self-sufficient machines. AI has a different sense of understanding of different people. Some think that AI should closely replicate the human behaviour and thought process while others think that it should be free from the notion of emotion and should be able to figure out the best way rationally. AI is like an umbrella which takes into consideration various fields and their respective perspectives and techniques be it from philosophy, mathematics or computer science. Many think that the notion of AI itself is a modern idea, but the vision has been there since last 50 or so years. It was Alan Turning who bought AI into trend after the introduction of Turing Machines (1937) [52] which was a model of ideal self-sufficient, intelligent computer based on which he developed the theory of AUTOMATA. After this, the first developed artificial network the MP neuron was work of Walter Pitts and McCulloch in 1943 [35]. Ever since then researchers all across the globe have been trying to imitate the process of the human brain. A simple machine qualifies as an AI machine if it can perform all the work that a human can with the help of his brain. If a machine can impersonate human behaviour is qualified to be called as an artificially intelligent machine.

  1. 1.

    Machine Learning

    Machine Learning is an Artificial Intelligence approach to enable the systems to automatically learn and refine itself from experience without providing any external code for all the features [7]. The heart of Machine Learning lies in the development of programs that retrieve the data and utilise it to learn and improve. The learning procedure involves observing the data for the patterns present and make future decisions based on the patterns and examples provided to the system. The goal is to make the system to learn and adjust on its own without any interference of human.

    In Machine Learning there two major categories of division as described underneath:

    1. (a)

      Supervised Machine Learning Algorithms: It utilises past learning into the new data to make predictions of future events, only when data is pre-labeled.

    2. (b)

      Unsupervised Machine Learning Algorithms: It is used in scenarios where the data are not labelled or classified. The data is explored to draw inferences to determine the hidden structure from unlabelled data.

      Semi-Supervised Machine Learning Algorithms: It utilises both labelled and unlabeled data for learning. It is used in cases where labelled data need resources for further training.

      Reinforcement Machine Learning Algorithms: This method produces action to interact with the environment. It involves the trial and error search and delayed reward.

  2. 2.

    Deep Learning

    Deep learning is one of the three subsets of the significant broad classification of machine learning where algorithms are inspired by the structure and the functioning of the human brain. Deep learning allows the computational models to learn multiple levels of abstraction within the data in the multiple computational layers of the composed model [27]. These methods worked astonishing well in multiple domains by improving the existing state-of-the-art.

  3. 3.

    Transfer Learning

    In today’s era, even with the abundance flow of data, there are some domains where a lesser amount of work have been done. Less research and low reachability to such topics have led to significantly fewer data in those domains. Even if the problem is similar to one of the existing problem which has been tackled with the help AI, it is sporadic that new problem also follows the same distribution as to the problem already been solved. In such cases, if knowledge transfer is done correctly, it can lead to significant performance improvement in the model, taking away the much more painful task of addressing and labelling more data. In the recent decade, transfer learning has emerged as one of the new learning frameworks to address the problem of labelled data scarcity [43]. There are two main types of Transfer Learning techniques:

    1. (a)

      Networks as feature extractors: In this approach, features are extracted from some interim processing layer of a computational deep learning model, and the values coming out of those network at this stage are used as feature vectors. This feature vectors down the processing pipeline are used with different models for specific tasks in another domain different than on what the deep learning model was initially trained upon [21].

    2. (b)

      Fine tuning pre-trained networks: In this approach, a pre-trained network is used as a starting point; then continuous efforts are made in order to fine-train the pre-existing weights such that they can generalise well over the new task [6].

2.3 Artificial Intelligence in Stock Market Prediction

Recent days have witnessed numerous research that was carried with the help of artificial intelligence to predict stock prices. In AI-enabled stock market prediction, it is noticeable that not only the combination of indicators but new features also came into existence and incorporated into the traditional and latest AI model. In early 2008, in order to simulate the market, a genetic algorithm (GA) in combination with a support vector machine (SVM) was introduced [13]. Where GA was able to simulate the indicator variables, but feature selection was not introduced hence made the model computationally very expensive. The subsequent year 2009, to cut down the computational cost attached to GA, proposed another 3 stage approach that bought down computational need multi-fold [28]. The three-stage approach proposed initial technical analysis over the indicator variable based on historical data, secondly selection of features amongst them, and then applying SVM. This machine learning and statistical model outperformed the earlier model. In the year 2010, with intense research over the market, new indicators were introduced, which extended the horizon for the number of factors that were included in earlier days to many more independent variables that were proven to impact the stock prices. Also, because of new predictors variables and the availability of data, new ways were exploited to build new models. One of the models was based on the artificial neural network with the given exploded indicator variables [24]. This model outperformed the existing machine learning (ML) model and financial mathematical model, giving rise to a new wave of ANN-based models for stock price prediction.

2.4 AI Stock Market Prediction with Financial Indicators

In stock market financial indicators such as open, high, low and close (OHLC) holds great importance. These trading indicators can show the captivity of the organisation in the real world. OHLC is considered to be complete in order to show the behaviour of an organisations’ stock prices. In general, for an extended period, it can provide useful insights about not only the trading strength but also the price gaps. For example, if we plot low and high points for the day, it can provide insights regarding the day level volatility of the stock in stock exchange [17].

Significant data processing capabilities not only extended our horizon of finding the influences of financial factors over a long period which were not known earlier but also helped in advancing our decision making capability. With progress in the capability of harnessing the Big Data and to nurture it for a particular requirement also opened a whole new world for building more data-driven models and the ability to incorporate more number features.

In 2015, the deep learning based event-driven stock price model showed significant improvement as compared to all previously ML-based model [16]. DL based model was able to show a remarkable 6% improvement when compared to earlier model SP 500 stock historical data. With extensive data and new implementation of models also made stock predictions of stocks to reach near to the actual future price of the stocks. One of the recent research paper published in 2017, was also able to provide essential baselines when the different architecture of the ANNs was deployed to predict stock market price [11]. It was able to show that ANN as compared to the existing model was able to identify more hidden context from the data. It was also able to increase covariance estimation when it was subjected to the covariance-based market analysis.

2.5 AI Stock Market Prediction with Textual Data

Widespread adoption of technologies not only bought the world together but also enabled individuals to share their thought, ideas and experiences over the worldwide forum. These thoughts and experiences started building the sentiments amongst the people of same interest leading to favouritism and boycott of a product or organisation with a higher impact and over a global level.

In very recent years, 2009, new research was carried which showed the dependency of the financial news over the stock price prediction [47]. This paper introduced a new world of natural language processing (NLP) to qualitative financial data. Now qualitative data also came into consideration as one of the major contributing factors towards the stock price prediction. Financial news was processed using NLP to create a bag of words including only noun phrases and the named entities for the financial domain. The closeness of model prediction on the real-time financial news was very close to the receding stock price of the impacted organisation. Not only data from different forums but with a general inclination towards Twitter as a social platform also narrowed down the horizon to collect and compute the sentiments of the people regarding any organisation. With hashtag functionality and constrained length of tweets made Twitter favourable amongst the researchers to get motivated towards finding sentiments from the twitter data. In 2013, Twitter tweets and time series data of stocks formed the baseline of another research [50]. Topic-based analysis of the Twitter data and its incorporation with past historical data of stock price variation of a particular organisation showed the worth of the twitters’ tweet and power that it was able to pass on to the quantitative financial data. With advancement in NLP, another paper published in 2015, was able to find a relation between the specific topics impacts on a specific organisation [41]. This method explained how data collected from different forms could be processed and cleaned to the data which is of real concern in the prediction of stocks for that particular company. Paper was able to bring the notion of defined topics for specific industries and change in sentiments for that particular topic. It was also able to bring down the cost of pre-processing needed for the data to be of any importance as suggested by the earlier research. Data used to build the model incorporated existing topic modelling approaches with newly proposed methods and historical financial data. Research not only showed the improvement which was gained over the existing approaches by 2.07% for 18 stocks over a year time but was also able to capture the sentiment analysis contribution in stock price prediction in the real world effectively.

  1. 1.

    Sentiment Analysis Tasks

    Exponential growth in an individual’s power to access the internet and so for social media has led to a flood of thoughts and ideas that are shared per second across the world. Ease of curation of such thoughts and experiences of individuals for an organisation has given rise to sentiment analysis. As Zhang, Lei explained, Sentiment analysis or opinion mining is the computational study of people’s opinions, sentiments, emotions, appraisals, and attitudes towards entities such as products, services, organisations, individuals, issues, events, topics, and their attributes [2, 32]. Over last decade numerous research tried to capture the opinions which are individuals as customers, people of the state and far most as the human to find the influence over the organisations, countries and towards global topics. It was not only able to find the rationale of individuals belonging to a particular geographical area but all over the global level and tried to measure its impact on different higher level organisations. Sentiment analysis is broadly categorised and studied under three categories which are document level, sentence level and aspect level [55].

    1. (a)

      Document-level sentiment analysis task tries to classify the sentiments as neutral, positive or negative based on the overall sentiment captured in a document. The document-level analysis assumes the notion that each document will be talking about one context. One tweet can also be considered a document which can represent the opinion of an individual regarding some product or organisation.

    2. (b)

      Sentence level sentiment analysis task provides the capability to capture the sentiment of the document on the sentence level. In sentence level, before going for sentiment analysis, the sentence is mostly checked for the subjectivity. Subjectivity classification helps in avoiding objective opinion which is none other than facts [31]. Only sentences with high subjectivity are taken into consideration to undergo sentiment analysis task. Sentence level sentiments are also captured in three classes neutral, positive and negative respectively.

    3. (c)

      Aspect level sentiment analysis task focuses on summarising and bringing out overall sentiments from people’s opinion for any particular entity also known as targets. Aspect level sentiments express sentiments for each aspect of an entity. For example, if an organisation is considered as an entity, then the salary package can be considered as one aspect and employee perks as different aspect and so on. When aspect level sentiments are captured, one can easily find out whether salary or employee perks are useful in the organisation.

  2. 2.

    Emotions Analysis Task are closely related to the sentiment with more analysis of the inferred polarity. For example, negative sentiment can be caused by sadness or anger, while a positive sentiment can be caused by happiness or anticipation. Thus, following the way in sentiment analysis, many deep learning models are applied to detect emotions [55]. Zhou proposed an Emotional Chatting Machine (ECM) that can generate appropriate responses grammatically relevant and emotionally consistent based on GRU [55]. Their system is modelling the emotion factor, using emotion category embedding, internal emotion memory, and external memory. A bilingual attention network model was proposed by Wang [54] for code-switched emotion prediction. Abdul-Mageed and Ungar, built a large, automatically curated dataset for emotion detection using distant supervision and then used GRNNs to model fine- grained emotion [1]. They extended the classification to model by Plutchik [44], in which he proposed 8 primary emotion dimensions as shown in Fig. 1.

Fig. 1.
figure 1

Plutchik’s wheel of emotion

2.6 AI Stock Market Prediction with Twitter Data Analysis

The rising number of blogs and social media platform in the last decade provided a mean for people to put forward their opinion regarding the entities which can be an organisation or individuals’. This massive amount of opinionated data mining has provided a mean by which we can quickly capture the sentiments of targeted individuals regarding any product, organisation or government institution. In social media, Twitter gained favouritism from the world-wide community in terms of its use and became one of the dominant platforms to convey opinions. With the limit in tweet length and worldwide acceptance also grabbed attention from the AI community. Many research papers were published to establish the tweets sentiments relations to the stock market prediction in the last decade. In 2010, A.Pak proposed twitter corpus for sentiment analysis. In that particular corpus tweets where tagged to specific emotions manually, such that happy emoticons signified positive sentiment and sad emoticons signified negative sentiments [42]. In 2013, Twitter tweets with the time series data were able to provide significant results in stock market prediction [50].

2.7 AI Stock Market Prediction with News Data Analysis

While in last decade there were plenty of articles and research paper were published in data mining and time series to predict stock market prices, but there is very handful of papers which covered text mining in stock market prediction. Some of the earliest research paper who started using business news for financial forecasting [10, 26], did a remarkable job but still due to the absence of news and public opinion about the particular organisation and nature of high volatility of stock price there was still scope to enhance the model. Most of the sentiment classification involves the training of the system based on the labelled documents by the experts or generated by the system. In 2004, Mittermayere proposed NewsCATS engine which was able to classify the news into three classes namely Good, Bad and No-movers. In this based on the category of the News movement of the stocks was predicted [38]. The AZFinText system proposed in 2010 is a regression system which also tried to predict the stock market prediction based on the news [48].

2.8 Big Data Visualisation of Streaming Data

Big Data poses a computing challenge because of its rapid velocity, immense volume, and a wide variety of [25]. With ever-increasing, human-centered systems are creating enormous amounts of data. As this enormous high volume of a wide variety of information can easily be generated and collected at a very high speed; Have created a necessity of Big Data visualization and visual analytics in a diverse real-world application.

In the last few years, there has been the development of various tools and techniques to visualize patterns in the textual data. Which most popular ones try to find the co-occurrences of the entities [9, 18]. Also, there has been a multi-fold increment in the software to visualize such data [4]. Tableau is one of the most commonly used software for the analysis; It’s the ability to hook with different ingestion system make its favorable choice amongst the developers and higher managers to crunch data quickly into multiple axis [14]. Alongside different systems were built to not only visualize the Big Data but to provide end to end solution to the Big Data visualization problem. ELK stack is one of the commonly used stacks which is used in the industry [20].

2.9 Contribution

The main contribution of this work is to analyze and develop the architecture to provide visualization aid to the prediction of the stock market prices. In prior research, most of the time, groups were mainly focused upon either the qualitative or the quantitative data, which have concentrated upon the more modest algorithms to solve that task without any practical validation. As the state of the art algorithms comprises the machine learning models which made this research novel in the financial domain but at the same time deployment of deep learning (DL), model made it to lose the explainability aspect of it. The visualization aid was aggregated to the framework to make it more explainable to mitigate the drawback of using deep learning algorithms.

Our approach extends the existing framework described in Sect. 2.8. In addition to developing together with the components that worked best in abstract design, we also created a pipeline that can ingest from multiple platforms in parallel without any concerns of qualitative or qualitative data. Framework in integrated with state of the art machine learning models of the finance domain and at the same time, the ingested data was represented in real-time over the tableau dashboard to comprehend the model’s prediction.

Our findings are that model prediction does not need to come at the expense of the explainability. For the approaches to building the framework, we have developed components extension to integrate seamlessly, which provided the necessary aid of visualization to the financial model prediction. Although the data ingestion requires unique formulation and filtering, the process itself is straight forward and easily accessible.

3 Big Data Pre-processing and Visualization of Tweets, Business NEWS and Financial Indicators

This section defines the end-to-end pipeline that has been utilized in this study for analysis of qualitative as well as quantitative data flowing in from the different platforms. Although data is scrapped based on the scrapping policies and API utilized to connect to the different platform, but once we have a data stream, the ingestion, processing tunnel and storage space remains the same for the project.

Fig. 2.
figure 2

Proposed architecture

Figure 2 describes the proposed architecture that has been used in the project for creating a visualization pipeline and also used for the data modeling part. In an initial study, all the platforms for the study were evaluated based on scrapping policies and availability of API, and three different platforms were found best for the study. Twitter interface API has been used to collect the twitter data from the social media perspective, Financial times feed was consumed to gather the business NEWS, and Quandl API was consumed to collect the financial indicators of the Microsoft stocks. Based on API and feed, Kafka streaming tunnel with NGNiX was created for continuous monitoring and streaming of data from the different platforms. S3 has been consumed to create a data lake for the data getting ingested from the Kafka pipeline. Over S3, the logstash component was built whose primary aim was to provide server-side processing pipeline, and output was served to elasticsearch. In elastic search, data can be found in a much more meaningful way as it forms a definite structure. This processed and ready to use data is then consumed by the machine learning models to predict the stock price as well as same data was consumed by tableau for intermediary visualization and analysis to get a notion of how the model should behave and then the model was tweaked if any discrepancies in the visual and inference from machine learning model are found.

Data set curated from different platforms for the Microsoft organization is from the 4th of April 2015 till the 28th of March 2019. Whereas due to less number of data points, no development set have been taken out from the data set, and training and the testing split is based on the dates. Training data points are taken from the 4th of April 2015 till the 1st of January 2019, and testing data is the point is from the 2nd of January 2019 till the 28th of March 2019. Data set Overview is provided in the Table 1

Table 1. Dataset overview - train and test split

In the Table 1, ‘f’- refers to the number of features in the data set and ‘f(VIF)’- represents the number of features after removing the correlated features from the data frame. Also, the table ‘f(VIF)’ is null for Microsoft finance and NEWS BERT because features arrangement holds a semantic representation of each textual document, and hence the relation can break if the correlation feature removal is implemented in the data set.

3.1 Data Pre-processing

The need for data pre-processing is only required by the raw textual data curated from the Twitter platform and financial times website.

Pre-processing of Scrapped Textual Data

The textual dialogues are processed using ekphrasisFootnote 1 tool [5] in which series of operation are performed. A brief visual description of this tool is described in Fig. 3 as well as components are explained underneath:

Fig. 3.
figure 3

Pre-proessing pipeline for textual data

  1. 1.

    Noisy Entity Removal: Twitter being a social networking platform on the global level, which makes tokenisation of twitter data most complicated task. It is essential to keep the words intact with the corresponding emotions attached to it. Also, creative writing use for new emotion generations and hashtag should be considered. Textual data curated from the Business NEWS platform are much more formal hence requires fewer efforts for cleaning. The goal here is to remove any stop words, punctuation’s, URLs with censored words, and not to remove complex emoticons.

  2. 2.

    Text Normalization: This step involves tokenising the processed data coming after the above-stated stage. Tokenised words are then lemmatised so that each word can be visualised as root words.

3.2 Feature Extraction Techniques

Feature Extraction from Textual Data

  1. 1.

    Sentiment Analysis. For capturing sentiment analysis different libraries and ontology have been used:

    1. (a)

      TextBlob

      TextBlob is a library supporting Python 2 and 3 for the original processing data. It provides a simple application interface which helps in efficiently leveraging everyday natural language tasks such as part-of-speech (POS) tagging, extraction of entities based on the POS tags, sentiment analysis and more [33]. TextBlob under the hood utilises NLTK and pattern library which are widely used and accepted in natural language processing NLP community. In recent years, TextBlob gained wide acceptance in AI community which can be readily determined by the number of a research paper using it as a tool for sentiment analysis [3, 34, 53].

    2. (b)

      Pysentiment

      Pysentiment is the library for sentiment analysis which is built on top of the dictionaries. Two dictionaries which are used by this library is namely Harvard iv-4 by Harvard University and Loughran and McDonald Financial Sentiment Dictionary.

    • i. Harvard Institute provides HIV4 dictionary. This dictionary provides 185 features for each of the 11789 words. One hundred eighty-five features in this dictionary represent the different aspects of the word ranging from sentiment, affiliation, psychological, emotions and many more.

    • ii. In 2012, Loughran and McDonald Financial Sentiment Dictionary (LM) consisted of 84330 financial words with their sentiments was published [36]. This dictionary after its release in public domain assisted much research to captures sentiments from financial articles, business news and much more

  2. 2.

    Emotion Analysis: “Words are associated with emotions,” as quoted in the research paper NRC emotion Lexicon [39]. In order to capture the emotions from the tweets and business news, there are different deep learning models available [8], but to make the architecture lightweight, tokenized processed documents are mapped against the NRC emotion lexicon. Eight emotions are captured in the process. As tweets posted over the twitter platform for a day is more than ten thousand, hence more advanced emotion normalization score system is used to compute the emotions.

    Score Formation for Emotions:

    $$\begin{aligned} \frac{\sum _{i=1}^{l} \frac{\sum _{i=1}^{d}\;\; emotion_{i}\;\; appearing\;\;in\;\;a\;\;tweet}{length\,of\,the\,tweet}}{{number\;\;of\;\;tweets\;\;per\;\;day}} \end{aligned}$$
    (1)

    In Eq. 1, ‘i’ refers to the single tweet of the day ‘d’ refers to total number of tweets for a day. ‘i’ refers to the total number of days.

  3. 3.

    Bidirectional Encoder Representations from Transformers (BERT): State-Of-The-Art Textual Representation of Textual Documents. BERT provides the pre-trained vectors representation of the words, which can be used further with the various AI models. BERT architecture is a frame to provide representations by joint conditional probabilities both from the left and right context for all the processing layers [15]. BERT vectors are used in the experiment to utilise the shallow transfer learning models to enhance the capabilities of the current predictive models. BERT is used as a service, to convert processed text both for twitter and business NEWS to its corresponding vector. As there are multiple models in BERT, current experiment utilises BERT-Base-Uncased which holds the capability to represent the word in the 768 dimensions.

Fig. 4.
figure 4

Textual: feature engineered data set preparation

Feature Extraction from Financial Indicators. As platform ‘Quandl’ provides the financial indicators such as OPEN, CLOSE, Adj CLOSE, VOLUME and DATE for a range of specific duration. Existing research in the area of stock market prediction can help the system to derive out a significant number of derivative features from the information provided by the Quandl platform. As for generating a label for the dataset, as explained in Sect. 3.3, the system is using the open price of the stock of the current and successive day. Hence all derived features are built upon the OPEN financial indicator. From OPEN financial indicator corresponding Fourier transformation have been derived based upon the wavelet research [30], the moving average is computed as a feature with a lag of 2, 7 and 21 days [22], Moving Average Convergence Divergence MACD [12, 46], Upper and lower bounds [29], exponential moving average [37] with lag of 12 and 21 days, momentum and log momentum [19].

3.3 Formulation of the Feature Engineered Data Set and Label

Feature extraction techniques Sect. 3.2 are used to build overall feature engineered dataset for the current experiment. Qualitative textual data is converted into quantitative data with the help of the feature extraction techniques, and the custom score mechanism explained in Sect. 3.2. In parallel derivative financial indicators have bee developed based on the prior research in the field of finance.

Formulation of Target Labels. As the system tries to predict the rise or fall for particular days under test for the current organisation. Hence, opening stock price is taken as a measure to compute the label for the particular day. The formula for computing the target label is provided underneath:

$$\begin{aligned} {TargetLabel}_{t} :={ if }\left( {OpenIndex}_{t} \le {OpenIdex}_{t+1}\right) { then } 1 { else } 0 \end{aligned}$$
(2)

In this equation, t depicts the day under evaluation or for which label is to be assigned and t + 1 represents the next day. According to the equation, if the market is going up, one is assigned as label whereas on the contrary 0 will be assigned as a label for the fall of the stock price.

Formulation of the Textual Feature Engineered Data Set. Feature extraction techniques provided a way to extract the features and to give quantitative meaning to them. All the features from the analysers are combined to formulate the overall data-set. In addition to features coming from the analyser twitter also have one more feature provided by the Twitter API, which is also taken into consideration. Architecture for textual feature engineered data set is visualised in Fig. 4.

Formulation of the Financial Indicators Based Feature Engineered Data Set and Label. As described in the previous section, all the derivatives of the OPEN indicator formulated the new engineered feature. Visualisation of data set formulation of the financial indicator is provided in the Fig. 5.

Fig. 5.
figure 5

Financial indicator: feature engineered data set preparation

3.4 Twitter Data Accumulation and Visualization

Once we formulate the definite dataset as described in Table 1. Emotion analysis over the tweets is done with the help NRC [2] lexicons, and individual emotion score is further amplified with the custom score Eq. 1. Two positive emotions, namely- ‘trust’ and ‘joy’ are evaluated against the OPEN index of the stock market and visualised in Fig. 6. Two negative emotions namely- ‘anger and ‘sadness’ are evaluated against the OPEN index of the stock market and visualised in Fig. 7.

Fig. 6.
figure 6

Positive emotions in Tweets

Fig. 7.
figure 7

Negative emotions in Tweets

Sentiment Analysis of tweets is done with the help of TextBlob 1a library and pysentiment 1b library. In the pysentiment library, two libraries are used, Harvard Institute dictionary and Loughran and McDonald Financial Sentiment Dictionary to capture the sentiments flowing in the tweets concerning the organisation. All the sentiments are averaged out and visualised in Fig. 8.

Fig. 8.
figure 8

Sentiments vs opening price for Microsoft stocks based on Tweets

3.5 Business NEWS Data Accumulation and Visualization

Emotion analysis over the Business NEWS is done with the help NRC 2 lexicons, and individual emotion score is further amplified with the custom score Eq. 1. Two positive emotions, namely- ‘trust’ and ‘joy’ are evaluated against the OPEN index of the stock market and visualised in Fig. 9. Two negative emotions namely- ‘anger and ‘sadness’ are evaluated against the OPEN index of the stock market and visualised in Fig. 10.

Sentiment Analysis of Business NEWS is done with the help of TextBlob 1a library and pysentiment 1b library. In the pysentiment library, two libraries are used, Harvard Institute dictionary and Loughran and McDonald Financial Sentiment Dictionary to capture the sentiments flowing in the tweets concerning the organisation. All the sentiments are averaged out and visualised in Fig. 11.

4 Architecture for Stock Market Prediction

As experiment evaluates data gathered from the social platform, business NEWS, and financial indicators with the state-of-the-art models. Hence two strategy has carried forward to build the hybrid architecture to improve the performance of the earlier existing systems.

4.1 Hybrid Architecture Based on Best Model Selection Strategy

The first strategy followed is to build an Architecture that can incorporate quantitative as well as qualitative data. Hence prediction from the best performance models for each of the platforms is taken together and given to the voting classifier. The voting classifier then uses the soft voting technique to assign a weight to different models based on the classifier, which are housed in the voting classifier, as visualized in the Fig. 12.

Fig. 9.
figure 9

Positive emotions in business NEWS

Fig. 10.
figure 10

Negative emotions in business NEWS

4.2 Hybrid Architecture Based on Shallow Transfer Learning Model

The second strategy is the formulation of the effectiveness of the state-of-the-art shallow networking based transfer learning technique in the form of BERT vectorization. A tweet and an abstract of NEWS formed an independent document of variable length. Each of the documents then goes through the BERT vectorization service, where it gets converted into the fixed-length vector. The fixed-length vector of tweets and news are independent of each other. Once all the fixed-length vector for the whole day is identified, then the average fixed-length vector is formed for that particular day, which then merges (twitter and news) to create a day data point. A high context level diagram is provided in Fig. 13.

Fig. 11.
figure 11

Sentiments vs opening price for Microsoft stocks based on business NEWS

4.3 Hybrid Architecture Based on Engineered Feature Dataset

The third strategy is the formulation of the dataset. The dataset preparation is done by combining all the features extracted from twitter, business news, and Quandl collected financial data. Once data is formulated, it is then subjected to the machine learning models for the training and inference purpose. A high context level diagram is provided in Fig. 14.

Fig. 12.
figure 12

Hybrid architecture based on best model selection strategy

5 Evaluation Metrics

As current experiment carried out is a supervised problem, hence matrics evaluated for the comparison of the results from the different machine learning model and the deep earning models has been done based on accuracy, precision, recall and F1-Score.

  1. 1.

    Accuracy:

    Accuracy is the ratio of total correct predictions that have been made in all the classes in the classification problem. Mathematically it can be visualised as the ratio of true positive and true negative with all the data points present in the data-set. Mathematical formula of accuracy is given underneath:

    $$\begin{aligned} \frac{\sum _{i=1}^{l} \frac{t p_{i}+t n_{i}}{t p_{i}+f n_{i}+f p_{i}+t n_{i}}}{l} \end{aligned}$$
    (3)

    In Eq. 3, ‘tp’ represents the true positive from the model. ‘tn’ represent the true negative from the model. ‘fn’ represents the false negative from the model. ‘fp’ represents the false positive from the model.

  2. 2.

    Precision:

    Precision defines the exactness of the system. It is defined as the ratio of true positives identified by the model over actual number of positive marked by the model. Mathematical formula of the precision is given underneath:

    $$\begin{aligned} \frac{\sum _{i=1}^{l} \frac{t p_{i}}{t p_{i}+f p_{i}}}{l} \end{aligned}$$
    (4)

    In Eq. 4, ‘tp’ represents the true positive from the model. ‘tn’ represent the true negative from the model. ‘fp’ represents the false positive from the model.

  3. 3.

    Recall:

    Recall helps in evaluating the completeness of the model. It is the ratio of predicted positive over the ground truth positive classes. Mathematical formula of recall is given underneath:

    $$\begin{aligned} \frac{\sum _{i=1}^{l} \frac{t p_{i}}{t p_{i}+f n_{i}}}{l} \end{aligned}$$
    (5)

    In Eq. 5, ‘tp’ represents the true positive from the model. ‘tn’ represent the true negative from the model. ‘fn’ represents the false negative from the model.

  4. 4.

    F1-Score:

    F1-Score is computed by evaluating the harmonic mean of the precision and recall. Mathematical formula of F1-score is given underneath:

    $$\begin{aligned} \frac{\left( \beta ^{2}+1\right) {Precision}_{M} R e c a l l_{M}}{\beta ^{2} ({ Precision }+R e c a l l )} \end{aligned}$$
    (6)
Fig. 13.
figure 13

Hybrid architecture based on shallow transfer learning model

6 Results

As current experiment carried out is a supervised problem, hence matrics evaluated for the comparison of the results from the different machine learning model and the deep earning models has been done based on accuracy, precision, recall, and F1-Score.

6.1 Hybrid Architecture Based on Best Model Selection Strategy

Best model from the different platforms has been selected. From Twitter models the dense, deep neural network has been taken, from Business NEWS models Naive Bayes model has been taken and from financial indicator models Random Forest has been taken; Individual output coming from each of the best models will be given to Voting Classifier to make the prediction. Results obtained from the architecture is described in Table 2.

Table 2. Evaluation metric for hybrid architecture based on best model selection strategy

6.2 Hybrid Architecture Based on Shallow Transfer Learning Model

Dataset evaluated in this section is a resultant dataset obtained after the merger of the BERT vector of Twitter documents and Business NEWS articles on a daily basis. The evaluation result over machine learning model is provided in Table 3 and evaluation on deep learning model is provided in Table 4.

Fig. 14.
figure 14

Hybrid architecture based on engineered feature dataset

Table 3. Evaluation metric for hybrid architecture based on shallow transfer learning ML model

6.3 Hybrid Architecture Based on Engineered Feature Dataset

Accumulated feature engineered datasets, from multiple platforms, are taken and evaluated with the machine learning and deep learning models. The evaluation result of Machine Learning models on the framed dataset is provided in Table 5 and deep learning-based models evaluation is provided in Table 6

Table 4. Evaluation metric for hybrid architecture based on shallow transfer learning DL model
Table 5. Evaluation metric for hybrid architecture based on engineered features dataset ML model
Fig. 15.
figure 15

AI2VIS4BigData reference model

Table 6. Evaluation metric for hybrid architecture based on engineered features dataset DL model

7 Validation of the AI2VIS4BigData Reference Model

Section confirms and maps the proposed architecture of the study to the AI2VIS4BigData reference model [45], as shown in Fig. 15. This mapping is necessary to validate the proposed system but also to provide a useful gateway to extend this research and possible collaboration in the future. As in the AI2VIS4BigData reference model for processing step ‘Data Management & Curation,’ our data ingestion pipeline, as proposed in Sect. 3, can directly be used. In other processing steps, as mentioned in the AI2VIS4BigData reference model, ‘Interaction & Perception’ tableau can facilitate the meaningful visualization needed for explanation of the inference made by the AI model.

8 Conclusion and Future Work

Amongst all the Hybrid architecture, the Random Forest model was able to outperform all the other machine learning and deep learning model by the significant margin. Accuracy of 72.41% and weighted average precision of 72.00% shows the balanced inclination of the model towards the respective two classes, which are rise and fall of the stock price for the subsequent day.

The present research has been carried out to provide the feasibility study of the social media platform and Business NEWS over the stock market prediction. The findings in terms of affective analysis visualization and model building showed a significant correlation amongst the social media platform and Business NEWS for stock price prediction. It also captured the results obtained from the state-of-the-art methodologies over the research problem. As a remark, even though there is high volatility in the stock market but with the amount of data flowing in different social media platforms and righteous Business NEWS, in coming future, it will be very much possible to capture the stock price movement with multiple such platforms efficiently.

In future, underneath mentioned directions can be explored to build better visualization platform which can provide explainability to black-box machine learning model:

  • More complex emotions can be captured with a correct mathematical formula, which can improve the efficiency of the system.

  • More hybrid models strategy can be evolved and evaluated, and deep neural network on most the cases underperformed in the current experimental setup.

  • Complex features can be developed in the financial indicator as they showed prominent results as individual models.

  • Parallel research on multiple different platforms demands scalability. Scalable modules can be developed to capture the events in real-time.