1 Introduction

In today’s society, much human interaction takes place online through blogs, emails and chat boards, to name a few. Blogging websites like Twitter, have gained mass popularity and serve as a medium for communicating through a few sentences, embodying the low social presence and high self-disclosure classification of Social Media as defined by Kaplan and Haenlein (2010). The nature of microblogs, being more to the point on a topic and less verbose (140-character limit for Twitter posts), make them prime candidates to extract sentiment for use in predictive analytics (Bermingham and Smeaton 2010; Ghiassi et al. 2013; Martínez-Cámara et al. 2014; Aisopos et al. 2016; Saif et al. 2016).

Gruhl et al. (2005) showed that blogs and other on-line social media websites are predecessors to ‘real-world’ behavior and the volumes of posts related to various products on Amazon’s website are highly correlated with actual purchase decisions. Pang and Lee (2004) provided further support for social media data as a viable source to use in predictive analytics, which is validated by the fact that people are more inclined to share their opinions on social media websites to mere strangers. Extracting features from social media messages have proven to be a robust method for a variety of different labels. Hennig-Thurau et al. (2015) and Asur and Huberman (2010) leveraged Twitter messages and tweets related to a specific movie before its release date and showed a positive correlation between message volume and movie ticket sales. Wu and Brynjolfsson (2014) created an index of Google search queries related to housing prices and sales, which was shown to be a forward-looking indicator of the housing market trends. Choi and Varian’s (2012) research showed that Google query search volume is a strong predictor of future economic activity in various industries. Also, Google trends data was leveraged to forecast weekly volatility by Hamid and Heiden (2015). These studies further validated the internet as a source for robust predictive data and behavior patterns.

Several studies related to capital markets suggested the volume of stock chatter messages were a predictor of volatility and next day returns (Wysocki 1998; Tumarkin and Whitelaw 2001; Antweiler and Frank 2004; Da et al. 2011; Zhang et al. 2013, 2014; Shen et al. 2016). Bollen et al. (2011) extracted the mood state and sentiment of many users on a stock blogging site and presented highly predictive directional moves in the Dow Jones Industrial Average, two days out, with an 87.6% accuracy. Also, Houlihan and Creamer (2015) leveraged volume and sentiment as features from StockTwit messages and showed how they help explain continuation and reversal effects. Sentiment will be one of the main features used in this research.

Another way to capture the market sentiment is through the options market. Anthony (1988) has shown that increased trading in call options leads to next day gains in various underlying stocks that experienced a spike in call volume the day prior. The latter research would warrant using call option volume as a feature for a model to predict a label, such as future directional moves. Chen and Lu (2017) identified stocks with large decreases in option implied volatility experienced abnormal gains. Cao et al. (2003) find that option volume imbalances, specifically, short-term out of the money call option volumes, are predictors of pending takeovers. This finding points to a somewhat inefficient market, one where only informed traders have access to insider information before an announcement. However, this inefficiency can be leveraged as an indicator for a model that attempts to predict a label such as the next day directional move. Billingsley and Chance (1988) showed one such indicator, the put-call ratio, to yield abnormal gains when used in a trading strategy. The put-call ratio, PCR, is simply the total daily put volume divided by the daily call volume for a particular equity. Intuitively, a ratio below 1.0 would point to a bullish indicator, whereas a ratio greater than 1.0 points to a bearish indicator. However, Billingsley and Chance (1988) show that a ratio of 0.7 is a better threshold. Additionally, not only is PCR suggestive of being a short-term indicator for near-term directional moves of stocks or indexes, but the PCR also seems to be more of a contrarian indicator than a conformist indicator. In fact, several other indicators are contrarian in nature, including short-term interest and VIX. Hu (2014) shows that imbalances between option volume and underlying volume predict future stock returns. Pan and Poteshman (2006) also show that volume for specific traders contained information about future prices. This latter study had access to a unique data set that showed new buyer volume that was broken out by various traders. Unique put-call ratios were derived using each particular trader. The data (1990–2001) was analyzed using a univariate regression, where the independent variables are the corresponding put-call ratios and the dependent variable is the next day risk-adjusted return. The results showed stocks with low put-call ratios derived from a particular trader (full-service) outperformed stocks with high put-call ratios by \(+\)40 basis points on the next day and 1% over the following week. The premise here is that informed, full-service investors trading the underlying stock instead of index options have firm, specifically related information rather than market-wide news. Also, stocks that went through periods of higher breadth (advancing issues relative to declining issues) rewarded investors with abnormal returns of 2.92% in 6 months and 4.95% in a 12-month period as shown by Chen et al. (2002). Also, Houlihan and Creamer (2014) formulated trader specific call-put ratios based on option contract volume and determined that specific traders have superior information over other traders as they showed higher Sharpe ratios with specific trader call-put ratios.

The contribution of this research suggests that sentiment extracted from social media messages and market data based call-put ratios contain information to forecast asset returns. In addition, we leveraged a unique dictionary which captures measurable mood states of authors. Sentiment is a crowd-sourced measure from the general investing community and behavior is in the form of overreactive and especially underreactive effects observed by investors. Additionally, the call-put ratios represent traders whose sentiment and behavior can be captured through option volume data. Leveraging all features together yielded the highest monthly cumulative returns and annualized Sharpe ratios, suggesting the additional information generated by combining both sentiment and behavior from social media and market data improved asset return direction. Lastly, we validate several risk factors that help explain asset price returns.

Table 1 Financial and sentiment risk factors

2 Data

All raw data, price data, and micro-blogging messages were drawn from the period between July 2009 and September 2012. Additionally, time series were formed for all the various features (Table 1) and labels to create a matrix for all stocks used in the analysis. All features are derived on a stock by stock basis for each day.

Social media

  • Roughly 4.1 million messages were provided by StockTwits, a social media platform for the financial community consisting of 230,000 active members who discuss and exchange trading ideas, between July 13, 2009, and October 31, 2012. StockTwits also enabled its users to append tickers (CashTags) with a $, that is $TWTR, when discussing specific assets in messages, allowing for a simple regex match. This research uses only the following StockTwits fields:

    • body—the message text.

    • created_at—datetime stamp of when messages were posted. Note: only messages whose timestamp of between 09:30 am EST and 4:00 pm EST were used in this analysis.

    • symbols—list of tickers mentioned in message (cashtags).

Market data

  • Asset price data is from the University of Chicago’s Center for Research in Security Prices (CRSP) database. We assume an entry point at the market open price, and exit price at market close price, both per CRSP.

  • Also used is a unique dataset provided by International Securities Exchange Holdings which consist of firm-wide daily option volume data broken out by various traders:

    • Customer—Option trade volume for traders acting on behalf of discount and full-service customers. This trader type dominates option volume.

    • Broker Dealer—Option trade volume for traders acting on behalf of institutional clients.

    • Proprietary—Option trade volume for proprietary traders acting on behalf of their firm.

3 Fama–MacBeth Regression Analysis

Before delving into the methodology, we first need to determine if the proposed features help explain the variability of asset price returns. Validating their explanatory power can be performed through the Fama–MacBeth regression estimation framework (Fama and MacBeth 1973). This method involves two regression steps. The first step consists of regressing (Formula 1) the proposed risk factors as the independent variables against each of the asset return series to compute each respective asset’s beta values.

$$\begin{aligned} R_i =\beta _{0,i} +\beta _{1,i} F_{1,i} +\cdots +\beta _{m,i} F_{m,i} +\varepsilon _i \end{aligned}$$
(1)

where \(R_i\)—excess returns for asset i, \(F_{m,i}\)—risk factor m for asset i, \(\beta _{m,i}\)—regression coefficient of asset i for factor m, \(\varepsilon _i-\) residual of asset i.

Step two determines risk factor exposure of asset returns by running cross-sectional regressions (Formula 2) for each period of returns, against the betas, and with risk loading estimates \(\hat{\beta }\) for each asset calculated from step one.

$$\begin{aligned} R_t =\lambda _{0,t} +\hat{\beta }_{1,i} \lambda _{1,t} +\cdots +\hat{\beta } _{m,i} \lambda _{m,t} +\eta _t \end{aligned}$$
(2)

where \(R_t\)—excess returns for all assets at time t, \(\hat{\beta }_{m,i}\)—risk loading estimates m from step 1 for asset i, \(\lambda _{m,t}\)—slope m at time t, \(\eta _t\)—idiosyncratic risk

The risk premium (exposure) for each factor is the average of the slopes (\(\uplambda _{\mathrm{m,t}}\), Formula 3).

$$\begin{aligned} \hat{\lambda }_m =\frac{1}{T}\mathop \sum \limits _{t=1}^T \lambda _{m,t} \end{aligned}$$
(3)

where \(\lambda _{n,m}\)—period t slope for asset m, \(\hat{\lambda }_m\)—risk exposure for factor m.

We run Fama–MacBeth regressions (Table 2) for well-known risk factors used in asset pricing models (APM), specifically, CAPM, Fama and French (1993) three-factor and Carhart (1997) four-factor to establish a baseline and understand the exposure the stocks have with these well-known risk factors. Next, we include the first sentiment factor which will act as the baseline; rating and volume derived from the Loughran and McDonald (2011) dictionary because of its popularity in the finance literature. We slowly build on this model by including the features from Table 1 in separate Fama–MacBeth regressions. Since we have nine features, not including the baseline, instead of running simulations for every possible subset \((2^{9} = 512)\) of features, we add each one individually (Table 2) to the baseline model and run Fama–MacBeth regressions to determine their viability as risk factors.

Table 2 Risk premium

The small-minus-big risk factor exhibited the smallest coefficient values (impact), suggesting the vast majority of stocks were not small cap stocks, but rather larger cap stocks. The evaluation of the financial risk factors (Table 2) using the Fama–MacBeth framework shows momentum (UMD) having the highest impact. These results indicate that the majority of stocks are exposed to short-term momentum effects that could be quickly shared by tweets. Also, sentiment derived from the Loughran and McDonald dictionary has a very low impact while the Liu dictionary and the Pleasantness and Activation parameters from the dictionary of affect in language (DAL) are the most important risk factors after UMD. The difference between the Loughran and McDonald and the Liu dictionaries can be explained because the first is optimized using large bodies of texts from financial reports while the second is optimized for succinct social media blogs as those used in this research. The largest impact values were observed with traders, suggesting option market behavior may drive underlying prices.

The risk premiums of the Liu dictionary and the DAL components have a negative relationship with return. This may indicate the overreaction of investors and the quick price reversal that follows any corporate news. Pearson correlation tests were run between the underlying volume and message volume (Fig. 1).

Fig. 1
figure 1

Underlying and message volume Pearson correlation. This figure shows a histogram of the Pearson correlation coefficients between underlying and message volume, the percent (% sig) that exhibited statistical significance, the mean and standard deviation (SD) of the correlation values

Over 70% of the stocks exhibited statistically significant correlations between underlying volume and message volume.

4 Methodology

With viable risk factors established, focus now shifts to their predictive capability. We take a machine learning approach through a majority vote, ensemble, method through leveraging five well-known classifiers to both train, validate and test a model to predict the assets price direction move, up or down, and in turn determine what position to take, long or short, respectively. Ensemble methods have been shown to outperform stand-alone classifiers (Dietterich 2000; Zhou et al. 2002; Maglogiannis 2007; Galar et al. 2012; Kanakaraj and Guddeti 2015). The machine learning classifiers chosen are listed below:

  • LogitBoost: ensemble method of classification based on boosting that assigns more weight to the misclassified observations and minimizes the logistic loss (Friedman et al. 2000).

  • Naïve Bayes: Bayesian parameter estimation method based on some known prior distribution (Russell et al. 2009).

  • AdaBoost: adaptive boosting machine learning meta-algorithm used for improving performance and classifier accuracy by adding more weight to previously misclassified instances (Freund and Schapire 1997).

  • Logistic Regression: logit based regression for categorical labels which has been shown to be an accurate classifier for binary labels (Cox 1972).

  • Bagging: classifier that generates an aggregated predictor through multiple adaptations of a predictor; this has been shown to increase classifier accuracy by minimizing variance (Breiman 1996).

A multi-stage simulation process (Fig. 2) will be followed. Using 10-fold cross-validation, models will be trained using the above algorithms for each stock with the first 80% observations and tested with the remaining 20% observations (holdout). Splitting the dataset in this manner will prevent data snooping and adheres to the 80/20 Pareto principle. All labels have the return directional moves, up (1) or down (−1), for the next trading day. The train data will go through a calibration stage where it will be split 80% and 20% for train and test, respectively. The date ranges for train and test were, respectively: July 13, 2009, to March 10, 2012; 942 trading days, and March 11, 2012, to October 31, 2012, 235 trading days. The calibration stage will only be performed for the baseline case using the current and lagged (prior day) returns. The test stage of the calibration is further granulized into trading simulation bins between predicted probabilities of 50% and 80%, in steps of 5%, based on the forecasted return of directional moves and their respective predicted probabilities. Only assets with predicted probabilities greater than each respective bin are tradable securities or qualified assets. Based on these forecasts, our algorithm takes a long or short position, depending on the directional forecast of the label, positive or negative, respectively, on each qualified asset.

Fig. 2
figure 2

Trading strategy. This figure shows the trading strategy schema. We retrain and test the model for every feature set added and take long positions on labels predicted to be a positive one, 1, and short positions on labels predicted as zero, 0

We simulate a daily trading strategy with our test dataset, taking a long or short position of the assets that have a positive or negative trend forecast, respectively. At the end of each day, we liquidate every position and calculate the daily return after transaction costs. Transaction costs open and close all positions for qualified stocks while taking into account the New York Stock Exchange rate of 0.0023 US dollars per share. The purpose of the calibration stage is to determine which predicted probability inherently achieves the highest performance. Once the best performing predicted probability is identified, we move forward with this value for the full simulation stage (Fig. 3).

Fig. 3
figure 3

Simulation flow. This figure shows the overall simulation flow. The full data set is first run through a calibration stage, left side, to determine the best performing algorithm and predicted probability bin. Once determined, the predicted probability bin is used throughout full simulation stage, right side

We evaluate our models using the Sharpe Ratio, formula (4), average daily return and Matthews Correlation Coefficient (MCC), formula (5). The Sharpe ratio is known as the risk to variability ratio which adjusts the performance of an asset or portfolio by risk, volatility.

$$\begin{aligned} S=\frac{E\left[ {R-R_f } \right] }{\sqrt{VAR\left[ R \right] }} \end{aligned}$$
(4)

where R—return of asset or portfolio, \(\hbox {R}_{\mathrm{f}}\)—risk-free rate through holding period.

MCC helps determine if the model is a robust predictor of the return direction (Matthews 1975). MCC is not only ideal for a binary label; it also overcomes the bias inherent in an unbalanced label count. Considering that markets tend to go up in the long run, return directional moves in the positive direction will outweigh moves in the negative direction. As a result, there will be a class label imbalance: more upticks (55%) than downticks (45%).

$$\begin{aligned} MCC=\frac{{\mathrm{TP} \times \mathrm{TN}}-{\mathrm{FP} \times \mathrm{FN}}}{\sqrt{\left( {TP+FP} \right) \left( {TP+FN} \right) \left( {TN+FP} \right) \left( {TN+FN}\right) }} \end{aligned}$$
(5)

where TP—true positive, forecasted true and actual true, TN—true negative, forecasted false and actual false, FP—false positive, forecasted positive and actual negative, FN—false negative, forecasted negative and actual negative.

We incrementally add features to the data set to determine the effect of certain features on model performance. The steps that run through the training and testing procedure for both the calibration and the main simulations are outlined below:

  1. 1.

    Baseline features: Use current and lagged return as features to forecast the direction of the next period return (label). This step will only be run for the calibration stage where we determine the ideal predicted probability bin and algorithm to use for the remaining steps.

  2. 2.

    Baseline features and social media derived sentiment baseline feature: Using the same baseline features, from 1, above, we include the baseline sentiment and volume feature derived from the Loughran and McDonald word dictionary; one simulation.

  3. 3.

    Baseline features and first social media derived risk factor sentiment feature: Using the same baseline features, from 1, above, we include the sentiment and volume feature derived from the Liu word dictionary; one simulation.

  4. 4.

    Baseline features and market data derived sentiment: Using the same baseline features, from 1, above, we include the aggregated ISE ratio and the individual trader ratios (customer, broker-dealer, proprietary and professional traders) according to Formula 6; five simulations.

    $$\begin{aligned} \hbox {ISE}=\frac{\hbox {LONG CALLS}_{ TC} \left( {\hbox {Opening Position}} \right) }{\hbox {LONG PUTS}_{ TP} \left( {\hbox {Opening Position}} \right) } \end{aligned}$$
    (6)

    where

    TC = trader specific call volume

    , TP = trader specific put volume

    . The ISE call-put ratios are leading indicators of bullish or bearish market direction if the ratios are greater or less than 1 respectively.

  5. 5.

    Baseline features and second social media derived risk factor sentiment feature: Using the same baseline features, from 1, above, we include the sentiment and volume features derived from the Dictionary of Affect in Language; three simulations. Agarwal et al. (2009) showed that DAL accurately captured binary (positive or negative) sentiment from tweets and Nguyen et al. (2015) and Xie et al. (2013) successfully used semantic frames to predict future stock prices. Also, we take an approach similar to recent studies (Cambria and White 2014; Cambria et al. 2013; Poria et al. 2014) that leveraged dictionaries which expand meanings of words into multiple dimensions. We use a unique dictionary that contains multiple dimensions and extend these studies further by aggregating together with market data sentiment and additional sentiment measures (step 6). The DAL parameters are known as Pleasantness, Activation, and Imagery. These parameters, the additional three features, capture human emotion similar to Googles Profile of Mood states (six total emotional states) that were successfully used by Bollen et al. (2011), Abu Bakar et al. (2014), Siganos et al. (2014), Kim and Kim (2014), and Danbolt et al. (2015) to predict future directional moves in stocks. We score all messages using the DAL parameter scores by tokenizing each message and taking the average of each parameter for every message. Using this dictionary, we assume when authors write negative text they use more negative words than positive words and viceversa.

  6. 6.

    Baseline features and all market data derived and social media derived statistics; one simulation.

5 Results

To determine the ideal predicted probability bin for the validation stage, we run the machine learning ensemble method using the features from step 2. The 65% predicted probability bin yields a substantial number of trades: 884, an annualized Sharpe ratio of 0.2043, and an average monthly return of 0.1719 basis points (Table  3). We then move forward with 65% as the predicted probability bin to trade. All returns and Sharpe ratios show a significant difference at the 99% level.

Table 3 Predicted probability cutoffs

The implementation of our forecast and trading strategy shows that the Loughran and McDonald dictionary outperforms the model based only on prior return (Table 4). However, the Liu dictionary and the components of the DAL (pleasantness, imagery, and activation) outperform the baseline Loughran and McDonald dictionary, suggesting that these are superior for our social media data set. The pleasantness sentiment parameter yields the largest Sharpe ratio (1.0139), return (2.29%), and MCC (−0.33). Out of the specific traders, the broker–dealer ratio exhibited the largest Sharpe ratio (0.4491), return (1.48%), and MCC (−0.32), suggesting the broker–dealer has superior information. This latter result is not surprising as broker–dealers have substantial resources at their disposal that also act on behalf of very sophisticated traders.

As in the case of the risk premiums of the Liu dictionary and DAL, all the sentiment indicators show negative MCCs. A forecasting model can capture this pattern and use it to anticipate the return direction.

Table 4 Return, volatility, and Sharpe ratio of trading strategies

A trading strategy based on the pleasantness category shows the largest positive and statistically significant alpha after adjusting by excess market return (MKT-RF), size (SMB), valuation (HML) and momentum effect (UMD) (Table 5). The pleasantness category more closely reflects the sentiment associated with every word. This characteristic explains its selection as a risk factor in our predictive models. Next, we combined all risk factors as features where the largest Sharpe ratio (1.5003), return (09%), and MCC (−0.32) was observed.

Table 5 Risk-adjusted trading strategy return

6 Discussion

The baseline simulation, step 2, yielded the worst results and the Liu dictionary, step 3, beat out the baseline dictionary. The Loughran and McDonald dictionary was optimized using large bodies of texts from financial statements while the Liu dictionary was optimized for succinct social media blogs, so this result is not surprising. Performance results further improved with the trader ratios, step 4, especially with the broker–dealer trader. This suggests that the behavior patterns of various trader types are a proxy of future returns of assets. Furthermore, it is typically the savvy investor type who trades derivative products, options, and who has access to both superior information and the means to trade, not only from a monetary perspective but also technological. This was most apparent in the simulation runs using the broker–dealer ratio which achieved the most robust performance out of all other ratios. Recall broker–dealer traders operate on behalf of institutional clients. Out of all trader types, institutional clients have access to both superior research and technology. When institutional clients channel through broker–dealers for trade execution, not only do broker–dealers gain access to information inherent in these trades, but also have access to their internal information and technology, which are not available to other traders.

Leveraging the customer ratio yielded the lowest performance out of all trader ratios. Again, this trader constitutes both discount and full service. The discount customer is most likely considered a noise trader (De Long et al. 1990a, b) and the full service could be considered a hybrid between noise and positive feedback traders. The discount trader will not have access to superior information, and usually, constitutes the majority of the herd. Full service would have access to superior information, but the sheer numbers of discount far outweigh the full-service customer, which washes out any performance advantages that could have been observed if the option data was broken down by full service and discount. Proprietary trader performances yielded better results than the customer but slightly lower than the broker–dealer. Per Pan and Poteshman (2006), these traders possess little information about future stock prices and leverage the options market for hedging purposes. Overall, these results are promising, considering returns were adjusted for both transaction costs and market effects while residual alpha was still present. Future research will use the same framework by aggregating sentiment together for stocks in the same industry and sector.

7 Conclusion

This research shows the importance of both sentiment types extracted from social media messages and market data derived signals to forecast asset return. Both features contain a sentiment and behavioral aspect. Sentiment is an aggregated opinion of the general investing community, and the call-put ratios are sentiment for various trader ratios beyond what would be found on social media platforms. Social media provides information about the masses opinions and moods and a profile of the more conformist traders. The market data derived signal consists of customer, broker–dealer and proprietary traders who are not, besides customer, on social media outlets broadcasting their opinions to the world about stocks since there are strict SEC rules preventing them from doing so. However, we can capture their behavior through the option volume data. It is suggested that the broker–dealer trader may possess superior information with respect to all other traders as we saw the highest performance out of all simulations with this trader ratio used with lagged return, current return and sentiment. This research suggests that the additional information generated by combining both feature types, sentiment from both the masses and specific trader type behavior, from two forms, text and market data, improve the asset return prediction. This research shows the importance of sentiment extracted from social media messages and market data to both explain and forecast asset price returns. We demonstrate that sentiment extracted from social media and market data are valid additional risk factors in relation to the Fama–French and Carhart models. Furthermore, these results suggest that sentiment can be harnessed in a predictive analytics framework to realize positive residual alpha after adjusting for market effects.