Keywords

1 Literature Review

In the last two decades a significant amount of work has been done on time series clustering [7]. Financial Time series clustering is a subject that has also gained lots of attention in the last decade [14]. It is an important area of research that finds wide applications in noise reduction, forecasting and enhanced index tracking [1]. A good forecast of future prices is desired by financial companies especially the algorithmic trading firms. Index track funds are low cost funds that closely track the returns of one particular index. In a typical financial time series clustering procedure, each time series is considered as an individual object and inter-object dissimilarities are then calculated. Subsequent clustering is done using one of the many clustering algorithms. Rosario [2] used a well-known dissimilarity measure based on the Pearson’s correlation coefficient. They then cluster the data set using single linkage hierarchical clustering. Saeed [4] used symbolic representation for dimensionality reduction of financial time series data, and then used longest common subsequence as similarity measurement. Guan [5] proposed a similarity measure for time series clustering which relied on the signs (positive or negative) of the logarithmic returns of the stock prices. It did not take into account the size of movement. Marti et al. [6] tried to find answer to the question, what should be an appropriate length of a time series for the clustering procedure. John et al. [8] proposed a shape based dissimilarity for time series clustering which relied on the cross correlation coefficients. This work comes closest to our approach, but still there is a significant difference between the two approaches. While calculating their dissimilarity measure, they do not break the time series into smaller segments. In the present approach, time series are broken into smaller parts and then a different procedure is followed. This is because lead – lag relationship between two time series may change over time. Further, we propose another dissimilarity measure that gives similar or better results as compared to the first dissimilarity measure proposed in the present paper.

2 Preliminaries

In the present work, hierarchical clustering is used to form clusters from the inter object dissimilarity matrix computed using a dissimilarity measure. Linkage method is an important aspect of hierarchical clustering algorithm. In the present work, we choose ‘single’ linkage and ‘ward’ linkage (implemented as ward.D2 linkage in R ‘stats’ package) for our analysis. This is because ‘single’ linkage has been a preferred choice of researchers in financial time series clustering papers, e.g., [2, 3]. Ward linkage [10] was used for financial time series clustering by Guan [5]. Details about pre-existing dissimilarity measures i.e., Correlation based dissimilarity measure (COR) and Temporal correlation based dissimilarity measure (CORT) are given in the Appendix. Additionally, taking inspiration from [16], lead/lag time between two time series \( X_{T} \) and \( Y_{T} \) is defined as the integer ‘k’ which maximizes cross correlation between the two time series.

3 Proposed Dissimilarity Measures

3.1 Cross-Correlation Type Dissimilarity Measure (CCT)

Given two time series \( X_{T} \) and \( Y_{T} \) (each of length T), here the interest is in finding a dissimilarity measure between them. Let ‘m’ be the maximum value of lead or lag being taken into consideration for calculation of dissimilarity between the two time series. We consider a segment of time series \( X_{T} \) starting from ‘m+1’ and ending at ‘T-m-1’, and divide this segment into ‘n’ equal parts each of length ‘p’. Here, we conveniently choose ‘m’, ‘n’ and ‘p’ such that \( 2m+np+1=T \). This is required to make sure that all the data points in time series are utilized for the calculation of dissimilarity measure. Though it would still suffice if ‘2m+np+1’ is slightly less than T. Another couple of restrictions are that \( m\le p \) and \( p\ge 15 \). These restrictions have been imposed to avoid unwanted cross-correlations. Now, with this background we define our CCT similarity measure which is given by the following:

$$\begin{aligned} CCT=\frac{1}{n}\times \sum _{l=1}^{n} max\{ CCT_{k}(m+1+p\times (l-1)) \quad | -m \le k \le m \} , \end{aligned}$$
(1)

where \( CCT_{k}(i) \) is defined as follows

$$\begin{aligned} CCT_{k}(i)=\frac{\sum _{j=i}^{i+p-1}(x_{j+1}-x_j)(y_{j+k+1}-y_{j+k})}{\sqrt{\sum _{j=i}^{i+p-1}(x_{j+1}-x_j)^2}\sqrt{\sum _{j=i}^{i+p-1}(y_{j+k+1}-y_{j+k})^2}} \end{aligned}$$
(2)

The value of \( CCT_{k} \) is same as the correlation between returns of a segment of time series \( Y_{t} \) at a lead k with respect to a segment of time series \( X_{t}\). The motivation behind this similarity measure is that similar financial time series would have more sub-segments which are highly correlated with each other at some lead or lag. Stock prices of similar assets in different exchanges exhibit such pattern. This phenomena is perfectly captured by CCT similarity measure. Notice that value of each of \( CCT_{k}(i) \) lies in the interval [−1,1]. Hence, the value of CCT similarity measure also lies in the interval [−1,1]. This similarity measure is then converted into a dissimilarity measure using a function, which is given by:

$$\begin{aligned} \phi (u) = \frac{2}{1 + e^{4\times u}} \quad \end{aligned}$$
(3)

We choose to do our analysis with the above function as opposed to the function \( \sqrt{2(1-x)} \) for conversion of similarity measure into dissimilarity measure. Through our data experiments, we see that even this function gives similar results as the function in Eq. 3. Though we have not given details of those experiments in this paper. The dissimilarity measure thus obtained after conversion can be used for clustering of financial time series data.

Table 1. Hypothetical data set for two time series each of length 37.

As an example for CCT similarity measure consider the two time series data given in Table 1. Since the length of each time series is 37, one suitable choice for ‘p’, ‘n’ and ‘m’ may be 15, 2 and 3, respectively. In subsequent calculations it is found that for the first segment of series 1 (i.e., 4th to 18th data point in Series 1) the max \( CCT_k(4) \) value is 0.57 which exists at k = 3. For the second segment (i.e., 19th to 33rd data point in Series 1) the max \( CCT_k(19) \) value is 0.27 which exists at k = 0. Thus the value of CCT measure comes out to be 0.42.

3.2 Cross-Correlation Type-II Dissimilarity Measure (CCT-II)

In another version of this dissimilarity measure, we ignore those intervals whose ‘maximum \( CCT_{k}(i) \)’ value is less than a given threshold (denoted by ‘Thr’). We mean that instead of the ‘maximum \( CCT_{k}(i) \)’ value we put ‘0’, when that is less than the threshold. Rest of the computations remains the same. The expression for this similarity measure is as follows:

$$\begin{aligned} \text {CCT-II} = \frac{1}{n}\times \sum _{l=1}^{n} ( M(m+1+p\times (l-1))\times I_{[Thr,1]}(M(m+1+p\times (l-1)))\, ) \end{aligned}$$
(4)
$$\begin{aligned} M(i) = max\{ CCT_{k}(i) \quad | -m \le k \le m \} \end{aligned}$$
(5)

where \( CCT_{k}(i) \) is same as defined in (3)

and \( I_{[Thr,1]}(x) \) denotes the indicator function of the set [Thr, 1].

This similarity measure is then converted into dissimilarity measure using the function given in Eq. (3).

As an example, consider the time series data given in Table 1. We now compute CCT-II similarity measure with the same set of values (as used for computing CCT similarity measure) i.e. p = 15, n = 2 and m = 3. The threshold value is set as 0.50. Again it is noticed that max \(CCT_k()\) values are 0.57 and 0.27. But since 0.27 lies below the threshold thus it is replaced by 0. Thus, the value of CCT-II similarity measure comes out to be 0.29.

The motivation behind this dissimilarity measure is that often the stock prices are weakly correlated to each other at some lead or lag. This may be due to a general trend followed by all stocks or it may be a spurious relation. This leads to error while clustering the data set using CCT dissimilarity measure. This noise can be removed by not considering the value of cross-correlations whose value is less than a given threshold.

In order to apply CCT-II measure of dissimilarity between two time series we need to fix the threshold value beforehand. The threshold value obviously lies in the interval (0, 1). Additionally, it can’t be close to 0 as then there would remain very little difference between the values of CCT and CCT-II dissimilarity measures. It can’t be close to 1 as then most of the information regarding the behavior of two time series would be ignored and hence will not be considered in the evaluation of the final expression of CCT-II dissimilarity measure. Thus, threshold value should be kept close to 0.5. For this work we have chosen the range for the threshold to be [0.35, 0.65]. If a test data set is available then it can be used to find the optimal value of threshold. If test data set is not available, then choose threshold value as any random value in the range [0.35, 0.65].

The time complexity of CCT and CCT-II dissimilarity measure is O(pmn). Calculation of \( CCT_{k}(i) \) value requires O(p) computations, as we are calculating cross-correlation value at lead ‘k’ between two time series segments each of length ‘p’. Since we need to evaluate this expression for all k such that \( -m \le k \le m \), thus it takes O(pm) time to evaluate ‘\( max\{ CCT_{k}(i) \quad | -m \le k \le m \} \)’. This process is repeated ‘n’ times thus, CCT dissimilarity measure is of O(pmn) time complexity. Similarly, it can be argued that CCT-II dissimilarity measure is of O(pmn) time complexity. This time complexity is better than time complexity of Dynamic Time Wraping dissimilarity measure, which takes \( O(T^{2}) \) time [15] to find its final expression, where T is the length of time series. Since ‘np+2m+1’ is equal to T or slightly less than T, thus, this time complexity is equivalent to \( O(n^{2}p^{2}) \). This time complexity is clearly greater than O(pmn) as \( m\le p \). Lower time complexity of proposed measures enable its computations to be carried out faster.

4 Experiments and Analysis

Experiments are conducted on 3 data sets one by one. Each data set consists of ‘End of Day’ (EOD) stock prices of some companies. Figure 1 depicts the format of data sets used for the experiments in the present paper. While clustering, each time series associated with a company is considered as an individual object. Inter object dissimilarities are then calculated with each of the 4 measures i.e., COR, CORT, CCT, CCT-II. A dissimilarity matrix is thus created, which is used for further analysis. Single linkage hierarchical clustering and Ward linkage hierarchical clustering are then employed to create the corresponding dendograms (see Fig. 2). The dendogram then can be cut at any level to form desired number of clusters of the data set.

Fig. 1.
figure 1

EOD stock prices of the companies.

A cluster evaluation measure is then used to compare the clustering results obtained through the different dissimilarity measures. The present paper uses the cluster evaluation measure defined in [9, 16]. This cluster evaluation measure lies in the range [0,1]. Higher value of cluster evaluation measure corresponds to better clustering results.

Here, we discuss results corresponding to each data set in a different sub-section. R version 3.1.1 was used for preparation of all the figures presented in this paper. Since the number of data points in all time series in all the experiments is 2014, hence the parameter values of ‘n’, ‘p’ and ‘m’ are taken to be 18,100 and 100 respectively in all the experiments mentioned below. In these experiments, the threshold value for the CCT-II measure concerning Indian companies (the first data set) is taken to be higher as compared with the companies traded in USA (the second and third data set). This is because prices of the Indian companies tend to show more noise or spurious cross-correlations as compared to prices of American companies. Thus, threshold value for the first data set is taken to be 0.65 and threshold value for the second & third data set is taken to be 0.35.

Fig. 2.
figure 2

The above figures give dendograms corresponding to different dissimilarity measures. Ward linkage hierarchical clustering has been used for forming the dendograms. Each object in the clusters is represented by a number whose unit’s place indicate it’s true cluster value.

Table 2. The name of the companies whose stock prices are part of the Indian data set. These companies can be divided into 3 broad categories.
Table 3. Cluster evaluation measure for the different dissimilarity measures. This table corresponds to the Indian data set. ‘Number of Clusters’ represent the number of clusters formed using the dendogram. Higher the measure the closer is the representation to the true clusters. ‘COR’, ‘CORT’, ‘CCT’ and ‘CCT-II’ stand for correlation based dissimilarity measure, temporal correlation based dissimilarity measure, first proposed measure and second proposed measure respectively.

4.1 Indian Data Set

The EOD stock prices of the companies listed in Table 2, form the first data set of our experiments. The time-span of the prices is from 5th August 2008 till 2nd May 2016. This data set can be originally clustered into 3 clusters as indicated in Table 2. Dendograms obtained on this data set using Ward linkage hierarchical clustering are shown in Fig. 2. Table 3 gives cluster evaluation measure corresponding to different representations (number of clusters formed through dendogram) and different dissimilarity measures. Amongst all the dendograms shown in the Fig. 2, the best result is seen in Fig. 2 (d). Figure 2(d) shows dendogram formed using CCT-II dissimilarity measure. Cluster evaluation measure verifies that the clusters obtained through the proposed measures are better than the pre-existing measures. Cluster evaluation measure is consistently higher for the proposed measures, irrespective of how many clusters we form out of the dendograms.

Table 4. Cluster evaluation measure corresponding to different dissimilarity measures and number of clusters. Ward linkage hierarchical clustering has been used for clustering the datasets. This table corresponds to the S&P500 data set which can be originally divided into 10 clusters.
Table 5. Cluster evaluation measure corresponding to different dissimilarity measures and number of clusters. Single linkage hierarchical clustering has been used for clustering the datasets. This table corresponds to the S&P500 data set which can be originally divided into 10 clusters.
Table 6. Cluster evaluation measure for the different dissimilarity measures. ‘Number of Clusters’ represent the number of clusters formed using the dendogram. Higher value corresponds to better clustering results. This table corresponds to the DJIA data set which can be originally divided into 8 clusters.

Clustering results obtained on first data set using Single linkage hierarchical clustering also verify the superiority of CCT and CCT-II measures. In this case also, Cluster evaluation measure values for CCT and CCT-II measures are higher than the COR and CORT dissimilarity measures.

4.2 S&P500 Data Set

The EOD stock prices of the companies listed in S&P500 index form the second data set. These companies are traded in USA. The time span of the prices for this data set is from 12th September 2008 till 23rd august 2016. The companies whose prices were not available for the complete time span under consideration, were removed from the data sets. This data set can be originally clustered into 10 clusters, where each cluster represents the sector of the company.

In the experiment associated with this data set, Cluster evaluation measure values are given in Tables 4 and 5. The threshold value is taken to be 0.35 for CCT-II measure in this experiment. In the case of ward linkage (as seen in Table 4) CCT and CCT-II measures clearly give better results as compared to COR and CORT dissimilarity measures. In the case of single linkage (as seen in Table 5), all measures give similar results.

4.3 DJIA Data Set

The EOD stock prices of the companies listed in Dow Jones Industrial Average index form the third data set of our experiments. The time span of the prices for this data set is from 12th September 2008 till 23rd august 2016. This data can be originally clustered into 8 clusters. Here, the threshold value is taken to be 0.35 for CCT-II measure.

As can be seen in Table 6, the proposed measures show better or similar results as compared to pre-existing measures when hierarchical clustering is done using Ward linkage. In the case, when clustering is done using Single linkage hierarchical clustering for this data set, the proposed measures give similar or slightly inferior results as compared to COR & CORT measures. Though in this case also, maximum value of cluster evaluation measure (amongst all possible number of clusters) is seen corresponding to CCT measure.

Now, we discuss some of the directions in which future work could be carried out related to this research paper. Determination of optimal values of threshold, n, p and m could be carried out in future. Also, optimal function for conversion of similarity measure into dissimilarity measure needs to be determined.

5 Conclusion

Financial time series clustering is an important area of research and finds wide applications in noise reduction, forecasting and index tracking. In this paper, two new dissimilarity measures have been proposed for financial time series clustering. These dissimilarity measures are used to cluster time series belonging to 3 data sets. One data set consists of EOD stock prices of 28 Indian companies. The second data set consists of EOD stock prices of 468 companies listed in S&P500 Index. The third data set consists of EOD stock prices of companies listed in Dow Jones Industrial Average Index. Overall, the data sets consist of 526 companies which is a fairly large number. Clustering is done and it is shown that our proposed dissimilarity measures outperform existing dissimilarity measures.