1 Introduction

An exchange rate reflects relative values between different currencies, and is one of the most important financial and macroeconomic indicators in an economy. The fluctuation of exchange rates affects international trade, capital flows, and asset portfolio management. Many financial time series forecasting models (Adhikari and Agrawal 2013; Wu and Chang 2012; Zhiqiang et al. 2012) have been developed that play a critical role in the world economy because of their ability to forecast economic benefits and influence economic development. These models have attracted increased attention from academic researchers and business people for its theoretical possibilities and practical applications (Hadavandi et al. 2010; Lu et al. 2009). The ostensible purpose of the breakdown of financial market boundaries was to enhance the efficiency of capital funding (for example, the Bretton Woods system of monetary management was officially ended in the 1973). As a result, currencies that are traded internationally have become crucial economic indices for international trade, financial markets, alignment of economic policy by governments, and corporate financial decision-making.

However, financial time series forecasting is a challenging task because of its inherent nonlinearity and nonstationary characteristics. In the last few decades, these characteristics have attracted increased attention from many academic researchers. The forecasting approaches used in literature can be classified into two types of models: statistical and artificial intelligence (Wang et al. 2012; Zhu and Wei 2013). Linear statistical models such as exponential smoothing (Lemke and Gabrys 2010) and autoregressive integrated moving average (ARIMA) (GEP and GM 1970) have identified immense applications for forecasting financial data. A subclass ARIMA model, namely, Naïve random walk (RW) (Sun 2005; Tyree and Long 1995), has become the benchmark statistical technique in this domain. In a simple RW model, each forecast is assumed to be the sum of the most recent observation and a random error term. After the pioneering work of Meese and Rogoff (1983), the RW model has been extensively used by many researchers for foreign exchange rate forecasting. Currently, the simple RW is the most dominant linear model in literature of the financial time series and, especially, exchange rates (Zhang 2003).

Despite the simplicity and notable forecasting accuracies of RW models, their main drawback is their inherent linear form. Thus, such statistical models cannot effectively capture nonlinear patterns hidden in financial time series because these models are developed based on the assumption that the time series being forecasted are linear and stationary (Huang et al. 2010). To overcome this limitation of statistical models, several nonlinear models have been proposed. Among them, the artificial neural network (ANN) has attracted considerable interest from researchers because of their excellent nonlinear modeling capability (Zhang and Wu 2009; Chen et al. 2012a; Jaeger and Haas 2004; Deng et al. 2015; Vasilakis et al. 2013). Many studies have concluded that the ANN model outperforms conventional statistical models. However, ANN suffers from local minimum traps and has difficulty determining hidden layer size and learning rate (Kazem et al. 2013). A new learning algorithm for the single hidden layer feed-forward neural network (SLFN) known as the extreme learning machine (ELM) has been proposed that overcomes the aforementioned disadvantages (Huang et al. 2006a; Chen and Ou 2011). In the learning process of ELM, the input weights and hidden biases are randomly selected, and the output weights are analytically determined by using the Moore-Penrose generalized inverse. ELM can learn much faster and has a higher generalization performance than do the traditional gradient-based learning algorithms. In addition, ELM solves the problems of stopping criteria, learning rates, learning epochs, and local minima (Huang et al. 2006; Chen and Ou 2011; Xia et al. 2012; Lu and Shao 2012). In recent years, ELM has attracted considerable attention and become an important method in nonlinear modeling (Chen and Ou 2011; Xia et al. 2012; Lu and Shao 2012).

When building intelligent prediction models directly using original values, obtaining satisfactory forecast results is difficult because of the high-frequency, nonstationary, and chaotic properties of financial data. Hence, to further improve prediction performance, recent research efforts on modeling time series with complex nonlinearity, dynamic variation, and high irregularity have initially used information extraction techniques to extract features hidden in the data. They then use these extracted characteristics to construct a forecasting model (Lu et al. 2009; Chen et al. 2012b; Liu and Wang 2011; Lu 2010). In other words, by means of suitable feature extractions or signal processing methods, useful or interesting information that may not be observed directly from the original data can be revealed in the extracted features. Therefore, an effective forecasting model possessing more precise prediction capabilities must be developed.

Empirical mode decomposition (EMD), based on Hilbert-Huang transform (HHT), is suitable for decomposing nonlinear and nonstationary time series, which adaptively represent the local characteristics of the given signal (Huang et al. 1998, 2003).Though the use of EMD, any complicated signal can be decomposed into a finite and often small number of intrinsic mode functions (IMFs). IMFs possess simple frequency components and strong correlations, and thus are easy and accurate to forecast (Jaeger and Haas 2004). EMD has been widely used in many fields, including in the analysis of the atmosphere time series (Xuan and Yang 2008), river water turbidity forecasting (Wang and Qi 2009), crude oil price prediction (Yang et al. 2010), short-term wind power prediction, and others (Jaeger and Haas 2004; Lu and Shao 2012; Chen et al. 2012a; Bao et al. 2012; Ye and Liu 2011).

Another critical reason that financial time series are notoriously difficult to predict is their chaotic nature. Chaos is often identified in the fields of physics and other natural sciences. Empirical evidence of chaotic behavior in financial time series has also been identified (Barkoulas and Travos 1998; Gimore 2001; McKenzie 2001). Chaos theory points out that an adequate method can help reveal underlying information in complicated matters believed to be unpredictable (Takens 1981). For chaotic time series, the techniques of prediction based on phase space reconstruction (PSR) can be employed to extract information and characteristics hidden in dynamic systems of time series. PSR can transform a one-dimensional signal into a structure that embeds sufficiently high dimensions. In this new high dimensional space, a structure is formed that is topologically equivalent to the original phase space. This has led some researchers to apply chaos theory to time series forecasting.

In this study, we propose a hybrid exchange rate forecasting model by integrating EMD, PSR, and ELM (EMD \(+\) PSR \(+\) ELM). First, the original exchange rate time series are first decomposed into a finite number of independent IMFs employing different frequencies. Second, based on PSR, different ELM models are used to model and forecast the four sub-series, respectively, according to reconstructed time series. Finally, these forecasting results are combined with the ultimate forecasting result output. Moreover, experimental results from four sets of real exchange rate data demonstrate that the proposed hybrid forecasting method outperforms methods of Naïve RW, single ELM, and other hybrid models in terms of mean absolute error (MAE), root mean-square error (RMSE), and mean absolute error (MAPE).

2 Literature Review of Major Methods

2.1 EMD

The EMD method based on HHT is based on the simple assumption that any signal consists of different but simple intrinsic mode oscillations. The essence of the method is to identify the intrinsic oscillatory modes (IMFs) (Huang et al. 1998) based on their characteristic time scales in the signal and then decompose the signal accordingly. A characteristic time scale is defined by the time lapse between the successive extremes.

To extract the IMF from a given data set, the sifting process is implemented as follows. First, we identify all local extrema, and then connect all local maxima by a cubic spline line that thus acts as the upper envelope. Then, we repeat the procedure for the local minima to produce the lower envelope. The upper and lower envelopes should cover all the data between them. Their mean is designated \(m_{1}(t)\), and the difference between the data and \(m_{1}(t)\) is h(t), given by the following:

$$\begin{aligned} x(t)-m_1 (t)=h_1 (t). \end{aligned}$$
(1)

Ideally, \(h_{1}(t)\) should be an IMF. Because the construction of \(h_{1}(t)\) described previously should have forced the result to satisfy all definitions of an IMF, we demand the following conditions: (i) \(h_{1}(t)\) should be free of riding waves, that is, the first component should not display under- or over-shots that ride on the data and produce local extremes without zero crossing; (ii) symmetry of the upper and lower envelops with respect to zero should be displayed; (iii) the number of zero crossing and extremes should be the same in both functions.

The sifting process must be repeated as many times as required to reduce the extracted signal to an IMF. In the subsequent sifting process steps, \(h_{1}(t)\) is treated as the data:

$$\begin{aligned} h_1 (t)-m_{11} (t)=h_{11} (t), \end{aligned}$$
(2)

where \(m_{11}(t)\) is the mean of the upper and lower envelops of \(h_{\mathrm{s}}(t)\). This process can be repeated as many as k times and \(h_{\mathrm{1k}}(t)\) is then defined as:

$$\begin{aligned} h_{1(k-1)} (t)-m_{1k} (t)=h_{1k} (t). \end{aligned}$$
(3)

After each processing step, we must confirm that the number of zero crossings equals the number of extrema. The resulting time series is the first IMF and then is designated as \(c_{1}(t)=h_{\mathrm{1k}}(t)\). The first IMF component from the data contains the highest oscillation frequencies found in the original data x(t).

This first IMF is subtracted from the original data, and this difference is called a residue \(r_{1}(t)\) by means of the following:

$$\begin{aligned} x(t)-c_1 (t)=r_1 (t). \end{aligned}$$
(4)

The residue \(r_{1}(t)\) is considered as if it was the original data and we reapply the sifting process to it. The process of locating additional intrinsic modes \(c_{1}(t)\) continues until the last mode is found. The final residue will be a constant or a monotonic function. In this last case, it will be the general trend of the data.

$$\begin{aligned} x(t)=\sum _{j=1}^{n} {c_{j}(t)+r_n (t)} . \end{aligned}$$
(5)

Thus, the data is decomposed into n-empirical IMF modes plus a residue, \(r_{n}(t)\), which can be either the mean trend or a constant.

2.2 Phase Space Reconstruction

The analysis of time series generated by non-linear dynamic systems can be accomplished in accordance with Takens’ embedding theory (Takens 1981). Given a univariate time series \(\{x_{i}\}_{i=1}^{N}\) generated from a d-dimension chaotic attractor and where N is the length of the time series, a phase space \(\hbox {R}^{\mathrm{d}}\) of the attractor can be reconstructed by using a delay coordinate defined as:

$$\begin{aligned} X_{i}=(x_{i}, x_{i-\pi },\ldots , x_{i-(m-1)\pi }), \end{aligned}$$
(6)

where m is both the embedding dimension of reconstructed phase space and the time delay constant. Choosing the correct embedding dimension is crucial for predicting \(x_{t+1}\). Takens (Takens 1981) considered that a sufficient condition for the embedding dimension is \(m\ge 2d+1\). However, an embedding dimension that is too large requires additional observations and complex computation. Moreover, if we choose an embedding dimension that is too large, noise and other unwanted inputs will be embedded with the real source input information. This may then corrupt the underlying system dynamic information. Therefore, in accordance with Sauer et al. (1991), if the dimension of the original attractor is d, then an embedding dimension of \(m=2d+1\) is adequate for reconstructing the attractor.

An efficient method of locating the minimal sufficient embedding dimension is the false nearest neighbors (FNN) procedure proposed by Kennel et al. (1992). Two near points in reconstructed phase space are called false neighbors if they are considerably far apart in the original phase space. Such a phenomenon occurs if we select an embedding dimension that is lower than the minimal sufficient value and if the reconstructed attractor does not therefore preserve the topological properties of the real phase space. In this case, points are projected into the false neighborhood of other points. The idea behind the FNN procedure is as follows. Suppose \(X_{\mathrm{i}}\) has a nearest neighbor \(X_{\mathrm{j}}\) in an m-dimensional space. Calculate the Euclidean distance \(||X_{\mathrm{i}}-X_{\mathrm{j}}||\) and compute the following:

$$\begin{aligned} R_i =\frac{\left\| {X_{i+1}-X_{j+1}}\right\| }{\left\| {X_i -X_j}\right\| }. \end{aligned}$$
(7)

If \(R_{\mathrm{i}}\) exceeds a given threshold \(R_{\mathrm{tol}}\) (say, 10 or 15), the point \(X_{\mathrm{j}}\) is considered a false nearest neighbor in dimension m. We can say that the embedding dimension m is sufficiently high if the fraction of points that have false nearest neighbors is zero or considerably small.

Estimation of time delay \(\tau \) is another major concern. If \(\tau \) is too small, redundancy will occur. However, if \(\tau \) is too large, it will probably lead to a complex phenomenon called irrelevance. In this study, we use the first minimum of mutual information (MI) function (Huang et al. 2006a) to determine \(\tau \) as follows:

$$\begin{aligned} MI(\tau )=\sum _{n=1}^{N-\pi } {P(x_n ,x_{n+\pi })\log _2 \left( {\frac{P(x_n ,x_{n+\pi })}{P(x_n )P(x_{n+\pi })}} \right) } , \end{aligned}$$
(8)

where \(P(x_{\mathrm{n}})\) is the probability density of \(x_{\mathrm{n}}\) and \(P(x_{\mathrm{n}}, x_{\mathrm{n}+\tau })\) is the joint probability density of \(x_{\mathrm{n}}\) and \(x_{\mathrm{n}+\tau }\).

2.3 ELM

ELM is an improved learning algorithm for the SLFN architecture. ELM is different from the traditional neural network methodology in that all the parameters of the feed-forward networks (input weights and hidden layer biases) are not required to be tuned. The ability of SLFNs to choose input weights randomly, as well as hidden layer biases and a nonzero activation function to approximate any continuous functions on any input set, has been demonstrated in Rao and Mitra (1971). The SLFN with randomly chosen input weights and hidden layer biases can be considered a linear system. For this linear system, the output weights that link the hidden layer to the output layer can be analytically determined through a simple generalized inverse operation of the hidden layer output matrices. This simple approach enables ELM to be extremely efficient and many times faster than the traditional feed-forward learning algorithms.

The structure of ELM consists of an SLFN in which the input weight matrix W is randomly chosen and the output weight matrix \(\upbeta \) is analytically determined. Suppose we are given a data set with N arbitrary distinct samples \((x_{\mathrm{i}}, t_{\mathrm{i}})\), where \(x_{\mathrm{i}}=[x_{\mathrm{i1}}, x_{\mathrm{i2}}, . . ., x_{\mathrm{i1}}]^{\mathrm{T}} 2\in R^{\mathrm{n}}\) and \(t_{\mathrm{i}}=[t_{\mathrm{i1}}, t_{\mathrm{i2}}, . . ., t_{\mathrm{im}}]^{\mathrm{T}} \in R^{\mathrm{m}}\). The mathematical model of a standard SLFN with \(\tilde{N}\) hidden nodes and activation function g(x) for the given data can be formulated as follows (Huang et al. 2006):

$$\begin{aligned} \sum _{i=1}^{\tilde{N}} {\beta _i g_i (x_j )=} \sum _{i=1}^{\tilde{N}} {\beta _i g_i (w_i x_j +b_i )=y_j ,\quad j=1,\ldots ,N} , \end{aligned}$$
(9)

where \(w_{\mathrm{i}}=[w_{\mathrm{i1}}, w_{\mathrm{i2}}, . . ., w_{\mathrm{in}}]^{\mathrm{T}}\) denotes the weight vector that connects the input nodes to the \(i\hbox {th}\) hidden node and \(b_{\mathrm{i}}=[b_{\mathrm{i1}}, b_{\mathrm{i2}}, . . ., b_{\mathrm{im}}]^{\mathrm{T}}\) is the weight vector that connects the output nodes with the \(i\hbox {th}\) hidden node. In addition, \(b_{\mathrm{i}}\) is the threshold of the \(i\hbox {th}\) hidden node. The inner product of \(w_{\mathrm{i}}\) and \(x_{\mathrm{j}}\) is denoted by the operation \(w_{\mathrm{i}}\cdot x_{\mathrm{j}}\) in (9). Let us consider that standard SLFNs with \({\tilde{N}}\) hidden nodes employing activation function g(x) can approximate these N samples with zero error. In such a situation, we obtain the following equation:

$$\begin{aligned} \sum _{j=1}^N {\left\| {y_j -t_j } \right\| =0} , \end{aligned}$$
(10)

where y denotes the actual output value of the SLFN. This indicates the existence of \(\beta _{i}\), \(w_{i}\), and \(b_{i}\) such that:

$$\begin{aligned} \sum _{i=1}^{\tilde{N}}{\beta _{i} g_{i} (w_{i} x_{j} +b_{i})=t_{j},\quad j=1,\ldots ,N} . \end{aligned}$$
(11)

A succinct expression of the previous N equations can be written as:

$$\begin{aligned} H\beta =T, \end{aligned}$$
(12)

where H is the hidden layer output matrix.

$$\begin{aligned} H= & {} \left[ {{\begin{array}{c} h(x_1) \\ \vdots \\ h(x_N) \\ \end{array}}} \right] =\left[ {{\begin{array}{ccc} h_{1} (x_1)&{}\quad \cdots &{}\quad h_{\tilde{N}} (x_1) \\ \cdots &{}\quad \cdots &{}\quad \cdots \\ h_{1} (x_N)&{}\quad \cdots &{}\quad h_{\tilde{N}} (x_N) \\ \end{array}}} \right] , \end{aligned}$$
(13)
$$\begin{aligned} \beta= & {} \left[ {{\begin{array}{c} \beta _{1}^{T} \\ \vdots \\ \beta _{\tilde{N}}^{T}\\ \end{array}}} \right] , \end{aligned}$$
(14)
$$\begin{aligned} T= & {} \left[ {{\begin{array}{c} T_{1}^{T} \\ \vdots \\ T_{N}^{T} \\ \end{array}}} \right] . \end{aligned}$$
(15)

As previously discussed, the input weights and hidden biases are randomly generated and do not require any tuning as in the case with traditional SLFN methodology. The evaluation of the output weights that link the hidden layer to the output layer is equivalent to determining the least-square solution to the given linear system. The minimum norm least-square (LS) solution to the linear system defined in (12) is:

$$\begin{aligned} \hat{\beta }=H^{+}T. \end{aligned}$$
(16)

The \(H^{+}\) in the previous equation is the Moore–Penrose (MP) generalized inverse of matrix H (Babovic et al. 2000). The minimum norm LS solution is unique and has the smallest norm among all the LS solutions. The MP inverse-method-based ELM is shown to achieve a quality generalization performance with a radically increased learning speed. A general algorithm for ELM can be stated as follows. For a given training set, including activation function g(x) and hidden neuron number L:

  1. Step 1:

    Assign random input weight \(w_i \) and bias \(b_{i}, i=1, . . ., L\).

  2. Step 2:

    Calculate the hidden layer output matrix H.

  3. Step 3:

    Calculate the output weight \(\beta :\beta =H^{+}T\).

3 Proposed Model

The proposed hybrid approach for exchange rate forecasting (EMD-PSR-ELM) combines EMD, PSR, and ELM, and consists of four main stages. These four stages are described as follows.

Stage 1 EMD Decomposition

The original time series \(x(t), t = 1, 2,{\ldots },N\) is decomposed into n IMF components, \(c_{\mathrm{j}}(t), j = 1, 2, {\ldots }, n\), and one residual component \(r_{\mathrm{n}}(t)\) by using EMD.

Stage 2 Phase Space Reconstruction

First, the MI function in (8) is calculated for each \(c_{\mathrm{j}}(t)\) and \(r_{\mathrm{n}}(t)\) time series. Second, the first delay time in which the MI function minimum value occurs is considered the optimum time delay \(\tau \). Third, the FNN method is employed to find the minimum sufficient embedding dimension m. Fourth, according to the optimum time delay \(\tau \) and embedding dimension m, the time series phase space is reconstructed to reveal its unseen dynamics.

Fig. 1
figure 1

Flow chart of the proposed EMD-PSR-ELM Model

Therefore, the input and output samples can be represented by the matrix X and Y, respectively, in the following forms (where x can denote \(c_{\mathrm{j}}\) and \(r_{\mathrm{n}}\)):

$$\begin{aligned} X=\left[ {{\begin{array}{cccc} x(1)&{}\quad x(1+\tau )&{}\quad \cdots &{}\quad x(1+(m-1)\tau ) \\ x(2)&{}\quad x(2+\tau )&{}\quad \cdots &{}\quad x(2+(m-1)\tau ) \\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ x(M)&{}\quad x(M+\tau )&{}\quad \cdots &{}\quad x(M+(m-1)\tau ) \\ \end{array}}} \right] , \quad Y=\left[ {{\begin{array}{c} x(1+(m-1)\tau +lag) \\ x(2+(m-1)\tau +lag) \\ \vdots \\ x(M+(m-1)\tau +lag) \\ \end{array}}}\right] .\nonumber \\ \end{aligned}$$
(17)

Some forecasting techniques for chaotic time series nearly fix their selected time lag at 1 (Liong and Sivapragasam 2002; Makridakis 1993). Therefore, for this study, we also fix time lag = 1.

Stage 3 ELM Modeling

The reconstructed time series datasets are divided into training and testing datasets. The training datasets are used to build ELM models.

Stage 4 Result Composition

A regression forecast model is set up for each IMF whereas the residue is set up by using ELM. The final prediction results are obtained by compositing the prediction values. F is the ELM predictor function. The final forecasting result is \(\sum \nolimits _{j=1}^{n-1} {F_j (c_j (t))} +F_n (r_n (t))\).

The flow chart of the proposed EMD-PSR-ELM model is shown in Fig. 1.

4 Experimental Results and Analysis

4.1 Data Sets

Daily exchange rate values for USD/TWD, EUD/TWD, GBP/TWD, and AUD/TWD were extracted from the data stream provided by OANDA (http://www.oanda.com) and used in this study. The entire data set covers the period from January 1, 2007 to December 31, 2013, yielding a total of 2557 observations. The data set was divided into two sets, training and testing data. The daily data from January 1, 2007 to April 19, 2013, generating a total of 2301 observations, were used as the training data set. Others of the daily data from April 20, 2013 to December 31, 2013, producing a total of 256 observations, were used as the testing data set. In the next section, we explain the manner in which we implement our EMD-PSR-ELM model.

4.2 Benchmark Prediction Models

As mentioned in Sect. 1, this study adopts the Naïve RW, ELM, EMD-ELM, and PSR-ELM as the benchmarks for the experiment.

  1. (1)

    Naïve RW the Naïve RW simply takes the forecast for the next value from the current value. Thus, no fitting process is required.

  2. (2)

    ELM the original time series x(t) are directly used to build ELM models and to forecast final results. The function can be expressed as \(\hat{x} (t+1)=F(x(t))\), F refers to the ELM predictor function.

  3. (3)

    EMD-ELM first, the original time series are decomposed by EMD into several IMF time series and one residual time series. These decomposed datasets are then utilized to build the ELM models previously mentioned into the EMD-ELM models.

  4. (4)

    PSR-ELM we use the PSR method to reconstruct the original time series space, from which we can obtain optimum embedded dimension m and delay time \(\tau \). The reconstruction datasets are adopted to build ELM models as well. The function can be expressed as the following:

    $$\begin{aligned} \hat{x} (t+1)=F(x(t),x(t-\tau ),...,x(t-(m-1)\cdot \tau )). \end{aligned}$$
    (18)

4.3 Evaluation Criteria

To evaluate the forecasting performance of the proposed model, we adopt the MAE, RMSE, and MAPE. These measures are defined as follows:

$$\begin{aligned} \hbox {MAE}= & {} N^{-1}\sum _{t=1}^{N} {\left| Y_{(t)}-{\hat{Y}}_{(t)}\right| }, \end{aligned}$$
(19)
$$\begin{aligned} \hbox {RMSE}= & {} \left( N^{-1}\sum _{t=1}^{N} (Y_{(t)}-{\hat{Y}}_{(t)})^{2}\right) ^{1/2}, \end{aligned}$$
(20)
$$\begin{aligned} \hbox {MAPE}= & {} N^{-1}\sum _{t=1}^{N}{\left| (Y_{(t)} -Y_{(t)})/{\hat{Y}}_{(t)}\right| }, \end{aligned}$$
(21)

where \(Y_{(t)}\) and \({\hat{Y}}_{(t)}\) are the actual and prediction values, respectively, at time t, and N is the sample size. Note that MAE, RMSE, and MAPE are the measures of the deviation between actual and prediction values. Therefore, improved forecasting performance occurs when the values of these measures are small. However, if the results are not consistent among these criteria, we choose MAPE as suggested by Makridakis (1993) as the benchmark because MAPE is relatively more stable than are other criteria.

4.4 Implementation of EMD

Based on the previous steps described in Sect. 3, we conducted prediction experiments. First, using the EMD technique, the four exchange rate series (USD/TWD, EUD/TWD, GBP/TWD, AUD/TWD) were decomposed into 10 IMFs (IMF1–IMF10) and one residual (Residual), as shown in Fig. 2. All the extracted IMF components are graphically illustrated in the order in which they were extracted. The order of frequency (or period) from the highest frequency to the lowest is indicated. The last component is the residual of sifting. This generally represents the trend of the time series. In this study, EMD components were obtained by using the HHT MATLAB program (http://rcada.ncu.edu.tw/research1_clip_program.htm).

Fig. 2
figure 2

IMFs and residual. a USD/TWD. b EUD/TWD. c GBP/TWD. d AUD/TWD

4.5 Implementation of PSR

In the PSR stage, MI was used to select the optimal delay time \(\tau \), which was selected based on the first minimum value of the MI function. After the optimal \(\tau \) was selected, FNN (was then used to extract the minimum embedding dimension. Table 1 shows the optimal m and \(\tau \) for each IMF and the residual. These optimal embedding dimensions and delay times are used to construct the input matrix (X). The data were fed to ELM forecast models and set up for each IMF and the residual. The final prediction results of the EMD-PSR-ELM model were obtained by compositing (i.e., combining separate prediction values into one value). We used Hao Cheng’s Fractal MATLAB toolbox to implement the MI and FNN functions.

Table 1 Optimal m and \(\tau \) for each IMF and residual

4.6 Forecasting Results and Analysis

To compare the performance of different models, we first applied the benchmarks, Naïve RW, ELM, PSR \(+\) ELM, and EMD \(+\) ELM, to forecast the four exchange rates, respectively. The performance comparison of five models (Naïve RW, ARIMA, back propagation neural network (BPNN), ELM, PSR \(+\) ELM, EMD \(+\) ELM, and the EMD \(+\) PSR \(+\) ELM) according to three evaluation criteria (MAPE, MAE and RMSE) is reported in Table 2. Relative errors defined as “the ratio of error to the actual value” of the five models are shown in Fig. 3.

The empirical analysis confirms that the performance of EMD \(+\) PSR \(+\) ELM is the best among the five models with respect to the four exchange rates. The empirical results demonstrated the usefulness of the two-stage data preprocessing (stage 1 EMD, stage 2 PSR) of the ELM model we proposed. We can observe some phenomenon in Fig. 3 to identify possible superiority. In the high-frequency points, relative errors of the hybrid model are much smaller than in other models. This observation demonstrates that the EMD method can reduce noise contained in time series and can thus enhance accuracy.

The average error of pure ELM was the worst for MAPE, MAE, and RMSE in the four exchange rates. It was even worse than the Naïve RW in nearly all measures. This indicates that single ELM is unsuitable for exchange rate time series forecasting. However, if we combine the EMD data processing method into a single ELM model, that is, EMD \(+\) ELM model, its performance would improve.

Table 2 reveals that the accurate rates in the PSR \(+\) ELM model for the four exchange rates are not better than those in the single ELM model, even is the worst in AUD/TWD exchange rate dataset. The optimal embedding dimension m and delay time \(\tau \) for the four exchange rates by using the PSR method are presented in the Tables (\(m=1, \tau =7\)), (\(m=1, \tau =7\)), (\(m=1, \tau =7\)) and (\(m=6, \tau =7\)). When \(m=1\), the input matrix (X) constructed is the same as that does not use the PSR method. Therefore, the performance for the three exchange rates, USD/TWD, EUD/TWD, and GBP/TWD, is the same as that presented in Table 2. We have no sufficient evidence to demonstrate that PSR does not enhance accuracy for exchange rate forecasting.

Furthermore, to demonstrate the effectiveness of the EMD \(+\) PSR \(+\) ELM model, we also compared ARIMA and BPNN, the most popular neural network modules, to our model. Experimental results also revealed that our model outperforms these other two with respect to MAPE, MAE, and RMSE criteria for the four data sets. This information is also shown in Table 2. It proved the strong robustness of our proposed hybrid model. Optimal parameters for ARIMA, BPNN, and PSR \(+\) ELM for the four data sets are shown in Table 3.

Table 2 Model performance for different exchange rate datasets
Fig. 3
figure 3

Corresponding relative errors by difference models. a USD/TWD. b EUD/TWD. c GBP/TWD. d AUD/TWD

5 Conclusions

Designing an appropriate model to forecast financial data is a major challenge for time series analysts and researchers. This is mainly because the irregular movements and several changing turning points of these series are practically too difficult to understand and predict. In this study, a new hybrid model that intelligently combines the EMD, PSR and ELM models (EMD \(+\) PSR \(+\) ELM), is proposed to forecast exchange rates. From the experimental results of this study, we can draw the following conclusions:

  1. (1)

    EMD can fully capture the local fluctuations of data and can be used as a preprocessor to decompose the complicated raw data into a finite set of IMFs and a residue, which can improve rate predicting accuracy.

  2. (2)

    The network topology of the model has a major influence on prediction performance for ELM. It is more objective in identifying the chaotic characteristics of exchange rate time series and determining the embedding dimension of the reconstructed phase space through the FNN function. The determined embedding dimension can then be served as the numbers of nodes in the input layer for the SLFN.

  3. (3)

    Empirical results from four real-world exchange rate time series clearly suggest that our hybrid method substantially improves the overall accuracy of forecasting and also outperformed both a statistical model (Naïve RW) and an artificial intelligence model (ELM). Therefore, the proposed method is extremely suitable for prediction using nonlinear, nonstationary, and highly complex data and is an efficient method for exchange rate prediction.

Table 3 Optimal parameter for ARIMA, BPNN, PSR \(+\) ELM

Future research should consider the property of data in order to combine time series and AI method. Direction prediction criteria are crucial to the trading strategies of investors. In our model, we select only one-dimensional time series of exchange rates for input variables. Future research might attempt to enhance the performance of prediction models by including other efficient input variables such as macroeconomic variables and using diverse data for feasibility. One possibility might try to find important input variables by adopting some strong or emerging mathematical methods, such as MARS, CMARS for building more perfect integrated model. In addition, the relationships between and trading information about different markets might be examined.