1 Introduction

Estimation of sea-ice thickness in the Arctic is a very crucial task. Improved estimates of sea-ice thickness should lead to safer navigation in ice-infested waters, as well as improved estimates of ocean-atmosphere heat transfer, and improved weather forecasting. It has been reported that there is an increasing need for developing accurate and comprehensive forecasting tools that can estimate crucial sea-ice parameters, such as sea-ice thickness, in different regions (Eicken 2013). It is clearly necessary to conduct research on novel tools that can estimate sea ice thickness from observational data, as well as on improving the accuracy of the existing prognostic forecast models. The present contribution is on the first of these two objectives and proposes an intelligent method to retrieve sea-ice thickness from geophysical data. The method is different from those commonly used in the sea ice community (Iwamoto et al. 2013; Nihashi et al. 2009; Kaleschke et al. 2012), in that it is a black box approach, meaning that it is not based on any physical principles. While the method presented here uses brightness temperatures from passive microwave radiometers in addition to data from forecast models, it could in principle be modified to accommodate other types of data.

Data from passive microwave sensors are widely used for sea ice monitoring because these sensors are capable of returning surface information during both dark and cloudy periods. In addition to providing indispensable ice concentration estimates to the scientific community, it has been observed that ice thickness information can be obtained from these data by exploiting the correlation between the horizontally and vertically polarized emissions measured by passive sensors and the thickness of thin ice (Nihashi et al. 2009; Iwamoto et al. 2013). It has been found that this correlation is stronger for the lower frequency channels on a passive microwave sensor than for the higher frequency channels, even when the data from the higher frequency channels have been screened for atmospheric contamination (Scott et al. 2014). This is due to the fact that the emitting layer is farther below the surface for lower frequencies (Weeks 2010).

Research conducted on how to select the most influential observational data, or on how to use this information for developing estimation tools, is often based on the underlying physical/environmental characteristics of the remote sensing measurement and the region of study. However, when developing physical laws, several important issues may arise. First, if the objective is to develop high-fidelity physical models, and if it is optimistically assumed that most of the important elements are taken into account in these models, there is a possibility that the resulting model possesses a complicated structure. Significant effort may be required to identify the unknown structural parameters of the model (Pratama et al. 2014b). The model may also be computationally very expensive.

The difficulties associated with developing accurate and efficient physical models have led to alternative and complementary research on establishing new surrogate models. Indeed, as the quantity of available data is increasing, there is a possibility to develop accurate black-boxes, which are ignorant of the physical situation. Such a trend is, in fact, in line with a global interest towards concentrating on developing automated machines for reducing the amount of human inference in data analysis (Pratama et al. 2014a).

Over the past decade, enormous progress has been made to foster the applicability of intelligent techniques as black boxes capable of creating an accurate nonlinear map between a set of input and output data pairs (Meireles et al. 2003). In this context, most of the proposed learning systems are designed to prepare intelligent systems for a wide range of applications in which there is need for incremental and recurrent learning (Pratama et al. 2015a), evolving classification and regression (Pratama et al. 2015b, c). However, there exist rare reports in the literature of sea-ice thickness estimation focusing on the use of intelligent identifiers. By surveying the archived literature, a limited number of relevant research papers have been found with a meaningful contribution toward advocating the use of intelligent identifiers for modeling of sea-ice thickness. Haverkamp et al. (1995) proposed an intelligent technique for estimation of sea-ice thickness from a synthetic aperture radar (SAR) database. The simulation results clearly verified that the developed intelligent system can accurately identify the variations in ice thickness to regional and seasonal changes. Soh et al. (2004) developed a hybrid intelligent system for satellite sea-ice image analysis. The developed intelligent framework was able to perform both feature selection and rule-base classification using the data coming from sensors. The simulation results endorse the efficiency of the proposed intelligent paradigm. Lin and Yang (2012) proposed a hybrid algorithm based on chaotic immune genetic algorithm and back-propagation neural network (BP-NN) to predict the sea-ice thickness in the Bohai Sea and the northern region of the Yellow Sea. The results of the conducted simulations clearly demonstrated both the accuracy and robustness of the proposed estimator. Belchansky et al. (2008) developed an in-situ learned and empirically derived neural network model to estimate the fluctuation of Artic sea ice thickness. One of the important characteristics of the developed neural network was that it could accurately predict the sea-ice thickness under different conditions with a very high accuracy.

In the pursuit of addressing the main open concerns on the potential of intelligent methods for the estimation of geophysical varaibles, here, the authors would like to propose a novel technique based on the concept of modular neurocomputing (Rojas 1996). To the author’s best knowledge, this is the first time such an intelligent system has been used for the estimation of a geophysical variable, such as sea-ice thickness. Furthermore, this research intends to clearly demonstrate how a relatively sophisticated intelligent tool can be used to (1) process spatio-temporal data coming from two satellite sensors, the moderate resolution imaging spectro-radiometer (MODIS), and advanced microwave scanning radiometer-earth (AMSR-E) observing system,(2) develop a modular intelligent tool that can separate the information coming from different regions, and (3) estimate the sea-ice thickness using a set of independent identification modules allocated to each of those independent regions.

The rest of the paper is organized as follows: Section 2 is devoted to a brief review of the background of modular identification systems. Also it is discussed why a modular system is potentially a good approach for estimation of sea-ice thickness over a given geographic region, which is chosen here as the Labrador coast, along the east coast of Canada. The description of the studied region along with the traits of the collected data from MODIS and ASMR-E sensors are presented in Sect. 3. The details of the developed model are given in Sect. 4. Section 5 is devoted to the description of experimental setup. Finally, conclusions are given in Sect. 6.

2 Modular neural networks: a concise review

In the field of data mining, one often encounters complicated databases comprising several independent sub databases each having its own characteristics and features. To process such complicated information, the human brain uses an automatic procedure that first tries to discriminate the sub-processes and then processes the data corresponding to each sub-process. Inspired by the approach taken by the human mind, a novel computing field has emerged within the realm of intelligent computing, known as modular data processing (Fodor 1983). Modular data processing can be used for developing both modular regression and modular classification tools. The main idea behind such a scheme is to use some similarity and discrimination criteria to first divide the database into a set of sub databases that has similar characteristics, and thereafter allocate independent regression/classification methods to each of those independent databases (Melin 2012). Modular computational frameworks can have any type of identifiers, e.g. neural networks (NNs), neuro-fuzzy systems, and fuzzy inference systems, in their architecture (Melin 2012). Each of these identifiers has its own pros and cons, and thus, can give specific features to the resulting modular system. Due to the straightforward and beneficial computational characteristics of neurons in neural networks, most of the existing modular architectures use such identifiers in their structure. A modular NN has a relatively complex architecture including a series of independent NNs working together in a systematic fashion to predict the same output (Rojas 1996).

The key point lies in the fact that each of those sole NNs try to fulfill a specific task by focusing on separate portions of the input space from the database. So, the systemic integration of the prediction of those independent NNs can result in a modular NN capable of performing a unique task. Such an approach has an obvious advantage from the data mining viewpoint, especially when dealing with complicated information systems with multiple data streams. In fact, by doing so, a complex large data-mining task is reduced to a number of smaller and manageable data mining tasks, and a higher level of manipulation can be achieved to have a much more robust and accurate result. The other advantage of modular NNs lies in their intrinsic parallel architecture enabling them to perform the entire task in a parallel fashion. This intuitively increases the speed and efficacy of computation (Farooq 2000).

An enormous amount of applied and theoretical research has been carried out to demonstrate the applicability of such systems for different tasks. Providing a detailed chronological review on the progress of modular intelligent data processing is clearly beyond the scope of the current investigation. The interested readers are referred to some special issues and seminal books published regarding the advances in modular data processing (Rojas 1996; Farooq 2000). In general, the present research clearly indicates that modular NNs not only can increase the robustness and accuracy of the data mining task, but also can do this job very efficiently. The same observation has been reported by the scientists working within various realms, such as medicine, systems sciences, and manufacturing (Ding et al. 2014; Javadi et al. 2013). Such promising reports have instigated the authors to evaluate the potential of modular intelligent computing for estimating sea-ice thickness. The main reasons behind such a motivation can be listed as follows:

  1. (a)

    The quantity of data collected from satellites is quite large, and are often combined with data from other sources (e.g., forecast model output) to produce several streams of spatio-temporal information. It is desirable to develop a modular data mining tool capable of reducing the complexity of the resulting database, while achieving accurate and robust estimates of geophysical variables.

  2. (b)

    Modular neural networks have clearly demonstrated their high potential in various fields of engineering. However, there are fewer applications of such systems for geoscience tasks, in particular estimation of sea-ice thickness. Therefore, the current investigation tries to excavate the computational potential of modular identifiers for the estimation of sea-ice thickness.

  3. (c)

    The current investigation also contributes to the field of modular computing by proposing a novel sequential system that uses a differential evolutionary algorithm for clustering, and ridge randomized neural network for estimation. To demonstrate the computational advantages of the proposed modular architecture, here, the authors use some well-known rival modular neural networks for the same estimation task.

The detailed procedures required for the implementation of the proposed modular estimator will be given in the next sections.

3 Description of collected database

In this section, the details of the studied region located along the east coast of Canada as well as the characteristics of data from the two considered remote sensors, i.e. ASMR-E and MODIS, are discussed.

3.1 Region of Study

As mentioned, in this study, the emphasis is put on measuring the sea-ice thickness along the east coast of Canada including sea-ice along the Labrador coast, and the northern coast of Newfoundland, as indicated in Fig. 1.

Fig. 1
figure 1

Region of study is the Labrador Coast and Newfoundland

The database used in this study covers the period of February 2, 2007, to February 20, 2007. For this portion of the year, the ice cover along the Labrador coast is bounded to the west and south by land and to the east by the Labrador current. The ice starts to appear along the Labrador coast in December and gradually becomes thicker through January and February. The ice cover contains a marginal ice zone composed of small ice floes near the open water, with the ice becoming thicker toward the land boundaries. In addition, sometimes, there may be some coastal polynyas between the consolidated ice region and the landfast ice.

3.2 AMSR-E data

To capture the radiation within the passive microwave range of electromagnetic spectrum, the AMSR-E sensor uses six different frequencies, 6.9, 10.7, 18.7, 23.8, 36.5, and 89 GHz. The footprint of each of these frequencies is approximately elliptical, with sizes ranging from 74 km \(\times \) 43–6 km \(\times \) 4 km. Swath data are used in this study to mitigate the effects of uncertainties that can arise when data are averaged or resampled. Furthermore, due to the land contamination associated with the sensor footprints, in this study, the information coming from pixels a distance of up to half of the sensor footprint from the land boundaries has been neglected. Brightness temperatures are used from both a low-frequency channel (6.9 GHz) and a high-frequency channel (36.5 GHz). The latitude, longitude, and brightness temperature values are entries \(x_{1}\) to \(x_{3}\) in the database (see Table 1).

Table 1 Characteristics of the collected features used for designing M-RRNN

3.3 MODIS data

The MODIS sensor measures radiation in the VIS/IR range of electromagnetic spectrum. To calculate the sea-ice thickness, the surface temperature calculated from the MODIS infrared channels (Hall et al. 2004) is used in a heat balance equation (Yu and Lindsay 2003). In this study, the heat balance equation uses the atmospheric variables from the Global Environmental Multiscale (GEM) model and the MOD29 ice surface temperature product prepared by the National Snow and Ice Data Center (Hall et al. 2007). The MODIS data are swath data at a 1-km resolution in which each pixel has been screened for cloud contamination. The observed data include surface temperatures from 243 to 271K. To reduce the rate of uncertainty of the collected database, nighttime images are used as those are not affected by uncertainties associated with the surface albedo and shortwave radiation (Wang et al. 2010). Figure 2 depicts a sample ice temperature image obtained by the MODIS sensor. The ice thickness from MODIS is entry \(x_{10}\) in the database (see Table 1).

Fig. 2
figure 2

Ice temperature from MODIS, January 24, 2007

3.4 Data from the forecasting system

Variables from an atmospheric weather forecasting model (the GEM model) and a coupled ice-ocean model are used in addition to the AMSR-E and MODIS data. The variables from these models are those that impact brightness temperature, such as surface temperature, windspeed, water vapour, and cloud liquid water. The dataset has been described in a previous study (Scott et al. 2012). The variables from the forecasting system used in the present study are listed as \(x_{4}\) to \(x_{9}\) in Table 1.

4 Methodology

In this section, the details of the architecture of modular ridge randomized neural network (M-RRNN) is presented. The considered M-RRNN is comprised of two parts: a distributor with a modified differential evolutionary algorithm at its heart and a modular estimation phase with ridge randomized neural network (RRNN) at each of the considered modules. In the first sub-section, the algorithmic description of the modified differential evolutionary algorithm is presented, and in the second sub-section, the steps required for the implementation of the RRNN and for creating a modular architecture are proposed.

4.1 Implementation of distributor: differential evolutionary algorithm

For the implementation of the clustering methodology, a modified version of differential evolutionary algorithm (DEA), called scale factor local search differential evolution (SFLSDE) (Neri and Tirronen 2009), is taken into account. This method has been proven to have a very robust performance and is not sensitive to the dimensionality of the problem. In previous theoretical and numerical investigations, it has been observed that the local search operators used at the heart of SFLSDE can result in a very fast convergence towards the optimum solution regardless of the characteristics of the landscape of the objective function (Neri and Tirronen 2009; Mozaffari et al. 2014). These local search operators are known as golden sectioning search (GSS) and hill-climbing (Neri and Tirronen 2009), which are used to update the scale factor parameter and crossover rate of the standard DEA. The salient asset of SFLSDE lies in its capability to allocate an independent scale factor to each of the potential solutions in the solution space. This enables SFLSDE to devise an independent searching strategy for each of the agents that in turn increases the diversity of the algorithm. One of the other salient assets of SFLSDE compared to other variants of DEA lies in its capability to update the scaling factor value in a self-organizing fashion. To the best knowledge of the authors, SFLSDE has not been applied to clustering so far. However, it has the following computational advantages that clearly indicate it may have high potential for the clustering task:

  1. (a)

    It has been proven that SFLSDE has an efficient performance, independent from the dimensionality of the optimization problem. This is due to the insensitivity of the local searches to the scale of the solution landscape (Neri and Tirronen 2009).

  2. (b)

    The algorithmic structure of SFLSDE not only results in a high diversity of the search over the solution space, but also enables it to neatly balance the exploration and exploitation capabilities to ensure the convergence to the optimum regions within the objective landscape (Neri and Tirronen 2009).

  3. (c)

    One of the main features of SFLSDE lies in its capability to fix a major flaw associated with a large number of modified DEAs, namely the phenomenon of stagnation. Indeed, a large portion of proposed DEAs may become trapped in a local minimum during the optimization procedure. This is known as stagnation. However, diversified refreshing of the characteristics of genotypes participating in the optimization procedure helps SFLSDE to escape from local minima, which in turn impedes stagnation (Neri and Tirronen 2009).

  4. (d)

    SFLSDE is indeed a memetic algorithm based on Lamarckian learning which is the integration of DEA and two local search operators (Neri and Tirronen 2009). However, despite most of the existing memetic methods, the local searches used at the heart of SFLSDE does not increase the computational complexity of the resulting architecture, as both hill climbing and golden sectioning search mechanisms have quite simple architectures.

The abovementioned remarks have motivated us to adopt SFLSDE to develop a clustering methodology, which will be used as a distributor in the considered modular ridge randomized neural network. By inspecting the literature of heuristic clustering, the authors realized that adding a general operator known as acceleration can effectively boost the performance of clustering metaheuristics (Chuang et al. 2011). Thus, in this study, the authors embed the acceleration strategy into the algorithmic structure of SFLSDE to make sure it has an acceptable clustering behavior. Let us call the resulting clustering method as SFLSDE-clust. The following steps are taken to implement the method:

Step 1: Set the controlling parameters of SFLSDE-clust according to those given in Table 2.

Table 2 Controlling parameters of SFLSDE\(-clust\) for the current simulations

Step 2: Uniformly spread the position of S chromosomes [denoted by vectors \(\mathbf {s}(1),\mathbf {s}(2),\dots ,\mathbf {s}(S)\)] through the solution space. Each solution in this space represents the centroids of a set of clusters (different centroid points in a 2D space are shown in the first panel of Fig. 3). Assume that K clusters are going to be used for partitioning the collected information, and the destined information possesses d features. Then, each chromosome has \(K \times d\) bits that should be optimized. In a previous work by the authors, it was observed that the optimum number of chromosomes for clustering is \(10 \times K \times d\) (Chuang et al. 2011).

Step 3: Prior to starting the clustering procedure, proceed with the acceleration process. In this way, select the one third of the chromosomes randomly and use the standard K-means clustering (Chuang et al. 2011) to update their position. Each chromosome will cluster the information in the database to minimize the Euclidean distance

$$\begin{aligned} \Vert \mathbf {x}_{p}-\mathbf {c}_{j}\Vert =\sqrt{\sum _{i=1}^{d}\left( x_{p,i}-c_{j,i}\right) ^{2}}, \end{aligned}$$

where \(\mathbf {c}_{j}\) is the centroid of the jth cluster \(\mathcal {C}_j\), \(c_{j,i}\) indicates the ith coordinate of the centroid \(\mathbf {c}_{j}\), \(x_{p,i}\) is the ith coordinate of the pth data in the database, and d is the dimensionality of the data points \(\mathbf {x}_{p}\). The centroid of the jth cluster \(\mathcal {C}_j\) is determined using the set of data points in cluster j according to

$$\begin{aligned} \mathbf {c}_{j}=\frac{1}{n_{j}}\sum _{ \mathbf {x}_{p} \in \mathcal {C}_{j}}\mathbf {x}_{p}, \end{aligned}$$

where \(n_{j}\) represents the number of data points in cluster \(\mathcal {C}_j\).

Step 4: After the termination of acceleration phase, calculate the objective function of all chromosomes using the sum of intra-cluster distance, given as

$$\begin{aligned} J=J(\mathbf {c}_{_1},\dots ,\mathbf {c}_{_K})=\sum _{j=1}^{K} \sum _{\mathbf {x}_{p} \in \mathcal {C}_{j}} \Vert \mathbf {x}_{p}-\mathbf {c}_{j} \Vert . \end{aligned}$$
(1)

The above objective function is minimized to find the \(\mathbf {c}_{j}\) corresponding to optimal clustering and is referred to as the fitness.

Step 5: For each chromosome select 5 random numbers, \(u_{1}\), \(u_{2}\), \(u_{3}\), \(u_{4}\), and \(u_{5}\), from a uniform distribution (Neri and Tirronen 2009).

Step 6: For each chromosome \(\mathbf {s}(i)\), randomly select three individuals, i.e. \(\mathbf {s}_{1}(i), \mathbf {s}_{2}(i)\) and \(\mathbf {s}_{3}(i)\), from the existing population pool.

Step 7: For chromosome \(\mathbf {s}(i)\), if the selected solution has the best fitness value and also \(u_{5}\) is less than a given threshold \(\tau _{3}\), proceed with golden sectioning search (GSS) strategy to update the scale factor of the agent F(i) (the steps required for GSS is given in Appendix 1). Archive the resulting off-spring as \(\mathbf {u}(i)\).

Step 8: For chromosome \(\mathbf {s}(i)\), if the selected solution has the best fitness value and also \(u_{5}\) is greater than the threshold \(\tau _{3}\) and at the same time less than the threshold \(\tau _{4}\), proceed with the hill-climbing search to update the scale factor of the agent, F(i) (the steps required for GSS is given in Appendix 2). Archive the resulting off-spring as \(\mathbf {u}(i)\).

Step 9: For the chromosome \(\mathbf {s}(i)\), if \(u_{5}\) is greater than the threshold \(\tau _{4}\), proceed with the conventional evolutionary steps of standard DEA (the steps required for GSS is given in Appendix 3). Archive the resulting off-spring as \(\mathbf {u}(i)\).

Step 10: Calculate the fitness of the calculated solution. If the resulting solution \(\mathbf {u}(i)\) has a higher fitness than the parent \(\mathbf {s}(i)\), then replace the old solution with the new one.

Step 11: Check the stopping criteria. If the stopping criteria are satisfied terminate the procedure; otherwise, return to Step 2.

A schematic illustration of the procedures taken by SFLSDE-clust to partition the information to a number of clusters is shown in Fig. 3. At this point, K partitions have been created, which can be used to develop a modular estimator with K modules.

Fig. 3
figure 3

Schematic illustration of the proposed clustering procedure in 2D space: (1) initialization of centroids in the information space, (2) selection and updating of one third of the clusters, (3) proceed with the operators of SFLSDE to update all of the clusters

4.2 Implementation of the modular architecture

As it was pointed out previously, a modular architecture comprises of K modules where each of them has an estimator at their heart. In this sub-section, we first concentrate on explaining the mathematical structure of each estimator, i.e. RRNN. Thereafter, we explain how those independent modules should be put together to form the modular framework, i.e. M-RRNN.

In the proposed modular frame, random based neural networks are used for function approximation. Through theoretical studies, it has been proven that a multi-layer neural network with a single hidden layer including bounded and nonconstant activation functions can serve as a universal approximator (Hornik 1991). In parallel, in the same year, Park and Sandberg (1991) demonstrated that radial-basis networks with the same width for all radial-basis neurons or different widths for different RBF neurons in the network can also serve as universal approximators. Such findings have enabled the researchers of neural computation society to search for much flexible learning schemes which reduce the complexity of optimization process required for training the network’s computational units (neurons). In line with such activities, over the past decades, a comprehensive investigation has been carried out which clearly demonstrated the potentials of random based learning systems for designing feed-forward neural networks (Schmidt et al. 1992), radial basis neural networks (Broomhead and Lowe 1988; Lowe 1989), and functional link nets (Pao et al. 1994). Based on the promising reports on the computational power of feed-forward randomized neural networks (RNN) (Schmidt et al. 1992), which is in a good agreement with the authors own experiments, this network is used at the heart of the proposed modular network to form the approximator.

It should be mentioned that, nowadays, a large number of research groups are trying to improve the performance of RNNs and modified versions of random neural networks are being proposed. In this context, one can find different architectures of RNNs in the literature which are known by specific names. Among the existing vairants of RNNs methods such as extreme learning machines (ELMs) (Huang et al. 2006), random vector functional link nets (RVFLNs) (Zhang and Suganthan 2015), random kitchen sinks (RKSs) (Rahimi and Recht 2007), fast foods (Le et al. 2013), convex network (Huang et al. 2013), no-prop algorithm (Widrow et al. 2013), liquid state machines (LSMs) (Yamazaki and Tanaka 2007), echo state networks (ESNs) (Rodan and Tiňo 2011), and reservior computing machines (LukošEvičIus and Jaeger 2009), have relatively found their reputation in commputational intelligence society.

Proposed in Wu and Moody (1996), ridge neural network is a modified version of standard RNN which tries to tame the numerical difficulties associated with the analytical solving methodology used in analytically trained least square based neural networks. To be more precise, the modification is made by applying the concept of Tikhonov regularization instead of simple least square solution to tune the synaptic weights of RNN (Burger and Neubauer 2003). Assume that the collected database has n training samples \(\mathfrak {D}=\lbrace (\mathbf {x}_{1}, y_{1}),(\mathbf {x}_{2}, y_{2}),\dots ,(\mathbf {x}_{n}, y_{n})\rbrace \) in which \(\mathbf {x}_i=(x_{i,1},\ldots ,x_{i,d})^\mathrm{T}\) represents the d-dimensional input vector and \(y_i\) represents the response value for the ith observation, respectively. Assume that neural network has N hidden nodes in its architecture. Then, the mathematical formulation below is used to create a map between the input vectors, \(\mathbf {x}_i\)’s, and the target values, \(f(\mathbf {x}_i)\)

$$\begin{aligned} \sum _{j=1}^{N}w_j g\left( \varvec{\alpha }_j^\mathrm{T}\mathbf {x}_{i}+b_{j}\right) =f(\mathbf {x}_i), \quad i=1,\ldots ,n, \end{aligned}$$
(2)

where \(\varvec{\alpha }_j=(\alpha _{j,1},\dots ,\alpha _{j,d})^\mathrm{T}\) are the synaptic weight vectors connecting the input nodes to the jth hidden node, \(w_j\) indicates the weight connecting the jth hidden node to the output nodes, and g is a continuous activation function, which is the sigmoid function in this paper (Pao et al. 1994), i.e.

$$\begin{aligned} g(x)=\frac{\text {e}^x}{1+\text {e}^x}\quad \text {for } \ x\in \mathbb {R} . \end{aligned}$$

Let

$$\begin{aligned} \mathbf {H}= & {} \begin{pmatrix} g(\varvec{\alpha }_1^\mathrm{T}\mathbf {x}_1+b_1) &{} \dots &{} g(\varvec{\alpha }_N^\mathrm{T}\mathbf {x}_1+b_N)\\ \vdots &{} \vdots &{} \vdots \\ g(\varvec{\alpha }_1^\mathrm{T}\mathbf {x}_n+b_1) &{} \dots &{} g(\varvec{\alpha }_N^\mathrm{T}\mathbf {x}_n+b_N) \end{pmatrix},\quad \mathbf {y}=\begin{pmatrix} y_1\\ \vdots \\ y_n \end{pmatrix}, \\ \mathbf {w}= & {} \begin{pmatrix} w_1\\ \vdots \\ w_N \end{pmatrix}. \end{aligned}$$

To estimate the function f defined in Eq. (2), the RNN algorithm discussed in Schmidt et al. (1992) allows the user to choose \(\varvec{\alpha }_j\)’s and \(b_j\) arbitrarily at random and apply the least square method to estimate the hidden output weight vector \(\mathbf {w}\), that is

$$\begin{aligned} \min \limits _{\mathbf {w}}\Vert \mathbf {y}-\mathbf {H}\mathbf {w}\Vert _{_2}^2, \end{aligned}$$

where \(\Vert \mathbf {a}\Vert _{_2}\) represents the Euclidean norm of an arbitrary vector \(\mathbf {a}\). See also Wu and Moody (1996). It is known that if the matrix \(\mathbf {H}^\mathrm{T}\mathbf {H}\) is invertible the least square solution is

$$\begin{aligned} \widehat{\mathbf {w}}=\left( \mathbf {H}^\mathrm{T}\mathbf {H}\right) ^{-1}\mathbf {H}^\mathrm{T}\mathbf {y} . \end{aligned}$$

In practice though, often the condition value of the matrix \(\mathbf {H}^\mathrm{T}\mathbf {H}\) is close to zero and, therefore, the solution \(\widehat{\mathbf {w}}\) is not stable. To resolve this problem, Schmidt et al. (1992) suggested using the Moore–Penrose generalized inverse. A better and more numerically stable solution is obtained by a penalized least square problem known as Tikhonov regularization or the ridge regression. Therefore, we do the following optimization (Hastie et al. 2009):

$$\begin{aligned} \min \limits _{\mathbf {w},\lambda _2}\left\{ \Vert \mathbf {y}-\mathbf {H}\mathbf {w})\Vert _{_2}^2+\lambda _2\Vert \mathbf {w}\Vert _{_2}^2 \right\} . \end{aligned}$$

The solution to this optimization problem is given by the ridge regression estimate

$$\begin{aligned} \widehat{\mathbf {w}}=(\mathbf {H}^\mathrm{T}\mathbf {H}+\lambda _2\mathbf {I})^{-1}\mathbf {H}^\mathrm{T}\mathbf {y} , \end{aligned}$$
(3)

where \(\lambda _2\ge 0\) is the ridge or Tikhonov regularization parameter. In the present study \(\lambda _{2}\) was obtained by means of Bayesian information criterion (BIC) (Hastie et al. 2009).

At this point, the steps required for the implementation of an estimation module, i.e. RRNN, is completed. Now it should be discussed as to how those independent modules are concatenated to form the final M-RRNN.

As mentioned earlier, the number of independent modules at the heart of M-RRNN is directly related to the number of clusters formed by the SFLSDE-clust distributor. Let us assume that the observed input vectors are given by \(\mathcal {C}=\lbrace \mathbf {x}_1,\mathbf {x}_2,\dots ,\mathbf {x}_n\rbrace \) and is clustered into K separated partitions \(\mathcal {C}_1,\dots ,\mathcal {C}_K\). For \(j=1,\dots ,K\) and \(\ell =1,\dots ,d\), let

$$\begin{aligned} a_{j,\ell }&=\min \lbrace x_{j,\ell } \mid \mathbf {x}_j=(x_{j,1},\dots ,x_{j,\ell },\dots ,x_{j,d})\in \mathcal {C}_j\rbrace ,\\ b_{j,\ell }&=\max \lbrace x_{j,\ell } \mid \mathbf {x}_j=(x_{j,1},\dots ,x_{j,\ell },\dots ,x_{j,d})\in \mathcal {C}_j\rbrace . \end{aligned}$$

Then, for each cluster \(j=1,\dots ,K\)

$$\begin{aligned} \mathcal {C}_j&\subset [a_{j,1},b_{j,1}]\times [a_{j,2},b_{j,2}]\times \dots \times [a_{j,d},b_{j,d}],\\ \mathcal {C}&=\bigcup _{j=1}^{K} \mathcal {C}_{j}. \end{aligned}$$

Then, all of the training samples in cluster j are used to train jth module of M-RRNN. It can be clearly understood that by assigning an independent module to each of the above partitions, the following can be assured: (1) whole \(\mathcal {C}\) is covered by the resulting M-RRNN, and at the same time; (2) a precise estimator is developed for each of the separated clusters, as the information in each cluster shares relatively similar characteristics. Besides, by decomposing the complexity of the dataset to a number of separated clusters with independent characteristics, it can be ensured that simpler estimators with smaller sizes of hidden layers can be used for estimating each cluster, and also the accuracy of estimation is increased.

4.3 Architecture of M-RRNN

By integrating the distributor with the modular estimator, the architecture of M-RRNN is formed. The architecture of M-RRNN is presented in Fig. 4.

Fig. 4
figure 4

Architecture of the designed modular ridge extreme learning machine

5 Experimental setup

In this section, the steps required for setting the parameters for the simulations are outlined. As discussed, the experiments are performed in two different stages. As the proposed modular scheme has two different sections, i.e. distributor and estimator, different sorts of rival techniques should be considered to test the efficacy of the proposed technique.

5.1 Distributor performance

To check the performance of distributor, which is based on a differential evolutionary algorithm, several well-known nature-inspired clustering methods, genetic algorithm (GA) (Sheikh et al. 2008), artificial bee colony (ABC) (Karaboga and Ozturk 2011), chaotic particle swarm optimization (CPSO) (Chuang et al. 2011), and firefly algorithm (FA) (Senthilnath et al. 2011), are taken into account. It is worth mentioning that all of the considered clustering techniques are equipped with an acceleration phase that can improve the performance of any given stochastic/metaheuristic clustering algorithm (Chuang et al. 2011). To tune GA, the number of chromosomes of 40, the crossover probability (\(P_\mathrm{c}\)) of 0.8, the mutation probability (\(P_\mathrm{m}\)) of 0.02, and the number of elite chromosomes (e) of 1, are considered. Besides, the arithmetic graphical search (AGS), tournament selection and simulated binary crossover (SBX) operators are adopted from the literature to form the algorithmic structure of GA. For optimal implementation of ABC, the number of onlooker and employed bees of 20 and the limit number of 10 are taken into account. The limit size of 10 implies that the first bee that fails to update its position after 10 trials is sent to scout bee search phase to update its solution vector. For CPSO, the particle size of 40, inertia weight of 0.9, and cognitive and social coefficients of 1.4 are taken into account. To prepare FA, the number of fireflies of 40, the maximum attraction (\(\beta _\mathrm{max}\)) of 1, and the absorption rate (\(\varUpsilon \)) of 1 are selected for the sake of optimization. All of the above mentioned parameters are set based on the recommendations given in the seminal cited papers as well as the author’s own assessments. Besides, all of the rival algorithms start the optimization using the same population seeding (distribution) to impede obtaining any biased results. All of the considered clustering techniques perform the optimization for 10,000 function evaluations, and each algorithm is executed for ten independent runs. Well-known statistical measures, i.e. standard deviation (std.), accuracy (mean), best (min), and worst (max), are also taken into account, which are defined as follows:

$$\begin{aligned}&\text {Mean fitness}=\frac{1}{10}\sum _{i=1}^{10}J_{i}\\&\text {Best fitness} =\min \lbrace J_{i} \vert i=1,\dots ,10\rbrace \\&\text {Worst fitness} =\max \lbrace J_{i} \vert i=1,\dots ,10\rbrace \\&\text {Robustness}=\frac{1}{9}\sum _{i=1}^{10}(J_{i}-\text {Mean fitness})^2 \end{aligned}$$

It is also worth pointing out that the clustering process is conducted using four different numbers of clusters (\(K = 5, 6, 7\) and 8) to find the optimal number of clusters.

It is expected that either the ice surface temperature or the ice thickness can be used together with the spatial information (latitude and longitude values) to optimally partition the database. In this way, two different sets of clustering experiments are carried out to evaluate which method leads to the best results. The two sets of data used for the two clustering scenarios are

$$\begin{aligned} \text {Experiment 1 }:&\text { features } x_{1}, x_{2}, x_{7}\\ \text {Experiment 2 }:&\text { features } x_{1}, x_{2}, x_{10}, \end{aligned}$$

which correspond to latitude/longitude/ice temperature and latitude/longitude/ice thickness, respectively.

Table 3 Statistical characteristics of the collected databases for low- and high-frequency databases

5.2 Performance analysis

To evaluate both the accuracy and robustness of the estimation part of the proposed M-RRNN, several well-known variants of neural networks, back-propagation neural network (BP-NN) with steepest descent optimization algorithm (Mozaffari and Fathi 2012), optimally pruned RNN (OPRNN), randomized neural network (RNN) (Schmidt et al. 1992), and multi-layer feed-forward neural network with interior point optimization (MLFF-IPO) (Raja and Samar 2014), are replaced with RRNN modules of the modular architecture. All of the identification modules have 10 neurons in their hidden layer. For the steepest descent optimization method, a learning rate of 0.1 is used. The iterative optimizers perform the learning for 100 iterations. It should be pointed out that OPRNN comprises of a standard RNN in which the most influential neurons are retained in the architecture through multiresponse sparse regression neuron pruning. Also, the pruning procedure takes advantage of leave-one-out validation criterion to ensure the optimal selection of active neurons (Hastie et al. 2009).

The database used for the analysis has eight features. The considered features are listed in Table 1. As it was mentioned, the considered databases contain ice thickness from the MODIS sensor, brightness temperatures from the ASMR-E sensor, and geophysical variables from forecast models. The database covers the period from 2 February to 20 February and has 14,639 and 17,162 temporal data-pairs for low (6.9 GHz) and high (36.5 GHz) frequency AMSR-E channels, respectively. In the rest of this paper, these two datasets are referred to as the low- and high-frequency databases. The collected databases have been normalized within the range of unity for increasing the efficiency of the computation. Table 3 lists the statistical characteristics of the gathered features for both low and high frequencies.

5.3 Verification of method complexity

It is necessary to verify that the complexity of the final modular network is not greater than the complexity of non-modular systems. In this regard, after finding a modular network with K modules, single models of each of the above identifiers with \(K \times 10\) neurons are taken into account for estimation of the sea-ice thickness. This is mainly due to the fact that the complexity of the resulting networks in terms of O notation strictly depends on the number of hidden nodes (which is 10 in this study).

Finally, it is necessary to mention that all of the simulations are carried out on the Matlab software with Microsoft Windows 7 operation system on a PC with a Pentium IV, Intel core i7 CPU, and 4 GBs RAM.

6 Results and discussion

In this section, the results from several experiments are reported to evaluate the performance of the proposed intelligent approach. In the first stage of the experiments, the potential of the proposed method is tested for efficient partitioning of the dataset. Thereafter, the proposed estimation tool together with the other rival estimators is used to develop a map between the input features and the sea-ice thickness. Some mathematical formulations are also presented to calculate the complexity of the resulting modular framework and also to evaluate the estimation error of the modular architecture based on the errors of each of the independent modules.

Figures 5 and 6 indicate the real-time evolution of the rival clustering methods for partitioning the low- and high-frequency information, using different numbers of clusters. The clustering scenarios differ in that one is conducted using the ice surface temperature as the third features, while the other uses the ice thickness.

Fig. 5
figure 5

Mean real-time evolution of the considered clustering techniques for low frequency database

Fig. 6
figure 6

Mean real-time evolution of the considered clustering techniques for high frequency database

Table 4 Comparison of the fitness values for the rival clustering techniques for low-frequency database
Table 5 Comparison of the fitness values for the rival clustering techniques for high-frequency database

Figure 5 indicates that for both of the clustering experiments, increasing the number of clusters decreases the convergence speed of the metaheuristic algorithms. This is logical, as increasing the number of clusters increases the dimensionality of the clustering problem and makes the optimization landscape more intricate. Looking more closely at the results, it can be seen that for both of the experiments of the low-frequency scenario, SLFSDE-clust and CPSO have the fastest convergence speed. Also, the simulation results indicate that for most of the clustering scenarios, ABC cannot compete with the other rival techniques in terms of the accuracy of the final solution as well as the convergence speed. By inspecting the real-time evolution results of high-frequency scenario, presented in Fig. 6, it can be seen that similar performance is observed. The only interesting observation refers to the obvious superiority of SFLSDE-clust for \(K = 5\) and 7 (first experiment), and \(K = 7\) (second experiment). Taking into account that the number of datapairs of high-frequency database is larger than that of low-frequency database, it can be inferred that by increasing the complexity of the information to be partitioned, the superiority of SFLSDE, at least in terms of convergence speed, becomes much more obvious.

Tables 4 and 5 list the details of the statistical results for both experiments of low-frequency and high-frequency databases. The presented results indicate that by considering \(x_{7}\) (ice temperature) as the third clustering feature, the performance of the clustering methods for partitioning both the low-frequency and high-frequency database is slightly improved. However, as it was mentioned, the best clustering should satisfy some other conditions. The std. values indicate that the computational robustness of the rival methods is relatively similar. However, in most of the clustering scenarios, the robustness of ABC is less than the other methods, as its std. value is larger than the others. By checking the results of K-means clustering approach, it can be easily inferred that for all of the clustering scenarios, the heuristic clustering methods show superior results. Also, the results of the conducted simulations indicate that the performance of GSA and CPSO are relatively the same for most of the clustering scenarios. However, none of them can beat SFLSDE-clust algorithm for the considered clustering scenarios. All in all, the convergence and performance evaluation results indicate that SFLSDE-clust can be considered as a very powerful information partitioning scheme and can be used as the distributor of M-RRNN.

In addition to all of the above experiments, the authors would like to test the computational complexity of the considered rival clustering techniques. This will uncover whether the better performance of SFLSDE-clust is obtained at the cost of a higher computational burden. Tables 6 and 7 list the computational time of the rival clustering methods for both of the experiments on low-frequency and high-frequency datasets, respectively. The obtained results clearly demonstrate that the algorithmic complexity of SFLSDE is close to CPSO and GSA. Indeed, the obtained results indicate that the complexity of ABC is higher than the other rival techniques which implies that it has a much more complex algorithmic structure. The presented results of the computational complexity indicates that SFLSDE-clust not only has a very good performance, but also requires an acceptable rate of computational complexity to carry out the clustering task.

Table 6 Computational time for the rival clustering approaches for low-frequency database (s)
Table 7 Computational time for the rival clustering approaches for high-frequency database (s)

It is also important to find out the effect of using \(x_{7}\) and \(x_{10}\) features on the partitioning of the considered region. To this aim, the clusters formed for low-frequency database using each of the mentioned clusters are indicated in Fig. 7. The presented results clearly indicate that the final partitions are quite different when considering either \(x_{7}\) or \(x_{10}\) as the third feature. This graphical illustration clearly proves the importance of using the two experiments for extracting the best clustering scenario.

Fig. 7
figure 7

Comparison of clustering results for the two experiments for low-frequency database

As mentioned earlieer, some additional performance evaluation metrics are required to realize which clustering scenario can better partition the data into a set of clusters. Table 8 lists the results of clustering for the two experiments on low-frequency dataset. By checking the reported values, it can be seen that \(K = 7\) for the first experiment leads to the best results in terms of the considered performance evaluation indices. Table 9 lists the results of information clustering for high-frequency dataset. It can be seen that, for this case, the best performance is achieved for \(K = 7\), considering \(x_{10}\) as the third clustering feature. Interestingly, for both of the datasets, the best clustering results are achieved when the number of clusters are equal to 7. The details of the formed centroids for the considered clusters of both of the experiments are given in Tables 10 and 11 for low-frequency and high-frequency datasets, respectively. The details of the centroid positions for \(K = 7\) are shown in bold font.

Table 8 Comparison of the performance measures for each partitioning schemes for low-frequency data
Table 9 Comparison of the performance measures for each partitioning schemes for high-frequency data
Table 10 Details of the cluster centers obtained by SFLSDE for low-frequency data (un-normalized)
Table 11 Details of the cluster centers obtained by SFLSDE for high-frequency data (un-normalized)

Tables 12 and 13 list the number of data handled by each of the clusters for both of the experiments on low-frequency and high-frequency databases, respectively. The results disclose the sharing of information among clusters, and also indicates the number of data pairs used by each module.

Table 12 Number of data handled by each module for the two experiments (low-frequency database)
Table 13 Number of data handled by each module for the two experiments (high-frequency database)

By having the information regarding the distribution of the datapairs for each module, it is possible to calculate the distribution of computational complexity of the resulting modular structure as follows:

$$\begin{aligned} \text {Complexity} = \frac{n_j}{n}N_{j} \quad \text {for } \ j=1, \dots , K , \end{aligned}$$
(4)

where \(N_{j}\) indicates the number of hidden nodes of jth module, n is the number of datapairs, and \(n_{j}\) represents the number of data handled by jth module. As the number of hidden nodes at each module is fixed and equal to 10, the computational complexity of the modular frames corresponding to the selected distribution of data can be calculated as follows:

$$\begin{aligned} 1\text {st scenario}:\,&\frac{2940}{16670}, \frac{3290}{16670},\frac{2030}{16670},\frac{4390}{16670},\frac{540}{16670},\\&\quad \frac{1950}{16670},\frac{1530}{16670}\\ 2\text {nd scenario}:\,&\frac{1830}{16670}, \frac{990}{16670},\frac{2470}{16670},\frac{4170}{16670},\frac{3630}{16670},\\&\quad \frac{1070}{16670},\frac{54}{16670} . \end{aligned}$$

The above information indicate that most of the designed modules have a relatively equal computational complexity. It seems that 1st, 2nd, and 4th modules of the first modular architecture, and 4th and 5th modules of the second architecture have a little bit more complexity as compared to the other modules. All in all, the results indicate that a neat balance is held between the modular architectures, which indicates that the proposed distributor does an acceptable job. Tables 14 and 15 list the MSE error of each of the modules for both first and second experiments on low-frequency and high-frequency databases. To calculate the total estimation error of each modular frame based on the MSE error of its modules, the following formulation is taken into account:

$$\begin{aligned} \text {MSE}_\text {total}=\frac{1}{n}\sum _{j=1}^{K}n_j\text {MSE}_{j}. \end{aligned}$$
(5)

The total MSE of each of the modular frames is also included in Tables 14 and 15. Based on the reported values, it can be easily seen that the estimation error of modular frames with 7 modules is less than the other counterparts. This is in concurrence with the results of clustering which suggests the use of 7 modules to form the modular framework.

Table 14 MSE error of each of the modules for low-frequency data
Table 15 MSE error of each of the modules for high-frequency data

Figures 8 and 9 depict the correlations between the estimation of M-RRNN with 7 modules for both of the experiments on the low-frequency data, and those observed by MODIS sensor. It can be easily seen that, for both training and testing phases, the ice thickness estimates and the observed values are in a very good correlation. In particular, it can be seen that the correlation is higher when the low-frequency database is used, in agreement with earlier studies (Scott et al. 2014; Kaleschke et al. 2012) that indicate the potential of low-frequency microwave data for ice thickness estimation.

Fig. 8
figure 8

Correlation of the estimation ice thickness using the M-RRNN vs. ice thickness from MODIS for training and testing phases for low-frequency database

Fig. 9
figure 9

Correlation of the estimation ice thickness using the M-RRNN vs. ice thickness from MODIS for training and testing phases for high-frequency database

After selecting the most promising architecture for M-RRNN, that would be necessary to continue the experiments to find out whether the use of RRNN at the heart of the modular frames can afford the best results, and also to realize the computational advantage of the modular frames as compared to the standard estimators with the same number of hidden nodes (the same structural complexity). Tables 16 and 17 compare the accuracy and robustness of modular frames with different types of estimators at their heart. The results of the estimation indicate that the performance of modular frames with RRNN and OPRNN is better than the other rival techniques. However, the RRNN does a slightly better job. Moreover, it should be mentioned that the training procedure of OPRNN is a little bit more complex than RRNN as it includes both Lasso and ridge regression techniques. So, it can be inferred that the use of RRNN at the heart of M-RRNN can afford the best results. Moreover, the robustness results indicate that all of the considered frames have an acceptable robustness. This in turn implies that using modular architecture can improve the robustness of the estimation.

The final experiment intends to evaluate the efficacy of the proposed modular architecture in comparison with a number of rival architectures with the same structural complexity. Tables 18 and 19 list the results of the estimation. By checking the obtained results, it can be observed that the accuracy of the modular frame is higher than the other standard rival estimation approaches. It can be seen that the modular estimator has a significantly better performance. Moreover, by comparing the estimation error of the standard estimators with their modular architectures presented in the previous experiment, it can be seen that considering a modular frame for each of the considered estimators can drastically improve their performance. However, it can be also seen that the robustness of the standard estimators is relatively the same as their modular counterparts, and the robustness improvement of the modular estimators is not significant. All in all, the results of the simulations indicate that the modular frames can significantly improve the accuracy of the estimation; however, it was observed that the robustness of the both standard and modular structures is relatively equal to each other.

7 Conclusions and future work

In this investigation, the authors proposed a modular ridge randomized neural network (M-RRNN) with a nature-inspired distributor to sea ice thickness along the Labrador coast. The proposed modular estimator used unsupervised learning to partition the spatio-temporal information consisting of latitude and longitude values and either sea ice thickness or sea ice temperature. Thereafter, each of the K partitions is fed to K independent RRNNs to create a nonlinear map between the input features, consisting of the AMSR-E brightness temperatures and data from the forecasting system, and the sea-ice thickness from MODIS. Based on a comprehensive comparative analysis, the following conclusions were derived:

Table 16 MSE of the modular frameworks with different estimation modules for low-frequency database
Table 17 MSE of the modular frameworks with different estimation modules for high-frequency database
Table 18 Comparing the accuracy and robustness of the proposed modular estimator with standard counterparts with the same structural complexity for low-frequency database
Table 19 Comparing the accuracy and robustness of the proposed modular estimator with standard counterparts with the same structural complexity for the high-frequency database
  1. (a)

    It was observed that the modified scale-factor local search differential evolutionary algorithm (SFLSDE) can be used as an efficient distributor to partition the captured data into a set of subgroups. Furthermore, the comparative studies clearly demonstrated that SFLSDE can surpass PSO, FA, GA, and ABC for the same problem.

  2. (b)

    The results of the analysis revealed that the use of RRNN as module at the heart of the proposed modular architecture can outperform the modular identifiers using BPNN, OPRNN, RNN, and MLFF-IOP modules for estimation. Furthermore, it was indicated that RRNN has an analytical training strategy which always gets around singularity because of using the ridge norm.

  3. (c)

    By comparing the resulting modular randomized neural network (M-RNN) with standard identifiers of the same computational complexity, it was realized that the modular estimator has a higher robustness and accuracy.

  4. (d)

    The successful performance of M-RRNN inspires further studies of intelligent approaches in a complementary manner with the current physical and semi-empirical estimation tools. The findings of the current research contributes to the use of advanced methods from intelligent computing to geophysical problems.

In future, the authors would like to further expand this study by trying to add a degree of uncertainty to the model using the fuzzy numbers theory to find out its impact on suppressing the undesired effects of random white noise associated with the sensory data.