1 Introduction

With the development of the Internet, XML (Extensible Markup Language) has become a de-facto standard for representing a large number of rapidly increasing Web data in numerous applications. Because of heterogeneous and distributed data source is utilized in a lot of applications, XML data integration becomes more and more imperative (Thomo and Venkatesh 2008; Guha et al. 2006). XML document classification plays a critical role in XML data integration. It is logical that there is a growing demand in the research of XML document classification (Brzezinski and Piernik 2015; Zhao et al. 2011; Thasleena and Varghese 2015).

The XML document classification is similar to plain text classification to a certain extent. The difference between them is that pure text classification only focuses on the semantic level, while XML document classification needs to consider both the structure and semantics of the XML document. Various approaches have been designed to perform XML document classification in (Zhao et al. 2011; Dalamagas et al. 2006; Tekli and Chbeir 2012; Maguitman et al. 2005; Tekli et al. 2015).The majority of the approaches exploit the direct comparison of the structure and semantics of the given paired XML documents, such as tree edit distance (Dalamagas et al. 2006; Tekli and Chbeir 2012; Maguitman et al. 2005; Tekli et al. 2015). Some of the approaches also exploit machine learning strategies (Zhao et al. 2011; Zhang et al. 2012) to solve this problem. Zhao et al. reported that XML documents have to be transformed into a specific representation model, which is then taken as an input of Extreme Learning Machine (ELM) (Zhao et al. 2011). Ribeiro and Härder introduced an approach of entity identification in XML documents based on approximate joins (Ribeiro and Härder 2006). Zhang et al. presented an approach of computing the similarity between XML documents, which make reference to the adjacency matrix of graph theory (Zhang et al. 2012). The authors introduced XML matching approaches and a template, called XML Matcher Template, which describes the main components of an XML Matcher in (Agreste et al. 2014).

XML documents which are previously classified are often deterministic. However, in fact, fuzzy information often emerges in some practical applications. Some data are inherently vague rather than definite since their values are subjective in real-world applications. Fuzzy sets theory which proposed by Zadeh has been widely used in numerous applications (Negoita et al. 1978). Fuzzy data are modeled in the XML document in (Nierrman and Jagadish 2002; Gaurav and Alhajj 2006; Oliboni and Pozzani 2008) and approaches representing and processing fuzzy information based on the XML data are proposed in (Turowski and Weng 2002; Abiteboul et al. 2006).

With the increasing of fuzzy XML data on the Web, it is necessary to classify fuzzy XML documents for further integrating multiple similar fuzzy XML documents into a single one. Unfortunately, to our best knowledge, there are not many reports discussing the fuzzy XML documents classification. Although uncertain XML document classification has been investigated (Zhao et al. 2016), the proposed approaches need to consume a large amount of time to enumerate all possible instances. Based on the above analysis, aiming at developing an effective approach to classify fuzzy XML documents, we devise a new fuzzy XML document tree model to represent semantic and structure of the fuzzy XML documents, and then we propose an integrated machine learning approach (KPCA-KELM) to classify fuzzy XML documents effectively.

In this paper, we concentrate on the classification of heterogeneous fuzzy XML documents which are collected from different data sources. The main contributions of the proposed work are as follows.

  • We take a first step in the construction of a new fuzzy XML document tree model (FXDTM), which makes it easier to describe fuzzy data and capture the feature information in fuzzy XML documents. Based on the Structured Link Vector Model (SLVM) (Yang and Chen 2002), a Modified Structured Vector Space Model (MS-VSM) is employed in (Zhao et al. 2017) to represent fuzzy XML documents, which not only represents structural and semantic information of fuzzy XML documents, but also makes the feature vector carry fuzzy information. Considering that the accuracy and validity of feature extraction can be improved by using the FXDTM model to represent the key features of fuzzy XML document, in this paper, the FXDTM model is used as an intermediary instead of directly converting fuzzy XML document into the MS-VSM model as in (Zhao et al. 2017). The raw data coming from this MS-VSM model are taken as the input of classifiers.

  • We propose an integrate machine learning approach (KPCA-KELM) to classify fuzzy XML documents. At first, to reduce the complexity of computation and exploit hidden information, we perform features extracting of the input raw data using Kernel Principal Component Analysis (KPAC) (Schölkopf et al. 1998; Iosifidis et al. 2015) approach. Different feature extraction methods can result in different classifications. The ELM is employed as feature extractor in (Zhao et al. 2017). Considering that the KPCA is very suitable for extracting the interesting non-linear features from high-dimensional space data, in this paper, we use the KPCA to capture the non-linear features of fuzzy XML documents to extract those decisive features. Secondly, due to Extreme Learning Machine (ELM) (Huang et al. 2006; Huang and Chen 2007; Huang et al. 2016; Tang et al. 2016) has a good classification accuracy and faster learning speed, we propose an effective algorithm based on Kernel Extreme Learning Machine (KELM) to achieve the classification of fuzzy XML document. In this way, we can classify the fuzzy XML documents collected from diverse data sources.

The remainder of this paper is organized as follows. Section 2 introduces the related works, including the fundamental concepts and theories of KPCA and KELM after a presentation of fuzzy XML documents. Section 3 describes the proposed Fuzzy XML Document Tree Model (FXDTM) and Modified Structured Vector Space Model (MS-VSM) of fuzzy XML documents. The KPCA-KELM framework based on KPCA and KELM is proposed in Section 4. Experimental evaluations are given in section 5. Finally, the conclusion is drawn in Section 6.

2 Related Works

To facilitate the understanding of the proposed approach, this section briefly reviews the related fundamental of concepts/theories of the fuzzy XML documents, KPCA and KELM. Though ELM unifies regression and classification tasks, we only focus on the classification in the following parts.

2.1 Fuzzy XML Document

A fuzzy XML document is a set composed of elements and attributes, linked together via the containment relation. To represent fuzzy information in fuzzy XML documents, a representation model based on “membership degree and possibility distributions” is developed in (Ma and Yan 2007; Yan et al. 2009). In this model, an element may be involved in a membership degree. The membership degree of an element indicates the possibility of being its parent’s child element. The attribute values of elements may be presented as possibility distributions in this representation model. Note that it is possible that some attributes can take multiple (conjunctive) values. In contrast, some attributes can only take unique (disjunctive) value. That is, we have two kinds of fuzziness in the fuzzy XML document: one is the fuzziness in elements associated with membership degrees; another is the fuzziness in attribute values of elements represented with possibility distributions. The Type attribute as a child of element Dist is used to indicate the type of possibility distribution, having values of disjunctive or conjunctive. In addition, each Dist element has a Val element as its child. The Poss attribute as child of element VAL indicates the membership degrees of a given element.

Here we do not present the detailed definitions of fuzzy XML document representation model and only present an example fragment of fuzzy XML document shown in Fig. 1. One can refer to (Ma and Yan 2007; Yan et al. 2009; Li and Ma 2017) for more details.

Fig. 1
figure 1

A fragment of fuzzy XML document

In the fuzzy XML document of Fig. 1, Poss is adopted together with a fuzzy construct denoted by Val to specify the possibility of a given element in the fuzzy XML document. In line 2 of Fig. 1, <Val Poss = 0.9 > states that the membership degree of the given element department being Information Science and Technology is equal to 0.9. Another fuzzy construct called Dist can be used to express possibility distribution of attribute values. Since we have two types of possibility distribution, Type is adopted to indicate the type of possibility distribution, being disjunctive or conjunctive. Lines 18–22 are the disjunctive Dist construct for the age of the student with SID 20130425. Lines 25–29 are the conjunctive Dist construct for the email of the student with SID 20130425.

2.2 Kernel Principal Component Analysis (KPCA)

Kernel Principal Component Analysis (KPCA) (Schölkopf et al. 1998; Iosifidis et al. 2015), which is derived from Principal Component Analysis (PCA), has been widely used for nonlinear feature extraction. To be fluent for reading, let us introduce the essential knowledge of KPCA. More details about KPCA can be found in (Schölkopf et al. 1998; Iosifidis et al. 2015).

As a nonlinear method, KPCA is nothing but the PCA in the feature space associated with a kernel function. KPCA maps all samples from the linear data space to the nonlinear feature space, and the PCA features are extracted in the feature space. To the data space Rd, consider a feature space by a possibly nonlinear map Φ: Rd → F, x → Φ(x). The map Φ is defined implicitly with kernel function.

Let us consider a set of N training samples in a d-dimensional space xiRd, i = 1, …, N. If the training samples have been mapped into a feature space by a nonlinear function Φ, we may perform PCA in feature space F. We assume that all their mapped samples Φ(x1), Φ(x2), …, Φ(xN) are centered, that is \( {\sum}_{i=1}^N\varPhi\ \left({x}_i\right)=0 \). The correlation matrix of the mapped samples in feature space F can be computed by

$$ C=\frac{1}{N}\sum \limits_{\mathrm{i}=1}^N\Phi \left({\mathrm{x}}_{\mathrm{i}}\right)\Phi {\left({\mathrm{x}}_{\mathrm{i}}\right)}^T $$
(1)

The extracted features in feature space must be from the set of the eigenvectors of C (Schölkopf et al. 1998). With these extracted features, we can reconstruct the samples with the minimum mean-square error. KPCA seeks to find eigenvalues λ and the associated eigenvectors V satisfying

$$ CV=\lambda V $$
(2)

The eigenvalue equation can be written as

$$ \left(\Phi {\left({x}_{\mu}\right)}^T\cdot CV\right)=\lambda \left(\Phi {\left({x}_{\mu}\right)}^T\cdot V\right)\kern0.5em for\ all\kern0.5em \mu =1,\dots, N $$
(3)

And there exists coefficients αj, j = 1,…, N, such that

$$ V=\sum \limits_{j=1}^N{\alpha}_j\Phi \left({x}_j\right) $$
(4)

Combining Eqs. (1), (3) and (4), we have

$$ \frac{1}{N}\left(\sum \limits_{\mathrm{i}=1}^N\sum \limits_{j=1}^N\left(\Phi {\left({x}_{\mu}\right)}^T\cdot \Phi \left({x}_{\mathrm{i}}\right)\Phi {\left({x}_{\mathrm{i}}\right)}^T\cdot {\alpha}_j\Phi \left({x}_j\right)\right)\right)=\lambda \sum \limits_{j=1}^N\left(\Phi {\left({x}_{\mu}\right)}^T\cdot {\alpha}_j\Phi \left({x}_j\right)\right) $$
(5)

Now we define an N × N kernel matrix K by Kij = (Φ(xi)T·Φ(xj))Kij = (Φ(xi)T · Φ(xj)). Furthermore, the following equation can be derived: K2α = λNKα, where α = [α1, …, αN]. In fact, the coefficient vector α is the eigenvector corresponding to non-negative eigenvalue of the kernel matrix K. Finding solutions of this equation is equivalent to solve the eigenvalue problem of  = Nλα for nonzero eigenvalues. Then this yields eigenvectors α1, …, αN corresponding to eigenvalues λ1 > … > λN.

The dimensionality of the input data is reduced by retaining only the first p most representative eigenvectors associated with the first p largest eigenvalues. That is, the first p nonlinear principal components are used to describe the input data. Note that in KPCA, the number of principal components (PC) p may exceed the input dimensionality d. The maximum number of PCs in KPCA is N.

The corresponding eigenvectors α1, …, αp of eigenvalue λ1, …, λp can be normalized by requiring the corresponding vectors V1, …, V p in F be normalized, that is

$$ {\displaystyle \begin{array}{c}{\left({\mathbf{V}}^k\right)}^T\cdotp {\mathbf{V}}^k=\sum \limits_{j,l=1}^n{\alpha}_j^k{\alpha}_l^k\left({\Phi}^T\left({\mathrm{x}}_j\right)\cdotp \Phi \left({\mathrm{x}}_l\right)\right)\\ {}=\sum \limits_{j,l=1}^n{\alpha}_j^k{\alpha}_l^k\cdotp k\left({\mathrm{x}}_j,{\mathrm{x}}_l\right)\\ {}={\left({\boldsymbol{\upalpha}}^k\right)}^T\mathbf{K}{\boldsymbol{\upalpha}}^k\\ {}={\lambda}^k{\left({\boldsymbol{\upalpha}}^k\right)}^T{\boldsymbol{\upalpha}}^k\\ {}={\lambda}^k{\left\Vert {\boldsymbol{\upalpha}}^k\right\Vert}^2\\ {}=1\end{array}} $$
(6)

when \( {\left\Vert {\boldsymbol{\upalpha}}^k\right\Vert}^2=\frac{1}{\lambda^{\mathrm{k}}} \), for all k = 1, 2, …, p, V k is normalized.

Principal components of a test sample y are obtained by projecting Φ(y) onto the eigenvectors Vk in F: k = 1, …, p

$$ {\Phi}^T(y)\left[{\mathbf{V}}_1,\dots, {\mathbf{V}}_p\right]=\left[k\left(y,{\mathrm{x}}_1\right),\dots, k\left(y,{\mathrm{x}}_n\right)\right]\left[{\boldsymbol{\upalpha}}^1,\dots, {\boldsymbol{\upalpha}}^p\right]\in {R}^{1\times p} $$
(7)

Here \( {\boldsymbol{\upalpha}}^k={\left[{\alpha}_1^k,{\alpha}_2^k,\dots, {\alpha}_N^k\right]}^T \).

Projection of test sample set {y1, y2, …, yM} on Vk

$$ {\left[\Phi \left({y}_1\right),\dots, \Phi \left({y}_M\right)\right]}^T\left[{\mathrm{V}}_1,\dots, {\mathrm{V}}_p\right]={K}_{test}\left[{\boldsymbol{\upalpha}}^1,\dots, {\boldsymbol{\upalpha}}^p\right]\in {R}^{M\times p} $$
(8)

Here Ktest = [ΦT(yk)Φ(xj)]M × N = [k(yk, xj)]M × N

The algorithm of KPCA is presented as Algorithm 1.

figure a

2.3 Kernel Extreme Learning Machine (KELM)

The Extreme Learning Machine (ELM) model was originally proposed in (Huang et al. 2006; Huang and Chen 2007) and developed in (Huang et al. 2014; Huang 2014), which implements a single-hidden layer feed forward neural network (SLFN). The biggest advantage of ELM is that it can provide extremely fast learning speed and good generalization performance compared with the traditional neural network. The essence of ELM is that the network’s hidden layer (mapping) can be randomly assigned and the output weights can be calculated by matrix operations without iteratively tuning.

Consider N arbitrary samples (xi, ti) ∈ Rn × m. Then ELM is modeled as

$$ f(x)=\sum \limits_{j=1}^L{\beta}_j{g}_j\left({x}_i\right)=h(x)\beta \kern0.5em i=1,\dots, N $$
(9)

Here L is the number of hidden layer nodes, g (•) is activation function, βj is the weight vector between the jth hidden node and the output nodes. \( \beta ={\left[{\beta}_1^T,\dots, {\beta}_L^T\right]}_{L\times m}^T \) is the vector of the output weights between the hidden layer of L nodes and the output node and h (x) = [g1 (x), …, gL (x)]T is the output (row) vector of the hidden layer with respect to the input x. h(x) actually maps the data from the n-dimensional input space to the L-dimensional hidden-layer feature space (ELM feature space).

According to the ELM theory (Huang et al. 2012) and Karush-Kuhn-Tucher (KKT) (Fletcher 1981) theorem, ELM aims to minimize the training errors and the output weights. This objective function can be expressed as follows:

$$ \underset{\beta, \xi }{\min}\frac{1}{2}{\left\Vert \beta \right\Vert}^2+\frac{C}{2}\sum \limits_{i=1}^N{\left\Vert {\xi}_i\right\Vert}^2\kern0.5em s.t.\kern0.5em h\left({x}_i\right)\beta ={t}_i^T-{\xi}_i^T $$
(10)

Here ξ is the training error matrix on training data, ξi = [ξi1, ξi2, …, ξim]T ξi = [ξi1, ξi2, …, ξim]Tis the ith column of ξ with respect to the training sample xi, N and m are the number of training samples and classes, and C is a regularization parameter which trades off the norm of output weights and training errors.

The optimization problem in Eq. (10) can be efficiently solved. According to (Huang et al. 2012), the optimal β which minimizes Eq. (10) can be analytically obtained as

$$ \upbeta ={\mathrm{H}}^{\mathrm{T}}{\left(\frac{\mathrm{I}}{\mathrm{C}}+\mathrm{H}{\mathrm{H}}^{\mathrm{T}}\right)}^{-1}\mathrm{T} $$
(11)

Here I is an identity matrix.

$$ {\displaystyle \begin{array}{l}T={\left[{t}_1^T\kern0.5em \dots \kern0.5em {t}_N^T\right]}_{N\times m}^T\\ {}H=\left[\begin{array}{c}h\left({x}_1\right)\\ {}\vdots \\ {}h\left({x}_N\right)\end{array}\right]={\left[\begin{array}{ccc}{g}_1\left({x}_1\right)& \dots & {g}_L\left({x}_1\right)\\ {}\vdots & \dots & \vdots \\ {}{g}_1\left({x}_N\right)& \dots & {g}_L\left({x}_N\right)\end{array}\right]}_{N\times L}\end{array}} $$

In ELM, h(xi) denotes the output of the hidden-layer with regard to the input sample xi. H is called the feature mapping matrix due to the fact that it represents the corresponding feature space mapping of the given N training samples. Feature mapping h(xi) maps the data xi from the input space to the hidden-layer feature space, and the feature mapping matrix H is irrelevant to the training target values tis.

After obtaining the optimal output weights β, the decision score of the corresponding to a new data points (xt) can be predicted fast and accurate by

$$ f\left({x}_t\right)=h\left({x}_t\right)\beta $$
(12)

However, in this specific case, the feature mapping h(x) may not be known to users; instead, its corresponding kernel K(u, v)(e. g., K(u, v) =  exp (−γu − v2)) is given to users. The dimensionality L of the feature space (number of hidden nodes) need not be given either. The ELM like this is called the Kernel ELM (KELM). If a feature mapping h(x) is unknown to the users, one can apply Mercers conditions on ELM. A kernel matrix for ELM is defined as follows (Huang et al. 2012):

$$ {\Omega}_{\mathrm{ELM}}=\mathrm{H}{\mathrm{H}}^{\mathrm{T}}:{\Omega}_{EL{M}_i{,}_j}=h\left({x}_i\right)h\left({x}_j\right)=K\left({x}_i,{x}_j\right) $$
(13)

The kernel matrix ΩELM is neither relevant to the number of output nodes m nor to the training target values tis. The kernel matrix ΩELM = HHT is only related to the input data xi and the number of training samples N. The size of kernel matrix ΩELM = HHT is N × N.

Then, the output function of KELM classifier can be written compactly as

$$ \mathrm{f}\left(\mathrm{x}\right)=\mathrm{h}\left(\mathrm{x}\right){\mathrm{H}}^{\mathrm{T}}\left(\frac{\mathrm{I}}{\mathrm{C}}+\mathrm{H}{\mathrm{H}}^{\mathrm{T}}\right)\mathrm{T}=\left[\begin{array}{c}K\left(\mathrm{x},{x}_1\right)\\ {}\vdots \\ {}K\left(\mathrm{x},{x}_N\right)\end{array}\right]{\left(\frac{\mathrm{I}}{\mathrm{C}}+{\Omega}_{ELM}\right)}^{-1}\mathrm{T} $$
(14)

The algorithm of KELM is presented as Algorithm 2.

figure b

3 Representation Model of Fuzzy XML Document

3.1 Fuzzy XML Document Tree Model

Obviously, finding all element/attribute features of the fuzzy XML documents is a core task. To captures features of fuzzy XML documents, we need first to establish a suitable model to represent the fuzzy XML document. XML documents are presented as ordered labeled trees in most existing approaches in (Dalamagas et al. 2006; Tekli and Chbeir 2012; Maguitman et al. 2005; Tekli et al. 2015). In this ordered labeled tree, the nodes represent elements/ attributes and are labeled with the corresponding element/ attribute label names, which are ordered following their orders of appearance in the fuzzy XML document.

A fuzzy XML document which represents hierarchically structured information can be also presented as a rooted ordered labeled tree. But the fuzzy XML document is clearly different from the crisp XML document because the fuzzy XML document contains several special fuzzy constructions, which is attribute Poss and Type, in addition to element VAL and Dist. These fuzzy constructions contain fuzzy information of associated tree nodes. To reduce the model complexity, the redundant elements/attributes are deleted and the pertinent information (including fuzzy information) of the elements/attributes is encapsulated in the node of the fuzzy XML document tree. We adopt the following considerations in order to reach such a goal.

We note that Type always appears as the first child attribute of the Dist and Poss always appears as the first child attribute of the VAL. They are considered redundant data and increase the computational complexity. Therefore, all of Type and Poss attribute are no longer reserved when a fuzzy XML document is mapped into a fuzzy XML document tree. But Val values are remained because they need to be considered while calculating the values of TF-IDF (cf. Section 3.2). So, we need to copy Val values into its sibling (element/attribute) nodes in processing of transformation. Similarly, we can disregard the Type and Dist elements if it is not affecting the tree structure and the depth of the other node.

Based on the discussion above, we present a new Fuzzy XML Document Tree Model (FXDTM for short).

Definition 1 (Fuzzy XML document tree)

Formally, we model a fuzzy XML document as a rooted ordered labeled tree FXDTM = {N, E, L, T, FT, AC}. Here.

  • N is the set of nodes in tree FXDTM.

  • E is the set of edges, which reflect the hierarchical structure of the tree FXDTM.

  • L is the set of labels of the elements and attributes corresponding to the nodes in N.

  • T is the set of data types, including the basic element data types and attribute data types.

  • FT is the set of fuzzy data types of nodes N.

  • P is the set of Poss values of nodes N.

We need to hold related information of elements/ attributes (e.g. label, the fuzzy value) when the fuzzy XML document is transformed into the fuzzy XML document tree model (FXDTM) we proposed. Hence, following our tree representation model, a fuzzy XML document tree node is modeled as follows:

Definition 2 (Fuzzy XML document tree node)

A node nN of FXDTM is represented by a quintuple n = {NodeLabel, NodeDepth, NodeFuzzy, NodeFuzzyType, NodePossValue}. Here.

  • NodeLabelL is the label name of the node.

  • NodeDepth is the nesting depth of the node in the fuzzy XML document. The depth of the root node is defined to be 1. If the parent of node n is at depth d, then the depth of node n is d + 1.

  • NodeFuzzy is used to indicate if the node is a fuzzy or crisp one. If the value of NodeFuzzy is 1, the node is a fuzzy one. If the value of NodeFuzzy is 0, the node is a crisp one. In particular, the value of NodeFuzzy of the node Dist or Val is 0.

  • NodeFuzzyTypeFT denotes the type of possibility distribution, disjunctive or conjunctive distributions. For a crisp node, its value is equal to Null.

  • NodePossValue is the memberships of the node or the possibility of attribute values of elements. For a crisp node, the value of NodePoss is equal to 1.

To sum up, FXDTM is a rooted ordered tree, in which the nodes represent the fuzzy XML elements/ attributes. Nodes are ordered following their order of appearance in the fuzzy XML document. And FXDTM representation model that we proposed here considers the most common characteristics (e.g., label, the fuzzy values) of fuzzy XML documents.

After the fuzzy XML document mentioned in Section 2.1 is converted into the FXDTM tree model in this way, we can extract the feature information of nodes and apply the approach based on KELM to classify the fuzzy XML documents represented by the proposed FXDTM tree model. Figure 2 shows a FXDTM tree instance that basically comes from the corresponding fragment of a fuzzy XML document is described in Fig. 1. Without loss of generality, when we use “node” we mean “element node” or “attribute node” in following section.

Fig. 2
figure 2

A FXDTM tree corresponding to the fuzzy XML document in Fig. 1

3.2 Representation Structured Vector Space Model of Fuzzy XML Document Tree

Although we use a new fuzzy XML document tree model FXDTM to better represent the feature information of the fuzzy XML document, a fuzzy XML document needs to be represented in the form of eigenvectors in order to apply machine learning algorithm based classification (Tang et al. 2016). Therefore, in order to classify the fuzzy XML documents, it is necessary to transform the fuzzy XML documents from FXDTM model to vector model. That is, instead of directly transforming the vector model from the fuzzy XML document, we have gone through the intermediate transformation of the FXDTM tree. The advantage of this method is that FXDTM tree can more accurately reflect the hierarchical structure and fuzzy information of feature terms compared with plain text documents.

Vector Space Model (VSM) (Salton and McGill 1983) is often used to represent plain text documents, which takes distinct feature terms TF-IDF value as feature vectors. However, VSM cannot directly represent structural information of a semi-structured (XML) document. Structured Link Vector Model (SLVM) is proposed in (Yang and Chen 2002) based on VSM to represent semi-structured documents, which contains both semantic and structural information. Given a set of XML documents D, an XML document dD. Structured Vector is defined as

$$ {d}_{sv}=<{d}_1,\dots, {d}_n> $$
(15)

where di is a feature vector of the ith XML document feature terms calculated as

$$ {d}_i=\sum \limits_{j=1}^m\left( TF\left({w}_i, Doc.{e}_j\right)\cdot {\varepsilon}_j\right)\cdot IDF\left({w}_i\right) $$
(16)

where wi is the ith term, εj is a unit vector corresponding to the element node ej, and m is the number of nodes corresponding to term wi.

In SLVM, each dsv is a feature matrix Rn×m. di is a m dimensional feature vector which consists of XML element node the corresponding to a same feature term.

A Modified Structured Vector Space Model (MS-VSM) based on SLVM to represent fuzzy XML documents is proposed in (Zhao et al. 2017), which not only achieves the representing of semantic and structural information of fuzzy XML documents, but also makes the feature vector to carry fuzzy information.

Given a set of fuzzy XML documents FD, a fuzzy XML document x∈FD. Modified Structured Vector Space Model (MS-VSM) to represent fuzzy XML document is described as

$$ {x}_{MS- VSM}=<{x}_1,\dots, {x}_n> $$
(17)

where xi is a feature vector of the ith fuzzy XML document feature terms calculated as

$$ {x}_i=\sum \limits_{j=1}^m\left( TF\left({L}_i, FXDoc.{n}_j\right)\cdot {\varepsilon}_j\right)\cdot IDF\left({L}_i\right) $$
(18)

where m is the number of nodes corresponding to feature terms Li (Label) in fuzzy XML document FXDoc, FXDoc.nj is the jth node corresponding to feature terms Li in fuzzy XML document FXDoc, εj, which is the unit vector of FXDoc.nj and define in similarity matrix by users according to their concrete application in SLVM (Yang and Chen 2002), is now the product of the reciprocal of node depth (nj.NodeDepth) and node fuzzy value (nj.NodePossValue).

$$ {\varepsilon}_j=\frac{1}{n_j. NodeDepth}\cdotp {n}_j. NodePossValue $$
(19)

Example 1 shows the calculation process of eigenvectors corresponding to feature terms.

Example 1

To illustrate the calculation process of eigenvectors, we calculate the corresponding eigenvectors for the two feature terms “teacher” and “title” in Fig. 2:

$$ {\displaystyle \begin{array}{l}{x}_{teacher}=\frac{1}{18}\times \frac{0.9}{4}\times \mathit{\log}\frac{100}{1+80}\\ {}{x}_{title}=\left(\frac{1}{18}\times \frac{0.8}{7}+\frac{1}{18}\times \frac{0.6}{7}\right)\times \mathit{\log}\frac{100}{1+50}\end{array}} $$

The FXDTM tree has 18 nodes in Fig. 2. Here, suppose that the sum of documents in the Corpora is 100, the number of documents in which the feature term “teacher” appears is 80, the number of documents in which the feature term “title” appears is 50. The eigenvectors of a fuzzy XML document are derived from the corresponding eigenvectors of several feature terms.

4 Classification of Fuzzy XML Document with KPCA-KELM

An ELM-based double hidden layer framework is proposed in (Zhao et al. 2017). The proposed framework includes the feature extraction with Extreme Learning Machine and the classification with Kernel Extreme Learning Machine. Considering that the KPCA is very suitable for extracting the interesting non-linear characteristics from high- dimensional space data, we choose the KPAC as feature extractor in this paper. In this section, first of all, we introduce preprocessing of fuzzy XML documents, which produce raw data of feature representation of fuzzy XML documents. And then we propose a new KPCA-KELM framework for fuzzy XML document classification. The overall architecture of the proposed KPCA-KELM includes the training stage and the testing stage.

4.1 Preprocessing

In (Zhao et al. 2017), the fuzzy XML document is directly represented as a Modified Structured Vector Space Model (MS-VSM). In this paper, we use the method described in Section 3 to represent the fuzzy XML document samples in order to represent more accurately information.

To obtain the raw data of the characteristics of the fuzzy XML document, we need to transform the fuzzy XML document into FXDTM tree, and calculate the feature value of each element. For each sample either for training or for testing, the necessary pre-processes such as size normalization are implemented to facilitate the feature extraction process. We propose Algorithm 3 to implement this preprocessing. Assume that there is a fuzzy XML document set FD = {xi} (i = 1, …, N). Then, the raw data of the train sample set (D) and test sample set (T) is obtained. The next process is to extract the KPCA features from the raw data and realize the final classification of the test sample. This preprocessing can be realized as Algorithm 3.

figure c

4.2 KPCA-KELM Classification Architecture

The starting-point of the proposed KPCA-KELM is as follows. At first, one can extract more information by using suitable nonlinear features from a given raw dataset. The KPCA is very well suited to extract interesting nonlinear features in the data. It is particularly important that KPCA has the advantage of capturing the nonlinear features in high dimensional space (Schölkopf et al. 1998; Iosifidis et al. 2015). The KPCA maps the raw data obtained in the pre-processing stage to nonlinear feature space. In addition, it is easy to understand that dimensionality reduction before performing classification tasks can improve classification accuracy and reduce training time. At the same time, a feature selection technique is employed for pre-processing before the classification models are constructed. We try to find out which features are most relevant to fuzzy XML documents through feature selection. Feature selection is applied to identify the informative features, and after then several feature subsets with top-p ranked features are fed to the classifier for classification and performance evaluation. Secondly, in order to solve the classification problem, we need to find a suitable machine learning classification algorithm. Compared with SVM, the KELM can achieve comparative or better performance with much easier implementation and faster training speed in many classification tasks (Huang et al. 2014; Huang 2014; Huang et al. 2012). That is, the KELM requires much less learning time and has good generalization accuracy. These characteristics inspire us to use the KELM to solve the fuzzy XML document classification problem. In a word, we try to use KELM classifier to classify fuzzy XML documents by combining the KPCA feature extraction method and suitable feature selection.

It is shown that the proposed KPCA-KELM architecture is structurally divided into two separate steps: 1) feature extraction and 2) classification. In the first step, a KPCA-based extractor is developed to extract sparse features of the input raw data, which can help to exploit hidden information among training samples. The KPCA first maps the raw input data into some high dimensional feature spaces via a nonlinear kernel function (e.g. Gaussian kernels and polynomial kernels) and then performs linear PCA on the mapped data. After a series of matrix operations, the largest first p eigenvalues and the corresponding eigenvectors need to be found. Through such feature selection and dimensionality reduction, the most representative relevant features can be identified. The extracted features can be used as a basis for classification after feature selection. It should be noted that the actual physical meaning of extracted features is not very explicit. The eigenvectors in the feature space do not correspond to the feature terms in the input raw data space. For example, the FXDTM tree in Fig. 2 has 14 feature terms, which are “college”, “Cname”, “Val”, “department”, “Dname”, “teacher”, “student, “Tid”, “Dist”, “SID”, “age”, “email”, “Tname”, and “title”. Suppose we want to select the first 5 eigenvectors in the feature space. The extracted eigenvectors only include the largest common information of 14 feature terms. The first p principal components have the largest amount of common information relative to the input raw data. The second step of the KELM-based classification is performed for final decision making by using projections of feature extraction results as the inputs of classifier. Although the SVM has gained much more popularity due to its mature theory as well as the satisfactory classification performance, the KELM still has its own advantages in some aspects. The KELM method improves the disadvantage of ELM method: random assignment and hidden layer bias cause unstable output results. At the same time, the hidden layer feature mapping does not need to be known. For example, the Gaussian kernel function is applied in this paper. Two main parameters presented in KELM with Gaussian kernel are penalty parameter C and kernel parameterγ, which play an important role in model construction. Through these two steps, the KPCA-KELM model provides an improved performance for the classification of fuzzy XML documents. The proposed framework of KPCA-KELM is shown in Fig. 3.

Fig. 3
figure 3

The model architecture of KPCA-KELM

In first (KPCA) step, the nonlinear projection produces almost orthogonal eigenvectors Vk which are linearly independent in the training stage. In the testing stage, the principal components of the test sample y can be extracted by projecting Φ(y) onto the eigenvectors Vk in F as Eq. (7). In second (KELM) step, Ωtrain is calculated by Eq. (13) in the training stage. In the testing stage, the final classification of the sample is realized by Eq. (14). The algorithm of KPCA-KELM is presented as Algorithm 4.

Overall, to capture the principal component features, a representative raw dataset is constructed with Algorithm 3 from the fuzzy XML document set. And then, the KPCA features are extracted from these representative raw datasets. Finally, in the new feature space, the KELM classifier is used to realize classification.

figure d

5 Experimental Evaluation

In this section, we present experiments conducted in order to verify the effectiveness and efficiency of the proposed KPCA-KELM framework. We compare it with the other existing main relevant state-of-the-art classifiers (original ELM, SVM etc.) in the literature in terms of classification accuracy and training time cost on various datasets.

5.1 Experimental Settings

Although a variety of classification methods are presented in the literature (Huang et al. 2006; Suykens and Vandewalle 1999; Blatman and Sudret 2011; Paliwal and Kumar 2009; Kamgar-Parsi and Kanal 2010; Gupta et al. 2019; Palshikar et al. 2018), we choose popular techniques to compare the computational performance of different classifier.

The current version of ELM and Kernel ELMFootnote 1 has been employed for the performance analysis. For SVM classifier, the classic algorithm (Suykens and Vandewalle 1999) has been rewritten to verify its performance in terms of different parameters which are consistent with other classifiers.

In all the experiments, we apply the original ELM (Eq. (12)) and the Kernel ELM (Eq. (14)) algorithms for different ELM mapping space.

In the case of original ELM space mapping, we have employed the Sigmoid activation function:

$$ g\left(a,b,x\right)=\frac{1}{1+\mathit{\exp}\left(-\left(a\cdot x+b\right)\right)} $$
(20)

Here all the hidden node parameters a and b are randomly generated. The number of hidden nodes L is set as {26, 27, …, 215}. The regularization parameter C is set as {2−10, 2−9, …, 29}.

In the case of Kernel ELM space determination, we employed the Gaussian kernel function:

$$ K\left(u,v\right)=\mathit{\exp}\left(-\gamma {\left\Vert u-v\right\Vert}^2\right)\Big) $$
(21)

where the value γ of the Gaussian kernel is set in the range {2−5, 2−4, …, 24}. The optimal value for the regularization parameter C is set as {2−10, 2−9, …, 29}.

The same Gaussian kernel function is adopted in the KPCA-KELM and SVM. In order to gain an unbiased estimate of the generalization accuracy, the all experiments will be repeated and averaged over 20 runs to obtain results for the classification comparison, and the mean and standard deviation are given.

All the experiments were conducted on a computer with an Intel Core i7 processor, 8GB RAM. All the simulations are carried out in MATLAB R2014a and JDK 1.6 environment, which runs on Windows 7 operating system with synthetic and real datasets.

5.2 Datasets

To test our method’s effectiveness, we used the following synthetic and real datasets for the experimental evaluation.

We experimented with synthetic datasetsFootnote 2 from five different domains. Each domain has a corresponding XML DTDs. The characteristics of these XML DTDs represented different application domains are shown in Table 1. To obtain fuzzy XML documents, at first, we used our adaptation of the XML documents generator to generate the corresponding XML documents based on these XML DTDs. And then, we manually add fuzzy construct to these XML documents during the top-down traversal of a XML document. To be specific, at each visited node N, we randomly generate a fuzzy construct (Dist node D with Type node as first children) to be the child of N during the top-down traversal of an XML document. The original children of N become the children of the newly generated Dist node D, which have fuzzy construct (Val node with Poss node as first children) being assigned with random probabilities to specify the membership degree of the given elements.

Table 1 Characteristics of the synthetic datasets

For real datasets, we considered the ABC News,Footnote 3 IBM Developer WorksFootnote 4 and WikipediaFootnote 5 as three original real datasets. Of course, we need to make use of a random data generation method that transformed the XML documents in original real datasets into a fuzzy XML document. That is, we artificially add fuzzy nodes to the XML documents using same approaches to deal with XML documents in synthetic datasets.

After obtaining fuzzy XML documents of synthetic and real datasets, we choose a part of them as their subsets. For example, IBM Developer Works is composed of around 1200 XML documents classified in 6 classes, out of which we choose 3 classes and 600 random XML documents as a subset. In this way, synthetic dataset is randomly split into 4 subsets (T1, T2, T3, T4), and real dataset is randomly split into 6 subsets (T5, T6, T7, T8, T9, T10), respectively.

Subsequently, we transform these fuzzy XML documents in synthetic and real datasets (T1-T10) ordered by increasing samples size into MS-VSM which we proposed using Algorithm 3. Consequently, we obtain 10 datasets represented with MS-VSM as input raw data of KPCA-KELM frameworks and other classifiers. The key characteristics of the real datasets are summarized in Table 2.

Table 2 Key Characteristics of the datasets

For each dataset, 80% samples are used for training while the rest 20% samples are used for testing. We have normalized all input variables to the (−1, 1) range for all the classifiers.

5.3 Performance Comparison and Discussion

In this part, we compare the performance of the above-mentioned methods for classification over ten raw datasets, including both classification accuracy and training time cost, to demonstrate both the effectiveness and the computational efficiency of the proposed KPCA-KELM framework.

First, we demonstrate the selection of parameters in original ELM and KPCA-KELM frameworks on synthetic datasets (T1-T4). Compared with the traditional learning algorithms, fewer parameters need to be chosen in the training process of them.

The performance of original ELM is mainly influenced by the cost parameter C for the regularized least mean square calculation and the number of hidden neurons L. Here, these two key factors will be examined in detail.

Figure 4 shows the 3-D testing accuracies curves of original ELM in terms of (C, L). It is shown that the influence of C and L on the performance of original ELM is different. As shown in Fig. 4, the accuracy curves become smoother and steadier as L increases when C has different values. On the other hand, it is shown that the trend of accuracy curves does not change very much as C increases if L is set large enough (e.g., L > 210 in our simulations). That is, selection of C is less important, while L should be large enough.

Fig. 4
figure 4

Testing accuracy curve of ELM in terms of (C, L)

Being different from original ELM, the performance of KPCA-KELM is mainly influenced by the cost parameter C and kernel parameter γ in Gaussian kernel function. In order to achieve good generalization performance, the cost of parameter C and kernel parameter γ and KPCA-KELM also need to be chosen appropriately. Thus, the best combination of (C, γ) of KPCA-KELM with Gaussian kernel needs to be found.

Figure 5 shows the 3-D testing accuracies curves of KPCA-KELM in terms of (C, γ). We can clearly see that the testing accuracy of KPCA-KELM is heavily influenced by the parameter γ. The testing accuracy is getting maximum value when the value of γ is close to 1. Compared to the parameter γ, the parameter C is not sensitive to the performance of KPCA-KELM when C is large enough. That is, the classification accuracy keeps stable as C increases if γ is set to a fixed value (e.g., γ = 1).

Fig. 5
figure 5

Testing accuracy curve of KPCA-KELM in terms of (C, γ)

It should be mentioned that the KPCA-KELM seems to achieve a similar testing accuracy compared with the original ELM, and its performances tend to be quite stable in a wide range of C (C > 26). This improvement may be because the KPCA-KELM framework is able to effectively capture the nonlinear relationship existed in the fuzzy XML document datasets with the aid of the Gaussian kernel. Meanwhile, the performance of KPCA-KELM is more sensitive to the parameter C than that of the original ELM. As C increases, the accuracy of KPCA-KELM is larger fluctuating than ELM.

To investigate whether feature extraction can further improve the performance of KPCA-KELM for fuzzy XML documents classification, we further conduct the experiments in the reduced feature space which was obtained using KPCA algorithm.

This parameter that affects the performance of the KPCA-KELM framework is the number of principal components (p) of feature extraction in KPCA Algorithm 1, We discuss the impacts of the parameter (p) on the performance of the KPCA-KELM framework in detail.

Figure 6 shows the change curves of testing accuracies under different number of the extracted features (p). It is shown that KPAC-KELM gets different classification results on different parameter p, and the trends of classification accuracy seems to be increasing when the value of parameter p is growing. It can be also seen that the testing accuracies curves keep stable on different real datasets (T5-T9) when the value of parameter p is large enough.

Fig. 6
figure 6

Testing accuracy curve of KPCA-KELM in term of p

Figure 7 shows the change curves of training time in terms of the number of extracted features (p). We can see that the training time seems to be increasing when the value of parameter p is growing. Moreover, the training time is increasing with the increase of the number of samples when the value of p is the same.

Fig. 7
figure 7

Training time curve of KPCA-KELM in term of p

From the above analysis, we can find that KPCA-KELM framework has improved its performance for fuzzy XML document classification with the aid of feature extraction using KPCA.

Furthermore, in order to evaluate the effectiveness of the proposed KPCA-KELM approach, the SVM was also implemented for comparison. Here we only considered the Gaussian kernel for SVM classification. The optimal parameter pair (C, γ) was employed to construct the SVM predictive model.

In summary, user-specified optimal parameters used in our simulations are adopted as follows: (C, L) for original ELM is (50, 1000), (C, γ) for KPCA-KELM with Gaussian kernel is (25, 1), and the same optimal parameters are adopted in the SVM.

Hereunder, we used more complicated real datasets to verify the classification performance of KPCA-KELM over the other existing main relevant state-of-the-art classifiers (original ELM, SVM, kernel ELM) in the literature.

For KPCA-KELM, feature extraction is performed using KPCA, and the resulting features are randomly projected before ELM-based classification; for ELM and kernel ELM, the randomly mapped raw data is input for classification. For fair comparison, in the feature classification stage, the same Gaussian kernel function is adopted in the KPCA-KELM, kernel ELM and SVM. In addition, the sigmoid activation function is adopted in original ELM.

To analyze numerical results, two performance metrics have been graphed: testing accuracy, training time. The average performance comparison of different methods was gathered in Table 3 and plotted in Figs. 8 and 9.

Table 3 Performance of classifiers on the different datasets
Fig. 8
figure 8

Testing accuracy of different classifiers

Fig. 9
figure 9

Training time of different classifiers

One of the most important evaluation criteria is the testing accuracy. As shown in Table 3 and Fig. 8, we can conclude that KPCA-KELM and KELM have similar testing accuracy performance for all of datasets because the same kernel function and parameter are adopted in each algorithm. Moreover, the testing accuracy comparison shows that three ELM-based algorithms are similar to the SVM-based method. Importantly, we observe that the KPCA-KELM algorithms cannot obtain a huge advantage compared with other algorithms on classification accuracy. Furthermore, Fig. 8 shows the standard deviations for testing accuracy on the same dataset are generally identical due to the similar accuracies for the four different classifiers. Finally, it is shown that there are several low testing accuracies such as T1, T3 for small datasets. This indicates that the performance of the classifiers really depends on the nonlinear nature of datasets themselves rather than the number of samples.

Another important evaluation criterion of learning algorithm is the training time, of which the comparison is presented in Fig. 9. The Fig. 9 shows the change curves of training times (in a logarithmic scale) of the four classifiers: KPCA-KELM, ELM, KELM and SVM on the different datasets. For almost all the data sets, KPCA-KELM is the fastest. However, the ELM, which is the second slowest one in the smaller datasets, is similar to or slightly faster than KELM for the bigger datasets. It seems that, for the large datasets, the computational cost of process the whole training set (N samples) in KELM is higher than the cost of calculate the matrix H in ELM. Moreover, from the experimental results we can see that SVM is a significantly more time-consuming algorithm than ELM-based algorithms for all datasets. The difference between KPCA-KELM/ELM and SVM increases with the size of the dataset. Since the major part of the training time is the matrix calculation, the feature extraction method in KPCA-KELM removes redundant features, the effect of feature extraction is obvious and the training time of KPCA-KELM decreases to a great extent.

In summary, the above results validate that the KPCA-KELM learning framework achieves a significant improvement on the performance of fuzzy XML documents classification compared with the existing other methods although the classification accuracy of the KPCA-KELM is slightly higher than that of others. However, it is acceptable to the common personal computer with the implementation environment mentioned at the beginning of this section. And, in the meantime, the training time of KPCA-KELM is extremely faster than the above other methods.

The major difference between KPCA-KELM and other methods is that before feature classification, KPCA-KELM uses KPCA algorithm to obtain sparse representation of the input raw data. The compact features can help to remove redundancy of the original inputs, and thus improve the overall learning performance.

Then, we discuss some limitations of KPCA-KELM. The tunable parameter of KPCA-KELM framework is the kernel spread γ, the regularization parameter C, and number of feature extraction p, which must be tuned for each particular dataset. Therefore, the classification performance of KPCA-KELM not only depends on the size (number of patterns, inputs and classes) of the dataset, and but also on result of feature extraction. However, we can clearly see that feature extraction can further decrease the training time of the KPCA-KELM.

6 Conclusion

In order to deal with the classification of the fuzzy XML documents effectively, in this paper, we first propose a novel tree representation model to capture the node information of fuzzy XML documents, and then a vector space model of representing the semantic and structure of fuzzy XML documents is employed based on the proposed fuzzy XML document tree model to extract feature of fuzzy XML document. We propose KPCA-KELM framework to classify the fuzzy XML document based on the vector space model. The core component of the proposed framework is the KELM classifier. With the aid of the feature extraction techniques (KPCA), the performance of KELM classifier is improved with extracted features. Finally, the experimental results show that our algorithms can efficiently perform classification on the fuzzy XML documents. Note that our proposed KPCA-KELM approach achieves a significant reduction in training time while maintaining the same level of accuracy as the state-of-the-art baseline models. In the future, duo to the computation cost of KPCA increases dramatically with the sample size, it is intuitive to accelerate the proposed KPCA-KELM algorithm by using the cloud computing technology on a large dataset. We also plan to develop a prototype system based on the proposed framework and then simulate it for other real application systems.