1 Introduction

With the development of information technologies, such as computation, network, and database, e-business [16, 28, 43] has rapidly extended its scale, scope, and measures during the last two decades. Due to the openness, globalization, and virtualization of network economics, the e-business market has become larger and more complicated than ever. The major e-business pressures are labeled the 3Cs: competition, customers, and change [34]. To alleviate the business pressures, an effective way is to use Business Intelligence (BI) [15, 33, 42, 53], which requires enterprises to use data mining (DM) [1921, 23, 35] tools to analyze business data. DM, as one of the promising technologies (since the 1990s), is a non-traditional data-driven method that can discover novel, useful, hidden knowledge from massive data sets. It has been considered very useful for data analysis in business, industries, and engineering [4].

A large amount of e-business data are increasingly generated at distributed locations. In many cases, it is not feasible to transfer all of the data to a central location for DM due to security issues, limited network bandwidth, or the internal policies imposed by some organizations [7]. Distributed data mining (DDM) is an extension of DM techniques in distributed data environments. DDM can also be used to effectively speed up the DM process, even if the data is not physically distributed. However, the primary purpose of DDM is to discover and combine useful knowledge from databases that are physically distributed across multiple sites [45]. Giannella et al. [14] describe two main advantages of using DDM: (1) lower network traffic (On each data source site, it processes the data and sends the results (local model) back to the main host, instead of transferring large amount of data across the network, which can take much time [46]. As the local model is much smaller than the local data, sending only the model can substantially reduce the network traffic and require much less network bandwidth), and (2) better security (Sharing only the model, instead of the entire data, could mean better security for some organizations since it overcomes the issue of data privacy).

The rest of the paper is organized as follows: Section 0 describes the related concepts of BI and DDM. Applicable scenarios and issues of DDM for e-business are also described. Section 3 classifies modern DDM systems into three classes and provides some representative examples; the m issues with modern DDM systems are also summarized in this section. To help address these issues with DDM systems, Sect. 4 proposes a novel DDM model which divides the system into layers (based on data source relevance) to support hierarchically parallel mining. Section 5 evaluates the feasibility of the proposed DDM model by verifying a prototype system using the proposed model with a web mining experiment. Section 6 presents the conclusion and suggestions for future research.

2 Related concepts of BI and DDM

2.1 Concept of BI

BI [15, 33, 42, 53] is the concept of applying a set of technologies to convert data into meaningful information. BI tools include information retrieval, DM, statistical analysis, and data visualization. Using the tools, large amounts of data originating in different formats and from different sources can be consolidated and converted to key business knowledge. Figure 1 presents a general view on how to transfer data into BI knowledge. The process involves both business experts and technical experts. It converts a large scale of data into meaningful outcomes so as to provide decision-making support to end users [57].

Fig. 1
figure 1

BI processing [57]

2.2 DDM and its applicable scenario

DM deals with the problem of analyzing a large amount of data in a scalable manner [19]. DDM is a branch of DM that offers a framework to mine distributed data with careful attention to the distributed data and computing resources. A distributed scenario (where DDM is applicable) may have the following features [6]:

  1. 1.

    The system consists of multiple independent sites of data and computation which communicate only through message passing.

  2. 2.

    Communication between the sites is expensive.

  3. 3.

    Sites have resource constraints.

  4. 4.

    Sites have privacy concerns.

2.3 Issues in DDM for e-business

In e-business, most of the daily produced data are distributed in sites (e.g., websites, departments, companies of the same business chain). Many of the sites can be connected through networks which provide the environment for DDM. Some studies [10, 64] are conducted to address the following issues in order to improve the performance of DDM systems.

  • Heterogeneous versus homogeneous DM. In a centralized DM, most of the work is to deal with the homogeneous data, which means the data are maintained by the same DBMS and management model. If the data are heterogeneous, the local data management model is usually integrated and converted to the global model before conducting DM. Otherwise, contradictions among attributes may occur.

  • Data variety in dynamic environment. In traditional DM, the data is regarded as static, and the mining work is executed in an environment owns enough data. In an e-business environment, the data related to business is time-varying in nature and it is a challenge to correctly transfer the time series related result.

  • Communication cost. In centralized DM, the I/O and CPU time costs are considered when designing the algorithm. In a distributed data environment, the communication cost, which depends on the network bandwidth and the amount of transferred information, needs to be considered.

  • Knowledge integration [26, 49]. For DDM, the final purpose is to get the local result by analyzing local sitesand integrating the local results to produce the global result. To analyze the local data set, we can use the existing centralized DM methods. To integrate the local results, a traditional simple integration method may not work. For instance, the samples that are locally interesting may lose their value at the global level. It is necessary to collect all of the local interests and verify the global interest degree in order to produce the final result.

3 Modern DDM Systems

Modern DDM systems can be classified as follows:

3.1 DDM systems based on parallel DM agents

PArallel Data Mining Agents (PADMA) is a multi-agent [2, 44, 61] based architecture for DM. It is a system that makes use of intelligent DM agents, which are responsible for accessing, analyzing, and discovering the hidden patterns within the data warehouse. The agents work together in conjunction with each other and share the same repository or metadata [29]. The main purpose of designing PADMA is to realize the coordinated parallel mining by multi-agent technologies, which can enhance the efficiency.

Albashiri et al. [1] proposed an Extendible Multi-Agent Data mining System (EMADS), whose vision is that a community of DM agents (contributed by many individuals) can interact with one another under decentralized control to address DM requests. EMADS is considered both as an end-user application and a research tool. In EMADS, there is an anarchic collection of persistent, autonomous (but cooperating) DM agents operating across the Internet. Individual agents have different functionalities; the system currently comprises data agents, user agents, task agents, mining agents, and a number of ‘‘house-keeping” agents. Users of EMADS may be data providers, DM algorithm contributors, or miners of data. The current functionality of EMADS is limited to classification and meta association rule mining. Figure 2 offers a high- level view of EMADS.

Fig. 2
figure 2

High level view of EMADS conceptual framework [1]

Danish [8] proposed a DM architecture of “CAKE” (Classifying, Associating and Knowledge DiscovEry) using centralized metadata, which contains all the rules of classification and association,along with the data structure details. The “web interface” is used to provide the users with the interface so that they can view the result. Future work on CAKE will improve its ability of dealing with heterogeneous data sources and complex mining needs. Figure 3 shows the architecture of CAKE. Rule-definer agents are used to define the metadata of the data warehouse on the basis of the rules that are going to be defined by the users. These rules are then going to be used by the “Intelligent Data Mining Agents” for DM and by “Knowledge Discovery Agents” to drive the knowledge out from the defined patterns. “Intelligent Data Mining Agents” are a group of agents which can be set up to work on a specified set of data with defined rules at any location.

Fig. 3
figure 3

CAKE (Architecture) [8]

Chen et al. [5] proposed a DDM system to effectively solve the problems caused by network bandwidth limit, data privacy, and system incompatibility when mining distributed data with a traditional centralized DM model. Figure 4 illustrates the system architecture. Taking into account the complexity of the data processing and the feasibility and reliability of system realizing, the Data Mining Agent (DMA), which plays the core role in the system, executes the mining task. The agent coordinator negotiates and synchronizes among the modules.

Fig. 4
figure 4

DDM system architecture proposed in [5]

3.2 DDM systems based on meta-learning

Meta-learning is a recently developed technique that deals with the problem of computing a “global” classifier from large and inherently distributed databases. Meta-learning aims to compute a number of independent classifiers (concepts or models) by applying learning programs to a collection of independent and inherently distributed databases at the same time. The “base classifiers” computed are then collected and combined by another learning process. Here, meta-learning seeks to compute a “meta-classifier” that integrates in some principled fashion the separately learned classifiers to boost overall predictive accuracy [45].

The main purposes of designing this kind of system is to improve the quality of selection and the composition of DM algorithms, and to select the reasonable DM model according to the relevance of different data sources. Tozicka et al. [52] proposed a framework for agent-based distributed machine learning and DM based on (1) the exchange of meta-level descriptions of individual learning processes among agents and (2) online reasoning about learning success and learning progress by learning agents. Figure 5 illustrates a generic model of a learning step. To improve the utility of the framework, the communication cost should be considered first; secondly, experiments with agents using completely different learning algorithms (e.g. symbolic and numerical) should be executed.

Fig. 5
figure 5

A generic model of a learning step [52]

Dam et al. [7] proposed an evolutionary-based online-learning system called XCS in conjunction with the knowledge probing technique. XCS is a genetic-based machine learning algorithm that applies a reinforcement learning scheme. Luo et al. [37] considered an execution engine as the kernel of the system to provide mining strategies and services, and proposed an extensible architecture for this engine, based on a mature multi-agent environmentwhich connects different computing hosts to support intensive computing and complex process control via distribution (see Fig. 6). Reuse of existing mining algorithms is achieved by encapsulating them into agents. The algorithms also define a DM workflow as the input of the engine and detail the coordination process of various agents to process it.

Fig. 6
figure 6

System architecture of execution engine[37]

Yang et al. [56] proposed a Service Oriented Architecture for Knowledge Discovery (SOA4KD) (see Fig. 7), which selects and executes the knowledge discovery algorithm by using a meta-learning and semantic web service. User requirement is divided into a content part and a quality part. An extended knowledge discovery task ontology is proposed which can acquire user requirements through a natural language interface along with domain ontology. A Knowledge Discovery Service (KDS) quality ontology which considers the unique characteristic of KDS (as well as the characteristics of general service) is proposed. In this ontology, meta-learning is used to select the most appropriate KDS, according to user requirements. However, to guarantee the reliability and integrity of the user’s need being expressed in a natural language, the need is restrained in the given set.

Fig. 7
figure 7

Architecture of SOA4KD[56]

3.3 DDM systems based on Grid

DM in grid [13, 30] computing environment represents a specific incarnation of DDM motivated by resource sharing via local and wide area networks [48]. Increased performance, scalability, access, and resource exploitation are the key drivers behind such endeavors. However, several factors hamper large-scale DM applications on a grid. To begin with, Grid computing [50, 58, 62] itself is relatively new, and relevant standards and technologies are still evolving. A plethora of DM technologies and a staggering number of largely varying DM application scenarios further complicate matters. Finally, DM clients range from highly domain-oriented end users to technology-aware specialists. For highly domain-oriented end users, user transparency and ease-of-use is paramount. Technology-aware specialistsneed to control certain detailed aspects of DM and grid technology [48].

Today, new DDM projects aim to mine data in a geographically distributed environment which is based on grid standards and platforms, in order to hide the complexity of heterogeneous data and lower level details. So their architectures are becoming more sophisticated to articulate with grid platforms as well as to supply a user-friendly interface for transparently executing DM tasks. When running computationally intensive processes such as DM operations in a dynamic grid environment, it is advantageous to have an accurate representation of the available resources and their current status. A grid-enabled environment has the potential to solve this problem by providing core processing capabilities with secure, reliable, and saleable high bandwidth access to various distributed data sources and formats across various administrative domains [59].

Based on the principle of SOA [17, 24, 31, 32, 54, 60], standardization and open source, Stankovski et al. [48] proposed a DDM system based on a Data Mining Grid (DMG). Figure 8 depicts the DMG system architecture in four layers. Generally, components in higher layers make use of components organized in lower layers. The layer at the bottom represents software and hardware resources; The Globus Toolkit 4 layer depicts some of the system’s core grid middleware components; the high-level services layer shows components providing central DMG services; and the client components layer depicts the DMG applications’ client side components.

Fig. 8
figure 8

The DMG system architecture [48]

Cesario et al. [3] proposed a general DM architectural model (see Fig. 9) that can be exploited for different DM algorithms deployed as Grid services for the analysis of dispersed data sources. In the future, they intend to deploy more DM algorithms (clustering, frequent item sets, and association rule). Since the proposed architecture does not have the splitting functionality of datasets or a data transfer utility, they will build a framework that can give this functionality to the users.

Fig. 9
figure 9

Typical architecture of a DDM algorithm [3]

Luo et al. [36] systematically analyzed the issues of agent Grid and implemented an Agent Grid Intelligent Platform (AGrIP) which provides an infrastructure for agent-based DDM in a Grid environment. They proposed a four-layer model for AGrIP platform from an implementation point of view, as illustrated in Fig. 10:

Fig. 10
figure 10

Architecture of AGrIP [36]

  • Common resources: various resources distributed in Grid environment, such as workstations, personal computers, computer clusters, storage equipment, databases, data sets, or others, which run on Unix, NT, and other operating systems.

  • Agent environment: the kernel of Grid computing which is responsible for resources location and allocation, authentication, unified information access, communication, task assignment, and agent library.

  • Developing toolkit: the development environment, containing agent creation, information retrieval, and distributed DM, to let users effectively use Grid resources.

  • Application service: certain agents organized automatically for specific application purposes, such as e-science, e-business, decision support, and bio-information.

3.4 Issues of the modern DDM systems

Based on the above analysis of modern DDM systems, three main issues are summarized as follows:

  • Most DDM systems’ architecture is closed and that lack openness and platform independence can make it difficult to dynamically manage the DM algorithms. However, in e-business, the decision making support cases (such as customer segmentation, personal service, and cross selling) are complex and need to be solved by dynamically combining multiple DM algorithms.

  • The data source relevance has not been given enough consideration, and the single DDM approach cannot guarantee the quality of the final global result.

  • There is a lack of effective methods to integrate the result of local DM.

4 A novel DDM model for e-business

Based on our years of practical experience in the data mining area, in order to improve the efficiency of e-business DDM system and explore a solution to address these common issues (see Sect. 3.4), we propose a Data source Relevance based Hierarchical Parallel Distributed data mining Model (DRHPDM) (see Fig. 11) which adopts web service [11, 25, 39] and multi-agent [41] technologies.

Fig. 11
figure 11

Hierarchical parallel DDM model in E-Business

4.1 Features of DRHPDM

The main features of DRHPDM are listed as follows:

  • To improve the openness, cross-platform ability, and intelligence of the DDM system, web service encapsulating the DM algorithm and multi-agent will be adopted. Web service is a new distributed computing model which has the features of platform dependency, unified data representation and supporting component reuse [11, 25, 39]. It can effectively realize the publishing, discovering, and calling of function bodies [47]. Multi-agent has the features of agent (e.g. autonomy, initiative and adaptability) and can realize the cooperation of agents. It can effectively support the execution of DDM from local to global use [41].

  • The data sources which have strong relevance to one another will construct the Local Centralized Data Mining Layer (LCDML), which can improve the quality of the final global DDM results.

  • When mining the local data source, other resources can help realize parallel mining. In summary, the data source and the other resources of the local site can construct the Local Parallel Data Mining Layer (LPDML).

  • The local mining results will be transferred to the Global Processing Unit (GPU) that is responsible for integrating the local results in order to produce the final global results for DDM.

4.2 Workflow of DRHPDM

For an e-business enterprise whose distributed sites (holding different data sources) are connected by Internet, the following workflow will be activated when the user (decision maker, market analyzer, data analyzer, etc.) wants to extract knowledge from a large amount of data (see Figs. 12, 13):

Fig. 12
figure 12

Workflow of DRHPDM

Fig. 13
figure 13

Dataflow of DRHPDM from step1 to step 5

  • Step 1: The user logs into the system and submits the DM requirement to the GPU.

  • Step 2: On the GPU, the DM requirement is analyzed and divided into a series of DM tasks.

  • Step 3: Consequently, the tasks are broadcast to all the distributed sites which hold data sources belonging to the user’s corporation.

  • Step 4: As a single site can deal with different businesses, a site may have different data sources corresponding to different businesses. When a site holding different task-related local data sources receives the DM task, the site will register the information (including the IP, the data file, the data type, etc.) in the GPU.

  • Step 5: On the GPU, according to the registered information, the data sources are divided into sets corresponding to different DM tasks. The relevance among the data sources of the same set will be appraised by a special program (see Sect. 4.3). After the appraisal, the data sources that have high relevance to one another are grouped together to produce centralized DM layers (or LCDMLs—see Sect. 4.1); the others will be discretely mined to produce the parallel DM layers (or LPDMLs—see Sect. 4.1). Finally, the GPU notifies the related data sources about the grouping results.

  • Step 6: The data sources, which are notified to be discretely mined, will implement locally parallel DM in their sites. The data sources, which are notified to be centrally mined, will transfer necessary data to the Central Data Mining Unit (CDMU). The CDMU can be one of the sites holding the data sources, or another site with enough computing and storage resources can be used.

  • Step 7: The DM results (models) are transferred to Local Managing Agent (LMA) (see Sect. 4.4) for local integration. The integrated models are then consequently transferred to GPU for global integration.

  • Step 8: To improve the readability, the final global model is transformed and submitted to the user.

4.3 Measuring the data source relevance based on ontology and semantics

Ontology [18, 55] serves as the metadata schemas, providing a controlled vocabulary of concepts, each with explicitly defined and machine-processable semantics. By defining shared and common domain theories, ontology helps people and machines to communicate concisely by supporting semantics exchange, rather than just syntax [38]. Semantic models based on ontology can accurately and completely depict the concepts and the relevance among concepts [40], and the interaction among data sources in the ontology semantic model layer offers such features as completeness, accurateness, and efficiency. As a result, using ontology and semantics technology, we can build a data source relevance measuring model which will measure the relevance of the DM task-related databases existing in the corresponding data sources. In addition, the databases with strong relevance should be centralized mined in order to guarantee the quality of final DM result.

As shown in Fig. 14, the DM task-related database should be reverse-engineered to produce its E-R (Entity Relation) model first. To improve the quality of the E-R model, it should also learn from instance data. Sequentially, the E-R model is translated to the data source ontology and is expressed by Web Ontology Language (OWL).

Fig. 14
figure 14

Measuring data source relevance

The data source semantics model is defined as follows:

$$ DSSM = \left\{ {G(V,E),\Upgamma ,\Uplambda ,N,T,A^{o} } \right\} $$
(1)

Where G denotes a graph based on Unified Modeling Language (UML) class graph, the note set V corresponds to the entity set, and edge set E denotes the relations among entity concepts; \( \Upgamma \) denotes the set of entity concept and \( \Upgamma = \left\{ {c_{1} ,c_{2} , \ldots ,c_{n} } \right\} \); \( \Uplambda \) denotes the set of entity relation and \( \Uplambda = \left\{ {r_{1} ,r_{2} , \ldots ,r_{n} } \right\} \); the mapping relation between V and \( \Upgamma \) is built by function set N; the mapping relation between E and \( \Uplambda \) is built by function set T; and \( A^{o} \) is the restrictive axiom on \( \Uplambda \).

Sequentially, the relevance of the entity concepts (vocabulary) belonging to the data source semantic model is measured and the semantic relevance can be computed. The executive solution can be chosen from one of the following two measuring methods (according to the situation):

  • Measuring semantic relevance based on concept lattice: The form context (G ∪ {g}, M, J) should be built according to domain knowledge related to concrete e-business. The concept lattice can be set up according to the form context; the semantic relevance between objects (terms) can be computed according to the hierarchy of the concept lattice.

  • Measuring semantic relevance based on HowNet: Firstly, the mechanism for computing relevance between the sememes (atomic or indivisible units) should be set. Then the relevance between the terms can be computed according to the computing results between the sememes. Finally, the relevance between data sources can be produced according to the relevance between the sememes.

4.4 Knowledge integration model

Based on the sharing of findings in the integration model, agents can share the knowledge with each other. As shown in Fig. 15, the Local Managing Agent (LMA) is responsible for collecting, recording, and publishing the knowledge mined by the Data Mining Agent (DMA). When a DMA in a local site needs knowledge, it will first verify whether other sites can provide such knowledge. If so, the knowledge on other sites will be transferred to the local site. Otherwise, the DMA should locally mine the knowledge. If the LMA has too much knowledge to hold, DMAs will be divided into blocks, with DMAs with strong relevance being divided into the same block. The knowledge mined in one block can be interchanged with other blocks by the Global Integrating Agent (GIA) which exists in the GPU and is responsible for producing the global knowledge.

Fig. 15
figure 15

Knowledge-integrating model

4.5 Web service composition

A complicated e-business mining task often needs multiple DM algorithms to cooperate. In DRHPDM, the algorithms are encapsulated into web services; it is necessary to study the web service composition method to realize flexible cooperation among DM algorithms.

Figure 16 depicts the whole procedure of web composition:

Fig. 16
figure 16

Web service compositing model

  • Step 1: A service applicant (e.g. DMA) submits its service requirement including purpose and restraints (e.g., Quality of Service (QoS)) to the composition system;

  • Step 2: A compositing agent analyzes the requirement and extracts the core content to the Planning and Designing Subsystem (PDS). Sequentially, PDS produces several functional flows and processes the reasoning and verifies the work for these flows. PDS accesses the web service library and chooses the most suitable services; finally, the execution flow is made by PDS according to the functional flows.

  • Step 3: According to the execution flow and QoS, the Appraisal and Optimizing Subsystem (AOS) produces the optimized suitable execution flow.

  • Step 4: The Execution and Monitoring Subsystem (EMS) receives the execution flow from the AOS and register it so that it can be used by users.

  • Step 5: The composition results (including web service and the related semantics) are provided to the service applicant.

5 Experimental evaluation

Nowadays, Web Usage Mining (WUM) [9, 12] has become a powerful way for realizing BI. Accurate Web usage information could help attract new customers, retain current customers, improve cross marketing or sales, increase effectiveness of promotional campaigns, track leaving customers, and serve as the most effective logical structure for a user’s Web space [22, 27]. In this section, we verify the feasibility of DRHPDM with a WUM case. We have performed the experiments using the scenario that an e-business manager wants to improve the website topology by using WUM, with a condition that log files be distributed in different local websites.

To establish the experimental environment, we adopted real server log files which record visitors’ accessing behavior (such as time, page visits, and IP addresses). The log files were provided by a Chinese e-business website located in Hebei Province, which consists of a LAN with 10 heterogeneous nodes. To simulate the real world, the server log files were divided into six parts and were randomly distributed to six nodes; the DM algorithms were encapsulated into web services and were deployed into different nodes. We also built a prototype system of DRHPDM using block-based design methods to guarantee its scalability. The system was developed with Java to guarantee its cross-platform ability.

The workflow of the prototype system was designed according to the description in Sect. 4.2. The main experimental steps are listed as follows:

  • Step 1: After a series of preprocessing operations including data clearing, user identifying, session identifying and path identifying, we identified 320 sessions from 5,236 users and the total number of URLs decreased from 2,120 to 210;

  • Step 2: As Fig. 17a shows, the LAN nodes connecting to the local machine were displayed;

    Fig. 17
    figure 17

    Prototype system of DRHPDM

  • Step 3: As Fig. 17b shows, the listed nodes are those that have task related data (log files) and have registered themselves in the local machine which is working as the GPU and running the system;

  • Step 4: As Fig. 17c shows, the nodes holding the task related data were divided into thirds, according to the relevance among themselves;

  • Step 5: When changing the node of the “DM Web Service” listbox, the DM algorithms provided by the node in the Web service form was consequently presented in the “DM Algorithm” listbox. As Fig. 17c shows, we selected the “K-Means” algorithm provided by the LAN node with the IP “192.168.0.10” to mine the patterns from log files. The parameters that should be given were listed in the “Parameters Setting” area;

  • Step 6: As Fig. 17d shows, after the “Run” button was clicked, the URLs were able to be clustered into five main parts with distributed K-Means algorithm in about 20 s.

According to the clusters shown in Fig. 17d, the web page URLs numbered from 82 to 112 are clustered together. When checking their relations using the website topology, we found that they all belong to the “online payment” module. In other words, the clustering results are consistent with the real business module of the website.

On the other hand, the URL numbered 37 was clustered together with the URLs numbered from 140 to 156, but according to the website topology, the web page 37 does not have direct hyperlinks to the web pages from 140 to 156 (which belong to “personal information management” module). To obtain in-depth analysis of the phenomenon, we checked the content of page 37 and found that it provided the entrance to realize “self-assistant website construction.” That is to say that, according to the cluster, we found a regular pattern that showed that most of the website’s visitors were apt to access the “personal information management” module and the “self-assistant website construction” module in one session. There are over 20 modules (such as user registration, personal information management, self-assistant website construction, games, job advertising, bargains, etc.) on this e-business website. As a result, we can add the hyperlinks on page 37 to the homepage of the “personal information management” module to help users find information more quickly and thus to enhance the user experience and to improve the information seeking efficiency. In conclusion, the WUM, using our prototype system, does provide the capability to help improve the website topology.

6 Conclusion and future research

Nowadays, with the support of information technology, e-business has been rapidly growing. How to make use of the e-business data to support decision making has become a main focus of the BI area. In this paper, we reported on an in-depth investigation on the issues of DDM in the e-business data environment and modern DDM systems. Based on the literature review and on our in-depth analysis, we proposed the “Data source Relevance based Hierarchical Parallel Distributed data mining Model (DRHPDM).” Giving enough consideration to the data relevance and to improvingthe final mining result, the data sources are divided into two kinds of layers: a centralized mining layer and a distributed layer. By adopting web service and multi-agent, DRHPDM shows the capability to realize flexible, cross-platform mining.

In the future, we will conduct more research on the realization of local centralized data mining as well as on the threshold of the relevance among data sources. We will also involve a variety of users to systematically evaluate the prototype system, in order to further improve the usability, efficiency, and effectiveness of our prototype system and the DRHPDM model. Our proposed DRHPDM model is transformative and can be applied to other fields such as e-learning, e-government, etc. Finally, our proposed DRHPDM model can be further extended and integrated with recent advances in distributed text mining [51, 63] to help discover more hidden knowledge, patterns, and insights.