Keywords

1 The Basic Content of the Network Data Mining

1.1 Data Mining and Network Data Mining

Data mining is extracted from a large number of incomplete noise, fuzzy, or random data, but people do not know in advance useful information and knowledge. Network data mining is the application of data mining technology in the network information processing, data from the Web site to explore the relationships and rules. Each site on the Web is a data source, between each site and the organization is not the same, which constitute a large heterogeneous database environment. The network data mining not only takes advantage of all the general and standard database data mining technology, but also takes advantage of the characteristics of network data, using a special method.

1.2 Network Data Mining and Network Information Retrieval Distinction

Network information retrieval system by the Robot, the index database, and query engine. The information gathering Robot WWW traverse, the discovery of new information as much as possible. The full-text indexing technology to collect information stored in the index database query engine receives and analyzes the user’s query, traverse the index databases based on a relatively simple matching strategy, and finally submit the result set of addresses to the user. The network information retrieval system can only deal with the simple goal of a keyword; can not handle the complexity of the sample given by the user in the form of fuzzy target. Network data mining technology follows the Robot, full-text search, the outstanding achievements of the network information retrieval, integrated use of inductive learning, machine learning, statistical analysis methods and artificial intelligence, pattern recognition, a variety of techniques in the field of neural networks. Network data mining system and network information retrieval biggest difference is that it can in accordance with the requirements of the user-defined purpose of information search, according to the target feature information in the network or repository.

1.3 Network Data Mining Type

  1. 1.

    Web content mining

    Web content mining process to discover useful information from the network, data, documentation, access the object of the search engine in the web search. Many types of network information resources, from the perspective of a network of information sources, including Gopher, FTP, UseNet have hidden resources after the WWW form of private data of WWW information resources, database management, information systems and data, can not be indexed. From the terms of the form of the network resources, including text, images, audio, video and other forms of data.

  2. 2.

    Network structure mining

    The network structure mining, mining Web the potential link structure mode. This idea stems from the citation analysis, by analyzing a Web link and the number of links and object to create a Web link structure mode. This mode can be used for Web pages classified and can obtain information about the similarity between different pages and associate degrees. The network structure mining authority site to help users find related topics to the site, and you can find links to related topics.

  3. 3.

    Network usage mining

    Network usage mining is mainly used to understand the significance of network behavior data. Web content mining, network structure mining object is the original online data network usage mining face of the second-hand data is extracted in the process of user and network interaction, including network server to access the records, proxy server logging browser logging, user profiles, and registration information, user sessions or transactions, user questions, style and so on.

1.4 The General Steps of the Network Data Mining

  1. 1.

    It is established in the target sample the target text selected by a user, as to extract the characteristics of the user information. This is the individual requirements of the user, based on the needs of users for data mining.

  2. 2.

    Establish a statistical dictionary to establish the main dictionary and thesaurus for feature extraction and word frequency statistics, contains the dictionary, and then according to the target word frequency distribution of the sample, the statistical dictionary feature vectors extracted mining target and calculated corresponding weight value. Often the feature item weights and the match threshold is the feedback feature items sample weights and matching threshold closely, so according to these sample data adjustment feature vector. The sample eigenvectors with these goals, you can use a traditional search engine technology and information collected.

  3. 3.

    Finally, the information is acquired by the feature vector, and the feature vector of the target sample is matched and will meet the threshold condition of the information presented to the user.

2 Network Data Mining Technology

Web data mining the primary technology solve semi-structured data source model and semi-structured data model query and integration issues. This must be a model to describe the data on the Web, while looking for a semi-structured data model is the key to solve the problem. In addition, you also need a semi-structured model extraction technique, semi-structured model of the technology that automatically extracted from the existing data. Can be seen as a semi-structured extensible markup language (XML) data model can easily corresponding attributes in the XML document describes the relational database, the implementation of precise query and model extraction. The use of XML, Web designers can create text and graphics, but also to build a multi-level document type definition, interdependent systems, data tree, metadata, hyperlinks, structure and style sheet.

2.1 Web Data XML to Uniquely Identify

XML, each database search software must understand how to build, because each database describing data formats are almost always different. Due to the presence of the integration problems of the data from different sources, it is now actually impossible to search for a variety of incompatible database. XML can be the structures of the data from different sources are easily combined. XML three-tier architecture is shown in Fig. 1.

Fig. 1
figure 1figure 1

The three-tier structure of XML

Software agents can integrate data from back-end databases and other applications at the server of the middle layer. The data can then be sent to clients or other servers for further collection, processing, and distribution. XML-based data is self-describing data does not need to have internal description can be exchanged and processed. With the use of XML, users can easily carry out local computation, and processing XML format data is sent to the customer; the customer can use the application software to parse the data and for data editing and processing.

2.2 XML Applied to a Large Number of Computing Load Distribution in the Client

Customers can choose according to their needs and making a different application to handle the data, while the server is only required to issue the same XML file. Initiative in processing data to the client, the server just made the best possible accuracy of the data package into an XML file. XML’s self-explanatory client to receive data at the same time understand the logical structure and meaning of the data, so that a wide range of common distributed computing become possible. This is also more conducive to meet stressed the needs of individual users in the network information data mining problems.

3 Network Data Mining Model Design

3.1 Network Data Mining Model Design

Characteristics: Heterogeneous network data are different from the general database data and are semi-structured. Each site is a data on the Internet Source; each data source has its own style, that is, information and organization of each site is different; the Internet is a huge heterogeneous database, unlike a traditional relational database. Data on the Internet is very complex, there is no uniform model describes the existence of these data have some degree of structural, but the readme level and complex interrelated, which is a non-fully structured data. Given these characteristics of network data, the data mining technology into the Internet, must be to do a certain amount of pre-processing work. Network based on data mining model is shown in Fig. 2.

Fig. 2
figure 2figure 2

Network data mining models

Figure 2 is actually an improved model of traditional search engines; data mining technology is loaded into the search engine, enabling network data mining. XML information preprocessing module is a key link in the network data mining, but if follow conventional Spider or Robot heterogeneous network data collection, recombination, in accordance with a unified data structure, will be difficult to achieve because of the heavy workload. The emergence of XML has brought hope to resolve the problem of network data mining. XML is a simple, open, efficient, and scalable network markup language in line with international standards, its scalability and flexibility to be able to describe the data in different types of application software, which describe the data records collected pages. XML indexing of network data is a semi-structured data model, but the document describes the one-to-one correspondence with the attributes in the relational database, which can implement accurate query model extraction, complete integration of heterogeneous network data job.

4 Network Data Mining Algorithms Support Intelligent Retrieval

4.1 Network Data Mining Support Intelligent Retrieval

Personalized information retrieval, or based on the content retrieval, and even knowledge retrieval, intelligent Web information retrieval system key is to know the user needs what, something with a high-quality (content related knowledge content) available to the user. Retrieve data mining of network intelligence support is reflected in a deep analysis of the information and networking source information the user needs to provide the key to intelligent retrieval necessary knowledge.

4.2 User Knowledge of Mining

Although with a specific user’s information needs of the individual information, but the user base as a whole, the information needs of the user is random, for general information, the user needs analysis to a great deal of difficulty. Data mining from the overall situation, a rich, dynamic online query and analysis to understand the user’s information needs. Questions online, the survey table, the system can obtain the information about the user’s user name, user access to the IP address, the original user’s occupation, age, hobbies and other information. Then, take certain mining rules (such as association rules, online analytical processing, etc.), these data fusion analysis, the result is the establishment of an information demand model for each user. And a full range of user needs information mining, and similar information needs of the user can be linked to implement “a minority” of the retrieval program. User knowledge mining has entered a practical stage mining tools, such as IBM’s new DB2 UDB7.1 is an ideal knowledge of the user.

4.3 Network Knowledge Mining

The network knowledge mining is to find out the law of distribution of information in the vast amounts of data with extreme uncertainty, mining hidden information, and the formation of the model, to discover regularities knowledge. Network information distribution regularity is the correlation within the network information. The excavation of this correlation of network information is mainly reflected in two aspects: first, the Web content mining. Web content mining is a network source information in the form of text, images, audio, video, metadata, classification, clustering, and other forms of mining method to find useful information, and the information in the form of press meet some retrieval methods be the organization’s process. Network structure mining—the network structure mining is established by analyzing the number as well as the object of a Web page link and be linked to the Web link structure mode. This mode can be used for Web classified and thus can obtain information about the similarity between different pages and associate degrees. Link structure model is conducive to intelligent navigation.

5 Conclusion

Data mining techniques to the development of the network resources, able to accelerate the development of intelligent retrieval. The results of data mining is the basis for intelligent retrieval, the intelligent retrieve the results can provide guidelines for data mining and clues. Currently, the network data mining data mining technology led products have been applied. Example, Net Perceptions developed Net Perception, can mine user information, and thus lay the foundation for the realization of personalized information services. If combined with machine learning, pattern recognition and other artificial intelligence techniques in the development of the endless network data resources, network data mining technology will be more perfect in practical applications.

With the rapid development of the Internet and information technology, network information resources have become the bottleneck of the further development of the Network Information Service. Network number dig digging a new branch in the data mining technology, it relates to the network technology, data mining technology, text processing, artificial intelligence technology, a powerful network data mining system is provided for the user’s information gathering tools, the description of a new generation of network data language (XML) will be provided for the implementation of the network data mining great convenience.