Keywords

9.1 Background

Nowadays, most of the research in the fields of mechatronic systems or social studies have spent significant effort to find rules from various complicated phenomena by principles, observations, measured data, and logic derivations. The rules are normally summarized as concise and quantitative expressions or “models,” and this can provide mechanisms to improve the system (represented by its model) performance. As for the social studies, the social network data (e.g., Twitter, Facebook, Sina_micro-blog, Tecent_micro-blog, etc.) have attracted millions of users and academic and industry researchers to research on modeling and mining the knowledge behind the magnanimity information, and as a result, there has been tremendous interest in social networks. Due to its fast development and wide usage, the microblog has attracted the attention of users, enterprises, governments, and researchers, and so applied methods and techniques for modeling and controlling in this field is very important.

As the foundation of the micro-blog data mining, data collection is the key phase, crawling or collecting the relevant micro-blog data effectively and efficiently is important. But the microblog has many differences compared with traditional web applications. For example, there are many online users, at the same time, its different interactive and displaying mode and login operation are needed, and AJAX technology is widely used, etc. Traditional web crawlers, for instance, can only get the corresponding web pages, but they cannot get the relevant structure and the corresponding social relationships as well as users’ backgrounds and fans. That is to say, being different from traditional web application, there are some differences on micro-blog data’s login operation, display way, privacy policy, data processing, etc. So the traditional web crawler is not suitable for micro-blog data crawling or collection.

This section presents some details on modeling of micro-blog data crawler based on simulating browsers’ behaviors. On the basis of this method, we have collected several million blog data in a short time period.

9.2 Motivation

Although there has been some research on AJAX-based web pages, the technique is not suitable to the micro-blog application. Encouraging developers to develop applications on micro-blog services, some providers of micro-blog services usually offer some special APIs, which can provide developers with the probability of constructing uniform and universal architecture to utilize the APIs to automatically download and save these special data. But the mere APIs-based method has some limitations on rights, calling times, special policies, and so on, and some extra tasks cannot be done by only using these official APIs.

In this chapter, we present some strategies based on simulating browsers’ behaviors to obtain the data from micro-blog platform. The main idea is to simulate browsers’ behaviors by using the browser’s (e.g., FireFox) core to get the corresponding data. This can solve the problem of parsing the JavaScript code, and can do special login operation, etc. In order to crawl the data effectively, we present the following strategies: (1) focused crawling on some special crowds; (2) meta-topic searching and crawling: that is to say, we crawl the special contents by using the microblog’s searching function; (3) parallel crawling: based on big data processing by using the Redia and MongoDB, we use the multiprocessing technology to download and save the data simultaneously. The proposed crawler is composed of four modules, i.e., simulating module, data crawling module, data parsing module, and data persistence module. The experimental results and the analysis show the feasibility of the approach. Further works are also presented at the end.

9.3 Related Work

Online social networking technologies enable individuals to simultaneously share information with any number of peers. With the launch of Twitter in 2007, the microblog has become highly popular, and many researchers want to investigate the micro-blog information propagation patterns [1] or analyze structures of the micro-blog network to identify influential users [2]. Reference [3] discusses some of the ways in which earlier works used text content to analyze online networks, as well as background on language coordination and the exchange-theoretic notions of power from status and dependence. Reference [4] studies several long-standing questions in media communications research, in the context of the micro-blog service Twitter, regarding the production, flow, and consumption of information. A framework which enriches the semantics of Twitter Messages (i.e., tweets) and identifies topics and entities (e.g., persons, events, products) mentioned in tweets is present in reference [5]. Reference [6] conducts a study on recommending URLs posted in Twitter messages and compares strategies for selecting and ranking URLs by exploiting the social network of a user as well as the general popularity of the URLs in Twitter. Authors of reference [7] investigate the attributes and relative influence of 1.6M Twitter users by tracking 74 million diffusion events that took place on the Twitter follower graph over a 2-month interval, and they conclude that the word-of-mouth diffusion can only be harnessed reliably by targeting large numbers of potential influencers, thereby capturing average effects. Reference [8] examines the role of social networks in online information diffusion with a large-scale field experiment, and the authors further examine the relative role of strong and weak ties in information propagation. Although stronger ties are individually more influential, it is the more abundant weak ties that are responsible for the propagation of novel information, and the authors suggest that weak ties may play a more dominant role in the dissemination of information online than currently believed. In reference [9], authors address the problem of discovering topically meaningful communities from a social network, and authors propose a probabilistic scheme that incorporates topics, social relationships, and nature of posts for more effective community discovery, and then they demonstrate the effectiveness of the model and show that it performs better than existing community discovery models. Reference [10] examines the application of an event-driven sampling approach to the Live Journal social network, and the approach makes use of the “always on” atom feed provided by Live Journal that contains all public blog posts in near real-time to inform the sampling process of user friendship networks, and this has the effect of targeting sampling toward the public active users of the network. In addition to proposing models and algorithms for learning the model parameters and for testing the learned models to make predictions, reference [11] develops techniques for predicting the time by which a user may be expected to perform an action.

As for data crawling, in order to overcome the inherent bottlenecks with the traditional crawling, reference [12] proposes the design of a parallel migrating web crawler. Reference [13] proposes a dynamic data crawling methods, which include the sensitive checking of website changes and dynamic retrieving of pages from target websites, and the authors implement an application and compare the performance between the conventional static approaches and the proposed dynamic ones. In reference [14], authors present a novel URLs ordering system that relies on a cooperative approach between the crawlers and the web servers based on file system and web log information, and the proposed algorithm is based on file timestamps and web log internal and external counts. Reference [15] presents a micro-blog service crawler named as MBCrawler, which is designed on the APIs provided by micro-blog services, and the architecture is modular and scalable, so it can fit specific features of micro-blog services. Reference [16] presents a dynamic cooperation model for different crawlers’ message exchanging, and both the experimental results and the application validate the feasibility of the algorithm.

As for the modeling methods, reference [17] presents the commonly used statistical modeling methods, such as stepwise regression, radial basis function partial least squares, partial robust M-regression, ridge regression, and principal component regression that can be applied in the proposed multicollinearity domain. The Viterbi algorithm, a widely used maximum likelihood estimating method, can be used in natural language processing, and reference [18] presents an effective search space reduction for human pose estimation with Viterbi algorithm.

Although the proposed algorithm has some relationship with the above related work, there are many differences. The proposed crawler works with simulating browser behavior to collect Sina_Micro-blog (http://weibo.com/) and Tecent_Micro-blog (http://t.qq.com/) data. The proposed algorithm is based on simulating browsers’ behaviors. As for browser, reference [19] describes how to calculate various object-oriented metrics of three versions of Mozilla Firefox, and the neural network approach can predict high and medium severity errors more accurately than low severity errors.

The experimental results and the analysis show the feasibility of the proposed approach.

9.4 System Architecture

Social networks are often huge, and therefore crawling the micro-blog data could be both challenging and interesting. As microblog’s big data properties, it is impossible to crawl all the micro-blog data. Instead, it is feasible to crawl some kinds of data (e.g., account information, contents or topics, attentions or fans, etc.). In this section, we propose the system architecture on parallel crawling. Figure 9.1 shows the architecture, and Fig. 9.2 shows the parsed data and its persistence.

Fig. 9.1
figure 1

System architecture

Fig. 9.2
figure 2

Parsed data and its persistence

In practice, we can use the RDBMS to store the parsed data, and the Redis is used as the cache server, so users’ retrieval request can be done through the web server. In detail, as for the multi-thread based parallel crawling, we use a thread pool and a queue manager to schedule the tasks. There are several different queues, including the toVisitUrlsQueue, isVisitingUrlsQueue, visitedUrlsQueue, circleUrlsQueue, keywordsQueue, etc., see Fig. 9.3.

Fig. 9.3
figure 3

Multi-thread-based parallel crawling

9.5 Case Studies and Implementation of the Simulated Browser-Based Crawling

Instead of merely using the official APIs, we propose a simulated browser-based crawling, as the merely APIs-based method has some limitations on rights, calling times, and so on, and perhaps some extra tasks cannot be done by only using official APIs. We present some strategies based on simulating browsers’ behaviors to obtain the micro-blog data, and the proposed crawler is composed of four modules, i.e., simulating module, data crawling module, data parsing module, and data persistence module.

9.5.1 Simulation of the Login Operation and Cookies Data Obtaining

Commercial websites often use technologies (e.g., HTTP compression, SSL encryption and chunked encoding) to provide some reasonable levels of security and system performance. As for the micro-blog data effectively crawling, simulated login operation is necessary. Otherwise (for example, by only using official APIs-based crawling), only few data can be crawling. Here, the simulated login operation means this kind of operation allows the crawler to use some legal accounts and their corresponding passwords to login the corresponding micro-blog platforms, and the key phase is the encrypted data parsing. Here, we use the HttpWatch [20], which integrates with Internet Explorer or Mozilla Firefox to provide some unrivaled levels of HTTP monitoring, without the need for separately configured proxies or network sniffers. Simply interacting with a website, HttpWatch can display a log of requests and responses alongside the web page itself, and it can even show interactions between the browser and its cache. As a result, each HTTP transaction can be examined or parsed so as to see the values of headers, cookies, query strings, and other HTTP-related data. HttpWatch can work well with these technologies to provide a view of HTTP activity. By using HttpWatch, we can obtain 21 or more different parameters during the simulated login phase. But in practice, there usually exist some different situations during the login phase, and Fig. 9.4 shows the two situations when using the same account.

Fig. 9.4
figure 4

Different situations with the same account

Fig. 9.5
figure 5

Requested preliminary parameters (a) and the return values after the requested period (b)

Fig. 9.6
figure 6

Returned results

From the parsed results, we can conclude that the requested URLs usually contain static and dynamic parameters (e.g., the parameter P and verifycode in Fig. 9.4, and the verifycode parameter is usually used as the password encryption). As the microblogs’ login passwords are usually multilevers and multilayer encrypted, it usually contains some other preliminary parameters, and the displayed parameter in Fig. 9.5b (i.e., “\(\backslash x00 \backslash x00 \backslash x00 \backslash x00 \backslash x25 \backslash x24 \backslash x6a \backslash x52\)”) is the encrypted parameter in Fig. 9.4a. Now the encrypted resolving phase is finished, and the returned or parsed content is shown in Fig. 9.6.

As for judging whether the corresponding user is legal or illegal, it needs to analyze the cookies data. On the other hand, whether actually login or not, when requesting the server data, if the user can get the legal cookies, he or she can obtain the same data as if he or she really “login” the web server. In detail, in order to obtain the cookies data, it needs three steps. First, it needs to obtain the verifycode and uin parameters, see algorithm 1 below. Second, by using the JavaScript analysis engine and invoking the encrypted function, we can obtain the parsed parameters, see Fig. 9.7. Last, it needs to merge the relevant parameters to obtain the corresponding cookie data, which is the result of the simulating login phase, see algorithm 2 below, and the parsed cookies data result is shown in Fig. 9.8.

figure a
figure b

9.5.2 Data Parsing and Persistence

After collecting the corresponding data, it needs to be parsed and stored. As for the content, there are some differences between data within the blogger’s main page and other common pages. As usual, the main page returns data in a traditional way, while other common pages usually use AJAX [21] and JSON [22] technology to return data to client in order to enhance the performance or optimize the user experience. By using JavaScript, these returned data can be parsed and filled into the corresponding sites. Figure 9.9 shows the private crawled data, and algorithm 3 shows the main steps of this processing step.

Fig. 9.7
figure 7

The parsed parameters

Fig. 9.8
figure 8

The cookies data of the simulating login phase

Fig. 9.9
figure 9

The crawled content

figure c

The data parsing module needs to process the crawled data and parse the micro-blog content. The crawled data can be classified into two corpus, i.e., plaintext and cipher text. As the plaintext is regular and uniform, we use the regular expression to extract the real contents (e.g., the micro-blog contents, Url, id, published time, IP address, reviewed or commented number, forwarded number, etc.). The processing flow is shown in Fig. 9.10.

Fig. 9.10
figure 10

The processing flow

9.6 Experimental Results and Analysis

9.6.1 About the Testing Data Set and the Experimental Environment

As for these micro-blog big data, its persistence is an important issue. We use the MongoDB, Redis to store and cache them in our real application, and Mysql is only used as the experimental platform. Figure 9.11 shows the MySql-based parsed Tecent_Micro-blog data.

Fig. 9.11
figure 11

The parsed results, a account information b other parsed content

In order to evaluate the algorithm’s performance, we classify the following tested data set according to their authorities into three classes. The reason we use the following three classes is that the micro-blog official platform usually presents the following three different classes, and the three classes have different data sizes. In detail, the first class has less data while the two others have more. In detail, the first class is the ordinary (i.e., minor authorities) microbloggers, whose propagative scope is only limited within old friends or classmates; the second class is those medium authorities’ microbloggers (e.g., network magazines microblog), whose contents can be followed or spread into all kinds of users; the last class is famous persons’ microblogs. We will test the performance on the above different data environment.

9.6.2 Ordinary Microblogger’s Performance Evaluation

In this section, we use someone’s microblog as an example. Figure 9.12a shows the original microblog, and b shows the crawled and parsed data, respectively. It is clear that there is no difference between them, and all contents have been obtained correctly.

Fig. 9.12
figure 12

Ordinary and the parsed results of the ordinary microblogger’s, a ordinary data, b parsed results

9.6.3 Medium Authorities’ Microblogger Performance Evaluation

Here, we use some medium authorities’ microbloggers as the experimental platform. Figure 9.13a shows the micro-blog interface, while b shows the crawled and parsed data.

Fig. 9.13
figure 13

Medium authorities’ microblogger and the corresponding parsed result, a ordinary data, b parsed results

9.6.4 Famous Persons’ Microblog

As for these famous persons’ microblogs, we use the famous actor Tom Cruise’s microblog as an example. Figure 9.14a shows his micro-blog interface, while b shows the crawled and parsed data.

Fig. 9.14
figure 14

Famous persons’ micro-blog and the parsed results, a ordinary data, b parsed results

9.6.5 Performance Evaluation

Sometimes micro-blog services provide some APIs. Through these services, the well-structured data can be easily obtained, so it can provide us the probability of constructing uniform and universal software architecture to utilize the provided APIs to automatically download data. However, there are usually some limits and obstacles. In order to evaluate the performance, we present the comparison between APIs-based crawling and the simulating browser behavior approach, see Table 9.1.

Table 9.1 The performance comparison of the two approaches

From the above comparison, as for the proposed method, it is clear that the parsed data’s degree of integrity and the accuracy or scope is higher. But, as shown before, if the template or the main framework of the microblog has been changed, the accuracy of the parsed data is lower than usual. Fortunately, these changes occur rarely. If we can track or analyze the parsed data periodically, it is easy to find the changes and then revise some special rules to parse the corresponding data.

9.7 Conclusion

It is hard for a traditional web page crawler to crawl micro-blog data as usual, and most microblogs’ official platforms cannot offer some suitable tools or RPC interfaces to collect the data effectively and efficiently. This chapter presents some algorithms and strategies on crawling and parsing micro-blog data effectively based on simulating browsers’ behaviors. This needs to analyze the simulated browsing behavior in order to obtain the requesting URLs, to simulate and analyze the sending URLs requests according to the order of data sequence. It needs to focus on crawling on some special crowds and crawl some special contents by using the microblog’s searching function. Parallel crawling and the multiprocessing technology are also used to download the data simultaneously. The experimental results and the analysis show the feasibility of the approach. Existing works are also presented at the end.