Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In most machine learning applications, the observed and unobserved instances are assumed to be drawn independently from the same distribution. Classification problems are solved using instances’ features (content) and labels. Connections/dependencies/ relations between instances are not taken into consideration. On the other hand, learning problems with network information, where for each node its features and relations with other nodes are available, become more common in our lives. Examples include social [39], semantic [36], financial [3], communication [10] and gene regulatory [2] networks. Classification of nodes or links in the network, discovery of links or nodes which are not yet observed or identification of essential nodes or links, are some of the research areas in social networks.

In a social network, the class membership of one node may have an influence on the class membership of a related node. Networked data contain not only the features for each node, but also features and sometimes labels of neighbors and sometimes the features and links of the nodes for which labels need to be estimated. Classification in social networks have a number of challenges compared to classification of plain data:

  • Design of classifiers which can make the best use of both content and link information available in the social network data is a challenge.

  • One other issue that must be considered is the separation of data into training and test sets, because using traditional random sampling methods may not give healthy results [21].

  • Even nodes connected through a chain of links may affect each other’s class. Classification algorithms may need to take this dependency into account.

In this chapter, we try to address these challenges. Section 4 describes random and snowball sampling which can be used to partition networked data into training and test sets so that different learning methods can be tested against each other. In Sect. 5 we show how neighbor labels may be used through different aggregation mechanisms. In Sect. 6 we discuss the classification methods for networked data and especially the collective classification algorithms, which allow test nodes to be labeled based on the actual labels of training nodes and the estimated labels of test nodes.

In addition to these, we give the notation used in Sect. 2. Details on graph properties, which are important in understanding social networks data and which algorithms to use for learning, are given in Sect. 3. Experimental results on three different datasets are in Sect. 7. The chapter is concluded with Sect. 8.

2 Notation

We assume that there is a networked dataset represented by a graph G = (V, E) with nodes (vertices) V and undirected links (edges) E ⊆ {{ u, v} | u, v ∈ V }.

Each node u ∈ V has a C dimensional label vector r(u) ∈ { 0, 1}C which uses 1-of-K representation and shows the class of the node. Some of the vertices are in the training set V train whose labels are known, while the rest are in the test set V test whose labels will be predicted. Note that, V train V test  =  and V train V test  = V. Note also that if the test set is unknown then V test  = .

Each node u ∈ V (whether it is in the training or test set) also has a d dimensional node content feature vector x(u) ∈ { 0, 1}d.

In the pattern recognition scenario that we are interested in, given feature vectors of the training nodes and their labels, x(u) and r(u), u ∈ V train , we need to train a mapping (classifier) g(x(u)): { 0, 1}d → { 0, 1}C. Since the classifier g(x(u)) uses only the input features, we will call it the Content Only (CO) classifier g(x(u)) = g CO (x(u)).

When not only the training node features, but also their links are given, the link information can also be used for classification. Usually link information of neighbors of a specific node are taken into account. The neighborhood function N(u) returns a set of nodes which are neighbors of node u according to the links L:

$$\displaystyle{ N(u) =\{ v:\{ u,v\} \in L\}. }$$
(1)

Most classifiers need fixed dimensional inputs. So, we need to aggregate [29] labels of neighbors of a node into a fixed dimensional vector. Let r N (u) denote the aggregated neighbor labels for a node u. We will define r N (u) based on the aggregation method (Sect. 5) used.

Based on the labels of the neighbors only, a classifier, which we call the Link Only (LO) classifier g LO (r N (u)) can be trained on the training data. When a test node needs to be classified, if it has neighbors in test set, inputs to the classifier need to be determined iteratively, based on the current label assignment of the neighbors using a collective classification algorithm such as the Iterative Classification Algorithm (ICA) [21, 34].

When both node features and links are known, a classifier that uses both the content features of the node and labels of the neighbors have been used in [34]. We will call this classifier with d + C features, the Content and Link classifier:

$$\displaystyle{ g_{\mathit{CL}}([\mathbf{x}(u)\ \mathbf{r}_{N}(u)]). }$$
(2)

Evaluation of a classifier can be based on its accuracy on the test set:

$$\displaystyle{ \mathit{acc}(g,V _{\mathit{test}}) = \frac{1} {\vert V _{\mathit{test}}\vert }\sum\limits_{v\in V _{\mathit{test}}}1 - [g(\mathbf{x}(v)) = \mathbf{r}(v)]. }$$
(3)

Here [P] is the Iverson bracket and returns 1 if condition P is true (i.e. vectors g(x(v)) and r(v) are equal) and returns 0 if condition P is false (i.e. g(x(v)) and r(v) differ in at least one position). The test labels and sometimes the identities of the test inputs are not known during training. Since the actual labels are not known, test accuracy can not be evaluated. Instead, validation set(s) extracted from the labeled training set through sampling (Sect. 4) is(are) used to estimate the test accuracy of a classifier.

3 Graph Properties

Social network data consist of a graph with nodes, node features and labels and links between the nodes. Graph properties, such as homophily, degree distribution help us understand the data at hand. Graph properties can also be used as guidelines in finding out if a sampling (Sect. 4) used for evaluation of algorithms is a good one or not, or which type of aggregation (Sect. 5) of neighbor labels should be used for classification. In this section, we define some of the important graph properties. Please see, for example [27] or [11], for more details on graph properties.

3.1 Homophily

A classification algorithm in a social network aims to use both content and link information. Whether the link information will be useful or not depends on whether linked objects have similar features and or labels. This similarity have been quantified using a number of different criteria.

Neville and Jensen defined two quantitative measures of two common characteristics of relational datasets: concentrated linkage and relational autocorrelation. Concentrated linkage occurs when many entities are linked to a common entity like the citation links to the key papers. Relational autocorrelation occurs when the features of the entities are similar among entities that share a common neighbor [17]. As pointed out by Neville and Jensen, most of the models (e.g., PRMs, RMNs) do not automatically identify which links are the most relevant to the classification task. In their method, they defined links which are most relevant to the classification task by using concentrated linkage and relational autocorrelation, and explored how to use these relational characteristics to improve feature selection in relational data [42].

Yang et al. identified five hypertext link regularities that might (or not) hold in a particular hypertext corpus [40, 42]. We list three of them here.

  • Encyclopedia regularity: The class of a document is the same as the class of the majority of the linked documents.

  • Co-referencing regularity: Documents with the same class tend to link to documents not of that class, but which are topically similar to each other.

  • Partial co-referencing regularity: Documents with the same class tend to link to documents that are topically similar to each other, but also link to a wide variety of other documents without semantic reason. The presence (or absence) of these regularities may significantly influence the optimal design of a link-based classifier. Most of link analysis methods and link-based classification models are built upon the “encyclopedia” or “co-referencing” regularity. As a result, the models do not automatically identify which links are most relevant to the task [42].

Assortativity index defined by [27] is also a measure of how similar are the labels of connected nodes. Assortativity index is defined using the number of links which connect nodes of same and different classes and it is proportional to the number of links that connect nodes of the same class.

Homophily [21, 31] or label autocorrelation can be defined as the tendency of entities to be related to other similar entities, i.e. linked entities generally have a tendency to belong to the same class. High homophily usually implies that using link information helps with classification while for low homophily datasets using a content only classifier could do a better job.

We define homophily of a node u as the proportion of neighbor nodes of u which have the same label as u:

$$\displaystyle{ H(u) = \frac{1} {N(u)}\sum\limits_{v\in N(u)}[\mathbf{r}(u) == \mathbf{r}(v)] }$$
(4)

In this equation [] is the Iverson bracket and [p] is equal to 1 if p is true, it is 0 otherwise. The graph homophily is then:

$$\displaystyle{ H(G) = \frac{1} {\vert V \vert }\sum\limits_{u\in V }H(u) }$$
(5)

Note that although homophily is defined as label “sameness”, labels could be related to each other, and for example could just be the opposite of each other for a binary classification problem (i.e. heterophily). The classifiers that we use would be able to take advantage of homophily, heterophily or any other correlation between the label of a node and labels its neighbors.

3.2 Degree Distribution

Degree, k(u), of a node u is the total number of its connections to the other nodes in the network. Although the degree of a node seems to be a local quantity, degree distribution of the network often may help to determine some important global characteristics of networks. A scale-free network is type of a network whose degree distribution follows a power law [11]. It was shown that whether an epidemic spreads in a scale-free network or stops can be computed based on the scale free exponent in [16].

3.3 Clustering Coefficient

Clustering coefficient is a measure of degree to which nodes tend to cluster together in a graph. Clustering coefficient property can give information, for example, on how close are two nodes in the graph, and therefore how much their predicted labels would affect each other during prediction of labels. Different types of clustering coefficients can be defined as follows.

3.3.1 Global Clustering Coefficient

The global clustering coefficient, which is a measure of indication of the clustering in the whole network is based on triplets of nodes. A triplet is called an open triplet when three nodes are connected with two links and it is called a closed triplet when all three nodes are tied together. Three closed triplets, one centered on each of the nodes, form triangle. The global clustering coefficient is defined as the ratio of the number of closed triplets to the total number of triplets (open and closed) [11]:

$$\displaystyle{ CC_{G}(G) = \frac{\mathit{Number\ of \ closed\ triplets}} {\mathit{Number\ of \ connected\ triplets}} }$$
(6)

3.3.2 Local Clustering Coefficient

Local clustering coefficient is defined in [11] as the ratio of the actual number of edges between the neighbors of a node u to all the possible numbers of edges is the local clustering coefficient for node u:

$$\displaystyle{ CC_{L}(u) = \frac{\sum_{v_{1},v_{2}\in N(u),v_{1}\neq v_{2}}[\{v_{1},v_{2}\} \in E]} {\vert N(u)\vert (\vert N(u)\vert - 1)/2} }$$
(7)

Here [] is the Iverson bracket. The number of all possible undirected links between neighbors N(u) of node u is \(\vert N(u)\vert {\ast} (\vert N(u)\vert - 1)/2\).

3.3.3 Average Clustering Coefficient

The network average clustering coefficient is defined by Watts and Strogatz as the average of the local clustering coefficients of all the nodes in the graph[38]:

$$\displaystyle{ CC_{L}(G) = \frac{1} {\vert V \vert }\sum\limits_{u\in V }CC_{L}(u) }$$
(8)

If the average clustering coefficient is zero, then the graph is a tree. If the average clustering coefficient is higher than a random graph with the same degree distribution, then the network may show the small world phenomenon, i.e. any two random nodes can be connected using much smaller number of links than O( | V | ). It should be noted that during calculation of the average clustering coefficient the nodes having less than two neighbours, which naturally have a clustering coefficient of zero, are not considered in the average clustering coefficient computation.

3.4 Rich Club Coefficients

Let V k  ⊆ V denote the nodes having degree higher than a given value k and E  > k denote the edges among nodes in V k . The rich-club coefficient is defined as:

$$\displaystyle{ \phi _{k}(G) = \frac{\vert E_{>k}\vert } {\vert V _{>k}\vert (\vert V _{>k}\vert - 1)/2} }$$
(9)

In Eq. (9) \(\vert V _{>k}\vert (\vert V _{>k}\vert - 1)/2\) represents the maximum possible number of undirected links among the nodes in V  > k . Thus, ϕ(k) measures the fraction of edges actually exist between those nodes to the maximum number of edges they may have. The rich club coefficient helps to understand important information about the underlying architecture revealing the topological correlations in a complex network [7].

3.5 Degree-Degree Correlation

Degree-degree correlation is the correlation between the number of neighbors (minus 1) of neighboring nodes. Both homophily and degree-degree correlation can be considered as special case of assortativity [26], the correlation between a certain property (label, degree, etc.) of two neighboring nodes. Average degree of the neighbours’ of the nodes having degree k. Kahng et al. [19] found out that among scale-free networks authorship and actor networks are assortative (i.e. nodes with large degree connect to nodes with large degree) while protein-protein interaction and world wide web networks are disassortative (i.e. large degree nodes tend to connect to small degree nodes).

3.6 Graph Radius and Diameter

These two notions are related to each other. We use the definitions from [12].

Let d G (u, v) be the distance (i.e. the minimum number of edges that connect u and v) between nodes u and v. The eccentricity ε(u) of a node u is defined as the greatest distance between u and any other node in the graph:

$$\displaystyle{ \epsilon (u) =\max _{v\in V }d_{G}(u,v) }$$
(10)

The diameter of a graph is defined as the length of the longest of the shortest paths between any two nodes or equivalently as the maximum eccentricity of any vertex:

$$\displaystyle{ \mathit{diam}(G) =\max _{u,v\in V }d_{G}(u,v) }$$
(11)

The radius of the graph is defined as the minimum eccentricity among all the nodes in the graph:

$$\displaystyle{ \mathit{rad}(G) =\min _{u\in V }\epsilon (u) =\min _{u\in V }\max _{v\in V }d_{G}(u,v) }$$
(12)

3.7 Average Path Length

The Average Path Length of a graph is defined as the average of all shortest paths between nodes:

$$\displaystyle{ \mathit{APL}(G) = \mathit{avg}_{u\in V }\max _{v\in V }d_{G}(u,v) }$$
(13)

[13] has computed the average path length based on whether the graph is a scale free one or not and the scale free exponent.

3.8 Graph Density

The density of a graph is defined as the ratio of the number of edges to the number of possible edges:

$$\displaystyle{ D(G) = \frac{\vert E\vert } {\vert V \vert (\vert V \vert - 1)/2} }$$
(14)

3.9 Graph Properties of the Datasets

We used three datasets that have been used in network classification research [21, 23, 34]. We give their graph properties in Table 1.

3.9.1 Cora Dataset

Cora [22] data set consists of information on 2708 Machine Learning papers. Every paper in Cora cites or is cited by at least one other paper in the data set. There are 1,433 unique words that are contained at least 10 times in these papers. There are also seven classes assigned to the papers according to their topics. For each paper, whether or not it contains a specific word, which class it belongs to, which papers it cites and which papers it is cited by are known. Citation connections and paper features (class and included words) are contained in two separate files. Total number of connections between the papers is 5,278. There are 3.898 links per paper.

3.9.2 Citeseer Dataset

CiteSeer [14, 33] data set consists of information on 3,312 scientific papers. There are 3,703 unique words that are contained at least 10 times in these papers. There are six classes assigned to the papers according to their topics. Just as in the CoRA dataset, word, class and cites and cited by information are given in two separate files. Total number of connections between the papers is 4,536. There are 2.74 links per paper.

3.9.3 WebKB Dataset

WebKB [9] data set consists of sets of web pages from four computer science departments, with each page manually labeled into five categories: course, project, staff, student, or faculty.

Link structure of WebKB is different from Cora and Citeseer since co-citation links are useful for WebKB data set. The reason for that can be explained based on the observation that a student is more likely to have a hyperlink to her adviser or a group/project page rather than to one of her peers [21].

Table 1 Summary information about the datasets (Cora, Citeseer, WebKB)

4 Sampling

The test data in social networks may contain a bunch of nodes that are connected to each other, in which case the training-validation partitioning process needs to take this dependency into account. It is also possible that the test data are randomly distributed among the training nodes. Two different sampling mechanisms, snowball sampling and random sampling, are used to handle these two situations.

4.1 Random Sampling

When random sampling is used, nodes in training, validation, and test sets are selected. It is important to preserve class distribution of the dataset as much as possible during selection of the nodes. One method to achieve this is to partition nodes from every class among themselves randomly proportional to the required train, validation and test set sizes and then combine the nodes from every class in the training set to produce the training set and do the same for validation and test sets. While random sampling is a simple method, even if care is taken to preserve the class ratios, the sampled graph is likely to have very different topological properties than the original graph [1], therefore, the classification algorithms trained/tested on the sampled graph may not perform similarly on the test set or the original dataset.

4.2 Snowball Sampling

When the usual k-fold cross-validation training and test sets are obtained on networked data, especially when the number of links per node is low, k-fold random sampling may generate almost disconnected graphs [34], making learning through the links in the networked data impossible. To overcome the issue of disconnected graphs in k-fold cross validation, snowball sampling is used. In snowball sampling, first of all, different starting nodes are selected. Then new nodes are selected among the nodes which are accessible through the selected nodes. Thus, the selected nodes grow like a snowball and as a result selected nodes are connected. It is important to preserve the class ratios in the selected set of nodes, therefore, at every point during sampling, if a class is underrepresented, it is given higher probability. This sampling procedure continues until there are enough nodes in the selected subset.

In Fig. 1, visualization of train/test partitions for Cora and Citeseer datasets are given for both random sampling and snowball sampling. Red nodes are nodes in the test partition while green ones are in the train partition. Nodes’ actual labels are also displayed in the circles representing the nodes. In order to be able to visualize these two sets a subsample of 4 % of the actual data is used.

When random sampling is used, there is no order or dependency between the selection of training, validation and test sets. However, when snowball sampling is used, whether the snowball is selected and taken to be the test set or the training set may give different results. Taking the snowball to be the test set [34], generating k disjoint snowballs and using them for training-validation set formation [23], using temporal sampling and past data for training and generating snowball samples with some portion of provided labels for validation [24] are all different uses of snowball sampling for classification of networked data. While some authors mention that snowball sampling causes bias towards highly connected nodes and may be more suitable to infer about links than to infer about nodes in a social network [35], others suggest that since snowball is not guaranteed to reach individuals with high connectivity and would not reach disconnected individuals, snowball’s starting nodes should be chosen carefully [15].

Fig. 1
figure 1

Random sampling vs snowball sampling

When random sampling is used to generate k-fold cross validation training and validation (and test) sets, there are no overlaps between different test sets. However, when snowball sampling is used to generate the k test sets, the test snowballs created may overlap. Since linked instances in a snowball are correlated, errors made on them may also be correlated. The statistical tests, such as the paired t-test, which is used for model selection, may not give reliable results when test sets are not independent [25]. Forest Fire Sampling (FFS) [20] method may be considered as an alternative to snowball sampling, in this algorithm, as in snowball sampling, a breadth first search technique is used but some of the followed links are burned according to a probability distribution.

4.3 Sampling on Streaming Graphs

When the network has many nodes or the nodes are observed as a stream (as in the case of twitter for example), it is not possible to consider all of the graph when sampling. For streaming or large graphs the sampling algorithm needs to be space and time efficient. In [1] a sampling algorithm for streaming graphs called Partially-Induced Edge Sampling (PIES) is introduced. PIES algorithm always keeps a constant number of nodes in the sample and drops old nodes and adds new observed ones as the network keeps being observed. Note that the same idea can also be used when the graph is too large to fit in the memory, as new nodes are explored, they can be considered as a stream. Ahmed et al. [1] considers sampling algorithms based on network nodes, edge, and topology-based sampling for three different types of networks, static-small, static-large and streaming. Snowball sampling is a topology based sampling method. In [1], the authors’ objective is to ensure that the sampled graph is a representative subgraph which matches the topological properties of the original graph.

5 Aggregation

Each node in a graph may have a different degree and therefore different number of neighbors. On the other hand, most classifiers need the input dimensionality for each instance to be the same. Therefore, in order to take advantage of neighbor link or feature information, a mechanism to make them the same dimensional, regardless of the identity of a node is needed. Aggregation methods (also called propositionalization methods or flattening methods) are often used for this purpose. In this section we give common aggregation methods used for aggregation of neighbor labels so that they can be used for classifier training.

The main objective of aggregation in relational modeling is to provide features which improve the generalization performance of the model [29]. However aggregation usually causes loss of information, therefore one needs to be careful about not losing predictive information. Perlich and Provost proposed general guidelines for designing aggregation operators, suggesting that aggregation should be performed keeping the class labels under consideration, aggregated features should cause instances of the same class to be similar to each other and different aggregation operators should be experimented with. Below, we will present performances of different aggregation methods on different datasets. In [29] authors considered both simple aggregators and new more complex aggregators in the context of the relational learning system ACORA (Automated Construction of Relational Attributes). ACORA computes class-conditional distributions of linked object identifiers, and for an instance that needs to be classified, it creates new features by computing distances from these distributions to the values linked to the instance [29]. Lu and Getoor considered various aggregation methods: existence(binary), mode and value counts. The count method performed best in their study [28].

In the following sections, the notation introduced in Sect. 2 is used. Note that, although the neighborhood function N(u) is usually defined to include the immediate neighbors of a node, it could be extended to include neighbors which are at most a number of links away.

5.1 Neighbor Label Aggregation Methods

  • Count Method: The count aggregation method [30] determines the frequency of the neighbors having the same class as the node:

    $$\displaystyle{ \mathbf{r}_{N}^{\mathit{count}}(u)\,=\,\sum\limits_{ v\in N(u)}\mathbf{r}(v). }$$
    (15)

    The count method does not consider any uncertainty with the labels or links, neither does it consider the edge weights [30].

  • Mode Method: This aggregation method considers the mode of the neighbor labels:

    $$\displaystyle{ \mathbf{r}_{N}^{\mathit{mode}}(u) = \mathit{mode}_{ v\in N(u)}\mathbf{r}(v). }$$
    (16)
  • Binary Existence Method: This aggregation method only considers whether a certain label exists among the neighbors or not, it does not take into account the number of occurrences, as count or mode aggregation do. For the jth class, the binary existence of neighbor labels’ aggregation is computed as:

    $$\displaystyle{ \mathbf{r}_{N}^{\mathit{exist}}(u,j) = [\mathbf{r}_{ N}^{\mathit{count}}(u,j) > 0] }$$
    (17)
  • Weighted Average Method: The weighted average aggregation method [30] sums the weights of the neighbors of the node belonging to each class and then normalizes it with the sum of the weights of all edges to the neighbors. Similar to the count method, it does not consider uncertainty.

    $$\displaystyle{ \mathbf{r}_{N}^{wavg}(u) = \frac{1} {Z}\sum\limits_{v\in N(u)}w(u,v)\mathbf{r}(v). }$$
    (18)

    Here \(w(u,v) \in \mathcal{R}\) is the weight of the link between nodes u and v, Z is a normalization constant:

    $$\displaystyle{ Z\,=\,\sum\limits_{v\in N(u)}w(u,v) }$$
    (19)
  • Probabilistic Weighted Average Method: This aggregation method is the probabilistic version of the Weighted Average method. It is based on the weighted arithmetic mean of class membership probabilities of neighbors of a node. This method was introduced by Macskassy and Provost and was used as a probabilistic Relational Classifier (PRN) classifier [30].

    $$\displaystyle{ \mathbf{r}_{N}^{pwavg}(u,c) = \frac{1} {Z}\sum\limits_{v\in N(u),c\in C}w(u,v).P(c\vert v) }$$
    (20)

    where Z is defined as in Eq. (19) and c denotes a certain class.

In Tables 2 and 3, we show the average test accuracies over ten folds, of using iterative classification algorithm (ICA), which is a method for transductive classification of networked data (see the next section). We used logistic regression as the base classifier, during these experiments. As it can be seen from both tables, the count method outperforms the other methods both in terms of its simplicity and the accuracies obtained.

Table 2 Accuracy comparison of aggregation methods (ICA, SS) (Cora, Citeseer, WebKB)
Table 3 Accuracy comparison of aggregation methods (ICA,RS) (Cora, Citeseer, WebKB)

6 Classification in Social Networks

We consider two types of classification in a social network. The content only classification can be used whether the test nodes are known or not. On the other hand, when the test nodes or some unlabeled nodes are known, then semi-supervised classification algorithms can be used.

6.1 Supervised, Content Only Classification

This model consists of a (learned) model, which uses only the local features of the nodes whose class label will be estimated. The local models can also be used to generate priors for the initial state for the relational learning and collective inference components. They also can be used as one source of evidence during collective inference. These models typically are produced by traditional machine learning methods [21].

6.2 Semi-supervised and Transductive Classification in Social Networks

Since social network data usually come in huge sizes, in addition to labeled instances, there are, usually, a huge number of unlabeled instances. In such cases, it could be possible to use the information other than labels that exist in the unlabeled data, which leads to use of semi-supervised learning algorithms. When the test nodes whose class will need to be predicted are known, then we have a transductive learning scenario.

In contrast to the non-relational (local) model, the relational model use the relations in the network as well as the values of attributes of related entities, even possibly long chains of relations. In relational models, a relational classifier determines the class label or estimates the class conditional probabilities. The relational classifier might combine local features and the labels of neighbors using a naive Bayes model or a logistic regression [21].

In semi-supervised learning [43], the unlabeled instances can be used to monitor the variance of the produced classifiers, to maximize the margin and hence to minimize the complexity [18], to place classifier boundaries around the low density regions between clusters in the data [6]. There are also co-training [5] type algorithms which need different classifiers that are obtained through the use of different type of classifiers, different feature subspaces [41] or set of instances. When the classifiers produced are diverse and accurate enough, co-training may improve the final test accuracy [37]. On the other hand, Cozman and Cohen [8] have shown that unlabeled data can degrade the classification performance when there are discrepancies between modeling assumptions used to build the classifier and the actual model that generates the data. Therefore, both for the general semi-supervised and the transductive learning, the use of unlabeled data is not guaranteed to improve performance.

6.3 Collective Classification

Collective classification methods, which are sometimes also called collective inference methods, are iterative procedures, which classify related instances simultaneously [30, 42]. In collective classification, the content and link information for both training and test data are available. First, based on the available training content, link and label information, models are trained. Then, those models are used to label the test data simultaneously and iteratively where each test sample is labeled based on its neighbors.

Collective classification exploits relational autocorrelation. Relational autocorrelation is a very important property of relational data and is used as a measure of how an attribute for an instance is correlated with the same variable from a related instance [30].

However, sometimes, the advantage of exploiting the relationships can become a disadvantage since it is possible to make incorrect predictions about a particular node which propagates in the network and may lead to incorrect predictions about other nodes. Bilgic and Getoor proposed an acquisition method which learns the cases when a given collective classification algorithm makes mistakes, and suggests label acquisitions to correct those mistakes [4].

Iterative classification algorithm (ICA) and Gibbs sampling algorithm (GS), Mean field relaxation labeling (MF), Loopy belief propagation (LBP), are popular approximate inference algorithms used for collective classification [34]. In this book chapter, we explain the ICA algorithm and use it in our experiments. Iterative classification algorithm (ICA) is a popular and simple approximate collective inference algorithm [21, 34]. Despite its simplicity, ICA was shown to perform as well as the other algorithms such as Gibbs Sampling [33]. Please see [21, 34] for details on the other collective classification algorithms.

6.3.1 Iterative Classification Algorithm (ICA)

To determine the label of a node, Iterative Classification Algorithm (ICA) assumes that all of the neighbors’ attributes and labels of that node are already known. Then, it calculates the most likely label with a local classifier which uses node content and neighbors’ labels. However, most nodes have neighbors which are not in training data and hence are not labeled, therefore the label assignment on one test instance may affect the label assignment on a related test instance. ICA repeats the labeling process iteratively until all of the label assignments are stabilized. Neighbor label information is summarized using an aggregation operator (Sect. 5.1).

Pseudocode for the ICA algorithm (based on [34]) is given in Algorithm 1. In the pseudo code, \(\tilde{\mathbf{r}}(u)\) stands for temporary label assignment of instance u in the test set. g CL ([x(u) r N (u)]) is the base classifier which is first trained on training nodes and their neighbors from the training set. The base classifier uses the estimated labels of the neighbors if they are test nodes. O is a random ordering of test nodes.

Algorithm 1: \(\tilde{\mathbf{r}}(V _{\mathit{test}}) =\mathrm{ ICA}(G,V _{\mathit{train}},V _{\mathit{test}},g_{\mathit{CL}}())\)

As shown above, the ICA algorithm starts with a bootstrapping to assign initial temporary labels to all nodes by using only the content features of the nodes. Then, it starts iterating and updating labels according to the both relational and content features [32].

7 Experiments

In this section, we report classification results on three networked datasets. We evaluate content only (CO), ICA using only a link based classifier (LO), ICA using a content and link based classifier (ICA). We show that whether a method works on a dataset or not is highly dependent on the properties, especially homophily of the dataset. In the experiments, we report results using both snowball and random sampling methods to select the test set. We show that sampling may also affect the results.

7.1 Experimental Setup

We use Cora, Citeseer and WebKb datasets whose details are given in Table 1. Both Cora and Citeseer datasets consist of information on scientific papers. As features, the words that occur at least ten times are used. For each paper, whether or not it contains a specific word, which class it belongs to, which papers it cites and which papers it is cited by are known. The WebKB dataset contains web pages, instead of papers as content. The hyperlinks between web pages are used to produce links between nodes.

We experiment with different classification methods, namely, logistic regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Bayes Net (BN), k-Nearest Neighbor (kNN, k = 3) for the g CO , g LO and g CL classifiers. For all of the methods Weka implementations with default parameters (unless otherwise noted) have been used.

7.2 Performance of Different Classifiers

First of all, we conducted experiments on datasets using Logistic Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Bayes Net (BN), k-Nearest Neighbor (kNN) classifiers. We evaluated the accuracies obtained when content only (CO), link only (LO) classifiers and ICA are used with a specific classification method for each dataset.

Tables 4, 5, and 6 show the accuracies obtained when different classifiers are used on Cora CiteSeer and WebKb datasets when Snowball Sampling is used.

Table 4 Accuracies on Cora dataset using different classifiers (snowball sampling)
Table 5 Accuracies on WebKb dataset using different classifiers (snowball sampling)
Table 6 Accuracies on Citeseer dataset using different classifiers (snowball sampling)

Tables 7, 8, and 9 show the accuracies obtained when different classifiers are used on Cora CiteSeer, and WebKb datasets when Random Sampling is used.

Table 7 Accuracies on Citeseer dataset using different classifiers (random sampling)
Table 8 Accuracies on Cora dataset evaluated using different classifiers (random sampling)
Table 9 Test accuracy results of WebKb dataset evaluated using different classifiers (random sampling)

For CO classification, while LR, SVM, BN and NB give similar accuracies, kNN usually gives worse accuracies. On the other hand, for LO all classifiers perform similarly.

For the CO classification, SVM, NB and BN classifiers usually performed better than the others. We think that this is due to the high input dimensionality of the content features. On the other hand, for LO classification LR outperformed the other methods. Since the LO homophily of Cora dataset is higher than the CiteSeer, the LO accuracies are also higher. For the Cora dataset, instead of using thousands of features in CO classifier, simply using the aggregated class neighbors in LO classifier results in a better accuracy. This is expected since both datasets have high homophily and LO (and ICA) benefits from homophily. For both Cora and Citeseer datasets, BN method gives the best results for ICA and ICA performs better than CO methods. However, again due to high link graph homophily, ICA performs just a little better than LO accuracies.

8 Conclusion

In this chapter, we have given details on how content, link and label information on social network data can be used for classification. We defined important properties of social networked data, which may be used to characterize a networked dataset. We defined a list of aggregation operators, which are used to aggregate the labels of neighbors, so that the classifiers trained have the same number of inputs for each node. We evaluated the accuracies of different classifiers, using only the content, only the link or both the content and the link information.

We have shown that graph properties, especially homophily, play an important role on whether network information would help in classification accuracy or not. The Cora dataset which has high homophily benefits a lot from the use of network information. It is also worthwhile noting that when homophily is high, one can only use the link information and may not need the content information at all.

We experimented with a number of label aggregation methods and found out that the count method, which is one of the simplest aggregation methods, is as good as the other methods.

The networked data does not obey the i.i.d. assumptions which are mostly assumed to hold for content only datasets. The test dataset also may come either randomly among distributed in the network, or they may be concentrated on certain parts of the network. We used random and snowball sampling algorithms to separate the a portion of the all available data as test data. We have shown that, with random sampling, the test accuracies of classifiers that use network information are always better. This is due to the fact that with random partition the nodes in the test partition will naturally have more neighbors from train and validation partitions, which have the actual labels, as opposed to the labels estimated by the classifiers.

Depending on whether content only, link only or content and link information is used and depending on the dataset and the type of content information, we have shown that different types of classifiers, such as SVM, Naive Bayes, Logistic Regression classifier, may perform differently. All the datasets we used in our experiments contained text as content information, therefore, classifiers which were able to deal with high dimensional features, such as SVM, Naive Bayes and Bayes Net performed better for content only or content and link classification. On the other hand, when only link information was used, the logistic regression classifier performed better. We suggest that using different type of classifiers and content only, link only and content and link classification should be experimented with for a networked dataset.