1 Introduction

We study the problem of graph coloring for complex networks such as social and information networks. Our focus is on designing (1) accurate coloring methods that are (2) fast for large-scale networks of massive size. These requirements lead us to introduce a unified coloring framework that can serve as a basis for investigating and comparing the proposed methods.

Graph coloring is an important fundamental problem in combinatorial optimization with numerous applications including timetabling and scheduling (Budiono and Wong 2012), frequency assignment (Sivarajan et al. 1989; Banerjee and Mukherjee 1996), register allocation (Chaitin 1982), and more recently to study networks of human subjects (Kearns et al. 2006; Chaudhuri et al. 2008), among many others (Colbourn and Dinitz 2010; Moscibroda and Wattenhofer 2008; Ni et al. 2011; Capar et al. 2012; Schneider and Wattenhofer 2011; Grohe et al. 2013). The graph coloring problem consists of assigning colors to vertices such that no two adjacent vertices are assigned identical colors, while minimizing the number of colors. However, in general, the coloring problem is known to be computationally intractable (NP-hard), even to approximate it within \(n^{1-\epsilon }\) (Garey and Johnson 1979). Nevertheless, coloring lies at the heart of many applications where the goal is to partition a set of entities into classes where two related entities are not in the same class while also minimizing the number of classes used.

Despite its practical importance in a variety of domains (e.g., engineering, scientific computing), coloring algorithms for complex networks such as social, biological and information networks have received considerably less attention. Majority of work focuses on graphs that are relatively small, synthetic, or from other domains. However, these real-world networks (e.g., social networks) are usually sparse with complex structural patterns (Newman and Park 2003; Boccaletti et al. 2006; Barabasi and Oltvai 2004; Davidson et al. 2013; Kleinberg 2000; Adamic et al. 2001), while also massive in size and growing at a tremendous rate over time. For instance, the web graph has well over 1 trillion pages, whereas social networks such as Facebook have hundreds of millions of users. Unfortunately, coloring algorithms suitable for these large sparse real-world networks have been largely ignored, even despite the significance of coloring and its potential for use in a wide variety of applications. Furthermore, due to the aforementioned reasons, there has yet to be a systematic investigation of coloring and its potential applications.

In terms of social networks, coloring has been used for finding roles (see Everett and Borgatti 1991), but that work is limited to extremely small instances and does not scale to the requirements of modern social and information networks present in the age of big data. Others have used coloring to study small controlled groups of human subjects and their behavior (Kearns et al. 2006; Chaudhuri et al. 2008). Nevertheless, coloring methods for large sparse networks have not been proposed, nor has coloring been used for applications in these large networks.

The age of big network data has given rise to numerous opportunities and potential applications for graph coloring including descriptive and predictive modeling tasks. A few of the possibilities are discussed below. For instance, the number of colors, distribution of the size of independent sets, and other properties derived from coloring are useful in tasks such as relational classification (as features) (Sen et al. 2008; De Raedt and Kersting 2008), graph similarity (Berlingerio et al. 2013), anomaly detection (Akoglu et al 2010; Aggarwal et al. 2011), network analysis (Chaoji et al. 2008; Sun et al. 2008; Kang et al. 2011; Wang and Davidson 2010), or for evaluating graph generators, among many other tasks (Sharara et al. 2012). Additionally, vertex or edge induced neighborhoods may also be colored to study various questions; similar to the work of Ugander et al. (2013a) which used neighborhood motifs instead. Independent sets are also seemingly useful in many applications. One such application is network sampling, where vertices/edges may be selected from a large independent set to ensure good network expansion (and of course independence), and may be useful for estimating properties efficiently in the age of big data (Al Hasan and Zaki 2009; Ahmed et al. 2014). Indeed, such a sampling strategy would also be particularly useful for machine learning problems such as relational active learning (Sharma and Bilgic 2013), see the work of Bilgic et al. (2010). It is also easy to find applications in other problem domains, e.g., network A/B testing (Ugander et al. 2013b) which requires running randomized experiments on two independently sampled universes, A and B, to test the effectiveness of new products and marketing campaigns.

Although some recent work has used coloring in small social networks (Enemark et al. 2011; Mossel and Schoenebeck 2010), there has not been any systematic evaluation or comparison of coloring methods for large complex networks of various types. Further, this recent work also used only small networks. Moreover, the majority of previous work used a single coloring method and therefore lacked any evaluation or comparison to other coloring methods. Due to this, the properties and behavior of coloring algorithms for social and information networks are not well understood and are left largely unexplored. This work attempts to fill this gap by developing a variety of techniques that exploit the structure of these large networks while also being fast and scalable for partitioning the vertices into independent sets.

More specifically, we address the theoretically and practically important problem of graph coloring with a focus on coloring large complex networks such as social, biological and technological networks. For this purpose, we develop a flexible framework that serves as a foundation for coloring real-world graphs. The framework is designed to be fast, scalable, and accurate across a wide variety of networks (i.e., social, biological). To satisfy these requirements, we relax the constraint of using the minimum number of colors, and instead focus on balancing the competing tradeoffs of accuracy and performance. This relaxation provides us a framework that scales linearly with the graph size, while also accurate as demonstrated in Sect. 6. Using this framework, we propose three classes of coloring methods designed specifically for the scale and the underlying structure of these complex networks. These include social-based methods, multi-property methods, and egonet-based coloring methods (See Table 1). We also adapt previous coloring methods/heuristics that have been widely used on small and/or dense graphs from other domains (Gebremedhin et al. 2013; Leighton 1979; Matula and Beck 1983; Coleman and Moré 1983; Welsh and Powell 1967; McCormick 1983) and unify them under the greedy coloring framework. This provides us with a basis for comparing our proposed techniques with those traditionally used. We also develop static and dynamic ordering techniques for coloring based on triangle counts, triangle-cores (Zhang and Parthasarathy 2012; Rossi 2014), and a variety of egonet properties, and demonstrate the effectiveness of these methods using a large collection of networks from a variety of domains including social, biological, and technological networks.

The dynamic triangle ordering techniques proposed here are likely to be of use in other applications and/or problems such as for improving community detection (Blondel et al. 2008; Fortunato 2010), distance queries (Jiang et al. 2014), the maximum clique problem (Prosser 2012; Carraghan and Pardalos 1990), and numerous other problems that rely on an appropriate vertex/edge ordering.

We also formulated the problem of coloring neighborhood subgraphs and proposed a parallel algorithm that leverages our previous methods. One key finding is that neighborhoods that are colored using a relatively few number of colors are not well connected, with low clustering and a small number of triangles. While neighborhood colorings that use a relatively large number of colors have large clustering coefficients and usually contain large cliques. Nevertheless, we also find linear speedups and many other interesting results (See Sect. 7 for further details).

In addition to the technical contributions, the other aim of this work is a large-scale investigation of coloring methods for these types of networks. In particular, we compare the three classes of our proposed coloring methods to a wide variety of previous methods that are considered state-of-the-art for relatively small and/or dense graphs from other domains. Using our unified framework as a basis, we systematically evaluate our proposed coloring methods (with past methods) on over 100 networks from a variety of types including social, biological, and information networks.Footnote 1

The types of graphs differ in their size, semantics, structure, and the underlying process governing their formation. Overall, we find a significant improve over the previously proposed methods in nearly all cases. Moreover, the solutions obtained are nearly optimal and sometimes provably optimal for certain classes of graphs (e.g., collaboration networks). Additionally, the large-scale investigation on 100+ networks revealed a number of useful and insightful observations. One main finding of this work is that despite the pessimistic theoretical results previously mentioned, large sparse networks found in the real-world can be colored fast and accurately using the proposed methods.

The remainder of this article is organized as follows: Preliminaries are given in Sect. 2. Section 3 introduces the framework along with the proposed methods while Sect. 4 proposes the more accurate recolor variant. In Sect. 5, we derive the lower and upper bounds used throughout the remainder of the article. Section 6 demonstrates the effectiveness of the proposed methods on over a hundred networks. Next, Sect. 7 formulates the neighborhood coloring problem and proposes a parallel algorithm for coloring neighborhood subgraphs. We also provide numerous results indicating the scalability and utility of our approach. Finally, Sect. 8 concludes.

2 Background

Networks are ubiquitous and can be used to represent data in various domains, from social, biological, and information domains. Facebook is a good live example of a real-world network, where vertices represent people, and edges represent relationships/communications among them. In this section, we start by defining the fundamental graph properties used in the problem of coloring networks.

Assume \(G=(V,E)\) is an undirected graph used to represent some network, such that \(V\) is the set of vertices, and \(E\) is the set of edges. We use the term \(index(v)\) to refer to the index of a vertex \(v\). This index represents the unique identifier of a vertex \(v\) as it appears in the graph \(G\). One simple example of an index could be the unique userid assigned to each user by online social network providers (e.g, Facebook). Similarly, we use \(d(v)\) to represent the vertex degree, such that \(d(v)\) is the number of adjacent vertices (i.e, neighbors) to \(v\) in the graph. The concept of a vertex degree could simply describe the number of friends of a Facebook user.

Another property that proved to be useful particularly in social networks, is transitivity. A transitive edge would mean that if \(u\) is connected to \(v\) and \(v\) is connected to \(w\), then \(u\) is connected to \(w\). In this case \(uvw\) represents a triangle in \(G\). We use the term \(tr(v)\) to refer to the number of triangles incident to a vertex \(v\). In common parlance, for a user \(x\) in a social network, the number of pairs of friends of \(x\) that are also friends themselves would represent the number of triangles. The concept of transitivity can be also generalized to subgraphs with more than three vertices. In this case, every vertex in the subgraph is connected by an edge to every other. These types of subgraphs is typically called cliques. Note that cliques are maximal subgraphs, means that no other vertex in the network can be a member of the clique while preserving the same property that every vertex in the clique is connected to every other. In social networks, the occurrence of cliques indicates highly connected subgroups of users, such as co-workers.

Cliques are one example of the more generic concept of network groups. In networks, vertices can be divided into various types of groups or communities that help to explain the underlying network structure. In this section, we introduce two fundamental concepts of network groups related to the problem of coloring networks (\(k\)-core, and \(k\) triangle-core).

A \(k\)-core is a maximal subgraph of \(G\), such that every vertex in the subgraph is connected to at least \(k\) others in the subgraph (Matula and Beck 1983). The concept of \(k\)-core was first introduced in (Szekeres and Wilf 1968). \(k\)-cores are useful for various applications in network analysis, such as finding communities and cliques (Rossi et al. 2014). A simple algorithm to find the \(k\)-core of the graph \(G\) is to start with the whole graph, and remove any vertices that have degree less than \(k\). Clearly, the removed vertices cannot be members of a \(k\)-core (i.e, a core with order \(k\)) under any conditions. Note that by removing these vertices, naturally, the connected vertices to the removed ones will reduce their degrees as well. Therefore, the procedure continues until there are no vertices in the graph with degree less than \(k\). The output of this procedure is the \(k\)-core (or \(k\)-cores) of \(G\).

This procedure can also be repeatedly used to compute the core decomposition of the graph—this means computing the core number of each vertex \(v\). The core number of a vertex (denoted by \(K(v)\)) is defined as the highest order \(k\) of a maximum \(k\)-core that \(v\) can possibly belong to. While simple to implement, this procedure has a worst case runtime of \(O(|E|\cdot |V|\cdot \log |V|)\). However, the runtime can be efficiently reduced to \(O(|V|+|E|)\) by another implementation–which we use in this paper (see more details in Batagelj and Zaversnik 2003).

The concept of \(k\) triangle-core has recently emerged in network analysis research, it was first proposed in (Cohen 2009), and improved in (Zhang and Parthasarathy 2012; Rossi 2014). A \(k\) triangle-core is an edge-induced subgraph of \(G\) such that each edge participates in at least \(k-2\) triangles and \(k \ge 2\). A subgraph \(H_k = (V|E(F))\) induced by the edge-set \(F\) is a maximal triangle core of order \(k\) if \(\forall (u,v) \in F : tr_H(u,v) \ge k-2\), and \(H_k\) is the maximum subgraph with this property. Most importantly, we define the triangle core number denoted \(T(u,v)\) of an edge \(e=(u,v) \in E\) to be the highest order \(k\) of a maximum triangle \(k\)-core that \(e\) can possibly belong to. See Fig. 1 for further intuition. Computing the triangle core numbers of each edge \(e\) in the graph \(G\) is called the triangle core decomposition of \(G\). In Sect. 3.2, we provide an efficient algorithm for computing the triangle core decomposition with runtime \(O(|E|^{3/2})\).

3 Greedy coloring framework

In this section, we present a scalable fast framework for coloring large complex networks and introduce the variations designed for the structure of these large complex networks found in the real-world.

3.1 Problem definition

Let \(G=(V,E)\) be an undirected graph. A clique is a set of vertices any two of which are adjacent. The maximum size of a clique in \(G\) is denoted \(\omega (G)\). An independent set \(C\) is a set of vertices any two of which are non-adjacent, thus, \(\forall (v,u) \in C\) iff \((v,u) \not \in E\). The graph coloring problem consists of assigning a color to each vertex in a graph \(G\) such that no adjacent vertices share the same color, minimizing the number of colors used. More formally,

Definition 3.1

(Graph Coloring Problem) Given a graph \(G\), find a mapping \(\phi : V \rightarrow \{1,\ldots ,k\}\) where \(\phi (v_i) \not = \phi (v_j)\) for each edge \((v_i,v_j) \in E\). such that \(k\) (the number of colors) is minimum.

This problem may also be viewed as a partitioning of vertices \(V\) into independent sets \(C_1, C_2,\ldots ,C_k\) where \(\{1,2,\ldots ,k\}\) are called colors and the sets \(C_1,\ldots ,C_k\) are referred to as color classes. Thus, the graph coloring problem is to find the minimum number \(k\) of independent sets (or color classes/partitions) required to color the graph \(G\). Nevertheless, graph coloring is NP-hard to solve optimally (on general graphs), and for all \(\epsilon > 0\), it is even NP-hard to approximate to within \(n^{1-\epsilon }\) where \(n\) is the number of vertices (Garey and Johnson 1979).

In this work, we relax the strict requirement of partitioning the vertices into the minimum number of independent sets to allow for colorings that are close to the optimal. This relaxation gives rise to fast linear-time coloring algorithms that perform well in practice (See Sect. 6). Motivated by this, we describe general conditions for greedy coloring that can serve as a unifying framework in the study of these algorithms. More formally, we define the greedy coloring framework as follows:

Fig. 1
figure 1

Triangle cores 4, 3, and 2. A \(k\) triangle-core is an edge-induced subgraph of \(G\) such that each edge participates in \(k - 2\) triangles. Hence, each clique of size \(k\) is contained within a \(k\) triangle-core of \(G\). Similarly, the \(k\) triangle-core is contained within the \((k - 1)\)-core (i.e., the \(k-1\) core from the k-core decomposition)

Definition 3.2

(Framework) Given a graph \(G=(V,E)\) and a vertex property \(f(\cdot )\), the greedy coloring framework selects the next (uncolored) vertex \(v\) to be colored such that

$$\begin{aligned} v = \mathop {\hbox {argmax}}\limits _{v_i} f(v_i) \end{aligned}$$

The selected vertex \(v\) is then assigned to the smallest permissible color. This process is repeated until all vertices are colored.

The main intuition of the greedy coloring framework is to color the vertices that are more constrained in their choice of color as early as possible, giving more freedom to the coloring algorithm to use fewer colors, and thus result in a tighter upper bound on the exact number of colors. As an aside, selecting the vertex that minimizes \(f(v)\) usually results in a coloring that uses significantly more colors than the latter. Notice that a fundamental property of the above greedy coloring framework is that it is both fast and efficient, thus, providing us with a natural basis for investigating the coloring of large real-world networks, which is precisely the scope of this work.

The above definition of the framework uses a selection criterion as the basis for coloring. Instead, we replace the selection criterion with the more general notion of a vertex ordering. More specifically, given a graph \(G=(V,E)\) and a vertex ordering

$$\begin{aligned} \pi = \{v_1,v_2,\ldots ,v_i,\ldots ,v_n\} \end{aligned}$$

of \(V\), let \({\upchi }(G,\pi )\) denote the number of colors used by a greedy coloring method that uses the vertex ordering \(\pi \) of \(G\). Hence, the greedy coloring framework selects the next vertex to color based on the vertex ordering. This formalization allows for a more precise characterization of the framework that depends on three components:

  1. 1.

    A graph property \(f(G)\) for selecting the vertices to color

  2. 2.

    The direction in which vertices are selected (e.g., smallest to largest). For instance, \(\pi = \{v_1,\ldots ,v_n\}\) is from max to min if \(f(v_1) \ge \cdots \ge f(v_n)\), or min to max if \(f(v_1) \le \cdots \le f(v_n)\).

  3. 3.

    A tie-breaking strategy for the case when the graph property assigns the same value to two vertices. Suppose \(f(v) = f(u)\), then \(v\) is before \(u\) in the ordering \(\pi \) if \(f^{\star }(v) > f^{\star }(u)\) where \(f^{\star }(\cdot )\) is another graph property used to break-ties.

Notice that two vertex orderings \(\pi _1\) and \(\pi _2\) from the graph property \(f(G)\) may significantly differ in the number of colors used in a greedy coloring (i.e., \({\upchi }(G,\pi _1) \not = {\upchi }(G,\pi _2) + \epsilon \)). This is due to the direction of the ordering (smallest to largest) and tie-breaking strategy selected. Consequently, a specific graph property \(f(\cdot )\) defines a class of orderings where the order direction (from max to min) and tie-breaking strategy (\(f^{\star }(\cdot )\)) represent a specific member of that class of orderings. Note that in general \(f(G)\) can be thought simply as a function for obtaining an ordering \(\pi \).

In addition, we also define a few relationships between the graph parameters introduced thus far. Clearly, \({\upchi }(G,\pi )\) from a greedy coloring method is an upper bound on the exact number of colors required, denoted by \(\chi (G)\), i.e., the minimum number of colors required for coloring \(G\). Further, let \(\omega (G)\) be the size of the maximum clique in \(G\), which is also a lower bound on the minimum number of colors required to color \(G\). This gives the following relationship:

$$\begin{aligned} \omega (G) \le \chi (G) \le {\upchi }(G,\pi ) \le \Delta (G)+1 \end{aligned}$$

where \(\Delta (G)\) is the maximum degree of \(G\).

An example of the framework is shown in Fig. 2. This illustration uses a proposed triangle selection criterion, which is shown later in Sect. 6 to be extremely effective for large social and information networks.

Table 1 Methods used as selection criterion

3.2 Ordering techniques

In this section, we first review the previous methods used for coloring relatively small and/or dense graphs from other domains (see Gebremedhin et al. 2013), which are unified under our coloring framework. Many are considered state-of-the-art greedy coloring techniques and shown to perform reasonably well for those types of graphs. Despite the past success of these methods, they are not as well suited for large sparse complex networks (e.g., social, information, and technological networks) as demonstrated in this work. As a result, we propose three classes of methods for greedy coloring based on well-known fundamental properties of these large complex networks. In particular, we propose social-based methods, multi-property, and methods based on egonet properties, which are shown later in Sect. 6 to be more effective than the state-of-the-art techniques used in coloring graphs from other domains. A summary and categorization of these methods are provided in Table 1.

figure a

Index-based methods The simplest arbitrary ordering techniques under the sequential greedy coloring framework are natural ordering (natural) and random ordering (rand). The natural ordering (natural) method selects the vertices to be colored in their natural order as they appear in the input graph \(G\), i.e., \(v_1,v_2,\ldots ,v_n\). We also define the random ordering (rand) as the method that selects the vertices to be colored randomly. Therefore, the (rand) method selects a vertex by drawing an uncolored vertex uniformly at random without replacement from \(V\).

Degree methods The four simplest, yet most popular ordering methods under the sequential greedy coloring framework (Sect. 3) are all based on vertex degree. Specifically, we use the degree ordering deg, the incidence degree ordering (ido), the dynamic-largest-first (dlf), and the k-core ordering (kcore) [a.k.a smallest-last ordering (slo)]. First, the degree ordering (deg) (Welsh and Powell , 1967) orders vertices from largest to smallest by their static degree as it appears in \(G\). Second, the incidence-degree ordering (ido) (Coleman and Moré 1983) dynamically orders vertices from largest to smallest by their back degree, such that the back degree of \(v\) is the number of its colored neighbors. In this case, the incidence-degree method initially starts with all vertices with back degree equal to zero, and initially selects an arbitrary vertex \(v\) to color. Then, all the neighbors of \(v\) will increase their back degree by one, and the next vertex with largest back degree will be selected for coloring. This process continues until all vertices are colored. Third, in contrast to the incidence-degree method (ido), the dynamic-largest-first (dlf) (Gebremedhin et al. 2013) dynamically orders the vertices by their forward degree from largest to smallest, where the forward degree of \(v\) is the number of its uncolored neighbors. Thus, the dynamic-largest-first method initially starts with all vertices with forward degree equal to their original degree in \(G\), and selects the first vertex \(v\) to color, such that \(v\) has the maximum degree in \(G\) [i.e, \(d(v)=\Delta (G)\)]. Consequently, all the neighbors of \(v\) will decrease their forward degree by one, and the vertex with the largest forward degree will be selected next to be colored.

Finally, the \(k\)-core ordering (kcore) [also known as the smallest-last ordering (slo) (Matula and Beck 1983)] orders the vertices from lowest to highest by their \(k\)-core number (refer to Sect. 2 for definition). The \(k\)-core ordering method (a.k.a smallest-last ordering) was proposed in (Matula and Beck 1983), based on the concept of \(k\)-core decomposition, to find a vertex ordering of a finite graph \(G\) that optimizes the coloring number of the ordering in linear time, by repeatedly removing the vertex of smallest degree. The \(k\)-core ordering dynamically orders the vertices by their forward degree from smallest to largest, where the forward degree of \(v\) is the number of its uncolored neighbors. The method initially starts with all vertices with forward degree equal to their original degree in \(G\), and selects the first vertex \(v\) to color, such that \(v\) has the smallest degree in \(G\) (i.e, \({d}(v)=\delta (G)\)). Thus, all the neighbors of \(v\) will decrease their forward degree by one, and the vertex with the next smallest forward degree will be selected for coloring. The output of this method is the vertex ordering for the coloring number, which is equivalent to ordering vertices by their \(k\)-core number as defined in (Szekeres and Wilf 1968).

Table 2 Dynamic ordering methods

These methods (including kcore) were found to be superior to others, especially for forests and a few types of planar graphs (Gebremedhin et al. 2013). We also use these as baselines for evaluating our proposed methods (see Sect. 6).

Distance-2 degree methods We note that the degree-based methods were defined on the 1-hop away neighbors of each vertex \(v \in V\). These methods can also be extended for the unique 2-hop away neighbors of each vertex \(v \in V\) (McCormick 1983), we call these methods distance-2 degree ordering (dist-two-deg), distance-2 incidence degree ordering (dist-two-id), distance-2 dynamic largest first ordering (dist-two-dlf), and distance-2 \(k\)-core ordering (dist-two-kcore) respectively.

Fig. 2
figure 2

Greedy coloring framework. In this graph, we use the number of triangles incident to a vertex \(v \in V\) as the selection criterion. On the left, vertices are ordered from largest to smallest using the number of triangles, which results in a greedy coloring that utilizes only three colors. For this graph, this coloring is also optimal and thus \(\chi (G,\pi ) = {\upchi }(G)\). However, when vertices are ordered from smallest to largest (on the right) results in a coloring that uses four colors. As an aside, \(\Delta (G)+1\) is the maximum number of colors that can be used from any greedy coloring method from the framework and thus \({\upchi }(G,\pi ) \le \Delta (G)+1\). In this graph, \(\Delta (G)+1 = 4\) and thus, the ordering used on the right is also the worst possible coloring that can be obtained. Notice that in this example, we used the vertex index as the tie breaking strategy, i.e., \(v_i\) is ordered before \(v_j\) if \(i < j\). We also note that if the proposed repair coloring scheme (Sect. 4) were used in the minimum triangle selection criterion, then only three colors would be needed. Other selection criterion (e.g., degree) may lead to a different vertex ranking and as a consequence the greedy coloring framework may result in an entirely different coloring. For instance, ranking the vertices by max degree gives \(\{v_2, v_3, v_4, v_1, v_5\}\) which differs from the ranking given by max triangle counts. Further, if two nodes have equal degrees, then we break ties using triangle counts (known as a tie-breaking strategy)

Social-based methods While the degree-based methods were shown to perform well in the past, in this paper, we compare them to other social-based orderings such as triangle ordering (tri), and triangle-core ordering (tcore).

First, the triangle ordering (tri) method orders vertices from largest to smallest by the number of triangles they participate in, i.e. \(f(v)=tr(v)\) where \(tr(v)\) can be computed fast and in parallel using Algorithm 2. Other triangle-based quantities such as clustering coefficient may also be used and computed fast and efficiently using Algorithm 2. Thus, the triangle ordering initially selects the vertex \(v\) with the largest number of triangles centered around it. This process continues until all vertices are colored. The intuition behind triangles in social networks is that vertices tend to cluster, and therefore, triangles were extensively used to measure the number of vertices adjacent to \(v\) that are also linked together (as explained in Sect. 2). We conjecture that ordering vertices from largest to smallest by their triangle number would give a chance to those vertices that are more constrained in their choices of color to be colored first than those that have more freedom (as we explained earlier).

figure b

Second, the triangle-core ordering (tcore) method orders vertices from largest to smallest by their triangle core number (as explained in Sect. 2). Using the triangle core numbers, we obtain an ordering and use it to determine the next vertex \(v\) (or edge) to color, using the criteria: \(f(v) = \max _{w \in N(v)} {T}(v,w)\), where \(N(v)\) is the set of neighbors of vertex \(v\), and \({T}(v,w)\) is the triangle core number of the edge \((v,w) \in E\). Notice that triangle core ordering is comparable to \(k\)-core ordering, however, instead of removing a vertex and its edges at each iteration, we remove an edge and its triangles. This gives rise to a variety of ordering methods based on the fundamental notion of removing edges and their triangles. We call these dynamic triangle ordering methods and provide a summary of the main ones in Table 2 as well as a comparison with a few of the dynamic degree-based methods. Let us note that any edge-based quantity may be used for ordering vertices (and vice-versa). For instance, tcore-max defined in Table 1 computes for every vertex \(v\) in the graph, the maximum triangle core number among the (1-hop)-away-neighbors of \(v\).

The proposed triangle ordering template is shown in Alg 3 and the key operations are also summarized in Table 2. The backward (or forward) triangle counts are initialized in line 2. For slt, ParallelEdgeTriangles shown in Alg 4 is used to initialize the triangle counts. Next, line 3 adds \((v,u)\) to the bucket consisting of the edges with \(T(v,u)\) triangles which is denoted \(\texttt{bin}[T(v,u)]\). Hence, the edges are ordered in \(O(|E|)\) time using a bucket sort. Note that if it is used then this step can be skipped since each edge \((v,u)\) is initialized as \(T(v,u) = 0\).

The triangle ordering begins in line 4 by ensuring \(|E| > 0\) where \(E\) initially consists of all edges in \(G\). At each iteration, a single edge \((v,u)\) is removed from \(E\). Line 5 finds the edge \((v,u)\) with the smallest \(T(v,u)\) or largest \(T(v,u)\), see Table 2 for the variants. The neighbors of \(u\) that remain in \(E\) are marked in line 7 with the unique edge identifier \(e_i\) of \((v,u)\) (to avoid resetting the array). In line 8, we iterate over the triangles that \((v,u)\) participates, i.e., the pairs of edges \((v,w)\) and \((u,w)\) that form a triangle with \((v,u)\). Since the neighbors of \(u\) are marked in \(X\), then a triangle is verified by checking if each neighbor \(w\) of \(v\) has been marked in \(X\), if so then \(u,v,w\) must form a triangle. Line 9 sets \(\texttt{bin}[T(v,w)] \leftarrow \texttt{bin}[T(v,w)] \setminus (v,w)\) and \(\texttt{bin}[T(u,w)] \leftarrow \texttt{bin}[T(u,w)] \setminus (u,w)\), removing \((v,w)\) and \((u,w)\) from their previous bins. Next, the triangle counts of \((v,w)\) and \((u,w)\) are updated in line 10 using an update rule from Table 2. Afterwards, line 11 adds the edges to the appropriate bin, i.e., \(\texttt{bin}[T(v,w)] \leftarrow \texttt{bin}[T(v,w)] \cup (v,w)\) and \(\texttt{bin}[T(u,w)] \leftarrow \texttt{bin}[T(u,w)] \cup (u,w)\). This is repeated for each pair of edges \((v,w)\) and \((u,w)\) that form a triangle with \((v,u)\). Finally, line 12 implicitly removes the edge \((v,u)\) from \(E\).

figure c
figure d

Egonet-based methods An egonet is the induced subgraph centered around a vertex \(v\) and consists of \(v\) and all its neighbors \(N(v)\). Assume we are given an arbitrary graph property \(f(\cdot )\) (e.g., triangle-cores, number of triangles) computed over the set of neighbors of \(v\), i.e., \(N(v)\), we define an egonet ordering criterion for a vertex \(v\) as \(\sum _{w \in N(v)} f(w)\). In addition, besides using the \(\mathrm{sum}\) operator over the egonet, one may use other relational aggregators such as \(\min \), \(\mathrm{max}\), \(\mathrm{var}\), \(\mathrm{avg}\), among many others.

Multi-property methods We also propose ordering techniques that utilize multiple graph properties. For instance, the vertex to be colored next may be selected based on the product of the vertex degree and \(k\)-core number, i.e., \(f(v) = K(v) \cdot d(v)\).

3.3 Algorithm and implementation

This section describes the algorithms and implementation. The graph is stored using \(O(|E| + |V|)\) space in a structure similar to compressed sparse column (CSC) format used for sparse matrices (Tewarson 1973). If the graph is small and/or dense enough, then it is also stored as an adjacency matrix for constant time edge lookups. Besides the graph, the algorithm uses two additional data structures. In particular, let \({\mathsf {color}}\) be an array of length \(n\) that stores the color assigned to each vertex, i.e., \({\mathsf {color}}(v)\) returns the color class assigned to \(v\). Additionally, we also have another array to mark the colors that have been assigned to the neighbors of a given vertex and thus we denote it as \({\mathsf {used}}\) to refer to the colors “used” by the neighbors.

The algorithmic framework for greedy coloring is shown in Alg 1. For the purpose of generalization, we assume the vertex ordering \(\pi \) is given as input and computed using a technique from Sect. 3.2.

The algorithm starts by initializing each entry of \({\mathsf {color}}\) with \(0\). We also initialize each of the entries in \({\mathsf {used}}\) to be an integer \(x \not \in V\) (i.e., an integer that does not match a vertex id). The greedy algorithm starts by selecting the next vertex \(v_i\) in the ordering \(\pi \) to color. For each vertex \(v_i\) in order, we first iterate over the neighbors of \(v_i\) denoted \(w \in N(v)\), and set \({\mathsf {used}}({\mathsf {color}}(w)) = v_i\) as shown in line 4. This essentially marks the colors that have been used by the neighbors. Afterwards, we sequentially search for the minimal \(k\) such that \({\mathsf {used}}(k) \not = v_i\) (in line 7). Line 5 assigns this color to \(v_i\), hence \({\mathsf {color}}(v_i) = k\). Upon termination, \({\mathsf {color}}\) is a valid coloring and the number of colors is \({\upchi }(G,\pi ) = \mathop {\hbox {argmax}}\nolimits _{v \in P} {\mathsf {color}}(v)\). We denote \({\upchi }(G,\pi )\) as the number of colors from a greedy coloring algorithm that uses the ordering \(\pi \) of \(V\), which is easily computed in \(O(1)\) time by maintaining the max color assigned to any vertex.

Note that in line 4, the color of \(w\) (a positive integer) is given as an index into the \({\mathsf {used}}\) array and marked with the label of vertex \(v_i\). This trick allows us to avoid re-initializing the \({\mathsf {used}}\) array after each iteration over a vertex \(v_i \in \pi \)—the outer for loop. Hence, if \(w\) has not yet been assigned a color, i.e., \({\mathsf {color}}(w) = 0\), then \(used(0)\) is assigned the label of \(v_i\), and since \(0\) is an invalid color, it is effectively ignored. In addition, each entry in \({\mathsf {used}}(k)\), \(1 \le k \le \Delta +1\) must initially be assigned an integer \(x \not \in \pi \).

3.4 Complexity

The storage cost is only linear in the size of the graph, since CSC takes \(O(|E| + |V|)\) space, the vertex-indexed array \({\mathsf {color}}\) costs \(O(|V|)\), and \({\mathsf {used}}\) costs \(O(\Delta + 1)\) space. For the ordering methods, degree and random take \(O(|V|)\) time, whereas the other “dynamic degree-based” techniques such as kcore have a runtime of \(O(|E|)\) time. The other ordering techniques that utilize triangles and triangle-cores take \(O(|E|^{3/2})\) time in the worst-case, but are shown to be much faster in practice. Importantly, we also parallelize the triangle-based ordering methods by computing triangles independently for each vertex or edge. We also note the distance two ordering methods are just as hard as the triangle ordering methods, yet perform much worse as shown in Sect. 6. Finally, the greedy coloring framework has a runtime of \(O(|V| + |E|)\) and \(O(|E|)\) for connected graphs.

Fig. 3
figure 3

Repair coloring. Suppose \(v\) is the vertex to be recolored since it is assigned to a new color class \(C_k\), then we find a color class \(C_i\) where \(v\) is adjacent to a single vertex \(w\) (i.e., \(N(v) \cap C_i = \{w\}\)). Now, we find a color class \(C_j\) s.t. \(j > i\) and \(w\) is not adjacent to any vertex in \(C_j\), i.e. \(|N(w) \cap C_j| = \emptyset \). If such a color class exists, then \(w\) is removed from \(C_i\) and assigned to \(C_j\). As a result of this reassignment, \(v\) can now be assigned to the \(C_i\) color class, therefore reducing the number of colors by 1

4 Recolor variant

This section proposes another coloring variant that attempts to recolor vertices to reduce the number of colors. The variant is effective while also fast for large real-world networks.

4.1 Algorithm

The recoloring variant is shown in Algorithm 5. This variant proceeds in a similar manner as the basic coloring algorithm from Sect. 3.3. The difference is that if a vertex is assigned an entirely new color \(k\) (i.e., number of colors used in the coloring increases), then an attempt is made to reduce the number of colors. Using this as a basis for recolor ensures that the algorithm is fast, taking advantage of only the most promising situations that arise.

Suppose the next vertex \(v\) in the ordering is assigned a new color \(k\) and thus \(C_k = \{v\}\), then we attempt to reduce the number of colors by reassigning an adjacent vertex \(u\) that was assigned a previous color \(i\) such that \(i < k\). Hence, if \(|C_i \cap N(v)| = 1\), then \(C_i\) contains a single adjacent vertex of \(v\) (i.e., a single conflict), and thus, we attempt to recolor \(u\) by assigning it to the minimum color \(j\) such that \(i < j < k\) and \(C_j \cap N(u) = \emptyset \). This arises due to the nature of the sequential greedy coloring and is formalized as follows: Given vertices \(v\) and \(u\) assigned to the \(i\)th and the \(j\)th colors, respectively, where \(v\) is colored first and \(i < j\), then since \(v\) is assigned the minimum possible color, then we know the colors less than \(i\) are invalid, however, \(v\) could potentially be assigned the colors \(i+1,\ldots ,k\), since these colors arose after \(v\) was assigned a color.

The key intuition of the recolor variant is illustrated in Fig. 3. In the start of the example, notice that \(v\) is assigned to a new color class \(C_k\) (i.e., contains only \(v\)). Therefore, the recolor method is called, which attempts to find \(v\) another color class denoted \(C_i\) where \(C_i < C_k\). For this, we search for a color class \(C_i\) that contains a single adjacent vertex denoted \(w\) (known as a conflict). Intuitively, we may assign \(v\) to \(C_i\) if we can find \(w\) another “valid” color class denoted \(C_j\). Notice that \(i < j < k\) such that the color class \(C_i\) appeared before \(C_j\) and so forth. In other words, \(v\) can be assigned to \(C_i\) if there exists a valid color class \(C_j\) for which \(w\) can be assigned. If such a \(C_j\) exists, then the number of colors is decreased by one.

figure e

5 Bounds

Lower and upper bounds on the minimum number of colors are useful for a number of reasons (see Sect. 6.4). In this section, we first provide a fast parallel method for computing a lower bound that is especially tight for large sparse networks. Next, we summarize the upper bounds used in this work, which are also shown to be strong, and in many cases matching that of the lower bound, and thus allowing us to verify the coloring from one of our methods.

Fig. 4
figure 4

Clique invariant and fast heuristic clique finder. Recall \(C_v\) is the clique being constructed, whereas \(P\) is the set of potential vertices that could be added to \(C_v\) to form a clique of \(|C_v|+1\). Further, after a vertex \(u\) from \(P\) is added to \(C_v\), we must then remove \(u\) from \(P\) and compute the intersection \(P \cap N_{R}(u)\). The result of this intersection depends intrinsically on how well \(u\) is connected to the vertices in \(P\). In the ideal case, the heuristic is guaranteed to find the largest possible clique as long as the vertices in \(P\) that form the largest clique among each other are added to \(C_v\). For instance, the largest clique in the above example is \(|C_v|+3=8\) formed by adding the three vertices forming a 3-clique (triangle) in \(P\) to \(C_v\), whereas if \(u \in P\) with 0 degree is added to \(C_v\), then \(|C_v|+1=6\), since \(P_{t} \cap N_{R}(u) = \emptyset \)

5.1 Lower bounds

Let \(\tilde{\omega }(G)\) be the size of a large clique from a heuristic clique finder and thus a lower-bound on the size of the maximum clique \(\omega (G)\). As previously mentioned, \(\tilde{\omega }(G) \le \omega (G) \le \chi (G)\). Since the maximum clique problem is known to be NP-hard, we use a fast parallel heuristic clique finder tuned specifically for large sparse complex networks. Our approach is shown in Algorithm 6 and found to be efficient while also useful for obtaining a large clique that is often of maximum or near-optimal size [i.e., \(\tilde{\omega }(G)\) is close to \(\omega (G)\)] for many types of large real-world networks.

Given a graph \(G=(V,E)\), the heuristic obtains a vertex ordering \(\pi = \{v_1,\ldots ,v_n\}\) and searches each vertex \(v_i\) in the ordering \(\pi \) for a large clique in \(N(v_i)\). For convenience, let \(N_{R}(v)\) be the reduced neighborhood of \(v\) defined formally as,

$$\begin{aligned} N_R(v) = G( \{ v \} \cup \{ u : (u,v) \in E, B(u) \ge |C_{\max }|, u \not \in X) \} \end{aligned}$$

where \(|C_{\max }|\) is the largest clique found thus far, \(B(u)\) is a vertex upper boundFootnote 2, and \({{X}}\) is a vertex-index array of pruned vertices [i.e., \(O(1)\) time check]. Thus, let \(P \leftarrow N_{R}(v)\) be the set of potential vertices and initially we set \(C_v \leftarrow \emptyset \). At each step in the heuristic, a vertex \(u \in P\) is selected according to a greedy selection criterion \(f(\cdot )\) such that \(u \leftarrow \max _{w \in P} f(w)\) where \(f(\cdot )\) is a graph property. The selected vertex \(u\) is added to \(C_v \leftarrow C_v \cup \{u\}\) and \(P_{t+1} \leftarrow P_{t} \cap N_{R}(u)\) where \(t\) denotes the iteration (or depth of the search tree). The local clique search terminates if \(|P_{t}| + |C_v| \le C_{max}\), since this indicates that a clique of a larger size cannot be found from searching further. See Fig. 4 for a simple example. Notice that \(C_v\) is the clique being built and grows by a single vertex each iteration, whereas \(P_{t+1}\) are the potential vertices remaining after adding \(u\) to \(C_v\). Hence, \(|P_{t+1}| < |P_{t}| < |P_{t-1}|\) is monotonically decreasing with respect to \(t\). It is clear from Fig. 4 that \(|P_{t+1}|\) and thus the size of the clique \(|C_v|\) strongly depends on \(u\) selected by the greedy selection criterion. In Fig. 4, suppose the vertex without edges to other vertices in \(P\) is selected and added to \(C_v\), then \(P_{t+1} \leftarrow \emptyset \) and the search terminates. The proposed heuristic clique finder is equivalent to searching down a single branch in a greedy fashion.

Let us also point out that Algorithm 6 is extremely flexible. For instance, the vertices in \(G\) (globally) and \(P\) (locally) are ordered by their k-core numbers (see line 3 and 7), but any ordering from Table 1 may be used. In addition, while Algorithm 6 is presented using vertex k-core numbers for pruning (line 5), one may also leverage stronger bounds such as the triangle-core numbers (See Rossi 2014). We used k-core numbers for ordering and pruning since these are relatively tight bounds while also efficient to compute for large networks. Later in Sect. 6, we demonstrate the tightness of these bounds on large sparse real-world networks (See Tables 3, 4).

figure f

Complexity The runtime of the heuristic is \(O(|E| \cdot {K}(G))\) since it takes \(\sum _{v \in V} (v) = 2|E| = O(|E|)\) to form the initial set of neighbors for each vertex. The HeuristicClique is essentially a greedy depth-first search where the depth is at most \({K}(G)\). As an aside, if \({T}(G)\) is used instead, then the heuristic is computed in \(O(|E| \cdot {T}(G))\). Observe that at each step, the greedy selection criterion \(u \leftarrow \max _{v \in P} f(v)\) is evaluated in \(O(1)\) time by pre-ordering the vertices prior to searching. The runtime of the ordering is \(O(|P|)\) using bucket sort. A global bound on the depth of the search tree for any vertex neighborhood is clearly \({K}(G)\) and for a specific vertex \(v\) is no larger than \({K}(v)\). In practice, the heuristic is fast and usually terminates after only a few iterations due to the removal of vertices from \(P\) via the strong upper bounds.

Parallel algorithm The vertex neighborhoods are searched in \(\mathsf {parallel}\) for a large clique. Each worker (i.e., processing unit, core) is assigned dynamically a block \(\beta \) of vertices to search. The workers maintain a vertex neighborhood subgraph for the vertex currently being searched. In addition, the workers share a vertex-indexed array \(X\) of pruned vertices and the largest clique \(C_{\max }\) found among all the workers. If a worker finds a clique \(C_v\) larger than \(C_{\max }\), i.e., \(|C_v| > |C_{\max }|\) (max so far among all workers), then a lock is obtained, and \(C_{\max } \leftarrow C_v\) and the updated \(C_{\max }\) is immediately sent to all workers. As an aside, this immediate sharing of \(C_{\max }\) typically leads to a significant speedup, since the updated \(C_{\max }\) allows for the workers to further prune their search space including entire vertices.

5.2 Upper bounds

A simple, but not very useful upper bound on the Chromatic number \(\chi (G)\) is given by the maximum degree: \(\chi (G) \le \Delta (G)+1\). A stronger upper bound is given by the maximum k-core number of \(G\) denoted by \({K}(G)\). This gives the following relationship:

$$\begin{aligned} \chi (G) \le {K}(G)+1 \le \Delta (G)+1 \end{aligned}$$

In this work, we observe that this upper bound is significantly stronger than the maximum degree on nearly all large sparse networks.

Since \({\upchi }(G,\pi )\) depends on an ordering \(\pi \) then no relationship exists between \({\upchi }(G,\) slo) from slo and \({\upchi }(G,\pi )\) where \(\pi \) gave rise to \({{\theta _{\pi }}}(G)\). Nevertheless, suppose the vertices are colored using slo resulting in \({\upchi }(G,\pi )\) colors, then using \({K}(G)\) gives the following relationship:

$$\begin{aligned} \tilde{\omega }(G) \le \omega (G) \le \chi (G) \le {\upchi }(G,\pi ) \le {K}(G)+1 \le \Delta (G)+1 \end{aligned}$$

where \(\omega (G)\) is the maximum clique in \(G\) and \(\tilde{\omega }(G)\) is a large clique in \(G\) from the fast heuristic clique finder in Sect. 5.1. In other words, if a greedy coloring method uses \(\pi \) from slo then the resulting coloring of \(G\) must use at most \({K}(G)+1\) colors. Furthermore, \({K}(G)+1\) is also known as the coloring number denoted \({\mathsf {col}}(G)\) (Erdős and Hajnal 1966)Footnote 3.

The above relationship can be further strengthened using the notion of the maximum triangle core number of G denoted \({T}(G)\). This gives rise to the following relationship:

$$\begin{aligned} \omega (G) \le \chi (G) \le {T}(G) \le {K}(G)+1 \le \Delta (G)+1 \end{aligned}$$
Table 3 Network statistics and coloring bounds (color table online)
Table 4 Network statistics and coloring bounds (continued from Table 3) (color table online)

6 Results and analysis

This section evaluates the proposed methods using a large collection of graphs. In particular, we designed experiments to answer the following questions:

  • Section 6.1 Accuracy. Are the proposed greedy coloring methods effective and accurate for social and information networks?

  • Section 6.2 Scalability. Do the methods scale for coloring large graphs?

  • Section 6.3 Impact of recoloring. Is the recolor method effective in reducing the number of colors used?

  • Section 6.4 Utility of bounds. Are the lower and upper bounds useful and informative?

For these experiments we used over 100+ networks of different types (i.e., social vs. biological), sizes, structural properties, and sparsity. Our main focus was on a variety of large sparse networks including social, biological, information, and technological networksFootnote 4. Self-loops and any weights were discarded. For comparison, we also used a variety of dense graphs including the DIMACsFootnote 5 graph collection and the BHOSLIBFootnote 6 graph collection (benchmarks with hidden optimum solutions) which were generated from joining cliques together.

In this work, ties are broken as follows: Given two vertices \(v_i\) and \(v_j\) where \(f(v_i) = f(v_j)\), then \(v_i\) is ordered before \(v_j\) if \(i > j\). While the importance of tie-breaking was discussed in Sect. 3, many results in the literature are difficult to reproduce as key details such as the tie-breaking strategy are left undefined.

Table 5 Accuracy of coloring methods (color table online)

6.1 Accuracy

As an error measure, we compute the frequency (i.e., number of graphs) for which each coloring method performed best overall, i.e., used the minimum number of colors. If two methods used the minimum colors relative to the other methods, then the score of both are increased by one. The graphs for which all methods achieved the best are ignored. The proposed methods are evaluated below for use on (1) sparse/dense graphs and also (2) for each type of large sparse network (i.e., social or information networks).

Best methods for sparse and dense graphs The methods are compared in Table 5 (columns \(2\) and \(3\)) independently on the basis of sparsity. Notice the methods in the first column of Table 5 are ranked and shaded according to their accuracy on sparse graphs (following an ascending order). A few of our general findings from Table 5 are discussed below.

  • Selecting the nodes uniformly at random (RAND) generally performs the worst for both sparse and dense graphs. This highlights the importance of selecting vertices that are more constrained in the number of possible colors first, which can’t be achieved by random selection.

  • Nearly all the proposed methods (with the exception of deg-vol) gave fewer colors and found to be significantly better than the traditional degree-based methods.

  • As expected, the traditional degree-based methods are more suitable for dense graphs than sparse graphs. Nevertheless, the triangle and triangle-core methods performed the best on the majority of dense graphs.

  • In both sparse and dense graphs, we find that tcore-max/vol, and tri-vol gave the fewest colors overall.

  • Interestingly, the natural order performed best on 31 of the dense graphs. Further examination revealed that the majority of these cases are the BHOSLIB graphs. These graphs are synthetically generated by forming \(n\) distinct cliques and randomly connect pairs of cliques together. We found that the vertices in these cliques are ordered consecutively and thus give rise to this unexpected behavior found when using the natural order.

For additional insights, we provide the coloring bounds and various statistics for the DIMACs and BHOSLIB graph collections are provided in Tables 13 and 15. The coloring numbers from the various algorithms for the DIMACs and BHOSLIB graph collections are also shown in Tables 16 and 17, respectively. We find that in all cases, the proposed methods improve over the previous methods. In some graphs, the proposed methods offer drastically better solutions with much fewer number of colors, for instance, see MANN-a81 which is currently an unsolved instance.

Table 6 Colors used by the proposed methods (color table online)

Best methods: from social to information networks The sparse graphs are examined further by their respective types (i.e., social networks). For each network of a specific type, we apply the coloring methods in Table 1 and measure their accuracy just as before. This allows us to determine the coloring methods that are most accurate for each type of network. The results are shown in Table 5 (columns 4–10). The greedy coloring methods are ranked and colored according to their overall rank shown previously in the first two columns of Table 5.

  • In nearly all types of networks, the proposed methods are more accurate than the traditional degree-based methods (i.e., use fewer colors).

  • For social and Facebook networks, the triangle and triangle-core methods performed the best (i.e., accuracy), using fewer number of colors.

In the majority of cases, we found that the proposed methods are significantly better than the traditional degree-based methods (i.e., ido, deg) at \(p < 0.01\) level. More specifically, greedy coloring methods that use triangle properties or triangle-core based methods significantly improve over the other methods, resulting in a better coloring with fewer number of colors. In addition, the colors used by the proposed methods for each network are compared in Table 6.

Table 7 Comparing the coloring algorithms by runtime

6.2 Scalability

Now, we evaluate the scalability of the proposed methods. In particular, do the methods scale as the size of the graph increases (i.e., number of vertices and edges)? To answer this question, we use the proposed greedy coloring methods to color a variety of networks including both large sparse social and information networks as well as a variety of dense graphs. Figure 5 plots the size of the graph versus the runtime in seconds (both are logged). Overall, we find the proposed greedy coloring methods scale linearly with the size of the graph. Moreover this holds for both large sparse and dense networks. Nevertheless, coloring dense graphs is found to be slightly faster with less variance in the runtime, as compared to social networks which exhibit slightly more variance in the runtime of graphs that are approximately equal size.

We also compare the wall clock time (i.e., runtime in seconds) between a representative set of methods on a variety of networks. Results are provided in Table 7. For brevity, we removed the graphs for which all methods took less than 0.1 seconds to color. Not surprisingly, the simple degree-based methods (distance-1 and 2) are the fastest to compute.

Fig. 5
figure 5

Scalability of the proposed coloring methods. The x axis represents the log of the size of the graph whereas the y axis is the log runtime (in seconds). For both large sparse and dense networks, find that the proposed methods scale linearly as the size of the graph increases and thus practical for a variety of applications

These results indicate that in practice, the proposed methods are fast, scaling linearly as the size of the graph increases. Hence, these methods are well-suited for use in a variety of applications including network analysis, relational machine learning, sampling, among many others. See Sect. 7.3 for details on the scalability of the neighborhood coloring methods.

Table 8 Recolor statistics

6.3 Effectiveness of recolor

This section investigates the effectiveness of the recolor method. In particular, how often does it reduce the number of colors? For this, we investigate and compare greedy coloring variants that utilize recolor to the methods that do not. Given a graph \(G\) and a vertex ordering \(\pi \) from one of the proposed selection strategies in Sect. 3.2, we color the graph using the basic coloring framework (Algorithm 1) and then we color the graph again using the recolor method. From these two colorings, we measure the difference in the number of colors (after recoloring and before recoloring) and number of times the recolor method improved over the basic method. The results are shown in Table 8. Note that the statistics are computed over all graphs and greedy coloring methods, including the methods that do not perform well (i.e., degree-based methods). Note that the maximum improvement (i.e, max diff. in Table 8) and average improvement (i.e, mean diff. in Table 8) are measured as the maximum/average difference between the number of colors used before and after recoloring.

In sparse graphs, the recolor method results in fewer colors \(40.9\,\%\) of the time whereas the improvement for dense graphs is \(84.4\,\%\). We find that the improvement for dense graphs is much larger since the number of colors initially (before recoloring) used on average is usually far from the optimal number. Note that for sparse graphs, this includes the graphs where the greedy coloring methods was able to find the optimal number of colors (and thus, it is impossible for recolor to improve over the basic coloring). Additionally, the sparse graphs use fewer colors than the dense graphs and also the number of colors used from the greedy coloring methods tends to be closer to the optimal. These results indicate that recolor is both fast and effective for reducing the number of colors used by any of the proposed methods.

In addition, we also provide results for both recolor and basic variants on a variety of large sparse real-world networks, see Tables 9 and 10. These can be used to infer additional insights. In Tables 18 and 19, we also compare the recolor variant to the faster but less accurate basic coloring variant of each coloring method for the DIMACs and BHOSLIB graph collections.

Table 9 Recolor variant is compared to the faster but less accurate basic variant for each of the proposed methods. We also include four previous methods for comparison
Table 10 Recolor variant is compared to the faster but less accurate basic variant for each of the proposed methods (continued from Table 9)

6.4 Bounds and provably optimal coloring

This section describes two ways to leverage the bounds. Results are then provided in Tables 3 and 4 for a representative set of graphs from the collection.

First, the lower bound can be used to verify that the coloring from a greedy method is optimal. Let \(\tilde{\omega }(G)\) be a lower bound of \(\chi (G)\) (i.e., optimal number of colors), then we have the following simple relationship:

$$\begin{aligned} \tilde{\omega }(G) \le \omega (G) \le \chi (G) \le {\upchi }(G,\pi ) \le \Delta (G)+1 \end{aligned}$$

where \({\upchi }(G,\pi )\) is the number of colors from a greedy coloring that uses \(\pi \) and \(\Delta (G)+1\) is the maximum degree of \(G\). Consequently, if \(\tilde{\omega }(G) = {\upchi }(G,\pi )\), then as a result of the above, we know \({\upchi }(G,\pi )\) must be optimal.

Second, we may also use the bounds to characterize the accuracy of a greedy coloring method or prove that a solution is not optimal. For instance, suppose \({\tilde{\omega }(G)} \le K(G) \le {\upchi }(G,\pi )\), then we know \({\upchi }(G,\pi )\) is not optimal.

We find the optimal number of colors is directly obtained and verified via both lower and upper bounds for nearly all collaboration networks and web graphs as shown in Table 3. In 6 of the 13 collaboration networks, we found that \(\tilde{\omega }(G) = {\upchi }_{\min }(G,\pi ) = {\upchi }_{\max }(G,\pi ) = K(G)+1 = T(G)\) where \({\upchi }_{\min }(G,\pi )\) and \({\upchi }_{\max }(G,\pi )\) are the min and max number of colors used by any of the coloring methods. This implies that the ordering is insignificant for these networks as all the methods resulted in a coloring that is provably optimal. Notably, from the 7 other networks, 5 of them differ only in \({\upchi }_{\max }(G,\pi )\). In addition, many other interesting observations and insights may be drawn from Tables 4 and 3.

To summarize we find that:

  1. 1.

    For some types of information networks, the proposed greedy coloring methods produce an optimal coloring.

  2. 2.

    The upper and lower bounds are effective for proving the optimality or suboptimality of a solution from a greedy coloring heuristic.

  3. 3.

    For the majority of graphs that are significantly skewed and power-lawed, the optimal number of colors is directly obtained and verified via both lower and upper bounds.

7 Finding colorful neighborhoods

Given a large graph or a collection of neighborhood subgraphs, how can we define a domain-independent basis that succinctly characterizes the common structural properties of the neighborhood subgraphs? For this task, we define the problem of coloring local neighborhood subgraphs and propose a fast parallel flexible approach for solving it. Formally, a neighborhood subgraph can be defined as the induced subgraph centered around a vertex \(v\) and induced by all neighbors of \(v\). Our parallel neighborhood coloring framework makes heavy use of the proposed coloring methods from Table 1 as well as the basic coloring variant in Alg 1 and the more accurate recolor variant shown in Alg 5. In particular, we propose parallel methods for coloring neighborhoods that are (1) fast and scalable for large networks, (2) space-efficient, (3) flexible for a variety of applications, (4) and accurate, finding in many cases nearly optimal or provably optimal solutions.

One of the main observations we make is that neighborhoods that are colored using a relatively few number of colors are not well connected, with low clustering and a small number of triangles. To understand this fundamental finding and the key intuition, we provide a series of simple neighborhood colorings shown in Fig. 6. We also observe that neighborhoods that are colored using a relatively large number of colors have large clustering coefficients and usually contain large cliques relative in size to the other neighborhood colorings. Therefore, the set of neighborhood colorings is an important fundamental graph property, giving a number of key insights into the structural properties of the network at large and its local neighborhoods. In a similar manner as we have demonstrated above, one can also use neighborhood coloring to draw a number of other interesting insights and ultimately use it for characterizing the structure and behavior of many types of large networks. Besides these key benefits, we demonstrate that neighborhood coloring is fast and scalable to compute for large networks, and more specifically, it is linear in the number of edges. This is clearly much faster than computing the frequency of vertex/edge triangles (Rossi 2014) or counting the frequency of other subgraph patterns and motifs (Pržulj 2007; Shervashidze et al. 2009; Rahman et al. 2012). We also show that it is straightforward to parallelize for both shared-memory (CPU and GPU) and distributed architectures.

Fig. 6
figure 6

Neighborhood coloring extremes: from stars to cliques. For ac, the vertex \(v\) in the center is the vertex in which the neighborhood was induced, thus the other vertices are in the set \(N(v)\). In a the vertex neighborhood is a simple star–no connections between the neighbors of \(v\), and thus can be colored using only 2 colors. The neighborhood subgraph in b is essentially a star with a few neighbors of \(v\) with edges among each other, thus, forming triangles. Similarly, in c we find more neighbors forming connections among each other giving rise numerous triangles and two cliques of size 4. Finally, the neighborhood subgraph in d represents a single large clique. Node \(v\) was removed for clarity. These neighborhood subgraphs go from the least constrained neighborhood representing a star a to the most constrained neighborhood representing a clique d. The neighborhood subgraphs shown in b, c are better representatives of neighborhood subgraphs found in large real-world networks (e.g., Facebook or other social networks). Note that in reality, the vertex \(v\) in which the neighborhood subgraph corresponds may be removed from the coloring, since \(v\) must be connected to every other vertex. Thus, the neighborhood subgraphs above are (\(k-1\))-colorable when \(v\) is removed

Local neighborhood coloring consists of assigning a color to every vertex in a vertex neighborhood such that no two vertices linked by an edge share the same color while minimizing the number of colors used. The most colorful neighborhood is the one that requires the maximum number of colors. In this section, we propose parallel methods for coloring neighborhoods that are (1) fast and scalable for large networks, (2) space-efficient, (3) flexible for a variety of applications, (4) and accurate, finding in many cases nearly optimal or provably optimal solutions.

The neighborhood colorings may be useful for finding better communities, especially in local community detection methods (Malliaros et al. 2012). Besides community methods, neighborhood coloring may also be used in prediction tasks such as detecting anomalous patterns in graphs, see (Akoglu et al 2010) for one such egonet-based method. Other prediction tasks such as relational classification may also benefit from neighborhood coloring. For instance, one may construct a set of node features such as from these neighborhood colorings to improve the accuracy of classification (e.g., number of colors, largest independent set).

The results of our neighborhood coloring have direct and immediate implications on exact algorithms for the maximum clique problem (Bomze et al. 1999; Prosser 2012). In fact, the most successful approaches have used coloring as a bound, but vary in the ordering and method used (Konc and Janezic 2007; San Segundo et al. 2011; Tomita and Kameda 2007; Tomita et al. 2011, 2010; Rossi et al. 2012). For instance, suppose vertex neighborhoods are searched in parallel, similar to our heuristic in Alg 6, then the neighborhood coloring results may be used for bounding the search space in branch-n-bound algorithms, for pruning entire neighborhoods directly, and for ordering vertices via the number of colors from the vertex neighborhood coloring, among many other vertex level features that could be derived from such a set of neighborhood colorings. These may enhance recent parallel algorithms such as pmc (Rossi et al. 2014) that utilizes degeneracy ordering giving a worst-case runtime of \(O(2^{d/4})\) on sparse graphs with bounded degeneracy. In addition, super-linear speedups may become more frequent using the ordering from neighborhood coloring/pruning, i.e., these were observed using pmc (Rossi et al. 2014) and later confirmed again using a parallel version of mcs (Tomita et al. 2010; McCreesh and Prosser 2013). Nevertheless, the set of vertex neighborhood colorings may also be used for pruning in other ego-centric search methods. They also provide a basis for a variety of ordering methods which may have applications, e.g., graph compression (Boldi and Vigna 2004).

7.1 Problem formulation

Our focus is on coloring vertex neighborhoods. Let \(N(v) = \{v\} \cup \{u : (u,v) \in E\}\) be the closed neighborhood of a vertex \(v\) and we define \(H_v\) as the neighborhood subgraph induced from \(N(v)\), consisting of \(v\), the neighbors of \(v\), and any edges between them. Suppose \(H_v\) is a neighborhood subgraph of \(G\) and \(G\) is \(k\)-colorable, then \(H_v\) must also be \(k\)-colorable. Consequently, if \(H_v\) is a subgraph of \(G\), then \(\chi (H) \le \chi (G)\).

The local chromatic number of \(G\) is the maximum number of colors appearing in the closed neighborhood (subgraph) of a vertex minimized over all proper colorings. More formally,

$$\begin{aligned} {\upchi _{\ell }}(G) = \min _{c} \; \; \max _{v \in V} \; |\{ c(u) : u \in N(v) \}| \end{aligned}$$

where the minimum is taken over all proper colorings \(c\) and \({\upchi _{\ell }}(G)\) is the number of colors appearing in the most colorful closed neighborhood of a vertex. Clearly, \({\upchi _{\ell }}(G) \le \chi (G)\) and we find for large real-world graphs (i.e., social and information networks (Mislove et al. 2007; Ahmed et al. 2013)) these two numbers are usually close. Despite this result, we note that for general graphs \({\upchi _{\ell }}(G)\) may be small while \(\chi (G)\) can be arbitrarily large (Erdös et al. 1986; Godsil et al. 2001).

We relax the strict requirement above from considering all proper colorings to considering only a single proper coloring for each neighborhood. In particular, this article proposes a framework of local greedy coloring methods designed for dense and large sparse graphs found in real-world (e.g., social networks). Given a neighborhood subgraph of \(v\) denoted \(H_v\) and a graph property \(f(\cdot )\), let \(f(H_v) = \mathbf {x}\) where \(\mathbf {x}\in \mathbb {R}^{n}\) is a vector of vertex weights and \(x_i\) is the value of vertex \(u_i \in N(v)\). Using the weight vector \(\mathbf {x}\) as a basis for ordering the vertices in the closed neighborhood, we denote this ordering as \(\pi _v = \{u_1,u_2,\ldots \}\). Further, let \({\upchi }({H},\pi _v)\) be the number of colors used by a local greedy coloring algorithm that uses the ordering \(\pi _v\) to color \({H}\). Consequently, an approximation of the local chromatic number of \(G\) is defined as:

$$\begin{aligned} {\upchi _{\ell }}(G,\varPi ) = \max _{v \in V} \; {\upchi }(N[v],\pi _v) \end{aligned}$$

where \({\upchi _{\ell }}(G, \varPi )\) is the maximum number of colors used by a local greedy coloring method that uses the set of neighborhood vertex orderings \(\varPi = \{\pi _{v_1}, \pi _{v_2},\ldots,\pi _{v_n}\}\). Intuitively, the above gives rise to the following relationship:

$$\begin{aligned} \omega (G) \le \chi (G) \le {\upchi _{\ell }}(G,\varPi ) \end{aligned}$$

Also, if we consider a vertex neighborhood subgraph \({H}_v\), then:

$$\begin{aligned} \omega ({H}_v) \le \chi ({H}_v) \le {\upchi }({H}_v,\pi _v) \le \Delta ({H}_v)+1 \end{aligned}$$

where \(\omega ({H}_v)\) is the size of the maximum clique, \(\chi ({H}_v)\) is the optimal number of colors required to color \({H}_v\) (minimized over all proper colorings of \(N(v)\)), and \({\upchi }({H}_v,\pi _v)\) is the number of colors from a greedy coloring of \(N(v)\) using \(\pi _v \in \varPi \).

Fig. 7
figure 7

Scalability. The speedup of our methods on different types of graphs are shown in a, whereas the speedup of different coloring variants for soc-flickr are shown in b. It is clear that all proposed variants are scalable for large graphs, while vertex-centric coloring (vc) using recolor scales slightly better than the others for the large sparse flickr social network. Processing units are cores (one thread per core)

7.2 Neighborhood coloring

The parallel framework is shown in Alg 7. Here, \(B(\cdot )\) is assumed to be normalized with respect to cliques, hence, \(B(v) = K(v)+1\). This allows us to generalize the algorithm over any arbitrary upper bound.

figure g

Upper and lower bounds are computed in line 2 and 3, respectively, and used for pruning in line 4. The vertices remaining in \(G\) are ordered in line 5, and then each vertex neighborhood in that order are colored (line 6). For each vertex in order, we first try to avoid coloring \(v_i\) by checking if the local vertex upper bound \(B(v_i)\) is smaller than \({\texttt{max}}\). If not, then lines 8–10 form the “reduced” set \(P\) of neighboring vertices. In line 12, we obtain the local vertex ordering \(\pi _{v_i}\) by ordering the vertices in \(P\) using an arbitrary property \(f(P)\) computed in line 11. Next, the subgraph \(H_v\) induced by the ordered vertex set \(P\) are colored using a coloring variant and color assignment/search strategy (line 13). Finally, line 14 updates the maximum number of neighborhood colors required, if necessary.

Note that the three pruning steps are shown in lines 4, 7, and line 10, respectively. If the goal is to compute \({\upchi _{\ell }}(G,\pi )\), then the pruning steps can significantly reduce the search space leading to faster and more accurate colorings. For the problem of computing the complete set of neighborhood colorings, then we can simply avoid using the pruning steps. In other words, the pruning steps and their utility are application dependent, and thus may be turned on/off accordingly. We also note that these pruning steps are also useful for finding the max clique, computing a graph property for which the upper and lower bounds apply, and for finding dense subgraphs, among many other tasks.

In addition to the coloring variants from Sects. 3 and 4, we also investigate two types of search procedures for coloring (i.e., color-centric and vertex-centric) that differ in their implementation, but may result in significantly different runtimes depending on the structural properties of the input graph. In particular, the search procedure in the basic and recolor variants may be performed by searching color-classes (i.e., the independent sets) or by searching the vertex neighborhoods (i.e., adjacent vertices) and thus, we term these search procedures as color-centric and vertex-centric, respectively.

Fig. 8
figure 8

Properties of the neighborhood colorings. Using the parallel neighborhood coloring algorithm, we color each vertex-induced neighborhood and record the number of colors used for that neighborhood as well as the maximum independent set size (i.e., largest such coloring class given by the neighborhood coloring of that vertex). We use the complementary cumulative distribution function (CCDF) to study the coloring properties of a few large sparse real-world networks. The max independent set size is with respect to the coloring (largest such coloring class)

7.2.1 Parallelization

The neighborhood coloring problem is parallelized by considering each neighborhood subgraph as independent and coloring each of these subgraphs in parallel. We use dynamic scheduling and assign each processing unit a single neighborhood at a time. This helps ensure the vertex neighborhoods are colored in approximately the correct order. Our approach requires a single lock to ensure that the largest number of colors used thus far is consistent and avoid potential race conditions when updating it (see line 14). Importantly, as soon as a processing unit updates \({\texttt{max}}\), we immediately broadcast it to all other processing units. We observed that this can significantly improve performance as the tighter lower bound may be used for additional pruning or result in terminating a search early. As an aside, if the pruning rules are used, then two subsequent runs may result in slightly different \({\upchi _{\ell }}(G,\pi )\). This is due to possible variations in the global vertex ordering which determines the underlying order in which the neighborhoods are colored.

The parallel framework has many other advantages. For instance, each processing unit only requires a neighborhood subgraph and therefore the framework is space efficient for streaming or graphs too large to reside in memory and thus a good candidate for GPU parallelization as well.

Table 11 Upper and lower bounds of the chromatic number for the graphs

7.3 Experiments

We now analyze the effectiveness of our approach on a variety of real-world networks. The network statistics including lower and upper bounds are provided in Table 11.

A number of observations are made from the experiments. First and foremost, the scalability of our parallel framework is demonstrated in Fig. 7 where we observe that significant speedups are possible across a range of different types of graphs and coloring variants. Importantly, Fig. 7a demonstrates the scalability of our methods on a diverse set of graphs, from large sparse graphs (e.g., social and biological networks) to dense networks found in scientific computing. Besides density, these graphs are known to contain very different structural properties. We used the large Orkut social network that is sparse and power-lawed, the sparse Facebook Texas network, a slightly more dense biological network of a human gene, and a very dense unsolved instance from the clique/coloring DIMAC’s challenge. The last two graphs were found to be more difficult to obtain nearly optimal local colorings. Nevertheless, these two graphs, but especially the human gene, scale slightly stronger than the more sparse networks. These graphs were colored using the basic coloring method with color-centric search and no pruning. Further, vertices were ordered globally by kcore-vol (\(f(v) = \sum _{w \in N(v)} K(w)\)) and ordered locally using kcore-deg-vol and both orderings are from largest to smallest for simplicity.

Finally, we also investigated the scalability of a few different coloring variants using the large sparse flickr social network, see Fig. 7b for details. In particular, all the proposed variants are shown to scale well for large graphs, while vertex-centric coloring (vc) using recolor scales slightly better than the others. Similar results were also observed using other types of graphs and methods.

Table 12 Comparing the space of neighborhood coloring methods. We evaluate a representative set of methods from the framework. The local coloring number denoted \({\upchi _{\ell }}(G,\pi )\) is given for each of the variations. We present results for a representative sample of methods from the framework. In all methods, the local ordering is from largest to smallest, whereas the global ordering is from smallest to largest. For the global ordering we used kcore-vol for simplicity, while varying the local ordering method (color figure online)
Fig. 9
figure 9

Characterizing and comparing the various types of networks using statistics from neighborhood coloring. The number of colors used to color each of the neighborhoods are shown along with the size of the largest independent set in the coloring of the neighborhoods

Now, we investigate a representative sample of coloring methods from the large space defined by the framework. For this experiment, we use the three pruning steps and order the vertices globally using kcore-vol and are searched from smallest to largest. The vertices in each neighborhood are ordered from maximum to minimum and thus the vertices more constrained in their choice of color are assigned colors early allowing more flexibility in the color assignment whereas vertices that are not as constrained take lower precedence in their color assignment since these vertices are usually easily assigned to a color. In both global and local ordering, ties are broken using vertex ids such that if \(f(v) = f(u)\) and \(v > u\), then \(v\) is ordered before \(u\).

The results from a single graph (soc-flickr) are shown in Table 12, others were removed for brevity. The first row represents the family of methods that use the basic variant with pruning, whereas the second row uses no pruning. Likewise, the third and fourth rows use recolor with pruning and without it, respectively. There are several interesting observations. First, the coloring number from the recolor variant is at least as accurate and usually better than the basic variant. This result is independent of whether pruning is used or not and it shows how much improvement can be achieved by using the recolor variant. Second, pruning is generally effective in obtaining a better coloring number. Note that using both pruning and the recolor variant clearly improves on the basic coloring method (without pruning or recolor). For example, using the Tri-vol method, we get a \(\approx \! 9\,\%\) improvement in the number of colors, when we apply both pruning and recolor variant. Finally, we observe that the Tri-vol and Deg-Kcore-vol methods perform the best among all other methods (minimum number of colors). These results are consistent with the pervious discussion in Sect. 6.

Fig. 10
figure 10

Exploring the relationship between two statistics from the neighborhood coloring. The number of colors used in each local coloring is compared with the size of the largest independent set from that coloring

In this section, we use neighborhood coloring to characterize the various types of networks as well as gain insight into the structural properties of the networks. We view the neighborhood coloring as a process for discovering meaningful features that capture some underlying properties of the graph that arise from the notion of coloring. From this, we first derive two vertex features. Specifically, for each vertex \(v\), the first feature represents the number of colors used in the neighborhood coloring of the vertex \(v\), and the second feature represents the size of the maximum independent set resulting from the neighborhood coloring of the vertex \(v\). Figure 8 shows the complementary cumulative distribution (CCDF) of these features across all the nodes in the graph. We observe that those graphs that are denser and more clustered (such as soc-flickr) typically use many colors for neighborhood coloring of the vertices. For example, the soc-flickr dataset uses \(100\) colors to color the largest vertex neighborhood in the graph. On the other hand, graphs that are more sparse and less clustered (such as soc-youtube) typically use fewer colors for neighborhood coloring of the vertices. Further, we observe that the maximum independent set size is inversely proportional to the maximum number of neighborhood colors. Clearly, this observation is due to the rate of dependence among the graph vertices. For example, the soc-flickr dataset has a small independent set size \(\approx \! 400\) vertices. On the other hand, a graph that is as large and as sparse as soc-orkut typically has a large independent set size \(\approx \! 5{,}000\) vertices. These observations show how significant the two features (number of colors and maximum independent set size) for capturing the underlying structural properties of various types of graphs. Note that in Fig. 8, we show only some of the datasets as examples, and we omit the others for brevity.

As an aside, egonet-based clique methods were proposed for sparse graphs (Rossi et al. 2012) and sampling methods and estimators based on egonets were developed in the same spirit (Gjoka et al. 2013; Ahmed et al. 2013). One may also straightforwardly use egonets to obtain an accurate estimate of the distribution of local coloring numbers.

In Fig. 9, we focus the attention on the other denser graphs, bio-human-gene2 and keller6. Figure 9 shows the histograms of the number of colors, maximum independent set size, and the correlation between them, for both bio-human-gene2 and keller6 graphs. We observe that the histograms of the number of colors and maximum independent set size are highly skewed. For example, bio-human-gene2 graph shows that \(600\) vertices uses more than 1,400 colors for their neighborhood. Moreover, the size of the maximum independent set size has a small range (5–25). The keller6 graph, however, is one of the clique DIMAC’s challenge graphs. We observe that the histogram of the keller6 graph consists of two groups, one group with small number of colors (\(<\!200\)), and the other group with higher number (\( \approx \! 600\)) of colors. This observation is clearly shown in the histogram of the maximum independent set size. Similarly, we show the correlation plots between the number of colors and the maximum independent set size for several datasets in Fig. 10. The observations are similar to what we discussed before.

8 Conclusion

Despite the obvious practical importance of graph coloring, existing works have not systematically investigated or designed methods for large complex networks. In this work, we defined a unified framework that can serve as a fundamental basis for studying coloring on large networks. Using this framework, we proposed three classes of fast and accurate methods including social-based, multi-property based, and egonet-based methods. We demonstrated the effectiveness of the proposed methods on over 100+ networks and among 7 different types of networks (e.g., social, technological networks). In the majority of cases, we found these methods to be more accurate than other widely used heuristics that have been used for coloring in other domains. Importantly, we find that the solutions obtained from our methods are nearly optimal and sometimes provably optimal for certain types of networks. Furthermore, the coloring methods were shown to be effective for the task of finding graph outliers as well as predicting the type of graph (e.g., social vs. biological network). We also investigated the problem of coloring neighborhood subgraphs and proposed a parallel algorithm that leverages the proposed unified framework and methods. One key finding is that neighborhoods that are colored using a relatively few number of colors are not well connected, with low clustering and a small number of triangles. While neighborhood colorings that use a relatively large number of colors have large clustering coefficients and usually contain large cliques. In future work, we plan to explore the neighborhood coloring further as it has proven to provide a number of key insights into the structural properties of the network and neighborhoods at large, while also fast to compute for large networks. Overall, this work demonstrated the practical significance, accuracy, and scalability of our methods for coloring and analyzing large complex networks.