Keywords

1 Introduction

Our purpose is to investigate and analyze attributed graphs. In this article we discuss how recent extensions of Formal Concept Analysis apply to this problem. We consider undirected graphs G(O, E) where E is the edge set, and O the vertex set. The vertices are labeled by a description in an attribute pattern language L with a lattice structure, typically L = 2I where I is a set of binary attributes. Note that we may consider such an attributed graph both as a graph whose vertices are labeled with subsets of I and as set of objects each described by such a subset of I and that may be related together by edges.

The former view leads to consider the methodology used to investigate graphs, in particular social and complex networks. Most of the work in this area consider unlabeled networks and is concerned by what may be said about the topological structure of the network. A large set of measures have been proposed to analyze these networks, and two main ways have been proposed to extract interesting subgraphs. The first way consider the network as made of a core, i.e., a dense subgraph whose vertices are highly connected, together with its periphery, made of vertices highly connected to the core, but poorly connected between them [6]. The second way considers the network as made of a number of dense subnetworks, called communities whose vertices are highly connected within the community and poorly connected to vertices of other communities [8]. Finally, the two views may be combined, for instance, by considering the network as made of communities each having some core/periphery structure [16].

Regarding the notion of core, there have been various ways to define it, starting from the k-core of a network which is the greatest subnetwork whose vertices all have degree at least k in the subnetwork [17]. By changing the topological property, but keeping this idea of a greatest subnetwork whose vertices share the property within the network, we obtain various core definitions [3]. A core may also be defined as the greatest subnetwork made of a subset of a family of small, connected subnetworks. The simplest example is the k-clique core that is only made of k-cliques. When k = 3, the core is made of triangles which are known to be an important substructure in social networks analysis [29].

Concerning the idea of communities, it has been extensively investigated mostly as an optimization problem: how to optimally partition the network in subnetworks maximizing some measure. A second view of communities derives from some strong structural property that has to be satisfied within a community. We will further call them structural communities, or simply communities, as we only consider these kind of communities in the remaining part of this article. The main example is the k-community approach that divides a k-clique core (as defined above) in connected subnetworks each satisfying a stronger property (see below) [13].

From the first point of view, adding attributes to the vertices means that each attribute pattern induces a subgraph whose vertices satisfy the pattern. Each such subgraph could then be investigated, extracting its core and communities. The question is then how to summarize and select relevant information from such a set of results.

The second view considers the attributed graph first as a table representing a set of objects described by attributes, and then considers that edges may relate objects. This leads to the use of standard methodology of data analysis by adapting it to dealing with topological information. The whole purpose of this article is to discuss how Formal Concept Analysis, which was originally concerned with data tables, may be extended in order to take into account the topological information. The main idea we propose here is to consider as parameters the notion of cores and structural communities relevant to the data to analyze and adapt accordingly the FCA methodology.

Regarding the reduction of a graph to its core, this may be obtained by defining an interior operator p on the vertex powerset 2O. This approach, based on a previous work on abstraction in Formal Concept Analysis [24], produces abstract closed patterns structured in an abstract concept lattice together with a basis of abstract implications written □ q → □ w. All concept extents are then images of the interior operator. When considering attributed networks, and given some core definition, such an interior operator reduces a vertex subset e to the vertices forming the core of the subgraph G(e) induced by e [23]. It is then called a graph abstraction operator. Fig. 1 represents an attributed graph G, the subgraph induced by pattern a (in plain lines), and the 3-clique core of this subgraph, i.e., its abstract subgraph (in blue lines). Note that all vertices of the core share attribute b. This means that ab is an abstract closed pattern and that the abstract implication □ a → □ ab holds, i.e., any triangle in G whose vertices share a also share b.

Fig. 1
figure 1

The pattern a subgraph is displayed with plain lines, the corresponding 3-clique abstract subgraph is displayed in blue lines. The associated abstract closed pattern is ab

Recent works in attributed graph mining are interested in searching for local patterns made of a constraint on a subset of attributes together with a density constraint on a vertex subset, and this using various notions of maximality [11, 18]. In a companion article [21], we have defined local closed patterns corresponding to maximal attribute patterns each associated with one dense subgraph, allowing to extract local implications, particular to specific dense groups of objects. For that purpose Formal Concept Analysis (FCA) had to be extended in order to take into account this notion of locality.

The simplest example is obtained by considering a subgraph made of various connected components, and associating to each connected component a local closed pattern, i.e., the most specific pattern shared by the vertices of this connected component. More generally, local closed patterns may be associated with the connected components of abstract subgraphs. The family of such connected components forms a partial order called a cc-confluence while the corresponding local concepts have a weaker structure called a pre-confluence. As an example, in Fig. 2 we display a pattern subgraph extracted from a DBLP co-authoring network labeled by journal and conference names, together with its abstract subgraph (in bold and red vertices and lines) when considering a 4-core abstraction. The abstract subgraph has two connected components, i.e., two structural communities of scientists. Again we may associate to this structure a set of implications, called local implications. In the previous example we found a local implication □i q → □i w stating that in the connected component containing vertex i of the degree ≥ 4 abstract subgraph of pattern q, all vertices also share pattern w.

Fig. 2
figure 2

The DMKD,IDArev pattern subgraph in the DBLP co-authoring experiment. The red vertices and edges represent the subgraph induced by the degree ≥ 4 abstract extension

Connected components of abstract subgraphs as represented in cc-confluences do not always completely capture the idea of communities as considered in social network analysis. As discussed in [21], we may, however, enlarge the local closed patterns approach by deriving a new graph G T from G whose vertex set T is a set of vertex subsets of G. Figure 3 displays a graph whose vertices represent pupils of a school in the West of Scotland, whose edges represent friendship relations, and whose vertex attributes concern substance use and sporting activity.Footnote 1 As a running example we consider the subgraph induced by the empty pattern, i.e., the whole graph. By applying a 3-clique graph abstraction restricted to connected components containing at least 4 vertices, we obtain a subgraph made of the bold and colored edges and vertices. This abstract subgraph is made of two connected components, therefore leading to two local concepts. Footnote 2 However, the largest connected component is clearly made of distinct dense parts, i.e., communities, we would like to consider when defining local closed patterns. Fortunately, when considering k-communities [13] we can solve this problem by applying the cc-confluence approach to a new graph derived from the original graph. More precisely, a k-community is a vertex subset in a graph G that corresponds to a connected component in a derived graph G T. The vertices of G T are k-cliques in G and an edge relates two vertices whenever the corresponding k-cliques share k − 1 vertices in G. Each colored subgraph in Fig. 3 defines such a 3-community.

Fig. 3
figure 3

The original Friendship network of a group of West Scotland pupils. The pupils and edges forming 3-communities of size at least 4 are displayed in various colors

The last task is to define interestingness measures to rank abstract or local patterns and implications. Regarding patterns, we will search for patterns whose abstract (resp. local) subgraph is a large part of the whole pattern subgraph, i.e., which preserves the core (resp. communities) definitions. The corresponding measure is called specificity. Regarding the abstract and local implications, we search for implications which are informative, i.e., which did not hold as standard implications and therefore bring some new information about our data. The corresponding measure is called Informativity. A preliminary discussion of both measures was presented in [25]. We propose here definitions of specificity and informativity both at the abstract and local level and experiment them on two real attributed networks.

Section 2 describes the attributed graphs used in our experiments. Section 3 presents abstract concept lattices, abstract implications, and graph abstractions together with associated interestingness measures. Section 4 defines local concept pre-confluences, related local implications, cc-confluences, and interestingness measures. In Sect. 5 we show how we extract the set of 3-communities associated with pattern subgraphs by using derived cc-confluences, and we display the local concept ordering of the attributed network of teenage friendship displayed in Fig. 3. In Sect. 6 we briefly discuss the implementation used in our experiments.

2 Datasets

In this section we will consider experiments with two datasets. In both cases the data are described as a graph G = (O, E). Vertices of this graph are have labels from 2I, where I is a set of items, i.e., binary attributes. Since objects are not always described by binary attributes, the binarization preprocessing is described when necessary.

2.1 Teenage Friends and Lifestyle Study

The dataset is denoted by s50-1 and is a standard attributed graph dataset.Footnote 3 It represents 148 friendship relations between 50 pupils of a school in the West of Scotland, and labels concern the substances used (tobacco, cannabis, and alcohol) and sporting activity. The values of the corresponding variables are ordered. The binarization process consists in defining variables representing the value intervals. T stands for Tobacco consumption and has values 1 (no smoking), 2 (occasional), and 3 (regular). C stands for cannabis consumption and has values 1 (never tries) to 4, D stands for alcohol consumption and has values 1 (does not drink) to 5, and S stands for sporting activity and has two values 1 (occasional) and (2) regular. Binary attributes represent intervals, for instance, C234 means that the value of C is at least 2 and therefore represents the interval [2, 4]. In Table 1 we present the binary attributes we have defined. Attribute subsets represent intersection of intervals. For example pattern {D123, D2345} requires that the value of D lies within the interval [1, 3] ∩ [2, 5] = [2, 3]. Note that, for the sake of simplicity we do not distinguish the two highest values of attributes T, C, and D.

Table 1 The binary attributes used to label the vertices in the Teenage Friendship network

2.2 A DBLP Dataset

This is the DBLP dataset as described in [4]. There are 45,131 vertices, 228,188 edges, and 555 connected components. Vertices are authors that have published at least one paper in one among 29 journals or conferences of the Database and Datamining communities during the 1/1990–2/2011 period. Footnote 4 An edge links two authors whenever they are coauthors of at least one article. The conferences are clustered in three clusters: DB (databases), DM (data mining), and AI (artificial intelligence) according to a conference ranking site categorization.Footnote 5 The binary attributes are the journal and conference names together with the three clusters. An attribute has value 1 if the author has published in the corresponding journal or conference or cluster.

3 Abstract Closed Patterns in Attributed Networks

3.1 Closed Patterns

In this section we introduce the necessary definitions and terminology we use in the article. Note that the terminology is somewhat non-standard in FCA. Indeed, as we need to interleave interior operators with extensional and intensional operators, the standard X″ notation to represent closed elements is not so convenient, so we rather denote, respectively, by ext and int the extensional and intensional operators. Furthermore, in this introductory paragraph we relate FCA to closed pattern mining.

A standard pattern mining procedure consists in considering the set of occurrences of patterns, belonging to some pattern language L with a lattice structure, within an object set O (see, for instance, [5]). Footnote 6 This language is partially ordered following a general-to-specific ordering, and each object o is described as a particular pattern d(o). A pattern q occurs in object o whenever d(o) is more specific (i.e., larger) than q. The set of occurrences ext(q) of a pattern q is called its extension. An intension function int(e) returns the most specific pattern associated with the extension e. This means that we relate a pattern q to the most specific pattern with same extension by applying the closure operator int ∘ ext to q. int ∘ ext(q) is then called a closed pattern. The pattern language L typically is 2I where I is a set of binary attributes (aka items). With no loss of generality we will further use the powerset 2I as a pattern language while what follows also applies to wider languages as pattern structures[10]. When L = 2I the closure operator on patterns then simply intersects the object descriptions of the extension of the entry pattern. This means that when considering patterns with same extension as equivalent, closed patterns are the representatives of the equivalences classes. Such a class has therefore a maximum but also minimal elements, called minimal generators. When the patterns belong to 2I, the min–max basis of implications[14] is defined as follows:

m = {gfgf is a closed pattern, g is a generator, fg, ext(g) = ext(f)}

This basis represents all the implications tt′ that hold on O, i.e., such that ext(t) ⊆ ext(t′). This precisely means that all these implications may be derived from the min–max basis. Obviously all non-trivial implications, i.e., implications such that t′ ⊈ t, may be inferred from an implication lr of m where lt and rlt′.

Finally, note that the enumeration of closed patterns is in general restricted to frequent patterns, i.e., patterns whose extension is larger than some threshold. In FCA, such a constraint leads to iceberg lattices [27].

3.2 Abstract Closed Patterns

We summarize here how abstraction is applied in FCA by constraining the extensional space. We first recall the definitions of closure operators and interior operators, the latter being further used to restrict the pattern extensions to be abstract extensions. In what follows all ordered sets are finite, and in particular any topped meet-semilattice (resp. pointed join-semilattice) is a lattice.

Definition 1

Let U be an ordered set and f: UU a self map such that for any x, yU, f is monotone, i.e., xy implies f(x) ≤ f(y) and idempotent, i.e., f(f(x)) = f(x), then:

  • If f is extensive, i.e., f(x) ≥ x, f is called a closure operator

  • If f is intensive, i.e., f(x) ≤ x, f is called a dual closure operator, an interior operator, or also a projection.

In the first case, an element such that x = f(x) is called a closed element.

Ranges of interior operators on lattices are called abstractions and are characterized by the following Proposition:

Proposition 1 (see [24])

A subset A of X = 2O is the range p[X] of some interior operator p on X, if and only if for any elements x, y in A, their join xy also belongs to A and A contains the empty set. The interior operator is related to its range as follows:

p(x) = sup{aA∣ax} a.

Let then p be the interior operator associated with some abstraction A, p(x) be the greatest element of A contained in x. Closed pattern analysis has been recently extended to abstract closed pattern analysis by noticing that applying an interior operator on the extensional space 2O we obtain again a closure operator on the pattern language 2I [15, 24]:

Proposition 2

Let X = 2O and L = 2I, p be an interior operator on 2O, and A = p[X] be the associated abstraction, we have that (int, pext) is a Galois connection on (A, L), i.e.,:

f = int ∘ p ∘ ext is a closure operator on L,

The abstract extension of pattern q is defined as p ∘ ext(q). A new equivalence relation is then defined such that qA w whenever p ∘ ext(q) = p ∘ ext(w), each equivalence class of which corresponds to some abstract extension in A. There is then a unique abstract support closed pattern, i.e., a most specific pattern among all patterns sharing the same abstract extension, which is obtained as f(q) = int ∘ p ∘ ext(q). f(q) is then called an abstract closed pattern. This leads to the definition of abstract concepts organized in a concept lattice:

Corollary 1 ([24])

The set of (abstract extension, abstract closed pattern) pairs (e = ext(c), c = int(e)), ordered following A, is a lattice called an abstract concept lattice.

Note that, as p is monotone, whenever ext(q) ⊆ ext(w), i.e., qw is valid, we also have extA(p) = p ∘ ext(q) ⊆ extA(w) = p ∘ ext(w). The latter inclusion states the validity of an abstract implication we will rewrite as □A q → □A w.

This way we obtain abstract min–max basis with the same definition as earlier in this section except that extA replaces ext and therefore abstract implications relate minimal elements (i.e., A-generators) to maximal element (the abstract closed pattern, or A-closed pattern) of the same abstract equivalence class. We have then the following definition:

Definition 2

The abstract min–max basis m A of valid abstract implications is defined as

m A = { □A g → □A fgf is an A-closed pattern, g is a A-generator, fg, extA(g) = extA(f)}. 

In the same way as in the standard min–max basis case, all implications □A t → □A t′ that hold on A, i.e., such that extA(t) ⊆ extA(t′), may be inferred from the abstract min–max basis.

3.3 Graph Abstractions

These ideas have been applied to attributed graphs by defining graph abstractions [23]. The set of objects O is then the set of vertices of a graph G = (O, E), and each vertex o is labeled by an attribute pattern d(o) ∈ 2I.

A graph abstraction is an abstraction of 2Odefined through a characteristic property P such that P(x, e) expresses some minimal connectivity requirement of the vertex x within the subgraph G e induced by some vertex subset e.

Proposition 3

Let P be such that

  • P(x, e) implies xe and

  • ee′ and P(x, e) implies P(x, e′),

and let q be a mapping defined by q(e) = {xe | P(x, e)}, then the mapping p defined by p(e) = fixedpoint(q, e) is an interior operator on 2O .

Consider a subgraph G e, where p(e) represents the greatest vertex subset of e inducing a subgraph whose vertices all satisfy the associated characteristic property. This subgraph G p(e) will be further called the abstract subgraph of G e. We give hereunder examples of graph abstractions, defined through their characteristic property and exemplified in Fig. 4.

  1. 1.

    degree ≥ k. The degree ≥ k-abstract subgraph of a graph is its k-core [17].

  2. 2.

    k-club ≥ s: x has to belong to at least one k-club of size at least s in G e. This is a relaxation of the notion of clique[1]: a k-club is a subset c of vertices such that there is a path of length ≤ k between any pair of vertices in G c. A triangle, a 3-clique, is a 1-club of size 3 (Fig. 4a). Figure 4b represents a 2-club of size 6 and therefore a 2-club ≥ 6 abstract group.

  3. 3.

    nearStar(k, d): x has to have degree at least k or there must be a path of length at most d between x and some y with degree at least k. For instance, the simplest nearStar(8, 1) abstract group is a central node connected with eight nodes. Such an abstraction is useful when we want the abstraction to preserve hubs [2] (i.e., high degree vertices) together with their (low degree) neighbors (see Fig. 4c).

  4. 4.

    ccs: x has to belong to a connected component of size at least s in G e (see Fig. 4d).

Fig. 4
figure 4

Graph abstractions corresponding to various vertex characteristic properties. In each graph plain circles and plain lines form the abstract subgraph, crosses and dotted lines represent the vertices and edges out of the abstract subgraph. (a) x has to belong to a 3-clique, (b) x has to belong to a 2-club of size at least 6, (c) x has to be connected to a vertex y such that the degree of y is at least 6, i.e., to a nearstar(6,1), (d) x has to belong to a connected component whose size is at least 3

Finally, it is interesting to note that we can combine two (or more) abstractions A 1 and A 2 in two ways, defining a new composite abstraction either stronger or weaker than both A 1 and A 2. For instance, we may want to consider an abstract subgraph where vertices both have a degree larger than some k and belong to a connected component exceeding a minimal size s. On the contrary, we may want an abstract subgraph such that at least one of the two characteristic properties is satisfied by all the vertices. This would be the case, for instance, if we want to keep both vertices that have a degree larger than, say 10, and vertices in a star, i.e., connected to a hub which degree is at least 50. The following proposition states that we can freely combine abstractions in both directions.

Proposition 4

Let P 1 and P 2 two characteristic properties of abstractions defined on the same object set O, and let P 1P 2 and P 1P 2 be defined as follows:

  • P 1P 2(x, e) = P 1(x, e) ∧ P 2(x, e)

  • P 1P 2(x, e) = P 1(x, e) ∨ P 2(x, e)

Both P 1P 2 and P 1P 2 are characteristic properties of abstractions.

3.4 Interestingness Measures on Abstract Patterns and Implications

3.4.1 Specificity of Abstract Patterns

We are now interested in measuring knowledge brought by abstract closed patterns and abstract implications [25]. For that purpose we first generalize hereunder the structural correlation measure introduced by A. Silva and co-authors [19], originally introduced to relate a subgraph to its content in terms of quasi-cliques and rename it as specificity.

Definition 3

Let q be a pattern, A an abstraction of some powerset of objects O, the specificity of q with respect to A is defined as:

$$\displaystyle{ S_{A}(q) = \frac{\mid \mathrm{ext}_{A}(q)\mid } {\mid \mathrm{ext}(q)\mid } }$$

Consider, for instance, a 3-clique abstraction. Whenever S A(q) is close to 1, the pattern q subgraph is mainly made of triangles. To the contrary, whenever S A(q) is close to 0, the pattern q subgraph almost displays no triangles, which means quite isolated vertices. We relate this way a pattern q to the measure of how selecting vertices satisfying this pattern preserves the topological property associated with the abstraction.

Example 1

Figure 1 displays a graph each vertex of which is described by an itemset. We observe then that:

  • ext(a) = e = {1, 2, 3, 4, 5, 7} induces the subgraph G(e) (blue+black).

  • extA(a) = {1, 2, 3} as 4, 5, 7 do not belong to any 3-clique in G(e).

  • int ∘ extA(a) = ababab = ab is an abstract closed pattern.

  • S A(a) = 1∕2, S A(ab) = 3∕4.

Note that among the patterns of some equivalence class of ≡A the abstract closed pattern c has maximal specificity:

For any t, if ext A(t) = ext A(c) then S A(c) ≥ S A(t)

3.4.2 Informativity of Abstract Implications

Apart from measuring through specificity what is specific to the pattern in its abstract view, we are also interested when considering abstract implications in how informative they are. For that purpose we consider abstract implications whose left and right patterns are equivalent in the abstract space A, i.e., have same abstract extension, as in the min–max abstract implication basis defined above. Whenever these patterns are also equivalent in the original space 2O intuitively the implication is uninformative. Assume, for instance, that aabc is valid, then validity of the abstract implication with same left and right members does not bring any new information. On the contrary, assume that □A a → □A abc is valid while aabc has only confidence 0. 5, i.e., ext(abc) = 0. 5 ∗ ext(a), then clearly the abstract implication brings some information.

Definition 4

Let q be a pattern, A an abstraction of 2O, the informativity of the valid implication r: □A q → □A w is defined as:

$$\displaystyle{ I_{A}(r) = 1 -\frac{\mid \mathrm{ext}(qw)\mid } {\mid \mathrm{ext}(q)\mid } }$$

Informativity has a range between 0 and 1 and estimates the probability of not having w whenever we have q in graph G. This quantity has value 0 whenever qw holds and has limit 1 whenever ∣ext(qw)∣ approaches 0, i.e., restricting the extension of patterns to elements of A concentrates the extension of q to the very few sharing also w.

Intuitively, the informativity of an abstract implication measures what we discovered when we observed that q and qw share the same abstract support.

Example 2

Following Example 1 illustrated in Fig. 1 consider the abstract implication r: □ a → □ ab. This abstract implication has the following semantics: “a 3-clique of G whose vertices share pattern a also share pattern b,” and its informativity is therefore I A(r) = 1 − 1∕2 = 0. 5. 

Note that implications of the abstract min–max basis which relate minimal elements g of an abstract equivalence class to the corresponding abstract closed pattern c have maximal informativity:

Let gA tA t′ ≡A c then I A( □A g → □A c) ≥ I A( □A t → □A t′).

3.5 Experiments

Some experiments on the two datasets described in Sect. 2 have been performed and discussed in [23]. We discuss hereunder new experiments in particular regarding the interestingness measures.

We firs consider the Teenage Friendship network s50. Among the 50 pupils 38 belong to triangles (i.e., 3-cliques). As there are no isolated triangles, the abstract subgraph when considering only connected components with size at least 4 is reduced to 32 pupils. The corresponding abstraction is therefore 3-clique and cc ≥ 4.

Table 2 displays the top 15 patterns according to the corresponding specificity. We observe that specificity is clearly non-monotonic with respect to abstract extension size and that among top patterns we find both small (and therefore general) and large patterns. We further discuss the pattern with highest specificity D45-C34, which corresponds to pupils with high alcohol and cannabis consumption, in Sect. 5 where we search for communities.

Table 2 Top-15 abstract closed patterns in the Teenage Friendship network ranked according to their 3-clique and cc ≥ 4 specificity

Table 3 displays the top 15 abstract implications according to the 3-clique and cc ≥ 4 abstract informativity. Again, informativity is clearly nonmonotonic with respect to extension size. The first and third implications have the same abstract pattern as their rightmost member: it concerns the same abstract subgraph, whose pupils have in common the behavior D45-C12-S2, but is obtained either by reducing the pattern D345-S2 subgraph or the pattern D45-S2 subgraph. Obviously the former subgraph corresponds to a higher informativity as it includes the latter subgraph. As a matter of fact, the third implication is redundant with the first one and could be removed with no information loss. This leads to reduce the abstract min–max basis by eliminating all such redundant rules. We will apply such an idea when defining a basis for local implications in Sect. 4.2. The second implication in this ranking concerns pattern D45-C34, mentioned as the highest specificity pattern. The implication states that the 3-clique and cc ≥ 4 subgraph of pupils that have pattern C234, which corresponds to a medium-to-high cannabis consumption behavior, selects pupils with the D45-C34 pattern, i.e., those who have both high alcohol and cannabis consumption behavior. When applying no abstraction, the implication only holds on 7 among the 14 pupils that have pattern C234, which results in informativity 1 − 7∕14 = 0. 5.

Table 3 Top-15 abstract implications in the Teenage Friendship network ranked according to their 3-clique and cc ≥ 4 informativity

We discuss now some new details on experiments performed on the DBLP dataset. The experiment consisted in applying a degree ≥ k abstraction with increasing k-values and we focused in abstract patterns obtained with k = 16, which corresponds to a very strong abstraction: in an abstract extension each author is required to have 16 co-authors within the abstract extension. We obtained few abstract closed patterns and in particular the abstract closed pattern VLDBJ, ICDE, SIGMOD, VLDB and the related abstract implication □ VLDBJ → □ ICDE, SIGMOD, VLDB. Both the abstract closed pattern and its abstract minimal generator VLDBJ have an abstract extension of 38 among the 1276 VLDBJ authors in the dataset. The implication states that a dense group of co-authors that have published in the Very Large Database Journal also have published in several database conferences. In  Fig. 5 we present the corresponding subgraph. Such a very dense co-authoring subgraph within the VLDBJ subgraph is somewhat unexpected. Its abstract specificity 38∕1276 ≈ 0. 085 is low, but still higher than the 0 value we could expect from such a high abstraction level. The abstract implication has a high informativity about ≈ 0. 65 coming from the fact that among the 1276 authors who published in VLDBJ journal only 441 did publish in all the conferences ICDE, SIGMOD, and VLDB.

Fig. 5
figure 5

The subgraph obtained when applying the degree ≥ 16 abstraction to the VLDBJ subgraph in the DBLP co-authoring experiment

We made then some investigations in the DBLP repository, focussing of the 38 authors of the abstract extension, and found an article whose reference is given in  Fig. 6 and whose abstract begins as follows:

A group of senior database researchers gathers every few years to assess the state of database research …

Fig. 6
figure 6

An example of reference with many authors that leads to a high degree subnetwork

In some sense the explanation of the pattern we discovered is straightforward. However, the whole purpose of pattern mining is to find unexpected patterns, hidden within large datasets, and interpret them in order to acquire some new knowledge. It is exactly what happens here: we were not aware of these regular meetings of senior database researchers, and we learned something new, though, of course, this knowledge is clearly widely known within the database community.

When considering a weaker abstraction, namely here a degree ≥ 4 abstraction, we obtain more abstract closed patterns sometimes made of several connected components. Figure 2 in Sect. 1 represents the DMKD,IDArev pattern subgraph together with the subgraph induced by the abstract extension of the pattern. This abstract subgraph is made of two connected components, the one in the right part of the figure is made of ten vertices and we are then interested in knowing whether there is some more specific pattern than the abstract closed pattern DMKD,IDArev, which would be shared by this connected component. Answering such questions means mining at a local level the attributed graph, and this is the subject of the next section.

4 Local Closed Patterns in Attributed Networks

Given some attribute pattern, we are now interested in extracting local support closed patterns, i.e., maximal attribute patterns each associated with one dense subgraph, so allowing to extract local implications particular to specific dense groups of objects. Recently FCA has been extended to local closed patterns: they are obtained by applying a set of local closure operators [22]. In the graph case, this means that from the extension of some (closed) pattern c, various dense extensions, called local extensions are extracted each associated with a local closed pattern, i.e., the most specific pattern l common to the elements of the local extension. Again we obtain a set of local implications corresponding to inclusion of local extensions, but now such an implication is only valid in the vicinity of some dense group of vertices.

In [21] we introduced locality in the closure framework with the main motivation of investigating local patterns in attributed graphs. For that purpose we have first to define pre-confluences and confluences which are structures weaker than lattices that have been investigated in FCA [21, 23]. Confluences, in particular, are close to but different from confluent families as defined in [5]. We further denote by E x the up sets {yE | yx} of an ordered set E, by E x its down sets {yE | yx}, and by min(E) the set of its minimal elements.

First note that ordered sets we consider are all finite. We define a pre-confluence as a finite ordered structure that generalizes the (finite) lattice structure:

Definition 5

A finite ordered set F is a pre-confluence if and only if for any m ∈ min(F), F m = {xFxm} is a lattice.

A consequence of this definition is that a (finite) lattice is a pre-confluence with a minimum. The structure has a partial join operator:

Proposition 5

For any m ∈ min(F) and any x, yF m their least upper bound is the least element of F xF y we further denote by xF y.

This means that a pre-confluence is a union of lattices in which joins coincide. A particular case is which of a pre-confluence included in a host lattice and which is join preserving:

Definition 6

Let T be a lattice and FT be a pre-confluence with as join ∨F = ∨T, F is called a confluence of T.

An abstraction of T, as defined above is a confluence of T with ⊥T as minimum. We have then the following property when considering 2O as the host lattice:

Proposition 6

Let X = 2O be a lattice, FX is a confluence of X if and only if for any x, yF m with m ∈ min(F), we have that xy belongs to F.

A confluence is then associated with a set of interior operators:

Proposition 7

Let F be a confluence of a lattice X, and m ∈ min(F),

  • p m: X mX m, such that \(p_{m}(x) = \vee _{q\in F^{m}\cap X_{x}}q\) , is an interior operator and p m[X m] = F m .

We are concerned here with extensional confluences, i.e., confluences of X = 2O [21] that generalize extensional abstractions as graph abstractions. In this case, let x be an element of X greater than or equal to some minimal element m of F, then p m(x) returns the greatest subset of x in F that includes m. We now define graph confluences, which are the original motivation for defining confluences:

Definition 7

Let G = (O, E) be a graph, and F be the family of vertex subsets inducing connected subgraphs of G. F is a confluence of 2O called the graph confluence of G.

Proof

By definition, any singleton {s} induces a connected subgraph of G. Furthermore, the union of two connected vertex subsets that each includes a given vertex singleton {s} also is a connected vertex subset. Following Proposition 6, F is then a confluence of 2O.

The elements of F are simply called the connected vertex subsets of O. By abuse of notation we write p s and F s rather than p {s} and F {s}. The interior operator p s projects then any vertex subset e containing vertex s on the connected component of the subgraph G e induced by e that contains s. The up set F s is then the set of connected vertex subsets containing s, and the union of all these F s represents the whole set of connected subgraphs of G.

Example 3

Let G = (O, E) be a graph (displayed at the bottom of Fig. 7) whose vertex set is O = {1, 2, 3, 4}. Let F ⊆ 2O be the set of connected vertex subsets of G. F is a confluence whose set of minimal elements is min(F) = {{ 1}, {2}, {3}, {4}}. The subset F 1+3 = F 1F 3 representing connected vertex subsets containing vertices 1 or 3 is also a confluence. Figure 7 displays the diagram of F 1+3. □

Fig. 7
figure 7

A square graph (in the bottom of the figure) and the Hasse diagram of the confluence F 1+3 of connected vertex subsets that contain vertices 1 or 3

The extension e of a pattern q may then be projected through interior operators on various smaller local extensions {e i} corresponding to the connected components of the pattern subgraph. These interior operators are associated with local closure operators [21]:

Proposition 8

Let F be a confluence of X = 2O , m a minimal element of F, and L int(m) be the down set of the pattern lattice L whose elements q are such that q ≥ int(m), then

f m = int ∘ p m ∘ ext is a closure operator on L int(m)

In a graph confluence, let e = ext(q), then p s(e) is the connected component of the pattern subgraph G e to which the vertex s belongs. Obviously, p s(e) = p v(e) for any vertex v in the same connected component. Therefore f s(q) is a local closed pattern w.r.t. any vertex in this connected component, i.e., the most specific pattern shared by the vertices in the connected component.

Now a general result is that the set of local extensions is a pre-confluence:

Theorem 1

The mapping h: FF: h(e) = p m ∘ ext ∘ int(e) for me

is a closure operator on F and E = h[F] is a pre-confluence.

h(e) is therefore the local extension of int(e) that contains me and h[F] is a pre-confluence isomorphic to the set P of local concept pairs defined as follows:

Definition 8

The set of local concept pairs P = {(e, l)∣e = p m ∘ ext(l), l = int(e), me} is called a local concept pre-confluence.

To summarize we have defined local concepts as (local extension, local closed patterns) pairs and we have shown that they are organized in a structure with possibly several minimal elements, therefore generalizing the concept lattice definition. In the graph confluence exemplified above the local extensions simply are the connected components of the pattern subgraphs. We will now extend graph confluences by intersecting graph confluences with abstractions.

4.1 Cc-Confluences

We remark now that we can freely intersect confluences:

Proposition 9

Let F 1 and F 2 be confluences of X, then F 1F 2 is a confluence.

Since abstractions of X are confluences of X with the bottom element of X as their unique minimal element, the above proposition means we can freely intersect abstractions with confluences to build smaller confluences. Many confluences can then be derived from a graph confluence by intersecting it with some abstractions. We call this family of confluences the cc-confluences.

Definition 9

Let F be the graph confluence of some graph G and A be a graph abstraction of G, then the confluence FA is called the cc-confluence of G associated with A.

For instance, considering A as the k-clique abstraction, we obtain the cc-confluence of connected subgraphs of G made of k-cliques. Note that cc-confluences have an important property: rather than considering the minimal elements m of F when defining local closure operators we can consider vertices as in the graph confluence F. This is because, given any abstract subgraph and any m included in its vertex set, all vertices v of m belong to the same connected component and therefore to the same local extension. This is computationally important as this means that when considering local extensions we only need to consider each vertex in the extension and associate it to the connected component to which it belongs.

4.2 Local Implications

Inclusion of local extensions defines validity of local implications □m A q → □m A w, where m is a minimal element of F A in the extension of q. Note that, as the local extension of pattern q is obtained by applying an interior operator, which is monotone, to the support set of q, we have that, whenever □A q → □A w is valid and m ⊆ extA(w), we also have that □m A q → □m A w is valid, i.e., we may infer the latter local implication from the former abstract implication.

Example 4

Consider the graph displayed in Fig. 8. The 3-clique cc-confluence has as minimal elements {123, 567, 678} and rewrites as F A = {123, 567, 678, 5678}. The extension of pattern b is equal to its abstract extension 123, 678, and the abstract closed pattern is also b. However, the corresponding abstract subgraph displays two connected components 123 and 678. The vertices of the latter share bc which is consequently its local closed pattern. This leads to a local implication:

  • 678 A b → □678 A bc

Fig. 8
figure 8

The pattern b 3-clique abstract subgraph displays two connected components. The blue one, on the left, is also the pattern b abstract subgraph of motif a leading to the local implication □123 A a → □123 A ab

In a cc-confluence the local implication may be indexed with respect to any vertex of the corresponding connected component: a triangle in the same connected component as 6, when considering the pattern b abstract subgraph, also has c, which rewrites as □6 A b → □6 A bc. We will simply say that “a triangle containing 6 and which has b also has c.”

We search now for a basis \(B_{F_{A}}\) of valid local implications from which we may infer any local implications. We will consider a basis B m for a given minimal element m of F and obtain the whole basis \(B_{F_{A}} = \cup B_{m}\) by joining these bases. Consider a given abstract closed pattern c whose abstract extension has a connected component that contains m, and let l = f m(c) be the corresponding local closed pattern, with respect to the cc-confluence F A. This means that the implication □m c → □m l holds. We select then a basis B m of informative (lc) and irredundant (there is no other implication □m c′ → □m l with c′ less specific than c in the implication set) ones. From B m we may infer all local implications associated with m by applying standard axioms in the same way as in the case of the min–max basis in the standard closed or abstract framework. The basis \(B_{F_{A}} = \cup B_{m}\) represents the local knowledge deriving from the reduction of the extensional space from abstraction A to cc-confluence F A:

Definition 10

The Local Min–max Basis \(B_{F_{A}}\) associated with the cc-confluence F A is defined as:

\(B_{F_{A}} =\{ \square _{m}^{A}c \rightarrow \square _{m}^{A}l\mid \mbox{ where }c\mbox{ A-closed },l\mbox{ locally closed },c\not =l,\mathrm{ext}_{m}^{A}(c) =\mathrm{ ext}_{m}^{A}(l),\mbox{ and for all }c' \subset c\mbox{ we have }\mathrm{ext}_{m}^{A}(c')\not =\mathrm{ext}_{m}^{A}(c)\}\)

4.3 Interestingness Measures on Local Patterns and Implications

As in the abstract case, we may measure novelty brought locally [25]. We first extend the specificity measure to local patterns. If the ratio of a local extension to the abstract extension is high this means that the corresponding connected component is the largest part of the abstract subgraph.

Definition 11

Let q be a pattern, F be a cc-confluence and mF be such that m ⊆ extA(q), the specificity of q near m is defined by

$$\displaystyle{ S_{F}(q,m) = \frac{\mid \mathrm{ext}_{m}^{A}(q)\mid } {\mid \mathrm{ext}_{A}(q)\mid } }$$

We then define the informativity of a local implication by observing that in a valid local implication the left and right parts have same local extensions, while their abstract extensions are different. Therefore, as in the abstract knowledge case, we define the local informativity as the probability, at the abstract level, not to have the right part when the left one is true:

Definition 12

The local informativity of valid local implication r: □m A q → □m A qw is defined by

$$\displaystyle{ I_{F}(r) = 1 -\frac{\mid \mathrm{ext}_{A}(qw)\mid } {\mid \mathrm{ext}_{A}(q)\mid } }$$

When local informativity is close to 1, it means that the probability to observe w when we have q at the abstract level is low: to be able to deduce w from q is specific of the graph region where m lies.

Example 5

We carry on Example 4 displayed in Fig. 8. The local specificity near 678 of pattern bc is S F(bc, 678) = 3 ÷ 3 = 1: it does not appear elsewhere in the bc abstract subgraph. Informativity of local implication □678 A b → □678 A bc is 1 − (2 ÷ 6) = 0, 5: the abstract support of b is 6, but only three vertices share the local closed pattern bc. Regarding the local closed pattern ab, it also has local specificity 1 as ab is only found in the connected component 123, and it leads to a new local implication □123 A b → □123 A ab whose informativity is 0, 5. What we see here is that we obtain new knowledge regarding pattern b which depends on the region of the graph we consider.

Now, when considering the Teenage Friends attributed graph displayed in Fig. 3, clearly the friendship relations are organized in 3-cliques, therefore any stronger abstraction will be poorly informative. However, as mentioned in Sect. 1, when considering the 3-clique abstract graph associated with the empty pattern the unique connected component could be separated in several (overlapping) communities (displayed in Fig. 3 in various colors). We discuss and exemplify in the next section how to apply the local closure strategy to discover such sub-communities in an attributed graph.

5 Local Concepts from a Derived Graph

In what follows, we will consider a family T ⊆ 2O of vertex subsets, and consider T as the vertex set of a graph G T = (T, E T) derived from G. The simple graph confluence F of 2T is then the new extensional space and we will search for the corresponding local closed patterns. The local extensions are afterwards transformed into extensions in 2O. Let u: 2T → 2O be such that \(u(e_{T}) = \cup _{t\in e_{T}}t\). u(e T) is called the flattening of e T. We consider then the two maps extT and intT defined as follows:

  • extT: L → 2T with extT(q) = {t | t ⊆ ext(q)}

  • intT: 2TL with intT(e T) = int ∘ u(e T)

extT(q) represents the extension of q in T when considering that q occurs in t whenever q occurs in all elements of t (seen as a subset of O). Conversely intT(e T) represents the greatest pattern in L whose extension in T includes e T, i.e., whose extension in O contains, as subsets, the elements of e T. Now, consider as T the family of k-cliques of G and that (t 1, t 2) ∈ E T whenever t 1 and t 2 share k − 1 vertices in G. A k-community in G [13] is a vertex subset that results from the flattening (in the sense defined above) of some connected component of G T. The local closed patterns w.r.t. F are then most specific patterns occurring in k-communities of pattern subgraphs of G. This way we obtain local concepts and associated local implications, whose local extensions are these k-communities. Note that, in the derived case, the local concepts do not form a pre-confluence: technically we obtain a pre-confluence of 2T, but two different local extensions in 2T may result in the same flattening, corresponding to one 3-community. As a consequence, the local concept order is no more a pre-confluence.

5.1 Experiments on a Derived Graph

Coming back to our Teenage Friendship attributed graph, we have applied this strategy and built the derived graph G T, where T is the set of 3-cliques of the original attributed graph. In Figs. 9 and 10 we display the ordered set of 3-communities with size at least 4. Footnote 7The minimal 3-communities are the lowest ones on both figures. Each element of the pre-confluence represents a (3-community, local closed pattern) pair but may be associated with several non-redundant local implications. This happens for one 3-community displayed on the right at the bottom of Fig. 9 and associated with two local implications each represented in a square. Each square displays in red the 3-community, and in red+green+blue the abstract extension of the abstract closed pattern forming the left part of the implication. In Fig. 10 we have a unique maximal 3-community on the top, and a hierarchy of sub-communities.

Fig. 9
figure 9

The ordered set of size ≥ 4 3-communities of the Teenage Friendship network (part-I). The 3-communities are displayed in red and bold lines from the larger ones on the top to the smaller one on the bottom.The abstract subgraphs are displayed in plain lines. On the right at the bottom we have a 3-community displayed twice as it is built from two different abstract closed patterns

Fig. 10
figure 10

The ordered set of size ≥ 4 3-communities of the Teenage Friendship network (part-II)

We now investigate interestingness of local patterns and local implications. Note that the definitions given in Sect. 4.3 have to be adapted since we have to replace local extensions as defined in Sect. 4 by 3-communities, i.e., flattening of local extensions in the derived graph G T. Table 4 displays the local closed patterns ranked according to their specificities. Each local closed pattern is indexed by the first triangle, in the lexicographic ordering, leading to the corresponding 3-community. Consider, for instance, the first two local patterns l 1 = D45-C12-S2 and l 2 = D45-C34 which have both specificity 1. This means that the set of pupils triangle having D45-C12-S2 (respectively, D45-C34) forms a (unique) 3-community we further refer to as Community 1 (respectively, Community 2). Communities 1 and 2 have in common high alcohol consumption behavior (D45), but differ in that the members of Community 1 do not smoke cannabis (C12) and have a regular sporting activity (S2), while the members of Community 2 have regular cannabis consumption (C34).

Table 4 Top 15 local closed patterns ranked according to their specificity in the Teenage Friendship network

Consider now the implications of the local min–max basis, ranked according to their informativities, displayed in Table 5. We have here some implications with high informativity. As an example, the fifth (I 5) and ninth (I 9) local implications concern Community 1 while the second (I 2) concerns Community 2.

Table 5 Top 15 local implications of the min–max basis ranked according to their informativity int the Teenage Friendship network

As an illustration we consider Community 1 and its two related local implications I 5 and I 9. As associated with the same community their rightmost member is the same local closed pattern D45-C12-S2. However, as they are extracted from different abstract subgraphs corresponding, respectively, to abstract closed patterns S2 and D45, they have different informativities. I 5 has informativity 0. 75, while I 9 has informativity ≈ 0. 56. In Fig. 11 we display Community 1 together with the necessary information to compute the local informativity of I 5.

Fig. 11
figure 11

Community 1 (represented as bold dots joined by bold lines) composed of five pupils with high alcohol consumption, no cannabis consumption C12 and regular sporting activity (D45-C12-S2). The community has size 4 and is one of the communities of the S2 abstract subgraph which is represented with gray plain lines joining 16 vertices. The Informativity of the local implication I 5 associated with community 1, which has S2 as its leftmost member and D45-C12-S2 as its rightmost member, is therefore 1 − 4∕16 = 0. 75. Note also the three black dots representing vertices which have the S2 pattern but do not belong to its abstract subgraph

The high informativity value of I 5 means that what the pupils in this community have in common, i.e., high alcohol consumption, no cannabis consumption and regular sporting activity (D45-C12-S2) is unfrequent among pupils in triangles with regular sporting activity outside of the community.

6 Implementation

In our experiments we first used the CORON software [28] to compute frequent closed patterns, according to some frequency threshold, then apply a set of PYTHON functions as a post-processing to compute abstract and local patterns and implications. Footnote 8 More recently we have implemented an efficient algorithm using a divide and conquer strategy similar to that proposed in [5] and implemented in [12]. This allows in particular to directly apply the frequency constraints at the abstract and local levels. A first version, named ParaminerLC, is experimented in [26] and was designed to handle the 3-communities local knowledge extraction problem. The selection of the implications belonging to the local min–max local basis is performed as a post-processing. Our current implementation in progress is a versatile program enumerating abstract and local frequent closed patterns and local implications with various definition of abstraction and locality.

7 Conclusion

In this article we have addressed problems in which the extensional space, made of the vertex subsets of an attributed network, is constrained according to connectivity properties. We have first considered abstract vertex subsets in which a constraint has to be satisfied by each vertex in the subgraph they induce, as, for instance, a minimum degree constraint. The extensional space is in this case a particular lattice called an abstraction. We have then shown, benefiting from previous work in FCA, how abstract support closed patterns, i.e., maximal patterns among those sharing the same abstract extension, could be obtained using a closure operator. This has resulted in defining a wide class of abstract concept lattices, whose elements are (abstract extension, abstract closed pattern) pairs, each corresponding to a particular abstraction. This way we obtain a global information on how the graph topology is related to the pattern extensions. We have then considered a way to extract local knowledge from an attributed network. For that purpose, using a recent extension of FCA to local extensional spaces, called confluences, we have related each pattern to various local extensions, corresponding to connected components in subgraphs induced by abstract vertex subsets. We obtain this way a set of local concepts, organized in a generalization of the lattice structure called a pre-confluence. Furthermore we have defined both abstract implications and local implications representing knowledge which is valid at the abstract and local levels, i.e., regarding the latter, in the vicinity of particular vertices. For both abstract and local patterns and implications we have proposed proper interestingness measures, namely specificity which measures to what extent the original extensions of patterns are preserved in abstract and local concepts, and informativity, which measures novelty brought by abstract and local implications. Finally we have applied these ideas to enumerate 3-communities in a network. These 3-communities are in fact sub-communities as each is a 3-community in some subnetwork induced by an attribute pattern.

Overall, what we propose here is a new way, brought by recent developments in Formal Concept Analysis, to explore social and complex networks as attributed graphs. As an application, we are currently involved in the ADALAB project which aims at helping the robot scientist EVE [30] to design experiments.Footnote 9 In this context, we use our methodology to explore a co-regulation network labeled with information regarding gene expression. Future works concerns, on the extensional side, applying these ideas to attributed directed graphs or multiplex networks. We also consider to use abstract and local extensional constraints while extending the pattern language to a wider class of pattern languages. First, as in [7, 9, 15] by building a meet-semilattice adapted to the mining problem and using interior operators to reduce it to a tractable language. This has been in particular successfully applied to graph mining [10]. Then, as in [5, 20] by considering confluent languages allowing to treat connectivity within the pattern language.