1 Introduction

With the advent of modern information and communication technologies (ICTS), the amount of digital data has increased tremendously in recent years. Every day enormous amounts of information on and about customers, users, and subscribers are stored in databases. Especially, social media technologies play an important role in this ever-increasing data deluge and have become an important data source for researchers as well. Social platforms enable researchers to access vast amounts of data. The process of extraction actionable knowledge is known as “Knowledge Discovery” or “Knowledge Representation” and has been widely studied through the topology of complex and social networks established by researchers. Various network processes are affected by the network topology. For example, the spread of information is affected by the topology of communication networks, while the performance and productivity of research is affected by the topology of collaboration networks. The above-mentioned network processes have been mostly studied through the problem of identifying central individuals and communities/clusters in a network.

1.1 Motivation

The typical research addresses issues of centrality (Which vertices are best connected to others or have most influence?) and connectivity (How individuals are connected to one another through the network?). However, individuals (or vertices) alone may not represent the context (relationships, interactions, etc.) entirely. For example, it is nearly an impossible task for a single individual to organize a mass protest such as Taksim Square Gezi Park, or flash mobs. Similarly, a key set of individuals may better coordinate relief efforts during natural disasters, such as the Haiti Earthquake. The question “How can we detect influential sets of individuals instead of a set of influential individuals in social networks?” is then worth investigating, because in Gezi Park protests, for example, over 2.5 million distinct users participated in the social interaction during these events. Although a few actors and journalists tried to gain the initial attraction, it was the main body of social media users who have played an important role in making a local protest in Taksim Gezi Park become nationwide.Footnote 1 All of these phenomena would require the identification of key sets of actors rather than a set of key actors.

In this research, we focus on identifying such key sets of individuals and define them as focal structures. In order to accomplish this, a new approach is developed to detect focal structures in a social network. This approach can be applied to various areas, such as recommendation systems, the semantic web, bioinformatics, genetic networks, social networks, marketing, advertising, information diffusion studies, search engine indexing and so on. This approach can help study many real-world problems such as “How can businesses determine their buyers’ behaviors and discover worthy customers to select for promotion?”, “What are the dominant (most interactive) communication structures in an organization?” and “What are the specific interaction patterns in a social club or interest groups?”. The concept of focal structures can also help understand the dynamics of diffusion processes. For example, it can help to figure out the questions “How does information spread in a network throughout the focused and most significant structures in the context of a social movement?” and “What is the resilience and robustness of a network with the loss of those structures?”. For a contagious disease, for instance, it can help to figure out the focused and significant groups of individuals who diffuse that contagious disease with any states (e.g., susceptible, infected, carrier, immunized, dead, etc.).

We evaluate focal structures by their interactivity within the network. The interactivity of the focal structures is measured by observing how much the individual members of the focal structure engage the other individuals of the entire network. A more formal and mathematical definition is provided in Eq.  9 in Sect. 5.2.3.

To the best of our knowledge, this work is the first effort in identifying influential sets of individuals instead of a set of influential individuals. As an initial step of the study, we compare the engagement capacities of focal structures with that of average individuals. It is important to note here that focal structures are better coordinators than average individuals and cannot completely substitute influential individuals. From empirical observations, it was found that influential individuals are leading actors in disseminating information, however focal structures are better in rallying support od mobilizing the crowds. As next steps, we are designing experiments to compare and contrast focal structures with influential individuals to observe their respective utilities vi-a-vis complex and dynamic social processes, such as campaigns, social movements, etc. We believe that this work would open up new directions to researchers who can work further on developing new methods in social network analysis.

1.2 Illustrative examples

Fig. 1
figure 1

An illustrative example of identifying influential individuals in the Zachary network. The PageRank centrality method is used to detect influential individuals

Fig. 2
figure 2

An illustrative example of identifying communities in the Zachary network. The Louvain community identification method is used to detect communities

Fig. 3
figure 3

An illustrative example of identifying focal structures in the Zachary network

The well-known method to identify key individuals in knowledge discovery is the authoritative approach. It identifies most centralized individuals in a network. However, a node on its own may not represent the interaction. Therefore, there is a need to identify authoritative structures besides authoritative individuals. Not only to represent, but also to understand the context, we need to study interactions among a set of individuals and figure out what they tell us. We need to identify those structures in terms of specific interaction patterns or communication structures.

Example 1 From the topological point of view, to distinguish focal structures from influential individuals and communities, we utilize the well-known Zachary network, which is a social network of friendships between 34 members of a karate club at a US university in the 1970s (Zachary 1977). Figure  1 shows the influential individuals (Vertex 1, 33 and 34) identified by the PageRank (Brin and Page 1998) method. Figure 2 includes four communities detected by the Louvain (Blondel et al. 2008) method. In Fig. 3, eight focal structures are identified. For example, a focal structure includes the vertices 33 and 34, which are one of the most influential individuals in the network. The vertices 15, 16, 19, 21, 23 are involved in the same focal structure. There are other focal structures which do not contain any influential individual(s). Please note that these focal structures do not offer additional information, however, their relevance are demonstrated in the next illustrative example (Fig. 4). The contextual differences between focal structures and influential individuals are explained in greater depth in Sect. 5.

Example 2 Figure 5 shows an illustrative example to demonstrate the relevance of focal structures in a blogger network. The blogger network is retrieved during the 2008 Egyptian Revolution (Burton et al. 2009). It consists of 85,013 vertices and 129,079 edges. Hyperlinks between websites form the edges in the blog network. There are five focal structure samples identified with two (f2), three (f4), four (f6), six (f3), eight (f5) and ten (f1) vertices. These are f2, f4, f6, f3, f5 and f1, respectively. Upon investigation, we found that the bloggers that constitute focal structures have highly relevant postings. Further, their blogs are much more engaging than others. We aim to identify such structures which are focal in a social network.

Fig. 4
figure 4

Focal structures vs. communities and influential individuals

Fig. 5
figure 5

The extraction of focal structures from a blogger network

1.3 Outline

The rest of the paper is organized as follows: in Sect. 2, we present the related work on the authoritative approach and community identification approaches. Section 3 includes the definition of focal structures, discusses the problem in detail and defines the problem statement. The details of the developed approach, Focal Structures Analysis (FSA), is given in Sect. 4. It consists of two developed algorithms: global interconnectedness-based FSA and local interconnectedness-based FSA. In Sect. 5, we perform experiments using synthetic and real-world dataset. In a simulation environment, the performance of the developed algorithm is tested. Using real-world data, we demonstrate that the focal structures are more interactive than average individuals in the evolution of a mass protest. Section 5.3 includes a web tool,Footnote 2 which is designed and implemented for researchers and practitioners to demonstrate the usefulness of the FSA approach. Lastly, the paper offers a conclusion and future work in Sect. 6.

2 Related work

The following subsections provide a brief explanation of the identification of influential individuals and communities/clusters. Each subsection ends with a comparative analysis to distinguish the FSA approach from the reviewed approaches.

2.1 Influential individuals

Researchers have studied social networks from the perspective of information diffusion and identify key players who maximize the spread (Leskovec et al. 2007). Gruhl et al. (2004) study information diffusion of various topics in a social network, drawing on the theory of infectious diseases. In contrast to common belief, where best spreaders corresponds to the most highly connected or the most central people, Maksim et al. identify influential spreaders who are those located within the core of the network (Kitsak et al. 2010). The network core includes the region where most influential spreaders are placed. The emergence of super-stable nodes whose ranking is exceptionally stable to perturbations is predicted by Goshal and Barabasi (2011). Here authors show that for random networks the ranking provided by PageRank is sensitive to perturbations (an agitation in the network topology), making it unreliable for incomplete or noisy systems. Masahiro et al. address the combinatorial optimization problem of finding the most influential nodes on a large-scale social network (Kimura et al. 2007). Agarwal et al. (2012) have studied the problem of influential bloggers in a community by developing a stochastic model leveraging blogging gestures, such as recognition, novelty, activity, generation and quality of writing.

Page et al. (1999) study the authoritative approach in the context of webpage ranking. They introduce an algorithm, called PageRank, which is based on eigenvector centrality (Borgatti 1995; Borgatti 2005). It assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of “measuring” its relative importance within the set. Their underlying assumption is that more important websites are likely to receive more links from other websites. Kleinberg (1999) considers the authority and hub as indegree and outdegree, respectively, of nodes in a network. His approach is known as Hyperlink-Induced Topic Search (HITS) in the literature and supports the idea that a good hub links many good authorities while a good authority is linked by many good hubs. The degree information of each node is described by degree centrality (Freeman 1979; Nieminen 1974). More important nodes are more active and therefore should have more connections according to the idea of degree centrality.

The studies based on centrality measures consider a single individual in a network at a time. However, a single individual on its own may not represent the influence of a network by itself, because there is no other node to form a set of interaction patterns with each other. Unlike the authority-based approaches dealing with the extraction of influential individuals, the FSA approach identifies influential sets of individuals rather than a set of influential individuals.

2.2 Community identification

Identifying communities and clusters has been widely studied in the relevant literature since early social network analysis. Typically, a community is considered to be a set of entities where each entity is closer to the other entities within the community than to entities outside it.

Communities play a vital role in understanding the creation, representation, and transfer of knowledge among people, and are an essential building block of all social networks. However, the relationship of one individual to one another in a community is not easily formalized, or necessarily consistent, and thus how, exactly, one extracts communities from a social network.

There are three dominant approaches for community extraction: network-centric, content-centric and hybrid approaches (Agarwal and Liu 2009). Network-centric approaches leverage network structural properties to identify communities within a social network (Fortunato 2010). Since the fundamental assumption is that the members of a community tend to interact more often with each other as compared to the members of other communities, the network is partitioned into clusters with lower between cluster edges and higher within cluster edges (Hagen and Kahng 1992; Shi and Malik 2000; Von Luxburg 2007). Variations of the above network partitioning approaches address the dynamic nature of social networks (Ning et al. 2007; Chi et al. 2007a, 2007b). Content-centric approaches rely on the assumption that the members of a community tend to talk about similar topics. Communities thus are extracted based on the similarity of content available in the form of blog post texts or profile information of the individuals (Li et al. 2007; Brooks and Montanez 2006), which is inspired by the fundamentals of text mining and webpage clustering. Agarwal et al. (2010) proposed a novel clustering algorithm to meet the challenges of the highly dynamic environment. In this work, bloggers’ collective wisdom is extracted via the labels used by the bloggers to annotate their blogs and represented by a label relation graph that helps in discovering many interesting, complex, and latent relations between annotations. Hybrid approaches leverage both content and network information to extract communities. The central tenet behind such an approach is: a set of blogs that are highly linked and tend to share similar content, reflect tighter communities (Java et al. 2008). However, the sparse link structure and inherent differences between web pages and blogs (such as interactive and dynamic environment, highly likely topic and user drift, low barrier to publication leading to extremely noisy data) demand novel approaches.

Recall that the developed approach (FSA) is different than any community identification approach. Instead of extracting communities, this research aims to extract focal structures which are smaller and more pertinent groups in a network.

2.3 Other studies

The structural design of complex networks has also been studied through network motifs by some researchers. Network motifs (patterns of connectivity that occur significantly more frequently than expected) were introduced by Milo et al. (2002). Network motifs aim to find structures which occur frequently in a network (Kashtan and Alon 2005; Kashtan et al. 2004).

Another research has been conducted through graphlets which are small, connected, non-isomorphic, induced subgraphs of a large network. Janssen et al. use the frequency counts of graphlets and indicate that graphlets characterize the structure of a big network (Janssen et al. 2012). Some studies on graph similarity have utilized graphlet counts as a way of comparing networks (Bordino et al. 2008; Kondor et al. 2009; Shervashidze et al. 2009).

Motifs and graphlets are comparable at the level of focal structures as they are subgraphs too but they are very functionally very different. More specifically, motifs are frequent patterns or frequently occurring subgraphs. They repeat themselves as subgraphs in a network. Graphlets are induced subgraphs in a large network. Focal structures need not occur frequently nor do they have to be induced subgraphs. In other words, focal structures show the influential sets of nodes whereas motifs and graphlets show the frequently occurring sets of nodes in a network.

3 Problem statement

A Focal structure is a key set of individuals who may be responsible for organizing events, protests, or leading citizen engagement efforts. A focal structure includes a set of vertices (at least two) and edge(s). These individuals need not to be strongly connected and may not be the most influential on their own but by acting together they form a compelling power.

Let’s assume we have a graph G with vertices V and edges E. We need to identify the most authoritative and smallest set of nodes forming the focal structures F which are embedded in G. A focal structure cannot subsume another and yet they both could be unique.

A set of focal structures F needs to be identified in graph G with vertices V and edges E. Each focal structure is unique and a focal structure cannot be a subset of any other focal structure. A focal structure F includes all edges E among vertices V. Formally, a focal structure can be defined as follows:

Given a social network G = (V, E), where V is the set of vertices and E is the set of edges, a focal structure can formally be defined as follows:

Focal structures in G are defined by \(F = \{G'\}\), where \(G' = (V',E')\) and \(V' \subseteq V\) and \(E' \subseteq E\). For all i and j, \(i \ne j\), \(G_{i} \in F\) and \(G_{j} \in F\), such that no two focal structures can subsume each other, or \(G_{i} \subsetneq G_{j}\) and \(G_{j} \subsetneq G_{i}\).

In social networks, focal structures may fall under micro and meso-level analysis. Micro-level analysis starts with an individual but also supports a dyadic, triadic and a subset level structure which corresponds to a focal structure. The meso-level analysis pertains to networks within organizations, randomly distributed networks, and scale-free networks. A focal structure can include members with multiple roles acting in different groups or organizations.

The approaches to extract focal structures are explained in the next section.

4 The developed approach: FSA

We present the “Focal Structures Analysis” (FSA) approach to identify focal structures from a network of interactions among individuals. We develop two algorithms for this approach; global interconnected-based FSA, and local interconnected-based FSA. In next sub-sections, these algorithms are explained in detail.

4.1 Global interconnectedness-based FSA

One way to obtain a set of individuals in a social network is partitioning the graph into sub-structures. Therefore, we use network partitioning approach to identify sub-structures. As a global measure, the well-known modularity measure is used in a recursive mechanism to determine whether the sub-structures are focal.

Girvan and Newman (2002) proposed modularity algorithm which partitions the graph by removing inter-community links according to the “edge betweenness” (Girvan and Newman 2002; Newman 2004). They introduced a modularity measure Q, which basically shows the quality of the partitioning community.

We develop the modularity-based algorithm consisting of two parts; obtaining candidate focal structures and stitching those obtained focal structures.

4.1.1 Obtaining candidate focal structures (top-down division)

In this method, first, we leverage the concept of modularity recursively to extract candidate focal structures from a complex network. The recursive mechanism allows us to partition a network into smaller subgraphs. The process of obtaining candidate focal structures is known as focal patterns presented in our previous research (Şen et al. 2012).

The Louvain method of modularity (Blondel et al. 2008) is used to obtain candidate focal structures. Currently, this method is preferred, because it is a very efficient partitioning method for large graphs (Pujol et al. 2009; Haynes and Perisic 2009). Modularity is a measure of network structure that is designed to identify cohesive groups (or structures) in a network. Mathematically, modularity (Q) is defined as,

$$\begin{aligned} Q = \frac{1}{2m}\sum _{ij}\Bigg [A_{ij} - \frac{k_{i}k_{j}}{2m}\Bigg ]\delta (c_{i},c_{j}), \end{aligned}$$
(1)

where \(A_{ij}\) represents the weight of the edge between i and j, \(k_{i} = sum_{j}A_{ij}\) is the sum of the weights of the edges attached to vertex i, \(c_{i}\) is the community to which vertex i is assigned, the \(\delta\)-function \(\delta\)(u, v) is 1 if u = V and 0 otherwise and m is the number of edges.

By using modularity, we generate a hierarchically structured subgraphs, where all the leaf nodes are considered as candidate focal structures. The approach for identifying candidate focal structures can be summarized as follows:

  1. 1.

    Apply the Louvain method to the current (whole) graph and get the partitions along with the modularity value.

  2. 2.

    If the modularity value is zero (\(Q=0\)) return the graph.

  3. 3.

    Otherwise get the subgraphs, and start iterating through all sub-graphs.

  4. 4.

    Consider each subgraph as a whole graph, and repeat the same steps (1, 2 and 3) recursively for the subgraph.

The developed approach is for unweighted and undirected graph, but it could be easily modified to handle the weighted and directed graphs. This modification is published in our earlier work (Şen et al. 2012).

4.1.2 Stitching candidate focal structures (bottom-up agglomeration)

Since we use a modularity-based approach shown above, it is quite natural to extract cliques or near-clique structures. In other words, it is not unusual that our approach, mentioned above, misses the structures that are not cliquish, i.e. having lower connection density. For instance in Fig. 6, candidate focal structures are extracted usually with five and four vertices. It is obvious that we miss the structure with nine vertices, which is a challenge for other structures with lower densities as well. To overcome this challenge, we develop a follow up procedure that identifies additional focal structures by stitching the candidate focal structures.

Fig. 6
figure 6

The GetCandidateFocalStructures algorithm misses the structure of 9 vertices and 31 edges by identifying two cliques with four and five vertices

We stitch highly interconnected candidate focal structures based on their similarity values. Similarity between two focal structures is measured using Jaccard’s Coefficient. To stitch two candidate focal structures together, we propose to measure the degree of interconnectivity by the similarity of two subgraphs (candidate focal structures) i and j defined below:

$$\begin{aligned} {\mathrm{Jaccard~similarity}}(i,j) = \frac{\mathrm{interconnectivity} (i,j)}{\mathrm{unionsize}(i,j)} \end{aligned}$$
(2)

where interconnectivity(i, j) is the number of connections between subgraphs i and j, and unionsize(i, j) is the total size of subgraphs i and j. The Similarity (i, j) between two subgraphs i and j is the ratio of the number of the connections between i and j to the total size of the smaller subgraphs. The pair of candidate focal structures that have the highest similarity are stitched in each iteration, and the stitching process iterates until the highest similarity of all sibling pairs is less than a given threshold.

4.2 Local interconnectedness-based FSA

While modularity helps in measuring the global interconnectedness of the network, clustering coefficient helps in assessing the local interconnectedness. We develop a local interconnectedness-based algorithm, which leverages the clustering coefficient measure (Eq. 5) in a graph. Basically, there are two different variants of clustering coefficient.

4.2.1 Variant (1)

This variant is based on “triads” of vertices. It is first defined by Holland and Leinhardt (1971) as the global clustering coefficient which is based on triplets of nodes.The clustering coefficient is defined as

$$\begin{aligned} C_i = \frac{\mathrm{number \; of \; triangles \; connected \; to \; node} \; i}{\mathrm{number \; of \; possible \; triang. \; centered \; around \; node} \; i} \end{aligned}$$
(3)

where a triple centered around node i is a set of two edges connected to node i. A triangle consists of three nodes in which two of them are connected to node i. If node i is connected to all possible other two neighbor nodes forming triangles, then \(C_i\) is equal to one. Otherwise it is either equal to zero or smaller than one. The average clustering coefficient for the whole network is defined as

$$\begin{aligned} C = \frac{1}{n}\sum _{i=1}^n C_i \end{aligned}$$
(4)

where n is the number of vertices in the network. By definition 0 \(\le\) \(C_i\) \(\le\) 1 and 0 \(\le\) C \(\le\) 1.

4.2.2 Variant (2)

This variant is defined by Watts and Strogatz (1998) as the local clustering coefficient. It measures how close the neighbors of a node are to being a clique (complete graph). Let c(u) be the proportion of neighbors of u that are connected. In other words, c(u) is the probability that two friends of u are friends with each other in a social network.

If a vertex \(v_i\) has \(k_i\) neighbors, \(\frac{k_i(k_i-1)}{2}\) edges could exist among the two vertices within the neighborhood. The clustering coefficient (variant 2) is defined as the mean of all local clustering coefficients in the network. It is defined as

$$\begin{aligned} C_n = \frac{2e_n}{k_n(k_n-1)} \end{aligned}$$
(5)

where \(k_n\) is the number of neighbors of n and \(e_n\) is the number of connected pairs between all neighbors of n.

Like the variant 1, the network clustering coefficient is the average of the clustering coefficients for all vertices in the network.

At each step, Algorithm  1 uses a clustering coefficient to decide which vertices from the neighborhood are to be included in a focal structure.

First, we compute clustering coefficient values of each vertex (\(c_{i}\)) and mean (\(c_{m}\)) of the computed values for the entire graph. Then, we utilize \(c_{i}\) and \(c_{m}\) for pair-wise comparisons. For each pair-wise comparison, if the clustering coefficient of both vertices is less than or greater than the mean, then those vertices are included in the same structure. This process is based on the connection strength among neighbors of an individual. It generates sets of individuals whose clustering coefficient values are close to each other. In other words, the methodology gives focal structures in which the neighbors of the individuals are connected strongly and similarly with each other.

The focal structure id (index) is obtained before any pair-wise comparison. It is used to add the neighbor vertex \(v_j\) to that focal structure which includes the vertex \(v_i\). If a vertex has only one neighbor, it is not considered to be included in a structure even though its clustering coefficient is 1.

figure a

5 Experimental results

We perform experiments using synthetic and real-world dataset. The performance of the developed algorithm is tested in a simulation environment (Sect.  5.1). Using real-world data, we demonstrate that the focal structures are more interactive than average individuals in the evolution of a mass protest (Sect.  5.2). We also develop a web tool to make focal structures analysis accessible to researchers and practitioners (Sect.  5.3).

5.1 Focal structures in synthetic dataset

Using a synthetic dataset, we present a simulation environment (Fig.  7) in which we embed artificial focal structures in randomly generated small-world networks.

5.1.1 Simulation environment

In this simulation environment, we aim to measure the accuracy of the identified focal structures by using information retrieval evaluation metrics. First, a random graph is generated using the model of Erds and Renyi (1959). Second, we generate a small-world network by embedding the artificial focal structures to the random graph. The goal of embedding those structures is to generate a small-world network, a property exhibited by social networks (Zaidi 2013). Vertices of the simulated graph are randomly selected and replaced by artificial focal structures. Using the t test method, we keep embedding the focal structures until a significantly high clustering coefficient value is reached. In this study, a t test’s statistical significance shows whether there is significant difference between the random network and the new network embedded with focal structures. We compare the random network with the next one embedded with a new focal structure. The embedded focal structures increase overall clustering coefficient value of the network which is one of the basic properties of small world networks (Watts and Strogatz 1998). Finally, we apply our algorithm to identify those artificial focal structures embedded in small-world networks.

Fig. 7
figure 7

Simulation environment for generating focal structures from a small-world network

For both modularity-based FSA and simulation environment, we use Gephi (http://gephi.org/), an open-source software for visualizing and analyzing large networks. We implement our algorithm as a plug-in to Gephi. To form small-world networks from random graphs, we design artificial focal structures of various densities ranging from four to nine vertices as shown in Fig.  8. We use them for the performance test of our algorithm.

Fig. 8
figure 8

Artificial focal structures, n number of vertices, d density

Fig. 9
figure 9

Embedding artificial focal structures of five vertices and nine edges

Fig. 10
figure 10

Embedding artificial focal structures of five, six and eight vertices with 0.8 density value

5.1.2 Homogenous focal structures

In this experiment, we embed multiple homogenous focal structures to the graph and use FSA to identify. Homogenous focal structures consist of same number of nodes and vertices. Performance of our approach is assessed using the below-mentioned evaluation metrics (Eqs.  6,  7,  8). We generate 100 random networks and artificial focal structures, varying from four to nine vertices (Fig.  8). The artificial focal structures are embedded to the random networks to form 100 small-world graphs consisting of 50 vertices and 50 edges. The number of embedded focal structures vary from one to 20. In other words, every single focal structure in Fig.  8 is embedded from 1 to 20 times to each random network. For example, for the focal structure with five vertices and nine edges, we first embed a focal structure only once in the first iteration, twice in the second one, three times in the third, and so on until 20 times in the 20th iteration. This process is shown in Fig.  9. Then, we have 2000 (\(20 \times 100\)) small-world graphs embedded with focal structures of five vertices and nine edges. Finally, the proposed FSA algorithm is applied to identify the embedded focal structures. We apply the same process for identifying focal structures with densities 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0.

5.1.3 Heterogenous focal structures

In this second experiment, we embed multiple heterogenous focal structures, i.e. focal structures with different vertices and different edges. We again generate 100 random graphs of 50 vertices and 50 edges. There are three focal structures with five, six and eight vertices respectively. They are embedded from 1 to 9 times to each random network to form small-world graphs (Fig. 10). Finally, the proposed FSA algorithm is applied to identify those focal structures with different vertices blindly.

Figure  10 shows only the experiment with the 0.8 density value. We also apply the same process for 0.5, 0.6, 0.7, 0.9 and 1.0 density values.

5.1.4 Evaluation metrics

To evaluate performance of our algorithm, we use precision and recall, which is widely used in machine leaning for evaluating the accuracy of pattern recognition algorithms. Since we know which structures are embedded in a small-world graph, we can use these metrics to calculate the accuracy of the focal structures by using the identified focal structures. The definitions of precision and recall are

$$\begin{aligned} \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} \end{aligned}$$
(6)
$$\begin{aligned} \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} \end{aligned}$$
(7)

where TP (true positive) is the number of correctly detected focal structures, FP (false positive) is number of focal structures identified incorrectly, and FN (false negative) is the embedded focal structures which couldn’t be identified at all. According to these evaluation metrics, a higher value of precision indicates that a focal structure identified is more likely to present an embedded focal structure, and a higher value of recall shows that more focal structures that are embedded are identified. Also, F-measure (Eq.  8), the harmonic mean of precision and recall, is used to evaluate the overall performance. F-measure is defined as,

$$\begin{aligned} {\text{ F-measure }} = \frac{2 \times \mathrm{precision} \times \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} \end{aligned}$$
(8)

The experiments and performance results for both homogenous and heterogenous focal structures are presented next.

5.1.5 Results

We perform the experiments by embedding both homogenous and heterogeneous structures in synthetic networks and apply the FSA approach to identify them. For homogenous structures, we obtain performance results for various densities of focal structures. As shown in Figs. 11, 12, 13, 14, 15 and 16, the performance of the accuracy of identification increases as more homogenous focal structures are embedded. For heterogeneous focal structures, the purpose is to see how many focal structures the algorithm can identify if topologically different artificial focal structures are embedded in a small-world graph. A similar performance behavior is observed as shown in Fig. 17. The performance again increases as more heterogeneous focal structures are embedded.

5.1.6 Analysis and discussion

For homogenous structures, we obtain performance results for various densities of focal structures. It is clear that the algorithm tends to identify focal structures with high densities. That characteristic leads to extracting cliques or near-clique structures. The main reason of such a cliquishness is that it identifies structures with high modularity values. For example, for the focal structures of five, six, seven and eight vertices, it is not possible to identify focal structures which are lower than 0.8 density value. For the focal structures of nine vertices, it cannot even detect focal structures between the densities 0.8 and 0.9. It misses the structures that are not cliquish, or in other words have low density. For the small-world graphs embedded with six vertices focal structures, the algorithm can find focal structures with 0.8, 0.9 and 1.0 densities and 40, 43 and 47 % performance values respectively. Out of 20 embedded focal structures, it catches cliques made of seven and nine vertices with a performance of 40 %. Interestingly, the focal structure of four vertices can be identified in every density level beginning from 0.5 (three edges) to 1.0 (six edges) density values. Adding a stitching method gives a significant improvement in the performance of the algorithm after the standard approach. We are now able to identify structures with 0.5, 0.6, and 0.7 densities besides detecting structures with 0.8, 0.9 and 1.0 densities. The accuracy of the approach increases in this process. As more focal structures are embedded, we obtain higher performance values. This stitching mechanism is the second part of our algorithm and is essential to extracting those structures with lower densities.

Fig. 11
figure 11

F-Measures of homogenous four-vertices artificial focal structures with 0.5, 0.66, 0.8, and 1.0 densities. They are embedded to 100 small-world graphs using “without stitching” and “with stitching” methods. The stitching method of focal structures is based on “Jaccard’s Coefficient”

Fig. 12
figure 12

F-Measures of homogenous five-vertices artificial focal structures with 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0 densities. They are embedded to 100 small-world graphs using “without stitching” and “with stitching” methods. The stitching method of focal structures is based on “Jaccard’s Coefficient”

Fig. 13
figure 13

F-Measures of homogenous six-vertices artificial focal structures with 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0 densities. They are embedded to 100 small-world graphs using “without stitching” and “with stitching” methods. The stitching method of focal structures is based on “Jaccard’s Coefficient”

Fig. 14
figure 14

F-Measures of homogenous seven-vertices artificial focal structures with 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0 densities. They are embedded to 100 small-world graphs using “without stitching” and “with stitching” methods. The stitching method of focal structures is based on “Jaccard’s Coefficient”

Fig. 15
figure 15

F-Measures of homogenous eight-vertices artificial focal structures with 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0 densities. They are embedded to 100 small-world graphs using “without stitching” and “with stitching” methods. The stitching method of focal structures is based on “Jaccard’s Coefficient”

Fig. 16
figure 16

F-Measures of homogenous nine-vertices artificial focal structures with 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0 densities. They are embedded to 100 small-world graphs using “without stitching” and “with stitching” methods. The stitching method of focal structures is based on “Jaccard’s Coefficient”

A similar performance behavior is observed with heterogenous focal structures. The performance again increases as more heterogenous focal structures are embedded. Here, our purpose is to see how many focal structures the algorithm can identify if topologically different artificial focal structures are embedded in a small-world graph. From Fig. 17, it can be clearly seen that the algorithm is able to identify heterogenous focal structures with a higher performance. Out of 27 (\(3 \times 9\)) focal structures, the FSA algorithm can identify 50 % of the embedded focal structures. However, the algorithm is unable to identify the focal structures with the 0.5, 0.6 and 0.7 density values. Using the stitching method, we can detect the focal structures with the lower density.

Fig. 17
figure 17

F-Measures of heterogenous focal structures embedded to 100 small-world graphs using “without stitching” and “with stitching” methods. There are five, six and eight vertices artificial focal structures with 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0 densities. The stitching method of focal structures is based on “Jaccard’s Coefficient”

5.2 Focal structures in real-world dataset

To find out answers of the main research question regarding what the contextual differences are between focal structures and influential individuals, and focal structures and communities, we formulate the following two hypotheses:

\({H_{1}}\) Focal structures are more interactive than average individuals in the evolution of a mass protest.

\({H_{2}}\) Focal structures are more interactive than communities in the evolution of a mass protest.

Saudi Arabian women’s Oct26Driving campaign on Twitter, called “The 26th October campaign”, is one of the important mass protests (http://www.oct26driving.com). To test \(H_{1}\) and \(H_{2}\), we use a Twitter dataset based on the Saudi Arabian women’s collective action of the “Oct26Driving” Campaign.

5.2.1 Oct26Driving campaign dataset

The Oct26Driving dataset consists of around 70,000 tweets originated from 116 different countries. These tweets were collected based on the usage of dominant hashtags dedicated to the Oct26Driving campaign, viz., ‘#oct26driving’. Other types of available information, such as language, name, retweeted user name (RT), and other included hashtags, are also included. Since the tweets are updated with frequencies varying between hundreds to thousands of tweets per day, a crawler (viz., ScraperWiki, https://scraperwiki.com/) was configured with the above mentioned dominant hashtag running constantly to systematically collect, parse, and index the data.

Since the experiments are performed using the Twitter dataset, the following sub-section gives more detailed information about the Twitter platform.

5.2.2 Twitter platform

Twitter (see http://www.twitter.com) is a microblogging service that enables users to send short messages. It provides an environment where people can follow other users to receive updates (microblogs).

Tweet A tweet is a short message of 140 characters or less from a user. People tweet personal messages, random thoughts, post links, or anything else that fits in the character requirements.

Hashtag The hashtag, denoted by a word with preceding ‘#’ symbol (e.g., #HaitiEarthquake), is a platform convention for user-defined topics to categorize messages. It also provides an environment for group conversations on a specific topic.

Retweet A Retweet is a re-posting of someone else’s Tweet, similar to e-mail forwarding. To give credit to the original person, users usually put “RT” plus the originator’s username at the beginning of the tweet. Here’s an example: The Twitter user @oct26driving tweets: “Come on people, support me for this campaign!”, while you retweet by posting “RT @oct26driving Come on people, support me for this campaign!”.

Mention Without using the ‘Reply’ platform feature, “mention” acknowledges a user with the symbolic ‘@’ sign. For example, “@oct26driving I think it’s time to say (yes ) to give women right to drive \(\ldots\)”.

Reply This feature is used to communicate with the tweet author by clicking on Twitter’s ‘Reply’ button in response to a tweet. For example, user oct26driving tweets “I’m very proud. I believe that we accomplish the purpose of our campaign.”, while user @zaki_safar uses the built-in Reply button to indicate “@oct26driving I am very hopeful for this campaign”. The Reply syntax automatically inserts the originator’s user name.

5.2.3 Evaluation measures

From the sociological analysis point of view, Weber defines an “interaction” as a simple movement in the environment and meaningfully oriented to that of others (Weber 1978). For Weber, communal relations are based upon either affect or tradition; associative relations rest on the “rationally motivated adjustment of interests” and “agreement by mutual consent” (Turner 1988). Parsons views an interaction as a unit of acts which is formed to “systems of interaction” (Parsons 1937). According to him, an action becomes “interaction” when variously oriented actors must reconcile their respective orientations; and when such reconciliations become more “institutionalized”, then “social systems” can be said to exist.

Social interactions are studied in terms of ‘Human Computer Interaction’ (HCI) as well. According to Barlow et al. (1989), the HCI model consists of four main components such as human, computer, task environment, and machine environment. In the communication among people and computers, one must understand something about both and about the tasks which people perform with computers. A general model of human-computer interface emphasizes the flow of information and control at the human computer interface (Rada and Michailidis 1995).

Social media plays an important role for HCI because it allows people to share messages instantly with each other. Especially companies and organizations use it for analyzing their customers’ behaviors for certain time periods. For example, marketers use Twitter metrics to analyze the success of their brands with respect to the interaction of their customers. One of those metrics is the interaction rate which provides an understanding of how much customers interacted with them or with their brands (Engagement rate 2014).

In social media, the interaction rate tells how much a user’s fans communicate with him/her. In other words, the interaction rate shows how well a user’s fans communicate with him/her for a certain time-period. For Twitter, the interactions are retweets, replies, and mentions. The interaction rate (Eq.  9) is formulated as the total number of interactions per follower, where the total number of interactions denotes the total number of retweets, replies and mentions sent by the followers of an individual (Engagement rate 2014).

$$\begin{aligned} \mathrm{Interaction}\ \mathrm{rate} = \frac{(\mathrm{retweets} + \mathrm{replies} + \mathrm{mentions})}{(\mathrm{followers})} \times 100 \end{aligned}$$
(9)

5.2.4 Focal structures vs. average individuals

This experiment has been conducted to verify the following first hypothesis.

\({H_{1}}\) Focal structures are more interactive than average individuals in the evolution of a mass protest.

Results

To identify the focal structures, we apply the Clustering Coefficient-based FSA method to the Oct26Driving campaign Twitter dataset. The time period of this dataset ranges from 9th October 2013 through 30th October 2013. Table  1 shows the network size and the total number of identified focal structures for each day. For example, out of the network with 1832 individuals, the algorithm identifies 51 focal structures with a total 149 individuals on 14th October. It is clear that since the network grows as new members join the Twitter campaign, the total number of individuals in focal structures grow as well.

Table 1 Focal structures identified by FSA approach

For the Oct26Driving Twitter campaign beginning from 9th October 2013 to 30th October 2013, we apply the following steps:

  1. 1.

    Using the Eq. 9, the interaction rate of each individual is calculated.

  2. 2.

    The interaction rate of each focal structure is calculated by obtaining the total interaction rates of each individual in the focal structure.

  3. 3.

    All the focal structures are combined in one set and the total interaction rate of all those focal structures is obtained.

  4. 4.

    Using random selection, the average individuals are selected according to the total number of individuals of all those focal structures and the total interaction rates of the selected influential individuals are obtained.

  5. 5.

    This random selection process is repeated with multiple trials (100 times) and the overall total interaction rate is obtained.

  6. 6.

    To get the final interaction rate of the influential individuals, the average interaction rate out of the number of those trials is calculated (overall total interaction rate is divided by 100).

  7. 7.

    For both the focal structures and influential individuals, the interaction rates are normalized by the size of their sets. The size is equal to the number of individuals in all focal structures for that day.

  8. 8.

    The processes above are repeated for each day.

Applying the steps above, we obtain the interaction rates of both focal structures and individuals on average from October 9th to October 30th (see Fig.  18). The interaction rates between focal structures and individuals on average on a specific day (October 25th) are demonstrated in Fig.  19. The following sub-section analyzes and discusses those results.

Fig. 18
figure 18

The interaction rates from October 9th to October 30th, 2013

Fig. 19
figure 19

The interaction rates of October 25th, 2013

Analysis and discussion Using the Eq.  9, the interaction rates are calculated for each day. A fair comparison is performed between focal structures and individuals, because from the network, we randomly select individuals on average according to the total number of the individuals in the focal structures. For example, if there are 440 individuals of all identified focal structures on October 21st, then we randomly select 440 individuals multiple times (100 times) from the network of that day.

Focal structures clearly outperform the average individuals between October 9th and October 30th time period (Fig.  18) in 2013. The time-series of the average individuals include the error bars showing the confidence intervals with \(\pm\) three sigma (standard deviation). This clearly shows that the interaction rate of focal structures are significantly higher than the average interaction rate of random sets of individuals.

Focal structures increase almost proportionally from October 18th until October 23rd. This shows that the number of retweets, mentions, and replies increase proportionally with respect to the followers of the individuals in the focal structures. After October 23rd, the interaction rate of the focal structures decreases exponentially which clearly shows that the individuals in the focal structures are less interacting with the people they follow than before. However, despite the decrease of the interactions, focal structures still outperform the individuals on average until the end day of the campaign as demonstrated in Fig.  19. On the other hand, for the individuals on average, the interaction rate starts increasing remarkably after October 15th, whereas it decreases in some certain time-periods. Individuals on average outperform the focal structures only on October 30th. This shows that the followers of the individuals in focal structures are less interacting than the followers of the average individuals at that day. Those results clearly show that the focal structures are more interacting than average individuals during the time-period of the campaign.

Fig. 20
figure 20

The interaction rates in a logarithmic scale from October 9th to October 30th, 2013

5.2.5 Focal structures vs. communities

Another experiment has been conducted to verify the following second hypothesis.

\({H_{2}}\) Focal structures are more interactive than communities in the evolution of a mass protest.

Results

We reuse the Oct26Driving Twitter campaign dataset beginning from 9th October 2013 to 30th October 2013. The focal structures identified by FSA approach (Table  1 are utilized. The Louvain community identification method is used to extract communities out of the Oct26Driving Twitter campaign dataset.

For this experiment, we apply the following steps:

  1. 1.

    Using the Eq. 5.4, the interaction rate of each individual is calculated.

  2. 2.

    The total interaction rate of each focal structure is calculated by obtaining the total interaction rates of each individual in the focal structure.

  3. 3.

    To obtain the interaction rate of focal structures, the total interaction rate of each focal structure is normalized by its size. The size is equal to the number of individuals in a focal structure.

  4. 4.

    The average interaction score of focal structures is calculated.

  5. 5.

    The standard error interaction score of focal structures is calculated.

  6. 6.

    The total interaction rate of each community is calculated by obtaining the total interaction rates of each individual in the community.

  7. 7.

    To obtain the interaction rate of communities, the total interaction rate of each community is normalized by its size. The size is equal to the number of individuals in a community.

  8. 8.

    The average interaction score of communities is calculated.

  9. 9.

    The standard error interaction score of communities is calculated.

  10. 10.

    The processes above are repeated for each day.

Analysis and discussion

Focal structures clearly outperform the communities between October 9th and October 30th time period in 2013. The Fig. 20 is in a logarithmic scale and the time-series of the average individuals include the error bars showing the confidence intervals with \(\pm\) standard errors. The interaction rate of focal structures increases between October 18th and 23rd. After October 23rd, focal structures are less interacting with the people they follow than before. On the other hand, for communities, the interaction rate starts increasing remarkably after October 15th and 28th. This shows that the number of retweets, mentions, and replies increase proportionally with respect to the followers of the individuals in communities. However, despite increasing rates beginning from the initial days until the end day of the campaign, the interaction rates of communities can never reach the interaction rates of focal structures. These results clearly shows that the interaction rate of focal structures are significantly higher than the interaction rate of communities. In other words, it verifies the second hypothesis (\(H_{2}\)) that focal structures are more interactive than communities in the evolution of a mass protest.

5.3 Focal structures in online system

We design and implement an online systemFootnote 3 to demonstrate the usage of the FSA approach. The current version of the website aims to give an understanding to the reader what a focal structure is about. More information about this site can be obtained from here (Sen et al. 2015).

Since browsers have limited memory, visualization becomes more difficult for the web tool as the network size grows. For more than 2000 nodes, we get a hair-ball effect which does not mean anything for the user. Analyzing is another limitation for bigger networks. Applying FSA for such big networks is currently challenging on the online system. We have analyzed networks as big as 13789 nodes and 18871 edges, and experienced delays up to 15–20 s (please go to3 and select the ICSR network).

We design another online tool, http://www.merjek.com (guest account—username: merjek, password: merjek123). Unlike http://www.focalstructures.net, all other graph analytics methods such as PageRank and HITS centrality algorithms, graph similarity metrics, and community identification methods will be up and running in Merjek. We plan to make itFootnote 4 commercial by supporting big data networks as well.

Fig. 21
figure 21

Ukrainian crisis blog network

Fig. 22
figure 22

Identifying a focal structure in a ukrainian crisis blog network

5.3.1 Dataset collection

Figure  21 shows a blog network of the 2014 Ukrainian crisis. TweetTrackerFootnote 5, an online tool developed by Arizona State University, is utilized to find the sites that have been tweeted and retweeted the most indicating crowd polarization and propaganda campaigns. Those sites are then processed and classified as blogs or non-blogs. Blogs are then further subdivided into categories of Ukraine focused blogs, news blogs, and non-Ukraine focused blogs. NewProSoft’s Web Content Extractor, or WCE, is used to crawl through each of the identified sites. 18,000 records are stored in the database from about 25 different blog sources. This data is then queried to find relationships that is then used to construct a network of the blogs. Thus, the blog network was built based on the references to other site within a blog post’s content HTML.

5.3.2 The 2014 Ukraine crisis

The 2014 Ukraine crisis began in November 2013, when President Viktor Yanukovych rejected a deal for greater integration with the European Union. Since then, three big things have happened: Yanukovych was run out of the country in February, Russia invaded and annexed Crimea in March, and pro-Russia separatist rebels in eastern Ukraine have brought the relationship between Russia and the West to its lowest point since the Cold War.Footnote 6 A 35-year-old British journalist and a blogger, Graham W. Phillips, covered the 2014 Ukraine crisis and became a growing star on Kremlin-owned media. He set out to investigate in the way that has made him a cult micro-celebrity in east Ukraine’s crisis: by interviewing angry people on the street for 90 seconds at a time.Footnote 7

5.3.3 Applying FSA

After applying the FSA method, interestingly, even though there are other central and well-known news resources such as “rt.com”, “WashingtonPost”, and “TheGuardian”, we identify a focal structure which includes Graham W. Phillips who was actively involved in the crisis as a blogger. Figure  22 shows that focal structure with three blue-colored individuals.

Since browsers have limited memory, visualization becomes more difficult for the web tool as the network size grows. For more than 2000 nodes, we get a ‘hair-ball’ effect which does not mean anything for the user. Analyzing is another limitation for bigger networks. Applying FSA for such big networks is currently challenging on the online system.

6 Conclusion and discussion

The findings of \(H_{1}\) agree that there might be key sets of individuals which are more interactive than average individuals in the evolution of a mass protest. However, it does not imply those key sets are more influential than sets of most influential individuals. The studies based on those centrality measures mostly take into account a single individual in a network. However, a single individual on its own may not represent the influence of a community or network by itself alone, because there is no other node to form a set of interaction patterns with each other. Unlike the authority-based approaches dealing with the extraction of most influential single individuals, the FSA approach identifies most influential sets of individuals. Recall that findings of \(H_{1}\) do not imply that those sets of individuals are more interactive than sets of most influential individuals identified by a centrality measure. However, these findings show that sets of interactions are formed among individuals that a single influential can never include such interactions on its own. The findings support the definition of the focal structures, i.e., the identified individuals might not be as influential as the most influential individuals, but by acting together they form a compelling power and are more interactive than average individuals in the evolution of a mass protest.

The question “what makes focal structures different than communities in a social network such as a mass protest?” is answered successfully through the experiments mentioned above. Results of \(H_{2}\) reveal that using the FSA approach, we identify focal structures which are more interactive than communities in the evolution of a mass protest. It is very important to note that unlike the traditional community/cluster detection approaches such as the Louvain (Blondel et al. 2008) or CNM (Girvan and Newman 2002) methods which partition a network into clusters/communities, the FSA approach is the identification of smaller and more pertinent groups in a network.

7 Future research directions

From the collective action point of view, focal structures can be applied not only to study social movements, but also to examine flash mobs in cybersecurity. A flash mob is a dance beginning with a few dancers that appear to spread among the crowd, creating a spontaneous public spectacle of unified musical expression, and then suddenly dispersing as if nothing happened. For cyber-attacks, a group of people get together mostly via social media and coordinate an attack. It is very important to identify influential groups of attackers before, during or after the attack. The focal structures approach can help with identifying such groups which are significant parts of flash mobs. It can be concluded that the FSA approach in cybersecurity would open up a new way to detect those most influential sets of cyber-attackers instead of sets of most influential single cyber-attackers and such a study would address cyber-terror more effectively.

As far as the methodological contributions, this research can help in analyzing and understanding real-world phenomena and advance foundational sociological concepts. Moreover, it can also be applicable to many other areas, such as recommendation systems, semantic web, marketing, advertising, information diffusion studies, and search engine indexing, among others. Also, the research areas pertaining to network structural concepts; such as social capital, strength of weak ties, trust, diversity in network, bandwidth, and similarity of nodes/structure/homophily, etc., can benefit from the research. For example, social capital (Coleman 1989; Portes 2000) has been studied through the structure of relations among actors. It comes about through changes in the relations among persons that facilitate action. A similar point that attracts our attention is that like social capital, the FSA approach also facilitates actions by influencing a mass protest, but not through single actors as social capital does, rather through sets of actors. Therefore, this research can give a new perspective to the researchers of social capital by studying the asset of influential sets of individuals in a social network.

Another research question based on a different centrality approach, such as “how engaging the sets of individuals are in a social network?” can be included as future work as well. Such a study would allow researchers to observe a different characteristic behavior of sets of individuals and enable them to see from a different angle in social network analysis (SNA), i.e. they can study the most engaging sets of individuals instead of sets of most engaging individuals in a social network. Social media marketers, for instance, can also benefit from such a study by determining the most engaging sets of social media users and promote their products according to those key sets instead of single individuals.

The study of the FSA methodology is network centric rather than content centric. In other words, it deals with the topology of the network and does not include any study regarding identification of structures using content information such as topic, tags etc. As future work, the FSA approach can be improved by focusing on the content as well. Furthermore, the hybrid approach, the combination of both the network structure and content, can also be included in a further study for identifying focal structures. By accomplishing such a study in future, since not only the topology but also the content (or both) is taken into account, more accurate and relevant structures can be obtained in a social network.

As another future work, we plan to improve the web tool by adding new features, such as statistics calculations, exporting the output, multi-network and file type support. We also plan to make the source codes publicly available so that researchers can obtain the whole package of the online system and utilize it on their local environment.

To the best of our knowledge, this work is the first effort in identifying in influential sets of individuals instead of a set of influential individuals. We believe that this work would open up new directions for researchers to develop new methods in social network analysis.