1 Introduction

Team formation [9] in a social network is to find a set of experts such that not only a set of required labels is covered but also team members have lower communication cost with one another (i.e., well-connected in the underlying network). It is apparent that team formation can be applied to many real applications, such as searching for a group of employers to execute a project in a company, and composing an activity group for a cocktail party with particular themes. However, team formation techniques [1, 7, 12] are not applicable for organizing Influential Events in event-based social services (e.g. Meetup,Footnote 1 Plancast,Footnote 2 and Facebook Events). Here we consider influential event organization is to find a set of persons who are interested in the themes of an event, have better social interactions (i.e., lower communication cost), and can attract a large number of people to participate in the event. It is common and realistic to organize influential events. The real-world scenarios on the demand of influential events may include organizing technical conferences, fund raising for earthquake victims, and initiating anti-nuclear campaign. In such scenarios, people attempt to maximize the number of participants since more participants mean a success of the events. One may think Social Influence Maximization [8], which aims at finding a set of seeding users such that the number of influenced users can be maximized, seems to be a solution. However, influence maximization techniques [3, 4, 19] are not applicable for influential event organization because they consider neither the set of required labels, nor the communication between the selected seed nodes.

This paper proposes a novel problem, Influential Team Formation (ITF), in a social network. Given a set of required labels and the size k of the team, the ITF problem is to find a set S of nodes such that (a) the query label set is covered by the discovered k-node set S, and (b) the influence-cost ratio of nodes in S is maximized. We propose the Influence-Cost Ratio (ICR) to quantify the influence spread of the selected k nodes per communication cost. ICR of a node set S is defined as \(ICR(S) = \frac {\sigma (S)}{c(S)}\), where influence spread σ(S) is the expected number of nodes activated by S while the communication cost c(S) is the sum of all-pair shortest path lengths between nodes in S. A team can derive a higher ICR if the team members can lead to higher influence spread and are well-connected. The ITF problem is challenging since maximizing influence spread contradicts minimizing communication cost. Influence maximization tends to select well-separated nodes because their activated nodes can have less overlapping. But team formation prefers well-connected nodes since they can produce lower communication cost.

It is worthwhile to note that a team is a task-oriented group whose team members not only possess some skill labels to deal with the task, but also well collaborate with each other. Therefore, the team formation problem asks for a set of required skill labels as the input, and expects that the discovered team members are equipped with some of the required skill labels and have good communication among them. Since we aim at forming influential “teams”, the selected team members (i.e., seeds) need to rely on a required set of skill labels and be well-connected to have good communication. In addition, “influential” teams also require the team members to be influential, i.e., team members should lead to higher influence spread in the social network. Consequently, the proposed ITF problem is a combination of team formation and influence maximization.

We create an example, as shown in Figure 1, to exhibit the differences among team formation (TF), influence maximization (IM), and the proposed ITF. This example assumes the set of required labels is {a, b, c, d, e} and k = 3. TF may select the set S T F = {v 1, v 2, v 3} since they cover more required labels and are well-connected. \(ICR_{TF} = \frac {7}{3}\). IM will select the set S I M = {v 1, v 4, v 6} because they can lead to highest influence spread. \(ICR_{IM} = \frac {11}{5}\). ITF will find the set S I T F = {v 1, v 5, v 6} that leads to the highest \(ICR_{ITF} = \frac {10}{3}\). It is because not only the union of the activated sets of v 1, v 5, and v 6 leads to the largest activated set (i.e., {v 1, v 2, v 3, v 4, v 5, v 6, v 7, v 9, v 14, v 15}), but also v 1, v 5, and v 6 are inter-connected with a triangle structure in the network.

Figure 1
figure 1

A toy example of a social network (left), and a table (right) that describes the set of required labels possessed by each node and the set of nodes activated by each node. Note that a subset of nodes is shown in the table. Nodes except for v 1 to v 6 do not contain any required label

To this end, we formulate the ITF problem under the Independent Cascade (IC) model, and prove its NP-hardness. We propose a greedy algorithm with quality guarantee. While the greedy solution is effective but very inefficient, we further develop two greedy methods: ICR Greedy (ICR-Greedy) and Mixed Influence-Cost Greedy (M-Greedy), and one heuristic method: Similar Influence Search (SimIS). ICR-Greedy iteratively selects nodes with highest marginal gain of ICR scores. M-Greedy combines the NewGreedy IM method [3] with the original TF algorithm [9] in an interweaving manner. SimIS integrates Group-PageRank [15] with a best-first search in the social network. To validate the proposed methods, we have simulation-based and prediction-based experiments. The simulation-based experiments conducted on two real social network datasets, Facebook and Google + , and the results show both M-Greedy and SimIS can generate the highest ICR scores with satisfying time efficiency. The prediction-based experiments are conducted using the real event participation data of the event-based social service Meetup. The goal is to validate whether ITF with the proposed solution can truly identify the organizers of influential events based on the required labels of the given event and the social network. The results exhibit satisfying accuracy.

The contributions of this work consist of four parts, as described in the following.

  • We formulate the novel Influence Team Formation (ITF) problem to find a set of users as team organizers for initiating influential events on social network. By integrating Team Formation with Influence Maximization, we devise the Influence-Cost Ratio (ICR) to estimate the influence spread per communication cost among team members, and aim at maximizing ICR while covering the required labels in the ITF problem.

  • We analyze the hardness of the ITF problem. And we develop three algorithms, ICR-Greedy, M-Greedy, and SimIS, to solve the ITF problem in either the greedy or heuristic manner.

  • We conduct a comprehensive evaluation on our proposals with some baselines using real Google and Facebook datasets. Experimental results validates our idea and exhibit the promising effectiveness (in terms of ICR scores) and time efficiency of the proposed methods, especially for SimIS.

  • In addition to the simulation-based evaluation, we further use real event participation data in Meetup to validate whether the proposed methods can accurately predict the organizers of influential events based on the required labels. Experimental results show satisfying accuracy, especially for SimIS.

2 Related work

The relevant studies consists of three parts: team formation, influence maximization, and social event organization. Team formation is first proposed to find a set of experts such that not only a set of required skills are covered, but also the communication cost among team members is minimized [9]. There are a series of follow-up extensions considering various real scenarios: jointly finding a team leaders and forming the team [7], simultaneously tackling multiple sets of required skills [1], specifying the number of experts for each skill [12], allowing geographical and team-size constraints [20], imposing swarm-based optimization [2], and recommending other individuals to replace some of existing team members [14]. Community Search [22] alternatively finds a densely-connected subgraph based on a set of given nodes, instead of required skills.

The Influence maximization problem is to find a set of seed nodes that can maximize the influence spread (i.e., the expected number of activated nodes) in a social network under either Independent Cascade or Linear Threshold model [8]. There are two mainstream solutions that attempt to strike a balance between efficiency and effectiveness. The first is to develop seed-selection heuristics, such as Degree Discount [3], Maximum Influence Arborescence [4], Group-PageRank [15], and IM-Rank [6]. The second is to speed up Monte-Carlo simulation-based greedy algorithm [8], such as Cost-Effect Lazy-Forward [10], NewGreedy [3], StaticGreedy [5], Pruned Monte-Carlo [19], and supervised Monte Carlo estimation [16]. In contrast to influence maximization, Zhang et al. [27] alternatively aims at finding a critical block of nodes to control the misinformation diffusion in a social network.

Social event organization aims at composing a group of persons that satisfying various kinds of event requirements. Socio-spatial Group Query [25] is to find a group of persons who are not only geographically close to each other but also acquainted with each other, and their extended work [21] maximizes the likelihood of friend making among group members. A follow-up work [11] recommends a group of users satisfying required labels and being acquainted with each other for an event host. SEO [13] further composes multiple event groups simultaneously. Marketing effect maximization [26] aims to find a set of nodes that are geographically close to the event location and attract more those users satisfying event themes. The bottleneck-aware social event arrangement (BSEA) [23] further considers social influence to recommend events for users.

3 Problem formulation

We first describe some preliminary notations. First, a social network is represented by a graph G = (V, E), where V is the node set and E is the edge set. Each node u is associated with a set of labels L u . Second, let L(S) be the set of required labels covered by a node set S, i.e., L(S) = L ∩ (∪ uS L u ), where L is the set of required labels. We define the label coverage π of a node set S, given by \(\pi (S) = \frac {|L(S)|}{|L|}\). Third, we adopt the Independent Cascade model to propagate influence. In IC model, in time step t each active node u has a single chance to activate each of its inactive neighbors v with a pre-determined probability p u, v . If u succeeds, v will become active in step t + 1. Otherwise, u will not activate v again. The process ends when no more nodes can be activated. In this paper, p u, v is uniformly selected from the set {0.1, 0.2, ... , 0.9} based on the TRIVALENCY model [4]. Fourth, the influence spread of a node set S, denoted by σ(S), is the expected number of activated nodes given S. Fifth, we define the communication cost of a node set S, denoted by c(S), as the sum of all-pair shortest path lengths [7], i.e., \(c(S) = {\sum }_{u,v \in S} len(u,v)\), where l e n(u, v) is the length of the shortest path between u and v. To this end, we can define the influence-cost ratio of a node set S as \(ICR(S) = \frac {\sigma (S)}{c(S)}\).

Influential team formation (ITF)

Given a set of required labels L, a social network G = (V, E), and the size k of the desired team, the ITF problem is to find a k-node set SV such that not only the label coverage π(S) = 1 but also the influence-cost ratio I C R(S) is maximized.

Theorem 1

For the IC influence propagation model, the influential team formation problem is NP-hard.

Proof Sketch

The ITF problem can be divided into two parts: team formation (TF) and influence maximization (IM). The TF part aims at finding a k-node set S that covers L and minimizes the communication cost c(S). This part had been proved to be NP-hard [7] by a reduction from 3-Satisfaction (3-SAT). The IM part is to find a k-node set that maximizes the influence spread σ(S) under IC model, which had also been proved to be an NP-hard problem [8] by a reduction from Set Cover problem. Since maximizing I C R(S) is composed by maximizing σ(S) and minimizing c(S) simultaneously, it is natural that the ITF problem is also NP-hard. □

4 Proposed methods

In this section, we propose three algorithms to solve the ITF problem. First, we describe a ICR-Greedy algorithm, which extends the “Lazy-Forward” influence maximization greedy method to deal with the set of required labels and ICR. Second, we develop a Mixed Influence-Cost Greedy algorithm, which integrates the team formation procedure with maximizing influence spread. Third, we present a heuristic method, Similar Influence Search, whose idea is that influential near-by nodes have similar distributions of influence spread over all the nodes in the network, and find team members by selecting nodes with similar influence distributions.

figure a

4.1 Lazy-forward ICR-greedy method

Since the monotonicity of both σ(⋅) and c(⋅) is straightforward and σ() = 0, let c() = 1, we can have I C R() = 0. Based on the set function theory [17], a simple greedy can generate an approximated result with 1 − 1/e (≈ 63%) to the optimal solution. Since the ITF problem is NP-hard, we can estimate the I C R(⋅) by run Monte-Carlo simulation up to sufficient times R (e.g. R = 20,000), which was originally adopted by influence maximization [8]. However, such Monte-Carlo greedy method has time complexity O(k n R m), where n = |V | and m = |E|. Therefore, we resort to the Lazy-Forward strategy [10], which exploits the submodularity to speed up the computation by reducing the times of estimating σ(⋅). The detailed algorithm, Lazy-Forward ICR-Greedy, is provided in Algorithm 1. V L is used to retrieve the set of nodes possessing at least one required label (line 2). We use δ v to record the marginal ICR (line 11), and implement δ v using Priority Queue to efficiently find the node with the highest marginal ICR. Note that the proposed methods, i.e., ICR-Greedy, M-Greedy, and SimIS, does not directly ensure the label coverage π(S) = 1. Nevertheless, we assume that the number of required labels is usually not large (e.g. 5). Therefore, with a proper team size k, it is easy to make π(S) = 1.

figure b

4.2 Mixed influence-cost greedy method

While the previous ICG-Greedy directly maximizes I C R(⋅), we devise another greedy algorithm that aims at separately maximizing σ(⋅) and minimizing c(⋅) in an interweaving manner. The rationale is that directly maximizing I C R(⋅) may destroy the connectivity of the team members. That said, the shortest paths between nodes in S could pass through irrelevant nodes that contains no required labels. Since the communication cost c(S) is the denominator of I C R(S), including unconcerned nodes will increase c(S) and lead I C R(S) to an unsatisfied local maximal. To alleviate such damage on I C R(⋅), we propose to balance the the trade-off between σ(⋅) and c(⋅) by mixing maximizing σ(⋅) and minimizing c(⋅). The algorithm, Mixed Influence-Cost Greedy (M-Greedy), is shown in Algorithm 2. We take advantage of the Lazy-Forward strategy on the marginal gain of σ(⋅) and c(⋅) to minimize the cost function c(⋅). The first If-Statement (line 7), |S|%2 == 0, is created to interweave maximizing σ(⋅) (line 8) with minimizing c(⋅) (line 10).

4.3 Similar influence search heuristic

Although the previous two greedy methods can be effective for finding the team S with high I C R(S), they are quite inefficient for real-world practical usages. It is because of the computation of influence spread σ(⋅), even the Lazy-forward strategy is applied. Therefore, we aim at developing an effective and efficient heuristic method, Similar Influence Search (SimIS). The central idea of SimIS is two-fold. The first is exploiting the heuristic linear influence modeling [24] to efficiently estimate the influence spread: under Independent Cascade model [8], the influence reaching a node v is a linear combination of the influence from v’s neighbors. Such idea enables us to linearly compute the influence spread of an arbitrary set of nodes in a closed form. The second is that a node set S with lower communication cost means its members tend to be close to each other in the social network, and thus they are supposed to generate similar influence toward other nodes in VS. If the formation of team members starts from the most influential node, and iteratively selects the next team member by finding the node that can generate the influence spread (to all of the remaining nodes) as close as possible to the influence spread generated by the currently selected team members, we might be able to approximately lower down the communication cost while maintaining the influence spread. To implement such two ideas, we first elaborate the linear influence modeling, then give the detailed algorithm of SimIS.

Based on the linear influence model [24], we can approximately have the seed set S’s influence on a node vS, given by:

$$ \sigma(S \rightarrow v) = \rho \sum\limits_{u \in N(v)}p_{v,u}\sigma(S \rightarrow u), $$
(1)

where N(v) is the set of neighbors of node v, and ρ ∈ (0, 1) is the damping factor of influence propagation (note that for vS, σ(Sv) = 1). This representation reflects that the influence from S to v is a linear combination of the influence from S to each of v’s neighbors, and allows efficient linear computation of influence propagation in an iterative manner. Equipped with σ(Sv), we can further derive the influence from S to a node set T (ST = , SV and TV), denoted by σ(ST), by summing up the influence from S to each node vT: \(\sigma (S \rightarrow T) = {\sum }_{v \in T}\sigma (S \rightarrow v)\). Since the influence maximization part of our ITF problem is targeting at all nodes in the network, here we use the entire node set V in the network to be T. The influence spread of S, i.e., σ(S) = σ(SV), can be derived with time complexity O(|E|) [24].

figure c
figure d

Finding a k-node set that maximizes I C R(S) can be considered as maximizing σ(S) and minimizing c(S) simultaneously when iteratively adding nodes to S. We propose a Similar Influence Search (SimIS) heuristic to approximately fulfill such idea based on the linear influence modeling σ(ST). SimIS begins with finding the first team member s (1) by selecting the node with the highest influence spread, i.e., \(s^{(1)} = \mathrm {arg\,max}_{v \in V}\sigma (v \rightarrow V)\), then adds s (1) into S. Here we use Group-PageRank [15] GPR(vV) to efficiently estimate a node v’s influence spread σ(vV). GPR(vV) had been shown to be the upper bound for σ(vV) [15]. To find each of the next k-1 nodes, i.e, s (i)(1 < ik), we propose to select the node vVS whose influence on the entire node set V is as similar as the influence of currently selected S on V. Specifically, let σ(SV) be the vector that represents the influence of S on all nodes V in the network, i.e., \(\boldsymbol {\sigma }(S \rightarrow V) = [\sigma (S \rightarrow v_{1}), \sigma (S \rightarrow v_{2}), ..., \sigma (S \rightarrow v_{n})]'\), where n = |V |. We aim at finding the next team member s (i)(1 < ik) by selecting the node v that can minimize the difference between σ(SV) and σ({v} → V). The selection of next k-1 nodes can be expressed by:

$$ s^{(i)} = \underset{v \in V \setminus S}{arg\,min} \| \boldsymbol{\sigma}(\{ v \} \rightarrow V) - \boldsymbol{\sigma}(S \rightarrow V) \|_{F}^{2}, $$
(2)

where i = 2, 3, ..., k, and \(\left \| . \right \|_{F}^{2}\) denotes the Frobenius norm. Since nodes that are close to one another in the network have higher potential to have similar vectors of influence on V, (2) can approximately find the team member with lower communication cost. In addition, since the formation of the team starts from the node with highest influence spread, (2) can approximately select the next nodes with least loss of influence spread, i.e., maintaining the influence spread of S as high as possible. The detailed algorithm of SimIS is given in Algorithm 3, along with Algorithm 4, C o m p u t e I n f l u e n c e(S): the computation of the influence of a node set S on all nodes V in the network.

In short, the basic idea, nodes close to each other have similar influences, enables use to devise the SimIS algorithm to maximize the influence spread and minimize the communication cost at the same time. The key is that the 1st selected node is to purely maximize the influence spread. Then the 2nd node is selected based on its influence similarity with the first one. If the influence of the 2nd node is similar to the 1st one, it tends to maximize the influence by mimicking the influence of the 1st node. To mimic the 1st node’s influence, the 2nd node should be as close as possible to the 1st in the network so that their influence can be close. Such policy is applied to the next nodes to be selected.

5 Experiments via simulation

The evaluation consists of two parts, simulation-based and prediction-based, which are presented in this and next section respectively. In the simulation-based evaluation, we aim to answer the following questions. First, what is the effectiveness (in terms of ICR) of the proposed methods for solving the ITF problem, compared with team formation and influence maximization techniques as baselines? Second, to allow organizing influential teams for real-world usages, any solution is supposed to be able to instantly report the team members. How about the run time of the proposed methods? Third, an effective team requires the members to communicate with each other in an efficient and direct manner, instead of requiring the involvement of other inter-mediator persons. For example, in Figure 1, node v 5 is the inter-mediator in the team S I M = {v 1, v 4, v 6} since v 5 does not cover any label but is passed by the shortest paths between team members. Can the proposed methods lead to less involvement of inter-mediators for the communication between team members? Note that in this simulation-based experiment, since the goal is to demonstrate the performance in terms of ICR and time efficiency, no real event participants are involved. In the next section, we will present the prediction-based experiment using real event participation data.

5.1 Data and evaluation settings

We collect two sub-networks from Facebook and Google Plus friendship respectively for the experiments. The data statistics, including numbers of nodes and edges in separate social networks and the total numbers of labels, is shown in Table 1. We consider the attribute values as so-called “labels.” For example, we have labels male and female from Sex attribute, UIUC and CMU from Education attribute, and 1983 and 2000 from Birth Year attribute. Any accessible attribute values from users’ profiles are treated as the labels. Figure 2a and b exhibit the numbers of users who own a certain label. We can find few labels are very popular while most of labels are unpopular (i.e., less than 10 users possess such labels) for both datasets. Figure 2c and d show the cumulative distributions that the number of labels possessed by a user. It can be observed that most users have less than 20 labels while only less than 10% users have label numbers higher than 20%. Such distributions may result from the fact that most users in online social services do not want to provide complete information in their profiles.

Table 1 Statistics of the used datasets
Figure 2
figure 2

Data distributions. a and b are the distributions of user numbers for labels. c and d are the cumulative distributions of the number of labels for a user

We compare the proposed three methods, ICR-Greedy, M-Greedy, and Similar Influence Search (SimIS) Heuristic, with two baselines. The first is the Enhanced Team Formation (ETF) algorithm [9] while the second is the modified Linear Influence Maximization (LIM) algorithm [15]. ETF is the state-of-the-art method that finds team members by selecting k nodes that cover the required labels and possess high communication cost toward each other. But ETF cannot consider the influence spread of the team. LIM is the state-of-the-art method that selects a set of seed nodes maximizing influence spread and maintaining time efficiency. We modify LIM by imposing a requirement: only those the k influential nodes that collectively cover the required labels can be selected. Note that under such modification, those nodes with largest marginal gain of influence spread could not be always selected since some of them could not cover any required labels. Nevertheless, LIM cannot minimize the communication cost between the selected seeds.

In the experiments, we mainly show how ICR scores and time efficiency (in second) change when varying the team size k. We randomly generate 20 sets of required labels for each dataset, in which each set contains 5 labels. Note that the size of required labels has also be changed and the results demonstrate similar trends. We use the IC model to estimate the influence spread, n which edge probabilities are uniformly selected from the set {0.1, 0.2, ... , 0.9} based on the TRIVALENCY model [4]. To further looking into the effect of different methods, we also show the results of influence spread and communication cost. In addition, since a good team does not need too much involvement of other inter-mediator users who help facilitate the communication between team members, we also report the cardinality of a team, which is defined as the number of nodes that are traversed via the shortest paths between team members. The cardinality of a good team is expected to be as close as to the team size. The final values, i.e., ICR, run time, cardinality, influence spread, and communication cost, are computed by averaging the results of 20 labels sets.

5.2 Experimental results

The main results of ICR and run time are shown in Figures 3 and 4 respectively. M-Greedy and SimIS are obviously able to generate higher ICR scores with satisfying time efficiency, comparing to baselines LIM and ETF. Though ICR-Greedy is close to M-Greedy and SimIS, it is very inefficient. In more details, M-Greedy has higher ICR scores for smaller teams while SimIS can be slightly better than M-Greedy for larger teams. Due to such results, we can suggest SimIS can be used for organizing larger events that may need more team members. The result that M-Greedy outperforms ICR-Greedy shows that maximizing ICR directly and greedily can lead to unsatisfying results while dealing with maximizing σ and minimizing c separately and greedily can strike a better balance between effectiveness and efficiency.

Figure 3
figure 3

The performance of ICR by varying team sizes

Figure 4
figure 4

Run time of different algorithms

To understand why M-Greedy and SimIS lead to higher ICR scores but LIM and ETF perform worse, we show the influence spread σ and communication cost c in Figures 5 and 6 respectively. SimIS can generate higher σ than LIM on both datasets. We think the reason is that LIM is modified to cover required labels, so its σ is lowered down to some extent. Nevertheless, LIM’s σ is quite close to M-Greedy and SimIS. On the other hand, we can also find Sim-IS’s σ is higher than M-Greedy. It may results from half of M-Greedy algorithm is for minimizing c, so M-Greedy leads to lower c in Figure 6. Nevertheless, SimIS can also have lower c which is close to M-Greedy. It is natural to see ETF generates results with lowest c. In short, ETF and LIM lead to a trade-off between σ and c while M-Greedy and SimIS can strike a satisfying balance, and ICR-Greedy is an intuitive solution but generates worse results.

Figure 5
figure 5

The influence spread σ for different methods

Figure 6
figure 6

The communication cost c for different methods

The last results for cardinality are shown in Figure 7. Both M-Greedy and SimIS need only few inter-mediator users to facilitate communication, i.e., requiring only additional 6 users for Facebook data and additional 12 users for Google+ on average. Such results are reasonable considering some efforts are created to reach near-by (but not directly-connected) influential users. Sometimes the cardinality values of LIM and ETF are low as well since they directly maximizing σ and minimizing c respectively. ICR-Greedy that leads to the worst cardinality exhibits that directly maximizing ICR will include too many inter-mediators.

Figure 7
figure 7

The cardinality for different methods

Note that since our goal is to select a set of nodes that can maximize ICR, the ICR-Greedy is natural to be developed since it iteratively chooses the next node that can maximize the margin gain of ICR in a greedy manner. This ICR-Greedy is based on the idea of the greedy algorithm in solving the Influence Maximization problem. However, directly maximizing ICR will destroy the connectivity of the team members in the social network. That says, the communication of team members need irrelevant nodes as the intermediators. Since ICR consists of maximizing the influence spread and minimizing the communication cost among team members, an alternative greedy solution (i.e., M-Greedy) is to separately maximize the influence and minimize communication cost in an interweaving manner. Nevertheless, the experimental results show that the greedy algorithms are quite inefficient in running time, especially for the ICR-Greedy, as shown in Figure 4. The ICR-Greedy also works worse in ICR scores, as shown in Figure 3. Therefore, we further develop a heuristic algorithm, SimIS, to obtain the team members that can simultaneously possess the highest ICR scores and be derived with very high time efficiency. The results in Figures 3 and 4 truly validate the effectiveness and the efficiency of SimIS.

6 Evaluation with real event participation data

We further use real-world event-participation data to evaluate the performance of the proposed methods via prediction, in addition to examine the Influence-Cost Ratio in the previous section. The main question we aim to answer is whether or not the proposed methods can accurately predict the event organizers and participants, especially for the influential events. In other words, we expect those users who organize an event will be discovered based on the tags/labels of the event. Moreover, we also investigate whether the event participants can be identified by considering the members of the formed influential team to spread the event invitations.

6.1 Meetup data

We use the real event participation data in the social service Meetup for the evaluation. The Meetup data was collected during July 2013 to Oct 2013. Each user in Meetup can belong to multiple “online social groups”. Users in Meetup is allowed to organize “offline social events” by specifying a set of tags that describe the topic of an event, and the event’s geographical location (with latitude and longitude). Then the event organizers can launch events in online groups to invite and attract users in the same group or other groups to participate. In addition, users can also participate offline social events via the RSVP function (i.e., “yes”, “no”, or “maybe”).

Since the Meetup service does not allow users to specify their social connections, we compile the social network based on online social groups that users join. Each user is considered as a node in the social network. We assume that two users have higher potential to have the social connection if they co-join more online social groups and their joined groups are smaller (since larger groups may lower down the possibility of acquaintance for two users). Let u and v be two users in the social network, and g r p i denote an online group with |g r p i | members. We take a similar approach as in [18] to measure the connection strength c s(u, v), as formulated in the following:

$$ cs(u,v) = \sum\limits_{\forall grp_{i}, u \in grp_{i} \wedge v \in grp_{i}} \frac{1}{|grp_{i}|}. $$
(3)

We use a threshold parameter ρ to filter out those edge pairs with lower connection strength values. We consider ρ = 0.06 to construct the social network, and use such network to produce the experimental results. Note that the experiments are conducted using different ρ values, and the results exhibit similar trends. Here we report the result of ρ = 0.06.

We extract two data subsets in the Meetup data, which corresponds to two cities, Chicago (CHI) and San Francisco (SF). By considering the largest connected components as the final social networks and retrieving the corresponding offline events, the data statistics is provided in Table 2. Note that since we are concentrating on influential events, we consider only events whose number of participants is more than 100.

Table 2 Statistics of the extracted dataset of Chicago (CHI) and San Francisco (SF)

6.2 Evaluation settings

Our goal is to understand whether the proposed methods can truly find those users (i.e., event organizers) who host the influential events, and discover users who participates in the event. Therefore, the evaluation is designed to consist of two parts. The first is Organizer Finding, which aims at employing the proposed methods to find the influential team, and treating the team members as the predicted organizers, which will be examined using the ground-truth event organizers. The second is Participant Forecasting using the predicted event organizers to spread the event invitations, and investigating how many ground-truth event participants can be reported among the set of activated users. However, the Meetup data provides only which users are the participants of an event, and does not have the organizers of an event. To have the ground truth, therefore, we treat the earliest 10% participants of an event as the ground-truth organizers. Other 90% participants of an event are considered as ground-truth participants.

There are two evaluation metrics correspondingly. For Organizer Finding, by varying the team size k = 10, 20, ... , 50, we compute the Organizer Hit Rate (OHR) for the predicted organizer set S k of size k. Let \(\bar {S}\) be the set of ground-truth organizers of the corresponding event, OHR is defined as:

$$ OHR(S_{k}) = \frac{hit(S_{k}, \bar{S})}{k}, $$
(4)

where \(hit(S_{k}, \bar {S}) = |S_{k} \cap \bar {S}|\). OHR gets higher if more ground-truth organizers detected. In addition to OHR, we also wonder the social closeness between the predicted organizers and the ground-truth organizers. If a method can predict organizers very close to the ground truth, it can be considered as an effective method relatively. Therefore, we report the closeness scores between the predicted organizers S k and the ground-truth organizers \(\bar {S}\). Specifically, for each predicted organizer sS k , we define its Organizing Closeness (OClose) toward the nearest ground-truth organizer in the compiled social network, given by:

$$ OClose(S_{k}) = \frac{\sum\limits_{s \in S_{k}}dist(s,\bar{S})}{|S_{k}|}, $$
(5)

where \(dist(s,\bar {S})\) is the length of the shortest path between the predicted organizer s and the set \(\bar {S}\) of ground-truth organizers in the social network, in which edge weights are the connection strength values defined in (3). OClose gets lower if a method can predict organizers that are socially close to the ground truth.

For Participant Forecasting, we first need to obtain the set of forecasted participants based on a derived team S k . We take advantage of the Independent Cascade (IC) model [8] to determine which users are considered as the forecasted participants. Specifically, by considering the predicted event organizers as the seeds, we run the IC model to simulate the propagation of event invitations starting from the seeds in the social network. Users successfully activated are regarded as those who accept the invitation to be the participants. Given the set S k of predicted organizers of an event, the IC model starting from S k will be performed up to 1,000 times, and nodes who are successfully actively up to 50 times will be considered as the forecasted participants of the event, denoted by A(S k ). We measure the performance by computing the Precision and Recall of the forecasted participants. Let \(\bar {A}\) be the set of ground-truth participants of the corresponding event, the measures of Prevision and Recall are defined as:

$$ Precision(S_{k})= \frac{hit(A(S_{k}),\bar{A})}{|A(S_{k})|}, $$
(6)
$$ Recall(S_{k})= \frac{hit(A(S_{k}),\bar{A})}{|\bar{A}|}, $$
(7)

where \(hit(A(S_{k}),\bar {A}) = |A(S_{k}) \cap \bar {A}|\), |A(S k )| is the number of activated users (i.e., forecasted participants), and \(|\bar {A}|\) is the number of ground-truth participants. Note that in the execution of the IC model, we normalize the connection strength values c s(u, v) between nodes to be within [0, 1], and consider the normalized connection strength values as the propagation probabilities on edges. Furthermore, while Precision is to estimate the accuracy of forecasted participants in an overall perspective, sometimes the event organizers care more about who will participate in the events among their friends or follows. That says, in reality, the event organizers tend to invite their friends or followers, who are the immediate neighbors in the social network, to attend the events. Hence, to better facilitate event organization, a method should be able to successfully predict the participants among the immediate neighbors of the organizers. In order to obtain the accuracy of forecasted participants among immediate neighbors, we slightly modify Precision to obtain the Neighboring Precision (n b rP r e c i s i o n):

$$ nbr.Precision(S_{k})= \frac{hit(A_{1}(S_{k}),\bar{A})}{|A_{1}(S_{k})|}, $$
(8)

where A 1(S k ) ⊆ A(S k ) is the subset of forecasted participants who are the immediate neighbors of organizers S k in the social network. Higher n b r.P r e c i s i o n(S k ) indicates better performance in forecasting immediate participants.

We aim to compare the performance of the methods SimIS, M-Greedy, ETF, and LIM, which can be treated as unsupervised learning approaches to predict event organizers and participants. The tags of the given event as the required labels to execute the methods. While SimIS derives the most promising results in the previous experiment, we would like to know whether it can also perform well on the evaluation using real event participation data. Note that the ICR-Greedy is not used for comparison here. The reason is that the time efficiency is very impractical, as indicated in Figure 4, to make the prediction.

6.3 Results and discussion

The performance for organizer finding is shown in Figures 8 and 9. We can find that the most effective methods are SimIS and M-Greedy, whose scores of Organizer Hit Rate are very close and significantly outperform those of ETF and LIM, and their OClose values are much lower than ETF and LIM. Such results lead to four implications. First, due to the satisfying hit rate scores (i.e., when team size k ≤ 30, OHR is at least 0.7 for San Francisco and at least 0.8 for Chicago), the proposed Influential Team Formation is validated to be effective to capture the real-world event organization based on a set of specified labels that describe the topics of the event. Second, the proposed methods SimIS and M-Greedy are proven to possess sufficient predictability of real event organizers. Hence, the teams formed by SimIS and M-Greedy can be responsible for organizing influential team in practice. Third, owing to the worse performance of ETF and LIM, the discovered groups of users using either Influence Maximization [15] or Team Formation [9] fail to be considered as teams of organizers for influential events. Fourth, for those organizers cannot be found by these four methods, based on Figure 9, SimS tends to find users who are very close to the ground-truth organizers in the social network. This result indicates that even if the real organizers cannot be found (e.g. around 30% cannot be found according to Figure 8), organizers recommended by SimS can be also good-quality ones since they have strong relationships with the real organizers, and thus the messages of event invitation have higher potential to be spread through the real organizers.

Figure 8
figure 8

Organizer hit rate scores for organizer finding

Figure 9
figure 9

Organizing closeness scores for organizer finding

We present the results for participant forecasting in Figures 1011, and 12, which correspond to Precision, nbr.Precision, and Recall respectively. It can be observed that SimIS generally outperforms the other three methods. Such results reflect that SimIS is able to not only effectively find influential event organizers, but also possess relatively high effectiveness on forecasting the event participants according to the found organizers. Nevertheless, it is apparent that both Precision and Recall scores are quite low, i.e., below 0.1 for Precision and below 0.05 for Recall. In fact, the low scores make the comparison insignificant and indicate very unsatisfying predictability. We think that such worse performance may result from the mechanism of spreading the influence/invitations starting from the event organizers. Recall that while developing an precise simulation model for the propagation of invitations is not the goal of this paper, we utilize the Independent Cascade model. Low precision and recall scores imply the IC model is unsuitable for the spreading of event invitations. Therefore, we can only conclude that the proposed SimIS is relatively useful under the IC model. We leave the development of an effective invitation propagation model and its evaluation for participant forecasting using SimIS in the future work. Nevertheless, we can also find that the immediate participants can be relatively better forecasted (i.e., the scores are higher than 0.12), as reported by nbr.Precision in Figure 11. Such results make sense because organizers usually invite their friends or followers to attend events, and people tend to accept the invitation of events hosted by their friends. Thus the proposed SimIS would be more applicable for predicting which friends or followers will participate in the events hosted by found organizers.

Figure 10
figure 10

Precision scores for participant forecasting

Figure 11
figure 11

Neighboring (nbr) precision scores for participant forecasting

Figure 12
figure 12

Recall scores for participant forecasting

7 Conclusion and discussion

This paper proposes a novel Influential Team Formation (ITF) problem for organizing influential events on social networks. The objective of ITF is to find a team of users satisfying required labels and maximizing the influence spread per communication cost among team members. Two methods, M-Greedy and SimIS Heuristic, are developed and validated to be promising for solving the ITF problem in terms of effectiveness and efficiency. M-Greedy is suitable for organizing smaller events while SimIS is good for larger events. In addition, by considering SimIS and M-Greedy as two unsupervised learning approaches, we find they can lead to promising predictability on finding organizers for influential events using real event participation data in Meetup. Extensive analysis demonstrate the unsatisfying performance on forecasting participants, which encourages us to come up with an effective invitation propagation mechanism based on SimIS in the future.

Although the proposed techniques can effectively find the team members in both simulation-based and data-based experiments, there are still some limitations in our work. First, the influential team formation problem assumes that the label set of each individual is known in prior (e.g. being obtained from the user profiles). However, users in online social media tend to provide incomplete information, which might destroy the quality of the discovered team members. Second, the influence probabilities associated on edges of the social network is pre-determined. Unreal determination of influence probabilities could affect the real performance of the formed teams. How to automatically learn the influence probabilities from the action logs of information diffusion (e.g. retweets and replies) is a demanded issue. Third, we suppose the set of identified team members will perfectly join the event organization. But it is not the real case; that says, many of them may be unwilling or unavailable. A practical team formation should model the willingness of each individuals into the process of event organization. This can be a promising future direction.

By formulating and solving the proposed ITF problem with the effective algorithm SimIS, three potential extensions can be built. First, SimIS is able to not only find the team members, but also report the scores of both influence-cost ratio and influence spread. Hence, one can estimate the success of the formed team by the expected number of event participants, or make a control on the event scale. Second, SimIS algorithm is easy to be extended based on the user-input team size. Therefore, we are allowed to find the alternative if any individual is not available or unwilling to join the team. Third, given that the real-world campaigns are opinion-aware and competitive, our ongoing work is to form the influential teams with the consideration of multi-party competitors, in which each party supports for a particular opinion.