Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

For census agencies, migrations are difficult to track in the developed countries, let alone in developing ones. In emerging economies, authorities rely on inaccurate, outdated, de-contextualized census data even for the local population [15].

Fig. 1.
figure 1

Evolution of the follower graph: edges connect the geographical locations of the users in our dataset. The picture shows the cumulative set of graph edges after the (a) 1st, (b) 2nd, (c) 3rd and (d) 7th month after the platform launch. The brightest point in (a) corresponds to the city of Sao Paulo.

Migrants who have left their home country searching for better opportunities rely also on electronic communication to maintain their bonds with their home communities [3]. Publishing and ‘consuming’ content such as news and photos in online platforms is “a parcel of everyday life in transnational families” [2]. Previous studies have found that indicators characterizing offline communities (e.g., economic deprivation) can be extracted from online data (e.g., use of emotion words in Twitter) [22]. Therefore, we propose to consider online data in Brazil and track the number of migrants in a city by considering the interaction between users who live in the city and those outside.

Our main contribution is to propose a set of metrics extracted from online data to estimate migration levels. These metrics reflect the intuition that the higher the number of migrants in a city, the more online interactions between users in the city and those outside it. We compute these metrics for 35 cities in a Yahoo Meme dataset that includes more than 13M posts and 22M reposts exchanged between users in more than 1K cities around the world. We find that the proposed metrics work, in that, they correlate with the number migrants reported by the Brazilian census authority. By then combining these metrics in a linear regression model, we show that the model fits the data extremely well (the \(Adj.\) \(R^{2}\) = 0.61).

2 Dataset

Yahoo Meme was a microblogging platform, similar to Twitter, with the exception that users can post content of any length or type (text, pictures, audio, video), being text and pictures the more frequently posted content. In addition to posting, users could also follow other users, repost others’ content, and comment on it. In this study, we use a random sample of interactions on Yahoo Meme from its birth in 2009 until the day it was discontinued in 2012 (Table 1). Despite its moderate popularity in USA, Yahoo Meme was popular in Brazil, as witnessed by the fact that the top 45 cities in terms of number of interactions are all located there. Reposting was the main activity in the service (22M sample records) compared to comments (4M). We extract the users who posted the content in our sample and georeference them based on their IP addresses using a Yahoo service. We remove the users for whom we did not obtain results at city level (e.g., users employing proxy servers to connect to the Internet) obtaining 80 K users. For this set of users and their respective posts, we extract all the repost cascades and the follower relationships. Month after month, users across different Brazilian cities tended to intensify their follower connections till reaching a certain stability at month 7 after the platform launch (Fig. 1).

Table 1. Yahoo Meme dataset statistics

To attain geographic representability, we ascertain that the number of users in the top Brazilian cities in our dataset is significantly correlated with the number of Internet users (Fig. 2). As a result, any city outside the confidence area calculated (outlier) is excluded from the study. This leaves us with 35 cities, and we will see that such a number grants statistical significant results. That is because we are left with 1.4M repost cascades whose original content was produced in the 35 cities and was consumed across the world.

3 Attention Metrics

It has been shown that migrants maintain their strong ties in their home countries mainly using digital means [2]. We thus expect that studying online interactions in Yahoo! Meme across geographic areas would result in good estimators of migration flows. More specifically, we connect places every time that a user \(u_i\) located in city \(i\) interacts with a user located in city \(j\) either by reposting \(u_i\)’s content or by following him/her. The volume of such connections is then correlated with migration rates for 35 cities in Brazil. We consider migrants from Brazil itself and from the rest of the world.

Previous studies have shown that interactions on social media cannot be quantified with simple metrics such as popularity or number of followers but they are best characterized with metrics that also reflect the extent to which content is re-shared or liked [1, 6, 23, 30, 32]. That is because social media users make specific decisions about the content they want to consume or who they wish to follow. Such decisions are taken based on offline social ties [31], homophily, and physical distance [25].

Fig. 2.
figure 2

Number of users in our sample versus number of Internet users. Both quantities are log-transformed. Regression line and 95 % confidence intervals are shown

Fig. 3.
figure 3

Attention graph whose nodes are cities and whose weighted edges reflect the intensity of reposting between cities’ users. Size and color of the nodes are proportional to the node degree. The network was plotted using the GeoLayout plugin of the Gephi software package [14].

We thus resort to attention metrics, and these metrics capture the attention that a city’s users are able to attract from the Rest of the World and from other Brazilian cities:

Cross Border Attention. Our first set of attention metrics for city \(i\) is defined as the number of reposts that the city has attracted from the rest of the world (\(ROW^{repost}_{i}\)) or from other Brazilian cities (\(BR^{repost}_{i}\)), normalized with respect to the total number \(n_i\) of users in that city:

$$ ROW^{repost}_{i} = \frac{out_i}{n_{i}} , BR^{repost}_{i} = \frac{out'_i}{n_{i}} $$

where \(out_i\) is the number of times a post originated in city \(i\) has been reposted outside it (the world excluding Brazil); \(out'_{i}\), instead, counts the reposts received outside the city but inside Brazil.

We repeat the same definition considering now the number of cross-borders followers attracted by users in city \(i\):

$$ ROW^{followers}_{i} = \frac{outf_i}{n_{i}} , BR^{followers}_{i} = \frac{outf'_i}{n_{i}} $$

where \(outf_i\) is the number of times a user in city \(i\) has been followed by a user outside it (the world excluding Brazil); \(outf'_{i}\), instead, counts the follower links outside the city but inside Brazil. As a result, we obtain the first four metrics.

Authority. The previous metrics consider all cities equally. However, certain cities might be more central to migration flows than others. To capture this concept of centrality, we built an attention graph using reposts. This is a weighted directed graph where nodes are cities, and directed weighted edges \((i,j,w)\) represent the volume \(w\) of reposts between city \(j\) where the reposter lives, and city \(i\) where the original poster lives. Self-edges are allowed as many reposts occur between users living in the same city. The resulting attention graph has 1,310 nodes and 25 K weighted edges (Fig. 3). Then, we measure the ‘authority’ index of each city using the HITS algorithm [14]. In the HITS algorithm the autorithy centrality of a vertex is defined to be proportional to the aggregated values of the hub centrality indexes that point to it. For a city \(i\), the two indexes as defined as follows:

$$ Authority_{i} = \alpha \cdot \sum \limits _{j \in C} A_{ij}Hub_{j},$$
$$ Hub_{i} = \beta \cdot \sum \limits _{j \in C} A_{ji} Authority_{i}, $$

where \(\alpha \) and \(\beta \) are constants, C is the set of cities in our dataset and \(A\) is the attention graph’s corresponding city adjacency matrix.

The Authority index calculated by the HITS algorithm is more informative for the vertex centrality in directed networks than simpler measures such as the number of incident edges or indegree centrality [12] and, thus, it better captures the importance of a node in the network.

We calculate the correlation among each pair of the five metrics: \(ROW^{repost}_{i}\), \(BR^{repost}_{i}\), \(ROW^{followers}_{i}\), \(BR^{followers}_{i}\), \(Authority_{i}\) (Fig. 4) and observe that they are all correlated with each other. That is why, when we will run our predictions, we will account for interaction effects.

4 Correlations Between Attention and Migration

From the 2010 data provided by the Brazilian census bureauFootnote 1, we compute two migration rates for each of the 35 cities: \(m_{ROW}\) is the number of people coming from other countries and \(m_{BR}\) is that from other Brazilian cities. Both values are normalized by city population. We then correlate these two migration rates with our five attention metrics. To account for skewness, the metrics are log-transformed. The results obtained are statistically significant, with at least \(p\)-value \(<0.05\).

Fig. 4.
figure 4

Correlations among the five attention metrics. We observe that the ROW attention metrics are correlated among each other more than they are with the Authority metric. Values are log-transformed.

  • Reposts and Follower metrics. We find positive correlations between migration rates and attention received by the rest of the world: \(r=0.28\) for \( attention \) computed on reposts, and \(r=0.33\) for attention computed on number of followers. Stronger correlations are also found for attention received from other Brazilian cities: \(r=0.33\) for attention computed on reposts, and \(r=0.46\) for attention computed on number of followers.

  • Authority metric. Since the authority measure can be only computed on the aggregate (Brazil plus rest-of-the-world) dataset, we should correlate the authority measure with the total number of migrants (\(m_{ROW}\) + \(m_{BR}\)). In so doing, we obtain, again, a positive correlation \(r=0.32\).

5 Predicting Migration from Attention

We model the number of migrants as a linear combination of the five attention metrics. This is what we call Model1:

$$\begin{aligned} \begin{array}{r} log(MigrantsNumber_i) = \alpha + \beta _{1} \cdot log(ROW^{repost}_{i}) + \\ \beta _{2} \cdot log(ROW^{followers}_{i}) + \beta _{3} \cdot log(BR^{repost}_{i}) + \\ \beta _{4} \cdot log(BR^{followers}_{i}) + \beta _{5} \cdot log(Authority_{i}) + \\ \epsilon _{i} \end{array} \end{aligned}$$
(1)

We also build a model to account for the pairwise interactions effects between indicators:

$$\begin{aligned} \begin{array}{r} log(MigrantsNumber_i) = \alpha + \beta _{1} \cdot log(ROW^{repost}_{i}) + \\ \beta _{2} \cdot log(ROW^{followers}_{i}) + \beta _{3} \cdot log(BR^{repost}_{i}) + \\ \beta _{4} \cdot log(BR^{followers}_{i}) + \beta _{5} \cdot log(Authority_{i}) + \\ \gamma _{m} \cdot Interactions_{im} + \epsilon _{i} \end{array} \end{aligned}$$
(2)

where \(Interactions_{im}\) accounts for the pairwise interactions among the five attention metrics. This is model 2 (Table 2).

To account for Internet penetration rates and population, we build a model adding these two census variables

$$\begin{aligned} \begin{array}{r} log(MigrantsNumber_i) = \alpha + \beta _{1} \cdot log(ROW^{repost}_{i}) + \\ \beta _{2} \cdot log(ROW^{followers}_{i}) + \beta _{3} \cdot log(BR^{repost}_{i}) + \\ \beta _{4} \cdot log(BR^{followers}_{i}) + \beta _{5} \cdot log(Authority_{i}) + \\ + \mu _{i} Internet_i + \rho _{i} Population_i + \\ + \gamma _{m} Interactions_{im} + \epsilon _{i} \end{array} \end{aligned}$$
(3)

where \(Internet_i\) is the city’s Internet’s penetration rate, \(Population_i\) is the city’s population, and \(\epsilon _i\) is the error term. This is Model 3. We control for Internetpenetration because it is associated with online activity, and for city size because larger cities tend to be economically prosperous and enjoy “increasing returns to scale”: a city becomes more attractive as it grows [12].

Fig. 5.
figure 5

Predicted values versus actual values calcuated by Model 2 (\(Adj.\) \(R^2\)=0.61) that includes the five attention metrics and their pairwise interactions. The model’s prediction error is low: its Mean Absolute Error is 0.21.

Table 2. \(Adj.\) \(R^2\) for different models predicting city \(i\)’s number of migrants. Model 1’s predictors are the five attention metrics \(Attention_{im}\), Model 2 adds their interaction effects, Model 3 controls for the city’s Internet penetration rates and population. All \(p\)-\(values\) are \(<0.001\).

By computing the beta coefficients of model 2, the one with the best performance (without census data), we find that cross border attention in terms of followers accounts for 22 % of the model’s explanatory power, while the cross border attention for reposts explains 18 %. \(Authority\) attention, instead, only explains 7 % of the variance. As for model 2’s accuracy, the model achieves a Mean Absolute Error (MAE) of 0.21 on a logarithmic scale, where the minimum value is 2.6 and maximum is 5.23, meaning that, on average, the model predicts the log of the number of migrants within 1.16 % of its true value. Figure 5 plots the values predicted by model 2 against actual ones. Rio de Janeiro, one of the most international Brazilian cities, is one outlier for which the number of migrants level is higher than the predicted value.

6 Related Work

Real-life Processes and Social Media. Email exchanges have been used to track migration flows among developed and developing countries [26]. Also, Quercia et al. have shown a correlation between the sentiment expressed in tweets originated by residents of London neighborhoods and the neighborhoods’ well-being [22].

In the last few years, there have appeared some initiatives for measuring socio-economic conditions of city residents in developing countries using online data. For example, the United Nations and the World Bank have recently launched a program called “Data4Good”. This promotes the use of (currently untapped) digital data for, say, improving poverty measurement (“How can we measure poverty more often and more accurately?”) or dealing with corruption in international investment projects (“Can we detect fraud by looking at aid data?”). Recently, Orange released an anonymized dataset of mobile phone calls in Côte d’Ivoire, and launched a challenge in which researchers had to predict economic indicators from the activity metrics extracted from the call records [17]. Our research complements this line of work by proposing a set of metrics that can be applied to data extracted from any data source that reflects social exchanges, including social media data.

Migration. Davis et al. [8] conducted a study of human mobility using data published by the World Bank. They built a network of countries based on migration flows, and found that the most well connected countries remain stable over time and that migration is directed towards low and mid degree countries.

7 Conclusion

We have shown that online metrics are effective at predicting number of migrants. These metrics are particularly useful in developing countries, where economic changes happen at fast pace. As part of future work, we will study socio-economic indicators other than migration rates, and we will start with GDP and social capital.