Keywords

1 Introduction

Despite the widespread use of Twitter globally – with 328 million monthly active users as of the first quarter of 2017 – little research has investigated the differences amongst users of various languages; there is a tendency to assume that the behaviours of English users generalise to other language users [1]. Language has featured as a facet of research on the geographies of Twitter networks [2, 3], especially whether offline geography still matters in online social networks [4]. Linguistic-inspired studies have been performed on hashtags [5], as well as the volume and proportional of tweets in English and Arabic, as part of an analysis of the Arab Spring [6]. Nevertheless, language is clearly a vital component of affiliation and discourse on the web [7, 8], with the creation and curation of emerging multilingual networks and communities, representing well-established creative and cultural norms, including for minority languages such as Welsh [9], as well as investigations into the economics of linguistic diversity [10].

In the social network analysis domain, centrality measures such as degrees, betweenness, clustering coefficient, modularity and cliques have been used in various projects to measure influence or detect the emergence of new communities [11, 12]. These measures provide the ability to assess network graphs that are constructed from collected data (for example, tweets). Selection of these centrality measures is dependent on the goal of the analysis; for example, the degree of a node helps to identify nodes with high number of connections within the network [13,14,15]. In a representation of a real-world network, this metric may help to identify highly connected persons, such as political leaders, sports stars or celebrities, who are potential “information spreaders” [16,17,18].

Clustering users in communities has been an important factor in social networking analysis, with a particular focus on clustering users based on their locations. However, for the sake of anonymity, many users tend not to disclose information about their identity, such as locations [19]; looking at Twitter, it has also been reported in the literature that geotagged tweets are generally low in number [20,21,22], the exponential growth in social media over the past decade has been joined by the rise of location as a central organising theme [23] of how users engage with online information services and, more importantly, with each other [24,25,26]. The work here examines the correlation between multilingualism of users and their associated activity.

The remainder of this paper is organised as follows: in Sect. 2, we introduce the methodology and key language themes; Sect. 4 presents the 2016 Eurovision Song Context case study, along with an analysis of the key data and results; Sect. 5 concludes the paper with a wider discussion and a summary of the potential application of our approach.

2 Methodology

The primary purpose of this study is to identify and define an extensible analytical approach for examining language uses, communities, and diversity on Twitter. The approach is based on network graphs and their properties, such as indegree, outdegree, and edge weights. Graphs are generated from language settings in users’ profiles and those for statuses. First, we construct user graphs to analyse interactions and multilingualism at the level of individual users. Then, from the user graphs, we produce language communities graph that groups users based on common languages.

3 Language Entities

To generate the required graphs, we need three essential entities from each status; user ID, user profile language, and status languageFootnote 1. Those values can be extracted from [status][‘user’][‘id’], [status][‘user’][‘lang’], and [status][‘lang’], respectively. It is important to note that the focus of this work is on the analytical approach, not necessarily the accuracy of language detection; therefore we assume that language of tweets are correctly identified. For profiles, users are expected to pick a language for their settings. Nevertheless, their language entity may show as the initial placeholder text “Select Language...” or a translated version that may provide information to the user’s native language community.

3.1 Network Graphs

For this study, we need to generate two different graphs; one is based on individual users and their posting activity, while the other combines users into language communities. In the context of this study, all graphs must be directed to provide correct measurements, as demonstrated in Fig. 1.

Fig. 1.
figure 1

Examples of simple models of language graphs

User Graph. This graph represents the core structure for our analysis. As shown in Fig. 1(a), nodes in this graph are of two types; users and posting language. Each posted tweet resulted in two nodes, one represents the user with profile language setting added to the node as the attribute ‘{profile_lang:xx}’. The other node represents language of the tweet. Edges link users with the posting languages they used, and their weight (thickness) measures the number of tweets that have been posted by the user (the starting node) in the target language (ending node). In the example above, the profile language setting for user ‘03’ is ‘en’, they posted three tweets in ‘en’ and three in ‘ar’ (Arabic). This graph will be referred to as the user graph.

Communities Graph. This second graph is derived from the user graph and has one type of node to represent language community, as shown in Fig. 1(b). For each user node we generate one node from the ‘{profile_lang:xx}’ attribute, and another node from the posting language to which it is connected. This resulted in combining all users of the same profile language into one node, with edge connecting to posting language and its weight measuring their activity. Theoretically, each tweet results in two language nodes, one for the user profile, and the other for language of the tweet. In our example above, users with ‘fr’ (French) profiles have generated six tweets in ‘en’. In the case of ‘ar’ node, we can see that users of the profile language as ‘ar’ have posted four tweets in ‘ar’ only – in graph terminology this is referred to as ‘self-loop’; we will refer to this graph as the communities graph.

Throughout the paper, we refer to language communities in two ways; profile community to perceive the language as user profile settings, whereas posting community refers to the language as tweeting settings.

3.2 Measures

In this section, we will discuss how graph measures can be used to make deductions about users, associated community languages, posting language activity, and how different language communities are linked to each other. These measures and their interpretations, in the context of this study are as follows:

  • Indegree: number of incoming edges;

  • Outdegree: number of outgoing edges;

  • Edge Weight: number of tweets on edge;

  • Weighted indegree: total weights of incoming edges;

  • Weighted outdegree: total weights of outgoing edges.

User Graph Properties. User nodes have indegree = 0, and posting languages have outdegree = 0; these two properties will be used to distinguish between nodes. Both outdegree of user nodes and indegree of posting languages must be greater than 0. The edge weight indicates the number of tweets associated with both end nodes. Referring to the example in Fig. 1(a), we can see that user ‘03’ has indegree of 0 (user identifier), outdegree of 2 (number of languages he used), and weighted outdegree of 6 (total number of tweets posted). Also, in the same figure, we can see that for ‘en’ posting language, it has outdegree of 0 (language nodes identifier), indegree of 3 (number of users posted in this language), and weighted indegree of 9 (total number of tweets posted); Table 1 presents main properties of this graph.

 

Table 1. Node properties in user graph
Table 2. Node properties in communities graph

Communities Graph Properties. As discussed in Sect. 3.1, this graph is extracted from the user graph and contains one type of node: language community nodes. Nodes in this graph represent languages as profile language settings, posting language, or both. However, as the graph is directed, we can identify if a community node is for profile or posts by measuring the indegree and outdegree properties. Positive indegree implies posting language, and positive outdegree indicates profile language settings. Figure 1(b) shows three language communities, two nodes appear as posting and profile nodes, while one node exists as a profile only node. The node ‘ar’, for example, has outdegree of 1 and indegree of 2. In other words, at least one user has their profile language settings as ‘ar’, and at least two users have posted in ‘ar’. In terms of edge weights, we can say that there are seven tweets posted in ‘ar’ language, originated from two different profile language communities. For the ‘fr’ node, we can see only outdegree, which means this language community exists as a profile-only node as no user posted in ‘fr’; these measures are summarised in Table 2.

4 Case Study and Discussion

In our case study, we explore the analysis of a dataset collected from the #Eurovision hashtag during the 2016 Eurovision Song Contest, based on the techniques presented in Sect. 2. Using the user graph and communities graph, we conduct analyses on multilingualism, activities and user behaviours in posting in different languagesFootnote 2.

4.1 Case Study: 2016 Eurovision Song Contest

The 2016 Eurovision Song ContestFootnote 3 took place in May in Stockholm, Sweden, with the motto of “Come Together!”. There were 32 countries taking part, with two semi-finals taking place on 12 and 14 May, with 26 countries qualifying for the final on 16 May. This year’s contest was perceived by many commentators to be tense and politically motivated, especially with Ukraine eventually winning the final [27]. Varying analyses see the contest as being influenced by political conflicts, friendships or cultural bias [28,29,30,31], with a range of news articles explicitly discussing the possibly biased results [32]. Twitter activity was very high throughout the event on the primary #Eurovision hashtag, with close to 8 million statuses, produced by nearly 1.25 million users.

The study focuses on original statuses (tweets) as the basic entity, as we wish to measure posting behaviour, not reactions. Preliminary analysis shows that they account for 48% of the total activity, of which 4% tweets with an ‘unidentified’ language were eliminated. As for profiles, all users have chosen language preferences and no profile was found with the default language settings.

4.2 Multilingualism

The outdegree in the user graph shows the number of languages a user used; observing the outdegree of user nodes in the users graph revealed 20 groups of outdegree, ranging from 1 to 25. Figure 2 shows these groups, size of users and activities. Although 85% of users are monolingual, their activity accounts for 47% of all tweets. Additionally, while the average activity of users is five posts per user, monolingual users were the least active ones, scoring an average of two tweets per user. We found that 18% of tweets were in different languages, with a strong correlation between multilingualism and likelihood of using different languages.

Fig. 2.
figure 2

Multilingual communities on #Eurovision and their associated activities.

We used the user graph to generate two communities graphs; the first will be used to explore language communities amongst monolingual users, while the other includes language communities for multilingual users only.

4.3 Monolingual Communities

This graph includes 63 language communities: 15 languages exist as profile-only and have not been used in any post, while 12 were used in posting but never show as a profile language. Moreover, about 13% of monolingual users used different languages in posts which form 10% of tweets in monolingual communities. Hence, strongest relationships exist as a self-loop, as discussed in Sect. 3.1.

To explore the relationships between language communities, we remove all self-loop edges from the graph. The resultant graph shows that monolingual users with ‘en’ as profile language have posted in 47 other languages, causing 43% of tweeting activity, and that 48 other profile communities used ‘en’ language in posting 43%. Also, we found that the strongest relationship (edge weight), 9% of activity, is when ‘en’ profiles post in ‘es’ (Spanish). A further interesting case to mention involve the ‘el’ (Greek) and ‘ru’ (Russian) languages. Although the number of profile communities that used ‘ru’ is more than twice compared to the number of those that used ‘el’, they were significantly lower in terms of activity.

Fig. 3.
figure 3

Language communities graphs for #Eurovision.

4.4 Multilingual Communities

Although multilingual users form 15% of all users in the dataset, they generated 53% of tweeting activity. There are 48 language communities in this graph, 13 languages as profile-only, and 10 as posting languages. With self-loop edges excluded, activity in different languages measured 24% of multilingual users tweets. Also, we found that the strongest relations existed between the ‘es’ profile community and the ‘en’ posting language, which is the opposite to the monolingual case.

4.5 Visualisation

In Fig. 3, we present two communities graphs; the size of the node represents weighted indegree of community; how much a language was used in tweeting, and darkness reflects weighted outdegree; participation from users of language community. Edges link between profile and posting communities, and their thickness indicates the number of tweets posted.

Whilst Fig. 3(a) shows all language communities together, Fig. 3(b) presents a filtered graph. This filtered graph depicts relationships amongst language communities that scored high in weighted indegree and outdegree. Also, we eliminated users with activity lower than the overall average (five tweets/user), and generated the communities graph from the remainder.

5 Conclusions

This paper has presented an extensible approach for identifying interactions within language communities using a high-profile real-world case study – the 2016 Eurovision Song Contest – and its associated engagement and interactions on Twitter. This approach utilises network graph properties to explore the behaviour of monolingual and multilingual users. Surprisingly, even though monolingual users formed the largest proportion of users, they were less active than multilingual users. The results also confirmed that higher proportions of user multilingualism implies further distance from their profile language. In the profile community, large number of participants does not necessarily imply high language diversity, as a single post in other language is enough to take the community to a higher level of multilingualism. Therefore, filtering out those users with low activity would improve measurement accuracy. In a few cases, we witnessed users participating in a significant number of languages, up to 25 different languages. Such extreme cases may be interesting to investigate for possible spammer/false account detection or for sociolinguistics in more moderate cases (e.g. 2–5 languages).

The graph measures of users may be useful in confirming their association with language community, without the need to crawl their entire Twitter timeline. Although language settings for user profiles may indicate interface preference only, we found persistent activity in the same language across the users, especially for monolingual users.

A possible scenario for governments, politicians or campaigners would be to use this method to measure to what extent other languages are used within a profile community. It may also show how users associate themselves with one community in their profile while using other languages. Monitoring unusual activity for secondary languages may help to uncover important messages or opinions that could not be openly expressed, for a variety of reasons, to the rest of the profile community. This framework may also be extended to measure reactions via retweeting and replying using a variety of natural language processing and sentiment analysis techniques [33,34,35], to provide a different perspective for influence analysis.