Keywords

1 Introduction

One of the most popular social networking media, Twitter, is now emerging as a platform to share what users feel, experience and comment both on personal and nation’s current affairs with 280 characters per tweet. Almost all tweets contain some personal value when tweeted or re-tweeted by the users. Though sharing personal information in social media gives pleasure, the extent to which the message travels have no bounds or could not be limited as the user believes. Also, the users feel free to publish their opinion and thoughts online than sharing it in-person because this networking platform provides a wide range of audience which is both undefined and unlimited.

Gathering private information in this digital era has become much easier because of the advancements in modern communication technologies. Tommasel and Godoy (2016) stated that the access to social media posts are also widely open to every user irrespective of their age, location, expertise and professional community boundaries which leads to leaks in Personally Identifiable Information (PII) either knowingly or unknowingly. The publicly available personal information is at high risk of potentially being misused by cyber criminals eventually resulting in diverse effects like identity theft, child trafficking, business loss and professional black marks. A lot of issues related to privacy are arising in social media and plenty of privacy protection technologies are designed for users which tend to be independent and application specific. Such privacy protection mechanisms usually aim at protecting the information and control information disclosure to unintended audiences.

In-spite of all, the protection mechanisms available for both online and offline, there is a drastic increase in cyber-crimes. Therefore, some kind of prevention strategies should be adopted in the user’s side to withstand and overcome such cyber-crimes. One such strategy is proposed in this research work by highlighting the importance of text that users post on social media. Users should be ethically aware of what content is presented to whom, who all are accessible to what all contents and what contents should be excluded in the post. The authors define the sensitivity of a user tweet based on many existing researches and user experiences.

Sensitivity

In context with this research work, sensitivity in a social media post, say Twitter, should be broadly categorized into three domains namely personal, professional and health-oriented information shared publicly in a social media platform. The sensitivity in a user message can be referred to “a private identity or incriminating information disclosed by either the data owner itself or others without the consent or knowledge of the data owner”. Here, the data owner can signify to a person or organization or community who can be identified or traced through the sensitive content posted by the social media user.

Hence the task of sensitive text classification is integrated for identifying the tweet sets, text categories and topics, taking text analytics to the next level as described by Hu et al. (2012). The objective of the research work is to formulate better tweet pre-processing rules, impose exact words for internet jargons and abbreviations in tweets, identify the sensitive keywords for personal, professional and health related tweets by modeling a framework called Sensitive Topic Keyword Extractor.

The major contributions of this research work are listed as follows:

  1. 1.

    To build an Automatic Acronym-Abbreviation Replacer for twitter community

  2. 2.

    To determine what is sensitive in personal, professional and health tweet domains

  3. 3.

    To generate unique topic models using LDA for personal, professional and health tweet domains

  4. 4.

    To investigate sensitive keywords that co-occur with the cyber-keywords in personal, professional and health tweet domains

2 Related Work

2.1 Twitter and Privacy Concerns

Lu et al. (2015) defined two privacy attributes namely confidentiality and universality with a set of 53 keywords. The privacy and security levels of Internet of Things (IoT) and Cyber Physical Systems (CPS) are classified by the Privacy Information Security Classification (PISC Model). Various challenges that are related to detecting privacy related information from unstructured texts are discussed by Sleeper et al. (2013) using natural language processing, ontology and machine learning approaches. It also deals with the problem of identifying the user’s perception, domain specificity, context dependence, privacy sensitivity classification, data linkages in the social media messages. Wang et al. (2011) conducted a survey for capturing regrets that were caused by misunderstanding, misuse, misjudging of online posts and discovered the mechanisms preferred by the users to overcome the regrets such as apology, decline or delay the request, etc.

2.2 Twitter Topic Identification

The research work carried out by Aphinyanaphongs et al. (2014) presented a feasibility study of identifying and inferring the usage patterns and most frequent places where users posts alcohol related tweets using various classification techniques such as support vector machines, Naive Bayes, random forest and Bayesian logistic regression. Since natural language processing tools cannot be directly applied to social media texts, it is necessary to modify the extracted data so that it resembles standard texts. Baldwin et al. (2013) used data from various sources like Twitter, YouTube and Facebook comments, forums, blogs, Wikipedia and British National Corpus (BNC) to build corpus and performed lexical analysis, grammatical checks, language modeling and corpus similarity. A novel, unsupervised approach for detecting topic on twitter by using Formal Concept Analysis (FCA) is presented by Cigarrán et al. (2016). To handle the adverse effects of feature sparsity encountered by Tesfay et al. (2016) and curse of dimensionality in short text message classification, Lei et al. (2012) introduced the concept of boosting called LDABoost designed with the Naïve Bayes algorithm as a weak classifier. To improve the topic learning in social media messages, Mehrotra et al. (2013) proposed a novel approach of pooling and aggregating tweets in the tweet pre-processing step carried out for LDA. Whereas to learn the topic and behavior interests of the users, Qiu et al. (2013) introduced Behavior LDA (B-LDA) model using Twitter messages.

3 Methodology

The proposed Sensitive Topic Keyword Extractor analyses large volume of user generated tweets collected using the Twitter Streaming API for 68 cyber-keywords grouped under three categories namely personal, professional and health domains as derived from the findings of Lu et al. (2015) and Mao et al. (2011). The raw tweets are cleaned and prepared with pre-processing rules resulting with tweets suitable for text classification and analysis. The data annotation process for identifying sensitive tweets was performed using the most popular and reliable crowd sourcing platform called Amazon Mechanical Turk (AMT).

The LDA, being the most powerful probabilistic model for text classification is opted for identifying the topic keywords for each tweet category. The objective of the Sensitive Topic Keyword Extractor is to identify the next level of sensitive keywords that would cause potential vulnerability when used with the benchmarked cyber-keywords. The sensitive keywords that occur with the cyber-keywords are extracted using the Variational Expectation Maximization (VEM), VEM-Fixed and Gibbs model by filtering the non-cyber keywords present in keyword extraction process for various topics. Identifying such sensitive keywords will help users in realizing the potential threat caused by posting information related to those keywords. A detailed architecture of the proposed system, Sensitive Topic Keyword Extractor is represented in Fig. 1.

Fig. 1.
figure 1

Architecture of the proposed system – Sensitive Topic Keyword Extractor

3.1 Tweet Cleaning and Pre-processing

The tweets are short text messages posted by users without any formal template or sentence structure. The tweets tend to have short forms, internet jargons, abbreviations, symbols, numbers, date and time, email, URLs and other non-dictionary words as mentioned by Sproat et al. (2001). Therefore, it is necessary to wisely handle such inappropriate content and perform data transformation operations as listed in Table 1. During the process of cleanup, it is necessary to replace acronyms with corresponding abbreviations. Hence a dedicated list of acronyms and abbreviations is prepared with respect to twitter slang and internet jargons which will be used to handle the automatic replacement of short forms in the tweet.

Table 1. A list of pre-processing rules used by Sensitive Topic Keyword Detector.

3.2 Topic Keyword Detector Using LDA Model

The pre-processed tweets are taken to construct the term-document matrix which maps the words to its importance in the tweet and the overall tweets set under each category say personal, professional and health. There are basically three features considered for text classification namely Feature Frequency (FF), Term Frequency – Inverse Document Frequency (TF-IDF), Feature Presence (FP). The LDA model, an unsupervised probabilistic model generally combines topic, context and word through probability by focusing only words. The topics and keywords are extracted based on two parameters namely per document distribution and per topic word distribution. The LDA model generally has three steps in detecting topics. Firstly, the model identifies the term distribution β for each topic specified. Secondly, the proportion of each topic is computed for each tweet. Thirdly, for every word wi in tweet t, a multinomial function is selected for each topic associated with that word; a word wi is selected from that multinomial function based on the probability distribution for each topic. The domain-wise sensitive keywords are extracted by setting a threshold value Øtc for each topic in a tweet category, the keywords are selected based on the conditional probability values. The results showed a set of well associated domain-wise sensitive keywords for all three models taken into consideration namely VEM, VEM_Fixed and Gibbs model used for topic learning and detection. The VEM is similar to Expectation Maximization technique with the difference of iteratively performing the expectation step until the likelihood becomes computationally intractable.

4 Results and Discussions

4.1 Dataset

The benchmarked cyber-keywords pre-defined by Lu et al. (2015) comprised of 23 keywords for personal topics, 22 keywords for professional topic and 7 for health topics. Since there are only 7 cyber-keywords related to health, an additional set of 16 health-related cyber-keywords were added by filtering out the most frequently used health keywords in internet as mentioned by Geetha et al. (2019). Finally, a list of 68 cyber-keywords are defined and used for tweet collection. The authors avoided manual inclusion of keywords in the cyber-keyword list due to the factor that it might lead in compromising the quality and relevance of data collected for sensitive data analysis. A large volume of real time tweets were collected that counted to 800900 tweets. This collection contributed about 36% of personal tweets, 32% of professional tweets and 32% of health-related tweets. As depicted in Fig. 2, on building the corpus data for analysis, the authors chose to consider only the original tweets by eliminating the retweets from the data repository. The tweet corpus built comprised of 35%, 33% and 32% of the total tweets collected for personal, professional and health related tweets.

Fig. 2.
figure 2

The statistical representation of tweets collected for three tweet categories.

Dataset Annotation

The original tweets are further considered for annotation process as a Human Intelligence Task (HIT) that is performed by the AMT domain expert workers. The authors presented a detailed documentation composed of the following three itineraries to the AMT domain expert workers.

  1. (i)

    What is sensitivity in a social media text?

  2. (ii)

    A ruleset for identifying the presence of sensitive private data or PII.

  3. (iii)

    A set of gold data with sensitive and insensitive tweets in three tweet domains.

The set of gold data that were submitted to the AMT workers was the driving factor to arrive at a clear understanding and converge the rules and assumptions on sensitivity with the given tweets to identify the presence of PII. A sample gold data submitted to AMT domain expert workers is presented in Table 2.

Table 2. A sample of gold data submitted to the AMT domian expert workers.

On submission of the dataset, the authors received annotation results from 670 AMT works for 54176 tweets. The authors selected 153 out of 670 AMT domain experts by considering their knowledge in social media, turnaround time and consistency in annotating 50 tweets per session. The annotation process was completely assisted by the authors in case of any queries from the AMT domain expert workers. The annotation results submitted by the AMT domain expert workers were evaluated by applying the approach of multiple grading and gold standard evaluation. The tweets after the evaluation process is preprocessed with a set of standard data cleaning steps from Table 1 and stored in the tweet corpus as described in the Table 3.

Table 3. Transformation of a sample tweet in data cleaning and pre-processing stage.

4.2 Sensitive Topic Keyword Extractor Using LDA Model

The pre-processed tweets are taken for corpus building which in-turn is given for generating the bag-of-words by forming term document matrix. For text categorization, feature selection is based on the selection of appropriate words by using dictionaries and NLP standards. Thus, this paper uses optimal text replacement strategies for building better feature space by using the proposed Automatic Acronym - Abbreviation Replacer (AAAR) specially designed for twitter. The AAAR was built with 1200 internet jargons and twitter slangs that are most popularly used by users and frequently encountered in our dataset. The AAAR is subject to change and update in accordance with the evolving practices and trends of online social networks.

The generated matrix is applied to three different models of LDA namely VEM, VEM-Fixed and Gibbs model to find topic keywords provided the k number of keywords to be found under n topics. The three category of tweet sets are taken individually and applied for LDA with k = 10; n = 5 and k = 20 and n = 10. Therefore, it generates 5 sets of 10 topic keywords for 5 topics and 10 sets of 20 topic keywords for 10 topics for all three LDA models. The sensitive keywords are identified by filtering out the cyber-keywords from the generated topic keywords.

Table 4 projects the overview of the identified sensitive keywords under each tweet category with cyber-keywords. The \( \upalpha \) value, parameter for finding the per topic distribution in tweet category denoted the confidence of the topic assignment. It can be inferred that, higher the value of \( \upalpha \), more even the distribution of topics over the tweets. The lower the value of α, higher the percentage of tweets been assigned to one particular topic.

Table 4. A list of bench-marked cyber-keywords and identified sensitive keywords by the proposed strategy for personal, professional and health related tweets.

Therefore, Fig. 3 shows the topic distribution values achieved for all three tweet categories with three LDA models. It can be inferred from the topic modeling results that as the number of topic increases, the α value reduces which denotes the uneven distribution of topic keywords. The VEM model performs better than other two models with better variations in topic distribution for all three tweet categories as depicted in Fig. 4. The topic distribution for personal tweets with 20 keywords under 10 topics was 19.28085 whereas for 10 keywords with 5 topics was 24.58494. This clearly shows that as the number of topics increases, the keywords are more widely spread. The same behavior can be observed in all three tweet categories. This effect is due to the high sparsity that exists in social media messages which should be handled by adapting feature selection mechanisms.

Fig. 3.
figure 3

a. The topic distribution parameter - α value achieved for three tweet category for VEM, VEM_Fixed and Gibbs Model to extract 20 keywords under 10 topics and 10 keywords under 5 topics. b. The proportion of Potentially Sensitive Tweets (PST) in the tweet corpus for three categories of tweet.

Fig. 4.
figure 4

The topic-wise keyword distribution (10 topics – 20 keywords) and F1 score achieved through various sampling models of LDA for three tweet categories.

The optimal model for sensitive keyword extraction was identified to be LDA with VEM model using TF-IDF for feature extraction with higher F1 score as shown in Fig. 4. The topic-wise keyword distribution achieved for each tweet category and a threshold Øtc = 0.3 determined by evaluating the model with varying iterations nIter = {10, 20, 30, 40, 50} derived a set of 93 sensitive keywords.

The authors combined all cyber-keywords and identified sensitive keywords into bigrams from which the most frequently discussed sensitive topic would be identified as shown in Fig. 5. It can be inferred from the bi-gram analysis that in personal tweet category, the cyber-keyword ‘phone’ and sensitive keyword ‘love’ were frequently detected to be occurring together. Similarly, the professional tweet category has ‘account’ and ‘money’ and health tweet category has ‘weight’ and ‘loss’ as the most frequently occurring cyber-keyword and sensitive keyword respectively. These co-occurring keywords are subject to vary when the analysis is performed with future user tweets in the same tweet domains.

Fig. 5.
figure 5

The Topic-wise keyword distribution obtained for benchmarked cyber-keywords and sensitive keywords of three tweet categories modeled as bi-grams.

The existing researches that employ LDA in identifying topics has many scopes for sensitivity topic identification. Table 5 presents the performance analysis of most relevant LDA models by Gunawan et al. (2018), Zou and Song (2016) against the proposed method using the F1 measure thereby highlighting the suitability and uniqueness of the proposed approach for sensitive data identification in Tweets. Therefore, from the observed research results, the authors concluded that users should be cautious when using these cyber and sensitive keywords together because they are believed to have high potential in revealing personally identifiable information.

Table 5. Comparison of results for the sensitive data for various LDA models.

5 Conclusion

This research work discovers sensitive keywords in addition to the benchmarked cyber-keywords that are more frequently used along the cyber-keywords in user tweets. An enhanced data cleaning mechanism is implemented in replacing the abbreviations and internet jargons by building an Automated Acronym - Abbreviation Replacer module. The major contribution of this research work is building a framework called Sensitive Topic Keyword Extractor which the identified of 93 sensitive keywords that are potentially sensitive which when used with the benchmarked 68 cyber-keywords. Therefore, the sensitive keywords are also considered as critical or sensitive keywords and will be used in tweet collection and tweet repository building for the future works which will result in having a set of 161 query keywords for tweet extraction.