Keywords

1 Introduction

Social media like Twitter and Instagram are valuable sources that not only contain images and text, but they are a rich source of the geographical location of users, each getting updated constantly. This data can be extracted from social networking sites and can be used to identify trends among areas of interest. Using a radius around a location, the number of users can be identified with their corresponding location and time on the map. This form of data can further be used for identifying the type of user base in an area or their likes and dislikes, places they visit, interests as well. This extracted data can be used to target advertisements, open stores of similar interests, theme cafes, and much more.

Unlike other social platforms, Twitter is opinion based that establishes a level of openness and honesty. Making it an excellent platform to obtain personal views and harness intelligence. With advancements in search tracking to generate search trends with the help of Google Trends, one can only keep track of keywords up to a particular state or a country as a whole.

Looking up for trends not only ensures the interest of people but reveals how engaging people are. There are reports for the top five coffee enthusiast states in India, but it is hard to tell if people in your neighborhood love to drink coffee.

This paper aims to identify tweets according to pinpoint location and within a radius of a few kilometers. Tools like Recon-ng offer tracking of Instagram, YouTube, and Twitter posts with their geolocation coordinates and time. Once satisfied with the location, one can keep track of post timings, interests, number of active users, and their views.

2 Related Work

Social media analysis over Facebook posts, Twitter tweets, Instagram likes talks about the interests of people. Analyzing certain aspects of each platform gives off different results such as customer retention, opinions, or type of demographic. Likewise, [3] focuses on the popularity of KFC and McDonald’s with over 7000 tweets, each using a lexicon-based model to classify tweets positive, negative, and neutral [5].

According to [2], Instagram users can be distributed among segments of their preferences that help in building content marketing strategy according to organizations needs. Focusing on a certain group of people ensures customer engagement while maximizing reach. Taking a step forward, [10] emphasizes using Latent Dirichlet Allocation (LDA) for location-specific trends on Twitter with the help of an android application and accuracy of 84% in best case scenario.

Twitter opinions and views come forward as a look inside the mind [8] of the individual user as well as exhibits difference in behavior with the help of Natural Language Processing (NLP) presenting variation in general sentiment across the industry sector [12]. However, leveraging the location parameter [7] from tweets to predict general election results [6] for a country and monitoring events like natural disasters [9] with heat maps of affected zones [11]. Although most of the related trends are easily accessible through Google Trends, providing a great opportunity to check what range of people are interested in, by tracking search habits across the globe. While most algorithms focus only on uni-gram or bi-gram models, thus considering a single word or two words from the entire sentence. [1] implements the n-gram model with features to support emoticons, synonyms, and acronyms making this hybrid approach a reliable source to predict trends through tweets spread geographically.

The ever-growing platform Instagram where posts are made up of pictures and captions that are partially meaningful and bombarded with misleading hashtags, [4] takes an approach to tag images with a modified version of hyperlink-induced topic search (HITS) algorithm and combining it with crowd-tagging to produce a content-based image retrieval system leading to a maximum achievable recall value of 0.931.

3 Proposed Methodology

This paper focuses on pinpointing a geographical location, obtaining tweets for that area, analysis to generate interesting tweet topics, and identifying a neighborhood for the business of interest. Thus, improving reach and brand recognition among individuals. Figure 1 shows the flow of the process.

Fig. 1
figure 1

System flow diagram

3.1 Kali Linux

Kali Linux is a linux based operating system that offers a wide variety of tools for reconnaissance, penetration testing, and other offensive security measures. It is a bundle of prebuilt packages that are easily accessible to its users that provides out-of-the-box experience.

3.2 Location Retrieval

Our focus lies in a demanding area that could help generate optimal revenue. Using Google Maps, the latitude, and longitude of a geographical location can be identified and added into Recon-ng.

3.3 Recon-Ng

Reconnaissance is gathering information regarding a particular topic. Recon-ng is a web reconnaissance tool that can track location-based tweets, Instagram posts, YouTube uploads, interacting with databases, and managing API keys.

  1. 1.

    Load Twitter Module: Taking a modular approach to focus on the problem at hand makes it easier to break down the problem. A location-based pushpin module is just what is required to track tweets of a neighborhood.

  2. 2.

    Set Radius: The geolocation coordinates added earlier defines our target, but don’t want to focus on just one point. Thus, a radius is set around the location to widen our search for tweets.

  3. 3.

    Capture Tweets: Once done with setting up parameters, simply run the module to collect all the tweets around the location.

  4. 4.

    Load Pushpin Report Module: Creates an organized report for captured tweets for visual representation of tweets with their location.

  5. 5.

    Set Location and Radius: Geolocation coordinates with radius play a significant role in the visual look and feel over a map.

  6. 6.

    Generate Report: The reporting module generates a media file with a description of all the tweets, a map file that contains the pinpoint location of each tweet, and a database file with details of all tweets.

3.4 Dataset

While working with real-time tweets, it is quite significant to understand that the number of tweets in an area fluctuates. Seven days worth of tweet data retrieved from each location that may consist of roughly 250 tweets on average for happening places. Since the dataset is real-time, this number varies a lot and that is why a higher number like 500 tweets is also noticeable in some areas of Los Angeles, California.

3.5 Data Loading

The reports generated in the previous step provide a dataset of real-time tweets of a particular location for a whole week. An SQLite file generated from the pushpin reporting module from which it is required to select the pushpin table.

3.6 Pre-processing

Our primary focus is to highlight popular topics among a list of tweets, so it is better to drop other columns such as source, screen_name, profile_name, profile_url, media_url, thumb_url, latitude, longitude, time, and module while retaining only tweets content column that is named as the message. Tweets content also consists of punctuation and links to other pages that can dramatically alter our results with the most common word to be “https” due to its presence in all the tweets. It is better to remove them from the message column for reliable results.

3.7 Process Tweets

Aim is to find trending topics that rise in popularity. Say, a food chain plans to open a new branch in a neighborhood, the intention is to focus on areas that offer increased demand. Concentrating on tweets that mention keywords like food, recipe, cooking, ice-cream, healthy, sweet, and smoothie shows that the area is favorable for the food chain. Manual analysis of tweets is a time-consuming task and quite tedious when there are hundreds of tweets within a 3 km radius. Since our focus is on word frequency from a set of documents, Latent Dirichlet Allocation (LDA) provides reasonably accurate mixtures of topics within a given document. The implementation consists of:

  1. 1.

    Exploratory Analysis: To make sure our pre-processing works as expected, where it generates a word cloud to provide a visual representation of most common words.

  2. 2.

    Preparing data for LDA Analysis: Tweets are converted into a bag of words to count their occurrence in the database and plot the top 10 common words which should occur in the word cloud as well.

  3. 3.

    LDA model training: Using prebuilt libraries for Latent Dirichlet Allocation (LDA), it can tweak the parameters of the number of topics and words to achieve optimal results.

  4. 4.

    Analyzing LDA model results: To help us understand individual topics and relationships between the topics, it can use a visualizing package like pyLDAvis.

3.8 Charts and Graph Representation

Visualizing the relevant keywords that repeat over multiple instances. Not only tweet content can be analyzed to represent over graphic visuals, but it visualizes the time people usually post tweets or if there is any variation of timings on some days.

4 Result

Reports generated while collecting tweets help in analysis and visualization. Geographically pinpoint tweets from nearby locations on maps is the evidence to see active users in the area from Fig. 2. Stored records of each tweet in a database file for the pushpin location mentioned in Fig. 3.

Fig. 2
figure 2

Pushpin for Juhu, Mumbai

Fig. 3
figure 3

Reporting database for pushpin

Four locations were selected randomly to find out trending topics for the area with different search radius to make sure the search is optimized. To make sure that LDA is giving favorable results, real-time tweets are fetched again for Juhu within a duration of two months to find the change in trends. Initial results for Juhu, Mumbai dated April, 2020 are represented by Fig. 4 consisting of a word cloud that represents the most common words. Further, the top ten most common words are plotted in Fig. 5 and after LDA model training, the results consist of five topics in Fig. 6 with ten words associated with a particular topic. For better understanding, the results are visualized as shown in Fig. 7. Words like divine, guidance, mindseed, and dabbooratnani are among the top 30 most important terms where the \(\lambda \) parameter can be used to adjust relevance among different terms. The relationships between the topics are understood by Intertopic Distance Plot that reveals how different topics relate to each other.

Fig. 4
figure 4

Word cloud for Juhu, Mumbai dated April, 2020

Fig. 5
figure 5

10 most common words for Juhu, Mumbai dated April, 2020

Fig. 6
figure 6

Topics found via LDA for Juhu, Mumbai dated April, 2020

Fig. 7
figure 7

LDA analysis for Juhu, Mumbai dated April, 2020

Fig. 8
figure 8

Word cloud for Juhu, Mumbai dated June, 2020

Fig. 9
figure 9

10 most common words for Juhu, Mumbai dated June, 2020

Fig. 10
figure 10

Topics found via LDA for Juhu, Mumbai dated June, 2020

Fig. 11
figure 11

LDA analysis for Juhu, Mumbai dated June, 2020

Fig. 12
figure 12

LDA analysis for Churchgate, Mumbai

Fig. 13
figure 13

LDA analysis for Powai, Mumbai

Fig. 14
figure 14

LDA analysis for Los Angeles, California

Let’s look at the data for Juhu, Mumbai area dated mid-June, 2020, quite a dramatic change in the word cloud is noticed in Fig. 8 with some new trends among the top ten most common words in Fig. 9. LDA identifies a new set of topics with different sets of words such as dabbooratnani, happy, sandesh, and a few other words in one topic while words such as carpool, offered, and rideshare come under topic #2 in Fig. 10. When visualized using the library pyLDAvis in Fig. 11, words like carpool, offered, rideshare, and seats stood among the top 30 most important terms helping with our analysis that there is a dramatic change in trends in the same area within a matter of months. Similarly, the tweets are analyzed for Powai and Churchgate in Mumbai and Los Angeles in California. Considering the area of Churchgate in Fig. 12, there seems to be a variation in most salient terms such as hope, reading, acts, and gospel emphasizing different likes of people. In Fig. 13, i.e. Powai, there seems to be an enormous trend for real estate. The results are quite revealing to the extent that some areas are more inclined towards workout or crossfit as seen in Fig. 14 while some areas offer quite a lot of flats or properties.

5 Conclusion and Future Work

This research revolves around finding tweets that are fine-grained to a few kilometers from a pinpoint location providing a great opportunity to organizations or other businesses to identify their hot-selling areas. Analyzing a social platform that is filled with views and opinions of people provides a glimpse of their habits. Tracking the tweets time of people and approximately how many tweets per day are from an area results in activity tracking for a location. The core focus is to identify keywords or topics that people are repetitively talking about, allows to visualize where the target audience lies on a map and strategize accordingly. Henceforth, a way around choosing a favorable area for finally deciding the exact area for business to earn maximum profit with increased reach.