Introduction

Since the recent outbreak of COVID-19, many scientists and big companies have started working on many challenges related to how to fight such an epidemic better, but also how to fight the spread of false information (Zhao et al., 2020; Mavragani, 2020; Cuan-Baltazar et al., 2020). These data are extremely valuable for conducting experiments (Chen et al., 2020; Gozes et al., 2020) for prediction purposes (Lin & Hou, 2020). Particularly in social data mining, some recent work has made a dataset of hundreds of millions of multilingual COVID-19 tweets with location information available towards leveraging a collective contribution to extract relevant knowledge on the topic (Qazi et al., 2020). Tracking social media data about the COVID-19 pandemic would help map disease evolution geographically and predict communities under potential risk. It can also help limit the spread of false and misleading information between users, referred to as Infodemics Mavragani (2020); Cuan-Baltazar et al. (2020). Recently, Facebook, Google, LinkedIn, Microsoft, Twitter, and YouTube have committed to removing coronavirus-related infodemics and misleading news due to the severe damage that is caused to human health and society (Cuan-Baltazar et al., 2020).

Social media recently appeared as a great support for understanding the behavior of users and communities. Analyzing crowdsourced data can provide deep knowledge about surroundings, including current social events or unusual happenings. Topics of interest and daily events are heavily discussed over social networks as users share their feedback, multimedia content, and check-ins of their visited places. With the advancement of social and mobile sensing technologies, there is a real opportunity to enrich current detection systems with knowledge extraction tools that leverage tracking and predicting epidemic diseases, such as the COVID-19 outbreak. Moreover, social media data is becoming a rich source of information that can be utilized in detecting and predicting communities under potential risk.

Fig. 1
figure 1

Overview of COVID-19 data analysis

The research focus of this study revolves around the generation and analysis of COVID-19 related data, particularly emphasizing the role of data mining techniques. Our study is driven by the huge available social data shared by people over social networks about topics of interest that can be considered as a triggering point of investigation for the system. For instance, 54 million tweets about COVID-19, collected from Feb.1 until May 1, 2020, were made publicly available recently for exploration (Sharma et al., 2020). However, spreading false information and various harmful content over social media threatens the whole online social ecosystem. The ongoing COVID-19 pandemic presents no exception by laying the ground for an army of malicious users to spread low-credibility and unverified news.

Besides, the COVID-19 global pandemic has resulted in a massive global disruption in the healthcare sector and the economy, education, environment, and social life, to name a few (Leung et al., 2022). Besides the battle on the medical front line, governments, industries, and the research community have extensively explored the deployment of information and communication technologies to track and contain the global outbreak. Mobile Contact Tracing Applications (MCTA) were developed as part of these efforts, which take advantage of the vibrant ecosystem of mobile sensing (e.g., location, proximity) to leverage identifying and tracking pedestrians that may be contagious or under potential infection threat. Although contact tracing apps have raised many important privacy concerns, such technologies will put in place a massive global surveillance infrastructure that may survive even after the partial containment of the disease due to the worldwide distribution of the vaccine.

Figure 1 illustrates the steps of COVID-19 data analysis concerning social media, medical imaging, time series data, and contact tracing. Various data sources are available to extract insights and generate knowledge layers from raw data, such as geo-tagged tweets, time series of numerical daily data, X-ray images and topics and reviews from online platforms, and data from contact tracing apps. After pre-processing such data, they can be utilized for various types of analysis, including prediction, classification, correlation, and clustering. Finally, modeling and visualization tools summarize the data for decision-making by the concerned authorities.

The main contributions of this survey are: 1) introduce a taxonomy of related work on COVID-19 data mining techniques and analytics from each respective domain; 2) discuss the main data sources and datasets that have been recently produced to facilitate mining tasks, with a focus on social data; 3) present data modeling foundations, and then overview techniques in social data mining, medical imaging, contact tracing, and time-series data; 4) summarize analytical perspectives and implications in each research track, and the impact of COVID-19 on the socio-economic behavior; and finally discuss the overall challenges and great opportunities in these domains.

Related Work

This paper is motivated by the growing number of studies conducted on the COVID-19 pandemic. The most relevant work may appear in Chiroma et al. (2020) and Shinde et al. (2020). These papers offer a comprehensive discussion on the analysis of COVID-19 data extracted from public official websites or medical images. However, the main focus of this paper is COVID-19 social media data analysis. Our paper surveyed the existing literature on COVID-19 social analytics from various aspects and discussed their challenges and opportunities. Other related work can be categorized as contact tracing, prediction, and economic impact.

(i) Bibliometric analysis: Chiroma et al. (2020) provided a survey on early assessment using bibliometric analysis based on a machine learning approach to limit the spread of COVID-19. The authors have collected the dataset from academic databases and bibliometric techniques for analysis. Moreover, a new perspective is proposed to overcome some of the challenges highlighted. Their results indicate that using machine learning-based COVID-19 diagnostic tolls would require considerable attention. Lazarus et al. (2021) surveys potential acceptance of a COVID-19 vaccine. The main goal is to collect data from various nations in order to determine the possible acceptance rate and factors impacting COVID-19 vaccine acceptability. The study concluded that 48% of the people surveyed would accept the vaccine if recommended by their employers or the government.

(ii)Medical image: A survey on deep learning and medical image processing for COVID-19 was presented by Bhattacharya et al. (2021). The authors summarize the recent research work related to deep learning and its application to healthcare. Moreover, three use cases concerning China, Korea, and Canada have been studied to confirm the uses of deep learning applications for COVID-19 medical image processing. Also, the study highlighted some of the challenges related to deep learning implementation for COVID-19.

The authors in Ulhaq et al. (2020) surveyed the proposed computer vision techniques for COVID-19 control. They discussed the recent methods: computed tomography (CT) scans, X-ray Imagery, and Prevention and Control. The authors also identified some of the future research directions concerning the COVID-19 pandemic. Besides, Ahmed et al. (2020) has provided a comprehensive COVID-19 Contact Tracing Apps survey. The authors also present an overview of many proposed tracing app examples and discuss users’ concerns regarding their usage.

(iii) Other related work: Mahalle et al. (2020) provide a survey on forecasting models for COVID-19. The authors classify the forecasting techniques into two types, which are (i) stochastic theory mathematical models and (ii) machine learning techniques. Moreover, the authors highlighted some of the challenges and recommendations of the forecasting techniques. Alamoodi et al. (2020) studied and examined papers from the last ten years regarding the prevalence of various forms of infectious diseases, such as viruses, epidemics, pandemics, or outbreaks, understand the of use sentiment analysis, and collected the most significant literature findings. From January 1, 2010, to June 30, 2020, they systematically searched papers on similar topics in major databases. They organized the papers into a taxonomy to classify the current literature’s corresponding current viewpoints in four categories: lexicon-based models, machine learning-based models, hybrid-based models, and individuals. They divided the publications they found into three categories: disease mitigation, data analysis, and issues researchers face with data, social media platforms, and community. They discovered some interesting patterns in the literature and categorized the articles accordingly.

The previous surveys focus mainly on COVID-19-related medical Bibliometric, forecasting, and image data analytics or data extracted from official platforms. However, our study aims to offer a comprehensive survey on COVID-19 from social data mining perspectives. More specifically, the study covers the literature on the social impact of COVID-19, including data sources, social media analytics, contact tracing, prediction, and its impact on the economy. Lastly, we highlight some challenges and opportunities to address as future research directions. Taxonomy of COVID-19 Data Analysis is given in Fig. 2.

Fig. 2
figure 2

Taxonomy of COVID-19 data analysis

Taxonomy and Search Methodology

The taxonomy devised for comprehensively analyzing existing studies related to COVID-19 data analysis from a social perspective encompasses five key categories: data sources, type of datasets, type of analytics, techniques, and type of use cases. Each category provides a framework for understanding the various aspects of COVID-19 data analysis undertaken in the past two years. Data sources range from official government sources to social media platforms, while datasets include epidemiological, demographic, geographic, healthcare system, mobility, and socio-economic data. Analytical methods span descriptive, predictive, and prescriptive analytics, as well as network analysis, text mining, and machine learning techniques. Specific methodologies such as sentiment analysis, contact tracing, social media analysis, prediction are discussed, alongside diverse use cases such as public health interventions, resource allocation, risk assessment, economic impact analysis, and mental health assessments. This taxonomy facilitates a structured approach to evaluating the multifaceted nature of COVID-19 data analysis and informs future research directions aimed at addressing societal challenges posed by the pandemic.

Data Collection

Scopus and Google Scholar are widespread databases that cover the comprehensive scientific data and literature (Boyle & Sherman, 2006). Hence, the references covered in this study were obtained from Scopus and Google Scholar databases. We gathered bibliographical information, citations, abstracts and keywords, and funding details relevant to COVID-19 research from 2020 to April 2022. We used keywords separated by Boolean operators AND, OR, and NOT such as ( TITLE-ABS-KEY ( "Social Media" ) AND TITLE-ABS-KEY ( analytics ) OR TITLE-ABS-KEY ( analysis ) OR TITLE-ABS-KEY ( data AND mining ) AND TITLE-ABS-KEY ( covid-19 ) OR TITLE-ABS-KEY ( coronavirus ) ) AND PUBYEAR > 2019 to retrieves specific literature. Moreover, we omitted the term "sars-cov-2" from the search because it returned 3,500 documents; some of the documents were beyond the scope of this study. Journal articles, conference papers, reviews, book chapters, Notes, Editorials, and short surveys are retrieved from the database. The final dataset presented in this study was filtered from 3,500 bibliographic entries published between January 2020 and April 2022.

Data Sources

The primary data sources about COVID-19 were social media, Internet Search Engines, health data providers, online monitoring platforms, and other data providers, such as government websites and international organizations. Social media platforms such as Twitter and Facebook provide valuable data on public sentiment, behaviors, and the spread of information. In the subsections, we discuss the different types of data sources and their COVID-19-related contributions. Table 1 classifies the papers based on the data sources.

Table 1 Paper classification based on their data sources

Social Media

Social media is the main data source that can reflect users’ behavior and opinions about topics of interest. The work in Koh and Liew (2020) studied the effect of COVID-19 and social distancing on loneliness and other mental health issues using Twitter data. In Mutlu et al. (2020), Twitter data is used to extract the users’ opinions on using two medications (i.e., hydroxychloroquine and chloroquine) to cure COVID-19. investigated anomalous social movement during COVID-19 (Devi & Nayyar, 2021) through sentiment analytics of geo-tagged tweets. The work in Al-Rawi and Shukla (2020) focused on the impact of bots activities, which are programmed accounts that tweet or retweet mentioning the hashtags #COVID-19 or #COVID-19. Alsudias and Rayson (2020) analyzed 1M Arabic tweets to detect rumors and predict the source of the tweets. The authors of Ordun et al. (2020) evaluated the effect of using certain features, keywords, and some unique topics, as well as how fast the COVID-19-related information is being tweeted or retweeted. In Chakraborty et al. (2020), the authors discuss that official COVID-19 platforms such as WHO were not successful in precise and informative information to guide the public and thus reduce the spread of the disease. The work in Zheng et al. (2020) used topic modeling to classify tweets into discussion topics about COVID-19. Samuel et al. (2020) identified public sentiment, e.g., fear sentiment over time, developed due to Coronavirus. However, Abd-Alrazaq et al. (2020) classified the tweets into different COVID-19 pandemic topics. Gencoglu et al. (2020) proposed a machine learning-based classification to categorize tweets based on their language-agnostic representations. Section 5 presents a detailed analysis of these techniques and classifies them based on their detection purpose, language covered, and datasets used.

Internet - Search Engines

studied the propagation and spread analysis of COVID-19 in China Li et al. (2020) and Lai et al. (2020) using data collected from Google and Baidu search engines. In Li et al. (2020), the authors presented an analysis using Google Trends, Baidu Index, and Weibo Index to conclude that reported daily cases are highly correlated with social media posts as well as Internet searches.

Health Data Providers

The work in Liu et al. (2020) discussed the impact of COVID-19 control measures on the spread of the disease. Another studied parameter related to the spread of COVID-19 is the movement restrictions. They used a dataset from the US Centers for Disease Control (CDC). While Hossain et al. (2020) used data collected from the Chinese Center for Disease Control and Prevention to study the impact of quarantine on the spread of COVID-19. The authors in Wu et al. (2021) studied China’s responses to the COVID-19 emergency from a policy-making perspective since the beginning of the outbreak by adopting a joint unit for epidemic prevention and control mechanisms. Moreover, Wang et al. (2020) used data from the Centers for Disease Control, China, to predict the spread rate of COVID-19. A study to identify at-risk individuals using data collected from the International Classification of Disease was conducted by Amram et al. (2020). On the other hand, Chao et al. (2021) discussed the use of imaging and non-imaging data collected from hospitals in Iran and Europe of COVID-19 patients to predict patient need for ICU admission. While Alzahrani et al. (2020) presented a method to predict daily cases in KSA by analyzing the official data from the Ministry of Health.

Online Monitoring Platforms

Researchers have used official statistics to estimate and predict diseases with negligible biases and small computational requirements. The work in Russo et al. (2020) analyzed COVID-19 databases accessed from WHO. Moreover, Bhattacharjee (2020) studied the impact of local environmental factors, such as humidity and temperature, on the spread of COVID-19. Also, based on WHO data, Nadim et al. (2021) and Rocha Filho et al. (2020) discussed the effect of quarantine on the spread of the Coronavirus. Using publicly available data from the Hubei province in China, the work in Anastassopoulou et al. (2020) predicted the number of new cases in China. On the other hand, Traini et al. (2020) and Giordano et al. (2020) collected data from the Italian National Data to study the coronavirus spread and death rates. Used data on the GitHub page (Bayham & Fenichel, 2020) to show how social distancing affected the number of death cases. Infection and recovery rates were studied from data downloaded from GitHub, uploaded by John Hopkins University (Beare & Toda, 2020; Siwiak et al., 2020).

Other Types of Data

News media TV/video, newspapers, and radio were explored by some works to extract knowledge related to COVID-19. Additionally, the authors in Rovetta and Bhagavathula (2020) used mobile data of people in Italy and online searches to study the impact of some discussion topics on the spread to the Coronavirus. In Lai et al. (2020), the authors used the mobile data of domestic and international travel to study the impact of travel on the increase and spread of Coronavirus cases. However, the work in Zhu et al. (2020) used mobile data to predict the number of death cases by modeling the decay rate of the spatial mobile data. To estimate risks upon lifting lockdown, the authors of Kiamari et al. (2020) developed a Hawkes process-based technique that uses cell-phone-based mobility data to compute spatiotemporal risk scores assigned.

Techniques

We discuss in this section the main techniques developed recently to extract insights from diverse types of data related to COVID-19. Our main focus will be on covering the social data mining techniques. Still, we will also highlight the recent contributions in different related fields, such as medical imaging techniques for early screening and classification of COVID-19 cases and contact tracing mobile applications that were proposed in many countries by officials to track and help prevent the spread of the disease. This section first presents some foundations on mathematical modeling and methods for representing data related to COVID-19, then discusses the social data mining techniques in different languages. In addition, we present the techniques used in medical imaging and contact tracing and finally show assessments of these techniques based on well-established evaluation metrics whenever available.

Table 2 Social data mining during the COVID-19 pandemic

Mathematical Data Modeling

Researchers use mathematics and computer tools to model the patterns of the COVID-19 pandemic. These models aim to try to understand the patterns, predict future outbreaks, and track the evolution of COVID-19. In Tang and Wang, (2020), the authors modeled the decrease of the daily growth as an exponential decay function. The work in Kucharski et al. (2020) used mathematical models to assess the human-to-human transmission of COVID-19 in different areas in China. In Oehmke et al. (2021), the authors used a dynamic panel data model estimated using the generalized method of moments approach to provide surveillance metrics for COVID-19. These metrics provide estimates for speed, acceleration, weekly shifts, etc., to support decision-making to alleviate risks. COVID-19 cases and mortality data, along with a deterministic SEIR compartmental framework, are used by COVID et al. (2020) to model trajectories of severe acute respiratory syndrome SARS-CoV-2 infections. Using the model, the authors assessed social distancing and mask use levels on the virus’s spread.

Mathematical modeling Oliveira et al. (2021) to study the dynamics of COVID-19 in Bahia, a state in northeastern Brazil, considering the influences of asymptomatic/non-detected cases, hospitalizations, and mortality. The model explored hospitalization needs in a low-resource state during the COVID-19 pandemic. Mathematical modeling of COVID-19 data can lead to radical shifts in government decision-making Ferguson et al. (2020). For example, governments implementing ‘herd immunity’ strategies had to change their strategies after mathematical models predicted enormous death rates before reaching this objective.

Data-driven compartmentalized (susceptible-infected-recovered) modeling provides insights into the spread of COVID-19. Recent models that experimented with an increased number of compartments studied the impact of social distancing and quarantine on the spread of the virus and other statistics such as number of daily cases, deaths, etc. Leung et al. (2020) Giordano et al. (2020). Investigations into the need to hospitalize patients of COVID-19 in different scenarios were presented using mathematical modeling of public-related data by Moghadas et al. (2020) Castro et al. (2020).

Table 3 Social data mining during the COVID-19 pandemic (Cont..)

Mining COVID-19-Related Insights from Social Media

Many research studies have been presented since the beginning of the outbreak on how to make use of the huge amount of social media streams to extract insights and get a better understanding of the evolution of the disease, the spread of misleading information, and the users’ behavior and belief concerning related topics, such as the lock-downs and vaccines, towards overcoming the COVID-19 pandemic Yao et al. (2021). The following will present the different methods used in social data mining, which mainly covers analytics and sentiments on the most frequent topics, the evolution and tracking of the disease over time and space, and the different visualization mechanisms adopted. Tables 2 and 3 illustrate the main contributions in social data mining for COVID-19.

Large-Scale Datasets

Several developments have been presented on crawling large-scale social datasets to discover and track the evolution of the pandemic and to deeply model user behavior concerning the most trending topics. The design and analysis of a large-scale COVID-19 tweets dataset was introduced in Lamsal (2020). The Twitter dataset has more than 310 million COVID-19-specific English language tweets and their sentiment scores (Lamsal, 2020). They also presented the GeoCOV19Tweets Dataset (Lamsal, 2020), the dataset’s geo-tagged version. They analyzed the tweets in both datasets based on trending unigrams and bigrams with scores. Different algorithms for filtering geo-tagged tweets, hydrating tweet data using Twarc, and extracting region-based tweets were presented. They released these datasets publicly.

Dimitrov et al. (2020) developed a knowledge base of semantically annotated tweets on the COVID-19 pandemic, called Tweets-COV19, which contains more than 8 million tweets, October 2019 - April 2020. TweetsCOV19 is a subset of TweetsKB, a public RDF corpus of anonymized data for a large collection of annotated tweets. GeoCoV19 is a dataset of hundreds of millions of multilingual geo-tagged tweets on COVID-19-related topics (Qazi et al., 2020). Data crawling was performed on a period from February 1 to May 1, 2020, with more than 524 million multi-lingual tweets (62 different languages) with around 43 million Twitter users. The geo-location information is essential for many tasks, including disease tracking and surveillance. However, Twitter data has by default, a very small percentage of geo-tagged tweets (generally, between 2% to 5%). Therefore, a gazetteer-based approach was employed, which takes advantage of tweet content and user location to detect toponyms and derive its geo-location based on the Nominatim API from Open Street Maps at different spatial scales. The GeoCoV19 dataset leverages the development of AI-based analytics to predict disease outbreaks and trends and to learn about knowledge gaps and the impact of the global pandemic on the socio-economic life of users, among others.

has collected a large Arabic Twitter dataset on COVID-19 (Alqurashi et al., 2020). They collected tweets in Arabic from January 1, 2020, to April 30, 2020. They used specific keywords and hashtags to collect tweets. They provided preliminary statistics on the dataset. The dataset could help researchers and policymakers study different societal issues related to the pandemic, such as behavioral change, information sharing, misinformation analysis, and spreading rumors. Created a dataset called ArCOV-19 Hamzah et al. (2020), an Arabic COVID-19 Twitter dataset of 2.7M tweets spanning one year, covering the period from January 27, 2020 to March 31, 2020. It includes around 748k popular tweets alongside their propagation over the social network. They showed that ArCOV-19 captures the discussions associated with reported cases since the beginning of the outbreak in the Arab world. Aggregating and analyzing large datasets of diverse data helps in tracking the spread of the virus, identifying hotspots, assessing the effectiveness of interventions, and forecasting future trends. Moreover, large-scale datasets enable the development of predictive models for disease transmission, severity, and outcomes, as well as the evaluation of vaccine efficacy and safety. However, challenges such as data privacy, quality assurance, and interoperability need to be addressed to maximize the utility of large-scale COVID-19 datasets while safeguarding individual rights and ensuring data integrity (Bentotahewa et al., 2021).

Topic Detection and Monitoring

Most of the recent research on social data mining covered some exploratory analysis of COVID-19 streams using topic modeling and detection methods. Most works have collected data from Twitter social networks during the outbreak starting from January or February 2020. For instance, the authors in Ordun et al. (2020) investigated research questions to discover high-level trends and events that can be inferred from COVID-19 tweets. Using UMAP analysis, they inferred local clusters of topics representing personal protective equipment (PPE), healthcare workers, and government concerns. Using document embedding techniques like UMAP allowed for a better understanding of distinct topics extracted with the LDA method. Topic detection and monitoring are being used by some organizations to understand public sentiment and track discussions related to the COVID-19 pandemic. Such process requires collecting data from various sources such as social media platforms. The main goal is to extract insights to enable decision-making and proactive engagement with healthcare entities. Real-time monitoring and alerting mechanisms ensure timely response to evolving discussions, while regular reporting and actionable recommendations (Organization, W. H., et al., 2021). Wahid et al. (2023) have developed COVICT system demonstrating its potential for early detection, monitoring, and contact tracing. Leveraging real-time symptom data and semi-automated contact tracing can significantly aid in controlling the spread and identifying high-risk areas for targeted interventions. The potential for smart lockdowns and informed policy-making through this IoT architecture shows promise in the ongoing battle against the pandemic.

Fig. 3
figure 3

Term map of COVID-19 social media analysis

Research on topic modeling, extraction, and sentiment analysis is increasingly widening its focus as most of the datasets collected from the internet were from popular social media such as Twitter and Facebook. Figure 3 shows the result based on the most frequent terms collected in related papers. Various terms can be seen, such as content analysis, text mining, topic modeling, depression, etc. The minimum number of keyword occurrences used in this study is 5. For each keyword, the total strength of the co-occurrence links with other keywords is calculated. The keywords with the highest total link are selected. The initial term map covering 2020, 2021, and until March 2022 consists of 18 terms in 5 clusters.

An analysis of retweet speed shows that the median retweeting time was approximately 50 minutes faster than repostings from Chinese social media about H7N9 in March 2013. The size of the corpus is 5,506,223 tweets, about 77% of 23,820,322 tweets. From 1 January 2020 until 30 April 2020, Agarwal et al. (2020) developed a framework to classify important tweets relating to the COVID-9 pandemic and have investigated subject modeling to identify the issues and topics most discussed in their data collection. To deal with developments during the pandemic, the authors studied the temporal shifts in the subjects and discovered that eight subjects were enough to classify the themes. These subjects show a pattern tracking over time. Over the years, the dominant themes differ and correlate with the COVID-19 cases.

On the other hand, Kabir and Madria (2021) developed EMOCOV, which uses a collected Twitter dataset to visualize extracted topics and to represent human emotions during the global pandemic. Their dashboard presents various data analytics in the USA over a specified period of time to show changes in topic trends, and human emotions, and subjectivity of user feedback. Abd-Alrazaq et al. (2020) presented an infoveillance study on collected data of 2.8 million tweets from 160,829 unique users between February 2, 2020 and March 15, 2020. The tweets were analyzed using word frequencies of single (unigrams) and double words (bigrams). Latent Dirichlet allocation for topic modeling was employed to identify the main topics. Sentiment analysis and interaction rate of topics were performed by extracting the mean number of retweets, likes, and followers for each topic and calculating the interaction rate per topic.

CoronaTracker is an online platform that provides the latest news development, as well as statistics and analysis on COVID-19 Hamzah et al. (2020). They visualized real-time data queries, and then the queried data is used for Susceptible-Exposed-Infectious-Recovered (SEIR) predictive modeling. Their model predicts COVID-19 cases, deaths, and recoveries. It also helps to interpret patterns of public sentiment on disseminating related health information and assess the political and economic influence of the spread of the virus. Among other applications, the authors in Guntuku et al. (2020) also studied the impact of COVID-19-related news on mental health and symptom mentions of users from Twitter data. Moreover, no current work proposes a hybrid social and physical sensing approach to address some of these challenges. Gozes et al. (2020) developed an analysis tool to classify and quantify computer Tomography (CT) images of COVID-19 potential cases using deep learning. Several datasets from disease-infected areas in China were used for the training. Retrospective experiments were conducted to assess system performance in identifying thoracic CT features of COVID-19 potential cases. Zheng et al. (2020) used topic modeling to reveal insights from Twitter users’ feedback about the disease. They focused on the temporal analysis of related topics throughout the pandemic. Hou et al. (2021) analysed Weibo texts (from Dec. 2019 to May 2021) to infer the public attention and users’ sentiments on 41 popular topics related to COVID-19. Similarly, Zhang et al. (2021) aimed at identifying Twitter groups based on their concerns, sentiments, emotions, and disparities. Text mining from social media to infer policies for healthy and safe airports was also investigated in Park et al. (2021) to enrich the user experience in urban infrastructures.

Sentiment analysis on topics related to COVID-19 has been also studied in other research works. The authors in Chakraborty et al. (2020) analyzed a dataset containing 226,668 tweets collected from December 2019 to May 2020, which contrastingly shows that netizens had a maximum number of positive and neutral tweets tweeted. They demonstrated that though people have tweeted mostly positive regarding COVID-19, netizens were busy engrossed in re-tweeting the negative tweets and that no useful words could be found in word cloud or computations using word frequency in tweets. They validated their proposed model using deep learning classifiers and Bag-of-Words and Doc2Vec models, with admissible accuracy up to 81%. They proposed the implementation of a Gaussian membership function-based fuzzy rule base to identify sentiments from tweets correctly. Sentiment insights on coronavirus-specific tweets were also studied in Samuel et al. (2020). They demonstrated insights on the progress of fear sentiment over time as the pandemic approached peak levels in the USA, using exploratory and descriptive textual analytics and visualization tools. Their approach discovers early-stage insights using two essential textual classification methods and assesses their ability to classify Corona-related tweets. They observed a high accuracy for classifying short tweets using the Naïve Bayes method. In contrast, the logistic regression classification method yielded a reasonable accuracy with a relatively weaker performance for longer tweets. Nemes and Kiss (2021) also investigated users’ emotional polarity from Twitter on COVID-19-related topics using recurrent neural networks and sentiment analysis. Rapid emotional changes and fluctuations were manifested with different classes of emotions and with a good overall classification performance.

Analyzing Fake News and Misinformation

Analysis and discovery of fake news and misleading information has gained great interest on social media due to the huge incoming unverified streams spread over social networks. This issue is more crucial when discussing health-related topics, especially during the COVID-19 pandemic, where the amount of misinformation shared is colossal (Ayoub et al., 2021). Researchers recently have examined the activities of automated social media accounts or bots and the spread of false news on the pandemic (Nakov & Da San Martino, 2021). The authors in Al-Rawi and Shukla (2020) investigated the activities of social bots by adopting an integrated approach comprising data acquisition, classification/prediction, text mining, and network analysis. They collected data by using tweets or retweets referencing standard terms, e.g., #COVID19, over a period of over two months from February until April 2020. The total sample used was over 50,811,299 tweets from 11,706,754 unique users. The final sample was extracted from more than 185,000 messages posted by 127 bots. They showed the main classes and subclasses of bots’ memes. They found that financial incentives drive most bots and try to increase awareness of COVID-19 risks by citing official media and health sources. In contrast, other kinds of bots actively support the survivalist movement by emphasizing the need to prepare for the pandemic and learn survival skills.

Based on this claim, Apuke and Omar (2021) proposed a model for fake-news distribution predictors amongst social media users with Nigeria as a case study. The authors describe the result of a Nigerian sample regarding the dissemination of fake news related to COVID-19. Data was analyzed with Partial Least Squares metrics to find the impact of different parameters on disseminating fake news. An explainable NLP model to detect misinformation from social media was proposed in Ayoub et al. (2021) by using a variant of BERT embedding, DistilBERT, and SHAP (Shapley Additive exPlanations) for better explainability. A dataset of 984 claims about COVID-19 was collected and verified with fact-checking sources and was tested on the COVID-19 dataset. The results show high accuracy in detecting misinformation while figuring out the source of fake news. Analyzing misinformation from COVID-19-related tweets has been also investigated in Sharma et al. (2020). Streaming data from Twitter was collected from March 1, 2020, to June 2020, with 8.1M tweets from 182 countries. They identified unreliable and misleading content based on fact-checking sources and studied the narratives endorsed in misleading tweets and their distribution of engagement. Misinformation is identified by evaluating the retweet trees of a given post. A statistical dataset of source tweets with labels on misinformation cascades was used, and then a classifier was developed in Sharma et al. (2019) with a character-level embedding to determine suspicious cascades. The dashboard presented analysis and a daily updated list of identified misinformation claims during the pandemic. They provide examples of the spreading patterns of potentially misleading tweets. Various use cases have emerged in analysing fake news and misinformation during COVID-19. One such instance involves investigating COVID-19 misinformation on social media platforms like Twitter and Facebook, as well as on news websites and online forums. The overarching objective is to aid stakeholders in promoting accurate information and mitigating the harmful effects of false claims during the pandemic. Iwendi et al. (2022) developed an approach to combating COVID-19-related misinformation by employing Information Fusion to gather real news data from trusted sources and fake news data from social media. Using deep learning models, 39 features were created from multimedia texts to detect fake news, resulting in a substantial improvement in accuracy. The precision, recall, and F1-Measure metrics demonstrate the effectiveness of the models in discerning between real and fake news, outperforming standard machine learning algorithms. This approach holds promise in addressing the challenges posed by misinformation during the pandemic.

Special Considerations on Arabic NLP for COVID-19

Over the last few years, there have been several attempts to process Arabic content in a variety of applications. For example, Arabic sentiment analysis using a lexicon-based system for Modern Standard Arabic (MSA) applied to “news” was proposed in Abdul-Mageed and Diab (2011). Similar work was reported on modern Arabic in Mourad and Darwish (2013), by using random walks on graphs, while employing Naïve Bayesian and SVM classifiers. An example of using a dataset of Arabic social media content and POS tagging for multi-genre multi-dialect sentiment analysis can be checked in Abdul-Mageed and Diab (2014). Datasets and deep learning models for Arabic text classification were also proposed in Elnagar et al. (2020). Other attempts were also performed to process tweets and Arabic microblogs. A lexicon-based sentiment analyzer for both MSA and Egyptian dialectal Arabic tweets has been developed using an SVM classifier (Heikal et al., 2018). Arabic sentiment analysis of Twitter data related to COVID-19 was presented in Alanazi et al. (2020), with the aim at extracting and ranking the common symptoms discussed among patients on social media. The results were reported from 463 Twitter users who reported being tested positive, with 66% reporting symptoms. Among the symptomatic patients, the top three reported symptoms were fever, headache, and anosmia. Event detection from social media was recently discussed in Ibrahim et al. (2015) using a language-independent Naïve Bayes classification model. The focus was only on specific type of ‘disruptive’ events rather than a generic event detection platform.

Based on the World Health Organization (WHO) definition, an infodemic depicts the use and spread of false or misleading information over any kind of physical or digital mediaFootnote 1. An Arabic infodemics study was presented in Shaar et al. (2021). The authors have designed a pilot annotation for English and Arabic organized into seven questions about the input tweet streams. They annotated 504 English and 218 Arabic tweets with a seven-class labeling schema, focusing on the most retweeted ones. They used pre-trained transformers for word embeddings: (i) AraBERT, (ii) FastText, and (iii) BERT, and an SVM classifier. They have argued for the need for a holistic approach to counter the global infodemic related to COVID-19. They stated that the problem is not only in the context of the COVID-19 infodemic, malicious content, and conspiracy theories but also endorsing fake cures, panic, racism, xenophobia, and mistrust in authorities.

Another approach to fake news detection in Arabic is presented in Alsudias and Rayson (2020). The objective was to identify main topics, detect rumors, and predict tweet sources by using k-means clustering and ML classifiers with manual labels on false information. They collected a dataset of tweets related to COVID-19 from December 2019 to April 2020, which contained 1,048,575 unique tweets. They provided a labeled sample of 2000 tweets annotated for false, correct, and unrelated news. Around 60% of the rumors found on Twitter were reported by health professionals and academics, which shows the risk and urgent demand to alert against such fake news.

Analyzing COVID-19 related content in Arabic presents unique challenges in Natural Language Processing (NLP) due to dialectal variations, code-switching, lack of standardization, limited language resources, sentiment analysis and cross-lingual information retrieval (Bahja et al., 2020). Dialectal variations across Arabic-speaking regions necessitate adaptable NLP models capable of handling diverse linguistic forms, while code-switching between Arabic and other languages requires proficiency in recognizing mixed text. Moreover, the lack of standardized COVID-19 terminology in Arabic complicates information extraction, highlighting the need for specialized resources.

Table 4 Medical image classification during the COVID-19 pandemic
Table 5 Contact Tracing (CT) and Time Series (TS) data mining during the COVID-19 pandemic

COVID-19 Mining Techniques for Other Types of Data

Although our focus in this survey is to discuss techniques for mining COVID-19 insights from social data, for completeness, we will present in this section other COVID-19 mining techniques related to other types of data, starting from medical imaging to time-series data published by health organizations and online platforms, and finally techniques related to contact tracing. Table 4 illustrates some of the main ongoing research in medical imaging for COVID-19 early screening, while Table 5 presents contributions in contact tracking and time series data analysis.

Medical Imaging

Early COVID-19 screening through X-ray image classification has been studied in many recent works (Chowdhury et al., 2020; Jain et al., 2020; Zebin & Rezvy, 2020; Turkoglu 2020). Deep learning classification based on X-ray or CT medical imaging and trained on labeled image datasets is the most dominant approach in this research field. We, therefore, provide an overview of existing work on deep learning approaches for medical image classification, focusing on COVID-19 detection methods. The authors in Ouchicha et al. (2020) proposed a methodology for early screening of COVID-19 cases based on chest x-ray images using CNN-based three-class classification: i) normal, 2) viral pneumonia, and 3) COVID-19. They trained their model with 219 COVID-19 X-ray images, 1341 normal and 1345 viral pneumonia chest X-ray images and evaluated the performance based on accuracy, precision, recall, and F1 score. Promising results of accuracy of 96.69% were reported on the three class classifications. However, most of the approaches in this domain present an issue related to the small training data size, which may impact the scalability of such models for real-life diagnosis. Other approaches proposed to augment training data by multiplying the data size using techniques, such as Generative Adversarial Neural Networks (GANs) and Keras image data generator (Tabik et al., 2020; Umer et al., 2021; Zebin & Rezvy, 2020).

Tabik et al. (2020) proposed building a database for COVID-19 triage systems using class-inherent transformations (CiT) network inspired by GANs. Umer et al. (2021) used CNN classification to quantify COVID-19 cases in terms of severity levels: normal, mild, moderate, and severe. They trained their model on 426 positive and 426 negative chest X-ray images and a generated dataset of 10,000 images using the ImageDataGenerator class form Keras. Radiomic features and ML algorithms were combined in another approach for the early detection of COVID-19 and distinction from other types of viral/bacterial chest infections (Tamal et al., 2021). Similarly, Zebin and Rezvy (2020) used transfer learning for classifying COVID-19 chest X-ray images and CycleGAN for image augmentation. They aimed at distinguishing inflammation in the lungs due to COVID-19 and Pneumonia from normal cases based on labeled 673 X-ray and CT images. Another COVID-19 X-ray image classification approach used feature extraction from CNN layers using the Relief feature selection algorithm and SVM classification (Turkoglu, 2020). Perumal et al. (2020) presented a COVID-19 CXR classification method through transfer learning and extracting Haralick features. They claim texture feature extraction can be very helpful for early screening. Their trained model contains 81,176 observations with disease labels. Janarthanan et al. (2021) present a study on how artificial intelligence and media imaging can be utilized to diagnose COVID-19 patients. The authors extracted data from various research reports, articles, and WHO guidelines to identify the disease’s diagnosis, treatment strategies, and outcomes.

Alelyani et al. (2021) provides an evaluation study on the impact of the COVID-19 pandemic on medical imaging. The idea was to study how imaging volumes and imaging types in radiology are affected by COVID-19 in various locations. The authors utilized images between 2019 and 2020 from different hospitals that include cases related to outpatient, inpatient, and emergency departments. Such data was compared using t-tests. The results show that there was a decline observed in outpatient departments by 76% and emergency departments by 25%. Moreover, there was a decrease in nuclear medicine, ultrasound, MRI, and mammography by 100% 76%, 74%, and 66%, respectively. Born et al. (2021) offer a systematic review on the use of AI in imaging for COVID-19. The Authors have covered 463 papers published on AI for imaging-related studies. Their findings showed a significant disparity between clinical and AI communities in focusing on both imaging modalities. Furthermore, most of the research was found to be e lacking concerning potential use in clinical practice. Furthermore, the authors in Aytaç et al. (2022) suggest that applying an adaptive momentum rate for image classification would reduce classification error and increase accuracy.

The authors in Quak et al. (2021) have studied the relationship between gender disparity in medical imaging research and the COVID-19 pandemic. The goal was to investigate the impact of female physicians’ research in medical imaging on scientific publications. As a result, the researchers gathered information from 50 medical imaging papers published between March and May 2020. The result shows that there is the gender imbalance in the first and last authorship for articles submitted to the top 50 medical imaging journals. Rehouma et al. (2021) provide a comprehensive review on the use of machine learning models in COVID-19 detection. 62 papers based on deep learning algorithms were selected for analysis. The authors illustrated that convolutional neural networks have been widely used for image segmentation and classification to detect patients with COVID-19.

Contact Tracing

Contact tracing is another very important field of research that has witnessed a big adoption and government support by developing mobile contact tracing applications that help in tracking of confirmed cases and in reducing the spread of the disease (Ahmed et al., 2020). Many applications have been published to reduce the fast COVID-19 spread. Nonetheless, this approach has failed to some extent in achieving its purpose for many reasons (Dar et al., 2020). A survey on existing applications can be found in Ahmed et al. (2020). Our objective in this section is to highlight new approaches and discuss techniques that try to cope with the issues encountered in such mobile tracing applications.

Contact tracing of COVID-19 cases in Korea was studied in Park et al. (2020). They proposed indexing confirmed cases, high-risk and non-high-risk groups, and tracking contacts by linking to large databases (59,073 contacts and 5,706 COVID-19 indexed patients). They aimed to highlight the role of household transmission amid the reopening of schools and the loosening of social distancing. Bradshaw et al. (2021) introduced a hybrid bidirectional contact tracing with a digital exposure notification based on stochastic branching-process modeling. The effect of manual and digital hybrid tracing to identify infectors and their infectees and the benefits of bidirectional tracing were investigated. Another approach for contact tracing using indoor trajectories of moving users was introduced Alarabi et al. (2021), which considers social distancing and the exposure period to find potential infectees.

Tran and Nguyen (2021) investigated the risk-risk tradeoff model based on the privacy calculus theory and the risk-risk tradeoff notion to understand better COVID-19 contact-tracing app users’ risk minimization decisions. According to their findings, users participate in a health risk-privacy risk tradeoff while considering and opting to use the applications. As a result, their study contributes to the field of privacy calculus theory research and argues for a balanced management solution to this tradeoff challenge. The authors in Chan and Saqib (2021) conducted three experiments in France, Australia, and the United States to see if key COVID-19 issues, which should raise worries about personal and public health, do raise privacy concerns, lowering the use of contact tracing applications. They discovered that notable COVID-19 concerns reduce intentions to use contact tracing applications using an experimental design in which individuals were randomly assigned t to either a disease concern or a control condition. The mediation findings show that higher privacy values explain the lesser willingness. Jamieson et al. (2021) evaluated attitudes on downloading and utilizing contact tracking apps and how they linked to respondents’ everyday lives, work patterns, and overall sentiments about the epidemic using a survey of 153 working individuals and 15 follow-up interviews. They discovered that the incentives for downloading the app differed from those for continued use. They looked at how people navigated ambiguous behavior norms during the epidemic and considered personal risks while determining whether to use contact tracking apps.

The main challenge discussed in mobile contact tracing is the privacy concerns concerning revealing users’ detailed movements and contacts in real-time (Mokbel et al., 2020). Privacy-preserving Contact tracing through technological facilities was recently proposed in Mokbel et al. (2020). It suggests a paradigm shift from personal tracking through GPS or BLE-based techniques to large infrastructures and facilities, thus achieving better accessibility to elderly people and less exposure to users’ private data. Privacy concerns on mobile contact tracing has been also investigated in Cho et al. (2020). They discussed different privacy-aware methods with a use case on Singapore’s contact tracing app using partial anonymization via polling, random tokenization, and private messaging systems.

Time Series Data Mining Techniques

DeepTrack is a real-time dashboard for spatio-temporal monitoring of COVID-19 data (Luo et al., 2020). Different types of interactive visual analytics were used, such as choropleth maps, linked common, ad-hoc, and recommended visualizations. Other systems apply ETL-based data integration and generate analysis related to high-risk area discovery, tracking infection path, and similar trend search in real-time (Leung et al., 2022).

The analysis of time series COVID-19 health data through 6 geographic regions was presented in Hernandez-Matamoros et al. (2020). The authors introduced a relationship model between countries in the same geographical region to predict the spread of the virus. They evaluated their algorithm using the Auto-Regressive Integrated Moving Average (ARIMA) model for 145 countries distributed over six regions, with parameters that include population per 1 million people, the number of cases, and polynomial functions. Their results show the potential to create other models to predict the pandemic behavior using other variables, such as humidity, climate, and culture. They collected data from the European Centre for Disease Prevention and Control (ECDC), the WHO, Johns Hopkins, the United Nations, the World Bank, the Global Burden of Disease, and the Blavatnik School of Government. In a different approach, the social impact of the COVID-19 pandemic on the employment promotion policies for graduate students in China was studied in Chen et al. (2021).

A statistical model is proposed in Dash et al. (2021) to forecast the outbreak of COVID-19 such as future peak dates and change points in the growth of the pandemic by analyzing time series data of new cases. Also, By analyzing time series of suicide data from several countries, the authors in Pirkis et al. (2021) found that the number of suicide cases related to COVID-19 remained mostly unchanged or declined in the early months of the pandemic as compared to the expected number of suicide numbers due to the pandemic. ARIMA modeling to forecast the expected daily number of COVID-19 cases in Saudi Arabia was also explored in Alzahrani et al. (2020). The model is tested on 7668 new cases per day and over 127,129 cumulative daily cases in four weeks. The forecasting results showed the trend in Saudi Arabia compared to the prediction of new cases from the official website of the Saudi Ministry of Health. The prediction evaluation of daily discovered and death cases was performed using RMSE, MAPE, RMSRE values, and the highest R2 values.

Stability Analysis of the COVID-19 spread in Indonesia was studied in Annas et al. (2020), simulating the SEIR mathematical model on COVID-19 data. They constructed the SEIR model by considering vaccination and isolation factors as model parameters and used the generation matrix method for data analysis. A comparative study of five deep learning methods to forecast the number of new cases and recovered cases from six countries (Zeroual et al., 2020). Recurrent Neural Network (RNN), Long short-term memory (LSTM), Bidirectional LSTM (BiLSTM), Gated recurrent units (GRUs), and Variational AutoEncoder (VAE) algorithms were developed to demonstrate the promising potential of deep learning models in forecasting COVID-19 cases. VAE achieved better forecasting performance of new and recovered cases than all other models. Datasets were made publicly available by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. The five DL forecasting methods were assessed using MAE, RMSE, MAPE, EV, and RMSLE metrics for each country.

Deep learning forecast models are also proposed in Shahid et al. (2020) for time series prediction of confirmed cases, deaths and recoveries in ten major countries affected due to COVID-19. Developed models include ARIMA, support vector regression (SVR), LSTM, and Bi-LSTM. Bi-LSTM model outperforms in terms of endorsed indices and can be exploited for pandemic prediction for better planning and management. Predictive modeling to improve the current time-series forecasting algorithms was studied in Zivkovic et al. (2021). Katris (2021) proposed a technique to produce a time series based on statistical data to track the spread of the COVID-19 pandemic. The technique used Multivariate Adaptive Regression Splines and Feed-Forward Artificial Neural Networks to analyze and predict the spread of the disease. A hybrid approach (Hybridized CESBAS-ANFIS) combining machine learning and nature-inspired algorithms was claimed to perform better than other approaches. Forecasting the COVID-19 spread was also studied in Desai (2021), where a multivariate CNN trained on test positivity data combined with news sentiments derived from IBM Watson Discovery News. Their model shows a spread prediction accuracy that is higher than the baseline Bayesian-based SEIRD model. The authors in Kuo et al. (2021) proposed a hybrid prediction approach based on county-level demographic, environmental, and mobility data. Multiple machine learning techniques and a hybrid framework were implemented to discover high infections on weekends when mobility increases and the effect of long and short lock-downs.

Eight machine learning algorithms were employed: (1) elastic net (EN) model, (2) principal components regression (PCR) model, (3) partial least squares regression (PLSR) model, (4) k-nearest neighbors regression (KNN) model, (5) regression tree (RT) model, (6) random forest (RF) model, and (7) gradient boosted tree models (GBM), and (8) a 2-layer artificial neural network (ANN) model. The approach was evaluated on data extracted from official health platforms for new cases and population mobility data. The results demonstrate better daily case prediction for random forest (RF), gradient-boosted tree models (GBM), and a 2-layer artificial neural network (ANN) model. At the same time, elastic net (EN) and GBM predicted well for cumulative cases.

Table 6 Evaluation metrics per research field

Evaluation Metrics

This section emphasizes the need for evaluation metrics and benchmarks in each research field discussed above to achieve a deeper understanding and assessment of the proposed solutions. These metrics are domain-dependent. A summary of evaluation metrics per research field is illustrated in Table 6. Note that most of the available studies on social data mining lack rigorous performance evaluation since they focus on producing an analysis of social data concerning topic modeling and the spread of misinformation. On the other hand, all approaches in medical image classification adopt the accuracy, precision, recall, and F1 scores to assess the classification performance. Contact tracing performance is assessed by considering the user uptake and adherence, measuring the quarantine infectious people as accurately as possible, the quick and secure notification, and the ability to evaluate effectiveness transparently (Braithwaite et al., 2020; Colizza et al., 2021). However, most approaches do not consider such advanced metrics in mobile contact tracing. Times series data is well well-established research domain. Thus, evaluation metrics are based on assessing mathematical models and foundations such as MAE, RMSE, MAPE, EV, and RMSLE metrics to measure the forecast accuracy and bias (Vandeput, 2021).

Analytical Perspectives

COVID-19 has recently been the subject of a slew of analytical research. For example, the authors in Chernozhukov et al. (2021) assess the complex effects of the different US-state policies on reported COVID-19 cases and deaths and Google Mobility Reports for social-distancing activity. In this study, the voluntary reaction of individuals to news on transmission threats was considered in a causal structural model context. A review of their findings suggests that policy and risk communication knowledge are key determinants for the cases of COVID-19 (death rate). Also, it indicates that a shift in policy represents a significant proportion of the improvements found in social distancing behavior. In Istanbul, a panel data analysis was presented by Shakibaei et al. (2021).

A framework developed in Agarwal et al. (2020) tries to explain as part of the dialogue the various incidents that may occur during a pandemic based on social media exchanges. Various COVID-19 disorder strategies have been implemented, including identifying outbreaks, monitoring viral propagation, diagnosis and treatment, the detection of vaccines, and drug research. Other research challenges, including data safety, inconsistency in pattern, control, and transparency of diseases, and the difference between the symptoms of COVID-19 and non-COVID-19, were studied in Bhattacharya et al. (2021). Despite the good results, successful DL processing of COVID-19 medical images still takes considerable time and effort and close activity among various parties in government, industry, and academia. The authors in Grasselli et al. (2020) proposed a framework for predicting up to two weeks forecast for utilization and availability of ICU beds during the COVID-19 pandemic. The framework uses an ensemble approach that combines autoregressive, artificial neural networks and a compartment model. The system was tested on a Chile dataset, achieving a mean error of 4% for the first week and 9% for the second week. Results showed the ensemble approach performed better than individual models for handling different scenarios.

The transmission model of the SARS-CoV-2 virus was used to develop a new dynamical model based on flow networks (López & Čukić, 2021). The proposed model was developed using ‘SEQIJR model’, which can detect SAR-COV-1 network flow. The network analysis enables the transport flow to be defined as a linear programming issue while some functions within the device limitations need to be optimized. The scalability and adaptability of the system to various sub-populations is a further advantage. However, the lack of adaptability to a specific region and the inclusion of elderly people nodes into the network are major deficiencies of the model.

Social Media Analysis

Social media has been a great source of data for analysis. As the news of COVID-19 keeps spreading globally, some countries have yet to believe the existence of the deadly disease. The fake news spreading phenomenon has been studied in Apuke and Omar (2021) to identify how six different variables affect the outcome of fake news spreading. The findings suggest generosity was the most important factor in predicting COVID-19 fake news distribution. However, the research failed to assess the impact of fake media sharing on cultural context, age, and gender because the study was applied to Nigeria, a country with multiple cultural and ethnic groups. Also, research was carried out among residents in Wuhan by Zhong et al. (2021), which was the origin of the COVID-19 outbreak virus. The authors examined how Wuhan residents processed the health information on social media and how their use of social media could reveal a risk to mental health at the highest rate of the Wuhan-19 Outbreak. Their study’s results can help explain the potential connections between the use of social media and the mental distress experienced by individuals in the public health crisis. Furthermore, the study also provides insights into the mechanism of health training and public reaction to pandemics for a deeper understanding. However, the study does not address how to design potential interventions and health policies that alleviate the impact on mental health during or after the COVID-19 crisis.

In addition, from the perspectives of the pandemic in Atlanta, San Francisco, and Washington DC, it offers practical lessons for city governments and highlights the theoretical value of focusing on public relations methods through government. As a result, Zeemering (2021) carried out an exploratory investigation based on Functional fragmentation in city hall. The data is collected from city agency Twitter accounts and key informant interviews to validate the significance of fragmentation for core organization, as well as public outreach. However, the research was undertaken during the early stages of the US response to COVID-19. Similarly, Shahi et al. (2021) conducted an investigative study on COVID-19 misinformation from Twitter using alternative and complementary approaches to conduct an exploratory analysis into the Twitter accounts behind COVID-19 misinformation, COVID-19 misinformation dissemination on Twitter and false assertion material on COVID-19 circulating on Twitter. They focused on the decisions of experienced fact-checking organizations that track every argument manually.

From another perspective, perinatal health promotion content over Facebook and other websites during the pandemic was investigated in Durowaye et al. (2022). The authors concluded that although diverse topics related to healthy pregnancy during COVID-19 were covered on social media, many gaps were found in spreading the severity or risks during pregnancy and fighting misinformation. In Teng et al. (2022), they have analyzed 43K YouTube comments in order to infer the reasons behind vaccine hesitancy among users. In particular, concerns related to safety and potential side effects were raised by users, in addition to a lack of trust in authorities’ decisions and pharmaceutical companies. The authors suggested that anti-vaccination activists over social media have spread a lot of misinformation, which led to an amplification in vaccine hesitancy.

Contact Tracing from an Analytical Perspective

To understand the effect of social distancing measures on Brazilian MSM and transgender/non-binary lives, a web-based survey was conducted by Torres et al. (2020). The authors used personal individual lives, pre-exposure Prophylaxis/Antiretroviral Therapy access, and sexual behavior for their research. Both PrEP and ART devices are used for the determination of people with HIV in testing that is positive/negative. These methods also help us to analyze the factors linked to the failure to sustain social distance. Similarly, the influence of social distancing procedures of COVID-19 was examined by Castex et al. (2020), using a cross-country variation in dimensions of the socio-economic, regional, environmental, and health systems. Moreover, with demographic densities, country surface area, the working rate, and the proportion of elderly in the population decreasing and rising per capita GDP and health spending in the proceeding, the efficacy of procedures prescribing school closures and jobs. According to the authors, these results are by cross-country human mobility data reinforcement. This is because policies are similar across countries, while country characteristics vary substantially. Privacy is still the biggest concern when tracing people’s location or interactions, as discussed in Liu et al. (2023). The authors suggest that a privacy-preserving efficient tracing solution can be developed by combining some intrinsic properties of blockchain, such as anonymity, decentralization, and traceability.

Emerging Technologies for COVID-19 Data Analytics

The analysis of Evolving Technologies used for treating and Diagnosing COVID-19 was discussed by Vafea et al. (2020). This analysis outlines the new technology used in the COVID-19 research, diagnosis, and treatment. Key fields of focus include artificial intelligence, Big Data, and the Internet of Things, the relevance of mathematical prediction models, the use of Community screening technology and nanotechnology, the use of telemedicine to manage new demands, and the potential of robotics and other technologies. Table 7 summarizes the emerging technologies for COVID-19 applications.

Table 7 summary of the emerging technologies for COVID-19 applications

Impact on Social Behavior

Efforts to predict the spread of COVID-19 provide valuable insight to fight COVID-19. The social behavior of travelers in Istanbul during the COVID-19 pandemic was presented in Shakibaei et al. (2021). The study examines the impacts of the pandemic on travel behavior based on descriptive research using specific tri-wave evidence. The results of this study cause the Turkish government to take some action on the conduct of individuals traveling in Istanbul and to discern various trip uses, such as homework, social/recreational/leisure (SRL), and shopping. In another important study, the impact of CoronaVirus on Education in England and the impact of online learning on parents, teachers, and students between 11 to 15 years old was investigated (Connor et al., 2022). Interestingly, among 329 parents/carers and 117 teachers, one-third of teachers and around half of parents reported below-average well-being due to issues related to access to resources, confidence in online teaching, etc. Parents also revealed concerns about their children’s mental health and lack of access to electronic devices.

Considering the following factors: molecular, environmental, and social factors, the effect of the COVID-19 pandemic on children and adolescents’ mental health was studied by de Figueiredo et al. (2021). These considerations were taken into account because the sudden separation from the classroom, social life, and sports in the open air has significantly influenced children and young people. Some have endured increasing domestic abuse. However, the paper seeks to address the need for supervision and treatment for these people and to alert public health and government agencies.

Impact on Businesses and Economy

The consequences of (the COVID-19) pandemic have also produced a profound impact on business and the economy. For instance, Silva et al. (2020) evaluated how econometrics, machine learning models, and ensemble methods can be used to predict new COVID-19 cases. In econometrics, the study used ARIMA and SARIMA econometrics models. For machine learning models, AdaBoost and GBR models were evaluated. Moreover, ensemble methods were also evaluated. The study evaluated these models on the Brazil, South Korea, China, and Italy datasets, using features such as total number of cases, deaths, new cases, new deaths in the day, and recovered patients. The results showed that no single model gave better predictions in all datasets. However, the ensemble of machine learning and econometrics showed great potential. This is because machine learning models perform poorly with less data but can be compensated by using ensemble methods. The relationship between employment conditions and protective measures among low-income US workers during the pandemic has been studied in Capasso et al. (2022). Findings suggest that essential workers struggled with variable income or income loss, unpaid sick leaves, and others have suffered from food insecurity.

Sentiment Analysis

The initial impacts on the US stock market on COVID-19 Sentiment using Big Data was also considered by Lee (2020). This research examined the association between the COVID-19 sentiment and 11 selected United States (US) stock market sector indices between 21st January 2020 to 20th May 2020 on coronavirus-related searches, using the Daily News Sentiment Index (DNSI) and Google Trends data. Almost no use of tweeter data was made of DNSI or Google Trends during intensive studies on sentiment analysis to forecast the stock market movement. Moreover, this analysis explores the difference in US business forecasting shifts in DNSI by estimating a time-series model of regression with excess industry returns as the dependent variable. The authors in Pham et al. (2022), study the impact of the COVID-19 pandemic on the financial markets by studying US ex-President Trump’s tweets in order to infer industry-level reactions based on his tone of speech during the pandemic. The relationship and statistical correlation to 49 industries were discussed by analyzing the sentiments of 2574 tweets from Trump’s Twitter account.

Similarly, by monitoring COVID-19-related Twitter updates, the authors in Naseem et al. (2021) discussed the topic of Twitter sentiment using benchmark sentiment analysis approaches. Their research findings show that the population favored a lockdown in February, but their view shifted to mid-March. While the reason for the shift in feeling is unclear, misinformation is spread across social media and the need for a proactive and agile presence in public health is therefore necessary for the fight against the spread of false news. Also, the authors have published a wide-scale COVID-19 data collection for the study of emotion, which is freely accessible. In Italy, an Information Management System to Detect and Monitor Italian Tweets Relevant Topics During the COVID-19 event was developed by De Santis et al. (2020). To this end, a methodological paradigm based on a biological metaphor has been experimented with, which can monitor new words and evolving concepts over time, beginning with a real-world Tweets dataset gathered during the lockdown. The technique was a driver for creating an ongoing Twitter monitoring scheme expressly designed to retrieve the Italian language’s buzzwords and subjects. Besides, the proposed system can discover the newest conditions for socio-political activities in an uncontrolled way, which is highly emphasized, even for words that are often and continuously used, such as the names of leading prime ministers. It is also generally used to identify and track issues arising from socially important events in feeds of Twitter messages written in either language.

From a different perspective, sentiment analysis of users’ tweets concerning COVID-19 vaccines, such as Pfizer, Moderna, and Sinopharm, was presented in Mushtaq et al. (2022). Users’ sentiments on vaccines in general and then on each vaccine with its geographical distribution were reported. Temporal tracking of peak discussion times for specific vaccines and their spatial whereabouts. Overall, sentiments on related topics have changed over space and time, and the overview given can help policymakers adjust their policies in order to enhance their vaccination program acceptance.

Prediction

Several research studies have investigated the likelihood of predicting occurrences and trends of the COVID-19 pandemic. Elsheikh et al. (2021) proposed using a deep learning model using a long-short-term memory model for predicting the number of total confirmed cases, recovered cases, and deaths due to COVID-19 in Saudi Arabia. The proposed model are also tested for other countries as well for verification purposes, including Brazil, India, South Africa, Spain, and the USA. The system utilized the optimal hidden value and learning rate to achieve better results, which were 100 and 0.005, respectively. The system could predict results up to 1 week, which is far better with baseline systems tested against, including NARANN and ARIMA. The system also used several evaluation metrics for testing the results, including Root mean square error, coefficient of determination, mean absolute error, efficiency coefficient, overall index, coefficient of variation, and coefficient of residual mass. In the coefficient of determination, which highlights the correlation of predicted results vs. actual results (with a score between 0 and 1), the system achieved 0.976 for total cases and 0.944 for total deaths.

Progressively, Adly et al. (2020) have introduced an automated tool for computer Tomography (CT) image analysis to deal with the outbreak of COVID-19 using a deep learning approach. The idea is to detect, track and quantify COVID-19, which can distinguish between patients infected with COVID-19 and those who are not. The study used a variety of worldwide databases, including disease-infected areas in China. Chieregato et al. (2022) have proposed a severity predictive model to classify ICU from non-ICU patients based on CT images, a 3D CNN for feature extraction, and a CatBoost classifier. The authors suggest that integrating heterogeneous features and a better interpretability of models would greatly enhance the prediction of such complex tasks. From a different perspective, the authors in Kim et al. (2022) have studied the development of quarantine-related programs and the level of adherence to such distancing rules in different communities. The aim was to predict the influential factors and norms on how individuals’ compliance with quarantine rules may impact the next wave of COVID-19 spreading.

Other Perspectives

Healthcare Infrastructure and Resource Allocation: Analyze how data mining techniques can be used to optimize the allocation of healthcare resources such as hospital beds, ventilators, and medical personnel based on COVID-19 case data and patient demographics (Arunmozhi et al., 2022).

Vaccine Distribution and Effectiveness: Discuss how data mining can aid in the distribution and monitoring of COVID-19 vaccines, including tracking vaccination rates, identifying vaccination hotspots, and assessing vaccine effectiveness through real-world data analysis (Sun et al., 2021).

Epidemiological Modeling and Forecasting: Explore the use of data mining techniques in epidemiological modeling to predict the spread of COVID-19, estimate infection rates, and evaluate the effectiveness of intervention strategies such as lockdowns and social distancing measures (Namasudra et al., 2023).

Public Health Interventions and Policy Evaluation: Examine how data mining can inform public health interventions and policy decisions, including evaluating the impact of various interventions on disease transmission, healthcare outcomes, and socio-economic indicators.

Community Engagement and Behavioral Interventions: Discuss strategies for leveraging data mining to promote community engagement, encourage adherence to public health guidelines, and design targeted behavioral interventions to mitigate the spread of COVID-19.

Ethical and Privacy Considerations: Address ethical and privacy concerns associated with COVID-19 data mining, including issues related to data security, informed consent, and the responsible use of sensitive personal information in research and decision-making processes (Anshari et al., 2023).

Long-Term Socio-Economic Impacts and Recovery Strategies: Analyze the long-term socio-economic impacts of the COVID-19 pandemic and explore data-driven strategies for economic recovery, workforce reintegration, and rebuilding resilient communities in the post-pandemic era.

Opportunities and Challenges

The spread of COVID-19 has created opportunities and challenges for analyzing available datasets such as medical images and tweets, to fight the pandemic. In this section, we highlighted some of the opportunities and challenges in research related to the COVID-19 pandemic from social Media, medical image, and contract tracing perspectives.

Opportunities

The increased use of artificial intelligence-based techniques will enable social media data to be analyzed in real time. Such analysis provides an opportunity to track changing public sentiments concerning the COVID-19 pandemic and proactively communicate with the public (Hussain et al., 2021). Moreover, there is a pressing demand to identify COVID-19 outbreak rumors, hoaxes, and misinformation on social media, which causes panic among the public. In addition, with the vaccine rolled out, we need to fully understand public sentiments and address the concerns of vaccine skeptics (Hussain et al., 2021). A recent study on data mining algorithms that can be combined with epidemiological prediction models was presented in Cortés-Martínez et al. (2022). The authors consider that such an integration would help develop more accurate prognosis tools for better managing and tracking of viral diseases.

Safdari et al. (2021) have reviewed the most favorite data mining techniques to fight pandemics, such as NLP, for revealing disease characteristics. The authors in Abdalla et al. (2023) suggest that knowledge discovery methods can help infer unknown disease dimensions during the pandemic. A similar study reveals that 90% of techniques apply highly accurate supervised learning for classification or prediction tasks in the epidemiology discipline (Ghosh and Das, 2022).

Furthermore, social engagement among individual users and communities over social media applications is an essential research topic, as it may help in the development of more efficient epidemic models that account for social behavior, as well as more successful and targeted crisis communication tactics (Cinelli et al., 2020). During the COVID-19 epidemic, we will be able to find a significant incidence of mental health disorders that is positively correlated with frequent social media exposure (Gao et al., 2020). Social media can provide opportunities to disseminate and receive details about patients, clinicians, and scientists. Contact tracing apps have raised many concerns about their purpose, privacy breaches, how they operate, authority sponsorship, and the willingness to use such a technology (Abuhammad et al., 2020).

Murphy et al. (2020)identified some opportunities to enhance cognitive behavior therapy during COVID-19. In their study, they found that a potential solution to address the problem of fears of infection and the effects of social isolation is to deliver enhanced cognitive behavior therapy, an evidence-based treatment.

Challenges

The COVID-19 pandemic has generated enormous data on the spread of the virus, its impact on society, and the response of governments and healthcare systems. Data mining techniques have played a critical role in analyzing this data to gain insights and inform decision-making. While different types of vaccines and booster shots are available nowadays, the spread of the virus has not stopped (Yih et al., 2023). It seems more time is needed to reach herd immunity worldwide, and it is unclear how long newly generated COVID variants can resist or bypass developed vaccine protection (Windsor et al., 2022). Table 8 summarizes some of the existing challenges and opportunities of COVID-19 data analysis. Additionally, there are prospects for leveraging data mining solutions to overcome these challenges effectively. Here are some lessons learned from a data mining perspective:

Table 8 Challenges and opportunities per research field
  • Continuous Virus Spread: Despite the availability of vaccines and booster shots, the spread of the virus persists. Achieving global herd immunity remains a challenge, compounded by the emergence of new COVID variants that may evade vaccine protection.

  • Real-time Data Utilization: While real-time data is crucial for decision-making, there are challenges in processing and analyzing large volumes of data in real-time. This requires robust data mining infrastructure and algorithms capable of handling streaming data efficiently.

  • Data Sharing and Collaboration: While data sharing and collaboration are essential, there are barriers to sharing data across borders and organizations, including privacy concerns and regulatory restrictions. Overcoming these barriers requires international cooperation and the development of standardized data sharing protocols.

  • Predictive Analytics Accuracy: While predictive analytics has been instrumental in forecasting the spread of the virus and predicting healthcare resource demand, there are challenges in developing accurate and reliable predictive models. This necessitates the refinement of modeling techniques and the incorporation of diverse data sources for improved model performance.

  • Data Quality Assurance: Ensuring data quality is critical for the reliability of analytical insights. Challenges such as data incompleteness, inconsistency, and bias can affect the accuracy of data mining models. Addressing these challenges requires robust data quality assurance processes and the implementation of data cleansing and normalization techniques.

  • Ethical Considerations: The use of personal data in data mining raises ethical considerations related to privacy, fairness, and transparency. There is a need for ethical guidelines and regulatory frameworks to govern the ethical use of data mining techniques in the context of the pandemic.

Although many studies have been conducted in the last three years on the COVID-19 pandemic, research on social media, contact tracing, and the impact on economics remains in its early stages. Moreover, new challenges emerge from different aspects (Ajaz et al., 2022). Several challenges have been addressed. However, there still is room for improvement. The authors in Ajaz et al. (2022) suggest that COVID-19 can be controlled using IoT technology and machine learning techniques. A multi-layered architecture of IoT solution has been proposed, where unmanned aerial vehicle (UAV) applications can be used for privacy-preserving contact tracing.

In addition, sentiment analysis experiments were extensively checked over 10 years (1st January to 30th June 2020) in the presence of infectious conditions, infections, epidemics, and pandemics (Alamoodi et al., 2020). The inspiration behind this research was COVID-19’s wide distribution. However, COVID-19 remains ambiguous as an infectious disease, as its literature and cases massively proliferate; thus, it is almost difficult to track up-to-date infection. Besides, only after the pandemic stops can correct details be gathered. Further research should concentrate on the role of social media and the study of feelings during a related event.

Because of human tolls and economic implications, COVID-19 has inflicted unparalleled disruption on the global economy. It presented taxpayers and politicians with a larger threat to minimize the impact of this pandemic (Padhan & Prabheesh, 2021). The economic impact of COVID-19 was highlighted in this report, and policy alternatives were highlighted to minimize their impacts. Their study concludes that monetary, macro, and fiscal policy separately contribute to mitigating effects; in the post-pandemic cycle, the combined trio may be more successful. It is also essential that the consequences of COVID-19 be reduced by cooperation between the ’trio’ policies, i.e., monetary, macro-prudential, and fiscal policies. Other challenges require a great deal of attention. These challenges include the deactivation of mobile devices, electronic health policy, privacy, ethical socio-economic inequalities, and legal risks. In addition, there is a lack of supporting ICT infrastructure, WIFI, GPS services, and abuse of contact tracing apps (Mbunge, 2020). Recently, the effective use of artificial intelligence solutions in the medical area has been the black-box models because medical professionals do not fully understand the logic of a particular machine prediction. Utilizing multi-class disease segmentation and in-depth analysis of the characteristics of each class and their association with severity. AI can assist the community in various ways, including early warnings and alerts, diagnosis and prognosis, tracking and prediction, treatments and cures, data dashboards, and social control, by prioritizing individuals for testing and thus increasing the rate at which positive individuals can be identified. Ilyas et al. (2020)

Conclusion

In conclusion, although it is very hard to find any positive impact of the COVID-19 pandemic on most of the sectors that touched our lives, from sociological and health perspectives to the economic crash, and at personal and community levels, one can appraise the huge effort made by the scientific community in an attempt to alleviate such disastrous impact. This survey covered the main technical contributions from data mining perspectives, focusing on social data, contact tracing, medical imaging, and health-related time-series data. We presented the challenges, techniques, and open problems with opportunities that can be tackled soon. For instance, social data mining needs deeper correlation and semantic analysis with other data types, such as health and contact tracing data. Contact tracing, on the other hand, could not be widely adopted because of the large privacy concerns and effectiveness of current solutions. Finally, research on medical imaging has achieved great support for the automatic early screening of infected cases, but deeper pattern recognition and tracking of the disease in order to predict the best treatment ahead of time can immensely enrich the current solutions. Overall, the COVID-19 pandemic has highlighted the importance of data mining techniques in analyzing large volumes of data in real-time, integrating data from multiple sources, developing predictive models, ensuring data quality, and considering ethical considerations. This survey is unique based on all these perspectives and recommendations and can leverage more advancements in the related fields.