Keywords

1 Introduction

In the text-based analysis, corpus data characterize the “lexical”, “syntactical”, “semantic”, and “positional” relationship with each other. The corpus vector is a relevant term related to Feature extraction emphasizing Ontologies that provide vital information regarding relations between the words [1].

The work done will provide an innovational pathway for the considerate indulgence of “Knowledge Extraction” from the corpus’ perception. The conclusions will help in gaining insights into the most important topics discussed during the health and economic crisis based on the most frequent words used on Social Media Platforms. The hot topics deduced after the analysis of comments will help to know how well the people are handling their situations amidst the pandemic.

2 Related Work

2.1 Recent Research Conducted on COVID19

In literature, quite a few procedures are presented to analyze ontologies that comprehend lexical relationships with the corpus and define how feature vectors are related with the individual words using words association [2,3,4]. Over the years, Natural Language Processing research has been done with English language orientation. The reason being, according to a survey, the total number of active users on various platforms use English, for expressing views on a global platform [8, 20].

Novel COVID -19 has created a mesh of digital data, where researchers and many help care centers are trying to understand the hot topics discussed by the public via various social media platforms. The most recent researches conducted by researchers for analyzing the web-based content after COVID-19’s outburst are mentioned in Table 1. It was observed that while working with NLP, much of the statements present online show personal opinions in the majority and the research activities are performed on subjective datasets [19].

Table 1. Work done on Covid19 using NLP

Limaye et al. studied some concerns regarding the misrepresentation of situations, information and even statistics on social media especially during a pandemic like COVID-19 and displayed them in a commentary article [12]. The largest social media in China, WeChat, was analyzed by Lu and Zhang to identify the trends referencing COVID-19 [13].

Research in the field of observing COVID -19 situation and its effect on people has created many influential topics, as shown in Fig. 1 that gained mass attention. An online survey conducted on online available social platforms like Facebook, YouTube, Twitter, Instagram, blogging sites and many official web discussing forums has shown that people concerned with the current epidemic have analyzed people’s opinions in reference to these topics around the world including India to understand the lenient or harsh situations for them [13,14,15].

Fig. 1.
figure 1

Influential topics w.r.t. COVID-19 that gained mass attention [3, 4, 9,10,11, 14, 15]

3 Text-Based Analysis

Fig. 2.
figure 2

Steps for TBA

Fig. 3.
figure 3

Syntax parsing

NLP and Text Analysis techniques are generally practiced to recognize and extract information that is subjective in nature from a piece of text or corpus and this practice is known as Text-Based Analysis (TBA). It refers to the use of “computational linguistics” and “ontology-based analysis” to analytically categorize, extract, quantify, and study varying circumstances along with subjective information [5]. It is an evolving subject, which challenges the analysis and measurement of human language and transforms them into hard facts for enlightening the real, factual or cynical meaning behind the words [15, 21]. The pipeline for TBA can be achieved using Seven Basic steps as shown in Fig. 2:

  1. a)

    Identification of Language: Different idiosyncrasies are present in multiple languages, that’s why it is indeed a critical aspect to know what Language and what grammatical features we’re dealing with. It involves predicting the natural language of the text by observing the features of the grammar that will be responsible for the other text analytics function. Approaches like Short Word Based Approach. Frequent Word-Based Approach and N-Gram Based Approach are used to Identify the Language of the corpus elements [6].

  2. b)

    Tokenization: After the language of the text is known, it can be broken up into Tokens, the basic entities of the connotation that are operated on, and this process of breaking down of the corpus into smaller unique entities is known as Tokenization. Tokens can be words, phonemes, punctuations, hyperlinks or any other smallest component of the grammar. For example, an English sentence made up of 5 words, may contain 5 tokens. Tokenization depends on the characteristics of a language, and each language has varied requirements for tokenization. For instance, in English, practices “white space” and “punctuation” are used for breaks, and could be tokenized without putting much effort. Most languages based on alphabets follow comparatively simple approaches to break the corpus. So, rules-based tokenization is prominently used for alphabetic Languages [6, 7].

  3. c)

    Sentence Segmentation: Also known as Sentence Tokenization is the process of separating a sequence of written language into its component sentences. Once the tokens are identified, places where sentences end can be easily pointed. In order to run more complex text-based analytical functions such as syntax parsing, limits, where grammar ends in a sentence, must be known. In simpler terms, it breaks the paragraph into separate sentences [1, 19].

    Example: Contemplate the COVID 19 comment Sample –

    If you do not recommend it for 18 and under, how is it remotely safe for above 18-year-olds? Our makeup is not that different??? Thank god you came out and said not safe for 18 and under at least.

    Sentence Segment produces the following result:

    1. a)

      “If you do not recommend it for 18 and under, how is it remotely safe for above 18-year-olds?”

    2. b)

      “Our makeup is not that different???”

    3. c)

      “Thank god you came out and said not safe for 18 and under at least.”

  4. d)

    PoS Tagging: is the process of tagging every token collected from the corpus with its respective ‘Part Of Speech’. PoS tagging helps in finding out how a word is used in a sentence for instance as – “nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions or interjections.” It provides the fundamental step right before chunking to set the path for Word Sense Disambiguation (WSD) by properly identifying the part of speech of each for every token generated from the text-based corpus [8].

    Example:

    “Staying healthy and “social distancing” are mutually exclusive.”

    Output: -

    [(‘staying’, ‘VBG’), (‘healthy’, ‘JJ’), (‘and’, ‘CC’), (‘social, ‘JJ’), (‘distancing’, ‘NN’), (‘are’, ‘VBP’), (‘mutually’, ‘RB’), (‘exclusive’, ‘JJ’)].

  5. e)

    Chunking: calls the PoS-tagged tokens for phrases. Chunking can be defined as the process of mining phrases from unstructured text. It is not responsible for the internal structure of the constituents, nor their usage in the leading sentence. It rather works on top of POS tagging by identifying those constituents in the form of a group of words like Noun Phrase, Verb Phrase, Prepositional phrase, etc. [8, 15].

    Example:

    The covid patient is lying in the ICU.

    Chunking Output:

    [The covid patient]_np [is lying]_vp [in the ICU]_pp

    (np stands for “noun phrase,” vp stands for “verb phrase,” and pp stands for “prepositional phrase.”)

  6. f)

    Syntax Parsing: determines the structure of a sentence. In simpler terms, Syntax Parsing could be called Sentence Diagramming that acts as a preliminary step in processing any natural language features. It is considered as one of the most computationally-intensive steps while performing analysis on text-based content to gain insight into grammar and syntax. The Syntax tree for the above given example can be seen in Fig. 3.

  7. g)

    Sentence Chaining: The concluding step in organizing the raw and amorphous text for further analysis at complex levels is called sentence chaining. It is the process to link individual but related sentences by the “strength of association” of the sentence w.r.t. the title of the content. The lexical chain helps in combining sentences, even if they are present apart from each other in a document. It detects the predominant topics for a machine and measures the overall context of the document. It also helps in observing where linkages are shown for ontological meaning to the comment thus providing morphological relations among words.

4 Natural Language Processing

Natural Language Processing (NLP) is categorized as a sub-domain of dialectology, computer science, knowledge engineering and artificial intelligence implicating fundamental relations between computers and humane dialects. Predominantly, it concentrated on organizing systems to process and analyze massive natural language data [18].

NLP makes use of Tokenization, Sentence breaking, Part of Speech tagging, Chunks of tokens and PoS tags. In machine learning (ML) jargon, these series of steps taken are called data pre-processing. The idea is to break down the natural language text into smaller and more manageable chunks. These can then be analyzed by ML algorithms to find relations, dependencies, and context among various chunks. NLP utilizes these fundamental functions in order to achieve its two components while taking ontologies and Knowledge Representation into consideration, i.e. Natural Language Understanding and Natural Language Generation.

4.1 Natural Language Understanding (NLU)

NLU aids the machine in understanding and analyzing the human language with the help of metadata extracted from content such as entities, keywords, relations, semantic and syntactic roles etc. [20]. It involves Mapping the given input into useful representation, Analyzing different aspects of the language, Interpreting Natural Language, Deriving Meaning, Identifying context and Deducing Insights. Word Sense Disambiguation (WSD), a function that is implemented via NLU makes sure that the machine is able to understand the two different senses of a word belonging to a glossary [21].

4.2 Natural Language Generation (NLG)

NLG helps in converting the machine formatted data into a representation that could be read by a human. It is achieved by three common steps i.e. planning of textual content, Planning of Sentence making, and finally Realization of the text that will be represented as a Natural Language [6].

It’s important to note that in NLU, the process is to disambiguate the input sentence to produce a language that is known to the machine, whereas in NLG the process is about making decisions regarding the arrangement of representation into words known to humans [6, 21].

5 Data and Methodology

5.1 Data Collection

The data in the current research is collected manually and directly in real-time from three social media i.e., Facebook, Twitter and YouTube’s official press conference. The data collection started in mid-July 2020 and continued extraction till mid-May 2021. An unstructured dataset of 60,365 text-based discussions from different posts and various concerning topics, was converted into structured data that included comments, tweets and replies related to the pandemic COVID -19 around the world.

5.2 Pre-processing of Data

Data were preprocessed using various basic steps except stemming or lemmatization in order to analyze the real word association, like applying stop word removal, punctuation removal, emoji removal, hyperlinks removal, numerical removal, eliminating extra white spaces and converting all upper cases to lower cases for achieving feature extraction by selecting most frequently used words.

5.3 Working of Proposed Analysis: Methodology

The stepwise process of the NLP for Novel TBA can be observed as follows:

Step 1. After preprocessing, tokens were generated for words as well as their respective PoS Tags for their respective hypernyms, in order to get maximum content discussed.

Step 2. It can be seen with the help of the following Table 2, that a Glossary is created and a gloss is tagged along the word for better understanding of context to accommodate easier search and lookup for further processing. Thus establishing basic parameters for WSD and understanding the real meaning and context of the comment.

Table 2. Sample Glossary for COVID19 Comments

Step 3. After that, the tokens are generated for words’ distribution of the corpus on the basis of “is-a” relationship thus providing the concepts in the domain and also the relationships that hold between those concepts to observe the ontology behind the words used in some context.

Step 4. A Text bag is created, after extracting word definition corresponding to the Weighted Vector that will lead to a separate bag of features, 10 hot topics are generated on the basis of word usage and Ontology observed.

Step 5. These hot Topics are Labelled from S1 to S10 depending upon the features classified with respect to the Weighted Vector in Lexicon, given by

$$\overline{{V_{w} }} = \sum\nolimits_{i = 1}^{10} {S_{i} w_{i} } .$$
(1)
  • where Si is the feature vector from the text bag and wi is the average weight of the frequently used word.

Step 6. Evaluation of the maximum weighted overlapping between the context bag of words and the Si bag of Words is observed to chunk the sentences for respective Si’s feature from the entire corpus.

Step 7. Corpus Analysis to achieve Word Sense Disambiguation through Proper Knowledge Representation is implemented.

Step 8. Finally results based on the novel TBA approach using NLP steps will deduce the interesting insights regarding COVID 19.

The entire process of how novel Text-Based Analysis was obtained via NLP steps that helped in further classification of the corpus into ten hot topics’ categories is shown in the following Fig. 5, thus concluding their insights (Fig. 4).

Fig. 4.
figure 4

Proposed analysis steps: methodology

6 Results and Discussions

The corpus analysis achieved in Sect. 5.3 using the novel TBA steps has displayed a decent approach to represent knowledge that is available on the social media platforms and resolving Word Sense Disambiguation. Since many comments, tweets and replies were irrelevant from the topic, a text bag, is created to eliminate those unnecessary comments, then observe the texts, that showed relevance to COVID-19 and are frequently used. The text Bag that was created after the ontology-based distribution as shown in Fig. 5(a) helped in collecting information for the words’ usage that in turn helped in collecting the most frequently used words as shown in Fig. 5(b).

Fig. 5.
figure 5

(a) Screenshot for COVID 19 Text Bag. (b) Screenshot for Word Frequencies

For each word belonging to the corpus and its corresponding frequency, Extraction Word Definition corresponding to Weighted Vector and Extraction of PoS Definition corresponding to Weighted Vector as per English Lexicon was achieved. A Separate Text Bag of Features for the Hot Topic Si is created. The word-clouds for this separate featured set of text bag is shown in Fig. 6.

Fig. 6.
figure 6

Visualization of Word Cloud for Si

After that, the selected Si text bag is explored as per the word definition corresponding to Weighted Vector Vw. A sample of COVID 19 with their meaning representation is present in Table 3. The correct form of the word is replaced with the tokens.

The metadata about the knowledge base is provided that is used for classifying and organizing the content into sub-categories. It is achieved using three basic entities: users, tags, and resources to focus on Sentence chaining that categorized the entire corpus into 10 main HOT Topics:

  1. 1.

    Following Social distancing

  2. 2.

    Distribution of Masks, Hand Sanitizer and food entities

  3. 3.

    Medical Conditions and Aid

  4. 4.

    Political Agendas during the crisis

  5. 5.

    Bed Availability

  6. 6.

    Oxygen Cylinders availability

  7. 7.

    Online Education Policies

  8. 8.

    Work from Home Routine

  9. 9.

    Financial Crisis

  10. 10.

    COVID Vaccination

Table 3. The Sample meaning representation of comments from the corpus with their classification

In a similar way the entire corpus was classified into the previously mentioned ten Hot Topics with the observation being made that people are concerned about certain important factors affecting life and sciences during COVID19, that must be taken into consideration by authorities and researchers of the concerned field.

It was analyzed that the major problems faced during this crisis as shown in Fig. 7 were queries regarding Vaccine, for handling Finances during critical times and Medical Conditions not only for COVID patients but for patients suffering from other critical conditions as well, respectively. These are considered the most discussed topics over the social media platforms among the entire corpus from three platforms.

Fig. 7.
figure 7

Percentage analysis of hot topics discussed over the entire corpus

For the morphological understanding of the corpus and to observe the relation between topics for the Weighted Vectors, the top five most used words from the Si i.e. #deaths, #recovery, #Money, #work, #symptoms that have and were taken compared with all topics that were frequently used among these. The analysis of the features extracted from Si with respect to these weighted vectors is shown in Fig. 6.

Fig. 8.
figure 8

Hot Topics w.r.t. Weighted Vectors of Si

The data relating to COVID-19 is mostly about sufferings, losses, the crisis faced by the public, guidelines levied by governing authorities and impact on the educational and working sector. The weighted vector score observed from Fig. 8, showed that for some families, it became very hard to even manage food two times a day and were completely dependent on the Distribution of food entities. It was extremely difficult to follow social distancing guidelines with family members staying under one roof even if one of them was diagnosed with the disease. Correspondingly, by observing the meaning representation as shown in Table 3 and weighted vector score w.r.t. work from home and online education policies people working on startup businesses and science researches had to suffer great loss, as their businesses, resources and researches that were ongoing for a long period of time faced serious consequences due to lack of regular monitoring and inconsistent interactions.

7 Conclusion and Future Work

The Novel TBA approach has provided crucial analysis steps to better understand the basic concepts of Natural Language Processing that were encountered during the initial level of data analysis. The implementation of Knowledge Representation as shown in Table 3 has revealed that the taxonomy of Constituent Tree could be parsed with the ontologies to get a better understanding of text analytics while trying to achieve text classification.

The corpus is constructed and categorized on the basis of most discussed topics derived from frequently used words of comments, tweets and replies on Facebook, Twitter and YouTube respectively with the help of R and Python packages. The collected data was converted to corpus and the ten hot topics that need attention were discussed.

The proposed work could make use of a more complex version for the morphological analysis of the corpus. Future research could use an investigative approach to find the Lexical relations of varying features and their enactment on the sub-contexts and domains at deeper levels. Also, sentiments based on the knowledge extracted can be obtained to get even the cynical insights into the situation.