1 Introduction

Knowledge Graphs (KGs) have now been widely incorporated in industry and academia as being a factual reflection of the human knowledge to solve several domain-dependent real-life problems (Ji et al. 2020). In fact, the proliferation of Social Big Data has prompted the necessity for sophisticated approaches to assist machines to better understand the context of the multimodal contents. In particular, the heterogeneity in data sources and formats, the discrepancy in vocabulary, and the lack of comprehensive and integrated knowledge repositories are the key challenges for analysts. Yet, by presenting the domain knowledge as a set of entities and relations, KGs facilitate constructing a unified standard representation for the fusion of data. This thereby has led to knowledge propagation embodying graph datasets of divergent and interrelated domains (Wang et al. 2018a, b) and has extended to benefit large scale applications such as question answering (Zhang et al. 2018), recommender systems (Palumbo et al. 2017), KG completion (Lin et al. 2015), entity disambiguation (Huang et al. 2015a, b), and text classification (Marin et al. 2014). As a result, analysts today are able to conduct an in-depth analysis of external business data such as customer blog postings (Gruhl et al. 2004), Internet chain-letter data (Liben-Nowell and Kleiberg 2008), social tagging (Anagnostopoulos et al. 2008), Facebook news feed (Sun et al. 2009), and many other semantic Artificial Intelligence applications.

Despite the widespread usage of domain-independent (open-world) KGs (e.g. Google KG), domain-dependent KGs provide an overabundance of benefits to tackle domain-specific problems as well as to gain the hoped-for added value from domain corpora (Kejriwal et al. 2019). Domain knowledge is commonly captured in a KG, which is then used to enrich the semantics of data with a specific conceptual representation of entities. The reuse of domain ontology and interlinking process of embodied classes, entities, and concepts with other relevant entities from other KG repositories, facilitates the interoperability of information. Hence, KGs are used as backbones to support intelligent systems by extracting the semantics of textual data which are collected from different vocabularies and semantic repositories to enrich the semantic description of resources using an annotation component.

Another important consideration is the factuality and credibility of the embodied knowledge in a KG. The rapid growth in KGs sizes has risen a question on the quality of the embodied knowledge (i.e. entities and relations), and whether these facts do factually represent the intended real-world entities interlinked via their relationships. This has posed several research challenges in this field. For the purpose of proof of concept, this study targets the social political domain due to the fact that social media has been broadly used as an important arena by politicians to promote campaigns and to express and defend their views, and to open direct dialogues with their supporters (Shapiro and Hemphill 2017). Further, the amount of political discourse in social content is increasing; over 55% of OSNs’ users believe that they are worn-out by political posts and discussions (Anderson and Quinn 2020). This propagation of political social contents yet can be hijacked and misused by spammers to spread misinformation and false news. Hence, the data collected from OSNs should be scrutinised to augment KGs with trustworthy facts to benefit real-life applications. Despite the significant efforts attempted to address quality, endeavours in this direction are inadequate, and several measures should be proposed and taken to maintain the quality of KGs.

This paper presents a novel credibility-based domain-specific KG Embedding (KGE) framework. This framework comprises the following key modules: (1) Domain knowledge inference: involves capturing real-life entities obtained from social data into a formal and integrated representation depicted by a domain ontology (Politics). The proposed module makes use of various cross-domain knowledge-based repositories including Google KG™, IBM Watson NLU™, and Wordnet™ to enrich the semantics of the textual contents, thereby facilitating the interoperability of information. (2) Social credibility: to measure the credibility of collected users from social media and their content, thereby eliminating spam and low trustworthy content from further analysis. This module incorporates several fine-grained key attributes to establish feature-based ranking model and reflect this model by means of their domain-based credibility values. (3) KG construction and embedding: the aim of this module is twofold: (a) to construct the KG based on the underlying abstract structure of Politics ontology and leveraging various mapping, annotation, enrichment and interlinking methods; and (b) to embed the constructed KG in low dimensional vector space using several embedding techniques.

The resultant KG embeddings of two separate KGs (original and curated) are used to conduct several tasks including link prediction, clustering, and visualisation. Evaluation protocol and metrics are used to compute the performance of the incorporated embedding models and to prove the effectiveness of the framework and the embodied modules.

In this paper, we have made the following key contributions:

  • A domain knowledge graph is constructed based on an extended politics domain ontology using dissimilar light-weight ontologies and semantic repositories.

  • An embedded social credibility module is incorporated and customised to enhance the quality of the collected datasets.

  • Various state-of-the-art embedding models are implemented and their performance is evaluated using key evaluation metrics.

  • The utility of the constructed KG Embeddings is demonstrated and substantiated on link prediction, clustering, and visualisation tasks.

This paper is organised as follows: Sect. 2 provides background on works related to the context of this paper. Section 3 discusses the overall methodology of the proposed framework of this paper and the included modules. The experiments carried out in this study are explained in Sect. 4 along with the evaluation mechanism and the implemented tasks. Finally, the conclusions and some possible research directions are reported in Sect. 5.

2 Background and related works

Domain-specific KGs Domain-specific/dependent KGs are constructed from domain corpora to establish relevant and semantically interrelated ground to tackle a specific domain problem. Therefore, domain-specific KGs can be defined as the process of enriching an underlying domain ontology (Kejriwal 2019). Also, it can be more comprehensively defined as an “explicit conceptualisation to a specific subject-matter domain represented in terms of semantically interrelated entities and relations” (Abu-Salih 2021). There have been continuous attempts to construct KGs to capture several domains of knowledge. For example, in the politics domain, (Nguyen and Jung 2019) created a KG that captured and cluster social events decomposed from social media using Independent Component Analysis (ICA). This was followed by using SocioScope Knowledge Graph (SKG) model to automatically construct event-driven KGs from Twitter data (Laufer and Schwabe 2017) presented POLARE, an ontology for political system conceptualisation. This ontology is then used to build a KG so as to be used for a better understanding of the existing relations between agents in the political system in Brazil. Capturing politics domain has been also addressed in Chen et al. (2017) and Huang et al. (2017). Constructing domain-specific KGs have been also extended to different domains, such as Healthcare (Cui et al. 2020; Sheng et al. 2020), Education (Chen et al. 2018; Zheng et al. 2017), ICT (Kiesling et al. 2019) (Deng et al. 2019), Sciences and Engineering (Gong et al. 2021; Liu et al. 2021), Finance (Tong et al. 2016) (Liu et al. 2019), and Travel and Tourism (Feng 2020; Liang et al. 2020; Wu et al. 2020).

Knowledge acquisition (completion, entity and relation extraction) The ongoing efforts to construct large-scale KGs have been notably increasing. This has led to producing massive KGs embodying billions of facts that describe different contexts (Rossi et al. 2020). These KGs, however, suffer from incompleteness, which negatively affects the utility of such graphs to be leveraged in real-life applications (Akrami et al. 2018). For example, Freebase, a large-scale KG and commonly used knowledge-base in research communities, is far from completeness; it has been indicated that the “place of birth” for above 70% of “Person” entities are missing, and more than 90% of the person entities have no embodied ethnicity (West et al. 2014; Dong et al. 2014). This also applies to Wikipedia and many other knowledge bases. This has led the research community to confront this issue by providing technical solutions to tackle it, commonly known as KG Augmentation/Completion approaches. KG Completion (a.k.a Link Prediction) aims to enrich the KG with new facts that are depicted by new likelihood entities and/or new relations. Link prediction has many applications, such as predicting new friendships in social networks and recommender systems to various other use cases. In this context, a new cohort of models has recently gained considerable attention. These models are designed to embed the constituents of a KG (entities and relationships) into a low-dimension semantically-continuous space (Wang et al. 2017a, b). The generated embeddings can be then leveraged to generate a set of candidate facts to fulfil a completion task (Meilicke et al. 2018).

KGs are commonly constructed from (semi-)structured (e.g., Wikipedia) or unstructured (e.g., web data) datasets. However, harvesting meaningful information from heterogeneous data sources is not a trivial task. It encompasses extracting facts (entities linked via relationships) that require a correlated array of various Information Extraction (IE) techniques, Natural Language Processing (NLP), and other statistical approaches (Paulheim 2017). Examples of techniques used for entities recognition and relations extraction are; Conditional Random Field (CRF) (Lin and Wu 2009), Machine Learning models (e.g. SVM), Neural Networks models, such as Bidirectional Long Short-Term Memory (BiLSTM) (Huang et al. 2015a, b), Hidden Markov Models (HMM) (Morwal et al. 2012), and off-the-shelf NLP tools (e.g. spaCy,Footnote 1 Stanford CoreNLP,Footnote 2 AllenNLP,Footnote 3 IBM Watson NLU,Footnote 4 etc.).

KG can be further extended and thereby its embedded knowledge can be augmented to include missing facts of the real world leveraging contextualised knowledge repositories (Beheshti et al. 2020, 2018). The process of knowledge acquisition can be categorised into two key dimensions, namely KG completion, and entity and relation extraction. KG completion aims to expand the current knowledge by accumulating more facts to the current state of the KG, while the latter dimension aims to infer new knowledge by predicting new relations and entities (Ji et al. 2020). Link and entity inference in the context of KGs is the process of amplifying the KG with new facts depicted by new entities and/or new relations.

Several approaches have been introduced to tackle this issue (Han et al. 2018; Purohit et al. 2019; Lin et al. 2015; Balažević et al. 2019; Balazevic et al. 2019) (Kazemi and Poole 2018). These attempts have also extended to address interrelated domains (Qiuyu and Fuhua 2020). (Han et al. 2018) proposed a joint representation learning framework to solve the complexity of the structure of semantic information by presenting a mutual attention mechanism, which can be used to highlight the important features by conjoining the textual content and the KG models. Augmenting knowledge in disaster situations has been also addressed in the literature. For example, (Purohit et al. 2019) proposed DisasterKG; a disaster KG that offers a platform that provides resources to answer critical inquiries. The authors made their point on how interoperability of information from dissimilar data resources can efficiently improve decision making in such cases. Completion of KG by using web pages was attempted by (Kruit et al. 2019). (Kruit et al. 2019) suggested a new approach for HTML table interpretations, where the row and column indicate an entity and an attribute respectively. By using the Probabilistic Graphical Model (PGM), the authors were able to infer new facts for KGs with dissimilar topologies. (Shi and Weninger 2018) proposed ConMask; an Open-World Knowledge Graph Completion system. This system is designed incorporating fully convolutional neural networks, and semantic averaging to be able to tackle the incompleteness of KG. The proposed system has proven ability to forecast relations including unseen entities.

NLP applications using KGE Incorporating graph technology together with the abundance of dissimilar graph datasets has assisted in building quite sophisticated graph analytics tools. Despite the effectiveness of the conventional graph analysis approaches, such as Graphx (Gonzalez et al. 2014), Gephi (Bastian et al. 2009), GraphLab (Low et al. 2012) to name a few, graph embedding has notably improved the efficiency of conducting graph analytics by converting the graph to a low semantic dimensional space, thus information can be represented as vectors leading to computational efficiency. Several efforts have been conducted to incorporate KG Embeddings to address numerous NLP challenges. For example, Yao et al. (2017) proposed a topic distillation approach embodying Latent Dirichlet Allocation (LDA) to improve document presentation in the semantic space. Authors of (Li et al. 2018) benefited from the architecture of a neural network and a constructed knowledge base to build Text Concept Vector (TCV) that can be used to infer high-level presentation of concepts from the textual content. KGs are also utilised in conjunction with deep learning models to distil knowledge for several applications, such as sentiment tasks (Song 2019), bilingual dictionary induction (Nakashole and Flauger 2017), fake news detection (Pan et al. 2018), recommender systems (Wang et al. 2019), and other miscellaneous applications (Long et al. 2020; Yang et al. 2016).

Classification and clustering using KGE Classification in the context of KGs is the task of determining, whether the entities/nodes, relations/edges, or the whole triple contained within the testing dataset are correct. This task can be perceived as a binary classification task involving class labelling to each entity, relation or triple. Underneath this broad classification umbrella, quite a few conducted literature reported attempts to efficient and reliable applications incorporating graph embedding (Kipf and Welling 2016; Wang et al. 2017a, b) and also using dissimilar embedding techniques such as TransR (Lin et al. 2015), HolE (Nickel et al. 2016) and ANALOGY (Liu et al. 2017). On the other hand, clustering is an unsupervised learning approach that aims to assemble similar entities in groups. Clustering can also be used to examine the efficiency of the approach used for KG embedding. Incorporating KGE boosts the traditional clustering algorithms by transforming the embedded components of the graph into vectors (Cai et al. 2018). Other unconventional approaches have been also presented in the literature. For example, (Tian et al. 2014) showed how utilising deep neural networks can improve KG clustering through mapping the similarity matrix of the input graph to the output graph embedding using the layer-wise pre-training scheme.

3 Methodology

3.1 Overall framework architecture

Figure 1 shows the proposed KGE framework. As depicted in the figure, the system comprises five core components, namely: Domain Knowledge Acquisition & Pre-processing; Domain Knowledge Inference; Knowledge Credibility Module; Knowledge Graph Creation and Embedding; and Knowledge Reasoning. The system collects its datasets from three main knowledge resources, namely: Twitter, Wikipedia, and miscellaneous news articles.

Fig. 1
figure 1

System architecture

The collected datasets are pre-processed in order to ensure data cleansing and integration, then domain knowledge being captured in domain ontologies is identified and used to semantically enrich the textual contents. This process is attained through the domain knowledge inference module—semantic kitchen. This module incorporates several knowledge-based repositories including Google KG™, IBM Watson NLU™, Politics domain Ontology, and Wordnet™. The next phase in this framework is to ensure the credibility of the incorporated knowledge. The credibility of knowledge is commonly neglected in the construction of KGs especially when the knowledge is attained from social media where spammers and other low trustworthy users find a fertile medium to publish and spread their content taking advantage of the open environment and fewer restrictions of these platforms. The following module constructs the domain KG and conducts the KG embedding. This facilitates the knowledge reasoning which is carried out in the last module and represented by incorporated neural machine learning models for relational learning. Details on the system framework and the embodied modules are discussed in the next sections.

3.2 Domain knowledge acquisition and pre-processing

Since the emergence of the OSNs, the propagation of social data has revolutionised the research avenues to develop state-of-the-art techniques for social data analytics. OSNs are a fertile medium for researchers in diverse disciplines, leveraging the vast volume of content. For the purpose of proof of concept, this study focuses on analysing the political content that can be collected and distilled from the Twitter platform. The politics domain is selected amongst other domains due to the following reasons: (1) Twitter has been intensively incorporated as an important venue by politicians to express and defend their policies, to practice electoral propaganda, and to communicate with their supporters (Shapiro and Hemphill 2017). (2) Twitter has raised a lot of controversy about its usage as a platform to attack political opponents (Van Kessel and Castelein 2016). (3) Twitter is characterized by its growing social base to include broad political social groups leveraged by ease of use, free to access, and less governmental control (Halberstam and Knight 2016). (4) In fact, the amount of political discourses amongst the overall social content is overwhelming; over 55% of OSNs’ users believe that they are worn-out by political posts and discussions (Anderson and Quinn 2020). The social dataset used for this study has been collected using the twitter’s “User_timelineFootnote 5” API method. This mechanism allows access to and retrieval of the public users’ content and metadata.

3.2.1 Dataset acquisition

This study aims to augment the constructed domain-based knowledge graph with facts attained from heterogeneous data sources. These facts will not be only obtained from politics-related sources, but also be gathered from users who do not explicitly indicate an interest in this designated domain. Further, users who will be potentially detected as spammers will be also included to prove the applicability of our approach to filter out those users, thereby enhancing the quality of imported facts as will be discussed later.

Users who explicitly indicate an interest in politics domain are collected from various resources as follows: (1) we gathered all information provided for members listed in the Parliament of Australia official website.Footnote 6 Those include Senators and members of the Australian House of Representatives. (2) A selected set of users is assembled from three distinguished Australian Twitter lists that are relevant to the political domain.Footnote 7 (3) Mixt sourcesFootnote 8: users whose political interest is not explicitly identified were tentatively selected various Australian Twitter’s lists established to discuss sports, Information Technology, and other non-politics domains. (4) Finally, we included a subset of users indicated in the Twitter graph dataset collected by (Akcora et al. 2014). This graph was used in experiments carried out by Akcora et al. to discover spammers and other illegitimate accounts. One of the contributions of this paper is to provide a platform where trustworthy social content can be imported to augment the domain KG, and by this means eliminate untrustworthy content. Hence, the reason for selecting the graph of (Akcora et al. 2014) is twofold: (1) to proof the efficiency and applicability of the proposed approach which can be used to eliminate spammers and their content and entrench the domain KG with trustworthy facts; (2) to embed also the content of domain influencers from a dataset of users whose domains of knowledge are not explicitly known.

3.2.2 Dataset pre-processing

One of the significant features of properly addressing and curating Big data is to ensure its veracity. The veracity of data refers to the certainty, faultlessness, and trustworthiness of data (Demchenko et al. 2013). Although reliability, availably, and security of data’s source is significant (Demchenko et al. 2013), these factors do not guarantee data correctness and consistency especially in the context of social media where data can be infected with spam and other junk contents. Hence, appropriate data cleansing, integration, and credibility techniques should be incorporated to ensure the certainty and veracity of data. The collected users and their contents are cleansed and integrated to enhance quality as follows:

Datasets cleansing cleansing data is a crucial step to improve the quality of data that will be used in further analysis. Hence, detecting and removing errors and corrupted data, meaningless data, redundant data, and irrelevant data are key techniques in data cleansing which are carefully carried out in this experiment to guarantee that only curated data are passed for the next phase.

Data quality enhancement the list of Twitter handles (a.k.a. screen name such as @username), which are indicated in the user’s metadata, is collected and replaced with the actual user’s corresponding name. To achieve this task, Twitter provides a RESTful API service called “lookupFootnote 9” that is used to reveal the twitterer with a certain handle by receiving full hydrated information about the user. Twitter handles are commonly neglected in Twitter mining applications. However, handles are used to mention for example twitterers of important entities that are related to a certain domain. For example, a user demonstrates an interest in the political domain if the user is commonly posting politics-related content as well as mentioning twitterers related to politics domain such as politicians or political parties. Hence, it is essential to identify and determine the actual user information of those handles. This assists in the process of domain modelling and inference.

3.3 Domain knowledge modelling and inference

Domain knowledge modelling inference is the key phase in the proposed framework. Knowledge modelling presents the core activity in knowledge graph creation. It involves capturing the real-life entities obtained from the social data into a formal representation depicted by the domain ontology. Tom Gruber generated expansive interest across the computer science community by defining ontology as “an explicit specification of a conceptualisation” (Gruber 1993). While conceptualisation aims to formulate the knowledge about real-world entities, the specification attempts to represent those captured entities in a concrete form (Stevens 2001). Therefore, ontology captures the domain knowledge through the defined concrete concepts (representing a set of entities), constraints, and the relationship between concepts, thereby providing a common understeering of the domain as well as giving a formal representation in machine-understandable semantics. The purpose of an ontology is to represent, share, and reuse existing domain knowledge. This module aims to detect and infer the user’s domain of knowledge from pre-processed datasets. For a proof of concept purpose, we experiment within the Politics domain. We use Politics ontology, WordNet, and ontology interoperability and integration to infer political knowledge.

Politics ontology The BBC offers an array of domain ontologies which are designed to conceptualise a predefined set of domains such as, sports, music, education, to cite a few (‘BBC Ontologies’ 2015). These domain ontologies are designed to consolidate the established BBC Linked Open Data platform. The politics domain is amongst the ontologies constructed by BBC and is described as the conceptual knowledge captured in politics ontology along with its embodying knowledge base. BBC defines politics ontology as “an ontology which describes a model for politics, specifically in terms of local government and elections” (BBC 2014). Figure 2 displays the BBC Politics ontology. This ontology is initially designed to capture politics in the context of UK government elections.

Fig. 2
figure 2

BBC Politics ontology

However, the concepts and relationships embedded in the designated ontology are inadequate to properly model this nominated domain, particularly that this study addresses the domain of the politics in the Australian context. Hence, we extend the BBC politics ontology to provide a better depiction of the political domain. In this study, the extension of the political ontology is conducted manually by one of the authors who is an expert in Australian politics domain. In particular, BBC Politics ontology is scrutinised and extended with further concepts/classes and relations to provide a better comprehension of this domain. In future, we aim to explore new venues for ontology augmentation using (semi-)automatic techniques. Figure 3 shows the extended version of the BBC Politics ontology which is used in this research.

Fig. 3
figure 3

BBC politics ontology extension

Designing high-quality ontology is important as a corner store to provide a meaningful, contextualised, valid, error-free knowledge base. Also, the developed ontology for any domain should be appropriate to answer queries over its semantic concepts, relationships and instances. The new extended version of Politics ontology is therefore verified to ensure logical consistency. This has been carried out using the reasoning process. In particular, the extended ontology is reasoned incorporating various well-known reasoners such as TrOWL, RacerPro, Pellet, Pellet (Incremental), HermiT, FaCT +  + . Besides the standard inference services provided by ontology reasoners such as classification and realisation, reasoners are generally used to scrutinise all concepts, properties, instances, and embedded hierarchies. Also, they check if concepts are satisfactory and their descriptions are free of contradiction. The new extended Politics Ontology is reasoned and verified, and no contradictory facts are indicated.

WordNet database WordNetFootnote 10 is a vocabulary lexicon that includes a collection of terms/words (synsets /synonyms) that are interrelated and have similar semantic meanings. WordNet is commonly used to augment the term with further other semantic-related concepts that can enrich its meaning. For example, various synsets of the same contextual meaning can be extracted for the term ‘Pol, such as “politician, politico, and political leader”. WordNetis used in this study to expand the knowledge base with synonyms to concepts inserted in the extended Politics Ontology.

Ontology interoperability Ontology interoperability aims to align and consolidate the developed ontology with relevant entities captured from other predefined domain and generic ontologies. Ontology interoperability is attained in this study by apprehending equivalent links (URIs) that indicate the same entity/resource. This linkage is depicted by using, for example, owl#sameAs relation for the resources in the Linked Data. This entails that URIs of both subject and object indicate the same resource. For the interlinking process, we incorporate Google KG, a knowledge base that is mainly developed to enhance Google’s search engines by providing relevant, semantically-enhanced, and context-specific results. The Google KG Search APIFootnote 11 is used to infer entities and categorised classes/types. In particular, Google KG Search API provides a platform to collect entities that belong to a wide variety of independent domains. However, our approach incorporates a domain-driven approach that uses a domain ontology to find relevant entities captured from the textual content of users’ tweets. Therefore, amongst all entities that are obtained from analysing tweets using Google KG Search API, we incorporate those that are mapped with the concepts of our politics ontology. Hence, irrelevant entities which are not related to our designated domain are neglected.

We also utilised the Natural Language Understanding service of IBM Watson™ as a one-stop-shop, leveraging access to a wide variety of linked data resources through providing easy access APIs. These resources include but are not limited to: different vocabularies such as Upper Mapping and Binding Exchange Layer (UMBEL), Freebase which are community-curated databases for well-known people, places, and things, YAGO high-quality knowledge base, etc.

IBM Watson is also used for domain-based classification. In particular, IBM Watson analyses the given text or URL and categorises the content of the text or webpage according to a set of categories (taxonomies) with the corresponding scores. Scores are calculated using IBM Watson, range from “0” to “1”, and convey the precise degree of an assigned Category/Taxonomy/Domain to the processed text or webpage. IBM Watson presents an inclusive list of categories divided into certain predefined hierarchies where the high-level category indicates the high-level category and the deeper-level category provides a fine-grain category analysis. For instance, “law, govt and politics” is considered a high-level category in which “presidential elections” is one of its deep-level categories. IBM Watson is used further to identify the overall positive or negative sentiment of the provided content. The taxonomy inference module is used in this research in the domain discovery process, while sentiment analysis is used to discover the sentiments of tweets’ replies. The purpose of domain classification and sentiment analysis is discussed in the following section.

3.4 Social credibility module

As mentioned previously, this study aims to make use of domain-specific politics ontology and available KGs to analyse the social contents of users in OSNs, thereby augmenting the domain KG with facts inferred from users with legitimate and credible interest in politics domain. However, the OSNs medium allows legitimate and genuine users as well as spammers and other low trustworthy users to publish and spread their content leveraging the open environment and fewer restrictions (Abu-Salih 2018; Abu-Salih, Bremie, et al. 2019; Abu-Salih et al. 2020; Abu-Salih et al. 2018; Abu-Salih et al. 2019a, b; Chan et al. 2018; Meneghello et al. 2020; Wongthongtham and Salih 2018). Hence, it is vital to measure users’ credibility in numerous domains, thus indicating domain-based influential users, and filter out spammers and low trustworthy users.

This paper incorporates CredSaT (Abu-Salih et al. 2019a, b); a comprehensive credibility mechanism intended to measure users’ credibility based on their domains of knowledge. CredSaT provides an effective solution to discover spammers and influential domain-based users from the list of users whose domain(s) of knowledge is tacit, incorporating the temporal factor. The outcome of the credibility module is a ranked list of users with a corresponding credibility value for each specific domain. The temporal factor is assimilated in CredSat; the dataset of a user’s data and metadata is divided into several chunks, where each chunk represents a specific period. A metric of credibility measurements is used to evaluate the user’s trustworthiness in each particular chuck, thus providing overall credibility values. The mechanism used to calculate a user’s value in each step considers other users’ values, thereby providing a normalisation approach for building the relative ranking list of credibility in each domain. Hence, each particular key-value obtained from the user’s data and metadata is measured against other users’ values. In other words, each of the key attributes is normalised in each domain by dividing the value of the user’s attribute by the maximum value achieved by all users in that domain. CredSaT shows the effectiveness of its embodied framework by benchmarking it against other state-of-the-art baseline models.

As mentioned previously our study uses the Twitter graph dataset crawled by Akcora et al. (2014). This dataset comprises spammers and other anomalous users. Hence, the main purpose of the knowledge credibility module is to filter out spammers and other low trustworthy users as their social contents affect the quality of the incorporated domain-based knowledge. For example, spammers who hijack tweets of politics-related contents, events, and stories should be eliminated from further conducted analysis despite the fact that political entities extracted from the contents of those users are relatively high. Table 1 shows the set of features incorporated into CredSaT framework. The reader can refer to (Abu-Salih et al. 2019a, b) to obtain further detailed explanations of the methodology used for measuring users’ credibility.

Table 1 Selected features of CredSaT Framework

As an example of the domain-based credibility analysis, Fig. 4 and Table 2 illustrate the key attributes used in the process of conducting credibility analysis on the social data and metadata collected for a well-known politician “Joanne Ryan@JoanneRyanLalor” as well as a social spammer “Ham—Hamjuku@hamjuku”. Figure 4 illustrates the obtained values for certain domain-dependent attributes explained in Table 1. These values are computed for each of the 23 domains inferred from the domain discovery approach that is carried out utilizing IBM Watson API.

Fig. 4
figure 4

Domain-dependent social data analysis of two twitterers: a Joanne Ryan (MPJoanneRyanLalor); a legitimate politician who is a member of the Australian house of representatives, b Ham—Hamjuku (hamjuku): a social twitterer spammer

Table 2 Domain-independent social data analysis for a legitimate twitterer and a spammer twitterer

The values depicted in Fig. 4a demonstrate the domain-dependent analysis of Joanne’s tweets which depicts a clear interest in the political domain of knowledge. This is evident considering that she is a member of the Australian House of Representatives and being active in this domain for several years.Footnote 12 Figure 4a also depicts that Joanne’s tweets have had quite commended attention from her follower. This can be perceived due to the high number of domain-based likes, retweets, and replies. On the other hand, Fig. 4b shows the domain-based credibility analysis to a social spammer who demonstrated an interest in all domains. This commonly conveys a suspicious behaviour due to the following facts: (1) No one person is an expert in all domains (Gentner and Stevens 1983); (2) A user who posts in all domains does not convey to other users which domain(s) s/he is interested in. A user shows to other users which domain s/he is interested in by posting a wide range of contents in that particular domain; (3) There is the possibility that this user is a spammer due to the behaviour of spammers posting tweets about multiple topics (Wang 2010). This could end up by tweets being posted in all domains which do not reflect a legitimate user’s behaviour as in the case of @hamjuku.

Further, Table 2 shows the domain-independent analysis to @JoanneRyanLalor and @hamjuku Twitter profiles. The figures exemplified in this table are plausibly acceptable; the number of users following @JoanneRyanLalor’s tweets is four times the total number of her friends (i.e., users who follow her). Also, Tweets and URL similarities computed for her 6495 tweets are around 20% which is quite reasonable. Differently, the similarity analysis computed for both tweets and URLs of @hamjuku poses a question on the quality of posted contents; publishing the same content repeatedly is obviously a spammer behaviour (Sedhai and Sun 2015). More than 50% of the tweets posted by @hamjuku are mainly repeated content. This implies to the textual contents as well as the embedded URLs. The TFF ratio that is calculated for @hamjuju sounds rational and legitimate considering the fact that the increase in the number of friends that a user \(u\) follows compared to the steadiness in the number of followers commonly indicates a suspicious behaviour, and such a user is likely to be a spammer (Twitter 2009; Wang 2010). However, as it can be inferred from the analysis conducted to @hamjuku, friends to followers ratio analysis could not be considered as sole spamming detection criterion, and this does not necessarily exhibit a credible profile; further scrutiny should be carried out to examine the overarching behaviour of a spammer, thereby providing a reliable detection mechanism.

3.5 Knowledge graph creation

At this stage, the knowledge representing the politics domain and the incorporated credible users and their data and metadata are captured in the domain ontology. In addition, knowledge is depicted in a less expressive relational model that stores knowledge obtained from the analysis conducted on users’ social metadata and inferred from their collected textual content. The relational model embodies also the users’ domain-based credibility indicating the trustworthiness of the users in each domain of knowledge. The knowledge graph creation module aims to transform the collected heterogeneous data format into a unified standard form.

The Resource Description Framework (RDF) is a widely used underlying model to represent knowledge in terms of triples (subject, predicate, object), where the subject of the triple indicates the resource which needs to be described, predicate indicates the property of the subject, and object refers to the property value which describes the subject. A typical knowledge graph is represented as a directed graph where nodes indicate the entities (resources) of the class model and edges depicts the relations (properties) between those entities.

The datasets collected in this study are in different formats; Tabular, JSON, and CSV. One of the crucial steps in conducting Big data analytics is to provide a consolidated platform to handle the heterogeneity of the datasets collected from diverse data islands. Hence, we incorporate The RDF Mapping Language (RML) (Dimou et al. 2014) as a mapping language to express data in dissimilar format into a unified RDF form, thereby mitigating the variety dimension of Big data (Vidal et al. 2019). RML defines a generic approach for mapping different data structures, where the input could be any data source and the provided output is provided as an RDF graph. The mapping process in RML consists of one or more triple maps. In RML, each triple map embodies a logical source (input source), subject map (describes the mechanism to generate the subject for each logical resource), and predicate-object-map (specifies the predicate and the object map and how the triple’s predicate is generated). RML mapping rules are used in this study to transform the annotated components into RDF triples to enrich the knowledge graph of the semantic repository. Figure 5 demonstrates an example of mapping an input JSON data source to RDF triples for an Australian politician (Joanne Ryan) using RML.

Fig. 5
figure 5

source to an RDF using RML

Example of mapping a JSON data

For annotation and enrichment, the domain knowledge graph is fed with annotated politics entities extracted from the textual contents of the tweets. The annotation is then enriched with a description of the concepts referring to the domain ontologies and using controlled vocabularies e.g., Dublin Core (DCFootnote 13), Simple Knowledge Organization System (SKOSFootnote 14), Semantically-Interlinked Online Communities (SIOCFootnote 15). This allows each entity in the textual data to be specified with its semantic concept. The particular concepts can be further expanded into other related concepts and other entities instantiated by the concepts. The consolidation of this semantic information provides a detailed view of the entities captured in domain ontologies.

For the interlinking process, entities are interlinked with similar entities defined in other datasets to provide an extended view of the entities represented by the concepts. Our focus is on equivalence links specifying URIs (Universal Resource Identifiers) that refer to the same resource or entity. Ontology Web Language (OWL) provides support for equivalence links between ontology components and data. The resources and entities are linked through the ‘owl#sameAs’ relation; this implies that the subject URI and object URI resources are the same. Hence, the data can be explored in further detail. In the interlinking process, different vocabularies i.e. Upper Mapping and Binding Exchange Layer (UMBEL), Freebase—a community-curated database of well-known people, places, and things, YAGO—a high-quality knowledge base, Friend-of-a-Friend (FOAF), Dublin Core (DC), Simple Knowledge Organization System (SKOS), Semantically-Interlinked Online Communities (SIOC), and Google KG, are used to link and enrich the semantic description of resources annotated.

The domain KG is also enriched with knowledge inferred from the social presence of users on Twitter platform. This primarily encompasses associated metadata of the users and their content, such as #followers, #friends, #likes/favourites, and #retweet/share, etc. It also includes the resultants values obtained from the conducted domain-based credibility analysis to users. This includes values of the number of domains the users are interested in, the credibility value of the user in each domain of knowledge, the number of political entities indicated in the user’s tweet, the number of the positive, negative and neutral replies to the user’s tweets, etc.

Figure 6 illustrates an example of an RDF graph of knowledge inferred from multi-resources heterogeneous data collected for the politician, Joanne Ryan. This RDF graph can be referred to an RDF molecule as it represents a set of RDF triples indicating the same subject. This RDF molecule has been constructed as a result of the transformation process conducted on the data by the means of defined rules of RML mapping. The RML mapping rules are used further to ensure the format of the designated unique identifiers (URIs) for the mapped resources which are used as the subject of all the RDF triples.

Fig. 6
figure 6

Example of an RDF molecule describing “Joanne Ryan” obtained from the KG Creation process

3.6 Knowledge graph embedding models

Knowledge Graph Embedding (KGE) is the process of transforming the constituents of a KG (entities and relationships) into a low-dimension semantically-continuous space (Wang et al. 2017a, b). Even though solving problems pertaining to graphs can be carried out on the conventional graph presentation (i.e. adjacency matrix), mapping the entire graph or its nodes to the vector space has attracted the scientific community due to its scalability to simplify resolving several complex real-life graph problems such as KG completion, entity resolution, and link-based clustering, just to cite a few (Kipf and Welling 2016; Wang, Cui, et al. 2017; Nickel et al. 2015). Embedding a KG is learned via training a neural architecture over a graph, and comprises commonly three main components, namely; (1) encoding entities into distributed points in the vector space, and encoding relations as vectors, or other forms; (2) scoring function or model-specific function that is used to evaluate the model’s efficiency; (3) optimization procedure, which aims to learn the optimal embedding for the designated KG, thereby the scoring function assigns high scores to positive statements.

The literature in KG Embedding commonly categorises the embedding techniques into two main classes; translation distance models and semantic matching models (Wang et al. 2017a, b). Translation Distance Models are designed to evaluate the plausibility of a certain fact in a distance between two entities. Semantic Matching Models intends to measure the plausibility of facts considering the latent semantics of entities and relations into their low dimensional representations. Amongst numerous KG embedding models proposed in the literature, the following are the set of most popular KG embedding models that are incorporated in this study.

Translating embedding (TransE) (Bordes et al. 2013) learns the representation of both the entities and relations as vectors in the same low dimensional semantic space. Hence, for a golden triple \(\left( {h, r , t} \right)\), TransE treats the relation \(r\) as a translation in the embedding space so that \(h + r \approx t\), when \(\left( {h, r, t} \right)\) holds (\(t \) should be the closest to \(h + r\)), otherwise \(h + r\) should be away in distance from \(t\).

The DistMult model (Yang et al. 2014) is an extension and a simplification to RESCAL (Nickel et al. 2011) and is based on the bilinear model. In this model, the relation is encoded as diagonal (single vector) using the trilinear dot product as a scoring function.

Complex embeddings (ComplEx) (Trouillon et al. 2016) this is an extension to DistMult model by introducing complex-valued embeddings, where the scoring function is based on the trilinear Hermitian dot product in \(C\). Entity and relation embeddings are no longer positioned in real space but in a complex space.

Holographic embeddings(HolE) (Nickel et al. 2016) a compositional vector space model that learns compositional vector space representations of entities and relations through incorporating the strength of RESCAL as well as the simplicity of DistMult.

Convolutional 2D KG embeddings (ConvE) (Dettmers et al. 2018) is a neural link prediction model that uses deep, multi-layer, conventional and fully connected layers of nonlinear features to tackle the interactions between input entities and relations.

Convolution-based model (ConvKB) (Nguyen et al. 2017) incorporates conventional neural networks to represent the concatenation of entities and relations, which increases the learning ability of latent features.

4 Experimental results

4.1 Dataset selection

As indicated previously this study aims to construct a domain-based KG (politics) and to carry out embeddings on the constructed KG that will assist in conducting further analysis. We make use of the Twitter platform to consolidate the domain-based KG with facts inferred from social contents propagated from this virtual platform. As indicated in Sect. (3.2.1 Dataset acquisition), the dataset is collected from dissimilar resources based on three different categories of users: (A) Members of the Australian house of presentative (Senators and MPs); (B) users interested in Politics domain; and (C) users whose domain of interest is not explicitly conveyed. This set also might contain spammers, anomalous users, and other untrustworthy users. Domain analysis using IBM Watson has been conducted on each category of users to infer the domains of interest for each category. Figure 7 illustrates the total number of users and their tweets distributed over 23 domains of knowledge for each designated category.

Fig. 7
figure 7

The distribution of the total number of users and posted tweets and URLs in each designated domain for three categories: a Politicians (senators and members of parliament), b Politics-interested and c Unknown politics interest

As depicted in Fig. 7, category (A) shows a clear interest in the political domain which is reasonable (i.e. users of this category are mainly politicians, and their social contents are expected to discuss topics related to politics). Category (B) is a mixture of users who are selected as they explicitly show a common interest in politics. The domain analysis on category (B) supports this and shows that those users are interested in politics as well as other domains such as technology, art and entertainment, and travel. Despite the slight interest in politics domain and a strong interest in other venues, Category (C) demonstrates a balanced interest across the topics of interest.

Domain-based social credibility module aims to carry out scrutiny on the collected and pre-processed dataset before conducting further analysis, thereby augmenting the KG with facts obtained from users who are legitimate and convey credible interest in politics domain, and it also aims to eliminate contents of users who obtained low trustworthiness values as per the mechanism discussed in Sect. 3.4.

The social credibility module was applied on the dataset and thus generated a new dataset embodying legitimate and politics-interested users with their associated contents. Table 3 shows some figures on the collected datasets for each category before and after conducting the social credibility module.

Table 3 Datasets before and after conduction social credibility analysis

As demonstrated in Table 3, the contents belong to Category (A) was kept as it is with no cleansing. This is because this dataset comprises selected users (senators and members of the Australian House of Representatives). These users are politician and their social content involves the main source for KG creation. The percentage values of low-trustworthy users, tweets, entities, political_entities, and facts for the remaining categories are justifiable. Category (B) contains the users who might show a certain interest in the political domain but not necessarily a genuine interest. Therefore, ten percent (10%) of users in this category obtain low-trustworthy values. Category (C) contains a subset of users indicated in the Twitter graph dataset collected by (Akcora et al. 2014). As discussed in the Dataset Acquisition section, this dataset was used in the literature to discover spammers and other illegitimate accounts. Therefore, it is anticipated that the percentage value of low-trustworthy users in this category is higher than the former ones. In fact, fifty-five percent (55%) of users in this category are detected as low-trustworthy users (15% are spammers).

4.2 Domain KG embedding model evaluation

4.2.1 Experiment settings

This study incorporates Ampligraph™ (Costabello et al.) version 1.3.1, with TensorFlow 1.14, and Python 3.7 on the backend for conducting KG embeddings on the constructed domain KG. All the experiments including training and evaluation of each embedding model were carried out using the Australian Pawsey supercomputing high-performance facilities.Footnote 16 The domain KG is initially divided into training, test, and validation subsets. Several KG embedding models are implemented and their hyperparameters are tuned using the random search strategy. Random search has proven efficiency and outperformed grid search routine as it provides a solid baseline, and it also shows robustness when the number of parameters increases (Bergstra and Bengio 2012; Li et al. 2016, 2017). A brief description of some internal settings used in the incorporated embedding models is provided in Table 4.

Table 4 Hyperparameters used for the KG Embedding models

Table 5 shows the set of hyperparameters tested using the random search strategy, and those underlined are the optimal values obtained from the incorporated search strategy.

Table 5 The Embedding model and the incorporated hyperparameters used in the random search

4.2.2 Evaluation protocol

This study incorporates the evaluation protocol proposed by Bordes et al. (2013). There are three key steps in the defined protocol, namely: (1) generating negative triples synthetically; (2) remove the resultant positive triplets; then (3) ranking all the test facts (triples) against the triples returned from the preceding step. Negative triples are initially positive triples (correct facts) which have been manipulated (corrupted) by randomly replacing head, tail or relation, thus creating new triples (false facts). The negative sampling mechanism used in this paper is based on corrupting the h Poggio2016, then we compute the average of attained evaluation metrics of each method.

To evaluate the effectiveness of the social credibility module, we carry out the aforementioned protocol on two different KGs, namely a KG1 that is generated by accumulating the original dataset as demonstrated in Table 3 before applying the credibility module, and a KG2 that is generated from the datasets on which the credibility module has been applied to (curated dataset). The next subsection provides a further discussion on the conducted experiments.

4.2.3 Embedding evaluation results

The experiments have been carried out incorporating six well-known embedding models as depicted in Table 4 along with the depicted tuned hyperparameters. With the ranks obtained from the subjects and predicates corruption of each dataset described in the previous subsection, the metrics are computed for each embedding model on each generated KG.

Table 6 illustrates the attained metric values obtained from each model on each KG. Two key findings can be inferred from the figures illustrated in Table 6. First, all embedding models perform well on KG2 in comparison with KG1. This extends the significance of incorporating a credibility module for data purification, particularly data collected from mixed-quality resources such as social media. Second, despite the convergence in the outcome performance results, ConvKB embedding model outperforms other models in most metrics. For example, examining \(hits@1, hits@3,\) and \(hits@10,\) with ConvKB, we were able to hit a correct subject or predicate 55%, 66.9%, and 81% of the times respectively using KG2. This interpretation applies to all other metric values obtained from each embedding model. The good performance of ConvKB is established due to the underlying structure of ConvKB; it incorporates a CNN network to capture the global relationships and the transitional features of the KG embodied entities and relations. HolE model has also shown promising results; this is understandable as Hole integrates the efficiency and simplicity of more than one model (Wang et al. 2017a, b). Moreover, Hole can obtain rich interactions in such relational data by applying circular correlation on vectors that create compositional representations.

Table 6 Comparison analysis of evaluation metrics of the embedding models using two KGs, where KG1 is the KG that is constructed based on the original dataset before applying credibility module, and KG2 is the constructed KG on the curated dataset after applying the credibility module

The utility of the embedding models is commonly measured by the applicability of using these models in more factual tasks. The following sections discuss the utility of the developed approach in link prediction, clustering, and visualisation tasks.

4.3 Experiments on downstream tasks

4.3.1 Task (1): link prediction

The implemented KG Embeddings in this study are used to carry out a Link Prediction task. We have generated a set of facts in politics domain, which contain true political facts that have not been trained in the model (unseen facts) as well as some synthetically created false politics facts. The goal is to test the utility of the model to detect which of the presented true candidate facts are likely to be true. Similarly, which false candidate facts are unlikely to be true. To evaluate the performance of this task, accuracy, precision, recall, and F-measure metrics are incorporated.

Precision specifies the proportion between the sum of actual true politics facts that are accurately predicted and the total sum of accurate and inaccurate predictions of true politics facts. Recall specifies the proportion between the number of actual true politics facts that are accurately predicted and the total sum of actual true politics facts. Therefore, obtaining a high precision value indicates that the prediction module is a success in the result relevancy measure and is able to deduce more relevant politics facts among the retrieved ones. Attaining a high recall value indicates that the prediction module is a success in retrieving truly relevant results. For example, if a prediction module is evaluated and attains a precision value of ‘1’, this indeed conveys that all predicted facts are correct predictions and depict factual politics facts that can be used to augment the knowledge graph. However, this does not necessarily reflect the module’s efficacy to retrieve all true politics facts. On the other hand, if the prediction module attains a recall value of ‘1’, this implies that this module is able to retrieve all true positive facts. Yet, it does not convey the number of other false retrieved predictions. This is why it is commonly a good practice to incorporate the F-measure metric as it provides a weighted average of precision and recall.

The ground truth dataset of the link prediction experiment contains 1,000 labelled statements of both true politics facts as well as false politics facts obtained equally from two generated KGs (original and curated). Table 7 shows the performance comparison of the link prediction task for six incorporated embedding models and the two KGs. As depicted in the table, ConvKB embedding model has relatively overshadowed other embedding models in this task for both KGs. For example, this designated model has obtained 74.4%, 83.2%, 77.2%, 77.3% in accuracy, precision, recall, and f-score metrics respectively using KG2. HolE and ConvE have also shown promising performance results. For example, ConvE was able to predict almost half of the true positive facts correctly as a true positive statement. Also, the precision computed for ConvE was 72% which proves the ability of this embedding model to demonstrate virtuous results in this task.

Table7 Performance Comparison on Link Prediction task for six KG embedding models on two KGs

On the other hand, Table 7 shows that the results of TransE, DistMult and ComplEx embedding models are convergent in almost all computed metrics (i.e. accuracy, precision, recall, and f1-score) for both KGs. In spite of the fact that TransE performs well in datasets that embody one-to-one relationships, it demonstrates inadequacy to handle unbalanced relations (i.e. one-to-many/many-to-one) (Rossi et al. 2020). For example, embedding two knowledge facts such as; (Anthony Albanese, hasLocation, NSW) and (John Alexander, hasLocation, NSW) will result in “Anthony Albanese” entity vector be close to “John Alexander” entity vector. However, this does not convey the factual and realistic similarity between these two politicians; “Anthony Albanese” is a member of the Australian Labor Party while the affiliation of “John Alexander” is Liberal Party. This discrepancy also applies to their electorate and other facets. Furthermore, DistMult embedding model is inadequate to handle asymmetric and antisymmetric relations. This is evident because of the entry-wise product depicted in Eq. (2); it demonstrates that all relations are symmetric. This attains misleading results when asymmetric and antisymmetric relations are present (Wang, Ruffinelli, et al. 2018). HolE, on the other hand, is skillful to address this issue since it uses a circular correlation operator, this results in HolE able to capture the relations with asymmetricity and anti-symmetricity (Sharma and Talukdar 2018).

Further, Table 7 demonstrates the importance of incorporating the social credibility model. This can be seen in the relatively poor performance of all embedding models on link prediction task using the low-quality KG (i.e. KG1) and the better performance of the same embedding models on high quality and cleansed KG (i.e. KG2).

Table 8 shows an example of the link prediction task. The table presents a set of selected statements obtained from the ground truth of KG2, each statement with a label indicating, whether the statement is true or false along with the classification label acquired from ConvKB embedding model. It can be seen that the embedding model has been largely able to understand Australian politics and provide some good predictions on this domain. For example, the model is able to indicate that Karen Andrews is actually a member of the Australian Labor Party despite the fact that this information was not imported to the KG. It is also able to discover the political interest of users whose domain of interest is not explicitly depicted. For example, the collected tweets of @JohnKeily1 demonstrate that this user is interested in politics and does not support the Australian Labor Party (ALP). The model captures some truth about this user and did detect that this user is highly interested in politics, yet it fails to capture that @JohnKeily1 is not a supporter of ALP party. This can be understood considering that @JohnKeily1 has some negative tweets about ALP party and his presence in the vector space turns out to be close to those supporting this political party. This explanation can be also applied to other instances where the model was unable to correctly classify them. Hence, in future work, the KG will be further scrutinised and enhanced to embody for example the sentiments of the social contents, political polarisation, etc.

Table 8 A selected set of labelled candidates (true and false facts) from KG2

4.3.2 Task (2): KGE clustering and visualisation

Cluster analysis is another evaluation strategy that can be performed on a constructed KG. Clustering occurs on embedding space of both entities and relations and it is an effective evaluation strategy to measure the subjective quality of the KG embedding. The clustering projects the original embedding with the predetermined space size into a 2D space, then a subjective measure is carried out to evaluate the embeddings. Several clustering algorithmsFootnote 17 have been implemented and evaluated, such as AffinityPropagation, AgglomerativeClustering, Birch, DBSCAN, FeatureAgglomeration and KMeans. Several projections have been generated from these clustering modules, yet KMeans algorithm has proven effective due to the factual projections that are generated by this algorithm. This experiment follows the standard embedding space size (i.e. k = 100) used by AmpliGraph™.

Figure 8 illustrates two figures, namely clustering analysis for KG1(original) and clustering analysis for KG2 (curated). Each analysis provides four clusters that show the semantic representation of entities(users) obtained from both datasets. The clustering analysis of KG1 that is constructed form the mixed-quality dataset (original) shows a social spammer (spmmer5-the twitter screen name is concealed) that hijacks an orange-coloured cluster. The cohort of this cluster are mainly users who belong to or have an interest in the Austrian Labour Party. The clustering analysis of KG1 also demonstrates certain inadequacy in providing the correct assembly of legitimate users who have the same political belongings. For example, Melissa Price MP, a politician who belongs to the Liberal Party of Australia, is appended to the category of politicians who support or belong to the National Party of Australia.

Fig. 8
figure 8

Clustering analysis of the constructed KGs

On the other hand, Fig. 8—clustering analysis of KG2 shows a set of selected politicians and legitimate users interested in politics. The depicted cluster is widely accepted; it can be seen from the figure that almost all the members who have been categorised to the same group have the same political attachment. For example, Kevin Andrews MP, Lucy Wicks MP, Rowan Ramsey MP to name a few are all members of the Liberal Party of Australia and have been grouped to the same cluster. Likewise, Matt Keogh MP, Fiona Phillips, and Brian Mitchell are also assembled with others in the same cohort (i.e. ALP). Further, the figure depicts that the incorporated clustering approach is also able to infer politics affiliation of non-politicians; for example, the twitterer (@wheels002) happens to appear in the same vector space with members of the Australian Parliament who are also members of the Australian Greens Party. This is evident since @wheels002 has conveyed her interest in this political party in several tweets. Clustering analysis of these two KGs does verify the significance of incorporating social credibility module, not only to eliminate low-trustworthy social users and content but also to provide a better analysis on the downstream tasks.

TensorBoardFootnote 18 toolkit is used to visualise the implemented KG embeddings of the two KGs in 3D view and to project the resultant embeddings into low dimensional space using computed Principal Component Analysis (PCA). PCA is a statistical analyser that tends to minimise the dimensionality of a complex problem and explore patterns from the dataset by building a linear, multivariate model from the dataset (Rencher 2005). A dashboard of TensorBoard in PCA is used to provide an interesting 3D view of KG Embeddings. Figures 9a–c and 10a–c show visuals obtained from the TensorBoard on the original KG and the curated KG, respectively. These graphs illustrate the importance of visualisation to provide a subjective assessment of the implemented approach. Also, these visualisations demonstrate the significance of incorporating the social credibility module on KG embeddings. For example, Fig. 9a, listed all concepts related to “Politics” entity in the high-dimensional space. The nearest entities to Politics in the original space contains ‘spammers12’. This noisy data has been detected by the social credibility module, and thus eliminated from the dataset used to construct the curated KG (i.e. KG2). This is demonstrated in Fig. 10a in which the set of entities that are close to Politics are sound and convey the semantic relationships with the designated domain.

Fig. 9
figure 9

KG Embedding visualisation using TesorBoard on KG1 (original)

Fig. 10
figure 10

KG Embedding visualisation using TesorBoard.on KG2 (curated)

Figure 9b, c also illustrate the effect of applying KG embedding on KG that contains corrupted content (KG1). This is evident as spammer1 and spammer3 are positioned in the semantic space and thus convey relationships with legitimate politicians. On the other hand, Fig. 10b displays the cohort embedded with “Amanda Rishworth MP”, a member of ALP, using the curated KG. It is noticeable that the system is able to assemble members who factually share the common semantic features near to each other in the embedded space. This can also be supported by members appear with “Alan Tudge MP”, a member of LP, as depicted in Fig. 10c.

Using visualisation technique, as a subjective assessment to the implemented embedding approach, verifies again the applicability and utility of our approach on constructing high quality and trustworthy domain-based KG.

5 Conclusion and future work

The tremendous amount of information on the Web that is presented in dissimilar formats and covering various topics poses a challenge on possibilities to obtain the hoped-for added value from such massive data islands. This offers researchers a vital opportunity to consolidate efforts toward better understanding and analysis of such multimodal contents. In this context, Knowledge Graphs (KGs) are a popular phenomenon that has established a new venue by facilitating machines to understand meanings, thereby shrinking the semantic gap between them and humans. Further, domain-based KGs have extended these exerts by propagating knowledge in dissimilar domains that can be incorporated to resolve a variety of real-life problems. Yet, the credibility of knowledge is commonly neglected in the construction of KGs especially when the knowledge is harvested from social media where spammers and other low trustworthy users find a fertile medium to publish and spread their content taking advantage of the open environment and fewer restrictions of these platforms.

This paper proposes a credibility-based domain-specific KG Embedding framework. This framework incorporates certain modules which are integrated to manage and extract useful and trustworthy knowledge from the continuous propagation of mixed quality social content. In particular, the framework is framed to contain the following modules: (1) Domain knowledge inference: Presents the core activity to prepare data for KG creation. It aims to detect and infer the user’s domain of knowledge from pre-processed datasets. For a proof of concept purpose, the experiments are carried out on the Politics domain. The model makes use of various cross-domain knowledge-based repositories including Google KG™, IBM Watson NLU™, and Wordnet™ to enrich the semantics of the textual contents, thereby facilitating the interoperability and integration to infer the political knowledge. (2) Social credibility module: A comprehensive credibility mechanism to measure users’ credibility on politics domain incorporating a metric of key attribute. (3) KG construction: Aims to construct a politics KG leveraging politics ontology which captures knowledge representing the politics domain and the incorporated credible users and their data and metadata. (4) KG Embedding: this module incorporates state-of-the-art embedding models to embed the constructed KG in semantically interrelated and low dimensional vector space. Two KGs are used to demonstrate the utility of the constructed KG Embeddings, namely KG1 which is constructed using a poor and low-quality dataset, and KG2, curated KG, which is constructed using a cleansed version of the former dataset incorporating social credibility module. The embedding utility of these KGs is demonstrated and substantiated on link prediction, clustering, and visualisation tasks.

This paper is a report on work in progress as it is an ongoing project the purpose of which is to develop an integrated platform of various techniques for domain-discovery, credibility evaluation, and KG construction and embedding. Therefore, in the future work, the following extensions will be considered: (1) More embedding techniques will be implemented, and their evaluation will be examined. (2) CredSaT is the sole social credibility module that is used in this study, thus an array of other domain-based social credibility modules will be studied and implemented in order to consolidate the implemented approach. (3) Politics KG was constructed in this study for a proof of concept. In the future work, we will investigate other domains leveraging domain ontologies, semantic technologies and the linked open data cloud. (4) The current modules and new proposed enhancements will be automated and the entire architecture will be developed as an open-source project and will be released to facilitate replication and knowledge sharing.