1 Introduction

The five Vs of big data-velocity, volume, value, variety, and veracity-define social networks (Abkenar et al. 2021). Big data generated by social media networks and their related analytics have attracted the attention of researchers with the development of technology. Presently, data are identified in unprecedented proportions in a variety of environments, increasing every 18 months as a consequence of various forms of databases such as databases obtained from different social media networks (Rossi and Hirama 2022). Big data is heterogeneous and complex in nature. Therefore, straightforward approaches are ineffective when trying to process and store big data.A primary issue is the diversity of data types and data inconsistency, redundancy, and incompatibility. Consequently, there is a perceived need to devise a reliable system that can retrieve, organize, and process information efficiently and safely.

The development of cloud computing has caused a divergence in how software and computers interact. Rather than having to install them locally, cloud computing enables the activation and accessibility of apps in the cloud (Alqarni 2021). Nevertheless, it is not surprising that user interest in cloud storage has surged. Microsoft, Google, Amazon, and others’ services, which provide online storage, have sparked a gold rush following their arrival in 2012 (Ahuja et al. 2012). In a similar vein, the research community is constantly working to improve the effectiveness and security of cloud computing.

Data integration is the process of compiling data from multiple sources into a single, cohesive dataset. Researchers have been developing data integration tools (Kancharala 2021; Nie et al. 2021; VandanaKolisetty and Rajput 2021; Jung and Chung 2021). Bettio et al. (2021) developed MOMIS to integrate clinical data to visualize patient’s natural, molecular, and clinical history. GNN-DDI was designed to integrate drug information from several systems by developing an attributed heterogeneous network (Al_Rabeah and Lakizadeh 2022). However, because the volume of data continues to rise dramatically in a relatively short period of time, the source data is oftentimes difficult to integrate constraining the effectiveness of data integration tools (Kalayci et al. 2021). Hilali et al. (2022) modified the ETL (Extraction, Load, and Transformation) process to handle the semantic heterogeneity of big data but lacked compatibility with NoSQL databases. For this, Our framework would help people to integrate the data to better use and manage the data from distributed data sources. Also, it is more valuable for businesses at a low cost. However, the traditional data integration framework does not determine clear integration standards as well as a higher cost to achieve effective data gathering, cleaning and integration.

The objective of this research is real-time consolidation of data from several social media networks into a distributed data source. In spite of a wider spectrum of data types and domains, the ultimate goal is to consistently offer and enable user access to data while taking into account commercial and application requirements. In order to accomplish this, we propose a generic framework for big data integration on a cloud environment for a swift, reliable, and safe access to social network data. The data from social media is collected using multiple APIs (Application Programming Interfaces). The proposed framework offers an interface for end-users to access the filtered data in a readable format. The main contributions of our study are listed below:

  • Design and develop an integration framework to collect big data from distributed sources with different formats.

  • Provide a unified format of the output data to be processed in cloud computing platforms.

  • Propose a comprehensive framework and provide an interface for filtering data gathered from social media sites with the aid of big data integration.

  • A replication package of our developed framework for extension purposes (https://github.com/AlShomar/AlShomar-Big-Data-Integration-Framework).

The rest of the paper is organized as follows: Sect. 2 discusses the related work. Section 3 presents the proposed framework, and discusses all the layers and algorithms used. In Sect. 4 illustrates the performance evaluation of the designed algorithms. Finally, Sect. 5 concludes the paper and highlights the direction of future work.

2 Related work

This research aims to address the lack of a standardized data integration framework for combining a centrally managed database with different standalone social media data sources. Integrating data from heterogeneous sources into one database has been a challenge for many researchers (Fillinger et al. 2019). Big data integration frameworks present four main issues including big data transformation, storage, and retrieval. Table 1 summarizes the relevant studies discussed in this section.

2.1 Big data integration

Several big data integration tools such as ROHDIP (Shehab et al. 2016), BINARY (Eftekhari et al. 2016), MOMIS (Bettio et al. 2021), GNN-DDI (Al_Rabeah and Lakizadeh 2022) etc, have been developed in recent years. Graph-based (Kancharala 2021), hybrid hierarchy architecture-based (Nie et al. 2021), data integration techniques have also been developed. Similarly, Probabilistic Semantic Association (PSA) was introduced to integrate big data by generating feature patterns for the data sources (VandanaKolisetty and Rajput 2021). Cluster-based data integration model was proposed by Jung and Chung (2021). Akinyemi et al. (2020) presented a framework that processes and integrates plant data which can help mitigate decommissioning costs and reuse decommissioned items. On the other hand, (Fletcher et al. 2019) employed weighted joint likelihoods in their data integration model as a mean to highlight data sources according to various criteria (e.g sample size).

The notion of big data integration in the cloud blends data manipulation technologies and cloud computing in a new generation of data analytics platforms (Kune et al. 2016; Manekar and Pradeepini 2017). Users today require new big data integration cloud services, such as data collection from many sources via cloud-deployed APIs.

2.2 Big data transformation

Integrating high-quality data into the cloud is not sufficient rather, data transformation is required to filter, combine, and modify or reformat data types (Dey and Pandit 2020). Li et al. (2021) used the bilinear data transformation method to map angular wind data to time series. Kim et al. (2021) proposed a data transformation architecture based on machine learning techniques. Similarly, a framework based on R programming language to transform SQL data to NoSQL format was proposed by Hasan et al. (2021). Vendor lock-in is one of the major challenges in big data integration that necessitates big data transformation. Ahmed et al. (2021) deployed the k-nearest neighbor (KNN) imputation method and Kaplan-Meier weights to deal with the transformation of sensor data. Arslan et al. (2019) developed a web-based software to distribute datasets by applying mathematical data transformation by computing the Pearson P test statistic.

2.3 Big data storage and retrieval

Big data storage necessitates better storage functionalities. Saenko and Kotenko (2022) focused on providing efficient and resilient big data storage based on Hadoop Distributed File System (HDFS). Honar Pajooh et al. (2021) considered Hyperledger Fabric (HLF) platform with decentralized storage for IoT data. Authors enhanced data integrity by storing meta-data in off-chain big data systems. Shi et al. (2020) devised a Hadoop-based system using cloud storage to provide an efficient decision system. Data security is a significant issue that researchers are trying to solve regarding cloud storage. Viswanath and Krishna (2021) designed an encryption technique that was primarily responsible for securing the big data stored in a multi-cloud environment.

When retrieving large data, it is difficult to fully meet the expectations of end-users because traditional retrieval methods are prevalently time-consuming and take little consideration of the multi-source diverse attributes of big data. Arer et al. (2022), used IPFS(Inter-Planetary File System) for big data storage and elasticsearch for efficient data retrieval. Another study (Ye et al. 2022) used a parallel top-N algorithm to summarize and swiftly retrieve the matching semantic features of video data in big data.

Table 1 Summary of related work

3 Proposed approach

We propose a big data integration framework on the cloud as shown in Fig. 1. The proposed framework consists of the data source layer, application layer, resource layer, and visualization layer.

Fig. 1
figure 1

Architecture of the proposed framework

3.1 Data source layer

The data source layer provides big data from distributed data sources and connects directly to the application layer. This layer contains three social networks, namely, Twitter, YouTube, and Facebook. It is dependent on the intent of the application layer.

3.1.1 Twitter social media

Twitter as a social network. Since 2006, Twitter has been widely used among Internet users Al-Qurishi et al. (2018), resulting in some related literature on Twitter to understand microblogging usage and communities better. Twitter users can classify their posts into four categories: daily chatter, conversations, sharing information or links, and reporting news. The role of Twitter users can be classified into three classes:

  • Broadcasters, those who have a huge number of followers;

  • Acquaintances, those who have approximately the same number of followers and following; and

  • Miscreants and evangelists follow a huge number of users but have only a few followers.

The use of Twitter goes beyond personal use, and it can be convenient. Businesses consider it a channel to increase awareness about their products, create business opportunities, maintain customer loyalty, host marketing campaigns, improve reputation, predict trends, and recruit new talents Al-Qurishi et al. (2018).

3.1.2 YouTube social media

YouTube as a social network. Since 2005, YouTube has been a video-sharing website, which started as a media tool and became a marketing communication tool. It is a rich tool that contains multiple mechanisms, such as trending, subscribers, and a list of related videos, which could affect how a video is published, thereby impacting its popularity. Users of YouTube across the globe can upload free video content and generate billions of views every day. These millions of users can significantly affect the reputation of an organization or a person. The use of YouTube goes beyond personal use, and it can run ongoing information about new services or products. Moreover, many factors have facilitated the growth of its use, including ease of uploading videos, accessing commercial content, education, and broadcasting networks.

3.1.3 Facebook social media

Facebook is a social networking services website launched in 2004, allowing users to create profiles, send messages, and keep in touch with friends. The Facebook website represents a huge potential market for social media efforts. Facebook users can be categorized by their use as sharing status, social connection, sharing identities, and browsing the social network. The use of Facebook fulfills two needs: belonging and self-presentation. The belonging need allows users to learn about others and communicate with them, which is a significant motivator. The self-presentation need includes creating user profiles and posting images and wall content. It can be used for social searching, finding out information about offline users, and social browsing, which is used to develop new connections for offline interaction.

3.2 Application layer

This layer provides a link between the resource layer and the data source layer as shown in Fig. 2. It comprises a set of RESTful APIs and a data retrieval algorithm. The function of RESTful APIs is to collect big data from social media data sources and transfer the data to the data retrieval algorithm.

Fig. 2
figure 2

A block diagram of the proposed framework

3.2.1 RESTful APIs

REST architecture stands for Representational State Transfer (REST) and is used to deploy large-scale distributed systems based on the client-server model, which can exchange data between applications or systems. In REST, everything is a resource, and these resources can be accessed by the application program interface using the HTTP protocol. REST APIs are vital to pulling data from distributed data sources based on end-user requests. In this paper, we have used to connect the data source layer and the application layer. Those APIs are from social media channels, which are used to collect and retrieve data from them. These data can be used to read, update, create and delete data types, as shown in Fig. 3.

Fig. 3
figure 3

API connectivity among the four layers of the framework

3.2.2 Data retrieval algorithm

The data retrieval algorithm inserts the topic-driven by passing two parameters: the keyword and access token. More details of this algorithm are presented below in Algorithm 1.

figure a

Definition 1

(Data retrieval) Topic denotes a keyword-driven topic. \(K_{list}\) denotes a keyword ID list. API denotes an application program interface. \(A_{cc}\) denotes the access token of the social network. Q stores the URL query. q denotes an element in Q. F denotes the file in JSON format. Algorithm 1 acquires the data object with a given API using the function GetDataService (API), which is an interface to collect big data from a given social media service. Hence, it initializes two parameters: keyword-driven \(K_{list}\) and access token \(A_{cc}\). Then, it formats the queryURL using the function FORMAT (queryURL) before adding the query to Q. In steps 6–10, check whether the URL is valid or not. If the URL is not valid, it returns the result of an empty file. Otherwise, it builds a JSON file with the retrieved big data in F notation. Finally, the algorithm returns the file in JSON format represented in F, which includes the collected big data.

3.3 Resource layer

This layer comprises the read and clean data algorithm, data integration algorithm, distributed storage, ETL adapter, and cloud computing environment, which support the implementation of remote access, servers, virtual machines, and hardware and software resources. The resource layer satisfies the requirement of distributed big data storage and management of all host servers.

3.3.1 Read and clean data algorithm

The read and clean data algorithm inserts a file in JSON format and outputs lists of social media data attributes. For more details, please see Algorithm 2 below.

figure b

Definition 2

(Read and Clean Data). F denotes a JSON file. \(L_{att}\) denotes the list of attributes that come from social media sites. a is an element in JSON file F. \(A_{ut}\) represents the attributes of YouTube; \(A_{tw}\) denotes the attributes of Twitter, and \(A_{fb}\) contains the attributes of Facebook. Att denotes the attributes. pa denotes storing the attributes after being extracted and parsed. Algorithm 2 reads the file in JSON format with the given function F. Then, it cleans the data by extracting the attributes that denote extract Att(a) of the JSON file F, which comes from different social media sources. The next step will be parsing the big data attributes using the Parse Att(a) based on social media networks. After the attributes are extracted and parsed, it will store the attributes of \(A_{ut}\), \(A_{tw}\), and \(A_{fb}\) into \(L_{att}\) using the function \(L_{att} \leftarrow\) Add(pa). Finally, the algorithm returns the lists of attributes including \(A_{ut}\), \(A_{tw}\), and \(A_{fb}\) into \(L_{att}\).

3.3.2 Data integration algorithm

The big data integration algorithm takes the list of extracted data attributes in the form of Twitter, YouTube, and Facebook. Each element in the social media data is selected based on these attributes and is stored in the list of integrated data sources. Then, the output will be the integration of multiple data types and various formats of the integrated data sources. More details of this algorithm are presented in the Algorithm 3.

figure c

Definition 3

(Big Data Integration). \(L_{att}\) denotes the list of attributes for YouTube, Twitter, and Facebook. \(L_{AI}\) denotes the list of integrated data sources. \(A_{ut}\) represents YouTube attributes; \(A_{tw}\) contains the Twitter attributes, and \(A_{fb}\) denotes the attributes of Facebook. \(E_{ut}\) denotes an element in YouTube. \(E_{tw}\)denotes an element in Twitter. \(E_{fb}\) denotes an element in Facebook. \(S_{ut}\) denotes the selected element. Algorithm 3 is divided into three parts based on the social networks. In the first part (steps 1–4), the element of YouTube \(E_{ut}\) is selected based on the attributes using Select (\(E_{ut}\)) and is stored in the selected element \(S_{ut}\). \(L_{AI}\) derives the selected \(S_{ut}\) of the given function using the function \(L_{AI} \leftarrow\) Add(\(S_{ut}\)). The second part of the algorithm (steps 5–8) describes the element of Twitter \(E_{tw}\)which is selected based on attributes using the function \(S_{ut} \leftarrow\) Select(\(E_{tw}\)), and stores it in the selected element \(S_{ut}\). \(L_{AI}\) derives the selected \(S_{ut}\) of the given function using Add (\(S_{ut}\)). Finally, in the third part of the algorithm (steps 9–12), the element of Facebook \(E_{fb}\) is selected based on the attributes using the Select (\(E_{fb}\)) function and is stored in the selected element \(S_{ut}\). \(L_{AI}\) derives the selected \(S_{ut}\) of the given function \(L_{AI} \leftarrow\) Add(\(S_{ut}\)). Hence, \(L_{AI}\) returns the list of integrated datasets.

3.3.3 ETL adapter

ETL stands for Extract-Transform-Load and covers a process of pulling out data from one source to another. The ETL adapter supports the transfer and reduction of the integrated big data into the distributed storage. The extract step has been used to retrieve the data from the source system in a way that does not affect the performance or response time. The next step will be transforming the data from the source to the target place using the same dimension so that it can be joined later. The use of the load step fulfills the loading correctly with little resources that allow us to stop any constraints before the loading step and enable them after the load is completed.

3.4 Visualization layer

After the integrated data is stored in a reliable data source, we need to search, view, and interact with these datasets using an analytics engine and RESTful search. These data can be accessed through RESTful APIs and uses the JavaScript Object Notation (JSON) schema to store data. Moreover, we provide the capabilities to visualize the data in various maps, tables, and charts on top of large volumes of data. Search engine queries in real-time can do this visualization of data. In addition, users can create and share dynamic dashboards that display any changes in search engine queries.

4 Performance evaluation

We designed three algorithms to integrate big data from different sources with different formats. The experiments aim to evaluate the performance of these algorithms. We believe that the performance of the designed algorithms needs to be optimized and tested in a single server instance before we consider scaling up and scaling out. These experiments compare the performance of the algorithms during the execution from different data sources such as Facebook, YouTube, and Twitter.

4.1 Experimental setup

The designed algorithms were implemented in Java 7, Apache Maven 3.5.0, and Redis 4.0.1. We used Elasticsearch 5.6.0 as storage and Apache Kafka as the stream processing platform. All of these tools were deployed on the same virtual cloud machine. This machine runs on the following operating system: Ubuntu 64-bit 17.4 with Intel Core i5 2.40 GHz CPU, 10 GB of RAM, and 250 GB of hard disk storage.

4.2 Experimental dataset

We employ in our experiments three datasets from popular social media sites (Twitter, YouTube, and Facebook). These datasets were collected randomly by the data retrieval algorithm using social media APIs, which consists of three queries. We applied the first twitter4j query on Twitter which is a %100 pure Java library for the Twitter API in order to retrieve the maximum number of tweets and replies and we were able to collect 122,432 data size/ms. For the second query, we applied access token query for YouTube channel which is send Web API requests with an access token included either in the HTTP Authorization header or as a POST parameter in order to collect the data. Hence, we collected 91,146 data sizes/ms from YouTube consisting of comments and replies. For the third query, we applied Redis query on Facebook since the keys can contain hashes and sorted sets and were able to retrieve comments and replies around 56,260 data size/ms. Finally, we integrated these datasets using data integration algorithms. Table 2 summarizes the datasets used to evaluate the three algorithms from distributed sources.

Table 2 Datasets used to evaluate the proposed algorithms

4.2.1 Data retrieval algorithm (Exp. 1)

This experiment aims to benchmark the execution time of the data retrieval algorithm from Twitter, YouTube, and Facebook. We employ the algorithm to collect the datasets from the social media APIs, which are composed of three subsidiary queries. The first query Q1 retrieves the maximum number of tweets and replies from Twitter and checks if the file is empty; otherwise, a JSON file F is built. The second query Q2 retrieves the maximum number of video comments and comment replies from YouTube and checks whether the file is empty; if not, a file in JSON format F is built. The third query Q3 retrieves the maximum number of Facebook post comments and Facebook comment replies and checks if the file is empty; if not, a JSON file F is built. The same function executes Q1, Q2, and Q3 and builds the JSON file F. Note that the slowness of execution rate for Twitter while running the code takes 10 to 20 Seconds because of the get_constent() function, unlike YouTube and Facebook.

4.2.2 Read and clean data algorithm (Exp. 2)

This experiment aims to benchmark the execution time of the read and clean data algorithm to read the JSON file F and clean the data. The algorithm has three functions: extracting, parsing, and adding the attributes of Twitter, YouTube, and Facebook. The Twitter attributes \(A_{tw}\) will be extracted and parsed from JSON file F for both tweets and replies. The YouTube attributes \(A_{ut}\) will be extracted and parsed from JSON file F for video comments and reply comments. For the Facebook attributes, \(A_{fb}\) will be extracted and parsed for post comments and replies to post comments. All of \(A_{ut}\), \(A_{tw}\), and \(A_{fb}\) are added to the lists of attributes \(L_{att}\). The same function executes \(A_{ut}\), \(A_{tw}\), and \(A_{fb}\) and produces a list of social media attributes.

4.2.3 Data integration algorithm (Exp. 3)

The aim of this experiment is to benchmark the execution time of the data integration algorithm. This algorithm has two major functions: selecting social media attributes and adding these attributes to the list of integrated big data \(L_{AI}\). Each element in social media such as Twitter \(E_{tw}\), YouTube \(E_{ut}\), and Facebook \(E_{fb}\) is being selected based on specified attributes such as Channel(), CommentId(), OwnerId(), ParentId(), PublishedTime(), and Type() for the main post as well as the replies. The selected elements will be added to the lists of integrated big data \(L_{AI}\). Hence, the algorithm returns the integrated big data lists that come from different sources. The same function executes Twitter \(E_{tw}\), YouTube \(E_{ut}\), and Facebook \(E_{fb}\).

4.3 Experimental results

In general, these results suggest that the algorithms can effectively retrieve, read, clean and integrate the data from distributed data sources. In this section, we present the results of the three experiments on data retrieval, read and clean data, and data integration for social media sites. the execution rate for both data retrieval and read and clean data algorithms were acceptable rates. Furthermore, the data integration algorithm was at a good rate.

4.3.1 Data retrieval algorithm

Exp. 1 shows that the execution rate of the data retrieval algorithm of Q1 performed on Twitter is 286.5 (data size/ms), Q2 performed on YouTube is 27.9 (data size/ms), and Q3 performed on Facebook is 30.7 (data size/ms). Hence, the execution rate of the data retrieval algorithm for YouTube is lower than that of Twitter and Facebook. The results of this experiment are shown in Figure 4.

Fig. 4
figure 4

Result of the data retrieval algorithm

4.3.2 Read and clean data algorithm

Exp. 2 shows that the execution rate of the read and clean data algorithm of tweets is 504.1 (data size/ms) and replies of tweets are 6.48 (data size/ms). For the comments on the YouTube video, it is 38.5 (data size/ms) while it is 3.89 (data size/ms) for the replies to video comments. Finally, for Facebook post comments, the execution rate is 13.9 (data size/ms) and 4.6 (data size/ms) for Facebook comment replies. Hence, the execution rate of the read and clean data algorithm for Facebook is lower than that of Twitter and YouTube. The results of this experiment are shown in Figure 5.

Fig. 5
figure 5

Result of the data read and clean algorithm

4.3.3 Data integration algorithm

Exp. 3 shows that the execution rate of the data integration algorithm of tweets integration is 461.9 (data size/ms) and 9.0 (data size/ms) for the replies. On the other hand, the rate is 37.8 (data size/ms) for YouTube video comments integration and 18.9 (data size/ms) for comment replies. Finally, the execution rate is 11.7 (data size/ms) for Facebook post comments and 6.3 (data size/ms) for replies to post comments. Hence, the execution rate of the data integration algorithm for Facebook is lower than that of Twitter and YouTube. The results of this experiment are shown in Fig. 6.

Fig. 6
figure 6

Result of the data integration algorithm

4.4 Comparison with manual data integration

The proposed solution is compared to manual data integration based on the values of each tuple that are essential for extracting the correct information. Table 8 summarizes the comparison of manual data integration with proposed solution. The first issue is the integration type that will contain all the required data. The manual integration is going to be arduous because each line in the spreadsheet has to be manually populated with data. The proposed solution simplifies this process by gathering and integrating the data into distributed data sources using mapping techniques. It is going to eliminate any error that may be occurred. Another advantage of the proposed solution is that the scale of the data is too large compared to manual integration. This is useful where a huge amount of data are processed. Furthermore, the execution rate of the proposed solution is good while the execution rate of manual integration is questionable because some columns have to be manually ordered and may be forgotten or hidden if the user does not pay detailed attention. The use of manual data integration includes manually resolving meaning between terms to make sure that they belong to the same thing. The proposed solution on the other hand uses exact matching for considering the meaning and relationships between terms. Another issue with manual integration is that if the user does not pay attention to the rows of data, the empty cell may exist and the data is overlooked (Table 3).

Table 3 Comparison of manual data integration and proposed solution

4.5 Evaluating the big data integration framework

After completing the design of the framework, we invited participants to fill out a questionnaire to determine their cognitive behavior and evaluate the big data integration framework. To conduct the evaluation properly, we utilized the ARCS model invented by Keller (1983). ARCS stands for (A) attention, (R) relevance, (C) confidence, and (S) satisfaction. The model is generally used for evaluating how users interact with a system with the aid of user performance in an interactional framework. The calculated score of the ARCS model is 9 points based on the Likert scale, where 9 score is the highest and 1 is the lowest score Paas et al. (1994) (refer to Table 4 for the questionnaire based on the ARCS model). In our framework context, the attention items refer to the responses of the framework users about performing various queries and functions. The relevant items helped users to determine whether the results of the social media data are similar to real-life situations or not. The confidence items refer to the performance of the framework users in terms of facing any complexity or some difficulty concerning the framework. Finally, the satisfaction items measure the users’ experience in integrating big data and how the results meet the users’ expectations. To evaluate the quality of the above questions in terms of attention, relevance, confidence, and satisfaction, we applied some statistical procedures for each item. The statistical procedures are Cronbach’s alpha, means, variance, and Pearson correlation coefficients.

Table 4 Questionnaire based on ARCS model

4.5.1 Cronbach’s coefficient alpha

Cronbach’s coefficient is used to measure the scale of reliability in terms of the internal consistency between the observed and true scores Petri et al. (2017). To measure the internal consistency, you have to prove the scale of the question having one dimension. The Cronbach’s coefficient has the following formula:

$$\begin{aligned} \alpha = \frac{N\cdot \bar{c}}{\bar{\nu }+(N-1)\cdot \bar{c}} \end{aligned}$$
(1)

Here, the number of items is denoted as N, \(\bar{c}\) represents the average covariance among the items, and \(\bar{\nu }\) is equal to the average variance. Based on the above formula, the raw value of Cronbach’s coefficient must be acceptable or better to apply the internal consistency according to Table 5.

Table 5 Cronbach’s alpha measurement

A single question was presented (Item) to each participant to check whether the 20 questions in terms of attention, relevance, confidence, and satisfaction have achieved internal consistency or not. We used a Statistical Software suite (SAS)(NoAuthor 2020) which helps to compute the internal consistency using Cronbach’s coefficient \(\alpha\) for all questions (items). After we imported the values into SAS, a reliability coefficient test was calculated on each item. to measure the internal consistency. Consequently, Cronbach’s coefficient \(\alpha\) is 0.83 for all items, which has an acceptable internal consistency.The result returns two coefficients: raw and standardized; the raw coefficient depends on the item correlation so that when the test is consistent, then the more robust the items are interrelated. The standardized reflects the item covariance and is used to measure the distribution of two variables covariance is used to measure the distributions of two variables. When the correlation coefficient is higher, the covariance is higher. Now, we will use a dataset that contains 20 variables imported from a questionnaire based on the ARCS model in Table 4 to measure internal consistency using Statistical Software suite (SAS)(NoAuthor 2020). The result shows that The alpha coefficient for the 20 items is 0.839, suggesting that the items have relatively high internal consistency. Note that in most social science research situations, Internal Consistency coefficient of 0.70 or higher is considered “acceptable” (see Tables 5, 6 and 7 below).

Table 6 Results of Cronbach’s coefficient \(\alpha\): Alpha values
Table 7 Results of Cronbach’s coefficient \(\alpha\): simple statistics

4.5.2 Variance

After the above demonstration, we calculated the variance of all the items to measure how far the variables are spread out from each other in the datasets by using the formula below:

$$\begin{aligned} \alpha ^2 = \frac{\sum {(X-\mu )^2}}{N} \end{aligned}$$
(2)

The variance \(\alpha ^2\) is the sum of the squared distance of each item from \(\mu\) divided by the number of items. After applying the above formula to the ARCS model through the SAS application, we obtain the outputs given in Table 8.

Table 8 Results of the variance

Delete “As overall results for attention, relevance, confidence, and satisfaction, we obtained Fig. 7 below.”

4.5.3 Pearson correlation coefficient

The correlation coefficient is used to measure the relationship among the variables, and its value is always between \(+1\) and \(-1\) (Puth et al. 2014). One of the most commonly used formulas for correlation is the Pearson correlation coefficient, which measures the linear relationship between datasets and shows how they are related to each other (refer to the formula below):

$$\begin{aligned} r = \frac{n(\sum {xy})-(\sum {x})(\sum {y})}{\sqrt{[n\sum {x^2}-(\sum {x})^2][n\sum {y^2}-(\sum {y})^2]}} \end{aligned}$$
(3)

where N, \(\sum {xy}\), \(\sum {x}\), \(\sum {y}\), \(\sum {x^2}\), \(\sum {y^2}\) represent the sum of item scores, the sum of xy paired scores, the sum of x scores, the sum of y scores, the sum of squared x scores, and the sum of squared y scores, respectively. Here, we apply the Pearson correlation coefficient for all ARCS models and extend how the standards are interrelated with 20 variables after applying the Pearson correlation coefficient, where -1 perfectly negative linear relationship, 0 being no correlation, and +1 perfectly positive linear relationship. See Table 9.

Table 9 Pearson correlation coefficient results

We can obtain the scatter plot matrix among all items, i.e., attention, relevance, confidence, and satisfaction, as shown in Fig. 7.The scatter plots below in Fig. 7 shows that two distinct properties are the direction and strength of a correlation. The direction of the correlations gives a negative correlation corresponding to a decreasing relationship, while and a positive correlation corresponds to an increasing relationship. For the strength of a correlation can be assessed by taking values 0.1< | r |< 0.3 as weak correlation, values like 0.3 < | r | < 0.5 represent moderate correlation, and values like .5 < | r | represent strong correlation. As shown in the scatter plot below, Attention and Satisfaction have positive and strong correlations While Confidence has a positive and moderate correlation. Finally, Relevance has a positive and weak correlation.

Fig. 7
figure 7

The interrelation of the ARCS model

5 Conclusion and future work

With the massive growth of data, along with the availability of data manipulation and credible distributed data storage, many organizations no longer rely on conventional data processing to manipulate their datasets. They are moving toward big data models to combine datasets from multiple data sources that utilize various formats. Hence, having access to distributed big data sources and cloud computing technologies, and scalable big data storage is essential for integrating big data into the cloud.

Currently, big data paradigms have many limitations in the following aspects: they do not perform big data integration, and they have no reliable big data storage; they do not have a comprehensive view for query and data integration of multiple data types, as well as various formats of integrated data sources through cloud computing; hence, they only rely on the data types themselves and RDBMSs. In this research, we showed the proposed framework for social media big data integration on the cloud, which has the following properties:

  • The framework has three algorithms to retrieve data from social media channels and facilitate reading big data through the data source layer and application layer.

  • Big data are integrated from multiple data types and formats of distributed data sources on the resource layer using the data integration algorithm.

  • The framework provides a web interface to formulate responses based on the user service demand.

The framework finds the correlation between the data stored in distributed data sources, supporting various formats of data types, providing access to huge datasets with the help of cloud computing technologies, and enabling queries and ubiquitous data access. The framework has four layers:

  1. 1.

    The data source layer, which is responsible for providing the dataset from distributed data sources.

  2. 2.

    The application layer with a data retrieval algorithm is used to retrieve big data from distributed data sources and to build files in JSON format.

  3. 3.

    The resource layer, which contains data manipulation, data storage, and cloud virtualization infrastructure. The data manipulation has two algorithms: read and clean data and data integration. The read and clean data algorithm are responsible for reading the JSON files and outputting the lists of social media attributes. The data integration algorithm is responsible for integrating the data by taking the lists of social media attributes and outputting the lists of integrated data sources.

  4. 4.

    Finally, the visualization layer includes data summarization and a dashboard. We proposed algorithms to integrate big data from distributed data sources. The experiment results showed that the execution rate of the data retrieval algorithm for YouTube was lower than that of Facebook and Twitter. For the read and clean data algorithm, the execution rates for YouTube and Twitter were higher than that of Facebook. Finally, the data integration showed that the execution rate for Facebook was lower than that for Twitter and YouTube.

A proof-of-concept implementation of the big data integration framework used Kibana, and it had a web interface for suitable remote access. Hence, the user can perform many queries from multiple data sources as well as visualize the data integration in an appropriate format. The framework was deployed on the VMware cloud platform running on an Ubuntu operating system. We used Elasticsearch as the big data distributed and reliable storage. We developed a transformation adapter to support the transformation of the bulk data on the back-end resources. Apache Kafka was used as a back-end resource to integrate the big data in our prototype framework. The functionality of the big data integration framework was evaluated with the help of social networking data and three algorithms based on the execution time and data sizes.

In future work, big data integration research will allow researchers to use the contribution of this study as the basis for their research. Moreover, the framework for social media big data integration on the cloud can be extended to operate with commercial distributed data sources. It is possible to add more than three social networking sites to be integrated to utilize unstructured data and perform an analysis. The framework can be expanded using Apache Nifi, specifically for Twitter, in order to automate the flow of big data between systems. The proposed framework can be expanded to have a real-data processing platform such as Apache Storm and Spark.

Furthermore, another option is to use Cassandra and MongoDB as distributed storage. Apache Flink is an open-source stream processing for distributed data streaming processing applications (Shu et al. 2013). Flink can be used to establish connectivity to file systems and data storage.