Keywords

1 Introduction

By offering code hosting services and team-based software development frameworks, GitHub has gained the attention of many software engineers globally as a prominent software development platform, a social coding, and a web-based Git repository hosting service that enables the developers to engage in open-source project documentation, design, coding, and testing in a socio-cultural context. The important feature of this platform is that it allows following other users or organizations, so that they may get alerts about their public activity. This feature has enhanced user collaboration by encouraging them to explore more repositories and promoting the conversation on GitHub.

Moreover, GitHub is considered as a social media platform designed specifically for developers where they can duplicate the relevant projects and then customize the features of those projects to match their requirements. Also, users can ask for assistance from or have a conversation with people who are on their following lists when they run into technical difficulties while developing a repository, bug patches, code restructuring, and feature enhancements.

Even though the developers have the option to search for the projects or do the above things, it takes a lot of time and might disappoint them. So, to overcome this we analyzed a method to identify and recommend the followers working on similar domains to the user by using different centrality measures of network analysis, link prediction between the users, and recommending followers based on the content-based filtering technique.

The paper has five sections where Sect. 2 describes the related works and emphasizes the key characteristics used by us to implement the different measures of social network analysis and some techniques to identify the prominent followers. Section 3 has a detailed explanation of methodology analysis and the experimental findings from the implementation used to evaluate the effectiveness of the analysis are documented in Sect. 4. Finally the conclusion is given in Sect. 5.

2 Related Works

Initially we need to understand how GitHub works and what features it offers. As a result, we referred to a number of research studies conducted on GitHub with various objectives, where (Joy et al. 2018; Sarwar 2021; Koskela et al. 2018) focused on understanding and analyzing the various parameters of the open-source software GitHub. They were able to identify the key parameters as the number of contributions, programming languages, project size, and project age, and they investigated their significance by developing a conceptual model and experimenting with GitHub data. They were able to confirm that the above parameters have a greater impact on project performance as a result of this.

In paper Radha et al. (2017) proposed the implemented of different centrality measures where the USA road network to attain congestion-free shipment. They analyzed the road network based on the performance indicators and identified ways to reduce traffic issues that could aid in traffic management.

Thangavelu and Jyotishi (2018), Sharma and Mahajan (2017), and Gajanayake et al. (2020) proposed a detailed analysis of GitHub work and its projects. They primarily focused on identifying the characteristics of the firm behind the influence of the performance of the open-source project at different stages. In order to experiment they developed a model with four stages such as issues, forks, pull requests, and releases. Through this, they were able to examine the factors of success of the open-source project at different levels. We were able to gain an understanding of how GitHub works and the role of parameters as a result of these works.

Identification of a key person in the network using the well-known centrality measures such as degree, betweenness, closeness, katz, and Eigenvector centrality was proposed by Landher et al. (2010) and Yadav et al. (2019). They were able to conclude that Katz centrality is better for the network they chose based on the distance between nodes, the number of shortest paths, and ranking by implementing the above measure.

Farooq et al. (2018) implemented the above measures as well as and also the coefficient clustering and page rank to a covert network and were able to visualize the metrics that assisted them in identifying the most influential node. These two works served as a foundation for our research because they provided insight into how to implement those measures in our network.

Liu et al. (2019) proposed a method for predicting links between paper citations based on keywords, times, and author information. A series of experiments were also performed on the real-time dataset. They were able to correlate two papers and create a link in the citation network by assigning a value to a pair that was as accurate as possible. As a result, the dataset validated the feasibility of their proposal. We got an idea for implementing the link prediction to our dataset from this work.

Furthermore, to gain a better understanding of other methods for determining the social network’s most significant node, we referred to the work of Ramya and Bagavathi Sivakumar (2021), who demonstrated that DC-FNN classification outperforms fixed clustering and NLP-based sentiment analysis in identifying the influential node in the Twitter dataset based on categories such as pricing, service, and timeliness. They tested these approaches on the likelihood function using iterative logistic regression analysis and the temporal influential model (TIM), and the results show that they are more accurate than other methods in terms of precision, recall, and f-measure. We gained an understanding of the topic's role in a node's influence level as a result of this research.

3 Methodologies of Analysis

3.1 Data Description

We considered the essential GitHub values such as user, followers, following, gists, and numrepos, because we used real-time data from GitHub users. The values were adequate for our analysis, and a more detailed explanation is provided in the following sections.

3.2 Overview of the Work Flow Model

In this diagram, we depicted the flow of methods that assisted us in obtaining the required analysis. There are six divisions in total, with the first two illustrating GitHub access and follower analysis, and the outputs obtained being passed to the other four divisions for detailed analysis. The first division represents the calculation of centrality measures, and the second division represents prediction and follower analysis, which leads to the recommendation of the top 10 followers to the user. Although these two divisions are related, we separated them for clarity. Finally, the fourth division represents the prediction of future links between followers. As needed for our research, the output from each division is considered for analysis in other divisions (Fig. 1).

Fig. 1
A workflow diagram has six divisions in total. GitHub access, follower spread analysis for users, certainty measure, predicting and follower analysis, recommendation, and future link prediction.

Overview of the work flow model

3.3 Centrality Measures

Our methods consist of different stages where the results obtained from each stage are used in other stages to obtain an efficient output. Initially, we explored and understood the spread of followers for a particular user or developer by establishing a connection to the GitHub API (PyGitHub) using Personal Access Token, and then the username of a particular user (here the user is ‘Anshul Mehta’) was given to get the ego-centric graph of the first level of followers in Fig. 2. For better understanding we explored the followers of the followers and plotted the directed graph.

Fig. 2
An ego-centric graph of the first level of followers. it includes Vinay Kakkad, Homal Patel, Gauray Bajaj 27, Kavan Desai, Samkitk, Parth Mshah 1302, Kashisjivani, Dineshssdn 867, Jimmy 290901, Chirayu 999, and Anshul Mehta 1.

Followers graph of Anshul Mehta (developer)

As our first objective, we implemented centrality measures to identify the prominent follower in the network. Basically, centrality is used in a big network to estimate an individual's properties. The degree of interpersonal interaction and communication within a social network is also measured by centrality. More connections between network members result in greater centrality and always the community development in any online social network is mainly driven by individual nodes having high centrality.

3.3.1 Degree Centrality Measure

Degree centrality (DC) is a metric of exposure that evaluates a node's network connectedness using the number of direct contacts to the node.

$$ Dv = aiv^{n} i = 1 $$
(1)

In Eq. (1), the adjacency matrix aiv was used to determine the degree centrality of node v.

3.3.2 Betweenness Centrality Measure

Betweenness centrality (BC) measures the length of time a user or node acts as a bridge between two nodes in a sizable community network while taking into account the quickest path.

$$ C_{B} v = \frac{{\sigma_{st} \left( v \right)}}{{\sigma_{st} }}\quad s \ne v \ne t \in V $$
(2)

In Eq. (2), the terms \(\sigma_{st}\) and \(\sigma_{st} \left( v \right)\) stand for the total number of shortest routes from node s to node t and the number of paths that passes through v, respectively.

3.3.3 Closeness Centrality Measure

Closeness centrality (CC) is a measure of how long it takes for information to go from one node to every other node in a network and it is also the length of the shortest path between any two nodes in a complicated network graph.

$$ Cx = \frac{1}{{y^{{d\left( {y,x} \right)}} }} $$
(3)

In Eq. (3), \(d\left( {x,y} \right)\) is the measured distance between the x and y.

3.3.4 Eigenvector Centrality Measure

Eigenvector centrality (EC) is used to assess a node's effect over other nodes in the network. A node’s influence will be stronger if it is connected to significant nodes, but its impact on other nodes will be low if it is connected to less important nodes. So, the significance of the node’s neighbours determines its relevance

$$ {\text{EC}}_{v} = v_{x} = \frac{1}{{\tau_{\max } \left( A \right)^{{j = {}_{1}^{n} a_{jx} v_{j} }} }} $$
(4)

In Eq. (4), \(v = \left( {v1,2, \ldots } \right)\) is the eigenvector for the highest eigenvalue \(\tau_{\max } \left( A \right)\) of the adjacent matrix A.

3.4 Similarity Measures

In addition to that coefficient clustering was calculated to identify how closely the followers were present in a network. The obtained values of each follower were stored as a data frame with basic values of a GitHub such as the number of repositories, gists, number of followers, and number of following to get an idea of data spread and analyze the influential parameters.

As a part of the next research, we implemented the link prediction using the cosine distance for the obtained network. Link prediction is used to find a collection of missing or future linkages between the users by forecasting and estimating the likelihood of each non-existing link in the network.

$$ \cos \left( {x,y} \right) = \frac{x.y}{{\left| {\left| x \right|} \right|*\left| {\left| y \right|} \right|}} $$
(5)

In Eq. (5), the terms \(x.y\) and \(\left| {\left| x \right|} \right|\) and \(\left| {\left| y \right|} \right|\) are the dot product and the length of the vectors x and y, whereas the cross product of the vectors x and y is \(\left| {\left| x \right|} \right|*\left| {\left| y \right|} \right|\).

Whereas, the cosine of the angle created by the two vectors is used to calculate how similar two users or objects are when they are treated as two vectors in m-dimensional space. During filtering, it is often used to compare two documents, and subsequently more and more regularly to compare two individuals or other aspects instead of simply documents. Since it is often utilized in positive space, it determines how close 0 and 1 are in value.

Then to get a clear idea of similarities among the followers, we employed the follower similarity which identifies the similarities between the followers based on ratings given by them and also found the feature similarities in which the similar features among the followers were evaluated. The data frame consisting of these values was formed in Fig. 3. In addition, the pairwise cosine was found to get a clear idea of similarities between the followers. By using these data, similar followers were found based on the aggregate similarity score (based on the number of gists, repositories, followers, and following) with a developer.

Fig. 3
A table of GitHub users has 11 columns for user, followers, following, gists, numrepos, degree centrality, in degree cen, closeness, eigenvector, and clustering coefficients.

Data frame created from the calculated values

As we were able to identify similar followers now, we focused on finding the potential followers to recommend the user based on the content filtration technique. Here the keywords of the repositories are considered as content. The reason for the content-based recommendation is that it recommends based on the user preferences and this would help us in recommending the followers to the developer.

To implement the recommendation, we initially found the keywords of repositories by stage-wise like name, description, and language. Here comes the natural language processing (NLP) where the strings were converted to lowercase and by stemming, we found the keywords. This was done in two approaches such as bag of words and TF-IDF.

$$ {\text{TF-IDF}} = {\text{TF}}\left( {t,d} \right) + {\text{IDF}}\left( t \right) $$
(6)

In Eq. (6), TF(t, d) is the term frequency of term ‘t’ that appears in a document ‘d’ and IDF(t) is the inverse document frequency.

Bag of words is an NLP model used to count the frequency of each word in a corpus for that document, whereas TF-IDF is used to quantify the importance of the relevance of string representation in documents among the collection of documents. As we obtained the words, now again the cosine distance was checked in a bag of words and TF-IDF to find the similarity. These words were then stored and sorted with the corresponding indexes. Now we developed a recommender where the index fetched from the similarity array and sorted the distance in descending order to get the top 10 followers for both bag of words and TF-IDF.

As we found the followers for the user now, we identified the list of followers who can be recommended to the user. In addition, this was done to all the first-level followers of the developer to check whether our method worked properly or not. From this, we obtained the end result of recommending the followers to the user working on similar domains.

Finally, the future prediction between the user’s followers was calculated using the Jaccard coefficient and Adamic Adar coefficient, and the results were visualized to get an idea of links. We discovered the network’s common triads to gain a better understanding of the links and influence of followers.

4 Experimental Results

This section carries out a thorough experimental and quantitative study of the analysis undertaken and the results were shown with an explanation. Initially, the graph was obtained and then the centrality measures were implemented on the follower’s network in Figs. 4 and 5 depicts the centrality of common neighbours.

Fig. 4
A table of d f c dot head 100 has 11 columns for user, followers, following, gists, numrepos, degree centrality, in degree cen, closeness, eigenvector, and clustering coefficients.

Values obtained from the centrality measures

Fig. 5
A list presents a centrality measure of common neighbors. The lists include kkalaria 16, abdeen M, 10.666666666666664. kkalaria 16, kashvi 05 M, 10.666666666666664. kkalaria 16, 15 pratik, 10.666666666666664, etcetra.

Common neighbours centrality measure

Further, the link prediction between the followers was identified using cosine distance and values for the corresponding followers were stored in the data frame as in Fig. 6.

Fig. 6
Top. The screenshot has some programming codes that start with the def normalize function. Bottom. A table has 11 columns for user, followers, following, gists, numrepos, degree centrality, in degree cen, closeness, eigenvector, and clustering coefficients.

Values obtained from the cosine distance measure

The cosine similarity was calculated between the followers and the features of the data frame which helped us to get an idea of the relations between the followers as in Fig. 7.

Fig. 7
A screenshot has some programming codes from s k learn dot matrics dot pairwise import cosine similarity function. It presents the results for user similarity.

Implementation to find user and feature similarity

The above values were then added to the data frame as in Figs. 8 and 9.

Fig. 8
A table of user similarity has 17 columns for user, Sahil Doshi, Gizachew, Nand Patel, Mirchandani Mohnish, 0, 0, Dhruv Prajapati, Priyanshu Pathak, Nancy Radadiya, Aatman Vaidya, 0, Nisarg Thoriya, Kushal Gandhi, etcetera.

Values obtained from the user similarity

Fig. 9
A table has 9 columns for followers, following, gists, numrepos, degree centrality, in degree cen, closeness, eigenvector, and clustering coefficients.

Values obtained from the feature similarity

Based on these values the similar feature of the user was calculated by considering the ratings given and it was mapped to the gists as a similar score. So, when we give the gists number the corresponding features get displayed as in Fig. 10.

Fig. 10
A screenshot has a set of programming codes that starts with the def get similarity feature function. It presents the results for gists, clustering coefficients, closeness, following, followers, in degree cen, degree centrality, numrepos, betweenness, and name.

Similar feature of gists 26

Based on the score of similar features, the aggregate score was calculated which helped us to find similar followers to the user as in Fig. 11.

Fig. 11
A screenshot of programming code defines a function to calculate similarity scores for users based on aggregate scores of followers, gists, repos, and followings. It then calculates and prints a list of similar users for Anshul Mehta, sorted by the similarity score. The output includes usernames like Chirayu Vithalani, Vashisth Bhushan, Kavan Desai, Gaston Roldan, and Kirtan Kalaria.

Found similar followers for user Anshul Mehta

As a next step for recommending the followers, we used the content-based filtering approach where the keywords of repositories, such as name, description, and language were extracted and added to the data frame by converting strings to lowercase and stemming as in Fig. 12.

Fig. 12
A screenshot has a set of programming codes that starts with a user list and two lists. The statements read, for loop for getting the keyword of repo, add a try-catch here to avoid errors. User repo rate dot append function is used.

Extracting keywords from repositories

By the method of TF-IDF and bag of words, the frequency of words was found and then cosine similarity was used to identify the similarity between the TF-IDF vectors which was then stored in the sorted list as in Fig. 13.

Fig. 13
A table presents the T F I D f values for several words such as dart, monk, ox, personal plan, wise, food, javascript, stance, starknet, stars, stake, started stak, stat, stask is to find this deeper symmetry, starter files, and starter.

Frequency of keywords extracted from the TF-IDF

From the above diagram, we can infer some of the known keywords such as JavaScript, stake, dart, etc. We may thus conclude that extraction was successful since we have now identified the followers who, in light of the list of followers, can be suggested to the user as in Figs. 14 and 15.

Fig. 14
A table with a title s n a dot head presents the topics for 5 users, Saahil Doshi, Gizachew, Nand Patel, Mirchandani Mohnish, and None.

Keywords obtained were stored to the corresponding user

Fig. 15
A programming code defines a recommendation function using BOW and T F I D F methods. It calculates user distances, retrieves and prints recommendations based on similarity scores for both methods.

Finding the top 10 users from both the methods

The follower’s list closer to the user is printed from the data frame where we can see some followers are the same in both the methods. Figure 16 shows the list of top 10 followers from both the methods.

Fig. 16
A list of followers recommended for Anshul Mehta. It includes the user recommended by the bag of words method and T F I D F methods.

List of top 10 followers from both methods

At last, the objective of our research is to recommend the top 10 followers to the user based on the working of followers in similar technologies or domains. So, the recommendation is done as in Fig. 17.

Fig. 17
A list of followers recommended for Anshul Mehta. It includes the Jay Shah, Arnab Dey, Chirayu Vithalani, Kavan Desai, Parth Maniyar, Kausal, Nancy Radadia, Dincer Dogan, Varshil Shah, and Yashraj Kakkad.

List of top 10 followers recommended to the developer

Additionally, in order to identify the future link prediction between the followers we find Jaccard coefficient and Adamic Adar coefficient measures and also visualized the prediction for the top 100 and top 10 followers. Figure 18 depicts scores of Adamic Adar and Jaccard coefficient.

Fig. 18
2 tables of scores of Adamic Adar and Jaccard coefficient. It has three columns To, from, and score for the top 100 and top 10 followers.

Scores of Adamic Adar and Jaccard coefficient

As per the scores, the graph was plotted where the lines between followers shows the future link predictions for the top 100 and top 10 followers. Figures 19 and 20 depict visualization for Adamic Adar and Jaccard coefficient for top 10 followers.

Fig. 19
A plot depicts the visualization for Adamic Adar for the top 10 followers Dhruv Dave, Vishwa Raval, Nisarg 14, Kausal 1011, Dshita, Vatsal Shah 2002, Varun 1352, Caped Cnrsader 16, Udir Kapadia, and Prem Chapagain.

Adamic Adar coefficient for top 10 followers

Fig. 20
A plot depicts the visualization for the Jaccard coefficient for the top 10 followers Yashil Devpani, Ankit Devani 17, Anshul Mehta 1, Neel Popat 242, Anshil V T, Tirth Patel 177, Hemif 747, Tirth 8205, Gaurav bajaj 27, Parthm Shah 1302, Dineshssdn 867, and Kashis Jivani.

Jaccard coefficient for top 10 followers

From these graphs, we could be able to infer that there are high chances of future links between the followers of users. As the outliers are the same in these graphs, all graphs are nearly giving the same prediction of links. Figure 21 depicts the scores between the followers.

Fig. 21
A table of scores between the followers. It has three columns To, from, and score for 5 followers.

Scores between the followers

The outer clusters remain the same and all the inner clusters are concentrated for a few users forming the bigger circle. On basis of scores obtained from the Jaccard coefficient and Adamic Adar coefficient, the top 10 most predicted links between the user and his followers were evaluated.

The common triads in the network were found to get an idea of links and influence followers. Triads are a set of three actors and the smallest community possible in a network. Through this, we were able to identify that user and closest follower present in all triads as in Fig. 22.

Fig. 22
A diagram presents the common triads in the network. It links Amit n Trivedi and AYi Double to Sam Kitk.

Triad of developer and followers

5 Conclusion

Every day, more people join GitHub, but identifying users working in a similar domain is critical for developers so that they can collaborate and work together. We discussed the recommendation system designed specifically for the user and their followers in this paper. We were able to achieve our goals by employing approaches such as centrality measures, link prediction with standard algorithms, and content-based recommendations based on the developer’s real-time data. The results of the tests and a thorough quantitative analysis indicate that the metrics used are effective in suggesting notable followers to the network user.