Keywords

1 Introduction

GitHub Copilot, developed by OpenAI and GitHub in 2021 [1], is an AI-powered code generation tool that has garnered significant attention in the software engineering community. It aims to assist developers by automatically suggesting code snippets and completing lines of code based on natural language descriptions and context [2]. Leveraging machine learning techniques and a vast codebase, Copilot has the potential to enhance developer productivity and accelerate software development processes. Previous studies have explored the use of AI in code generation [15], the impact of automated coding tools on developer workflows [3], and the ethical considerations associated with AI-assisted coding [4]. However, there is a lack of comprehensive examination of its effectiveness, reliability, and ethical implications on GitHub Copilot. The absence of a systematic review of GitHub Copilot's recent research trends hinders a thorough understanding of its current state of development and potential impact on software development practices. A comprehensive review of recent research trends on GitHub Copilot is essential to assess its effectiveness, identify limitations, and uncover potential future directions.

Therefore, the objective of this article is to analyze the recent trends of research on GitHub Copilot to assess its current state of development, identify gaps in existing knowledge, and provide insights into potential future research directions. This study used the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) methodology to systematically search relevant databases, apply pre-determined inclusion and exclusion criteria, extract and synthesize data, and critically appraise the quality and relevance of included studies. These rigorous procedures ensure a comprehensive and reliable analysis of the literature on GitHub Copilot.

2 Methodology

This section describes the approach employed to retrieve articles relevant to GitHub Copilot. The PRISMA method, including the prominent databases Scopus, ACM Digital Library, and ScienceDirect, were used to conduct the systematic review. The review process encompassing identification, screening, and eligibility, as well as the data abstraction and analysis methods, are also presented.

2.1 PRISMA

PRISMA stands for Preferred Reporting Items for Systematic Reviews and Meta-Analyses. The PRISMA statement was first published in 2009 and has been updated several times. The most recent version, PRISMA 2020, was published in June 2021. It is a widely recognized tool for conducting systematic reviews and meta-analyses in healthcare research. PRISMA provides a checklist that researchers should consider when preparing their systematic review or meta-analysis, such as defining the research question, specifying the inclusion and exclusion criteria, and assessing the risk of bias in included studies. Its use is recommended for reporting systematic reviews and meta-analyses to enhance transparency, completeness, and reproducibility [5].

2.2 Resources

The review relied on three primary journal databases – Scopus, ACM Digital Library, and ScienceDirect. Scopus, recognized as one of the largest and most comprehensive bibliographic databases, encompasses various disciplines such as science, technology, medicine, social sciences, and humanities. It offers many indexed and abstracted articles from scholarly journals, conference proceedings, and books [6]. The ACM Digital Library is a vast repository of scholarly resources, encompassing articles, books, conference proceedings, and other computer science and information technology publications. Recognized as a valuable resource, it is a comprehensive platform for researchers, educators, and practitioners who aim to remain abreast of the latest advancements in their respective domains [7]. ScienceDirect is a robust database that offers access to an extensive collection of scientific, technical, and medical research articles from various journals and books. It is an invaluable resource for researchers, scholars, and students who strive to keep up with the latest developments in their respective fields [8].

2.3 Eligibility and Exclusion Criteria

Several eligibility and exclusion criteria are determined. The first criterion, Literature type, specifies that only research articles published in journals and proceedings will be eligible for inclusion. These articles are typically peer-reviewed and provide original research findings, making them more reliable and valuable for research purposes. The exclusion criteria under this criterion include systematic reviews, book series, books, chapters in books, and conference proceeding books. The second criterion, Language, specifies that only articles published in English will be considered. This criterion ensures that the selected articles can be easily understood and reduces the potential for language-related bias in the selection process. The final criterion is Timeline. Since the GitHub Copilot is a new tool, selecting articles is unlimited. All articles published in 2023 and the years before, have been selected. This criterion ensures that all articles related to the GitHub Copilot were chosen (see Table 1).

Table 1. The inclusion and exclusion criteria.

2.4 Systematic Review Process

The systematic review was conducted in April 2023 and comprised four distinct stages. The first stage involved the identification of relevant keywords for the search process. Since GitHub Copilot is a unique keyword, “GitHub Copilot” was used. Following a meticulous process, 12 duplicated articles were eliminated. The second stage was the screening process, which excluded two articles due to their status as conference proceedings, leaving 26 eligible articles for review. In the third stage, the eligibility criteria were applied to the full articles, excluding nine articles that focused on topics other than GitHub Copilot. Finally, the review’s last stage yielded 15 articles suitable for qualitative analysis, as presented in Fig. 1.

Fig. 1.
figure 1

The flow diagram of the study. (Adapted from [16]).

2.5 Data Abstraction and Analysis

The remaining articles were assessed and analyzed. Efforts were concentrated on studies that met the predetermined inclusion criteria, which ensured that the analysis focused on relevant sources of evidence. To extract relevant data, the team initially read through the article abstracts, followed by a more in-depth analysis of the full articles to identify significant trends related to GitHub Copilot. Qualitative analysis was performed using content analysis techniques, which allowed the team to identify relevant themes and patterns in the data. This rigorous and systematic approach to data analysis ensured that the study’s findings were grounded in the available evidence and provided a robust synthesis of the research on GitHub Copilot.

3 Results

Table 2 shows the recent trends of GitHub Copilot research. There are four main areas studied by researchers in 2022 and 2023: developer productivity, code quality, code security and education.

Table 2. Recent Trends in GitHub Co-pilot Research.

3.1 Developer Productivity

The study [9] aimed to assess the performance of GitHub Copilot, an automatic program synthesis tool, and compare it with genetic programming approaches on common program synthesis benchmark problems. The results revealed that both approaches had similar performance on benchmark problems. However, genetic programming approaches still needed to be mature enough to support practical software development due to their reliance on expensive hand-labelled training cases, long execution times, and bloated and hard-to-understand generated code. The researchers suggested that future work on program synthesis with genetic programming should prioritize improving execution time, readability, and usability to overcome these challenges and enable more practical applications of this approach.

The study [10] aimed to assess the impact of GitHub Copilot on user productivity and identify measurable user data that reflects their perceptions. The researchers analyzed user behaviors and feedback to determine the factors influencing developers’ productivity when using Copilot. The results showed that the rate of acceptance of suggestions was the primary driver of productivity, indicating that user satisfaction and acceptance of suggested code snippets were more critical to productivity than the longevity of the code snippets. The study underscores the importance of user-centered design in developing AI-powered programming tools that can enhance developer productivity.

The study [11] aims to investigate the effectiveness of GitHub Copilot in pair programming contexts compared to human pair programming. The researchers experimented with 21 participants, who were randomly assigned to one of three conditions: pair programming with Copilot, human pair programming as a driver, and human pair programming as a navigator. The results suggest that while Copilot can increase productivity in terms of lines of code added, the quality of the generated code is inferior, as indicated by a higher number of lines of code deleted in subsequent trials. The study highlights the importance of being cautious when relying solely on AI-based tools in pair programming contexts.

The study [12] investigates GitHub Copilot’s usability and perceived usefulness, a large language model-based code generation tool, in the programming workflow. The study employs a within-subjects user study with 24 participants to examine how programmers use the tool. The results show that Copilot did not improve task completion time or success rate. Still, most participants preferred to use it in their daily programming tasks since it provided a useful starting point and saved them the effort of searching online. However, participants faced challenges in understanding, editing, and debugging code snippets that were generated by Copilot, which reduced their task-solving effectiveness. The study highlights the need to improve the design of Copilot based on the observed difficulties and participants’ feedback.

3.2 Code Quality

The study [13] aimed to assess the quality of code generated by GitHub Copilot and to examine the impact of input parameters on its performance. The researchers utilized an experimental setup to evaluate the generated code’s validity, correctness, and efficiency. The results showed that GitHub Copilot generated valid code with a success rate of 91.5%, and 28.7% of the problems were correctly generated, while 51.2% were partially correct and 20.1% were incorrect. The study indicates that GitHub Copilot holds significant promise as a programming tool, but further assessments and improvements are necessary to optimize its performance in generating entirely accurate code that meets all requirements.

The study [14] aims to evaluate the correctness and understandability of the code generated by Copilot, which assists programmers by generating code based on natural language descriptions of desired functionality. The study utilizes 33 LeetCode questions in four programming languages to create queries for Copilot. It assesses the corresponding 132 Copilot solutions for correctness using LeetCode’s provided tests and for understandability using SonarQube’s complexity metrics. The findings indicate that Copilot’s Java suggestions have the highest correctness score (57%) while JavaScript has the lowest (27%). Overall, Copilot’s recommendations have low complexity, with no significant differences between programming languages. The study also identifies some potential shortcomings of Copilot, such as generating code that could be further simplified and relying on undefined helper methods. The study concludes by highlighting the need for further research to address these issues and explore the potential of Copilot in supporting software development.

The study [15] examines the impact of code generated by machine learning models on code readability and visual attention. Specifically, focus on GitHub Copilot to compare the generated code with code written entirely by human programmers. The study conducted a human experiment with 21 participants, used static code analysis, and used eye tracking to evaluate the code’s readability and visual inspection. The findings suggest that model-generated code is comparable in complexity and readability to code written by human pair programmers. However, the eye-tracking data indicates that programmers tend to pay less visual attention to model-generated code. Consequently, reading code remains essential, and programmers should be mindful of automation bias and avoid complacency when working with model-generated code.

3.3 Code Security

The study [16] introduces SecurityEval, an innovative dataset designed to evaluate machine learning-based code generation models’ security. The dataset comprises 130 diverse samples, each corresponding to 75 different vulnerability types and mapped to the widely recognized Common Weakness Enumeration. By leveraging SecurityEval, the authors assess the security of two prominent code generation models, InCoder and GitHub Copilot, and unveil that both models are susceptible to generating vulnerable code in certain cases. With its robustness and comprehensiveness, the SecurityEval dataset can serve as a valuable benchmark for scrutinizing the security features of other code generation models in future research.

The study [17] delves into the ethical and security concerns surrounding GitHub Copilot and comparable products that harness deep learning models to learn from open-source code. To mitigate the potential exploitation of such code, the authors introduce a prototype named CoProtector, which employs data poisoning techniques. The primary aim of the CoProtector is to substantially decrease the performance of deep learning models similar to Copilot while simultaneously detecting covert watermark backdoors. The authors conducted extensive large-scale experiments to validate CoProtector’s effectiveness in achieving its objectives. The results demonstrate that CoProtector can effectively safeguard open-source code from misuse and potential breaches.

The study [18] examines the presence of code smells and security vulnerabilities in datasets employed to train coding generation techniques and whether these issues are reflected in the output of such techniques. The study utilizes Pylint and Bandit to evaluate three training sets and analyze the output generated by an open-source transformer-based model and GitHub Copilot. The study results reveal that code smells and security vulnerabilities exist within the training sets, and these issues propagate into the output of the coding generation techniques. This underscores the need for further refinement to ensure the generated code is devoid of such issues. The findings also emphasize the significance of careful selection and scrutiny of the training data to minimize the risk of code smells and security vulnerabilities in the generated code.

The study [19] examines the security of GitHub Copilot, the pioneering AI pair programmer that automatically produces computer code. The concern arises from Copilot’s exposure to a vast amount of unverified code, raising the possibility of generating insecure code. The study thoroughly analyses 1,689 programs generated by Copilot in scenarios relevant to high-risk cybersecurity weaknesses. The results reveal that approximately 40% of the generated programs are vulnerable, indicating a significant security risk. Moreover, the study demonstrates that Copilot’s performance varies considerably based on the diversity of weaknesses, prompts, and domains. These findings emphasize the need for caution when employing AI-based code generation tools and underscore the importance of vetting generated code for security vulnerabilities to prevent potential security breaches.

3.4 Education

The study [20] examines the impact of GitHub Copilot on the learning process in introductory computer science and data science courses. The authors evaluate the correctness, style, skill level appropriateness, grade scores, and potential plagiarism of programming assignments generated by Copilot. The results demonstrate that Copilot produces original primary code that can effectively solve introductory assignments, with human-graded scores ranging from 68% to 95%. Based on these findings, the authors recommend that educators adjust their courses to integrate new AI-based programming workflows.

The research [21] investigates the effectiveness of OpenAI Codex, the underlying model for GitHub Copilot, in solving more advanced CS2 exam questions in comparison to the performance of students. The results indicate that Codex outperformed most students on these questions, generating accurate and comprehensive code. The study suggests that generative AI models like Codex have the potential to support students in completing programming assignments and exams and promote equitable access to high-quality programming education. The study emphasizes the significance of considering the ramifications of these tools for the future of undergraduate computing education.

The study [22] investigates the performance of GitHub Copilot on a diverse dataset of 166 programming problems, analyzing the types of problems that challenge Copilot and the natural language interactions between students and Copilot when resolving errors. The results demonstrate that Copilot solves approximately 50% of the problems on its initial attempt and can effectively solve 60% of the remaining problems using only natural language modifications to the problem description. The study suggests that the prompt engineering used to interact with Copilot when it initially fails can be a valuable learning activity that promotes the development of computational thinking skills and transforms the nature of code-writing skill acquisition.

The study [23] examines the use of OpenAI’s Codex machine learning model in programming education, particularly its implementation as the GitHub Copilot plugin, and the implications it raises for educators. The study evaluates the model’s performance and limitations in supporting programming instruction through qualitative analysis of the code suggestions generated by Copilot and student feedback. The goal is to provide insight into Copilot’s potential for programming education and to highlight the need for instructors to adapt their teaching practices accordingly. The findings indicate that while Copilot can generate correct and understandable code, it cannot replace the process of learning programming. Therefore, educators must incorporate Copilot into their pedagogical strategies judiciously.

4 Discussion

GitHub Copilot is an AI-powered programming tool that has gained attention for its ability to generate code automatically. Several studies have explored Copilot’s performance, productivity, and usability in different contexts. One study compared Copilot with genetic programming approaches on common program synthesis benchmark problems, concluding that both approaches had similar performance but that genetic programming approaches still needed to be mature enough for practical software development. Another study investigated the impact of Copilot on user productivity and found that the rate of acceptance of suggestions was the primary driver of productivity. The third study compared Copilot with human pair programming and found that while Copilot increased productivity, the quality of generated code was inferior. The fourth study revealed that Copilot provided a useful starting point for programmers but needed help understanding, editing, and debugging generated code snippets.

Several studies have recently assessed the code quality generated by GitHub Copilot, which assists programmers by generating code based on natural language descriptions of desired functionality. The studies have used various methods to evaluate the correctness and understandability of the generated code and have generally found that Copilot holds significant promise as a programming tool, generating valid code with high success rates. However, the studies also identify potential shortcomings, such as generating code that could be further simplified and relying on undefined helper methods. Further assessments and improvements are necessary to optimize Copilot’s performance in generating entirely accurate code that meets all requirements.

Using machine learning-based code generation models, such as GitHub Copilot, raises ethical and security concerns. Several recent studies highlight the potential for such models to generate vulnerable code and the need for careful selection and scrutiny of training data to minimize risks. To address these concerns, researchers have introduced SecurityEval, a dataset for evaluating the security of code generation models, and CoProtector, a prototype aimed at safeguarding open-source code from misuse and breaches. While Copilot’s performance varies considerably based on the diversity of weaknesses, prompts, and domains, the studies emphasize the importance of vetting generated code for security vulnerabilities to prevent potential breaches.

The studies explore using OpenAI’s Codex machine learning model in programming education through its implementation as the GitHub Copilot plugin. They investigate Copilot’s impact on the learning process, its ability to generate original code, and its performance on diverse programming problems. The studies show that Copilot has the potential to support students in completing programming assignments and exams and can promote equitable access to high-quality programming education. However, the studies also suggest that Copilot cannot replace the process of learning programming, and educators must adapt their teaching practices to integrate these AI-based programming workflows effectively.

5 Future Direction or Recommendation

Based on the above studies, the future direction of GitHub Copilot and similar AI-powered programming tools should focus on improving the accuracy and simplicity of generated code while addressing ethical and security concerns. This could involve further refinement of the training data and algorithms the tool uses to minimize the risk of generating vulnerable code. Additionally, developers could work on enhancing the tool’s ability to understand and edit generated code snippets and improving its debugging capabilities.

In terms of programming education, the studies suggest that AI-powered programming tools like Copilot have the potential to support students in completing programming assignments and exams and promote equitable access to high-quality programming education. However, it is also essential for educators to adapt their teaching practices to integrate these tools effectively, emphasizing the importance of learning programming concepts and not relying solely on generated code. The future direction of programming education should thus explore ways to integrate these AI-based programming workflows into the classroom while ensuring that they complement and enhance traditional programming education rather than replace it.

The future direction of AI-powered programming tools should prioritize accuracy, simplicity, and security while promoting equitable access to high-quality programming education. This requires balancing the benefits and risks associated with these tools and continued research and development to optimize their performance and address any potential ethical and security concerns.

6 Conclusion

In conclusion, the recent trends of the studies are focusing on four main areas: developer productivity, code quality, code security and education. The research on GitHub Copilot has shown significant promise in generating valid code with high success rates and increasing user productivity. However, potential shortcomings need to be addressed, such as generating code that could be further simplified and vetting for security vulnerabilities. The studies also suggest that Copilot can support students in completing programming assignments and exams, but it cannot replace the process of learning programming entirely. As AI-based programming workflows become more prevalent, educators must effectively adapt their teaching practices to integrate them into programming education. Future research should address the ethical and security concerns raised by machine learning-based code generation models and optimize their performance to generate entirely accurate code that meets all requirements.