Abstract
The value of structured scholarly knowledge for research and society at large is well understood, but producing scholarly knowledge (i.e., knowledge traditionally published in articles) in structured form remains a challenge. We propose an approach for automatically extracting scholarly knowledge from published software packages by static analysis of their metadata and contents (scripts and data) and populating a scholarly knowledge graph with the extracted knowledge. Our approach is based on mining scientific software packages linked to article publications by extracting metadata and analyzing the Abstract Syntax Tree (AST) of the source code to obtain information about the used and produced data as well as operations performed on data. The resulting knowledge graph includes articles, software packages metadata, and computational techniques applied to input data utilized as materials in research work. The knowledge graph also includes the results reported as scholarly knowledge in articles. Our code is available on GitHub at the following link: https://github.com/mharis111/parse-software-scripts.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Analyzing Software Packages
- Code Analysis
- Abstract Syntax Tree
- Open Research Knowledge Graph
- Scholarly Communication
- Machine Actionability
1 Introduction
Scholarly artefacts (articles, datasets, software, etc.) are proliferating rapidly in diverse data formats on numerous repositories [3]. The inadequate machine support in data processing motivates the need to extract essential scholarly knowledge published via these artefacts and represent extracted knowledge in structured form. This enables building databases that power advanced services for scholarly knowledge discovery and reuse.
We propose an approach for populating a scholarly knowledge graph (specifically, the Open Research Knowledge Graph) with structured scholarly knowledge automatically extracted from software packages. The main purpose of the knowledge graph is to capture information about the materials and methods used in scholarly work described in research articles. Of particular interest is information about the operations performed on data, which we propose to extract by static code analysis using Abstract Syntax Tree (AST) representations of program code, as well as recomputing the scientific results mentioned in linked articles. We thus address the following research questions:
-
1.
How can we reliably distinguish scholarly knowledge from other information?
-
2.
How can we reliably determine and describe the (computational) activities relevant to some research work as well as data input and output in activities?
Our contribution is an approach—and its implementation in a production research infrastructure—for automated, structured scholarly knowledge extraction from software packages.
2 Related Work
Several approaches have been suggested to retrieve meta(data) from software repositories. Mao et al. [10] proposed the Software Metadata Extraction Framework (SOMEF) which utilizes natural language processing techniques to extract metadata information from software packages. The framework extracts repository name, software description, citations, reference URLs, etc. from README files and represent the metadata in structured format. SOMEF was later extended to extract additional metadata and auxiliary files (e.g., Notebooks, Dockerfiles) from software packages [6]. Moreover, the extended work also supports creating a knowledge graph of parsed metadata. Abdelaziz et al. [1] proposed CodeBreaker, a knowledge graph which contains information about more than a million Python scripts published on GitHub. The knowledge graph was integrated in an IDE to recommend code functions while writing software.
A number of machine learning-based approaches for searching [4] and summarizing [2, 5] software scripts have been proposed. The Pydriller [12] and GitPython frameworks were proposed to mine information from GitHub repositories, including source code, commits, etc. Similarly, ModelMine [11] was presented to extract and analyze models from software repositories. The tool is useful in extracting models from several repositories, thus improves software development. Vagavolu et al. [13] presented an approach that generates multiple representations (Code2vec [7], semantic graphs with Abstract Syntax Tree (AST) of source code to capture all the relevant information needed for software engineering tasks.
3 Methodology
We now describe our proposed methodology for extracting scholarly knowledge from software packages and generating a knowledge graph from the extracted meta(data). Figure 1 provides an overview of the key components. We present the implementation for each of the steps of the methodology using a running example involving the article by Mancini et al. [8] and related published software package [9].
3.1 Mining Software Packages
We extract software packages from the Zenodo and figshare repositories by utilizing their REST APIs. The metadata of each software package is analyzed to retrieve its DOI and other associated information, specifically linked scholarly articles. Moreover, we use the Software Metadata Extraction Framework (SOMEF) to extract additional metadata from software packages, such as the software description, programming languages, and related references. Since not all software packages relate to scholarly articles explicitly in metadata, we also use SOMEF to parse the README files of software packages as an additional method to extract the DOI of linked scholarly articles.
3.2 Static Code Analysis
We utilize Abstract Syntax Tree (AST) representations for static analysis of Python scripts and Jupyter Notebooks included in software packages. Our developed Python-based module sequentially reads the scripts contained in software packages and generates the AST. The implemented methods and variables are represented as nodes in the tree, which facilitates the analysis of the code flow. Figure 2 shows the Python script included in the software package [9]. The script illustrates an example in which the fmri_behavioural_new.csv data is loaded and two statistical hypothesis tests (pearsonr and t-test) are conducted on this data, respectively. Figure 3 shows the AST of the Python script (Fig. 2) created using a suitable Python library. For simplicity, we show the AST of lines 1 and 11. We investigate the flow of variables that contain the input data, i.e., examining which operations used a particular variable as a parameter. With this analysis, we retrieve the series of operations performed on a particular variable. From our example, we conclude that dropna, head, groupby, pearsonr and ttest_ind operations are executed on fmri_behavioural_new.csv data.
3.3 Identifying Scholarly Knowledge
The information extracted via AST analysis of source code is not necessarily scholarly knowledge. In this work, scholarly knowledge is information expressed in scholarly articles. Hence, in this step we constrain the information obtained in the previous step to information mentioned in the articles linked to the analyzed software packages. We use the Unpaywall REST APIFootnote 1 to retrieve the document in PDF format. To identify the scholarly knowledge, we calculate the semantic similarity between code semantics and article sentences by employing a pre-trained BERT-based model to constrain words as scholarly knowledge that are semantically similar. First, we extract the text from PDF and remove stop words. Second, we arrange the sentences in bigrams and trigrams, because computing the similarity using a sentence-based corpus could lead to inefficient search. Next, we generate embeddings of keywords extracted from article text and software packages and use cosine-similarity to find words with similar meaning. From our example (Fig. 2), the extracted terms are searched in the article and we find that fmri, pearsonr and t test are mentioned in the linked article (Fig. 4(a, b)). Given the match, we assume that the extracted information is scholarly knowledge.
3.4 Recompute the Research Results
For identified scholarly knowledge, we assume that the outputs of operations on data is information included in articles, and not necessarily stored as a separate dataset. In our running example, the output is a p-value and as we can see in Fig. 2, the output is merely printed to the console and presumably manually copied (and pasted into the article text). To enable rich descriptions of scholarly knowledge, we recompute the procedure outputs by executing the scripts that produce scholarly knowledge. For this, we develop a separate program to check the automatic execution of the code under consideration. If the code is not automatically executable then we introduce a human-in-the-loop step to execute the code. As different software packages may have been developed using varying versions of libraries, it is required to create a new virtual environment to execute scripts from the same package and prevent any conflicts between libraries. It is also necessary to check if the set of libraries required to execute the script are installed. If the libraries are not installed, we identify the required libraries using AST representations, and automatically install them in the virtual environment to execute the code under consideration. After successfully recomputing the code output, we assume that the computed outputs are correct and mentioned in the linked article. For our running example, we observed that the t-test returns a p-value (0.00088388), which is indeed mentioned in the paper (Fig. 4(c)).
3.5 Knowledge Graph Construction
Given the extracted scholarly knowledge, we now construct the knowledge graph or, in our case, populate an the Open Research Knowledge Graph (ORKG). First the meta(data) is converted into triples that are ingested into ORKG using its REST API. This conversion is guided by ORKG templatesFootnote 2, which specify the structure of information types, their properties and value ranges. Hence, templates standardize ORKG content and ensure comparable semantics. To link the extracted scholarly knowledge with templates, we search for templates by operation name. The matching template is then utilized to produce ORKG-compliant data that can be ingested. For our running example (Fig. 2), we look for an ORKG template by searching “t-test” in the ORKG interface. We obtain a reference to the ORKG Student t-test templateFootnote 3, which specifies three properties, namely: has specified input, has specified output and has dependent variable. This template guides us in producing ORKG-compliant data for the scholarly knowledge extracted in the previous steps. Figure 5 shows the description of the paper in our running example in ORKGFootnote 4. The metadata of the corresponding software packageFootnote 5 can also be viewed in ORKG.
4 Validation
Since we have employed AST to extract scholarly knowledge from software packages, it is essential to demonstrate that the proposed approach reliably produces the desired results. For this purpose, we compare the AST-based results with ground truth data extracted manually from software packages. To prepare the ground truth data, we have developed a Python script that iteratively reads software packages and scans each script to lookup all possible functions that load datasets (i.e., read_csv, loadtxt, genfromtxt, read_json, open). All scripts containing these functions are manually annotated for input data, operations performed on data, and output data, if any. To identify scholarly knowledge, the extracted information was searched in the full text of linked scholarly articles. Then, the manually extracted results were compared with the AST-based extracted results. The algorithmic performance is assessed using the Index of Agreement (IA) [14], a standardized measure to examine the potential agreement between results obtained using two different approaches. Its value varies between 0 and 1, indicating the potential error between the observed and predicted results. In total, we analyze 40 software packages and obtain an overall IA of 0.74. This result suggest an acceptable reliability of the proposed approach.
5 Results and Discussion
At the time of writing, more than 115,000 software packages are available on Zenodo and figshare, collectively. To expedite the execution, we consider packages of size up to 2 GB, i.e., 91,258 software packages. We analyze package metadata and the respective README files using SOMEF and find a total of 10,584 linked research articles. Our analysis focuses on Python-based software packages, of which there are 31,992. Among these, there are 5,239 Python-based implementations that are linked to 5,545 articles. We also observe that some software packages are linked to multiple articles. Table 1 summarizes the statistics. To delve further into the structural and semantic aspects of these packages, we applied the AST-based approach and discovered that 8,618 software scripts (in 1,405 software packages) contain information about input datasets and methods executed on these datasets, and, if applicable, output datasets. A total of 2,049 papers are linked to these packages. As the Open Research Knowledge Graph (ORKG) is designed purely for the publication of scholarly knowledge, we search the extracted knowledge in the full text of the linked articles (as explained in Sect. 3.3) to identify scholarly knowledge. This step requires access to the full text of linked articles, and we found that out of 2,049 articles, 665 articles are closed access. Consequently, knowledge derived from packages linked to closed-access articles cannot be constrained as scholarly knowledge. Hence, we analyze the remaining 740 packages to obtain the scholarly knowledge. We describe the metadata of 91,258 software packages and 10,941 papers in ORKG. Out of total articles, 174 articles contain rich contributions i.e., information about datasets and operations performed on these datasets. The proposed AST-based approach is an alternative route to automatically produce structured scholarly knowledge. We suggest the extraction of scholarly knowledge from software packages, and our approach addresses the limitations of NLP-based approaches.
6 Conclusions
We have presented a novel approach to structured scholarly knowledge production by extraction from published software packages. Based on the encouraging results, we suggest that the approach is an important contribution towards automated and scalable production of rich structured scholarly knowledge accessible via a scholarly knowledge graph and efficiently reusable in services supporting data science. The richness is reflected by the fact that the resulting scholarly knowledge graph holds the links between articles, data and software at the granularity of individual computational activities as well as comprehensive descriptions of the computational methods and materials used and produced in research work presented in articles.
Notes
- 1.
- 2.
- 3.
- 4.
https://orkg.org/paper/R601243, where readers can view the data also as a graph.
- 5.
References
Abdelaziz, I., Srinivas, K., Dolby, J., McCusker, J.P.: A demonstration of codebreaker: a machine interpretable knowledge graph for code. In: SEMWEB (2020)
Ahmad, W., Chakraborty, S., Ray, B., Chang, K.W.: A transformer-based approach for source code summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4998–5007. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.449
Hendler, J.: Data integration for heterogenous datasets. Big Data 2, 205–215 (2014). https://doi.org/10.1089/big.2014.0068
Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: Codesearchnet challenge: evaluating the state of semantic code search (2020). https://www.microsoft.com/en-us/research/publication/codesearchnet-challenge-evaluating-the-state-of-semantic-code-search/
Iyer, S., Konstas, I., Cheung, A., Zettlemoyer, L.: Summarizing source code using a neural attention model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2073–2083. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/P16-1195
Kelley, A., Garijo, D.: A framework for creating knowledge graphs of scientific software metadata. Quant. Sci. Stud. 2(4), 1423–1446 (2021). https://doi.org/10.1162/qss_a_00167
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)
Mancini, F., Zhang, S., Seymour, B.: Learning the statistics of pain: computational and neural mechanisms. BioRxiv 2021–10 (2021)
Mancini, F., Zhang, S., Seymour, B.: Computational and neural mechanisms of statistical pain learning (2022). https://doi.org/10.5281/zenodo.6997897
Mao, A., Garijo, D., Fakhraei, S.: SoMEF: a framework for capturing scientific software metadata from its documentation. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 3032–3037 (2019). https://doi.org/10.1109/BigData47090.2019.9006447
Reza, S.M., Badreddin, O., Rahad, K.: ModelMine: a tool to facilitate mining models from open source repositories. In: Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3417990.3422006
Spadini, D., Aniche, M., Bacchelli, A.: Pydriller: Python framework for mining software repositories. ESEC/FSE 2018, New York, NY, USA, pp. 908–911. Association for Computing Machinery (2018). https://doi.org/10.1145/3236024.3264598
Vagavolu, D., Swarna, K.C., Chimalakonda, S.: A mocktail of source code representations. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1296–1300 (2021). https://doi.org/10.1109/ASE51524.2021.9678551
Willmott, C.J.: On the validation of models. Phys. Geography 2(2), 184–194 (1981). https://doi.org/10.1080/02723646.1981.10642213
Acknowledgment
This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and TIB–Leibniz Information Centre for Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Haris, M., Auer, S., Stocker, M. (2023). Scholarly Knowledge Graph Construction from Published Software Packages. In: Goh, D.H., Chen, SJ., Tuarob, S. (eds) Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration. ICADL 2023. Lecture Notes in Computer Science, vol 14458. Springer, Singapore. https://doi.org/10.1007/978-981-99-8088-8_15
Download citation
DOI: https://doi.org/10.1007/978-981-99-8088-8_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8087-1
Online ISBN: 978-981-99-8088-8
eBook Packages: Computer ScienceComputer Science (R0)