Keywords

1 Introduction

With the increasing availability of online education resources, it is significant to organize learning resources in a reasonable order. This has led to a boom in research into concept maps. Concept Map is a graphical tool to obtain relations between concepts such as dependencies, associations, co-occurrences and correlations. The educational concept map is a means of subject knowledge visualization. It focuses on the pedagogic relationship between concepts, especially prerequisite dependency. For instance, we need to have the basic knowledge of “probability density” in order to learn “mathematical expectation”. Therefore, the concept “probability density” is a prerequisite concept of the concept “mathematical expectation”. Educational concept map plays a critical role in lots of educational applications such as curriculum planning, teaching diagnosis, knowledge tracing and intelligent tutoring [1,2,3,4]. However, current educational concept map analysis works do not have available public datasets to support related research. To solve this problem, they generate private datasets by using concept and link extraction methods [5,6,7,8]. However, the data collection is time-consuming and the private synthesis data is tendentious to their analysis result. Therefore, it is important to provide a public well-organized dataset to generate an educational concept map.

In this work, we aim to build such a dataset named AuCM (Australian Courses Map data). In the first step, we collect the related data of undergraduate courses from universities in Australia. We browse all accessible undergraduate programs in the area of IT and CS from Sydney University, the University of Queensland, Monash University and other eleven universities. Overall, we collect 1292 courses from 14 universities with comprehensive information including course ID, title, category and course description. Besides basic information about the course, the course catalogs also provide course prerequisite information. For example, “SIT221 - Data Structures and Algorithms” is a prerequisite course for “SIT320 - Advanced Algorithms”. In our organized dataset, all the information is well-formatted and easily accessible.

In summary, our main contributions of this paper are as follows:

  1. 1.

    We collect a dataset including course information of bachelor programs in IT &CS area and the corresponding prerequisite dependencies from 14 universities in Australia, called Australian Courses Map data (AuCM).

  2. 2.

    We analyze the properties of AuCM from the number of courses, curriculum design and prerequisites among different universities through comparative analysis and visualization.

The structure of this article is as follows. We review the related work on the task of concept map dataset collection in Sect. 2. In Sect. 3, we introduce the construction of AuCM dataset. In Sect. 4, we analyze the statistical properties of AuCM. Finally, we explore the concept semantics of our dataset in Sect. 5, and summarize our work in Sect. 6.

2 Related Work

Many works on concept map-related dataset collection have been done in order to generate concept maps in educational environments. Based on the data source, it can be roughly divided into university websites, online education websites and textbooks.

Dataset from University Websites. Yang et al. [9] collected a dataset with 3509 courses data and available prerequisite requirements from four universities (MIT, CMU, Caltech and Princeton) in U.S. Their dataset infer the course prerequisites by constructing concept-level directed graphs. Liang et al. [10] provided the other related dataset. Their work includes 654 courses data with prerequisite relations from 11 U.S universities in their dataset. They manually annotated 1008 pairs of concepts with prerequisite relations.

Dataset from Online Education Websites. Besides the course data collected from various websites of universities, some efforts were also devoted to building the dataset through online education websites, i.e., MOOCs. The dataset mentioned in [11] is based on video playlists from a leading MOOC platform. It collected a total of 1346 course videos from 20 courses with different domains. In addition, this dataset annotated 573 course concepts with 3504 pairs of prerequisite relations among them manually. Similarly, Roy et al. [12] built a dataset based on the subtitles of videos from a famous Mooc platform in India. They used 382 videos from 38 different playlists in computer science departments. In total, their dataset includes 345 concepts and 1455 prerequisite edges between video pairs.

Dataset from Textbooks and Others. In addition, Wang et al. [13] firstly built a dataset to construct a concept map using six textbooks: computer networking, physics, databases and so on. Then they extracted key concepts through mining Wikipedia concepts and content cosine similarity with sub-chapter topics. They also manually labeled the prerequisite relationships between key concepts. Followed by Wang et al., Huang et al. [14] used multi-source data including mathematics textbooks as well as student question logs to build a dataset. A Similar method [13] is also employed to extract key concepts and relations. Lu et al. [15] created a dataset by extracting concepts from 18 textbooks of three domains: Calculus, Data Structure, and Physics. They annotated related domain concepts (318) and prerequisite relations (1522 pairs) among them.

Compared with them, our dataset includes courses from Australian universities with updates in 2022. To our best knowledge, there are no similar datasets in other works. In addition, our dataset includes more attributes, such as compatible courses and category information, to recover more semantic relations between courses.

Prerequisite Relationship Extraction. Based on Educational Data Mining, various downstream applications have been explored to leverage educational resources [16,17,18,19]. Our work is highly related to prerequisite relationship extraction among concepts. Based on existing raw data from websites, several efforts have been devoted to extracting prerequisite relations to construct educational concept map. Yang et al. [9] created a concept graph by mapping courses to concepts and learning concept-level dependencies in order to accomplish interlingua-style transfer learning. Liang et al. [10] recovered an accurate and universally shared concept graph based on the observed course dependencies. Unlike the studies above which used course-related data, Huang et al. [5] considered student test records and the relationship between questions and various concepts in order to perform data mining algorithms for generating concept maps. Pan et al. proposed a method MOOC-RF to recover the concept prerequisites from Coursera data [11]. They built a set of features and trained a classifier to recognize prerequisite relationships between concepts in video transcripts. Huang et al. [14] proposed a framework to extract prerequisite and collaboration relationships from multi-source of education data, like textbooks, student question logs and Wikipedia.

3 The AuCM Dataset

We collect all possible courses of bachelor degree from 14 Australian universities in IT and CS. For comparative analysis, these universities are selected from five states including the Australian Capital Territory (ACT) and half of the universities are from the “Group of Eight (G8)”. Our work on the AuCM dataset construction consists of two steps: data scraping and data processing. During the data scraping phase, we collect web pages from 1292 undergraduate courses. Then we extract the relevant information and organize the raw data during the data processing phase.

3.1 Data Scraping

For data scraping, we create multiple crawlers under Python 3.8 to scrape data from various university courses. First, we analyze the information structure of the target web to cover the key information. The requests library is used to send HTTP requests to the server during the crawling process. In general, our crawler downloads the pages of each undergraduate program which list the course requirements and extracts the URLs of each course using the Beautiful Soup library. Then, the crawler uses these URLs to request courses’ web pages and saves the HTML files for each course. During the crawling process, crawlers may trigger anti-crawler techniques on some websites of institutions. When the crawler accesses the websites of specific courses frequently, the website may refuse the next request. To solve this, we use a longer random waiting period for each visit interval ranging from 2–8 s to prevent these mechanisms from being triggered frequently.

3.2 Data Processing

After scraping the raw data, we aim to extract the useful data during data processing. We first analyze the page structure of the downloaded HTML file and use Beautiful Soup to extract critical information from it. Since different universities have their own HTML structure, we program various crawlers to get useful information. Generally, attributes of a course include course ID, title, category, course description and relations between courses. During the data processing, we may encounter an error called “404 Not Found”. This is mainly because some courses are no longer offered by the university. In this case, we need to remove it from our URL list and retrieve other possible course information in the IT &CS bachelor’s program.

We create an Excel file for each IT &CS undergraduate program of different universities and write course information into each row. Then the data can be readily turned into the specific data format by using existing python packages, e.g., the Pandas package. The CIAU22 dataset is generated after data processing, which includes 6 attributes. These attributes together describe the features of a course. The specific attributes are listed and explained in Table 1.

4 Statistical Analysis of AuCM

In this section, we conduct statistical analysis about the number of courses, curriculum design for IT &CS bachelor degree programs, core curriculum and courses with prerequisites.

Table 1. Overview of course feature
Table 2. IT &CS courses of universities in NSW/ACT

4.1 Analysis of the Number of Courses

Our dataset AuCM focuses on all the courses required to get a Bachelor degree in IT (BIT) and/or CS (BCS). Students are generally required to complete the core courses, specialized courses, and elective courses in order to earn a bachelor’s degree in IT/CS. We obtain all available course information of core units and courses of optional majors or minors from the websites of BIT &BCS programs. The core units, major or minor information corresponds to the category feature of Table 1. For comparative analysis, we select 7 universities from the “Group of Eight (G8)” and 7 non-G8 universities from 5 states and ACT. For example, Sydney University and the Australian National University which belong to G8 in New South Wales (NSW)/ACT are included. University of Technology Sydney and Australian Catholic University (non-G8) in NSW are also contained. The statistics of IT &CS courses from these NSW/ACT universities are listed in Table 2, which covers the total number of courses and the number of courses with prerequisite requirements. Data comparison in Table 2 shows that universities belonging to G8 generally offer significantly more courses than non-G8 universities in BIT &BCS programs. Like courses of USYD and ANU are more than 100 while UTS offers 71 courses and ACU only offers an example study plan which has 25 courses in total. This phenomenon is also evident in the other three states (Queensland, Western Australia and South Australia, shorted for QLD, WA and SA), as shown in Table 3.

Table 3. IT &CS courses of universities in QLD, WA, SA

It is obvious that the G8 universities of Queensland (UQ), Western Australia (UWA) and Adelaide (ADELAIDE) have more courses than their respective non-G8 universities in each state. In fact, the course number of UWA is more than 2 times that of ECU, with 131 and 50 courses respectively.

There is an exception to this phenomenon in the state of Victoria (VIC), Deakin University, not belonging to G8, offers more courses than G8 universities - Monash University and the University of Melbourne, with a total of 126 courses, the most in VIC. The IT &CS course statistics of universities in VIC can be seen in Table 4.

Table 4. IT &CS courses of universities in VIC

Figure 1 provides a comprehensive view of IT &CS courses from 14 universities. It can also be seen that G8 universities (green bars) generally have more total courses than non-G8 universities (blue bars). The average number of courses of these G8 universities is 115 which corresponds to 69 of non-G8 universities. This may be attributed to the fact that most non-G8 universities don’t offer the bachelor of CS or CS as a major is included in the bachelor of IT. Among them, UQ has the largest total number of IT &CS courses, with 160 courses and 8 majors to be selected, While the smallest one belongs to ACU which has 25 courses as also mentioned in Table 1.

Fig. 1.
figure 1

#Courses vs. universities in 5 states/ACT of Australia.

4.2 Analysis of Curriculum Design

The total number of courses indicates the scope of courses in the IT/CS field. We also find that 12 of the 14 universities offer an explicit plan for IT/CS bachelor’s degrees. The other 2 universities are the universities of Melbourne and Western Australia, which don’t offer a bachelor’s degree in IT, like the University of Melbourne places IT &CS courses into a bachelor of design, so our dataset only covers major courses in this field. Most universities that offer a bachelor’s degree in IT &CS provide a rich curriculum plan. For instance, there are many curriculum options to choose from in UQ. In general, students have to complete core courses, courses for plan options and elective courses in UQ. Core units are compulsory which lay a foundation for the study area. In addition, students can choose plan from many options like single major, singer minor, major plus minor, two majors, or even no major; Elective courses are divided into program electives and general electives, for the area breadth and general purpose respectively. Our dataset doesn’t cover elective courses as the majority of them are not in the IT &CS domain, like general electives in UQ are courses that can be selected from any other undergraduate course list.

Fig. 2.
figure 2

Distribution of course categories for BIT in UQ. There are 3 majors and one minor, accounting for 55.4% of the courses.

We can get an intuitive impression of the distribution of course categories in UQ from Fig. 2 and 3. Notably, the majority of courses belong to various majors and minors in both pie charts, with 36 courses and 55.4% in the BIT program and 50 courses and 52.6% in the BCS program. Similarly, most courses in the curriculum of other universities are also major-related courses. Like Monash University has 7 majors for students to choose from in the BIT program. Deakin University has a total of 6 majors and 13 minors in the BIT and BCS programs.

Fig. 3.
figure 3

Distribution of course categories for BCS in UQ. 5 majors account for 52.6% of the total number of courses.

4.3 Analysis of Core Curriculum

We also compare and analyze the core curriculum of BIT/BCS programs at different universities. Out of Australia’s 14 universities, 12 offer bachelor’s degrees in IT and 5 universities provide CS bachelor degrees for undergraduates to pursue. 7 of the 12 universities offering the Bachelor of IT list their core units. Other universities make these important foundation courses compulsory instead of referring to them as “core units.” Consistently, UTS, VU, UWQ and ADELAIDE universities all offer 8 core units. Figure 4 shows the different core courses from these schools. Notably, programming-related courses appear in the core curriculum of each university. Similar courses can also be found in the core units of the other 3 universities, as shown in Fig. 5. In addition, ANU sets the “Structured Programming” course as a compulsory course, ACU lists the course of “Programming Concepts” as an “IT specified unit”, and the course requirements for the first year at UNISA and ECU include programming-related courses. Among the core modules, other courses such as databases and information management are well-liked.

In contrast, courses of “Data Structures and Algorithms” and “Computer System” are more common in core units of CS field. The core courses for the BCS at USYD, UQ, Deakin, Adelaide, and Monash Universities are displayed in Fig. 6, algorithm-related courses are shown in orange, and computer systems courses are indicated in blue. As can be seen from the figure, the CS undergraduate core courses of the 5 schools all include courses related to “Data Structure and Algorithm” and “Computer System”. In addition to these common courses, many universities also offer programming courses as part of their core curriculum.

Fig. 4.
figure 4

Core units of BIT program from UTS, VU, UQ and ADELAIDE universities, programming-related courses are shown in blue, and database courses are indicated in green. (Color figure online)

Fig. 5.
figure 5

Core units of BIT program from USYD, QUT and DEAKIN universities with programming-related courses are shown in blue. (Color figure online)

Fig. 6.
figure 6

Core courses for the BCS program at USYD, UQ, DEAKIN, ADELAIDE, and MONASH Universities, algorithm-related courses are shown in orange, and computer systems courses are indicated in blue. (Color figure online)

4.4 Analysis of Prerequisites

Another observation of dataset AuCM is that a majority of courses have at least one prerequisite course. Prerequisite requirement represents the sequential orders between courses. In total from the 14 universities, there are 811 unique IT &CS undergraduate courses, 511 courses with prerequisite requirements, and 1475 pairs of courses with prerequisite relations. In addition, the average number of prerequisite links per course is 1.82. Figure 7 shows the proportion of the number of courses with prerequisites at each university. The red line is drawn by a linear regression algorithm, indicating that the proportion of prerequisite courses is around 75%. The largest proportion belongs to ANU with 94.8% of courses having prerequisite requirements, while the smallest one refers to Monash University with 37.0% of courses having prerequisite dependencies.

Fig. 7.
figure 7

#Courses with prerequisites vs. #courses.

5 Concept Semantics in AuCM

In this section, we evaluate and analyze the latent semantic features included in diverse concepts of undergraduate courses in AuCM dataset and explore concept map learning based on the dataset.

5.1 Semantic Feature Extraction and Analysis

To recover the concept-level semantic features, we first collect concepts from Wikipedia and process course descriptions by employing various traditional NLP tools, such as stop word removal, sentence segmentation, and lemmatization. Then we match the Wikipedia concept appearance with the course description. By using a pre-trained tokenizer from the BERT model, we tokenize the course description and retrieved concepts respectively in order to capture semantic characteristics. Furthermore, we use t-SNE to project these high-dimensional features into a two-dimensional map for visual analysis.

By drawing concept word clouds and scatter plots for concept/course description based on the projected 2D features, we can see that the focus areas of concepts and courses in BIT/BCS programs vary from school to school. We compare the word clouds of the course concepts between the universities of “Group of 8 (G8)” and non-G8 which are shown in Fig. 8.

Fig. 8.
figure 8

Word clouds of concepts extracted from the course descriptions of G8 (left) and non-G8 (right) universities.

Fig. 9.
figure 9

Scatter plots of projected features extracted from course descriptions/concepts in the courses from G8 and non-G8 universities.

As the word clouds show, common concepts such as data structures and operating systems are popular with high frequency in the course descriptions of various universities, and the difference is that advanced topics like “artificial intelligence” and “neural network” are frequently included in the course offerings at G8 universities, while basic topics of “operating systems” and “software development” are more frequently covered in the courses offered by universities of non-G8. This may indicate that G8 universities pay more attention to develop students’ capability in depth, whereas Non-G8 universities emphasize application and fundamentals. Another comparison of curriculum design between G8 schools and non-G8 institutions is shown in Fig. 9. The scatter plot on the left gives the distribution of words in the course descriptions, where the blue dots are the words described in the courses of G8 universities, and the red dots belong to the non-G8 school courses. It is clear that the blue dots are more scattered and span a larger area, indicating that the G8 course descriptions employs more semantically rich words and covers a broader diversity. The scatter plot on the right, which displays the distribution of concepts, also demonstrates this. Compared to relatively dispersed blue points, the red points tend to cluster in some conceptual areas, showing that non-G8 courses may pay more attention to the particular concepts rather than breadth.

5.2 Concept Map Learning

In addition to conceptual semantic comparison among domestic universities in Australia, our dataset is primarily utilized to create concept maps. Concept map construction intends to extract structured information from unstructured text and represent it as a graph, where concepts constitute the vertices and prerequisite dependencies make up the links. An example of the concept map is shown in Fig. 10, where the vertices come from the extracted concepts, and the links, represented by blue dotted lines, indicate the relationships between concepts based on a simple assumption: the prerequisite relationship between two courses means that the concepts contained in these courses have links. It is observed that the vertex “data type” at the center of the concept graph has many connections with other vertices, indicating that the “data type” constitutes a prerequisite for many concepts. Our AuCM dataset can facilitate relevant research on concept map learning by complementing the course information from Australian universities since the present benchmark dataset is primarily from the United States, like Liang et al. [10] and Yang et al. [9] both constructed concept maps based on datasets collected from the US universities.

Fig. 10.
figure 10

An example of concept map.

Dataset AuCM also allows for validation of state-of-the-art methods on concept map construction in the Australian data setting and a comparative investigation of concept maps based on AuCM and other datasets. Other downstream applications such as course recommendation and knowledge tracing can also be supported by our dataset.

6 Conclusion

In this work, we build a dataset about Courses Map Data of Australian Universities named AuCM. It contains 1292 undergraduate courses of bachelor degree in the area of IT or CS from 14 Australian universities. The usability of our dataset is demonstrated by statistical and semantic studies. Our statistical analysis shows the superiority of the number of courses offered by G8 universities. By semantic analysis of concepts/course descriptions, we find the variances in course design in terms of semantic richness and focus areas among G8 and non-G8 universities. Future work would be to construct concept maps on AuCM to validate existing concept map generation method. Another direction would be to perform a more comprehensive model comparison and evaluation by bringing other possible variations of models to learn a concept graph.