Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

We are living in times of exponential growth of the production of unstructured data, such as web pages, social media, documents etc. Such mass of information is unlikely to be analyzed by humans, so there is a strong drive to develop automatic methods capable to retrieve knowledge from unstructured data.

Since a decade, text clustering has become an active field of research in the machine learning community. Most of the approaches are based on the term occurrence frequency. The performance of such surface-based methods can decrease when the texts are too complex, i.e., ambiguous. One alternative is to use the semantic-based approaches to process textual documents according to their meaning. Furthermore, research in text categorization has been focused on flat texts whereas many documents are now semi-structured format.

In this paper we present the application of diverse clustering methods in order to discover knowledge from the repository of semi-structured documents (describing KRK competences). The proposed method exploits extensively information extraction methods (regular expressions) and information retrieval methods (flat clustering, hierarchical agglomerative clustering, patterns mining and grouping them into the lattice). Our approach is evaluated on the unique dataset describing KRK education competences of 2800 study fields in Polish Higher Education System.

2 Preface

Since the late 1990s, the Bologna Process has offered an incentive for many European countries to reform their educational systems and to make them more comparable and explicit. Nowadays, 46 European countries are involved in the process and one of its outcomes is the European Qualifications Framework (EQF), the system of explicit and comparable qualifications.

The EQF aims to establish a common reference framework as a translation device between different qualification systems and their levels. This framework comprises general, higher and vocational education and training, and should lead to better transparency, comparability and portability of citizen’s qualifications (e.g. diplomas, certificates etc.). The Qualification Framework strengthens competitiveness by enhancing mobility between the European educational systems and the labour market. Individuals and employers will be able to use the EQF for better understanding and comparing the qualifications levels of different countries.

Within the framework of the Bologna Process, the European ministers of education have agreed that each country should develop it’s own framework for degrees and qualifications (qualifications framework). Polish framework is called Krajowe Ramy Kwalifikacji (KRK competence framework). KRK competences are specialized descriptions of the abilities/qualifications, which students gain after graduation from the given study field. In Polish Higher Education System each study field should have the KRK competences (described in the semi-structured file) assigned to it. It means that study fields are linked to the files (pdf/doc), where in a table or in an enumeration form the KRK competences are placed. Polish KRK competence corpora consists of 3550 files and refers to about 2800 study fields.

3 Related Work

This paper refers to the two topics: information extraction (semi-structured extraction) and information retrieval (diverse clustering methods).

Information extraction (IE) is the task of automatically extracting structured information from unstructured or semi-structured machine-readable documents. The most popular tasks in IE are: named entity recognition, co-reference and relationship identification, table extraction, or the terminology extraction. There are various methods for information extraction like regular expressions, decision trees, Bayesian methods or Conditional Random Fields. In the paper [1] there are presented various IE methods in order to extract scientific profiles from web pages.

Document clustering is the form of unsupervised learning, which groups a set of documents into subsets called clusters. The algorithm’s goal is to create clusters that are internally coherent, but clearly different from each other. In other words, documents within a cluster should be as similar as possible, and documents in one cluster should be as dissimilar as possible from documents in other clusters [11]. The goal of text clustering in information retrieval is to discover groups of semantically related documents. At the root of the clustering documents lies van Rijsbergens [14] cluster hypothesis: closely associated documents tend to be relevant to the same requests, whereas documents concerning different meanings of the input query are expected to belong to different clusters.

The key input to a clustering algorithm is the distance measure. Two major classes of distance measure are: Euclidean, Non-Euclidean. A Euclidean space has some number of real-valued dimensions and it is based on the locations of points in such a space. A Non-Euclidean distance is based on properties of points, but not their location in a space. Euclidean distance is: \(L_{1}\) norm (like Manhattan distance), or \(L_{2}\) norm (most common notion of distance). Non-euclidean distances are: Jaccard distance, Cosine distance or Edit distance (number of inserts and deleted to change one string into another).

Clustering methods are usually divided into two groups: flat clustering and hierarchical clustering. Flat approach creates a flat set of clusters without any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy, a structure that is more informative than the unstructured set of clusters. These features of hierarchical clustering come at the cost of lower efficiency. In our applications we used both of those methods.

Approaches to text clustering can be also classified as data-centric or description-centric [3].

The data-centric approach focuses more on the problem of data clustering, rather than presenting the results to the user. Scatter/Gather [4] is an example, which divides the dataset into a small number of clusters and, after the selection of a group, performs clustering again and proceeds iteratively using the Buckshot-fractionation algorithm. Other data-centric methods use hierarchical agglomerative clustering [10] that replaces single terms with lexical affinities (2-grams of words) as features, or exploit link information [17].

Description-centric approaches are more focused on the description that is produced for each cluster of documents. This problem is also called descriptive clustering: discovery of diverse groups of semantically related documents associated with meaningful, comprehensible and compact text labels. Accurate and concise cluster descriptions (labels) let the user search through the collection’s content faster and are essential for various browsing interfaces. The task of creating descriptive, sensible cluster labels is difficult—typical text clustering algorithms rely on samples of keywords for describing discovered clusters. Among the most popular and successful approaches are phrase-based, which form clusters based on recurring phrases instead of numerical frequencies of isolated terms. STC algorithm employs frequently recurring phrases as both document similarity feature and final cluster description [16]. KeySRC improved STC approach by adding part-of-speech pruning and dynamic selection of the cut-off level of the clustering dendrogram [2]. Description-Comes-First (DCF) approach was introduced in the work [13] as an algorithm called Lingo. Description-Comes-First is a special case of description-centric approach, it first attempts to find good, conceptually varied cluster labels and then assign documents to the labels to form groups. The Lingo algorithm combines common phrase discovery and latent semantic indexing techniques to separate search results into meaningful groups. Phrase-based methods usually provide good results. They report some problems, when one topic is dominating. Navigli and Crisafulli, Di Marco [5, 6, 12] present a novel approach to snippet clustering, based on the automatic discovery of word senses from raw text. The proposed method clusters snippets based on their semantic similarity to the induced query senses.

Inspired by the above description-centric algorithms we introduced a novel method for clustering web search results based on frequent termsets mining [8]. First, we acquire the senses of a query by means of a word sense induction method that identifies meanings as trees of closed frequent termsets. Then we cluster the search results based on their lexical and semantic intersection with induced senses. We do not use any external corpora, the sense induction is performed only on the search results. The search results are distributed among matching senses. Finally we also use some diversification techniques in order to rerank clusters and theirs content. This method [8] has been used in clustering of KRK competences and corresponding to them study fields.

4 Approach

In order to perform computations we have built dedicated tools crucial for the KRK competence symbols extraction, clustering study fields according to theirs KRK representations, or identifying KRK patterns and structurizing them into sub-trees.

4.1 Extraction of KRK Competences

Extraction of KRK Compentences concerns processing the files describing each study field with the list of KRK competences. The sample file, containing KRK competence symbols, is presented in Fig. 1. There are mainly pdf/doc files containing the tables or list of symbols of competences.

Fig. 1
figure 1

Sample file describing study field ‘edukacja artystyczna’ with KRK competences placed in a table

The information extraction is performed as follows:

  1. 1.

    File processing using Apache Tika;

  2. 2.

    KRK competence symbols extraction using regular expressions;

File processing is done in order to fetch a content of file in a text format. Apache Tika toolkit detects and extracts metadata and text (file’s content) from over a thousand different file types (such PPT, XLS, DOC(X), PDF). All of those files are parsed through a single interface, making Tika crucial for search engine indexing and content extraction. In our experiments we used the AutoDetectParser class in order to works with diverse types of files (xls, pdf, doc).

KRK competences are extracted from the retrieved file’s content using well-defined regular expressions. Analyzing the KRK symbols we discovered some regularities in the notation, which give us possibility to define the one general regular expression as:

$$\begin{aligned} ((\backslash p\{Upper\})((\backslash w\{1,3\})\_(\backslash w\{3\}))|(A\_(\backslash w\{3\}))) \end{aligned}$$

Summarizing, for each document we build the vector of KRK competence symbols. For the study field Geography at Poznan University the vector of retrieved competences counts 62 symbols and looks like \(<S2A\_U08, S2A\_U09,\ldots \) \(S2A\_W07,\ldots ,S2A\_K05,\ldots , P2A\_K01>\). Such vector representation of competences is used in order to cluster the study fields.

4.2 Flat and Hierarchical Clustering

Flat clustering is efficient and conceptually simple, but requires a priori defined number of clusters as input and are usually nondeterministic. Hierarchical clustering creates a structure containing the history of grouping that presents in each iteration the state of clustering. Such representation is more informative than the unstructured set of clusters. In some applications (as finding similar study fields in the KRK vector space) we want a partition of disjoint clusters just as in flat clustering. In those cases the hierarchy needs to be cut at some point. There are two ways: to prespecify the number of clusters, or to prespecify level of similarity (minimum similarity between clusters). Analyzing the hierarchy of clusters is much simpler to decide about the cut off level (in our case the minimal similarity that enables to merge two clusters), which is acceptable for us.

Hierarchical clustering is either top-down or bottom-up. Top-down clustering proceeds by splitting initial cluster (containing all instances) recursively until singletons (individual instances) are reached. Bottom-up approach treats each document as a singleton cluster at the initial step and then merge them until all clusters are grouped into a single cluster containing all instances. This approach is called hierarchical agglomerative clustering, or shortly HAC. HAC is more frequently used in IR than top-down clustering so we also used the agglomerative approach in ours applications. HAC algorithm can be used with the different similarity measures as: single-linkage, complete-linkage, or average-linkage. Ours initial experiments prove that single linkage provides better results than complete-linkage, joining clusters based on the similarity of their most similar members. Complete-linkage clustering suffers from a special problem of KRK competence vectors, namely it pays too much attention to outliers, study fields that do not fit well into the global structure of the cluster.

We performed two clusterings in ours experiments: (1) flat deterministic clustering for the mirrors discovery, (2) HAC clustering for the similar study fields discovery.

4.3 SnS-Based Clustering

We have also applied a dedicated WSI-based web search result clustering method called SnSRC [8]. SnSRC consists of the following four steps: (1) preprocessing of results (transforming into a bag of words), (2) word sense induction method, (3) clustering of results, and (4) cluster sorting.

In the first step for a given query the interesting documents (snippets) are retrieved, and then iteratively processed into bag-of-words representation. In our case the bag-of-words representation is given initially as the vector of KRK competence symbols (each KRK competence symbol can be treated as a normalized word). Steps 2, 3 are performed as in the original version according to clustering snippets. The last step 4 is omitted. Some brief description of steps 2 and 3 is the next paragraphs.

Word sense induction step (2) is performed with the use of SnS [7, 9]. It is a word sense induction algorithm based on closed frequent sets and multi-level sense representation. SnS is a knowledge-poor approach, which means it does not need any kind of structured knowledge base about senses as well as the algorithms that have embedded deep language knowledge. Senses induced by SnS characterize better readability (are more intuitive), mainly because SnS discovers a hierarchy of senses showing important relationships between them. In other words the proposed method creates structure of senses, where coarse-grained senses contain related sub-senses (fine-grained senses), rather than flat list of concepts.

In our case the customized SnS algorithm consists of three phases, which we present below.

In Phase I KRK patterns are discovered from KRK competence vectors describing study fields. The patterns are closed frequent sets in the KRK competence space. The KRK competence vectors are treated as transactions (itemsets are replaced by sets of KRK symbols) and the process of mining closed frequent sets is performed with the use of the CHARM algorithm [15].

Phase II is devoted to forming KRK patterns into sense frames, building a hierarchical structure of senses. In some exceptional states few sense frames may refer to one sense, it may result from the corpus limitations (lack of representativeness and high synonymity against descriptive terms).

In Phase III, sense frames are clustered. The clusters of sense frames are called senses.

The clustering step (3) is performed in two phases: first, simultaneously during sense induction, and then after sense discovery for those results that remained not grouped. The first phase is based on the process of frequent set mining. Discovered closed frequent sets have support, and list of results, in which they appear. Senses are clustered sense frames. Each sense frame has the main pattern, so according to sense frames the results (study fields) containing the main pattern are grouped in the corresponding result cluster. Summarizing, for each sense a corresponding cluster of study fields is constructed. Let us note that after this phase is completed, there may remain study fields which are non-grouped. In the second phase, non-grouped instances are tested iteratively against each of the induced sense, and clustered to the closest senses.

4.4 Study Fields Search Engine

The search engine of study fields has been built, which goal is to find similar study fields to the given one using KRK competence vectors. In the proposed approach the closest study fields in the KRK vector space are presented in a flat list, where each item is also tagged with a number of common KRK competences. Figure 2 presents the top 10 results for the study field ‘Coaching medyczny’ (Table 1).

Fig. 2
figure 2

KRK-based search engine using to find similar study fields

Table 1 Sample clusters of study fields having the same KRK competences (mirrors)

5 Experiments

5.1 Experimental Setup

Test sets. We conducted our experiments on the corpora created using data from the POLon system (the central system of Polish Higher Education). Below some details about this dedicated corpora:

  1. 1.

    3550 files assigned to study fields of all universities;

  2. 2.

    average 1,3 file per study field;

  3. 3.

    2700 study fields have assigned the KRK file(s), but only for 2200 of them assigned files are automatically processed i.e. we are able to automatically extract KRK symbols from them.

5.2 Results

We performed computations using all approaches described above, starting from the flat and hierarchical clustering in order to find clusters of equal or similar study fields according to KRK vector space.

Mirrors discovery consists in finding cluster of study fields having the same KRK competences (100 % similarity between sets of KRK competence symbols). This process is performed in order to find obvious anomalies in the data. There were discovered 150 clusters of mirrored competences (average size of a cluster is 11), some of them are presented below (the anomalies are in bold). Each study field is tagged with the POLon id (database id, which disambiguates study field, because e.g. political science can be taught both at Poznan University and Warsaw University).

Table 2 Sample clusters of study fields clustered by HAC (cut-off \(=\) 0.7)
Fig. 3
figure 3

KRK-based sample tree clusters of patterns

We performed a second experiment aimed at clustering study fields using HAC algorithm. HAC clustering algorithm exploits single-linkage with cut-off \(=\) 0.7 and dedicated distance measure between KRK competence vectors. KRK dedicated similarity measure (krkSim) is defined by non-euclidean measure some way similar to Jaccard measure. Given a dictionary D of all KRK competence symbols and two KRK vectors (set of KRK symbol competences) \(W_i \subseteq D\) and \(W_j \subseteq D\) then the measure is expressed by the formula:

$$\begin{aligned} krkSim(W_i, W_j) = \frac{c(W_{i} \cap W_{j})}{max(c(W_{i}),c(W_{j}))} \end{aligned}$$

Here, \(c(X_{i})\) is a cardinality of vector \(X_{i}\) (vector of KRK symbol competences). Two KRK vectors are semantically close if the similarity measure krkSim is higher than the user-defined threshold (e.g. 70 %). We should emphasize that the proposed measure is always lower than cosine measure built upon vectors, where each KRK symbol is a dimension (values are boolean). Comparing ours distance measure with the cosine measure the numerators are equal, but denominator \(max(c(W_{i}),c(W_{j}))\) is always greater or equal than the cosine denominator. Experiments with cosine measure showed that some study fields are clustered too easy, therefore we decided to change this condition into more strict. The performed HAC clustering brings us 60 clusters (diverse distribution from 2 to 1000 instances into one cluster). The Table 2 shows sample clusters retrieved by HAC.

Apart from flat and HAC clustering we applied also the SnSRC algorithm, which can be treated as a clustering of patterns discovered in the KRK vectors. The SnSRC representation enables us to analyze some regularities in the KRK competences, and its structure itself (core competences, and some most popular extensions). Given minimum support 10 % we have discovered 80 000 patterns (closed frequent sets), which formed almost 4000 sense frames (trees clustering patterns). Sense frame is a multi-hierarchical structure organizing patterns. The root is a main contextual pattern, which is a representative label for the sense frame. The main pattern has sub-trees, which are sub-pattern trees. Sub-patterns are supersets of the main pattern, but also they can be in subset relation among themselves, which brings to multi-level sub-pattern trees. Sense frames enable to build compact, concise representation of discovered patterns. The Fig. 3 shows some patterns clustered into the form of sub-trees. The legend of some KRK competence symbols, used in the Fig. 3, is placed below.

  1. 1.

    \(S1A\_W02\)—graduate person has knowledge about social structures (also institutions)

  2. 2.

    \(S1A\_W03\)—graduate person has knowledge about relation between social structures

  3. 3.

    \(S1A\_W05\)—graduate person has knowledge about human-being

  4. 4.

    \(S1A\_W06\)—graduate person has knowledge about tools and methods used in social analysis

  5. 5.

    \(S1A\_W07\)—graduate person has knowledge about norms and rules in society

  6. 6.

    \(S1A\_U01\)—graduate person can interpret social processes

  7. 7.

    \(S1A\_U02\)—graduate person can use gained knowledge

  8. 8.

    \(S1A\_U03\)—graduate person can find out reasons

  9. 9.

    \(S1A\_U07\)—graduate person can analyse the proposed solutions

  10. 10.

    \(S1A\_U08\)—graduate person can understand social phenomenon

6 Conclusions

We have presented a few approaches to clustering study fields described by the corpora of semi-structured files (types: pdf, doc(x)). First of all we performed information extraction in order to retrieve the KRK competence symbols from files. Next we proposed the various types of clustering methods (flat, hierarchical agglomerative, or pattern based), which goal was to identify equal or similar study fields according to KRK competence vectors. Flat clustering was applied in order to find mirrors clusters (cluster of study fields with equal KRK vectors), which is performed in order to find obvious anomalies in the data. HAC clustering enables us to identify study fields, which educate similar qualifications (the cut-off \(=\) 0.7 was set relatively high to retain the consistency in the clusters). The last clustering method was the SnSRC method, which finds patterns and organizes them into the tree structures. Such approach gives us possibility to discover regularities in the KRK competence vectors, and present them into the tree structure (with the main component and its sub components). In this step we exploit a novel WSI knowledge-poor algorithm SnS, based on text mining approaches, namely closed frequent sets. Using significant KRK competence patterns SnS builds hierarchical structures called sense frames. Finally, study fields are mapped to the sense frames and clustered accordingly. Additionally, we have built the search engine, which enables to find similar study fields to the given one according to KRK competence similarity. All of those deliverables (search engine, clustering reports) were provided to the Ministry of Science and Higher Education, and were extensively used in order to detect improper KRK competences or discover some knowledge about study fields (educating similar qualifications) and clusters of KRK competence patterns (regularities within KRK competence vectors).