Study Fields Clustering Using KRK Competences

Kozlowski, Marek

doi:10.1007/978-3-319-30315-4_4

Marek Kozlowski⁶

Part of the book series: Studies in Big Data ((SBD,volume 19))

1965 Accesses

Abstract

The paper refers to the topic of study fields clustering using extracted information from semi-structured documents, namely documents describing study field’s KRK competences. KRK competences are the specialized descriptions of the qualifications, which students gain after graduation from the given study field. The proposed method enables extracting and processing KRK competences from diverse types of semi-structured documents. It consists of two stages: (1) entity extraction from documents (building vectors of KRK competences for each study field), and (2) study fields clustering using those competence representations. Polish KRK competence files, describing almost 3000 study fields in Poland, were used as a corpora. The method and its stages are thoroughly analyzed. The results allow to compare and identify similar study fields according to theirs final effects of education.

Access provided by Autonomous University of Puebla. Download chapter PDF

A study on construction and analysis of discipline knowledge structure of Chinese LIS based on CSSCI

Article 04 October 2016

Identification of Entities in Scientific Documents

A Method for the Identification of Competence Centers Based on the Example of the Artificial Intelligence Domain

Article 01 October 2017

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

We are living in times of exponential growth of the production of unstructured data, such as web pages, social media, documents etc. Such mass of information is unlikely to be analyzed by humans, so there is a strong drive to develop automatic methods capable to retrieve knowledge from unstructured data.

Since a decade, text clustering has become an active field of research in the machine learning community. Most of the approaches are based on the term occurrence frequency. The performance of such surface-based methods can decrease when the texts are too complex, i.e., ambiguous. One alternative is to use the semantic-based approaches to process textual documents according to their meaning. Furthermore, research in text categorization has been focused on flat texts whereas many documents are now semi-structured format.

In this paper we present the application of diverse clustering methods in order to discover knowledge from the repository of semi-structured documents (describing KRK competences). The proposed method exploits extensively information extraction methods (regular expressions) and information retrieval methods (flat clustering, hierarchical agglomerative clustering, patterns mining and grouping them into the lattice). Our approach is evaluated on the unique dataset describing KRK education competences of 2800 study fields in Polish Higher Education System.

2 Preface

Since the late 1990s, the Bologna Process has offered an incentive for many European countries to reform their educational systems and to make them more comparable and explicit. Nowadays, 46 European countries are involved in the process and one of its outcomes is the European Qualifications Framework (EQF), the system of explicit and comparable qualifications.

The EQF aims to establish a common reference framework as a translation device between different qualification systems and their levels. This framework comprises general, higher and vocational education and training, and should lead to better transparency, comparability and portability of citizen’s qualifications (e.g. diplomas, certificates etc.). The Qualification Framework strengthens competitiveness by enhancing mobility between the European educational systems and the labour market. Individuals and employers will be able to use the EQF for better understanding and comparing the qualifications levels of different countries.

Within the framework of the Bologna Process, the European ministers of education have agreed that each country should develop it’s own framework for degrees and qualifications (qualifications framework). Polish framework is called Krajowe Ramy Kwalifikacji (KRK competence framework). KRK competences are specialized descriptions of the abilities/qualifications, which students gain after graduation from the given study field. In Polish Higher Education System each study field should have the KRK competences (described in the semi-structured file) assigned to it. It means that study fields are linked to the files (pdf/doc), where in a table or in an enumeration form the KRK competences are placed. Polish KRK competence corpora consists of 3550 files and refers to about 2800 study fields.

3 Related Work

This paper refers to the two topics: information extraction (semi-structured extraction) and information retrieval (diverse clustering methods).

Information extraction (IE) is the task of automatically extracting structured information from unstructured or semi-structured machine-readable documents. The most popular tasks in IE are: named entity recognition, co-reference and relationship identification, table extraction, or the terminology extraction. There are various methods for information extraction like regular expressions, decision trees, Bayesian methods or Conditional Random Fields. In the paper [1] there are presented various IE methods in order to extract scientific profiles from web pages.

Document clustering is the form of unsupervised learning, which groups a set of documents into subsets called clusters. The algorithm’s goal is to create clusters that are internally coherent, but clearly different from each other. In other words, documents within a cluster should be as similar as possible, and documents in one cluster should be as dissimilar as possible from documents in other clusters [11]. The goal of text clustering in information retrieval is to discover groups of semantically related documents. At the root of the clustering documents lies van Rijsbergens [14] cluster hypothesis: closely associated documents tend to be relevant to the same requests, whereas documents concerning different meanings of the input query are expected to belong to different clusters.

The key input to a clustering algorithm is the distance measure. Two major classes of distance measure are: Euclidean, Non-Euclidean. A Euclidean space has some number of real-valued dimensions and it is based on the locations of points in such a space. A Non-Euclidean distance is based on properties of points, but not their location in a space. Euclidean distance is: $L_{1}$ norm (like Manhattan distance), or $L_{2}$ norm (most common notion of distance). Non-euclidean distances are: Jaccard distance, Cosine distance or Edit distance (number of inserts and deleted to change one string into another).

Clustering methods are usually divided into two groups: flat clustering and hierarchical clustering. Flat approach creates a flat set of clusters without any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy, a structure that is more informative than the unstructured set of clusters. These features of hierarchical clustering come at the cost of lower efficiency. In our applications we used both of those methods.

Approaches to text clustering can be also classified as data-centric or description-centric [3].

The data-centric approach focuses more on the problem of data clustering, rather than presenting the results to the user. Scatter/Gather [4] is an example, which divides the dataset into a small number of clusters and, after the selection of a group, performs clustering again and proceeds iteratively using the Buckshot-fractionation algorithm. Other data-centric methods use hierarchical agglomerative clustering [10] that replaces single terms with lexical affinities (2-grams of words) as features, or exploit link information [17].

Description-centric approaches are more focused on the description that is produced for each cluster of documents. This problem is also called descriptive clustering: discovery of diverse groups of semantically related documents associated with meaningful, comprehensible and compact text labels. Accurate and concise cluster descriptions (labels) let the user search through the collection’s content faster and are essential for various browsing interfaces. The task of creating descriptive, sensible cluster labels is difficult—typical text clustering algorithms rely on samples of keywords for describing discovered clusters. Among the most popular and successful approaches are phrase-based, which form clusters based on recurring phrases instead of numerical frequencies of isolated terms. STC algorithm employs frequently recurring phrases as both document similarity feature and final cluster description [16]. KeySRC improved STC approach by adding part-of-speech pruning and dynamic selection of the cut-off level of the clustering dendrogram [2]. Description-Comes-First (DCF) approach was introduced in the work [13] as an algorithm called Lingo. Description-Comes-First is a special case of description-centric approach, it first attempts to find good, conceptually varied cluster labels and then assign documents to the labels to form groups. The Lingo algorithm combines common phrase discovery and latent semantic indexing techniques to separate search results into meaningful groups. Phrase-based methods usually provide good results. They report some problems, when one topic is dominating. Navigli and Crisafulli, Di Marco [5, 6, 12] present a novel approach to snippet clustering, based on the automatic discovery of word senses from raw text. The proposed method clusters snippets based on their semantic similarity to the induced query senses.

Inspired by the above description-centric algorithms we introduced a novel method for clustering web search results based on frequent termsets mining [8]. First, we acquire the senses of a query by means of a word sense induction method that identifies meanings as trees of closed frequent termsets. Then we cluster the search results based on their lexical and semantic intersection with induced senses. We do not use any external corpora, the sense induction is performed only on the search results. The search results are distributed among matching senses. Finally we also use some diversification techniques in order to rerank clusters and theirs content. This method [8] has been used in clustering of KRK competences and corresponding to them study fields.

4 Approach

In order to perform computations we have built dedicated tools crucial for the KRK competence symbols extraction, clustering study fields according to theirs KRK representations, or identifying KRK patterns and structurizing them into sub-trees.

4.1 Extraction of KRK Competences

Extraction of KRK Compentences concerns processing the files describing each study field with the list of KRK competences. The sample file, containing KRK competence symbols, is presented in Fig. 1. There are mainly pdf/doc files containing the tables or list of symbols of competences.

The information extraction is performed as follows:

1.
File processing using Apache Tika;
2.
KRK competence symbols extraction using regular expressions;

File processing is done in order to fetch a content of file in a text format. Apache Tika toolkit detects and extracts metadata and text (file’s content) from over a thousand different file types (such PPT, XLS, DOC(X), PDF). All of those files are parsed through a single interface, making Tika crucial for search engine indexing and content extraction. In our experiments we used the AutoDetectParser class in order to works with diverse types of files (xls, pdf, doc).

KRK competences are extracted from the retrieved file’s content using well-defined regular expressions. Analyzing the KRK symbols we discovered some regularities in the notation, which give us possibility to define the one general regular expression as:

$$\begin{aligned} ((\backslash p\{Upper\})((\backslash w\{1,3\})\_(\backslash w\{3\}))|(A\_(\backslash w\{3\}))) \end{aligned}$$

Summarizing, for each document we build the vector of KRK competence symbols. For the study field Geography at Poznan University the vector of retrieved competences counts 62 symbols and looks like $<S2A\_U08, S2A\_U09,\ldots $ $S2A\_W07,\ldots ,S2A\_K05,\ldots , P2A\_K01>$. Such vector representation of competences is used in order to cluster the study fields.

4.2 Flat and Hierarchical Clustering

Flat clustering is efficient and conceptually simple, but requires a priori defined number of clusters as input and are usually nondeterministic. Hierarchical clustering creates a structure containing the history of grouping that presents in each iteration the state of clustering. Such representation is more informative than the unstructured set of clusters. In some applications (as finding similar study fields in the KRK vector space) we want a partition of disjoint clusters just as in flat clustering. In those cases the hierarchy needs to be cut at some point. There are two ways: to prespecify the number of clusters, or to prespecify level of similarity (minimum similarity between clusters). Analyzing the hierarchy of clusters is much simpler to decide about the cut off level (in our case the minimal similarity that enables to merge two clusters), which is acceptable for us.

Hierarchical clustering is either top-down or bottom-up. Top-down clustering proceeds by splitting initial cluster (containing all instances) recursively until singletons (individual instances) are reached. Bottom-up approach treats each document as a singleton cluster at the initial step and then merge them until all clusters are grouped into a single cluster containing all instances. This approach is called hierarchical agglomerative clustering, or shortly HAC. HAC is more frequently used in IR than top-down clustering so we also used the agglomerative approach in ours applications. HAC algorithm can be used with the different similarity measures as: single-linkage, complete-linkage, or average-linkage. Ours initial experiments prove that single linkage provides better results than complete-linkage, joining clusters based on the similarity of their most similar members. Complete-linkage clustering suffers from a special problem of KRK competence vectors, namely it pays too much attention to outliers, study fields that do not fit well into the global structure of the cluster.

We performed two clusterings in ours experiments: (1) flat deterministic clustering for the mirrors discovery, (2) HAC clustering for the similar study fields discovery.

4.3 SnS-Based Clustering

We have also applied a dedicated WSI-based web search result clustering method called SnSRC [8]. SnSRC consists of the following four steps: (1) preprocessing of results (transforming into a bag of words), (2) word sense induction method, (3) clustering of results, and (4) cluster sorting.

In the first step for a given query the interesting documents (snippets) are retrieved, and then iteratively processed into bag-of-words representation. In our case the bag-of-words representation is given initially as the vector of KRK competence symbols (each KRK competence symbol can be treated as a normalized word). Steps 2, 3 are performed as in the original version according to clustering snippets. The last step 4 is omitted. Some brief description of steps 2 and 3 is the next paragraphs.

Word sense induction step (2) is performed with the use of SnS [7, 9]. It is a word sense induction algorithm based on closed frequent sets and multi-level sense representation. SnS is a knowledge-poor approach, which means it does not need any kind of structured knowledge base about senses as well as the algorithms that have embedded deep language knowledge. Senses induced by SnS characterize better readability (are more intuitive), mainly because SnS discovers a hierarchy of senses showing important relationships between them. In other words the proposed method creates structure of senses, where coarse-grained senses contain related sub-senses (fine-grained senses), rather than flat list of concepts.

In our case the customized SnS algorithm consists of three phases, which we present below.

In Phase I KRK patterns are discovered from KRK competence vectors describing study fields. The patterns are closed frequent sets in the KRK competence space. The KRK competence vectors are treated as transactions (itemsets are replaced by sets of KRK symbols) and the process of mining closed frequent sets is performed with the use of the CHARM algorithm [15].

Phase II is devoted to forming KRK patterns into sense frames, building a hierarchical structure of senses. In some exceptional states few sense frames may refer to one sense, it may result from the corpus limitations (lack of representativeness and high synonymity against descriptive terms).

In Phase III, sense frames are clustered. The clusters of sense frames are called senses.

The clustering step (3) is performed in two phases: first, simultaneously during sense induction, and then after sense discovery for those results that remained not grouped. The first phase is based on the process of frequent set mining. Discovered closed frequent sets have support, and list of results, in which they appear. Senses are clustered sense frames. Each sense frame has the main pattern, so according to sense frames the results (study fields) containing the main pattern are grouped in the corresponding result cluster. Summarizing, for each sense a corresponding cluster of study fields is constructed. Let us note that after this phase is completed, there may remain study fields which are non-grouped. In the second phase, non-grouped instances are tested iteratively against each of the induced sense, and clustered to the closest senses.

4.4 Study Fields Search Engine

The search engine of study fields has been built, which goal is to find similar study fields to the given one using KRK competence vectors. In the proposed approach the closest study fields in the KRK vector space are presented in a flat list, where each item is also tagged with a number of common KRK competences. Figure 2 presents the top 10 results for the study field ‘Coaching medyczny’ (Table 1).

Table 1 Sample clusters of study fields having the same KRK competences (mirrors)

Full size table

5 Experiments

5.1 Experimental Setup

Test sets. We conducted our experiments on the corpora created using data from the POLon system (the central system of Polish Higher Education). Below some details about this dedicated corpora:

1.
3550 files assigned to study fields of all universities;
2.
average 1,3 file per study field;
3.
2700 study fields have assigned the KRK file(s), but only for 2200 of them assigned files are automatically processed i.e. we are able to automatically extract KRK symbols from them.

5.2 Results

We performed computations using all approaches described above, starting from the flat and hierarchical clustering in order to find clusters of equal or similar study fields according to KRK vector space.

Mirrors discovery consists in finding cluster of study fields having the same KRK competences (100 % similarity between sets of KRK competence symbols). This process is performed in order to find obvious anomalies in the data. There were discovered 150 clusters of mirrored competences (average size of a cluster is 11), some of them are presented below (the anomalies are in bold). Each study field is tagged with the POLon id (database id, which disambiguates study field, because e.g. political science can be taught both at Poznan University and Warsaw University).

Table 2 Sample clusters of study fields clustered by HAC (cut-off $=$ 0.7)

Full size table

We performed a second experiment aimed at clustering study fields using HAC algorithm. HAC clustering algorithm exploits single-linkage with cut-off $=$ 0.7 and dedicated distance measure between KRK competence vectors. KRK dedicated similarity measure (krkSim) is defined by non-euclidean measure some way similar to Jaccard measure. Given a dictionary D of all KRK competence symbols and two KRK vectors (set of KRK symbol competences) $W_i \subseteq D$ and $W_j \subseteq D$ then the measure is expressed by the formula:

$$\begin{aligned} krkSim(W_i, W_j) = \frac{c(W_{i} \cap W_{j})}{max(c(W_{i}),c(W_{j}))} \end{aligned}$$

Here, $c(X_{i})$ is a cardinality of vector $X_{i}$ (vector of KRK symbol competences). Two KRK vectors are semantically close if the similarity measure krkSim is higher than the user-defined threshold (e.g. 70 %). We should emphasize that the proposed measure is always lower than cosine measure built upon vectors, where each KRK symbol is a dimension (values are boolean). Comparing ours distance measure with the cosine measure the numerators are equal, but denominator $max(c(W_{i}),c(W_{j}))$ is always greater or equal than the cosine denominator. Experiments with cosine measure showed that some study fields are clustered too easy, therefore we decided to change this condition into more strict. The performed HAC clustering brings us 60 clusters (diverse distribution from 2 to 1000 instances into one cluster). The Table 2 shows sample clusters retrieved by HAC.

Apart from flat and HAC clustering we applied also the SnSRC algorithm, which can be treated as a clustering of patterns discovered in the KRK vectors. The SnSRC representation enables us to analyze some regularities in the KRK competences, and its structure itself (core competences, and some most popular extensions). Given minimum support 10 % we have discovered 80 000 patterns (closed frequent sets), which formed almost 4000 sense frames (trees clustering patterns). Sense frame is a multi-hierarchical structure organizing patterns. The root is a main contextual pattern, which is a representative label for the sense frame. The main pattern has sub-trees, which are sub-pattern trees. Sub-patterns are supersets of the main pattern, but also they can be in subset relation among themselves, which brings to multi-level sub-pattern trees. Sense frames enable to build compact, concise representation of discovered patterns. The Fig. 3 shows some patterns clustered into the form of sub-trees. The legend of some KRK competence symbols, used in the Fig. 3, is placed below.

1.
$S1A\_W02$—graduate person has knowledge about social structures (also institutions)
2.
$S1A\_W03$—graduate person has knowledge about relation between social structures
3.
$S1A\_W05$—graduate person has knowledge about human-being
4.
$S1A\_W06$—graduate person has knowledge about tools and methods used in social analysis
5.
$S1A\_W07$—graduate person has knowledge about norms and rules in society
6.
$S1A\_U01$—graduate person can interpret social processes
7.
$S1A\_U02$—graduate person can use gained knowledge
8.
$S1A\_U03$—graduate person can find out reasons
9.
$S1A\_U07$—graduate person can analyse the proposed solutions
10.
$S1A\_U08$—graduate person can understand social phenomenon

6 Conclusions

We have presented a few approaches to clustering study fields described by the corpora of semi-structured files (types: pdf, doc(x)). First of all we performed information extraction in order to retrieve the KRK competence symbols from files. Next we proposed the various types of clustering methods (flat, hierarchical agglomerative, or pattern based), which goal was to identify equal or similar study fields according to KRK competence vectors. Flat clustering was applied in order to find mirrors clusters (cluster of study fields with equal KRK vectors), which is performed in order to find obvious anomalies in the data. HAC clustering enables us to identify study fields, which educate similar qualifications (the cut-off $=$ 0.7 was set relatively high to retain the consistency in the clusters). The last clustering method was the SnSRC method, which finds patterns and organizes them into the tree structures. Such approach gives us possibility to discover regularities in the KRK competence vectors, and present them into the tree structure (with the main component and its sub components). In this step we exploit a novel WSI knowledge-poor algorithm SnS, based on text mining approaches, namely closed frequent sets. Using significant KRK competence patterns SnS builds hierarchical structures called sense frames. Finally, study fields are mapped to the sense frames and clustered accordingly. Additionally, we have built the search engine, which enables to find similar study fields to the given one according to KRK competence similarity. All of those deliverables (search engine, clustering reports) were provided to the Ministry of Science and Higher Education, and were extensively used in order to detect improper KRK competences or discover some knowledge about study fields (educating similar qualifications) and clusters of KRK competence patterns (regularities within KRK competence vectors).

References

Andruszkiewicz, P., Nachyla, B.: Automatic extraction of profiles from web pages. In: Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence. Warsaw (2013)
Google Scholar
Bernardini, A., Carpineto, C., DAmico, M.: Full-subtopic retrieval with keyphrasebased search results clustering. In: Proceedings of 2009 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 206–213. Milan (2009)
Google Scholar
Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41(3), pp. 1–38. ACM, New York (2009)
Google Scholar
Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/gather: a cluster based approach to browsing large document collections. In: Proceedings of SIGIR, pp. 318–329. Copenhagen (1992)
Google Scholar
Di Marco, A., Navigli, R.: Clustering web search results with maximum spanning trees. In: Proceedings of the 12th Congress of the Italian Association for Artificial Intelligence, pp. 201–212. Palermo (2011)
Google Scholar
Di Marco, A., Navigli, R.: Clustering and diversifying web search results with graph-based word sense induction. In: Computational Linguistics, pp. 709–754. MIT Press, Cambridge (2013)
Google Scholar
Kozlowski, M.: Word sense discovery using frequent termsets. Ph.D. in Warsaw University of Technology (2014)
Google Scholar
Kozlowski, M.: Web search results clustering using frequent termset mining. In: Proceedings of 6th International Conference on Pattern Recognition and Machine Intelligence. Warsaw (2015)
Google Scholar
Kozlowski, M., Rybinski, H.: SnS: A novel word sense induction method. In: Rough Sets and Intelligent Systems Paradigms, pp. 258–268. Madrid (2014)
Google Scholar
Maarek, I., Fagin, R., Pelleg, D.: Ephemeral document clustering for web applications. IBM Research Report RJ 10186 (2000)
Google Scholar
Manning, Ch., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Google Scholar
Navigli, R., Crisafulli, G.: Inducing word senses to improve web search result clustering. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 116–126. Boston (2010)
Google Scholar
Osinski, S., Stefanowski, J., Weiss, D.: Lingo: Search results clustering algorithm based on singular value decomposition. In: Proceedings of the International IIS: IIPWM04 Conference held in Zakopane, pp. 359–368. Zakopane (2004)
Google Scholar
Van Rijsbergen, C.: Information Retrieval. Butterworths, London (1979)
Google Scholar
Zaki, M., Hsiao, Ch.: Charm: An efficient algorithm for closed itemset mining. In: Proceedings 2002 SIAM International Conference on Data Mining, pp. 457–472. Arlington (2002)
Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. New York (1998)
Google Scholar
Zhang, X., Hu, X., Zhou, X.: A comparative evaluation of different link types on enhancing document clustering. In: Proceedings of SIGIR, pp. 555–562. Singapore (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

National Information Processing Institute, Warsaw, Poland
Marek Kozlowski

Authors

Marek Kozlowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marek Kozlowski .

Editor information

Editors and Affiliations

Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland
Dominik Ryżko
Institute of Computer Science, Warsaw University of Technology, Abo, Poland
Piotr Gawrysiak
Institute of Computer Science, Warsaw University of Technology, Warszawa, Poland
Marzena Kryszkiewicz
Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland
Henryk Rybiński

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kozlowski, M. (2016). Study Fields Clustering Using KRK Competences. In: Ryżko, D., Gawrysiak, P., Kryszkiewicz, M., Rybiński, H. (eds) Machine Intelligence and Big Data in Industry. Studies in Big Data, vol 19. Springer, Cham. https://doi.org/10.1007/978-3-319-30315-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-30315-4_4
Published: 25 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30314-7
Online ISBN: 978-3-319-30315-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics