Keywords

1 Introduction

In recent years, online recruitment has expanded significantly [1, 2]. This expansion has led to a continuous growth in the number of job portals and hiring agencies on the Internet [3, 4]. It has also led to a constant increase in the number of job seekers searching for new career opportunities [5, 6]. Accordingly, online job portals are starting to receive thousands of resumes (in diverse styles and formats) from job seekers who have different fields of expertise and specialize in different domains [7]. Several approaches have been proposed to support the automatic matching between candidate resumes and their corresponding job offers [8,9,10]. Examples on these approaches are automatic keyword-based resume matching techniques [8] machine leaning based approaches [9], and semantics-based techniques [10]. The main goal of these approaches is achieving high precision ratios i.e. finding the best candidates for a given job post ignoring the cost (run time complexity) of the matching process. Other systems have attempted to reduce the cost issue by segmenting the content of both resumes and job posts and finding matches between important segments in both instead of matching between the content of the whole resumes and job posts. For instance, the authors of [11, 12] propose using machine learning algorithms (Support vector machines (SVM) [11] and Hidden Markov Model (HMM) [12]) to automatically extract structured information from job posts and resumes by annotating the segments of job posts and resumes with the appropriate features and topics. While the authors of [8] use Natural Language Processing (NLP) techniques to implement the segmentation and information extraction module. Although these approaches have proved to be more efficient in carrying out the matching task [8], every newly obtained resume still needs to be matched with all job offers in the corpus. To overcome this issue, other researchers propose utilizing machine learning-based techniques to classify job posts and resumes into occupational categories prior conducting the matching task [13]. However, as pointed in [2], these techniques suffer from low precision ratios and produce high error rates. To address the issues associated with the previously highlighted techniques, we present a hybrid approach that employs conceptual-based classification of resumes and job postings and automatically ranks candidate resumes (that fall under each occupational category) to their corresponding job postings. We summarize the contributions of our work as follows:

  • Automatic Occupational Category based Classification of Resumes and Job Postings.

  • Employing a Section-based Segmentation heuristic by exploiting an integrated occupational categories knowledge base.

The remainder of this paper is organized as follows. In Sect. 2, we introduce the work related. Section 3 describes the overall architecture of the proposed system. Experimental validation of the effectiveness and efficiency of the proposed system is presented in Sect. 4. In Sect. 5, we discuss the conclusions and outline future work.

2 Related Work

Many techniques have been proposed to precisely match between candidate resumes and their corresponding job offers [14, 15]. However, little attention has been paid to addressing problems associated with automatic resumes and job posts classification [16]. For instance, when an employer seeks a “Web Developer” that falls under “Web Development” occupational category, the conventional systems search globally in the entire space of resumes for finding applicants that best match the offered position. In this context, each and every resume in the resumes collection will be matched to the offered job post instead of matching only those that fall under the corresponding occupational category (“Web Development” in this scenario). To address this issue, the authors of [12] have proposed resume Information Extraction (IE) with Cascaded Hybrid Model. This system employs HMM and SVM classification algorithms in order to annotate segments of resumes with the appropriate category, taking the advantage of the resume contextual structure where the related information units usually occur in the same textual segments. Accordingly, resumes pass through two layers; where in the first layer a HMM is applied to segment the entire resume into consecutive blocks and each block is annotated with a category such as Personal Information, Education, and Research Experience. After that, in the second layer SVM is applied in order to extract the detailed information from the blocks that have been labeled with Education and Personal Information respectively. However, a large fraction of the produced results by this system suffer from low precision since the information extraction process passes through two loosely-coupled stages. Another system (E-gen) [17] has been built in order to automate the recruitment process by segmenting and classifying job posts. First, job posts are transformed into vector space representations. Then, SVM classification algorithm is utilized to annotate segments of job posts with the appropriate topics and features. A correction algorithm is further applied because during the classification process some segments were incorrectly classified [17]. The main drawback of this system is the time needed to pre-process and post-process job posts in order to minimize the error and maximize the matching probability. On the other hand, JobDiSC system [18] attempts to classify job openings automatically by employing a standard classification scheme called Dictionary of Occupational Titles (DOT). The proposed system comprises three main modules: (1) Parser/Analyzer: which creates an unclassified job opening for each job listings captured from electronic forms prepared by employers. (2) Learning System to automatically generate classification rules from a set of pre-classified job openings and (3) Classifier that assigns one or more class for each job post depending on its confidence level for any potential class assigned to it. The main drawback of this system is that DOT’s usefulness has waned since it doesn’t cover the occupational information that is more relevant to the modern workplace [19].

3 Overview of the Architecture of the Proposed System

In this section, we present an overview of the architecture of the proposed system and discuss its main constituents.

As shown in Fig. 1, when a user submits a CV, the system directs it to the Section-based Segmentation module which is used to extract personal, education, experience information and employment history, in addition to a list of candidate matching concepts. After that, the Filtration module refines the concept lists by removing concepts that have low tf-idf [8] weights and concepts that don’t contribute to the semantics of each segment. Next, the Classification module takes a set of skills from both resumes and job posts as an input in order to classify job posts and resumes to their corresponding occupational categories. At this step, we exploit an integrated knowledge base which combines two main semantic resources: DICEFootnote 1 and O*NETFootnote 2. More details on these resources will be provided in Sect. 3.1. Then, the Category-based Matching module takes lists of concepts from both resumes and job posts to construct semantic networks by deriving the semantic relatedness between them using semantic resources. Finally, the matching algorithm takes the semantic networks as an input and produces the measures of semantic closeness between them as an output. The following sections detail the steps carried out in each module of the system.

Fig. 1.
figure 1

Overall architecture of the proposed system

3.1 Section-Based Segmentation and Conceptual Classification Modules

During this phase, an automatic extraction of segments such as Education, Experience, Loyalty and other Employment information such as Company name, Applicant Role in the company, Date of designation, Date of resignation and Loyalty is performed. In this context, the system matches segments of resumes to their relevant segments of job posts instead of matching the whole resumes and job posts. During this phase, unstructured resumes are converted into segments (semi-structured document) based on employing Natural language processing techniques (NLP) and rule-based regular expressions. As detailed in [7], the NLP steps are: document splitting, n-gram tokenization, stop word removal, part-of-Speech-Tagging (POST) and Named Entity Recognition (NER). Table 1 shows an example that illustrates the process of segmenting a sample resume.

Table 1. Results of the Section-based Segmentation Module

In order to classify both resumes and job posts, we utilize an integrated knowledge-base which combines Dice skills center (henceforth stated as DICE) and a standardized hierarchy of occupation categories known as the Occupational Information Network (O*NET) (henceforth stated as O*NET). In this context, we use DICE to classify skills that belong to Information and communication technologies (ICT), and economy field because we empirically found that O*NET is not scalable enough for our classification needs. Furthermore, some skill acronyms are not classified correctly in O*NET. However, and on the contrary of Dice, O*NET is able to better classify skills that are related to the Medical and Artistic fields. Table 2 shows a comparative analysis between Dice and O*NET classification.

Table 2. Comparative Analysis between DICE and O*NET Classification

As shown in Table 2, some skill acronyms are not recognized by O*NET, and accordingly they are not classified correctly. For instance, JPA which refers to “Java Persistence” is classified under “Accountants” category by O*NET. However, we can see that terms such as “Radiography” and “Medical analysis” are not classified in DICE, but classified correctly under “Radiologic Technicians” and “Medical and clinical Laboratory” categories in O*NET.

3.1.1 Skill-Based Resume Classification Module

In this module each skill in the skills set is submitted to the exploited skills knowledge base sequentially in order to obtain a list of candidate job categories. As shown in Fig. 2, the skill “android” is first submitted to the exploited skills knowledge base. For this skill, the knowledge base returns one occupational category, that is “Software Development/Mobile Development”. Next, the rest of the skills in the skills set are submitted to the exploited knowledge base. As a result, a list of weighted occupational categories is obtained and sorted by the highest weight (as one skill may return zero, one, or more than one occupational category). Accordingly, for the skills set that we have obtained from CV1, the occupational category “Web Development” gets the highest weight, followed by “Software Development/Application Development” and then “Soft-ware Development/Mobile Development” respectively.

Fig. 2.
figure 2

List of occupational categories generated for the sample CV

To produce weights for the occupational categories, we use the following algorithm.

In the used algorithm, skills are submitted to the skills knowledge-base respectively. As a result, one occupational category is returned for each skill (Line 5). If the same occupational category is returned for more than one skill, the algorithm increases the weight for that particular occupational category, otherwise it sets its weight to 1. (Lines 8, 11 and 12). Finally, the algorithm returns a list of weighted occupational categories in the answer list (Line 15). Table 3 shows each occupational category assigned to its corresponding skills.

Table 3. Occupational categories returned for the CV used Example 1

3.1.2 Job Post Classification Module

In the Job Post Categorization module, we use both the job title and the required skills from the structured job post for classification purposes. First, the job post is pre-processed and filtered through removing noisy information such as: city names, state and country acronyms that appear in the job title or job details. After that, we use the skill knowledge base to classify job posts in the same manner as we do for classifying resumes. Accordingly, we assign weights (Job Title = 70% and Required Skills = 30%) since we believe that the job title is more significant than the required skills and guides to better matching results. More examples on the results of this module are presented in Sect. 4.2.

3.2 Matching Resumes and Their Corresponding Job Postings

In the same fashion as proposed by the authors of [7], we use the same semantic resources (WordNet ontology [20] and YAGO2 ontology [21]) and statistical concept-relatedness measures to derive the semantic aspects of resumes and job posts. It is important to mention that we have considered additional weighting parameters such as: loyalty parameter (degree of devotion to the company that the candidate worked or currently working in) in order to increase the effectiveness of the matching process. It is also important to point out that we use a dynamic threshold value to fairly handle the loyalty parameter as shown in the following scoring formulae:

$$ \varvec{S} = \frac{{\left| {\left\{ {\varvec{Sr}} \right\}} \right|}}{{\left| {\left\{ {\varvec{RSj}} \right\}} \right|}}*{\mathbf{50}}\% + \frac{{\left| {\left\{ {\varvec{Er}} \right\}} \right|}}{{\left| {\left\{ {\varvec{REj}} \right\}} \right|}}*{\mathbf{20}}\% + \frac{{\left| {\left\{ {\varvec{Xr}} \right\}} \right|}}{{\left| {\left\{ {\varvec{RXj}} \right\}} \right|}}*{\mathbf{20}}\% + \frac{{\left| {\sum {\varvec{Yw}} } \right|}}{{\left| {\sum {\varvec{Cw}} } \right|}}*{\mathbf{10}}\% $$
(1)

Where:

  • S: is the relevance score result.

  • Sr: is the set of applicant’s skills.

  • RSj: the required skills in the job post.

  • Er: is the set of concepts that describe applicant educational information.

  • REj: is the set of concepts from the required educational information in job post.

  • Xr: set of concepts that describe applicant experience information.

  • RXj: concepts that represent the required experience information in the job post.

  • Yw: the total number of employment years.

  • Cw: number of companies that the applicant worked in.

As shown in the formula, we have set the following weighting values:

Skills weight = 50%, Educational level weight = 20%, Job experience weight = 20% and Loyalty level weight = 10%. The results of using the scoring formula are detailed in the next section (Sect. 4.3).

4 Experimental Results

This section describes the experiments that we have conducted to evaluate the efficiency and the effectiveness of the proposed system. In order to evaluate the accuracy of the proposed system, we collected a data set of 2000 resumes downloaded from:

and used 10,000 different job posts obtained from:

The collected resumes are unstructured documents in different document formats such as (.pdf) and (.doc) and we considered job posts as structured document having sections (job title, description, required skills, required years and field of experience, required education qualification and additional desired requirement). The experiments of our system prototype show that the classification process for the training data of resumes and job posts took 6 h on average on a PC with dual-core CPU (2.1 GHz) and (4 GB) RAM. The used operating system is Windows 8.1.

4.1 Execution Time for Matching Resumes with Corresponding Job Post with/Without Classification

In this section, we compare the results produced by our system with MatchingSem system [7] which is semantics-based automatic recruitment system. As shown in Fig. 3, our system Job Resume Classifier (JRC) was able to achieve better results than MatchingSem System. And this is due to the fact that, unlike MatchingSem, we only match job posts with corresponding resumes that fall under the same occupational category instead of searching globally in the entire space of resumes. For instance, “Front-End Developer” job post costs 6 h of execution time for finding the best candidate using MatchingSem, while it only took 1 h in JRC since resumes that fall under “Web Development” category were considered in the matching process.

Fig. 3.
figure 3

Cost (Run-time complexity in hours) of the matching process

4.2 Experiments of Job Post Classification

In this section we present job post classification. As mentioned in Sect. 3.1.2 we have used job title and required skills in the classification process. In Table 4, we have used 7 job posts in order to clarify the classification process.

Table 4. Job post classification results

As shown in Table 4, we can see that “Front End Web Developer” job post falls under “Software Development/Web Development” occupational category with weight equals 100%, and this is because when we submit the job title to our skills knowledge base it returns “Software Development/Web Development” category with weight 70%, then we submit the required skills and we find that all of them fall under the same space with weight 30%. However, “Unity Developer” job post falls under “Software Development/Interactive Multimedia” space with weight 85%, 70% for job title under “Software Development/Interactive Multimedia” category. When we submit “3D and unity” skills we see that they fall under the same space as job title with weight 15%, but “Objective-C, Xcode” skills fall under “Software Development/Mobile Development” space with weight 15%. And the same for “Data Entry Assistant” job post, that falls under three categories: “Industry-specific/General skills” with weight 79%, “Industry-specific/Microsoft Office” with weight 12%, and “Software Development/Web Development” with weight 9%.

4.3 Precision Results of Matching Resumes with Corresponding Job Post

In this section we evaluate our system’s effectiveness using precision indicator. For each job post, we compare between the manually assigned scores and their corresponding scores that are automatically produced by the system. Table 5, shows the precision results of matching resumes with corresponding job post.

Table 5. Precision results of matching resumes with corresponding job post by the system

As shown in Table 5, we match job posts with their corresponding resumes that fall under the same occupational categories. For instance, “Android Developer” job post is matched only with resumes that fall under “Mobile Development” category. As such, CV1 and CV3 are matched with “Android developer” and “Web developer” job posts. And this is because these CVs exist in both “Mobile Development” and “Web Development” categories. However, the matching score differ from job post to another. For instance, CV3 achieved a very low matching score when matched with Android Developer job post (0.05 manual score, 0.09 automatic score), but CV1 achieved better score for the same job post (0.8 manual score, 0.8 automatic score). On the other hand, CV3 achieved better results than CV1 when it was matched with “Web developer” job post (0.70 manual score, 0.75 automatic score) and this is because CV3 falls under “Web developer” with weight 80% and falls under “Mobile Development” with weight 35%.

In order to validate our proposal and evaluate the quality of the produced results by our system, we have compared our system with one of the previously proposed systems, called MatchingSem [7]. Table 6, shows the results produced by our system compared to MatchingSem system.

Table 6. Comparison with MatchingSem System

As shown in Table 6, for the job title “Android Developer” and the three CVs, the quality of the produced results (namely, the precision) by our system is higher than MatchingSem system. The reason behind this is that – unlike MatchingSem system – we are integrating a section-based segmentation module to extract features such as educational background, years of experience and employment information from applicants’ resumes. When we incorporate these features, the matching scores produced by our system are better than when using only a list of candidate concepts as proposed in MatchingSem.

5 Conclusions and Future Work

In this research work, we have introduced a hybrid approach that employs conceptual-based classification of resumes and job postings and automatically matches candidate resumes to their corresponding job postings that fall under each occupational category. The proposed system first utilizes NLP techniques and regular expressions in order to segment the resumes and extract a set of skills that are used in the classification process. Next, the system exploits an integrated skills knowledge base for carrying out the classification task. The conducted experiments using the exploited knowledge base demonstrate that using the proposed classification module assists in achieving higher precision results in a less execution time than conventional approaches. In the future work, we plan to utilize the extracted information from applicants’ resumes to dynamically generate user profiles to be further used for recommending jobs to job seekers.