Keywords

1 Introduction

“Semantic publishing refers to publishing information on the Web as documents accompanied by semantic markup”Footnote 1 using RDFa or Microformats, or by publishing information as data objects using Semantic Web technologies such as RDF and OWL. One of the areas where semantic publishing is actively used is scholarly publishing, where it helps bring improvements to scientific communication “by enabling linking to semantically related articles, provides access to data within the article in actionable form, or facilitates integration of data between papers” [9].

We don’t aim to survey the state-of-art of semantic publishing of scientific research in this paper, but we suggest to look at the existing works [1, 5, 6, 9] and papers presented at the series of Workshops on Semantic PublishingFootnote 2 for more in-depth overview.

This paper presents a contribution to semantic publishing of scientific research by conversion of a well-known web-site for publishing proceedings of workshops to Linked Data dataset. The work is carried out in framework of Semantic Publishing Challenge 2015Footnote 3, is based on the previous effort [3] extended by improving precision/recall of the information extraction and the ontology model.

Source Data. The source of data is CEUR-WS.org that publishes proceedings of workshops starting from 1995th year and is very popular among Computer Science community. At the time of writing, it contains information about 1346 proceedings and around 130 ones are added each year, over 19 000 papers and more than 33 000 people.

Challenges. As was described in the previous work [3], extraction of the needed information from the CEUR-WS’s web pages faces several challenges, some of them:

  • the web pages don’t have uniform structured markup, therefore it’s not feasible to relay on a single template for mapping data to RDF,

  • 41.5 % of proceedings’ web pages don’t contain any markup, such as RDFa or Microformats. But even pages having the markup don’t always follow its structure and semantics,

  • a big part of the proceedings are jointly published by several workshops, e.g. http://ceur-ws.org/Vol-1244/ includes papers of ED2014 and GViP2014 workshops.

Table 1. Namespaces and prefixes used in the paper

Structure of the Paper. The structure of the paper is as follows. Section 2 presents our approach. Section 3 explains the ontology model and mappings to some well-known ontologies. Section 4 gives an overall view of the dataset and lists SPARQL query examples. Also Sect. 4 describes how the dataset is published and how users can access the data. The last section concludes the work and results. The prefixes used throughout the paper are defined in Table 1.

2 System Description

In this work we apply knowledge engineering approach to the design of Information Extraction systems which requires expression of rules for the system are constructed by hand using knowledge of the application domain [2].

Although this approach is laborious, the results of the previous challenge shown that it’s performance much higher than the others [4]. The system submitted last year reached overall average precision/recall equal to 0.707/0.636 correspondingly while the next best result was 0.478/0.447.

Fig. 1.
figure 1

Workflow of conversion CEUR-WS.org to linked data

The system developed to convert CEUR-WS.org to Linked Data dataset implements the workflow outlined in Fig. 1. The workflow consist of three major steps:

  • crawling the web pages and serializing the extracted information to RDF,

  • processing the resulted RDF dump to merge resources of persons with similar names, e.g. Dusan Kolář and Dusan Kolar is actually the same person, therefore he should be represented by a single resource,

  • applying the mapping ontology to link the data to well-known ontologies.

The source code is open sourced and available at https://github.com/ailabitmo/ceur-ws-lod under the MIT License.

Crawling. In the system the rules are expressed using XPath expressions which constitute a template of an HTML block. The system has a separate template for each different HTML block presented on the web site’s pages. These templates are run by the crawler implemented using Grab frameworkFootnote 4 that provides Python API for creating crawlers.

There are two abstract templates which aren’t used by the crawler directly, but are used by the other templates as basis: Parser and ListParser. The difference between them is that ListParser is used for repeatable structures such as Table of Content of proceedings or list of proceedings on the index page.

The crawler groups the templates by the web site pages, such as index, proceedings, publication. There are 11 templates. In Table 2 all these templates with corresponding RegExp expressions that is used to categorize the pages are presented.

Table 2. Templates grouped by the web site’s pages

Each such template is a Python class which extends Parser or ListParser classes and has one or more methods having parse_template_ string as prefix in its name. The crawler executes these methods one by one while one of them matches the HTML block. After that the method extracts the information and passes it for the serialization.

Name Disambiguation. At the post-processing step the system does the disambiguation of the peoples’ names by fuzzy-matching sorted of tokenized name-string. The fuzzywuzzy Footnote 5 library was used for this task. For each pair of names in the dataset we have performed the following operations:

  1. 1

    String normalization: convert to ASCII representation, make lowercase.

  2. 2

    Split name string into tokens using whitespace separator and sort tokens in string.

  3. 3

    Perform fuzzy string matching between token-sorted strings.

Entities that have similar names were interlinked with owl:sameAs property and exported as separate fileFootnote 6.

We do not have tools to estimate correctness of the persons’ interlinking, thus we only performed manual validation of the output fileFootnote 7. The results in general are good, except two moments. First, the algorithm has the \(O(n^2)\) complexity and it took more than 12 hours to perform comparison of all names. Second, due to the nature of fuzzy string matching, the algorithm recognized a group of 32 persons with Asian names as one. This is due to the common names and surnames, such as “Li”, and short lengths of the name-surname combination—the string matching algorithm often returns high similarity measure in such occasions.

Mapping to Well-Know Ontologies. The last step is to map the ontology used by the system to several well-known ontologies. To do it a parser based on Jena Inference APIFootnote 8 was implemented which supports several RDFS and OWL constructs such as rdfs:subClassOf, rdfs:subPropertyOf, rdf:type, owl:equivalentClass, owl:sameAs and etc.

3 Ontology Model

We considered three ontologies for use as the basis of semantic representation of the crawled data:

  • Semantic Web Conference Ontology (SWC) is an ontology for describing academic conferences,

  • Semantic Web for Research Communities (SWRC) is an ontology for modeling entities of research communities such as persons, organisations, publications and their relationship,

  • Bibliographic ontology (BIBO) is an ontology providing main concepts and properties for describing citations and bibliographic references (i.e. quotes, books, articles, etc.).

Unfortunately, each of those ontologies alone are not sufficient to fully represent the structure of crawled information. For example, BIBO doesn’t have an “event is part of bigger event” semantics and with SWRC we can’t explicitly describe how many pages are in a publication, as swrc:pages could be used for describing page region, e.g. 255–259; SWC reuses SWRC, thus they share the same limitations, and SWC does not introduce entities relevant for our work. Of course, this is not full list of all incompletenesses of those ontologies, but we think that detailed ontology comparison is out of scope of this paper, therefore we refer the reader to existing works [7, 8]. Thus, based on subjective evaluation we decided to use SWRC as much as possible and add terms from other ontologies only if SWRC does not contain needed semantics. The structure of resulting ontology is represented on the Fig. 2.

Fig. 2.
figure 2

Semantic representation of the crawled data

We used SWC ontology only once to mark a paper as invited one, making it an individual of class swc:InvitedPaper, because SWC is the poorest of those three ontologies in terms of semantic richness: the number of properties is much lesser than in others, some classes have names like “Event-1” and “Role-1”, a lot of them are deprecated, and, last but not least, official site of ontology is not accessible, so we were forced to download the ontology from the third-party siteFootnote 9, which doesn’t contain imports SWC depends on. All those factors suggest that development of this ontology was halted before reaching consistent usable state—this is why we tried to avoid using it in our work.

A concept “series of events” is not described in any of these ontologies, thus we choose to link workshops of the same series with the rdfs:seeAlso property. To keep things consistent we decided to use this approach to link “series of proceedings” and not to make additional bibo:Series class.

3.1 Mapping to Well-Know Ontologies

To compensate semantic inconsistencies in the resulting data set introduced by usage of properties and classes from different ontologies, we created the mappings between ontologies with owl:eqivalentProperty and owl:eqivalentClass properties. We interlinked only BIBO and SWRC ontologies, as SWC already has some dependencies on SWRC.

The full list of the mappings:

figure a

4 Overview of Dataset

Publishing. The data is published using a Linked Data Fragments [10] server and available at http://data.isst.ifmo.ru. The users can use a Linked Data Fragments client for querying the data using SPARQL language. Or the data is also available as an HDTFootnote 10 dump in the GitHub repositoryFootnote 11.

Statistics. The dataset includes 402 648 triples and 55 893 subjects. The distribution of resource types are depicted on Fig. 3.

Fig. 3.
figure 3

Distribution of resource types in the dataset (# of triples – 402 648)

In absolute numbers the dataset includes information about 1 344 proceedings, 1 360 workshops, 18 875 regular and 203 invited papers, 252 conferences, 33 859 persons with 2 657 editors.

4.1 Example Queries

In this section several SPARQL queries are presented which provide some interesting insights.

Query 1. Top-10 persons how was an editor of the highest number of workshop series:

figure b

Query 2. Top-10 workshops with the highest number of authors:

figure c

Query 3. Latest workshops of top-10 workshop series with the longest history.

figure d

5 Conclusion

In this paper we described a system that converts a well-known web-site for publishing proceedings of academic events, called CEUR-WS.org, to Linked Data dataset. Also we described semantic representations (ontologies) that are used to create the dataset. The system is based on knowledge engineering approach to design Information Extraction systems.

To overview the resulted dataset we introduced some statistical information, such as amount of papers and proceedings. Also we presented example SPARQL queries which provide some interesting insights from the extracted information.

The presented system is developed in the framework of Semantic Publishing Challenge \({2015^5}\) and based on the previous work [3] which was extended with richer semantic representations and was improved in terms of precision and recall.