CEUR-WS-LOD: Conversion of CEUR-WS Workshops to Linked Data

Kolchin, Maxim; Cherny, Eugene; Kozlov, Fedor; Shipilo, Alexander; Kovriguina, Liubov

doi:10.1007/978-3-319-25518-7_12

Maxim Kolchin¹⁴,
Eugene Cherny^14,15,
Fedor Kozlov¹⁴,
Alexander Shipilo¹⁶ &
…
Liubov Kovriguina¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 548))

Included in the following conference series:

Semantic Web Evaluation Challenges

756 Accesses
9 Citations

Abstract

CEUR-WS.org is a well-known place for publishing proceedings of workshops and very popular among Computer Science community. Because of that it’s an interesting source for different kinds of analytics, e.g. measurement of workshop series popularity or person’s contribution to the field by organizing workshops and etc. For realizing an insightful and effective analytics one needs to combine information from different places that can supplement each other. And this brings a lot of challenges which can be mitigated by using Semantic Web technologies.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Extraction and Semantic Annotation of Workshop Proceedings in HTML Using RML

SAD Generator: Eating Our Own Dog Food to Generate KGs and Websites for Academic Events

Challenges for Semantically Driven Collaborative Spaces

Keywords

1 Introduction

“Semantic publishing refers to publishing information on the Web as documents accompanied by semantic markup”^{Footnote 1} using RDFa or Microformats, or by publishing information as data objects using Semantic Web technologies such as RDF and OWL. One of the areas where semantic publishing is actively used is scholarly publishing, where it helps bring improvements to scientific communication “by enabling linking to semantically related articles, provides access to data within the article in actionable form, or facilitates integration of data between papers” [9].

We don’t aim to survey the state-of-art of semantic publishing of scientific research in this paper, but we suggest to look at the existing works [1, 5, 6, 9] and papers presented at the series of Workshops on Semantic Publishing^{Footnote 2} for more in-depth overview.

This paper presents a contribution to semantic publishing of scientific research by conversion of a well-known web-site for publishing proceedings of workshops to Linked Data dataset. The work is carried out in framework of Semantic Publishing Challenge 2015^{Footnote 3}, is based on the previous effort [3] extended by improving precision/recall of the information extraction and the ontology model.

Source Data. The source of data is CEUR-WS.org that publishes proceedings of workshops starting from 1995^th year and is very popular among Computer Science community. At the time of writing, it contains information about 1346 proceedings and around 130 ones are added each year, over 19 000 papers and more than 33 000 people.

Challenges. As was described in the previous work [3], extraction of the needed information from the CEUR-WS’s web pages faces several challenges, some of them:

the web pages don’t have uniform structured markup, therefore it’s not feasible to relay on a single template for mapping data to RDF,
41.5 % of proceedings’ web pages don’t contain any markup, such as RDFa or Microformats. But even pages having the markup don’t always follow its structure and semantics,
a big part of the proceedings are jointly published by several workshops, e.g. http://ceur-ws.org/Vol-1244/ includes papers of ED2014 and GViP2014 workshops.

Table 1. Namespaces and prefixes used in the paper

Full size table

Structure of the Paper. The structure of the paper is as follows. Section 2 presents our approach. Section 3 explains the ontology model and mappings to some well-known ontologies. Section 4 gives an overall view of the dataset and lists SPARQL query examples. Also Sect. 4 describes how the dataset is published and how users can access the data. The last section concludes the work and results. The prefixes used throughout the paper are defined in Table 1.

2 System Description

In this work we apply knowledge engineering approach to the design of Information Extraction systems which requires expression of rules for the system are constructed by hand using knowledge of the application domain [2].

Although this approach is laborious, the results of the previous challenge shown that it’s performance much higher than the others [4]. The system submitted last year reached overall average precision/recall equal to 0.707/0.636 correspondingly while the next best result was 0.478/0.447.

The system developed to convert CEUR-WS.org to Linked Data dataset implements the workflow outlined in Fig. 1. The workflow consist of three major steps:

crawling the web pages and serializing the extracted information to RDF,
processing the resulted RDF dump to merge resources of persons with similar names, e.g. Dusan Kolář and Dusan Kolar is actually the same person, therefore he should be represented by a single resource,
applying the mapping ontology to link the data to well-known ontologies.

The source code is open sourced and available at https://github.com/ailabitmo/ceur-ws-lod under the MIT License.

Crawling. In the system the rules are expressed using XPath expressions which constitute a template of an HTML block. The system has a separate template for each different HTML block presented on the web site’s pages. These templates are run by the crawler implemented using Grab framework^{Footnote 4} that provides Python API for creating crawlers.

There are two abstract templates which aren’t used by the crawler directly, but are used by the other templates as basis: Parser and ListParser. The difference between them is that ListParser is used for repeatable structures such as Table of Content of proceedings or list of proceedings on the index page.

The crawler groups the templates by the web site pages, such as index, proceedings, publication. There are 11 templates. In Table 2 all these templates with corresponding RegExp expressions that is used to categorize the pages are presented.

Table 2. Templates grouped by the web site’s pages

Full size table

Each such template is a Python class which extends Parser or ListParser classes and has one or more methods having parse_template_ string as prefix in its name. The crawler executes these methods one by one while one of them matches the HTML block. After that the method extracts the information and passes it for the serialization.

Name Disambiguation. At the post-processing step the system does the disambiguation of the peoples’ names by fuzzy-matching sorted of tokenized name-string. The fuzzywuzzy ^{Footnote 5} library was used for this task. For each pair of names in the dataset we have performed the following operations:

1
String normalization: convert to ASCII representation, make lowercase.
2
Split name string into tokens using whitespace separator and sort tokens in string.
3
Perform fuzzy string matching between token-sorted strings.

Entities that have similar names were interlinked with owl:sameAs property and exported as separate file^{Footnote 6}.

We do not have tools to estimate correctness of the persons’ interlinking, thus we only performed manual validation of the output file^{Footnote 7}. The results in general are good, except two moments. First, the algorithm has the \(O(n^2)\) complexity and it took more than 12 hours to perform comparison of all names. Second, due to the nature of fuzzy string matching, the algorithm recognized a group of 32 persons with Asian names as one. This is due to the common names and surnames, such as “Li”, and short lengths of the name-surname combination—the string matching algorithm often returns high similarity measure in such occasions.

Mapping to Well-Know Ontologies. The last step is to map the ontology used by the system to several well-known ontologies. To do it a parser based on Jena Inference API^{Footnote 8} was implemented which supports several RDFS and OWL constructs such as rdfs:subClassOf, rdfs:subPropertyOf, rdf:type, owl:equivalentClass, owl:sameAs and etc.

3 Ontology Model

We considered three ontologies for use as the basis of semantic representation of the crawled data:

Semantic Web Conference Ontology (SWC) is an ontology for describing academic conferences,
Semantic Web for Research Communities (SWRC) is an ontology for modeling entities of research communities such as persons, organisations, publications and their relationship,
Bibliographic ontology (BIBO) is an ontology providing main concepts and properties for describing citations and bibliographic references (i.e. quotes, books, articles, etc.).

Unfortunately, each of those ontologies alone are not sufficient to fully represent the structure of crawled information. For example, BIBO doesn’t have an “event is part of bigger event” semantics and with SWRC we can’t explicitly describe how many pages are in a publication, as swrc:pages could be used for describing page region, e.g. 255–259; SWC reuses SWRC, thus they share the same limitations, and SWC does not introduce entities relevant for our work. Of course, this is not full list of all incompletenesses of those ontologies, but we think that detailed ontology comparison is out of scope of this paper, therefore we refer the reader to existing works [7, 8]. Thus, based on subjective evaluation we decided to use SWRC as much as possible and add terms from other ontologies only if SWRC does not contain needed semantics. The structure of resulting ontology is represented on the Fig. 2.

We used SWC ontology only once to mark a paper as invited one, making it an individual of class swc:InvitedPaper, because SWC is the poorest of those three ontologies in terms of semantic richness: the number of properties is much lesser than in others, some classes have names like “Event-1” and “Role-1”, a lot of them are deprecated, and, last but not least, official site of ontology is not accessible, so we were forced to download the ontology from the third-party site^{Footnote 9}, which doesn’t contain imports SWC depends on. All those factors suggest that development of this ontology was halted before reaching consistent usable state—this is why we tried to avoid using it in our work.

A concept “series of events” is not described in any of these ontologies, thus we choose to link workshops of the same series with the rdfs:seeAlso property. To keep things consistent we decided to use this approach to link “series of proceedings” and not to make additional bibo:Series class.

3.1 Mapping to Well-Know Ontologies

To compensate semantic inconsistencies in the resulting data set introduced by usage of properties and classes from different ontologies, we created the mappings between ontologies with owl:eqivalentProperty and owl:eqivalentClass properties. We interlinked only BIBO and SWRC ontologies, as SWC already has some dependencies on SWRC.

The full list of the mappings:

4 Overview of Dataset

Publishing. The data is published using a Linked Data Fragments [10] server and available at http://data.isst.ifmo.ru. The users can use a Linked Data Fragments client for querying the data using SPARQL language. Or the data is also available as an HDT^{Footnote 10} dump in the GitHub repository^{Footnote 11}.

Statistics. The dataset includes 402 648 triples and 55 893 subjects. The distribution of resource types are depicted on Fig. 3.

In absolute numbers the dataset includes information about 1 344 proceedings, 1 360 workshops, 18 875 regular and 203 invited papers, 252 conferences, 33 859 persons with 2 657 editors.

4.1 Example Queries

In this section several SPARQL queries are presented which provide some interesting insights.

Query 1. Top-10 persons how was an editor of the highest number of workshop series:

Query 2. Top-10 workshops with the highest number of authors:

Query 3. Latest workshops of top-10 workshop series with the longest history.

5 Conclusion

In this paper we described a system that converts a well-known web-site for publishing proceedings of academic events, called CEUR-WS.org, to Linked Data dataset. Also we described semantic representations (ontologies) that are used to create the dataset. The system is based on knowledge engineering approach to design Information Extraction systems.

To overview the resulted dataset we introduced some statistical information, such as amount of papers and proceedings. Also we presented example SPARQL queries which provide some interesting insights from the extracted information.

The presented system is developed in the framework of Semantic Publishing Challenge \({2015^5}\) and based on the previous work [3] which was extended with richer semantic representations and was improved in terms of precision and recall.

Notes

1.
Cf. http://en.wikipedia.org/wiki/Semantic_publishing.
2.
Cf. http://ceur-ws.org/Vol-1155/.
3.
Cf. http://github.com/ceurws/lod/wiki/SemPub2015.
4.
Cf. http://grablib.org/.
5.
Cf. https://github.com/seatgeek/fuzzywuzzy.
6.
Cf. https://github.com/ailabitmo/ceur-ws-lod/releases/download/ceur-ws-crawler-v1.0.0/task-1-persons-sameas.ttl.
7.
Cf. https://github.com/ailabitmo/ceur-ws-lod/blob/master/ceur-ws-crawler/postprocessing/merged_persons.json.
8.
Cf. https://jena.apache.org/documentation/inference/index.html.
9.
Cf. http://lov.okfn.org/dataset/lov/vocabs/swc.
10.
Cf. http://www.rdfhdt.org/.
11.
Cf. http://github.com/ailabitmo/ceur-ws-lod.

References

Auer, S., Lange, C., Ermilov, T.: Towards facilitating scientific publishing and knowledge exchange through linked data. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds.) TPDL 2013. CCIS, vol. 416, pp. 10–15. Springer, Heidelberg (2014). http://dx.doi.org/10.1007/978-3-319-08425-1_2
Google Scholar
Eikvil, L.: Information extraction from world wide web - a survey. Technical report, July 1999. http://user.phil-fak.uni-duesseldorf.de/rumpf/SS2003/Informationsextraktion/Pub/Eik99.pdf
Kolchin, M., Kozlov, F.: A template-based information extraction from web sites with unstable markup. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 89–94. Springer, Heidelberg (2014). http://dx.doi.org/10.1007/978-3-319-12024-9_11
Google Scholar
Lange, C., Di Iorio, A.: Semantic publishing challenge – assessing the quality of scientific output. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 61–76. Springer, Heidelberg (2014). http://dx.doi.org/10.1007/978-3-319-12024-9_8
Google Scholar
Nevzorova, O., Zhiltsov, N., Zaikin, D., Zhibrik, O., Kirillovich, A., Nevzorov, V., Birialtsev, E.: Bringing math to LOD: a semantic publishing platform prototype for scientific collections in mathematics. In: Alani, H., et al. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 379–394. Springer, Heidelberg (2013). http://dx.doi.org/10.1007/978-3-642-41335-3_24
Chapter Google Scholar
Peroni, S.: Semantic Publishing: issues, solutions and new trends in scholarly publishing within the Semantic Web era. Ph.D. thesis, Universit di Bologna (2012). http://dx.doi.org/10.6092/unibo/amsdottorato/4766
Peroni, S., Shotton, D.: FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semant. Sci. Serv. Agents World Wide Web 17, 33–43 (2012). http://www.sciencedirect.com/science/article/pii/S1570826812000790
Article Google Scholar
Ruiz Iniesta, A., Corcho, O.: A review of ontologies for describing scholarly and scientific documents. http://ceur-ws.org/Vol-1155#paper-07
Shotton, D.: Semantic publishing: the coming revolution in scientific journal publishing. Learn. Publ. 22(2), 85–94 (2009). http://www.ingentaconnect.com/content/alpsp/lp/2009/00000022/00000002/art00002
Article Google Scholar
Verborgh, R., et al.: Querying datasets on the web with high availability. In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 180–196. Springer, Heidelberg (2014). http://dx.doi.org/10.1007/978-3-319-11964-9_12
Google Scholar

Download references

Acknowledgments

This work has been partially financially supported by the Government of Russian Federation, Grant #074-U01.

Author information

Authors and Affiliations

ITMO University, Saint-petersburg, Russia
Maxim Kolchin, Eugene Cherny, Fedor Kozlov & Liubov Kovriguina
Åbo Akademi University, Turku, Finland
Eugene Cherny
Saint-Petersburg State University, Saint-petersburg, Russia
Alexander Shipilo

Authors

Maxim Kolchin
View author publications
You can also search for this author in PubMed Google Scholar
Eugene Cherny
View author publications
You can also search for this author in PubMed Google Scholar
Fedor Kozlov
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Shipilo
View author publications
You can also search for this author in PubMed Google Scholar
Liubov Kovriguina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liubov Kovriguina .

Editor information

Editors and Affiliations

Inria, Sophia Antipolis, France
Fabien Gandon
INRIA Sophia-Antipolis Méditerranée, Sophia Antipolis, France
Elena Cabrio
Université Paris-Sorbonne, Paris, France
Milan Stankovic
École des Mines de Saint-Étienne, Saint-Étienne, France
Antoine Zimmermann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kolchin, M., Cherny, E., Kozlov, F., Shipilo, A., Kovriguina, L. (2015). CEUR-WS-LOD: Conversion of CEUR-WS Workshops to Linked Data. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds) Semantic Web Evaluation Challenges. SemWebEval 2015. Communications in Computer and Information Science, vol 548. Springer, Cham. https://doi.org/10.1007/978-3-319-25518-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-25518-7_12
Published: 07 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25517-0
Online ISBN: 978-3-319-25518-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics