1 Background

Heterogeneous information networks (HINs), such as DBLP [5], YAGO [8], and DBpedia [1], have recently received a lot of attention. These data sources, containing a vast number of inter-related facts, facilitate the discovery of interesting knowledge [4, 6, 7]. Figure 1(a) illustrates an HIN, which describes the relationship among entities of different types (e.g., author, paper, venue and topic). For example, Jiawei Han (\(a_2\)) has written a VLDB paper (\(p_{2,2}\)), which mentions the topic “efficient” (\(t_3\)).

Fig. 1.
figure 1

HIN, meta paths, and meta structures.

Given two HIN objects a and b, the evaluation of their relevance is of fundamental importance. This quantifies the degree of closeness between a and b. In Fig. 1(a), Jian Pei (\(a_1\)) and Jiawei Han (\(a_2\)) have a high relevance score, since they have both published papers with keyword “mining” in the same venue (KDD). Relevance finds its applications in information retrieval, recommendation, and clustering [9, 10]: a researcher can retrieve papers that have high relevance in terms of topics and venues in DBLP; in YAGO, relevance facilitates the extraction of actors who are close to a given director. As another example, in entity resolution applications, duplicated HIN object pairs having high relevance scores (e.g., two different objects in an HIN referring to the same real-world person) can be identified and removed from the HIN.

Relevance Computation. In this tutorial, we will explore different ways of computing the relevance between two graph objects, for instance, neighborhood-based measures, such as common neighbors and Jaccard’s coefficient; graph-theoretic measures based on random walks, such as Personalized PageRank and SimRank. These measures do not consider object and edge type information in an HIN. We will discuss the concept of meta paths [4, 9]. A meta path is a sequence of object types with edge types between them. Figure 1(b) illustrates a meta path \(\mathcal {P}_1\), which states that two authors (\(A_1\) and \(A_2\)) are related by their publications in the same venue (V). Another meta path \(\mathcal {P}_2\) says that two authors have written papers containing the same topic (T). We will discuss several meta-path-based relevance measures, including PathCount, PathSim, and Path Constrained Random Walk (PCRW) [4, 9]. These measures have been shown to be better than those that do not consider object and edge type information.

We will further discuss meta structures, recently proposed in [3], to depict the relationship of two graph objects. This is essentially a directed acyclic graph of object and edge types. Figure 1(b) illustrates a meta structure \(\mathcal {S}\), which depicts that two authors are relevant if they have published papers in the same venue, and have also mentioned the same topic. A meta path (e.g., \(\mathcal {P}_1\) or \(\mathcal {P}_2\)) is a special case of a meta structure. However, a meta path fails to capture such complex relationship that can be conveniently expressed by a meta structure (e.g., \(\mathcal {S}\)). We will discuss how meta structures can be used to formulate three relevance definitions, as well as their efficient calculation.

Meta Path Discovery. There are often a huge number of meta paths between a pair of HIN objects. This can be very difficult, even for a domain expert, to identify the right meta paths. We will discuss a meta path discovery algorithm, recently proposed by [6], where users provide example instances of source and target objects through a Query-by-Example paradigm, to derive meta paths automatically. We will demonstrate a HIN search engine prototype based on this algorithm.

Query Recommendation. We will study the use of meta paths in query recommendation, where queries are suggested to web search users based on their previous query histories. As studied in [2], it is possible to use a knowledge base (a HIN) and its related meta-paths to perform effective query recommendation. The approach is especially useful to long-tail queries that rarely appear in query logs.

2 Proposed Schedule

The following is our proposed schedule of the 90-min tutorial.

  • Introduction (15 min). We will discuss the basic model of HIN, and discuss applications based on it, such as search, relevance computation, query recommendation, and data integration (10 min). We will also introduce meta-paths, a fundamental HIN analysis tool, and give an overview of the tutorial (5 min).

  • Main contents (60 min). Next, we will introduce meta path, and how it facilitates the computation of various relevance measures (10 min). We then explain the process of discovering meta paths (15 min). We discuss a novel query recommendation framework based on meta paths (15 min). We will also present the meta structures, which is the latest development of meta paths (15 min). We will demonstrate a HIN search engine prototype based on meta paths (5 min).

  • Conclusions (15 min). We will conclude the tutorial and discuss future directions (5 min). The rest of the time will be dedicated to Q&A (10 min).

3 Intended Audience

The tutorial is designed for researchers interested in latest development in the field of HINs, especially regarding meta-paths for novel applications. The HIN search demonstration will be give insight to software practitioners for developing recommendation facilities for HINs.

4 Biography of Presenters

Reynold Cheng is an Associate Professor of the Department of Computer Science in the University of Hong Kong. He obtained his PhD from Department of Computer Science of Purdue University in 2005. He was granted an Outstanding Young Researcher Award 2011–12 by HKU. He was the recipient of the 2010 Research Output Prize in the Department of Computer Science of HKU. He also received the U21 Fellowship in 2011. He received the Performance Reward in years 2006 and 2007 awarded by the Hong Kong Polytechnic University. He is a member of the IEEE, the ACM, and ACM SIGMOD. He is an editorial board member of TKDE, DAPD and IS, and was a guest editor for TKDE, DAPD, and Geoinformatica. He is an area chair of ICDE 2017, senior PC member of BigData 2017 and DASFAA 2015, PC co-chair of APWeb 2015, area chair for CIKM 2014, and workshop co-chair of ICDE 2014. He received an Outstanding Service Award in CIKM 2009. He has served as PC members and reviewer for top conferences and journals.

Zhipeng Huang is a 2nd year Ph.D. in the CS department of HKU, supervised by Prof. Nikos Mamoulis and Dr. Reynold Cheng. He received his bachelor degree from EECS department of PKU in 2015. His research interests cover data mining, data management and data cleaning.

Yudian Zheng is a 4th year Ph.D. in the CS department of HKU, supervised by Dr. Reynold Cheng. Yudian’s research interests cover crowdsourcing, data management and data cleaning. He has published full research papers in well-established database and data mining conferences/journals, including SIGMOD, VLDB, KDD, WWW, ICDE, and TKDE. He has also taken internships in Microsoft Research and Google Research.

Jing Yan is a 1st year MPhil student supervised by Dr. Reynold Cheng in the CS department of HKU. His research interests include data management and data mining, with emphasis on knowledge graphs and data cleaning.

Ka Yu Wong is currently a MSc student of the CS department of HKU.

Eddie Ng is currently a MSc student of the CS department of HKU.