BTC-2019: The 2019 Billion Triple Challenge Dataset

Herrera, José-Miguel; Hogan, Aidan; Käfer, Tobias

doi:10.1007/978-3-030-30796-7_11

José-Miguel Herrera¹⁷,
Aidan Hogan¹⁷ &
Tobias Käfer¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11779))

Included in the following conference series:

International Semantic Web Conference

3398 Accesses
8 Citations

Abstract

Six datasets have been published under the title of Billion Triple Challenge (BTC) since 2008. Each such dataset contains billions of triples extracted from millions of documents crawed from hundreds of domains. While these datasets were originally motivated by the annual ISWC competition from which they take their name, they would become widely used in other contexts, forming a key resource for a variety of research works concerned with managing and/or analysing diverse, real-world RDF data as found natively on the Web. Given that the last BTC dataset was published in 2014, we prepare and publish a new version – BTC-2019 – containing 2.2 billion quads parsed from 2.6 million documents on 394 pay-level-domains. This paper first motivates the BTC datasets with a survey of research works using these datasets. Next we provide details of how the BTC-2019 crawl was configured. We then present and discuss a variety of statistics that aim to gain insights into the content of BTC-2019. We discuss the hosting of the dataset and the ways in which it can be accessed, remixed and used.

Resource DOI: https://doi.org/10.5281/zenodo.2634588

Resource type: Dataset

Access provided by Autonomous University of Puebla. Download conference paper PDF

LOD-a-lot

Dataset search: a survey

Article Open access 24 August 2019

LSQ: The Linked SPARQL Queries Dataset

1 Introduction

The Billion Triple Challenge (BTC) began at ISWC in 2008 [44], where a dataset of approximately one billion RDF triples crawled from millions of documents on the Web was published. As a demonstration of contemporary Semantic Web technologies, contestants were then asked to submit descriptions of systems capable of handling and extracting value from this dataset, be it in terms of data management techniques, analyses, visualisations, or end-user applications. The challenge was motivated by the need for research on consuming RDF data in a Web setting, where the dataset provided not only a large scale, diverse collection of RDF graphs, but also a snapshot of how real-world RDF data were published.

A BTC dataset would be published each year from 2008–2012 for the purposes of organising the eponymous challenge at ISWC [5,6,7, 30, 44], with another BTC dataset published in 2014 [3]. These datasets would become used in a wide variety of contexts unrelated to challenge submissions, not only for evaluating the performance, scalability and robustness of a variety of systems, but also for analysing Semantic Web adoption in the wild; our survey of how previous BTC datasets have been used (described in more detail in Sect. 2) reveals:

Evaluation: the BTC datasets have been used for evaluating works on a variety of topics relating to querying [25, 26, 28, 35, 46, 57, 62, 64, 65], graph analytics [11, 14, 15, 33, 60], search [8, 17, 40, 47], linking and matching [10, 32, 49], reasoning [42, 52, 58], compression [21, 59], provenance [1, 61], schemas [9, 39], visualisation [22, 66], high performance computing [24], information extraction [41], ranking [45], services [53], amongst others.
Analysis: The BTC datasets have further been used for works that aim to analyse the adoption of Semantic Web standards on the Web, including analyses of ontologies and vocabularies [23, 48, 54], links [20, 27], temporal information [51], publishing practices [50], amongst others.

We also found that BTC datasets have been used not only for the eponymous challenges [3, 5,6,7, 29, 30, 44], but also for other contests including the TREC Entity Track [2], and the SemSearch Challenge [55].

In summary, the BTC datasets have become a key resource used not only within the Semantic Web community, but also by other communities [11, 14, 15, 60]. Noting that the last BTC dataset was published in 2014 (five years ago at the time of writing), we thus argue that it is nigh time for the release of another BTC dataset (even if not associated with a challenge of the same name).

In this paper, we thus announce the Billion Triple Challenge 2019 dataset. We first provide a survey of how BTC datasets have been used in research works down through the years as both evaluation and analysis datasets. We then describe other similar collections of RDF data crawled from the Web. We provide details on the crawl used to achieve the BTC-2019 dataset, including parameters, seed list, duration, etc.; we also provide statistics collected during the crawl in terms of response codes, triples crawled per hour, etc. Next we provide detailed statistics of the content of the dataset, analysing various distributions relating to triples, documents, domains, predicates, classes, etc., including a high-level comparison with the BTC-2012 and BTC-2014 predecessors; these results further provide insights as to the current state of adoption of the Semantic Web standards on the Web. We then discuss how the data are published and how they can be accessed. We conclude with a summary and outlook for the future.

2 BTC Dataset Adoption

As previously discussed, we found two main types of usage of BTC datasets: for evaluation of systems, and for analysis of the adoption of Semantic Web technologies in the wild. In order to have a clearer picture of precisely how the BTC datasets have been used in the past for research purposes, we performed a number of searches on Google Scholar for the keywords btc dataset and billion triple challenge (the latter with a phrase search). Given the large number of results returned, for each search we surveyed the first 50 results, looking for papers that used a BTC dataset for either evaluation or analysis, filtering papers that are later or earlier versions of papers previously found; while this method is incomplete, we already gathered more than enough papers in this sample to get an idea of the past impact of these datasets. We note that Google Scholar uses the number of citations as a ranking measure, such that by considering the first 50 results, we consider the papers with the most impact, but may also bias the sample towards older papers.

In Table 1, we list the research papers found that use a BTC dataset for evaluation purposes; we list a key for the paper, the abbreviation of the venue where it was published, the year it was published, the system, the topic, the year of the BTC dataset used, and the scale of data reported; regarding the latter metric, we consider the figure as reported by the paper itself, where in some cases, samples of a BTC dataset were used, or the BTC dataset was augmented with other sources (the latter cases are marked with ‘*’). Considering that this is just a sample of papers, we see that BTC datasets have become widely used for evaluation purposes in a diverse range of research topics, in order of popularity: querying (9), graph analytics (5), search (4), linking and matching (3), reasoning (3), compression (2), provenance (2), schemas (2), visualisation (2), high-performance computing (1), information extraction (1), ranking (1), and services (1). While most works consider a Semantic Web setting (dealing with a standard like RDF, RDFS, OWL, SPARQL, etc.), we note that many of the works in the area of graph analytics have no direct connection to the Semantic Web, and rather use the link structure of the dataset to test the performance of network analyses and/or graph algorithms [11, 14, 15, 60]. Furthermore, looking at the venues, we can see that the datasets have been used in works published not only in core Semantic Web venues, but also venues focused on Databases, Information Retrieval, Artificial Intelligence, and so forth. We also remark that some (though not all) works prefer to select a more recent BTC dataset (e.g., from the same year or the year previous).

In Table 2, we instead look at papers that have performed analyses of Semantic Web adoption on the Web based on a BTC dataset. In terms of the types of analysis conducted, most relate to analysis of ontologies/vocabularies (3) or links (2), with temporal meta-data (1) and publishing practices relating to SPARQL endpoint (1) also having been analysed. Though fewer in number, these papers play an important role in terms of Semantic Web research and practice.

Most of the papers discussed were not associated with a challenge (perhaps due to how we conducted our survey). For more information on the challenges using the BTC dataset, we refer to the corresponding descriptions for the TREC [2], SemSearch [55], and Billion Triple Challenges [3, 5,6,7, 29, 30, 44].

We reiterate that this is only a sample of the works that have used these datasets, where a deeper search of papers would likely reveal further research depending on the BTC dataset. Likewise, we have only considered published works, and not other applications that may have benefited from or otherwise leveraged these datasets. Still however, our survey reveals the considerable impact that BTC datasets have had on research in the Semantic Web community, and indeed in other communities. Though the BTC-2019 dataset has only recently been published, we believe that this analysis indicates the potential impact that the newest edition of the BTC dataset should have.

Table 1. Use of BTC datasets as evaluation datasets

Full size table

Table 2. Use of BTC datasets as analysis datasets

Full size table

3 Related Work

The BTC datasets are not the only RDF corpora to have been collected from the Web. In this section we cover some of the other initiatives found in the literature for acquiring such corpora.

Predating the release of the first BTC dataset in 2008 were the corpora collected by a variety of search engines operating over Semantic Web data, including Swoogle [19], SWSE [31], Watson [16], Falcons [13], and Sindice [56]. These works described methods for crawling large volumes of RDF data from the Web. Also predating the first BTC dataset, Ding and Finin [18] collected one of the first large corpora of RDF data from the Web, containing 279,461,895 triples from 1,448,504 documents. They proceeded to analyse a number of aspects of the resulting dataset, including the domains on which RDF documents were found, the age and size of documents, how resources are described, as well as an initial analysis of quality issues relating to rdfs:domain. Though these works serve as an important precedent to the BTC datasets, to the best of our knowledge, the corpora were not published and/or were not reused.

On the other hand, since the first BTC dataset, a number of collections of RDF Web data have been published. The Sindice 2011 [12] contains 11 billion statements from 231 million documents, collecting not only RDF but also Microformats, and was used in 2011 for the TREC Entity Track; unfortunately the dataset is no longer available from its original location. The Dynamic Linked Data Observatory (DyLDO) [37] has been collecting RDF data from the Web each week since 2013; compared with the BTC datasets (which are yearly, at best), the DyLDO dataset are much smaller, crawling in the order of 16–100 million quads per week, with emphasis on tracking changes over time. LOD Laudromat [4] is an initiative to collect, clean, archive and republish Linked Datasets, offering a range of services from descriptive metadata to SPARQL endpoints and visualisations; unlike the BTC datasets, the focus is on collecting and republishing datasets in bulk rather than crawling documents from the Web. Meusel et al. [43] have published the WebDataCommons, extracting RDFa, Microdata and Microformats from the massive Common Crawl dataset; the result is a collection of 17,241,313,916 RDF triples, which, to the best of our knowledge, is the largest collection of crawled RDF data to have been published to-date; however, the nature of the WebDataCommons dataset is different from a typical BTC instance since it collects a lot of relatively shallow metadata from HTML pages, where the most common properties instantiated by the data are, for example, Open Graph metadata such as ogp:type, ogp:title, ogp:url, ogp:site_name, ogp:image, etc.; hence while WebDataCommons is an important resource, it is somewhat orthogonal to the BTC series of datasets.

4 Crawl

We follow a similar procedure for crawling the BTC-2019 dataset as in the most recent years. Our crawl uses the most recent version of the LDspider [36] (version 1.3^{Footnote 1}), which offers a variety of features for configuring crawls of native RDF content, including support for various RDF syntaxes, various traversal strategies, various ways to scope the crawl, and most importantly, components to ensure a “polite” crawl that respects the robots.txt exclusion protocol and implements a minimal delay between requests to the same server to avoid DoS-like patterns.

The crawl was executed on a single virtual machine running Ubuntu 18.04 on an Intel Xeon Silver 4110 CPU@2.10 GHz, with 30 G of RAM. The machine was hosted in the University of Chile. Following previous configurations for BTC datasets, LDspider is configured to crawl RDF/XML, Turtle and N-Triples following a breadth-first strategy; the crawler does not yet support JSON-LD, while enabling RDFa currently tends to gather a lot of shallow disconnected metadata from webpages, which we interpret as counter to the goals of BTC datasets. IRIs ending in .html, .xhtml, .json, .jpg, .pdf are not visited with the assumption that they are unlikely to yield content in one of the desired formats. To enable higher levels of scale, the crawler is configured to use the hard-disk to manage the frontier list (the list of unvisited URLs). Based on initial experiments with the available hardware, 64 threads were chosen for the crawl (adding more threads did not increase performance); implementing a delay between subsequent requests to the same (pay-level) domain is then important to avoid DoS-style polling, where we allow a one second delay. The crawler respects the robots.txt exclusion protocol^{Footnote 2} and will not crawl domains or documents that are blacklisted by the respective file. All HTTP(S) IRIs from an RDF document without a blacklisted extension – irrespective of the subject/predicate/object position – are considered candidates for crawling. In each round, IRIs are prioritised in terms of the number of links found, meaning that unvisited IRIs mentioned in more visited documents will be prioritised for crawling. We store the data collected as an N-Quads file, where we use the graph term to indicate the location of the document in which the triple is found; a separate file indicating the redirects encountered, as well as various logs, are also maintained.^{Footnote 3} A diverse list of 442 URLs taken from DyDLO [37] were given as input to the crawl.^{Footnote 4}

We ran the crawl with this configuration continuously for one month from 2018/12/12 until 2019/01/11, during which we collect 2,162,129,316 quads. Since we apply streaming parsers to be able to handle large RDF documents, in cases where a document contains duplicated triples, the initial output will contain duplicate quads; when later removed, we were left with 2,155,856,225 unique quads in the dataset from a total of 2,641,253 documents on 394 pay-level-domains (PLDs).^{Footnote 5}

In Fig. 1, we show the crawling behaviour on the HTTP level. As the HTTP status does not cover issues on the networking level, we added a class (6xx) for networking issues, which allows us to present the findings on the HTTP and networking levels in a uniform manner. We assigned exceptions that we encountered during crawling to the status code classes, according to whether we consider them a server problem (eg. SSLException) or a networking issue (eg. ConnectionTimeoutException) as in [38]. The number of seed URIs is composed of all URIs we ever tried to dereference during the crawling, where in total we tried to dereference 4,133,750 URIs. We see that about two thirds of dereferenced URIs responded with an HTTP status code of the Redirection class (3xx), which are about three times as many as the URIs that directly provided a successful response (2xx). A total of 6% of requests immediately fail due to server or network issues (5xx/6xx). In total, 82% of seed URIs eventually yielded a successful response, i.e., about 3.3 million seed URIs, which is considerable more than documents in the final crawl (2.6 million); reasons for this difference include the fact that many seed URIs redirect to the same document, that multiple hash URIs from the same documents are in the seed list, etc.

In Fig. 2, we show the number of (non-distinct) quads crawled as the days progress, where we see that half of the data are crawled after about 1.6 days; the rate at which quads are crawled decays markedly over time. This decay in performance occurs because at the start of the crawl there are more domains to crawl from, where smaller domains are exhausted early in the crawl; this leaves fewer active domains at the end of the crawl. Figure 3 then shows the number of PLDs contributing quads to the crawl as the days progress (accessed), where all but one domain is found after 1.5 days. Figure 3 also shows the number of active PLDs: the PLDs that will contribute quads to the crawl in the future, where for example we see based on the data for day 15 that the last 15 days of the crawl will retrieve RDF successful from 16 PLDs. By the end of the crawl, there are only 6 PLDs active from which the crawler can continue to retrieve RDF data. These results explain the trend in Fig. 2 of the crawl slowing as it progresses: the crawl enters a phase of incrementally crawling a few larger domains, where the crawl delay becomes the limit to performance. For example, at the end of the crawl, with 6 domains active, a delay limit of 1 s means that 6 documents can be crawled per second. Similiar crawls of RDF documents on the Web have encountered this same phenomenon of “PLD starvation” [34].

In summary, we crawl for 30 days collecting a total of 2,155,856,033 unique quads from 2,641,253 RDF documents on 394 pay-level domains. Per Fig. 2, running the crawl for more time would have limited effect on the volume of data.

Table 3. PLDs by docs.

Full size table

Table 4. PLDs by triples

Full size table

Table 5. PLDs by quads

Full size table

5 Dataset Statistics

The data are collected from 2,641,253 RDF documents collected from 394 pay-level domains containing a total of 2,155,856,033 unique quads. Surprisingly, the number of unique triples in the dataset is much lower: 256,059,356. This means that on average, each triples is repeated in approximately 8.4 different documents; we will discuss this issue again later. In terms of schema, the data contain 38,156 predicates and instances of 120,037 unique classes; these terms are defined in a total of 1,746 vocabularies (counting unique namespaces).

Next we look at the sources of data for the crawl. RDF content was successfully crawled from a total of 394 different PLDs. In Table 3, we show the top 25 PLDs with respect to the number of documents crawled and the overall percentage of documents sourced from that site; the largest provider of documents is dbpedia.org (6.14%), followed by loc.gov (5.68%), etc. We remark that amongst these top PLDs, the distribution is relatively equal. This is because documents are crawled from each domain at a maximum rate of 1/s, meaning that typically a document will be polled from each active domain with the same interval. To counter the phenomenon of PLD starvation, we stop the polling of active domains when the number of active domains is below a certain threshold and move to the next hop (the documents in the queues of the domains are ranked by in-links as a measure of importance). The result is that large domains are often among the last active domains, where the polling is stopped before the domain is crawled exhaustively and for all domains after downloading almost the same number of documents. However, looking at Table 4, which displays the top 25 PLDs in terms of unique triples, we start to see some skew, where 52.15% of all unique triples come from Wikidata (despite it accounting for only 5.35% of documents). Even more noticeably, if we look at Table 5, which displays the top-25 PLDs by number of quads, we see that Wikidata accounts for 93.06% of all quads; in fact, if we divide the number of quads for Wikidata by the number of documents, we find that it contains, on average, approximately 14,208 triples per document! By way of comparison, DBpedia contains 226 triples per document. Hence given that the crawl, by its nature, balances the number of documents polled from each domain, and that Wikidata’s RDF documents are orders of magnitude larger than those of other domains, we see why the skew in quads occurs. Further cross referencing quads with unique triples, we see a lot of redundancy in how Wikidata exports RDF, repeating each triple in (on average) 15 documents; by way of comparison, DBpedia repeats each unique triple in (on average) 1.11 documents. This skew occurs as a result of how Wikidata chooses to export its data; while representing how real-world data are published, consumers of the BTC-2019 dataset should keep this skew in mind when using the data, particularly if conducting analyses of adoption; for example, analysing the most popularly-used predicates by counting the number of quads using each predicate would be disproportionately affected by Wikidata.

Turning towards the use of vocabularies in the data, Table 6 presents the most popular vocabularies (extracted from predicate and class terms) in terms of the number of PLDs on which they are used (and the percentage of PLDs). Unsurprisingly core Semantic Webs standards head the list, followed by Friend of a Friend (FOAF), Dublin Core (DC) vocabularies, etc.; almost all of these vocabularies have been established for over a decade, with the exception of the Linked Data Platform (LDP) vocabulary which appears in 21^st place. On the other hand, Table 7 presents the number of PLDs per predicate, while Table 8 presents the number of PLDs per class, where again there are few surprises at the top of the list, with most terms corresponding to the most popular namespaces. We conclude that BTC-2019 is a highly diverse dataset featuring hundreds of thousands of vocabulary terms from thousands of vocabularies.

Table 6. PLDs per voc.

Full size table

Table 7. PLDs per pred.

Full size table

Table 8. PLDs per class

Full size table

Table 9. Comparison of BTC 2012, 2014, 2019: High-level Statistics

Full size table

Table 10. Comparison of BTC 2012, 2014, 2019: Top PLDs per Documents

Full size table

6 Comparison with BTC-2012 and BTC-2014

We now provide a statistical comparison between BTC-2019 and its two most recent predecessors: BTC-2014 and BTC-2012. We downloaded these latter two datasets from their corresponding webpages and ran the same statistical code as used for the BTC-2019 dataset. Noting that BTC-2014 and BTC-2012 included HTTP header meta-data as part of their RDF dump, for the purposes of comparability, we pre-filtered such triples from these crawls as they were not part of the native RDF documents (and thus were not included in the BTC-2019 files).

We begin in Table 9 with a comparison of high-level statistics between the three datasets, where we see that in terms of quads, BTC-2019 is larger than BTC-2012 but smaller than BTC-2014; as previously discussed, BTC-2014 extracted a lot of shallow HTML-based metadata from small RDFa documents, which we decided to exclude from BTC-2019: as can be seen by cross-referencing the quads and documents statistics, BTC-2019 had on average 816 quads per document, while BTC-2012 had on average 147 quads per document and BTC-2014 had on average 91 quads per document. Of note is the relatively vast quantity of predicates, classes and vocabularies appearing in the BTC-2014 dataset; upon further analysis, most was noise relating to a bug in the exporter of a single site – gorodskoyportal.ru – which linked to nested namespaces of the form:

http://gorodskoyportal.ru/moskva/rss/channel/.../channel/*

where “...” indicates repetitions of the channel sub-path.

We see that BTC-2019 also comes from fewer domains than BTC-2012 and much fewer than BTC-2014; this is largely attributable not only to our decision to not include data embedded in HTML pages, but also to a variety of domains that have ceased publishing RDF data. Regarding the largest contributors of data in terms of PLDs, Table 10 provides a comparison of the domains contributing the most documents to each of the three versions of the BTC datasets, where we see some domains in common across both (e.g., dbpedia.org, loc.gov), some domains appearing in older versions but not in BTC-2019 that have gone offline (freebase.com, kasabi.com, opera.com, etc.), as well as some new domains appearing only in the more recent BTC-2019 version (e.g., wikidata.org).

7 Publication

We publish the files on the Zenodo service, which provides hosting in CERN’s data centre and also assigns resources published with DOIs. The DOI of the BTC-2019 dataset is http://doi.org/10.5281/zenodo.2634588. The data are published in N-Triples format using GZip compression. Due to the size of the dataset, rather than publish the data as one large file, we publish the following:

Unique triples (1 file: 3.1 GB) this file stores only the unique triples of the BTC-2019 dataset.
Quads (114 files: 26.1 GB total) given the large volume of quads, we split the data up, creating a separate file for the quads collected from each of the top 100 PLDs, and an additional file containing the quads for the remaining 294 PLDs. Given the size of Wikidata, we split its file into 14 segments, each containing at most 150 million quads each and taking 1.8 GB of space.

Hence we offer consumers a number of options for how they wish to use the BTC-2019 dataset. Consumers who are mostly interested in the graph structure (e.g., for testing graph analytics or queries on a single graph) may choose to download the unique triples file. On the other hand, other consumers can select smaller files from the PLDs of interest, potentially remixing the BTC-2019 into various samples; another possibility, for example, would be to take one file from each PLD (including Wikidata), thus potentially reducing the skew in quads previously discussed. Aside from the data themselves, we also publish a VoID file describing metadata about the crawl, and offer documentation on how to download all of the files at once, potential parsers that can be used, etc.

8 Conclusion

In this paper, we have provided a survey indicating how the BTC datasets have been used down through the years, providing a strong motivation for continuing the tradition of publishing these datasets. Observing that the last BTC crawl was conducted 5 years ago in 2014, we have thus crawled and published the newest edition to the BTC series: BTC-2019. We have provided various details on the crawl used to acquire the dataset, various statistics regarding the resulting dataset, as well as discussion on how the data are published in a sustainable way.

In terms of the statistics, we noted two problematic aspects: a relatively low number of PLDs contributing to the crawl, leading to exhausting the available PLDs relatively quickly, and a large skew in the number of quads sourced from Wikidata. These observations are based on how the data are published on the Web rather than being a particular artifact of the crawl. Still, the resulting dataset is highly diverse, reflects current publishing, and can be used for evaluating methods on real-world data; furthermore, with appropriately designed metrics taking into account the skew on Wikidata, the BTC-2019 dataset contains valuable insights on how data are being published on the Web today.

Notes

1.
https://github.com/ldspider/ldspider.
2.
One exception is the Crawl delay definition, where all websites are configured for a one second delay only irrespective of the robots.txt file.
3.
The script used to run the call – including all arguments passed to LDspider – is available at https://github.com/jotixh/RDFLiteralDefinitions/blob/master/ldspider-runner/bin/crawl.sh.
4.
https://github.com/jotixh/RDFLiteralDefinitions/blob/master/ldspider-runner/seed.txt.
5.
A pay-level domain (PLD) is one that must be paid for to be registered; examples would be dbpedia.org, data.gov, bbc.co.uk, but not en.dbpedia.org, news.bbc.co.uk, etc. Oftentimes datasets will rather report fully-qualified domain names (FQDNs), which we argue is not a good practice since, for example, sub-domains can be used for individual user accounts (as was the case for sites like Livejournal, which had millions of sub-domains: one for each user).

References

Avgoustaki, A., Flouris, G., Fundulaki, I., Plexousakis, D.: Provenance management for evolving RDF datasets. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 575–592. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34129-3_35
Chapter Google Scholar
Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2011 entity track. In: Text REtrieval Conference (TREC). NIST (2011)
Google Scholar
Bechhofer, S., Harth, A.: The semantic web challenge 2014. J. Web Semant. 35, 141 (2015)
Article Google Scholar
Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: LOD laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 213–228. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_14
Chapter Google Scholar
Bizer, C., Maynard, D.: The semantic web challenge 2010. J. Web Semant. 9(3), 315 (2011)
Article Google Scholar
Bizer, C., Maynard, D.: The semantic web challenge 2011. J. Web Semant. 16, 32 (2012)
Article Google Scholar
Bizer, C., Mika, P.: The semantic web challenge 2009. J. Web Semant. 8(4), 341 (2010)
Article Google Scholar
Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 83–97. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_6
Chapter Google Scholar
Böhm, C., Lorey, J., Naumann, F.: Creating void descriptions for web-scale data. J. Web Semant. 9(3), 339–345 (2011)
Article Google Scholar
Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: distributed web-of-data-scale entity matching. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 2104–2108. ACM (2012)
Google Scholar
Bu, Y., Borkar, V.R., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big(ger) graph analytics on a dataflow engine. PVLDB 8(2), 161–172 (2014)
Google Scholar
Campinas, S., Ceccarelli, D., Perry, T.E., Delbru, R., Balog, K., Tummarello, G.: The sindice-2011 dataset for entity-oriented search on the web of data. In: International Workshop on Entity-Oriented Search (EOS), pp. 26–32 (2011)
Google Scholar
Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing entities on the semantic web. In: International Conference on World Wide Web (WWW), pp. 1101–1102. ACM (2008)
Google Scholar
Cheng, J., Ke, Y., Chu, S., Özsu, M.T.: Efficient core decomposition in massive networks. In: International Conference on Data Engineering (ICDE), pp. 51–62. IEEE (2011)
Google Scholar
Cheng, J., Zhu, L., Ke, Y., Chu, S.: Fast algorithms for maximal clique enumeration with limited memory. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1240–1248. ACM (2012)
Google Scholar
d’Aquin, M., Baldassarre, C., Gridinoc, L., Angeletou, S., Sabou, M., Motta, E.: Characterizing knowledge on the semantic web with watson. In: International Workshop on Evaluation of Ontologies (EON), pp. 1–10. CEUR-WS.org (2007)
Google Scholar
Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: Aroyo, L., et al. (eds.) ESWC 2010. LNCS, vol. 6089, pp. 240–256. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13489-0_17
Chapter Google Scholar
Ding, L., Finin, T.: Characterizing the semantic web on the web. In: Cruz, I., et al. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 242–257. Springer, Heidelberg (2006). https://doi.org/10.1007/11926078_18
Chapter Google Scholar
Ding, L., et al.: Swoogle: a search and metadata engine for the semantic web. In: International Conference on Information and Knowledge Management (CIKM), pp. 652–659. ACM (2004)
Google Scholar
Ding, L., Shinavier, J., Shangguan, Z., McGuinness, D.L.: SameAs networks and beyond: analyzing deployment status and implications of owl:sameas in linked data. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 145–160. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17746-0_10
Chapter Google Scholar
Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). J. Web Semant. 19, 22–41 (2013)
Article Google Scholar
Gallego, M.A., Fernández, J., Martínez-Prieto, M., de la Fuente, P.: RDF visualization using a three-dimensional adjacency matrix. In: Semantic Search Workshop (SEMSEARCH) (2011)
Google Scholar
Glimm, B., Hogan, A., Krötzsch, M., Polleres, A.: OWL: yet to arrive on the web of data? In: Linked Data on the Web (LDOW). CEUR-WS.org (2012)
Google Scholar
Goodman, E.L., Jimenez, E., Mizell, D., al-Saffar, S., Adolf, B., Haglin, D.: High-Performance computing applied to semantic databases. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6644, pp. 31–45. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21064-8_3
Chapter Google Scholar
Görlitz, O., Thimm, M., Staab, S.: SPLODGE: systematic generation of SPARQL benchmark queries for linked open data. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 116–132. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_8
Chapter Google Scholar
Groppe, J., Groppe, S.: Parallelizing join computations of SPARQL queries for large semantic web databases. In: Symposium on Applied Computing (SAC), pp. 1681–1686. ACM (2011)
Google Scholar
Guéret, C., Groth, P., van Harmelen, F., Schlobach, S.: Finding the achilles heel of the web of data: using network analysis for link-recommendation. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 289–304. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17746-0_19
Chapter Google Scholar
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing. In: SIGMOD International Conference on Management of Data, pp. 289–300. ACM (2014)
Google Scholar
Harth, A., Bechhofer, S.: The semantic web challenge 2013. J. Web Semant. 27–28, 1 (2014)
Google Scholar
Harth, A., Maynard, D.: The semantic web challenge 2012. J. Web Semant. 24, 1–2 (2014)
Article Google Scholar
Harth, A., Umbrich, J., Decker, S.: MultiCrawler: a pipelined architecture for crawling and indexing semantic web data. In: Cruz, I., et al. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 258–271. Springer, Heidelberg (2006). https://doi.org/10.1007/11926078_19
Chapter Google Scholar
Heflin, J., Song, D.: Ontology instance linking: towards interlinked knowledge graphs. In: AAAI Conference on Artificial Intelligence, pp. 4163–4169. AAAI (2016)
Google Scholar
Hogan, A.: Canonical forms for isomorphic and equivalent RDF graphs: algorithms for leaning and labelling blank nodes. TWEB 11(4), 22:1–22:62 (2017)
Article MathSciNet Google Scholar
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Semant. 9(4), 365–401 (2011)
Article Google Scholar
Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: Workshops Proceedings of the International Conference on Data Engineering (ICDE), pp. 1–6. IEEE (2013)
Google Scholar
Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDspider: an open-source crawling framework for the web of linked data. In: ISWC Posters & Demonstrations. CEUR-WS (2010)
Google Scholar
Käfer, T., Abdelrahman, A., Umbrich, J., O’Byrne, P., Hogan, A.: Observing linked data dynamics. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 213–227. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38288-8_15
Chapter Google Scholar
Käfer, T., Wins, A., Acosta, M.: Modelling and analysing dynamic linked data using RDF and SPARQL. In: Workshop on Dataset PROFILing and fEderated Search for Web Data (PROFILES) (2017)
Google Scholar
Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of linked data. J. Web Semant. 16, 52–58 (2012)
Article Google Scholar
Ladwig, G., Tran, T.: Index structures and top-k join algorithms for native keyword search databases. In: Conference on Information and Knowledge Management (CIKM), pp. 1505–1514. ACM (2011)
Google Scholar
Lehmberg, O., Ritze, D., Ristoski, P., Meusel, R., Paulheim, H., Bizer, C.: The mannheim search join engine. J. Web Semant. 35, 159–166 (2015)
Article Google Scholar
Liu, B., Huang, K., Li, J., Zhou, M.: An incremental and distributed inference method for large-scale ontologies based on MapReduce paradigm. IEEE Trans. Cybern. 45(1), 53–64 (2015)
Article Google Scholar
Meusel, R., Petrovski, P., Bizer, C.: The WebDataCommons microdata, RDFa and microformat dataset series. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_18
Chapter Google Scholar
Mika, P., Hendler, J.: The semantic web challenge 2008. J. Web Semant. 7(4), 271 (2009)
Article Google Scholar
Mulay, K., Kumar, P.S.: SPRING: ranking the results of SPARQL queries on linked data. In: International Conference on Management of Data (COMAD), pp. 47–56. Allied Publishers (2011)
Google Scholar
Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. In: SIGMOD International Conference on Management of Data, pp. 627–640. ACM (2009)
Google Scholar
Neumayer, R., Balog, K., Nørvåg, K.: When simple is (more than) good enough: effective semantic search with (almost) no semantics. In: Baeza-Yates, R., et al. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 540–543. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28997-2_59
Chapter Google Scholar
Nikolov, A., Motta, E.: Capturing emerging relations between schema ontologies on the web of data. In: Consuming Linked Data (COLD). CEUR (2010)
Google Scholar
Papadakis, G., Demartini, G., Fankhauser, P., Kärger, P.: The missing links: discovering hidden same-as links among a billion of triples. In: International Conference on Information Integration and Web-based Applications and Services, pp. 453–460. ACM (2010)
Google Scholar
Paulheim, H., Hertling, S.: Discoverability of SPARQL endpoints in linked open data. In: ISWC Posters & Demonstrations, pp. 245–248. CEUR-WS.org (2013)
Google Scholar
Rula, A., Palmonari, M., Harth, A., Stadtmüller, S., Maurino, A.: On the diversity and availability of temporal information in linked open data. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 492–507. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_31
Chapter Google Scholar
Shaw, M., Koutris, P., Howe, B., Suciu, D.: Optimizing large-scale semi-naïve datalog evaluation in Hadoop. In: Barceló, P., Pichler, R. (eds.) Datalog 2.0 2012. LNCS, vol. 7494, pp. 165–176. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32925-8_17
Chapter Google Scholar
Speiser, S., Harth, A.: Integrating linked data and services with linked data services. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6643, pp. 170–184. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21034-1_12
Chapter Google Scholar
Stadtmüller, S., Harth, A., Grobelnik, M.: Accessing information about linked data vocabularies with vocab.cc. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, H.T. (eds.) CSWS 2012. SPCOM, pp. 391–396. (2012). https://doi.org/10.1007/978-1-4614-6880-6_34
Chapter Google Scholar
Tran, T., Mika, P., Wang, H., Grobelnik, M.: SemSearch’11: the 4th semantic search workshop. In: International Conference on World Wide Web (Companion Volume), pp. 315–316. ACM (2011)
Google Scholar
Tummarello, G., Delbru, R., Oren, E.: Sindice.com: weaving the open linked data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 552–565. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_40
Chapter Google Scholar
Umbrich, J., Karnstedt, M., Hogan, A., Parreira, J.X.: Hybrid SPARQL queries: fresh vs. fast results. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 608–624. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_38
Chapter Google Scholar
Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed reasoning using MapReduce. In: Bernstein, A., et al. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 634–649. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04930-9_40
Chapter Google Scholar
Urbani, J., Maassen, J., Drost, N., Seinstra, F.J., Bal, H.E.: Scalable RDF data compression with MapReduce. Concurrency Comput.: Pract. Experience 25(1), 24–39 (2013)
Article Google Scholar
Wang, J., Cheng, J.: Truss decomposition in massive networks. PVLDB 5(9), 812–823 (2012)
Google Scholar
Wylot, M., Cudré-Mauroux, P., Hauswirth, M., Groth, P.T.: Storing, tracking, and querying provenance in linked data. IEEE Trans. Knowl. Data Eng. 29(8), 1751–1764 (2017)
Article Google Scholar
Yang, T., Chen, J., Wang, X., Chen, Y., Du, X.: Efficient SPARQL query evaluation via automatic data partitioning. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013. LNCS, vol. 7826, pp. 244–258. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37450-0_18
Chapter Google Scholar
Fang, Y., Si, L., Somasundaram, N., Al-Ansari, S., Yu, Z., Xian, Y.: Purdue at TREC 2010 entity track: a probabilistic framework for matching types between candidate and target entities (2010)
Google Scholar
Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. PVLDB 6(7), 517–528 (2013)
Google Scholar
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013)
Google Scholar
Zhang, X., Song, D., Priya, S., Daniels, Z., Reynolds, K., Heflin, J.: Exploring linked data with contextual tag clouds. J. Web Semant. 24, 33–39 (2014)
Article Google Scholar

Download references

Acknowledgements

This work was supported by Fondecyt Grant No. 1181896 and by the Millenium Institute for Foundational Research on Data (IMFD).

Author information

Authors and Affiliations

IMFD; DCC, Universidad de Chile, Santiago, Chile
José-Miguel Herrera & Aidan Hogan
Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Tobias Käfer

Authors

José-Miguel Herrera
View author publications
You can also search for this author in PubMed Google Scholar
Aidan Hogan
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Käfer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aidan Hogan .

Editor information

Editors and Affiliations

Fondazione Bruno Kessler, Trento, Italy
Chiara Ghidini
Linköping University, Linköping, Sweden
Olaf Hartig
University of Bonn, Bonn, Germany
Maria Maleshkova
University of Economics Prague, Prague, Czech Republic
Vojtěch Svátek
University of Illinois at Chicago, Chicago, IL, USA
Isabel Cruz
University of Chile, Santiago, Chile
Aidan Hogan
Memect Technology, Beijing, China
Jie Song
Mines Saint-Etienne, Saint-Etienne, France
Maxime Lefrançois
Inria Sophia Antipolis - Méditerranée, Sophia Antipolis, France
Fabien Gandon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Herrera, JM., Hogan, A., Käfer, T. (2019). BTC-2019: The 2019 Billion Triple Challenge Dataset. In: Ghidini, C., et al. The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science(), vol 11779. Springer, Cham. https://doi.org/10.1007/978-3-030-30796-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-30796-7_11
Published: 17 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30795-0
Online ISBN: 978-3-030-30796-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)

BTC-2019: The 2019 Billion Triple Challenge Dataset

Abstract