Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Almost ubiquitous access to the Internet proliferates the usage of smart devices and Internet services in business and everyday life. Apps, small software services, are generated to the thousands per day. With ongoing availability of Internet access, data and services, these apps as well as more complicated Internet services have a demand for tools generating metadata and managing relations between data. These tools are called semantic technologies; they incorporate methods like automatic design of metadata, automatic reasoning, mapping of knowledge bases, complex event processing, semantic search or tracking and repair of changes within knowledge bases.

Modern systems for intelligent knowledge management also tend to take into account the actual context, especially when interpreting complex situations or when being tailored towards personalized usage. Modern intelligent systems allow a correlation between high-level, low-level, internal and external data. An example shows the desired alignment of different contexts in a personalized mobile application:

Imagine stopping at an unknown airport in order to catch a connecting flight at a different terminal. Your navigation system in your smart phone helps you to get to the new terminal on time while guiding you by a newspaper store stocking your favorite journal, passing you by a rack with your perfume brand at the duty-free shop on the way and showing you the way to tidy restrooms close to your gate.

In this scenario, there are several contexts to be taken into account: On the one hand, there is the personal element or profile, which can be enriched by learning algorithms automatically detecting some habits (i.e. walking the same route at specific times or buying specific products). Further contexts are time- and place-dependent: using micro navigation and the map of the airport your location is detected. Your goal can be derived from your ticket and a flight information system. A route is planned taking into account your preferences combined with experiences collected by crowdsourcing, such as evaluating the cleanliness of the restrooms.

The types of context are manifold and can be categorized as follows (Strube et al. 1996): Besides linguistic contexts (which are not further considered), situational contexts are of great importance in order to interpret situations by using semantic tools; they are contexts referring to topics (here, domain knowledge is relevant), interactions, institutions, systems (abstract, virtual or real), and cultures. Social semantics and crowdsourcing are considered as being part of a cultural context.

Context can still be further or differently classified (Mehra 2012): a distinction between small context, i.e. events or time and location, and large context, as available in Linked Open Data, can be made. Other categories, vertical to this distinction, are time and location, contracts and commitments, web of things, web of people, web of knowledge, and emotions and outcome. These contexts can also have smaller entities such as particular human situations. As can be easily seen, these categories can be aligned with the definitions of contexts in Strube et al. (1996).

The challenge now is to formally represent these different types of context in order to enable usage of knowledge. With respect to domain knowledge and institutional knowledge that has already begun, also sparsely with respect to time and place as can i.e. be seen in Yndurain et al. (2012), Henson et al. (2012) and Boyaci et al. (2012). Challenges with respect to semantic search, proactiveness and social semantics and their intended usage of contexts are described in the following sections. Future intelligent systems aim at taking into account the personal context of a user as well as linking large areas of knowledge in the web of knowledge for a more adequate search as described in Sect. 2. Intelligent proactive event processing wants to correlate patterns from various contexts to interpret situations and to predict possible outcomes (Sect. 3). Additionally, future social semantics have the goal of combining personal and interaction contexts with the web of knowledge as well as with place and time (Sect. 4).

2 Semantic Data Management

Modern search engines are efficient in retrieving documents (web pages in the case of web search) from large collections based on matching the text of the user’s query with the textual content of the document or the words used to label the links to the document (anchor text). While they have been effective in addressing common needs, most categories of queries are in the “long tail” of web query logs (88 % of the unique queries that appear only once in the year (Baeza-Yates et al. 2007)) are still unsolved. Particularly difficult are ambiguous queries, which arise naturally when the user is not able to name precisely the item that is sought. Also, complex queries, which go beyond a reference to a single named entity, and involve several entities and relationships between them, pose challenges to current search solutions. While these problematic types of queries are not new, the renewed interest in trying to address them can be explained by the increasing availability of semantic resources in terms of document annotations (also called metadata), large datasets and schemas. Many semantic search systems (Finin et al. 2011) have emerged specifically to exploit these resources. Thesauri and particularly schemas capture semantics that can be used to understand the query and data and to resolve ambiguities. Instead of returning textual data (web pages, documents), precise facts capturing entity-related information and their relationships can be directly retrieved from semantic annotations and data to provide direct answers to complex queries. Here we discuss the broad concept of semantic search and present several types of semantic search systems.

Overview

While all semantic search approaches involve some kind of explicit semantics, the retrieval contexts and the specific semantic models used to deal with the meaning behind the query intent and data vary. In particular, we identified the following aspects, based on which we will characterize and distinguish existing solutions:

  • The type of targeted information needs,

  • The representation of information resources (data),

  • The representation of the information need (query) or the underlying method for querying the data (querying paradigms),

  • The semantic models used to understand and to represent the meaning behind query and data,

  • The framework for matching queries against data, which also involves interpreting the data and query intent as well as ranking the results.

Figure 1 illustrates the specific differences in queries, data, semantics, results (corresponding to information needs) and solutions for the subproblems that distinguish existing semantic search approaches. We can see that central to semantic search is the use of semantics (depicted as gray boxes in Fig. 1). In particular, semantic resources represented by lexical models (thesauri, taxonomies, dictionary of words), knowledge models (ontologies, concept networks, schemas), as well as semantic data (RDF) and metadata (RDF data describing the content of documents, called RDFa) are used for the subproblems of understanding raw data (textual data, image objects) and queries. The resulting semantics of data and queries are then employed for solving the subproblems of matching the queries against the data and ranking the results. We will now discuss the main types of semantic search approaches.

Fig. 1
figure 1

Overview of semantic search approaches

Concept-Based Document Retrieval

This is the classic type of semantic search systems, which has been studied already in the early years of IR research. Researchers in this community have recognized the need for a richer representation of the information needs and data that goes beyond the bag-of-words model. Thus, lightweight lexical models have been employed to interpret the semantics of query and documents as concepts, and used within the standard bag-of-words IR paradigm (e.g. vector space model and language model).

However, there are no clear evidences suggesting conceptual IR can outperform standard IR techniques. Even when concepts are selected by hand, it helps only when the query is an incomplete description of the need (Voorhees 1994).

Annotation-Based Document Retrieval

These approaches exploit the advancement in IE technologies to obtain a richer representation of queries and documents, namely as entities and relations represented as annotations (query and document annotations, respectively); this query interpretation has been applied both to keywords and NL questions.

It has been shown that high precision is possible for queries for which corresponding annotations exist (Chu-Carroll et al. 2006). However, high recall is difficult because it requires the information extraction program to recognize arbitrary types of entities and relations. Along this line, there is a study showing that the quality of IEs significantly impacts semantic search performance (Chu-Carroll and Prager 2007).

Entity Search

In contrast to document retrieval, this approach searches for entities representing real-world objects (anything but documents). It includes search over (1) documents (e.g. for processing entity-related facts and listing queries of the TREC Question Answering track (Chu-Carroll et al. 2006), or expert search queries of the TREC Enterprise track (Balog et al. 2009)); (2) Wikipedia data, which, as discussed, can be seen as a kind of semantic data (e.g. for answering entity-related queries of the INEX Entity Ranking track (Kaptein et al. 2010; Pehcevski et al. 2008)); or (3) pure semantic data crawled from the Web (e.g. for answering entity queries of the SemSearch Entity track (Blanco et al. 2011)).

Some type(1) systems can also be categorized as annotation-based document retrieval systems because they detect entities and relations in data and queries (Chu-Carroll et al. 2006; Cheng et al. 2007; Nie et al. 2007). Thus, analogous to these systems, precision can be very high when quality extraction outputs are available here.

The use of semantics in type(2) systems is limited to viewing Wikipedia pages as entities (Kaptein et al. 2010), exploiting categories associated with them (Balog et al. 2011) as well as links between them (Pehcevski et al. 2008). As a source of semantics, experiments have shown that category information helps to interpret the entity query and to enrich the Wikipedia documents and, as a result, to outperform standard text-based retrieval (Kaptein et al. 2010; Balog et al. 2011).

While many type(3) systems (Cheng and Qu 2009; Tummarello et al. 2010; Tran et al. 2009a) have been built to search for entities over semantic data from the Web, solutions for ranking and the evaluation of their results have attracted interest only recently. Recently, the SemSearch challenge attracted a large number of participants, which submitted results produced by different ranking schemes based on traditional IR concepts. To date, the best performed method Blanco et al. (2011) uses an adaptation of BM25F.

Relational Keyword Search

This category comprises all approaches which process keyword queries over semantic data (or structured data capturing entities and relations) (Tran et al. 2009b; Ladwig and Tran 2011). While the results here include entities, the focus is to find possibly complex subgraphs encompassing several entities and relations between them (i.e. to support relational search). Regarding result quality, a benchmark study Coffman and Weaver (2010) has shown that the proposed ranking strategies (e.g. Tran et al. 2009b; Liu et al. 2006) based on the adoption of proximity, TF-IDF and PageRank do not perform well when considering a broad range of queries. Recently, it has been shown that high quality results can be achieved through the use of edge-specific language models that are constructed for results and queries (Bicer et al. 2011). Also, the learning to rank strategy, which combines TF-IDF-based ranking, proximity and other factors, yields high performance (Coffman and Weaver 2011).

Relational NL Search

As opposed to NL search over documents, which is mainly studied by IR researchers, this topic attracts a separate community that deals with interfaces over databases and knowledge bases. The underlying problem is about searching over semantic data. We refer to it as relational NL search because relations in the data have to be processed to answer complex queries, possibly asking for relations between several entities.

Traditionally, this type of query interface is used for expert systems targeting one particular domain. The questions of portability, and the problem of building domain-independent search systems, have attracted attention in the past few years (Wang et al. 2007; Cimiano et al. 2008; Li et al. 2007). While this problem of open-domain NL search, e.g. NL search over large amounts of heterogeneous web data, seems to be largely unsolved, relatively good results have been reported for single domain tasks. However, while there is TREC-based evaluation for NL search over documents (TREC QA track), there exists no such thing close to a standardized evaluation in this community.

3 Proactiveness

Due to an enormous increase of information sources relevant for making decisions and ever shorter time for acting, the efficiency of information management has moved the focus from finding relevant information in to anticipating potentially relevant situations that will enable proactive delivery of information (Engel and Etzion 2011). A common example is the prediction of the delays in the traffic and working on the mitigation before they happen. Obviously, searching for an existing traffic jam on the planned route is an easily manageable task, but the anticipation that because of the heavy rains at around 5 p.m. in a part of the route close to a big city there might be a traffic jam in about 30 min is a much more challenging task. Even more demanding is the anticipation based on the unusual trends in the social media streams, such as the increase of the number of tweets related to the traffic in an area, that might indicate some emerging problems on the road. There are two main challenges required for realizing such a scenario:

  1. 1.

    Real-time detection of interesting situations in very huge and dynamically changing data streams, and

  2. 2.

    Anticipation of events through the proactive delivery of information to relevant consumers or contexts.

More concretely, the problems are coming from the fact that the relevance of a piece of information can be determined only through the real-time combination with other information, which implies the need for data-driven processing. In such a processing scenario, a relevant situation is emerging from the data and will be discovered immediately after being completed (differently from the detection in the query time – for query-driven processing) (Silberstein et al. 2007). Another issue is, that we do not expect that a user can know and subscribe to all possible interesting situations, but rather has to be notified about them once they seem to be relevant for her/him. Otherwise, the user will suffer from an enormous information overload, which is much more difficult to overcome than in the traditional search due to the paradigm shift since “the user is not searching for relevant information, but information is searching for the relevant user”. This leads to the need for a very flexible and dynamic mechanism for the selection of situations to be detected.

We argue that semantic processing is of crucial importance for resolving these challenges for proactive computing due to its ability to interpret data in a broader (semantic) context and support the detection on a higher abstraction level. Figure 2 illustrates the process of proactive processing of real-time data sources and summarizes the tasks required for realizing an anticipatory behavior that will be described in the rest of this section.

Fig. 2
figure 2

Challenges for realizing proactiveness

Queries for Searching in the Future Streams

In contrast to a database/web search, where queries are formulated on an already existing dataset, queries for real-time processing are created at first and then applied on a continuous stream of upcoming events. In web search, web data is usually crawled and then indexed for enabling efficient search on the data corpus. The efficiency of search can be highly improved by applying powerful but time-consuming optimizations on storage and indexing. From a user perspective, web search is done to fulfil an information need immediately. For that reason, there is less time to improve queries and most effort is put into index optimization (see Sect. 2). The problems appearing in stream search are completely different. Here, queries are formulated at first and then matched against streaming data. While manipulations on incoming data are time-consuming, the largest potential for improvement lies in semantic processing of queries at design time in order to get better stream search results. Basically, the domain knowledge is used for extending the original queries with domain knowledge. However, in order to cope with the information overload, these extensions must be put in the context of the importance for the original query. For that reason, relevancy-based semantic requests at this level are extended with frequent terms in these concepts based on domain knowledge. This can be done by calculating the weight of relationships between entities (that have been gathered during runtime) and only considering entities with weights exceeding a certain threshold in order to refine patterns (Fang and Lei 2011). With respect to the dynamic flow of information in real-time social media streams, where messages might become important within a very short period in time (e.g., as often seen in disaster reporting), this requires a highly scalable computation of the current importance of a message.

Predictions: Searching in the Future Using Past Data

In order to understand what can be a trend in the future an efficient introspection of the past is needed. There are several applications domains where the possibility to act based on predictions (i.e. ahead of time) is getting ever more important, like Smart Grid and Smart Cities (Etzion 2010; Stojanovic et al. 2011a). This phase model concentrates on the mining of historical knowledge gathered from previously matched patterns. Based on the partial fulfilment degree of current patterns and the probability of complete fulfilment in history, unusual messages can be detected. Unusuality from our point of view is based on two features, unexpectedness and importance (Sen et al. 2010). The traditional way of querying RDF data is a blocking get operation. However, real-time applications need an asynchronous query mode to be more responsive on arrival of RDF data. Publish/subscribe (pub/sub) is a messaging pattern where publishers and subscribers communicate in a loosely coupled fashion. Subscribers can express their interests in certain kinds of data by registering a subscription (continuous query) and be notified asynchronously of any information (called an event) generated by the publishers that matches those interests. Notifications are made possible thanks to a matching algorithm that puts in relation publications and subscriptions. One of the challenges that faces this huge amount of data is the ability to combine the storage (associated with traditional, synchronous, query mode) and asynchronous delivery of RDF data (pub/sub protocol), whereas the deployment in a cloud environment should ensure the elasticity of the solution.

Dynamic Management of Future Queries

Another challenge for the information overload is the possibility to mange the deployment of the queries in a way that ensures that only relevant requests will be active at each point of time (Stojanovic et al. 2011b). This leads to the dynamic subscriptions, based on the real-time contextualization of the queries that will invoke queries only if the right context has been achieved. Similarly, the queries will be active only until the right context exists. Moreover, the validity of the queries should be verified in a long run, and in the case of deviation from the expected behavior, corresponding changes should be proposed. In a similar way, based on the data flow, new relevant queries can be generated on the fly and introduced in the system, reflecting the ultra dynamics of the real-time streams. The challenge for the semantics lies in the real-time mining of interesting queries and detection of the situations when they should be introduced in the system. Finally, managing proactiveness is an issue of enabling real-time acting in order to realize potential business opportunities or mitigate possible conflicts, anticipated dynamically from the data flow.

4 Scaling Social Semantics

In a nutshell semantic technologies strive to store and process meaning separately from the program itself and from other data. By doing so they enable the creation of systems that:

  • Can cope better with changes that affect the meaning of the data – because the program itself is not (so much) affected by such a change,

  • Can better work with heterogeneous data – again because the core program is not affected by this heterogeneity because it does not directly encode the meaning of the data,

  • Can use powerful algorithms such as semantic search that make use of the meaning of the data – because the meaning of the data is available for processing and not hidden in a program’s code.

These properties are only important for certain application areas – those areas where data changes frequently, where heterogeneous data abounds and where complex processing of data is needed.

For example semantic technologies may not be a good match for a system that manages the well understood and slowly changing inventory of a medium-sized company that produces and sells different kinds of paper – the flexibility offered by semantics is just not needed and all kinds of more complex processing can be done just as well directly in the software. If the company starts to add a new kind of paper having really different properties it will have to change so many things anyway that an adaption of the inventory software is comparably small and no hindrance to the speed with which the company can act and adapt. Contrast this with a system that manages information about actors in the health care sector in Europe and that is maintained by geographically distributed researchers in this area. Here change is permanent, the managed data is heterogeneous (there is not just one class of actor but actors can be all kinds of different things) and powerful semantic tools are needed to make this complex data accessible.

These are settings which are also high in social complexity; settings where multiple stakeholders constantly introduce new concepts, where a single schema cannot be enforced and in any case would be outdated by the time it is and where complex processing is needed to enable people with different levels of expertise to work with the data.

Going back to the examples of the paper factory one can see that this setting is low in social complexity with one company as one stakeholder, a single schema representing the inventory that is slowly changing. In contrast the hypothetical system managing the data about Europe’s health care actors is used in a setting high in social complexity – researchers from many locations contribute to it, nobody knows enough about all health care actors to even device a complete schema (which would be outdated in mere days anyway) and there might be conflicting views about what the actors are and what their role is. Also, for this system to be useful to the general public there is probably the need for the system to mediate between the language of the health care experts and that of the broader public.

Hence: for successful semantic systems to emerge in places where semantic technologies can offer the most benefit it is important that semantic technology research focuses not just on scaling to large datasets, but also on scaling to application areas high in social complexity.

Building semantic systems that scale to high social complexity needs research into:

  • Systems that are able to deal with multiple points of view, that can handle even conflicting knowledge and use it to produce results reflecting both the most likely answer as well as the diversity of opinion (e.g. Flöck et al. 2011).

  • Systems that support a user group in coming to an agreement and contributing in a structured way. Systems that for example use crowdsourcing tools and methodologies to involve a large group of users in the creation of knowledge (e.g. Di Iorio et al. 2010).

  • Systems that seriously consider the relationships between users, for example by including privacy and data ownership as first-order citizens (e.g. Duma et al. 2007).

  • Support diverse groups of users with different world views in first understanding the available data, then formulating queries and finally understanding the result. With large heterogeneous and fast changing datasets and heterogeneous users it can no longer be expected that users have knowledge about the kinds of data that are available or have the knowledge to understand this data. To address this, a new generation of systems must support exploration of these datasets and make use of available semantics to enable users at different levels of expertise to understand the data (e.g. Schenk et al. 2009).

One example of such work is the ongoing Render project (Hasan et al. 2011; Thalhammer et al. 2011). In this project, methods and techniques are researched that allow us to leverage the diversity of the web as a crucial source of innovation and creativity. More concretely, new search and aggregation techniques are developed that show not just the majority opinion or the opinion closest to the user but the whole range of sentiments. Such techniques are particularly important on the Internet, where pervasive recommender systems mean more and more that users only see the results confirming their views, thereby leading to users living in a kind of “Filter Bubble” isolated from conflicting views – something that may lead them to become ever more extreme in their views (Pariser 2011).

Such research – as well as that into the other challenges mentioned above – is a prerequisite for employing semantic technologies in the places where they are most needed and can bring the most benefit: applications areas that are high in social complexity.

5 Conclusion

Research in semantic technologies strives to use context-awareness in all its facets within future applications. New dimensions are gained as soon as complex situations respecting large and small contexts on the one hand and dynamic, volatile contexts with changing references on the other hand can be adequately interpreted or inferred. Future challenges also are the well balanced usage of semantic technologies like semantic search, proactive complex event processing or social semantics whenever applicable and suitable to achieve better results or performance.

Personalization and individualization serve as a motor to promote research in modelling and extracting all forms of contexts. Future applications will show the potential still to be uncovered by context-aware semantic technologies.