1 Introduction

Confucius who is an ancient Chinese social philosopher, said “reviewing what you have learned and learning anew, you are fit to be a teacher”. This means an approach to discover new things based on the study of the past. In this spirit, we conduct a retrospective survey on Large-Scale Multimedia Retrieval (LSMR) which has been receiving much research attention for more than twenty years. LSMR is the technique for analysing a large amount of multimedia data to efficiently find interesting and relevant ones. In other words, LSMR can be regarded as a classification problem to discriminate between relevant and irrelevant data to a query. As described in many literature [27, 112, 116, 120], the most challenging issue is the semantic gap which is the lack of coincidence between automatically extractable features (e.g., colour, edge and motion) and human-perceivable semantic meanings.

First of all, by referring to Fig. 1, let us define meanings that LSMR needs to identify. Since events are widely-accepted access units to multimedia data, we decompose semantic meanings based on basic aspects of event descriptions [100, 143]. As shown in Fig. 1a, we organise meanings using three components, concept, event and context. By applying these to [100, 143], concepts form the participation (or informational) aspect of objects in an event. That is, the event is derived by relating multiple concepts. Contexts are the collection of part-of, causal and correlation aspects among events.

Fig. 1
figure 1

An illustration of decomposing meanings based on concepts, events and contexts

More formally, we define concepts as textual descriptions of meanings that can be perceived from images, shots or videos, such as objects like Person and Car, actions like Walking and Airplane_Flying, and scenes like Outdoor and Nighttime. In other words, concepts are the most primitive meanings for multimedia data, and used in many state-of-the-art retrieval systems [79, 113]. Below, concept names are written in italics to distinguish them from the other terms. An event is a higher-level meaning derived from the interaction of objects at a specific situation [49, 110]. In our case, it is defined by the combination of concepts. For example, in Fig. 1b, Shot 1 shows Cheese, Meat, Sausage and Grill, from which the event “burning the first three things” is derived. Shot 2 displays Hand, Food_Turner, Bread, Cheese and so on, where the event “putting Cheese etc. on Bread” is formed based on movements of these concepts. Furthermore, as depicted by the bold line arrow in Fig. 1a, contexts are used to recursively define higher-level events based on part-of, causal and correlation relations among lower-level ones.Footnote 1 In Fig. 1b, based on the part-of relation, events in Shot 1 and 2 are combined into the higher-level event “cooking a Hamburger”. This event and the one in Shot 3 (“eating a Hamburger”) are further abstracted into “eating the cooked Hamburger”. Also, the correlation relation is used to connect two ‘weakly-related’ events, such as those which occur in separate locations but at the same time [100]. We consider the above organisation of meanings based on concepts, events and contexts as the final goal of LSMR.

To make the following discussions simple and clear, we adopt two policies: First, we use an example to indicate a single unit of multimedia data, such as image, shot, video and audio. When the discrimination among these data formats is not important, we use examples as their abstract name. Second, by drawing an analogy with Content-Based Image Retrieval (CBIR) in [23], we define LSMR as any technology that, in principle, helps to organise a large-scale multimedia data. Hence, LSMR in this paper includes technologies such as object detection/recognition, image/video/audio classification, browsing, summarisation and so on.

Our motivation for this survey paper is attributed to the fact that, the current LSMR owes much to the availability of large-scale data and the enhancement of machine performance. The underlying framework remains the classical machine learning approach. Roughly speaking, a classifier is built by analysing training examples each annotated with the presence or absence of a certain meaning. The former training examples, positive examples, serve as representatives of examples relevant to the meaning, while the latter ones, negative examples, represent irrelevant examples. Here, examples have significantly different visual appearances (features) depending on various changing factors (e.g., camera techniques and shooting environments). Thus, by analysing a larger number of training examples with a high-performance machine, the classifier can accurately distinguish test examples relevant to the meaning from the others irrespective of changing factors.

However, the success of the above machine learning approach does not mean that, machines which are fed with a large number of training examples, learn a classifier the way humans do. This approach is only useful for certain types of concepts (see Section 3.2). Furthermore, an event is derived from the combination of concepts, and what is more, a context is composed of multiple events. Hence, examples relevant to the event or context incur much more diverse visual appearances than examples relevant to a concept. Thus, the detection of the former requires much more training examples than that of the latter. But, since events and contexts are very specific meanings, collecting many training examples is impractical. Thus, knowledge about human interpretation of semantic meanings is necessary to derive events from automatically detectable concepts, and further deduce contexts from events. In other words, this knowledge is used to effectively cover the huge diversity of visual appearances, connected to examples relevant to an event or context. Therefore, this paper defends the importance of human-machine cooperation approaches which enrich LSMR by knowledge about human interpretation.

2 Overview of our retrospective survey

Our survey approach is significantly different from those of existing papers in fields of multimedia retrieval and understanding. Most survey papers adopt a progressive approach to derive future research directions from the progress of component technologies. Recent papers [15, 23, 49, 61, 67, 113] mainly reviewed the following four component technologies, (1) feature extraction, representation and transformation methods, (2) retrieval methods based on knowledge bases, machine learning techniques, similarities in terms of features, and data mining methods, (3) user interaction methods such as query specification, browsing (visualisation) and feedback, and (4) benchmark datasets for performance evaluation. Then, the above papers suggest future problems that should be further explored or should receive more attention, such as improvement of component technologies, design of application-oriented (human-centric) interfaces, scalability with both high-performance computing and algorithm sophistication, synergy between different modalities like text, image, video and audio, and utilisation of user-generated Web data like tagged images and videos.

Compared to the existing papers, we conduct a survey in a retrospective approach. By tracing the progress of LSMR, we detect missing links from recent approaches, which were addressed by classical approaches. Roughly speaking, due to the large data size that is unmanageable by humans, researchers tend to leave LSMR just to machines, where no knowledge about human interpretation is used. Thus, we argue the importance of human-machine cooperation which explicitly incorporates knowledge into machine-based approaches. In other words, researchers have already established large-scale labelled examples from which we now need to extract knowledge and utilise it, in order to achieve retrieval for high-level meanings. This means the ‘return’ to classical approaches, but we also need to consider the much larger and more structured knowledge in the future LSMR.

As depicted by the time axis in Fig. 2, we review existing LSMR methods in chronological order. We classify them into three categories, machine-based, human-based and human-machine cooperation. Machine-based LSMR does not explicitly utilise knowledge about human interpretation. This category began with syntactic methods using pre-defined templates of features based on specific camera and editing techniques in a particular video domain. In Fig. 2, we locate their starting point before 1995 by regarding one of the earliest method by Zhang et al. in 1993 [161]. In Fig. 2, the italic term (i.e., Concept, Event or Context) under a method name represents the level of meaning that can be detected. We assume that according to the compositional relation in Fig. 1a, methods which can detect events (or contexts) have the capability to expose concepts (or concepts and events). Syntactic methods support retrieval of events over shot sequences, such as goal events in football videos and conversation events in movies. However, syntactic methods lack the generality because pre-defined templates are sensitive to changes in shooting environments and shot concatenations. Thus, since around 2000, the research focus has shifted to machine learning methods which statistically build a classifier using training examples. As discussed before, these methods are only effective for concepts. As the effectiveness of using a large number of training examples together with high-dimensional features [22, 78] became well-known, after 2005, researchers started to develop large-scale methods which improve the scalability of machine learning ones.

Fig. 2
figure 2

Overview of our retrospective review of existing LSMR methods

Human-based LSMR is supported by a human, but machine and human are independent of each other. The earliest manual methods perform retrieval based on manual annotation. In Fig. 2, their starting point is set before syntactic ones, because multimedia retrieval was originally explored as a database problem (e.g., [145]) where examples are manually indexed to flexibly respond to queries of all levels of meanings. However, considering the labour for manual annotation, from the early 2000s, Web-based methods have received attention where users on the Web collaboratively annotate large-scale data [64]. Note that annotation in such methods has to be done easily so that many users can participate in it. In consequence, Web-based methods only support simple concept-level retrieval. Meanwhile, human-based LSMR includes interactive methods in which a human provides additional training examples based on the current retrieval result, as represented by relevance feedback developed in the late 1990s [97]. Considering the practical use, feedbacks should be done efficiently. Since it takes time to check shots or videos, interactive methods are suitable for concepts where their presence or absence can be quickly judged from single images. It should be noted that interactive methods do not affect retrieval algorithms, but just provide additional data to tune parameters. Compared to this, human-machine cooperation addresses the collaboration of humans and machines at the algorithm level.

As can be seen from Fig. 2, although early methods supported retrieval for high-level meanings (events and contexts), recent ones can only perform concept-level retrieval because of preferring to the generality and scalability for large-scale data, and the usability for users. Concepts are not so useful for practical applications because they are too primitive (or general) to identify examples that users want to retrieve. Therefore, the future LSMR should address the extraction of events and contexts while improving the concept detection performance. Human-machine cooperation is promising to achieve these goals and should deserve more attention. As shown in Fig. 2, we describe three types of methods. First, cognition-based methods utilise knowledge about human visual system. These aim to improve concept detection by implementing the mechanism of how human brains process visual information. Second, ontology-based methods make use of knowledge about human inference for high-level meanings, in order to derive events and contexts from concepts. In Fig. 2, we locate the starting points of the above methods from 2005, when the importance of a standardised set of concepts became well-known [79]. Last, adaptive leaning methods utilise knowledge about human learning, and adaptively control components of various retrieval algorithms. Thus, their role is the auxiliary improvement of detecting all levels of meanings. We set the starting point of adaptive learning to 2009, because, to the best of our knowledge, one of the earliest methods was developed in [12]. We hope that the discussion about the human-machine cooperation reminds researchers of the importance of knowledge for interpreting semantic meanings.

3 Machine-based LSMR

In this section, we first describe classical syntactic methods and their disadvantages. Then, we review machine learning methods which overcome the disadvantages of syntactic ones. Finally, recent works are presented where the scalability of machine learning approaches is improved.

3.1 Syntactic approaches

The easiest approach is to utilise prior knowledge about the structure (syntax) of videos. This is only possible when the retrieval is limited in a specific genre of videos. For example, in news videos, an eventFootnote 2 that represents one news topic starts with a shot where an anchor person appears, and ends with another shot where the anchor person appears again [160, 161]. In addition, an interview event is characterised by a sequence of shots, where shots showing an interviewer alternate with shots showing an interviewee [87]. To detect these events, researchers first construct a graph where each node represents a group of visually similar shots, and each edge represents the transition between two groups of shots [87, 160]. Events are then detected by extracting cycles which are connected to the node of shots showing the anchor person.

In a sports video, an event corresponding to one move in a game starts with a specific shot [163]. For example, in a baseball video, an event starts with a shot taken behind the pitcher. And, when the batter hits out the ball, a camera follows the flight of the ball in the next shot. In an American football video, each play starts with the formation where players line up on two sides of the ball. Also, goal events in ball game videos are characterised by a score change on the score caption, followed by audience’s cheering and applause [166]. Based on the above heuristics, each event is modelled using a pattern which represents a sequence of characteristic features [166], or a Hidden Markov Model (HMM) [4]. Then, the event in an unknown video is detected by finding sequences of shots, which match the pattern or HMM.

Movie directors and editors use film grammar which consists of practical rules to concentrate viewer’s attention on the story of a video [76]. For example, thrilling events are presented with a fast transition of shots with very short durations in order to emphasise the thrilling mood. On the other hand, romantic events are created by concatenating shots with very long durations, where person’s emotions and actions are thoroughly presented. In addition, a conversation between two persons is displayed by alternating shots showing the two persons one after another. Based on film grammar, conversation, suspense and action events are extracted using sequential patterns [153] or Finite State Machines (FSMs) [159]. Such a pattern or FSM represents a characteristic sequence of features, such as shot duration, motion, audio, and repetition of visually similar shots. Also, in [1], a tempo at which a viewer perceives meanings (e.g., haste and calm) is computationally defined based on shot durations and camera movements. Dramatic events are then detected by extracting gradual and sharp changes of the tempo. Furthermore, violent events are detected based on multiple heuristically defined features, such as motion, shot duration, flame-colour, blood-colour and sudden increase in audio energy [77].

In a surveillance video recorded by a fixed camera, the background frame can be defined as the one where no object appears. Based on this, the movement of an object can be easily captured by computing the difference between a video frame and the background frame. Thus, an event where an object actively moves is detected as a sequence of video frames which have large differences to the background frame [85]. Also, it is assumed that most events taken by a surveillance camera are normal and anomalous ones are very rare. Hence, by grouping video frames into clusters of similar frames, clusters including a large number of frames and clusters including a small number of frames represent normal and anomalous events, respectively [165].

The above syntactic methods can only process a limited number of a-priori known queries. However, users issue a variety of queries which cannot be assumed in advance. To overcome this, some research effort has been made on video data mining where videos are analysed using data mining techniques that extract previously unknown, interesting patterns in underlying data [103, 105]. As a result, patterns for retrieving a variety of events can be extracted. Specifically, we developed a method which extracts sequential patterns for associating adjacent shots related to a certain event [105]. Such sequential patterns are extracted by connecting statistically correlated features in adjacent shots. In addition, we also devised a method which extracts patterns for characterising ‘topics’ [103]. A topic is an event showing an interesting action like fighting, chasing or kissing. It is assumed that topics are not presented by normal editing patterns but by abnormal patterns, because the latter ones have much more impact on viewers than the former ones. In [103], a probabilistic time series segmentation is developed to extract abnormal patterns, each of these showing a certain person appearing in continuous shots with abnormally long or short durations.

However, even using video data mining methods, it is practically impossible to prepare all patterns which can respond to a variety of queries. In addition, pre-specified patterns or retrieval models lack the generality because it is difficult to assume a diversity of camera and editing techniques, that can be used to present an event. Thus, the research focus was shifted to a more general and flexible approach, where a user represents a query by providing training examples, based on which a retrieval model is constructed on the fly. The next section provides this kind of machine learning methods.

3.2 Machine learning approaches

Machine learning is applied to LSMR as Query By Example (QBE) where a user first provides some training examples as a query, then a classifier is constructed using them [44, 93]. The classifier distinguishes test examples relevant to the query from the others. Classical QBE methods search for test examples that are the most similar to given positive examples. The following two research topics have been assiduously explored. The first is the development of good similarity measures between positive and test examples. Many similarity measures such as histogram-based measure [45], psychology-based measure [66], a measure based on weighted graph matching [91], a measure based on longest common subsequence (LCS) [53], were developed. The other topic is the speed-up of the similarity calculation. For example, Kashino et al. developed the method that avoids unnecessary similarity calculation by estimating the upper bound of similarity [52], and Yuan et al. devised the two-phase hierarchical method that first computes a coarse similarity on sub-sampled video frames, and then verifies the similarity using fine audio features [155].

One big disadvantage of classical QBE methods is that they use global features which represent overall characteristics of an example, such as a colour histogram indicating overall colours, and a texture vector expressing overall responses to different filters. Such global features cannot capture the detailed content of the example. For this, in the late 1990s, Schmid and Mohr proposed to represent the example as a collection of local descriptors, each of which represents the characteristic of a local region [101]. To avoid confusion, we define a descriptor as the representation of a local region [162], and a feature as the representation of an example based on a set of local descriptors. Figure 3 illustrates feature extraction using local descriptors. First, as depicted by yellow circles in Fig. 3a, local descriptors are sampled from small regions in an example. Then, the example is represented by a feature which represents the distribution of sampled local descriptors. The feature (i.e., distribution) in Fig. 3b indicates that the example contains many descriptors similar to the one marked with (1), and few descriptors similar to (2). Such a feature reflects the detailed content of the example. In particular, by sampling a large number of local descriptors, the feature become robust to shape deformations and occlusions. For example, even if the car in Fig. 3a is partially masked by other objects, local descriptors that characterise its visible parts like a wheel, window or headlight are included in the feature. Below, we review existing QBE methods in terms of local descriptors, features and classifiers.

Fig. 3
figure 3

An illustration of feature extraction based on local descriptors

Many local descriptors have been proposed to capture different characteristics of a local region. The most popular one is a Scale-Invariant Feature Transform (SIFT) descriptor which represents the shape in a local region, reasonably irrespective of changes in illumination, rotation, scaling and viewpoint [68]. Sande et al. developed SIFT descriptors that are defined in different colour spaces and have unique invariance properties for lighting conditions [131]. Furthermore, local descriptors are defined around trajectories, each of which is obtained by tracking a point in a video [140]. The resulting local descriptors represent the displacement of a tracked point, the derivative of that displacement, and edges around a trajectory. Also, Speeded-UP Robust Features (SURF) descriptors are similar to SIFT descriptors, but can be efficiently computed based on the integral image structure which quickly identifies the sum of pixel values in any image region [9].

The main research theme for feature extraction is to accurately represent the distribution of local descriptors sampled from an example. The simplest approach, called Bag of Visual Words (BoVW), represents the distribution as a collection of characteristic local descriptors, namely visual words [22]. A set of local descriptors are firstly grouped into clusters where each cluster centre is a visual word. Then, each local descriptor extracted from an example is assigned to the most similar visual word. As a result, the example is represented as a histogram which represents the frequency of each visual word. Many extensions of BoVW have been proposed, such as soft assignment which extracts a smoothed histogram by assigning each local descriptor to multiple visual words based on kernel density estimation [131], sparse coding which represents the distribution of a large number of base functions used to sparsely approximate local descriptors [150, 154], Gaussian Mixture Model (GMM) supervector which estimates the distribution of local descriptors using a GMM [43], Fisher vector encoding which considers the first and second order differences between the distribution of local descriptors and the reference distribution [92], and Vector of Locally Aggregated Descriptors (VLAD) which concatenates vectors each representing the accumulated difference of a visual word to the assigned local descriptors [5, 46].

Since a feature which precisely represents the distribution of local descriptors is necessarily high-dimensional, a classifier effective for high dimensional data is used in QBE. Typically, a Support Vector Machine (SVM) is used [22, 48, 106, 131, 162] because its ‘margin maximisation’ principle can extract a well-generalised classification boundary between positive and negative examples in the high-dimensional feature space [133]. It should be noted that only positive examples are provided in QBE. Regarding this, Natsev et al. proposed to use randomly sampled examples as negative by assuming that only a small number of examples in the database are relevant to a query [80]. That is, almost all of the randomly selected examples are irrelevant and serve as negative. This approach works reasonably well and has been used in many existing works [82, 106, 115]. In addition, to cover a diversity of examples relevant to a query, bagging and random subspace are used to combine multiple SVMs which are built using subsets of randomly selected training examples, and subsets of randomly selected feature dimensions, respectively [80, 124, 149]. Such SVMs characterise different portions of relevant examples. In this context, we proposed a method using rough set theory which is a set-theoretic classification approach for extracting rough descriptions of a class from imprecise (or noisy) data [106]. Specifically, our method extracts classification rules each of which represents an SVM combination to correctly identify a different subset of positive examples. By accumulating relevant examples with such classification rules, a variety of relevant examples can be accurately covered.

Compared to classical syntactic methods, the above machine learning methods have much more generality because examples relevant to a query can be retrieved with significantly higher accuracy, regardless of video genres, camera techniques and shooting environments. However, machine learning methods are useful only for concepts corresponding to ‘basic categories’ of meanings, such as Person, Car and Building. Even though visual appearances of each basic category significantly vary, these are apparently different from those of the other basic categories [25, 59]. Compared to this, for concepts corresponding to ‘subordinate categories’ like Rider, Driver and Factory_Worker of Person, and Bus, Van and Truck of Car, their visual appearances can be distinguished only by small localised regions, or considering the relation to surrounding concepts. In addition, since an event (or context) involves multiple concepts, examples relevant to it incur a much larger variance of features than the one of examples relevant to a concept. Therefore, knowledge about human interpretation is necessary to appropriately detect concepts for subordinate categories, events and contexts.

3.3 Using large-scale data

This section presents methods for scaling up machine learning methods to large-scale data. We mainly focus on concept detection where a large number of training examples are available. Recently, there are several worldwide competitions, such as TRECVID [110] and PASCAL VOC [89], where concept detection methods developed in different research institutes are compared using large-scale benchmark data. These competitions have been promoting the improvement of concept detection methods.

One of the most important issues in concept detection is that different camera techniques and shooting environments cause visually diverse examples where a certain concept appears. To cover such a diversity, a large number of training examples are required. In general, the detection performance is proportional to the logarithm of the number of positive examples, although each concept has its own complexity of detection [78]. This means that 10 times more positive examples improve the performance by 10 %. In an extreme case, 80 million training images yield accurate detection performance [128]. Furthermore, using two billion images, specific concepts such as celebrities, consuming electronics and landmarks can be detected accurately [139]. Based on the importance of the number of training examples, researchers have developed online systems where many users on the Web collaboratively annotate a large number of examples as positive or negative [6, 134].

Another important issue is sampling of local descriptors. Algorithms for extracting local descriptors generally consist of two modules, region detector and region descriptor [162]. The former detects regions useful for characterising concepts, and the latter represents each of the detected regions as a vector. A concept is shown in significantly different regions, and in videos, it does not necessarily appear in all video frames. Considering such ‘unclear’ concept appearances, it is effective to exhaustively sample local descriptors in both the spatial and temporal dimensions. Indeed, the performance is improved as the number of sampled local descriptors increases [83]. Moreover, Snoek et al. compared two methods. One extracts features only from one video frame in each shot (one shot contains more than 60 frames), and the other extracts features every 15 frames [114]. They found out that the latter exceeds the former by 7.5 to 38.8 %.

Although a large number of training examples and exhaustively sampled local features are immensely important for accurate concept detection, processing them requires high computational costs. Many methods for reducing these costs have been developed. They can be classified into two types, ‘high-performance computing’ and ‘algorithm sophistication’. The first type parallelises the classifier training/testing process and the feature extraction using special hardware, such as a cluster consisting of multiple PCs [2, 149], multicore CPU [21] and General-Purpose computing on Graphics Processing Units (GPGPU) [18, 132]. The second type includes a fast SVM training method that iteratively solves sub-problems consisting of the most problematic training examples [28], a fast SVM training method that iteratively solves simple one-variable sub-problems [41], a fast SVM training and test method that efficiently computes similarities (kernel values) by sorting dimension values of each example [72], a fast SVM test method by hashing SVM parameters and the feature of each test example [65], and a fast feature extraction method by organizing the distribution of local features into a tree structure [43].

We also developed a fast SVM training/test method and a fast feature extraction method based on matrix operation [104]. The former re-formulates similarity computation, which enables batch computation of similarities among many examples. The latter re-formulates probability density computation, so that probability densities of many local descriptors can be computed in a batch. Based on these, SVM training/test and feature extraction become about 10–37 and 5–7 times faster than the normal implementation, respectively. By processing a large number of training examples and exhaustively sampled local descriptors using the above methods, we achieved the highest performance in TRECVID 2012 Semantic Indexing (light) task, which is one of the most predominant worldwide competitions on video analysis and retrieval [104].

Finally, as described above, much research effort has been invested to scale up machine learning methods to large-scale data, still the underlying framework remains the same. In other words, researchers prioritise the generality and scalability of a method, so that the same method can be used to search large-scale data in terms of a variety of queries. Of course, improving the generality and scalability is very important. But, we think that, the intensive favour to it is one reason why the mechanism of recent LSMR methods has become completely different from the mechanism of human’s semantic meaning interpretation.

4 Human-based LSMR

This section first presents classical human-based LSMR methods based on manual annotation. Recent methods are then described where manual annotation is conducted collaboratively by users on the Web. Finally, we review interactive methods that enable users to interactively refine retrieval results.

4.1 Manual annotation

In classical manual video annotation methods, videos are manually annotated with text descriptions. The following three issues are mainly addressed [122]:

  1. 1.

    Identification of meaningful segments: Videos are known as continuous media where sequences of media quanta (i.e., video frames and audio samples) convey semantic meanings when continuously played over time [35]. Hence, any segment of a video can become a meaningful unit.

  2. 2.

    Annotation that should be provided: A video contains too many meanings ranging from low-level ones like colour and shape to high-level ones like event and context. Thus, it is difficult to annotate the video with all the meanings contained in it.

  3. 3.

    Discrepancy between annotation and user expectation: This focuses on segments that are annotated and segments that are expected to be retrieved by users. For example, one intuitive answer to the query “two persons A and B are talking to each other” is a shot annotated with both A’s and B’s presences. However, a sequence of shots can be another answer where shots annotated only with A’s presence and shots annotated only with B’s presence are repeated one after the other. Thus, dynamic organisation of annotated shots (segements) is required to correctly respond to queries.

In accordance with these, we present several manual annotation methods.

Weiss proposed the algebraic video data model where text descriptions are organised in a nested hierarchical way by considering their temporal relationships, such as overlapping and inclusion [142]. This facilitates an easy way to attach different meanings to the same segment, and construct compound meanings from annotated meanings using algebraic operations like union, intersection, concatenation and so on. Oomoto and Tanaka developed Object-oriented Video Information Database (OVID) where a segment and text descriptions are regarded as a video object and attribute values, respectively [86]. Such attribute values of a video object are inherited by another object based on their temporal inclusion relationship. This way, text descriptions are shared among video objects, so that the manual annotation effort is significantly reduced.

Uehara et al. proposed an approach which represents the story of a video using a binary tree, called a story graph [129]. In this graph, each node represents the relation (e.g., sequential, physically-causal and psychologically-causal) between two successive segments, and edges are labelled with semantic constraints. This enables users to retrieve arbitrary-length scenes specified by natural language, and retrieve causes or consequences of queries based on causal relationships. Zettsu et al. developed a time-stamped authoring graph where each node represents a text description at a certain point of time in a video, and two nodes are connected if they have a strong semantic correlation based on co-occurrences of words in text descriptions [157]. Given a query, a segment is retrieved as the minimal subgraph which consists of nodes containing words in the query. This way, meaningful segments are dynamically determined depending on issued queries.

Pattanasri et al. developed a method using a knowledge base (ontology) about contexts [90]. This knowledge base represents relationships among verbs, such as “kill” implies “die”. Thereby, video segments that are related in terms of causes and effects of person’s actions, can be linked together and retrieved as a whole. François et al. developed an extensible and hierarchical framework for representing events in videos [32]. Here, complex events are constructed from simpler ones by operations, such as sequencing, iteration and alternation, which are defined in a knowledge base. Like this, various complex events can be defined only using relatively few primitive events.

Since the above approaches require expensive manual annotation cost, research focus has been shifted to machine learning methods in Section 3.2 and the ones described in the next section. It should be noted that very flexible context-level retrieval is possible based on laborious manual annotation.

4.2 Web-based annotation

Many Web-based annotation systems have been developed to distribute manual annotation of large-scale multimedia data to many users on the Web. To achieve this, the usability and quality of annotation should be addressed. The usability means whether users can easily annotate examples or not. If this is insufficient, it cannot be expected that many users participate in annotation. Regarding the quality, meaningless annotation may be provided by malicious users or operation mistakes.

The IBM research group developed a system for annotating a large number of shots with concepts’ presences or absences [64, 134]. To improve the usability, users are allowed to customise their annotation styles, such as the number, size, and layout of shots displayed per page, using mouse and/or keyboard, and annotating one or more concepts at a time. In addition, the system informs a user of how difficult the annotation of each concept is based on the disagreement with past annotations by different users, so that the annotation quality improves. In [6], the system in [134] is extended using active learning, in order to preferentially annotate shots that are promising for improving the classifier of a concept (see Section 4.3 for more detail about active learning).

Russell et al. developed LabelMe which is a Web-based system for annotating object (concept) regions in images [98]. Given an image, the user labels an object region by creating a polygonal region by mouse, then types the object name. To improve the usability and maintain the consistent annotation, the researchers considered several extensions, such as the lexical knowledge base (WordNet) for expanding and disambiguating freely typed object names, and the object relation for suggesting candidate objects where their regions frequently overlap a user-specified region.

However, the above systems do not consider the motivation of users. In other words, regular users on the Web are unlikely to volunteer to annotate when no benefit or no reason is given. In consequence, only researchers participate in annotation, which makes it difficult to collect large-scale annotation. Von Ahn and Dabbish proposed a Games With A Purpose (GWAP) approach where users play a game, and as a side effect, a computationally difficult task is solved [135, 136]. More concretely, users play a fun game without knowing that they conduct image annotation. Owing to the motivation that users want to have fun, as of July 2008, 200,000 users contributed to assigning more than 50 million labels to images on the Web [136].

The first game based on the GWAP approach is the ESP game where randomly paired users are first given the same image, then each user guesses a label that another user is likely to provide [135, 136]. If labels provided by both users agree, they get a certain number of points, and the next image is given. Like this, users are encouraged to get more points and play the ESP game many times. Since users know nothing and cannot communicate with each other, the easiest way for them to earn points is to provide labels relevant to given images. Thus, annotation data obtained by the EPS game are likely to be meaningful. The quality of annotation is further improved using taboo words that users are not allowed to type. Several variants of the ESP game have been developed, such as games for object region annotations [117, 137], video annotation [168], music annotation [8] and geographically-referenced photo annotation for landmark objects [10].

Another approach that motivates users is crowdsourcing that outsources problems performed by designated human (employee) to users on the Web [94]. In the field of multimedia annotation, one of the most famous crowdsourcing systems is Amazon’s Mechanical Turk where anyone can post small tasks and specify prices paid for completing them [54].

Although large-scale annotation data can be obtained by the above Web-based systems, flexible retrieval like those presented in the previous section is difficult. This is because annotation has to be simple in order to maintain the usability. Also, one drawback of the GWAP approach is that users tend to maximise their scores, so collected labels only represent general properties of examples (e.g., colour and shape), but do not represent specifics or details [37]. Furthermore, it requires huge monetary cost to apply Mechanical Turk to large-scale data. To the best of our knowledge, no Web-based system that fully supports annotation of high-level meanings has been developed until now.

4.3 Interactive approaches

This section focuses on interactive approaches where users iteratively refine the retrieval performance based on the current result. These are needed because of the user individuality, which means that even for the same query different users may be interested in different data [164]. For example, for the query “horse”, one user may look for shots showing “adult horse”, while another may look for shots showing “child horse”. In addition, it is often difficult for a user to precisely express his/her intent, because of the poor lexical vocabulary or the lack of proper positive examples. This is called the intention gap which is the discrepancy between user’s search intent and the query specified by him/her [159]. Thus, the interactive refinement of retrieval results is necessary to overcome the user individuality and intention gap.

One of the most popular interactive approaches is Relevance Feedback (RF) that asks a user to provide feedback regarding the relevance or irrelevance of currently retrieved examples [164]. RF can be also considered as active learning that selects the most informative examples for improving the performance of a classifier, and asks the user to annotate them [138]. Using such RF or active learning, a classifier is refined so as to efficiently find desired examples. Rui et al. developed one of the earliest RF method that dynamically updates both feature weights and feature dimension weights based on the user-provided relevance score for each retrieved image [97]. Also, the most typical RF method uses an SVM as a classifier where examples closest to its decision boundary are labelled by the user [127, 138]. This means that for such examples the prediction of the SVM is the most uncertain, so labelling them is useful for refining that SVM. In other words, examples far from the decision boundary are regarded to be reasonably classified, thus labelling them is redundant.

In [146], Wu and Zhang proposed an RF method using a random forest which predicts the relevance of an example to a query, by combining multiple tree classifiers built on different subsets of randomly sampled training examples. By refining the random forest based on additional training examples obtained by RF, multiple tree classifiers can cover the multimodal distribution of relevant examples in the feature space. This approach is extended to adaptive pattern discovery [147], which addresses RF by interactively discovering meaningful patterns of relevant examples. To facilitate pattern discovery, the authors present a dynamic feature extraction method, which aims to alleviate the curse of dimensionality by extracting a feature subspace using balanced information gain. The scientific achievements described above are integrated within the so called PatternQuest framework [148] that learns the patterns of interest (i.e., the distribution patterns of positive examples) using classification methods and RF.

The conventional RF requires a crisp binary decision to be made on the relevance of the retrieved images. However, user interpretation varies with respect to different information needs and perceptual subjectivity. In addition, users tend to learn from the retrieval results to further refine their information request. It is, therefore, inadequate to describe user’s fuzzy perception of image similarity with crisp logic. In view of this, Yap et al. [151] proposed a ‘fuzzy’ RF approach which enables the user to make a fuzzy judgement for relevance ranking, whereas a radial basis function (RBF) network with local modelling structure is used for similarity learning. Another fuzzy RF method was described in [7], where a composite short-term and long-term learning approach is used to learn the semantics of an image. The short-term learning technique applies Fuzzy Support Vector Machine (FSVM) learning on user labelled and additional chosen image blocks to learn a more accurate boundary for separating the relevant and irrelevant blocks at each feedback iteration. The long-term learning technique applies a novel semantic clustering to adaptively learn and update the semantic concepts at each query session.

Apart from RF, interactive approaches include browser-based methods which provide user interfaces to facilitate finding relevant examples to a query [111]. Here, examples are organised based on some criteria. For example, Wilkins et al. developed the broadcast-based browsing interface [144]. This provides a user with a list of videos each of which contains highly-ranked shots (i.e., shots regarded as relevant by a classifier). Using video as the unit of retrieval result presentation, the user can explore the context where even if a highly-ranked shot is irrelevant, shots temporally close to it may be relevant. Snoek et al. devised the thread-based browsing interface to explore videos from multiple perspectives [115]. A thread presents a sequence of shots that are linked based on a certain type of similarity. The interface in [115] utilises four threads regarding the temporal similarity between shots, the visual similarity between shots, the visual similarity of shots to positive examples, and the history of explored shots. Recently, a live competition has been established where browser-based methods developed by different research groups are evaluated on the same experimental setting within view of the audience [102].

Another variant of interactive approaches is visual query suggestion which assists a user in precisely formulating a query by simultaneously suggesting keywords and images based on the initially provided keyword [158]. Compared to only suggesting keywords, adding images is useful for manifesting vague user’s intent. In addition, the retrieval performance is improved by reranking images retrieved by keywords based on their visual similarities to suggested images. In [158], the researchers developed a query suggestion mining method which firstly selects keywords that are statistically related and informative to the initially provided keyword. Then, images suggested with each selected keyword are obtained by clustering images associated with both of the initial and selected keywords, and choosing the representative ones. The above visual query suggestion is extended to object retrieval where given a query region of a certain object, regions showing the same object are retrieved from a large amount of images [38]. Considering the difficulty of specifying query regions effective for retrieval, the method suggests a query region of an object by guaranteeing that this object is also present in other images. To this end, repeated patterns of geometrically consistent local descriptors are firstly mined. Since such a pattern only represents a part of an object, bipartite clustering is performed to extract regions of the same object as a cluster of patterns that co-occur in the same images.

To sum up, although the performance is somehow improved by tuning features or classifiers based on RF, they are substantially the same and remain insufficient for representing high-level meanings. Browser-based methods just leave the most difficult task (i.e., interpretation of meanings) to humans. In addition, visual query suggestion just refines queries and is not related to the improvement of retrieval algorithms. Furthermore, considering the usability of interactive methods, user interaction should be done quickly. Thus, interactive methods are only useful for concepts which can be easily recognised from single images or shots.

5 LSMR based on human-machine cooperation

In this section, we review existing human-machine cooperation methods which incorporate knowledge about human interpretation into LSMR. As shown in Fig. 2, we sequentially present cognition-based, ontology-based and adaptive methods, and discuss future research topics for each type of method.

5.1 Cognition-based approaches

Cognitive science is an interdisciplinary study of mind and intelligence in order to theoretically explain how the human mind (thinking) works [84, 121, 126]. In particular, owing to psychological and neurological experiments, the human visual system is intensively investigated from the ‘sensation’ process which transducts the light (stimulus) received by the eye into neural signals, to the ‘perception’ process which translates neural signals into meanings. Since these processes are applicable to concept detection for deriving concepts from raw multimedia data, we define cognition-based methods as those that improve concept detection using knowledge about the human visual system. Below, as shown in Fig. 4, we review cognition-based methods by classifying them into two categories. The first includes methods using knowledge about concept meanings like hierarchical/spatio-temporal relations and attributes, and the second includes methods which aim to implement the human visual system.

Fig. 4
figure 4

Categorisation of cognition-based LSMR methods

5.1.1 Methods using knowledge about concept meanings

Hierachical relation

A practical LSMR system is required to detect a variety of concepts. For example, over twenty-one thousand concepts are defined in ImageNet which is a huge concept vocabulary for images [24]. However, the conventional one-vs-all approach constructs a classifier for every concept, and tests all classifiers for each example. This is not suitable for treating a large number of concepts. Regarding this, when a human sees a new concept, he/she does not learn all of its appearance details, but just remembers its category and discriminative details [73]. This implies that the human forms a hierarchy of concepts. It is useful for not only detecting concepts quickly, but also achieving semantically consistent concept detection.

Using a large lexical ontology (WordNet) [30], Marszalek and Schmid developed a method which builds a classifier for each concept by defining training examples, so as to satisfy the hierarchical relation [73]. Here, positive examples are defined as the union of examples annotated with the concept’s presence and those annotated with presences of its child concepts, whereas negative examples are the union of positive examples for its sibling concepts. Then, the top-down procedure is applied to a test example where the presence of a concept is examined only if its parent concept is detected. While the one-versus-all approach takes the computational complexity O(n) where n is the number of concepts, the above method reduces it approximately to O(n 0.64) without loss of accuracy. Furthermore, Gao and Koller proposed a method which simultaneously constructs a hierarchy and classifiers by analysing distributions of positive examples for multiple concepts [34]. For each node in the hierarchy, a binary classifier is built by classifying concepts into two classes based on distributions of their positive examples. In particular, to maintain the generality of the classifier, the relaxed hierarchy structure is adopted where some concepts are flexibly ignored if their positive examples are difficult to classify. This method offers similar or slightly better performance with 2–5 times speed-up compared to the one-versus-all approach. Also, the top-down procedure propagates the error occurred at a concept for which the classifier is unreliable. To overcome this, Zhu et al. proposed an error recovery method which adjusts classifier’s output for each concept by considering classifiers’ outputs for its child and grandchild concepts [167]. Logistic regression is used to obtain the optimal weights, with which classifiers’ outputs for the current, child and grandchild concepts are linearly combined to refine the detection of the current concept. The error recovery method improves the performance of the top-down procedure with 14–27 %, and saves the computational cost with 67.1–89.4 % compared to the one-versus-all approach.

Spatio-temporal relation

Concepts do not appear in isolation. In other words, the presence of one concept can be a useful clue for detecting other concepts. Especially, concepts corresponding to roles in specific situations (e.g., Athlete) should be deduced based on relations to other concepts (e.g., Playground and Running). Below, we describe existing methods that consider spatial relations among concepts within examples, as well as their temporal relations over shot sequences. Notice that the following discussion excludes methods for treating relations among features (or descriptors) like [105, 156], because such relations are not directly related to meanings due to the brittleness of features to changes in camera techniques and shooting environments.

Jiang et al. proposed a method which refines independent detection results of concepts by considering their correlations (co-occurrences) [47]. Using training examples, they first construct a graph where each node is a concept and the weight of an edge is defined as the Pearson product moment correlation between two concepts. Then, graph diffusion is performed to smooth detection results in each example, so that results for strongly correlating concepts become similar. This approach refines independent concept detection results by 11.8–15.6 %. Yi et al. developed another refinement method based on both spatial and temporal relations [152]. Regarding the former, they build a Conditional Random Field (CRF) which probabilistically estimates a refined detection result of a concept by considering independent detection results of this and correlated concepts. To treat temporal relations, another CRF is built to predict a refined detection result of a concept based on independent detection result of this and correlated concepts in surrounding shots. Experimental results showed that spatial and temporal relations respectively improve independent detection results by 20.6–36.4 % and 26.5–47.2 %, and their combination offers 29.9–53.1 % of improvement.

Furthermore, Chen et al. developed NEIL (Never Ending Image Learner) which continuously extracts visual knowledge (positive images and concept relations) from Internet scale data [20]. NEIL is based on semi-supervised learning. First, for each concept, seed images are collected through Google Image Search to built the initial classifier. Second, concept relations are extracted by computing co-occurrences based on classifiers’ outputs. Third, NEIL selects additional positive images, each of which has large outputs of both the classifier for a concept and classifiers for its related concepts. Then, NEIL updates classifiers with additional positive images, and continuously repeats the second and third processes. As the result of running NEIL for 2.5 months, it could discover 400K positive examples and 1,700 concept relations for 2,237 concepts. It was also shown that using extracted relations improves the concept detection performance by 4.0–8.8 %.

Attributes

A human can categorise even unseen concepts based on their characteristic appearances. For example, even though a human does not know the exact names of airplanes, he/she can discriminate between airplanes with and without propellers. In addition, without knowing the name Zebra, it can be distinguished from other horses based on the presence or absence of stripes. Such descriptions of concept appearances like parts (e.g., “propeller”), shapes (e.g., “round”), textures (e.g., “stripe”) and non-verbal properties (e.g., “properties that dogs have but cats do not”) are called attributes [29, 58]. These are semantically meaningful descriptions, and their automatic detection is relatively easy compared to concept detection. By representing each example using responses of attribute classifiers, concepts can be effectively detected with a small number of training examples, or only with plain text descriptions [29]. Also, as long as attribute classifiers are robust, the performance of attribute-based concept detection can be maintained irrespective of domain changes. Moreover, since attributes capture detailed appearances (e.g., propeller of an airplane) which are under-estimated by features based on many local descriptors, they are useful for detecting concepts corresponding to subordinate categories like Jet_Airplane and Propeller_Airplane of Airplane [19].

Farhadi et al. proposed a method which constructs an attribute classifier, where characteristic features are extracted by contrasting examples with and without the attribute within the same category [29]. They showed that attribute-based concept detection can achieve similar performance only using 20 % of training examples, compared to feature-based detection. Lampert et al. verified the effectiveness of attributes for ‘zero-shot learning’ scenario. Here, no training examples are given, and a concept is detected using responses of attribute classifiers, and prior knowledge about which attributes are relevant to the concept [58]. Only for parts in attributes, one of the most popular methods is Deformable Part-based Model (DPM) which detects a region of a concept (object) by considering positions and deformations of parts, where each part is defined as a filter to capture its shape [31]. Compared to not-using parts, this method achieves 62.5 % and 33.3 % greater performance for localising Person and Car, respectively. Recently, Chai et al. demonstrated that concepts for subordinate categories can be effectively identified using features extracted from parts obtained by DPM (together with features from a region by image segmentation) [19]. Finally, since it is laborious to manually define a large vocabulary of attributes, Juneja et al. developed a method which automatically extracts distinctive attributes by firstly grouping regions of a certain concept into clusters, then examining the entropy of whether regions similar to those in each cluster are contained in different concepts [50]. Also, the method in [71] extracts attributes using crowdsourcing where users annotate corresponding regions in images showing the same concept, and differences in images displaying different concepts.

5.1.2 Implementing the human visual system

Deep learning

The human brain processes visual information in hierarchical ways where neurons in the early visual areas extract simple features, which are transmitted to neurons in higher-level areas to form more complex features or concepts [56]. Inspired by this, deep learning has been developed to learn feature hierarchies with higher-level features formed by the composition of lower-level ones [11, 13]. This aims to construct multiple levels of feature representations where higher layers characterise more abstract features. Deep learning mainly offers the following three advantages (see [13] for more detail): The first is the expressive power where the combination of (distributed) features at each layer can define the exponential order of higher-level features, and this order is further exponentially increased by passing through layers. The second advantage is the invariance property where more abstract features are generally invariant to subtle changes in visual appearances. The last one is the explanatory factor that the learnt feature hierarchy can capture valuable patterns or structures underlying raw images or videos. Finally, a classifier for detecting a concept is created by using the learnt hierarchy as initialisation of a multi-layer neural network, or building a supervised classifier by constructing the feature vector of each example based on the hierarchy.

In [55], deep learning has been implemented as a multi-layer convolutional neural network which iteratively conducts convolution or pooling of outputs by neurons in the previous layer. Convolution works as feature extraction using filters each represented by weights of a neuron. On the other hand, pooling summarises outputs of neighbouring neurons to extract more abstract features. The multi-layer convolutional neural network is optimised by stochastic gradient descent which updates each weight of a neuron by backpropagating the derivative of training errors in terms of this weight. In ILSVRC 2012 which is a worldwide competition on large-scale image classification [42], the above mentioned method with the error rate of 15.3 % significantly outperformed the others (the second best error rate was 26.1 %). Also, Le et al. developed a nine-layer stacked sparse autoencoder to train concept detectors from unlabelled images [60]. Each layer consists of three sub-layers, filtering, pooling and normalisation, which respectively offer feature extraction from small regions of the previous layer, the invariance of features (neighbouring neurons’ outputs) to local deformation of visual appearances, and the range adjustment of features. The stacked sparse autoencoder is optimised layer-by-layer so that sparse features constructed at a layer can be accurately converted back into the ones at the previous layer. By training such a stacked autoencoder using 10 million unlabelled images with 16,000 cores, it was shown that the highest-level neurons characterise concepts like Face, Cat_Face and Human_Body. Moreover, compared to state-of-the-art methods, the multi-layer classifier using the stack autoencoder as the initialisation yields 15 % and 70 % performance improvement for 10,000 and 22,000 concept detection tasks, respectively.

Visual attention

Selective attention is the brain’s mechanism that determines which part of the sensory data is currently of the most interest [33]. This enables humans to conduct real-time decision-making by closely analysing selected parts in a large amount of data, captured by eyes and ears. Visual attention implements selective attention on images and videos to detect salient regions that are likely to attract users [33]. By filtering irrelevant regions, visual attention yields both the improvement of concept detection performance and the reduction of computational cost. In addition, visual attention can bridge the discrepancy between concept detection and retrieval. The former aims to detect concepts irrespective of various changing factors, while the latter needs to retrieve examples where concepts related to a query appear in regions drawing user’s attention. For example, for the query “a car is moving”, the user is not interested in an example where a Car moves in a small background region. Thus, visual attention facilitates analysing the subjective property of each example and achieving meaningful retrieval for humans.

Most of the visual attention methods analyse spatial distributions of biologically inspired features (e.g., brightness, contrast and curvature) in an example, and produce a saliency map which shows the degree of salience of each pixel [33]. Typically, pixels which are irregular compared to surrounding ones are regarded as salient. However, this kind of bottom-up approaches based only on features do not work well. Thus, researchers are exploring how to adopt top-down approaches using prior knowledge. One popular knowledge is ‘contextual cueing’ which means that a human can easily find a target object, when the visual context (i.e., spatial layout of objects) is similar to the past [33, 63]. Contextual cueing is implemented as supervised classification to build a classifier which detects salient regions in a test example, by referring to training examples where salient regions are annotated by manual or eye fixation records. In [63], multi-task learning is used where salient regions in examples with a particular visual context are detected by sharing classifiers among examples with correlated visual contexts. Compared to building a classifier for every visual context, each of the above classifiers is simultaneously built using more training examples with different visual contexts. This yields the better generalisation of the classifier. Also, based on saccades which are the transition of eye fixations (i.e., salient regions), Sugano et al. proposed an image segmentation method which performs joint clustering of fixation locations and seed regions [118]. This is formulated as energy minimisation on a graph, where each edge represents the cost for merging a pair of a fixation location and a seed region. Meaningful regions are extracted as the ones which are characterised by densely distributed fixations and uniform visual features. Experimental results showed that compared to a standard image segmentation method, jointly using fixations improves the performance with 17.5 to 50 %.

Depth estimation

A 2D example (image or video frame) does not hold depth information in the real 3D world. As a result, for example, even though a Person stands in front of a Table, a 2D example shows that their regions are overlapping. In addition, despite the fact that a Person throws a Ball far, their regions in a 2D example may be close to each other. Meanwhile, humans can easily recognise depth information in a 2D example. This has inspired researchers to develop methods that estimate depth information directly from 2D examples [51, 99]. To the best of our knowledge, there is no existing work which uses estimated depth information for concept detection. Nonetheless, its effectiveness is implied by concept detection using a depth sensor (typically Microsoft Kinect) [39]. For example, in [96], compared to only using 2D information, the accuracy of localising various concepts in indoor scenes is improved from 66.2 % to 71.4 % by additionally using depth information (please refer to [39] for other works).

In depth estimation, a classifier for predicting depth values in an example, is built using training examples where depth values are known with a depth sensor. Saxena et al. developed a method that firstly divides an example into ‘superpixels’ which are homogeneous small regions with similar properties [99]. They assume that each superpixel has the same depth value, and represent it with features useful for depth estimation. For example, a grass field viewed at a short distance has fine textures, while such textures are blurred when it is viewed at a large distance. In addition, parallel lines have larger variations in edge orientations as they are viewed at a more distant position. Using such texture and edge features, a Markov Random Field (MRF) is built to probabilistically estimate the depth value of each superpixel by considering its feature and relative depth values of nearby superpixels. Experimental results showed that depth values are reasonably estimated for 64.9 % of Web images. Karsh el al. developed a method which transfers depth values in training examples to a test example by assuming that, examples with similar meanings have similar spatial distributions of depth values [51]. Given a test example, the method firstly selects visually similar training examples. Then, depth values in each training example are transferred by extracting the correspondence of local regions between the training and test examples. In addition, the method can be applied to a video by smoothing depth values in each frame based on optical flows, and imposing moving objects to have similar depth values to the ground that they contact.

5.1.3 Discussion

Although Sections 5.1.1 and 5.1.2 present many existing cognition-based methods, we argue that they only use very limited knowledge about human visual system. Thus, much more knowledge needs to be adopted. One critical problem is that these methods use the same approach for every concept. Regarding this, machine learning methods are useful for concepts with basic categories, while attribute-based methods are effective for concepts with subordinate categories. Thus, one possible approach is to define the category of each concept, and then select a machine learning-based or attribute-based method depending on categories of concepts. In addition, some concepts significantly affect visual appearances of others, such as Foggy, Nighttime and Dazzling. Hence, modelling such relations seems valuable for selecting a classifier or modifying (transferring) an already trained classifier.

Also, deep learning seems a promising approach for implementing the human brain mechanism. However, current methods construct a feature hierarchy by just stacking the same type of layers of neurons, which only have a few variations of response functions. This hierarchy is much simpler than the human brain, where visual information encoded by the occipital part is divided into two interacting pathways, ‘dorsal’ and ‘ventral’, which are responsible for object categorisation and space/action analysis, respectively [56]. In addition, neurons have a variety of functionalities, so they cannot be optimised by the same parameter optimisation approach [56]. Thus, deep learning needs to be extended by adopting a more complex hierarchy consisting of diverse types of neurons, based on a sophisticated optimisation method. Furthermore, visual attention and depth estimation can be considered to estimate information that cannot be directly observed from examples. In this context, estimating other information may be possible. For example, it is known that terahertz sensors produce waves which can pass through some objects (e.g., papers and plastic) to capture inside elements, and multispectral sensors can record invisible colour channels to characterise materials of objects. Thus, using these sensors’ outputs as labels of training examples, it may be possible to build a classifier which can detect concepts (inside elements or materials) that humans infer from examples. Finally, although methods in Sections 5.1.1 and 5.1.2 have been developed independently, it seems beneficial to establish a framework for integrating these different categories of methods. We believe that their synergy will offer significant improvement of concept detection performance.

5.2 Ontology-based approaches

An ontology is a machine-readable representation of knowledge to explicitly specify concepts, properties of concepts and relations among concepts in a given domain [40]. We define ontology-based approaches as those that utilise ontologies to extract high-level meanings (i.e., events and contexts) based on detection results for multiple concepts. Figure 5 illustrates an overview where videos showing the event “birthday party” are identified. Note that although Fig. 5 uses video as the unit, it is straightforward to apply the following discussion to image or shot. In an ontology-based approach, every example is represented using concept detection scores, each representing a scoring value between 0 and 1 in terms of a concept’s presence. A larger score indicates a higher likelihood of the concept’s presence. In Fig. 5, as depicted by white-filled arrows, every example (video) is represented as a sequence of vectors each representing concept detection scores in a shot. Then, as indicated by black-filled arrows, based on this representation, a classifier is built using positive and negative examples, and used to discriminate between relevant and irrelevant test examples to the high-level meaning.

Fig. 5
figure 5

An overview of an ontology-based approach

Since a high-level meaning is derived from the interaction among multiple concepts, the set of relevant examples has got a huge variance in the space of a low-level feature. Compared to this, owing to the recent progress, several concepts can be accurately detected. Thus, while each dimension in a low-level feature just represents physical values, detection scores for a concept (i.e., values in one dimension) represent appearances of a human-perceivable meaning. Hence, in the space of concept detection scores, the variance of relevant examples becomes smaller and can be modelled more easily. In other words, concept detection scores work as ‘intermediate’ features for a classifier to bridge between raw example representation and high-level meanings. Several publications reported the effectiveness of ontology-based approaches. For example, Tešic̀ et al. showed that when using the same classifier (SVM), concept detection scores achieve 50–180 % higher event retrieval performance than colour and texture features [125]. In addition, Merler et al. reported that compared to high-dimensional features based on local descriptors (e.g., SIFT, HOG and HOF), concept detection scores yield the best performance [75]. In particular, the example representation using detection scores for 280 concepts only requires a 15 times smaller data space than high-dimensional features, where data sizes are crucial for the feasibility of LSMR. Furthermore, Mazloom et al. demonstrated that concept detection scores lead to 3.1–39.4 % performance improvement compared to the feature based on SIFT descriptors [74].

This section reviews existing ontology-based methods in terms of classifiers. As shown in the middle of Fig. 5, we categorise existing methods into extensional or intensional. The former includes methods that analyse training examples to extract concept relations useful for characterising high-level meanings. In other words, a high-level meaning is defined by providing its instances (i.e., training examples). On the other hand, intensional methods utilise knowledge about concept relations to characterise high-level meanings. That is, a high-level meaning is formed by its aspects (i.e., concept relations) known in advance. It should be noted that, in addition to classifiers, ontology-based approaches have two other important issues. The first is the construction of a concept vocabulary. For this, several large-scale concept vocabularies have recently become available, such as LSCOM (Large-Scale Concept Ontology for Multimedia) [79], ImageNet [24] and VSO (Visual Sentiment Ontology) [17]. The second issue is concept detection which is performed and improved by cognition-based methods in the previous section.

5.2.1 Extensional methods

In what follows, we classify extensional methods into two categories, where the one focuses on high-level meanings within images/shots, and the other targets those over shot sequences.

Within images/shots

In general, methods in this category represent each example as a vector of concept detection scores, and build a classifier which discriminates between relevant and irrelevant examples to a high-level meaning. In other words, this classifier fuses detection scores for different concepts into a single relevance score, which indicates the relevance of an example to the meaning. Existing methods are roughly classified into four categories, linear combination, discriminative, similarity-based or probabilistic. Linear combination computes the relevance score of an example by weighting detection scores for multiple concepts. One popular weighting method is to use concept detection scores in positive examples. If the average detection score for a concept in positive examples is large, this concept is regarded as related to the query and associated with a large weight [81, 141]. Another popular method is text-based weighting where a concept is associated with a large weight if its name is lexically similar to a term in the text description of the query [81, 141]. The lexical similarity between a concept name and a term can be measured using a lexical ontology like WordNet. Discriminative methods construct a discriminative classifier (typically, SVM) using positive examples [81, 82]. The relevance score of an example is obtained as the classifier’s output. Similarity-based methods compute the relevance score of an example as the similarity between positive examples and the example in terms of concept detection scores. The method in [62] uses the cosine similarity and a modified entropy as similarity measures. Probabilistic methods estimate a probabilistic distribution of concepts using detection scores in positive examples, and use it to compute the relevance score of an example. In [95], the relevance score of an image is computed as the similarity between the multinomial distribution of concepts estimated from positive examples and the one estimated from the image.

Over shot sequences

When detecting a high-level meaning over shot sequences, one big problem is the difficulty of annotating the relevance of each shot. The reasons are two-fold: First, it is labour-intensive to annotate shots contained in each video. Second, due to the temporal continuity of meanings in a video, any segment can become a meaningful unit (see Section 4.1). Specifically, humans tend to relate each shot in a video to surrounding ones. Let us consider the video in Fig. 6 where the event “birthday party” is shown. There is no doubt that Shot 2 and 3 show the birthday party, based on which Shot 1 and 4 are related as chatting before the party and playing after it, respectively. This kind of shot relation makes it ambiguous to determine the boundary of a high-level meaning. In Fig. 6, one may think that the birthday party is shown only in Shot 2 and 3, while someone else may think that it is shown in the whole of the video, by regarding Shot 1 and 4 as parts of the party. Thus, objective annotation is only possible at the video level in terms of whether each video contains a high-level meaning or not. A classifier needs to be build under this weakly supervised setting, where even if a training video is annotated as relevant to the meaning, it includes many irrelevant shots.

Fig. 6
figure 6

An example video where the event “birthday party” is shown

To overcome weakly supervised settings,Footnote 3 we developed an event detection method using a Hidden Conditional Random Field (HCRF) [108]. It is a probabilistic discriminative classifier with a set of hidden states. These states are used as the intermediate layer to discriminate between relevant and irrelevant shots to an event. Specifically, each shot in a video is assigned to a hidden state by considering its concept detection scores and transitions among hidden states. Such hidden states and transitions are optimised so as to maximise the discrimination between positive and negative videos. We showed that the optimised hidden states and transitions successfully capture concepts and their temporal relations, that are specific to the event. Sun and Nevatia proposed a method which extracts temporal concept transitions in an event using Fisher kernel encoding [119]. Using all training videos, they first build an HMM which works as a prior distribution, representing concept transitions in the general case. Then, the vector representation of a video is created by computing the difference between the actual transitions of concept detection scores in the video, and the transitions predicted by the HMM. Thereby, vectors of positive videos for an event represent characteristic concept transitions by suppressing trivial transitions that are observed in many negative videos. Finally, to the best of our knowledge, Lu and Grauman developed the first metric which can quantify the context between two events, by finding concepts that appear in the first event and strongly influence the second one [69]. Such influences are measured by performing random walk on the bipartite graph, which consists of event and concept nodes. A concept is regarded as influential, if its ignorance leads to a dramatical decrease of the probability of transition between two event nodes. In [69], the above mentioned metric was used to create summaries consisting of events associated with semantically consistent contexts.

5.2.2 Intensional methods

Intensional methods exploit knowledge about concept relations to improve the detection performance of high-level meanings. To our best knowledge, existing methods can only deal with high-level meanings within examples. We will later discuss how to extend them for high-level meanings over shot sequences.

Deng et al. devised a method which computes the similarity between two examples based on the concept hierarchy [25]. The component similarity is computed as a weighted product between the detection score of an example for one concept, and the score of the counterpart example for another concept. Here, the weight is defined based on the lowest common ancestor of two concepts. For example, the component similarity for Donkey and Horse has a higher weight than the one for Donkey and Keyboard, because the common ancestor of the former concept pair Equine is more specific than that of the latter pair Object. The similarity between two examples is computed as the sum of component similarities. It was reported that using the concept hierarchy yields significant performance improvement [25].

Chen et al. proposed an interesting approach to organise independently detected concepts in an example using an ontology [20]. This ontology represents the hierarchy of concepts and their interactive relations. Based on this, the researchers devised an energy minimisation method which not only specialises detected concepts into more concrete ones, but also extracts their relations. The energy function consists of three terms: the first uses concept relations for estimating ontologically-consistent relations, the second term favours deep specialisation of concepts based on the concept hierarchy, and the last term uses pre-trained classifiers to examine the visual appearance of each specialised concept or estimated relation. By minimising this energy function, for instance, independently detected Person and Ball are specialised into Basketball_Player and Basketball which are linked with the relation of Throw. It was reported that compared to only using visual features, using the ontology reduces error rates of concept detection and relation estimation by 2.6–14.2 %.

Also, Guadarrama et al. developed a method which uses concept hierarchies obtained by text mining to extract events characterised by Subject/Verb/Object (SVO) relations [36]. First, SVO triplets are extracted from natural language descriptions of YouTube videos. Then, for each of S, V and O, concepts (i.e., subjects, verbs or objects) are clustered into a hierarchy based on their correlations. For each node (a set of concepts) in this hierarchy, a classifier is built where SVM’s output is weighted based on both the specificity of the node and its WUP similarity to the other nodes. Given an example, such weighted outputs are used to select the best node for each of S, V and O. Finally, S, V or O is described by the WordNet concept with the highest sum of WUP similarities to concepts in the best node. Owing to the specificity and WUP similarity, the method can flexibly select not only specific concepts, but also concepts that are less specific but visually plausible.

5.2.3 Discussion

Compared to cognition-based methods in Section 5.1, much less ontology-based methods have been developed so far. Since concepts are just primitive meanings, the advancement of ontology-based methods is the most important topic to realise practical LSMR systems. Below, we point out three issues that deserve much research attention in the future.

  1. 1.

    Uncertainties in concept detection: Traditional ontology formalisms do not account for uncertainties, where an ontology itself is not uncertain. Compared to this, even using the most effective methods, it is still difficult to accurately detect any kind of concept. In particular, real-world examples are ‘unconstrained’ [49] in the sense that they can be taken by arbitrary camera techniques and in arbitrary shooting environments. Thus, it cannot be expected to detect concepts with 100 % of accuracy. Relying on such uncertain concept detection significantly damages the detection performance of high-level meanings. For this, we developed a method which handles uncertain concept detection based on Dempster-Shafer Theory (DST) [107, 109]. DST is a generalisation of Bayesian theory [26], where a probability is not assigned to a concept, but instead to a subset of concepts. This enables us to consider a probability that one of a set of concepts could be present in an example. By accumulating such probabilities for sets including a certain concept, we can define the plausibility which represents the upper bound probability of the concept’s presence in the example. We incorporate such plausibilities into a probabilistic classifier, where plausibilities for a concept’s presence are estimated as density ratios between positive and negative examples in terms of detection scores. That is, plausibilities are ‘refined’ detection scores by considering uncertainties in detecting the concept. Compared to directly using concept detection scores, using plausibilities yielded 19.1 % of performance improvement in detecting events within examples. We expect that such approaches for managing uncertain concept detection will be further explored to detect high-level meanings over shot sequences.

  2. 2.

    Temporal continuity of a concept’s presence: Video editing is deemed as one main reason why intensional methods have not been developed for high-level meanings over shot sequences. The temporal order of shots can be easily distorted by inserting shots, that display different meanings than those of surrounding shots. Thus, it is difficult to reason a high-level meaning by specifying temporal positions of concepts. One possible solution is to model the temporal continuity of a concept’s presence. Let us consider a sequence of three shots, where the first shot shows a Person bringing a Birthday_Cake, the second one shows Persons talking to each other, and the third one shows Persons eating the Birthday_Cake. Here, even if the Birthday_Cake is not shown in the second shot, humans can assume its existence. Hence, it is reasonable to consider that the Birthday_Cake is not absent in the second shot, but is just ‘invisible’. To capture such a temporal continuity of a concept’s presence, it seems effective to analyse appearance patterns of a concept related to the development of the story in a video. For instance, when a concept plays an important role, it is present in many shots, otherwise it is not. Based on this, our method in [103] can divide a video into shot sequences, characterised by probabilistically distinct patterns of the concept’s presence. In such a shot sequence, the concept is assumed to be continuously present with the same degree of contribution to the story. This allows us to modify the detection score of a shot by considering scores of surrounding shots. Then, reasoning of high-level meanings is performed on modified concept detection scores.

  3. 3.

    Knowledge extraction: To reason various high-level meanings, a large repository of concept relations is required. One lesson from the success of attributes, concepts, and curriculum learning in the next section, is to gradually increase levels of meanings. Thus, we need to first address the extraction of concept relations which characterise various events. This requires to solve the following three fundamental issues: The first is to define a standardised vocabulary of events. As described in [36], while vocabularies of concepts corresponding to nouns are already large-scale like LSCOM and ImageNet, existing works only focus on a handful of events. We expect that, as concepts in LSCOM have been defined through the collaboration of multimedia researchers, library scientists and end users [79], similar effort is needed to construct an event vocabulary. The second issue is how to examine concept relations in events. For temporal relations, it is crucial to model the temporal continuity of a concept’s presence, described above. Similarly, depth estimation in Section 5.1.2 is vital to obtain semantically meaningful spatial relations among concepts. Assuming that these spatio-temporal concept relations could be successfully estimated, the last issue is to extract characteristic concept relations for events. Regarding this, in the next section, we will describe an efficient feedback approach to extract a large number of semantically meaningful concept relations with a small amount of user intervention. Finally, this feedback approach can be used to extract characteristic event relations for certain contexts.

5.3 Adaptive learning

Human learning can be considered as the repetition of the following process: Given a new problem, a human first monitors his/her performance, recognises a deficiency, and uses knowledge that he/she already owns to overcome the deficiency. By repeating this, the human can accumulate knowledge for solving diverse problems. Metacognition is a discipline to explore the process of how a human addresses a problem [3]. Assuming a cognitive system which simulates a functionality of the human mind, metacognition aims to monitor, model and control the behaviour of that system to effectively solve a problem. We position adaptive learning as metacognition for LSMR. Specifically, one LSMR method consists of various processes such as feature extraction, classifier construction, parameter tuning, training example collection, and so on. We define adaptive learning as methods which enhance or optimise one or more of the above mentioned processes based on knowledge about human learning. Below, we present two types of adaptive learning, explanatory feedback and curriculum learning.

  1. 1.

    Explanatory feedback: The traditional RF (or active learning) relies on the very restrictive communication between a classifier and a user, where the latter only informs the former of whether an example is relevant or irrelevant to a certain meaning (see Section 4.3). In the real world, a teacher makes much complex communication with a leaner. In particular, if the learner makes a mistake, the teacher tells him/her the reason for it.

    Based on this idea, Parkash and Parikh extended RF to Explanatory Feedback (EF), where if an example that a classifier selects as relevant to a meaning is judged to be irrelevant by a user, he/she can explain the reason for this mis-classification [88]. For example, if an example showing Forest is mis-classified as Street by the classifier, the user can explain “this example is too natural to be a Street”. EF is based on attributes with which an example is represented using responses of attribute classifiers (see Section 5.1.1). Since attributes are semantically meaningful descriptions, they can be used as a language between the classifier and the user to realise their complex communication. Especially, the irrelevance (negative) label assigned to the mis-classified example can be propagated to examples which contain the attribute explained by EF. In the above mentioned case, if examples have higher responses for the attribute “natural” than that of the mis-classified example, they are also regarded as irrelevant to Street. Like this, multiple negative examples are obtained only with one feedback, so that the classifier performance can be effectively improved. It was demonstrated that EF yields similar classifier performance only with one-fifth of iterations required in RF [88]. This method has been further improved by adopting the propagation of weighted irrelevance labels, on-the-fly update of attribute classifiers based on every feedback, and the example selection by estimating the entropy reduction resulting from the label propagation beginning at each example [16].

  2. 2.

    Curriculum learning: Human learning is highly organised based on a curriculum (education system), where children start to learn easier concepts and then build up more complex ones. Based on this, Bengio et al. proposed curriculum learning which builds a classifier by presenting training examples in a meaningful order, starting with easy examples and gradually introducing more difficult ones [12]. It should be noted that curriculum learning is not useful for a classifier with a convex optimisation function like SVM, because the global optimum can be found in any order of training examples. In other words, it aims to find a good local minimum to build a classifier with a non-convex optimisation function. In [12], a two-step curriculum was used for classification of shapes (rectangles, ellipses and triangles). Here, a multi-layer neural network is pre-trained using training examples with less variability in shape (squares, circles and equilateral triangles), and then trained using examples with diverse shape deformation. It was shown that compared to not-using pre-training, the two-step curriculum leads to about 23 % of performance improvement.

    Considering that curriculum learning in [12] relies on pre-defined ‘easiness’ values of examples (e.g., squares are easier to classify than general rectangles), Kumar et al. developed an iterative method where each iteration not only enlarges training examples by adding more difficult ones, but also updates the parameter of a classifier [57]. Assuming that labels of easy examples are easily predicted by the classifier, the researchers introduced binary variables each representing whether a training example is used to compute the optimisation function. Since the optimisation function favours correct classification of training examples, an easier example starts to be used from an earlier iteration. In [57], for object localisation using latent structural SVMs, the error rate 16.92 % was improved to 15.38 % by adopting the above curriculum learning. In a similar fashion, Ma et al. developed a video event detection method where a continuous-valued label, called ‘fine-grained label’, is assigned to each negative example, based on how easily it can be distinguished from positive examples [70]. Such fine-grained labels are initialised based on detection scores of negative examples for concepts, that are specific to positive ones. Then, the method jointly optimises fine-grained labels and two classifiers (one using concept detection scores and the other using features). That is, fine-grained labels are optimised so as to improve the performance of classifiers. The researchers reported that using fine-grained labels significantly enhances event detection [70].

5.3.1 Discussion

We consider EF as a promising approach to extract relations between high-level and low-level meanings, where these meanings are used as a language to make complex communication between humans and machines. Although the current EF only supports attribute-concept relations, it can be used to recursively extract concept-event and event-context relations (see the composition of meanings in Fig. 1). For the former relations, EF allows a user to explain concepts which caused the detection of an incorrect event for an example. In order for this EF to appropriately link concepts to the event, it is necessary to estimate meaningful spatio-temporal concept relations described in Section 5.2.3. Similarly, EF can be applied to event-context relations, where a falsely identified context is fixed by explaining the causative events. Also, easiness values of training examples in curriculum learning imply one interesting research topic to model metalevel features [3]. These do not characterise meanings, but capture aspects of features, classifiers and parameters in the LSMR pipeline. Apart from easiness values, for example, it is said that the performance of a decision tree can be estimated based on the number of nodes, depth, shape and so on [14]. By devising such metalevel features, we can decide or control a strategy to effectively utilise available features, classifiers and parameters for accurate retrieval.

6 Conclusion

In this paper, we reviewed existing LSMR methods including the ones that we developed. By tracing the history of machine-based and human-based methods, we stated that because of prioritising the generality and scalability for large-scale data, current methods lack knowledge about human interpretation which was used in classical syntactic or manual methods. Then, we presented existing human-machine cooperation methods which incorporate such knowledge into LSMR. In particular, we classified them into three types, cognition-based methods using knowledge about human visual system, ontology-based methods using knowledge about human inference, and adaptive learning methods using knowledge about human learning. Our retrospective survey indicates one remarkable difference between classical and human-machine cooperation methods. While the former just lists rules or templates as knowledge, the latter uses it to sophisticate computational models so as to maintain the generality and scalability. Since only limited knowledge is used in the current methods described in Section 5, we expect that much more knowledge will be adopted and integrated into LSMR.

To reach the aforementioned goal, our final suggestion is to consider the mutual relation among three types of human-machine cooperation methods. Figure 7 illustrates this relation. First, cognition-based methods are used to detect concepts, from which events and contexts are derived by ontology-based methods. In the opposite direction, concept relations that are used to characterise events and contexts in ontology-based methods, are useful for validating and refining concept detection results by cognition-based methods. In addition, detection of meanings (concepts, events and contexts) in these methods is enhanced by adaptive learning methods. Especially, EF provides means to effectively refine cognition-based and ontology-based methods by explaining reasons of false meaning detection. Meanwhile, the information (i.e., knowledge, features (concepts) and classifiers) of these methods is a material to extract metalevel features in adaptive learning methods, so that the latter can effectively control meaning detection in the former. Therefore, it is beneficial to develop a framework for unifying methods in the above mentioned mutually-related categories.

Fig. 7
figure 7

An illustration of the relation among three types of human-machine cooperation methods (cognition-based, ontology-based and adaptive learning methods)