1 Introduction

Automatic video analysis is one of the basic tools that enables the development of applications such as multimedia information retrieval or novel TV services. Understanding a video is a multifaceted objective that includes simple tasks like the recovery of the video structure or more complex ones such as the detection of specific events. In this paper, we focus on this last case, i.e., on the ability to detect specific extracts of a video that have a particular, and usually important, meaning for the user. Event detection is particularly important when videos have a weak structure since events are then the only anchors in the stream that allow non-linear browsing.

Events have no general definition and are in general specific to a particular context and application. Most of the time, the application context calls for a high-level semantic definition of an event (a goal in soccer, a dunk in basketball), but the definition may be much fuzzier: An interesting moment, or a moment similar to a set of examples. This suggests two main approaches to build an event detection system. First, a formal definition of the event can be given as a model which is then used for detection. Such approaches are usually based on rules that can be either defined by a human expert (see, e.g., [24, 32, 35, 38]) or inferred from examples as in Perlovsky [29]. Alternately, machine learning techniques can be used to train a system from examples with the goal of deciding whether a video extract contains the event or not. These approaches heavily rely on classification techniques, be they probabilistic (see, e.g., [17, 25, 37]) or not (e.g., [1, 15, 33]). The goal of these classifiers is to establish a relation between what can be extracted from the videos, i.e., low-level multimodal features, and the event. Their performance is thus closely linked to their ability in finding the best combination of features and to derive a decision rule that allow discriminating the positive examples with an adequate level of generalization.

We focus here on probabilistic approaches to event detection which are, for the most part, based on the Bayesian theory. One of the interests of statistical models is that they allow to easily take into account the correlations between features and the temporal aspect of the videos. The variety of statistical models used in multimedia analysis includes naive Bayes classifiers, decision trees, Bayesian networks—often in a naive Bayes way—and hidden Markov models (HMMs). In particular, HMMs and their variants have been extensively used for event detection in videos, with a wide range of applications from commercial detection [25], video genre classification [12] to structure analysis in videos [20, 31]. Hidden Markov models are well adapted to dense segmentation where every shot correspond to some well-defined event but not particularly suited for sparse event detection. Moreover, the model assumes that all observations are part of a single observation vector. This assumption is not well adapted to videos, in which several modalities are combined. Multistream extensions of the HMM framework have been proposed in audiovisual speech recognition, with application to multimedia analysis [17, 20]. However, recent work on segmental multistream HMMs for tennis video structuring [7] demonstrated the necessity of modeling the dependencies between features. Unfortunately, knowing which dependencies to model, and how, is not an easy task and requires a lot of human expertise, if possible at all. This paper therefore focuses on the necessity for algorithms to learn statistical dependencies between variables in a large set of variables, along with the corresponding model.

In this regard, Bayesian networks (BN) define a general framework for probabilistic modeling which encompasses all of the above mentioned models, including HMMs and segment models using so-called dynamic Bayesian networks [26]. A Bayesian network, dynamic or not, is a directed acyclic graph (DAG) where nodes represent random variables and links represent the causal relations between variables, thus allowing for a wide variety of topologies and offering flexibility. Moreover, BNs have shown to be adapted to multimodal fusion in the framework of video analysis [18, 23] with application in event detection, for example in soccer videos [36] or in Formula 1 car races [30].

Certainly one of the most appealing features of the Bayesian network theory is the ability to learn the structure of the graph that links together all the variables considered. In other words, one can learn the structure of the DAG in addition to the parameters of the model, a fact never satisfactorily achieved with HMMs. The main interest for structure learning is to avoid resorting to human expertise and heavy trial and error experimental protocols to define the best statistical model, yet avoiding unnecessary assumptions on the data such as the state conditional independence assumption for HMMs.

Several algorithms for structure learning in BNs, such as the K2 algorithm [6], have been proposed in the literature. In particular, score based approaches seek to maximize an objective function that reflects a trade-off between the best fit of the training data and the generalization capabilities of the model. However, such algorithms are seldom used in the multimedia area where naive networks predominate [27]. Note however the work of Choudhury et al. [4], Friedman et al. [9] and Baghdadi et al. [2] which investigate structure learning successfully on rather simple tasks.

In this paper, we investigate the use of structure learning algorithms for a rather complex multimedia task which consists in detecting action shots in soccer videos from multimodal input. To the best of our knowledge, this constitutes the first attempt in the multimedia area to use BN structure learning algorithms on a large scale complex task. We illustrate that classical score-oriented structure learning algorithms such as the K2 one, whose usefulness has been demonstrated on simple tasks, fail at providing a good network structure for classification tasks where many correlated observed variables are necessary to make a decision. We then compare several structure learning objective functions, which aim at finding out the structure that yields the best classification results, extending existing solutions in the literature. All structure learning algorithms are evaluated and compared on a realistic task, namely action detection in soccer videos from multimodal input, using a comprehensive data set of 7 games.

The paper is organized as follows. Section 2 defines Bayesian networks more formally and present classical structure learning strategies. In Section 3, we exhibit a case where the K2 algorithm performs poorly and discuss the reasons for this. We then propose new objective functions oriented towards the classification goal in Section 4 and provide an experimental comparative study in Section 5.

2 Bayesian networks and structure learning

A Bayesian network can be seen as a graphical representation of a probabilistic distribution over a set of random variables, where the graphical representation depicts the causal relations between the variables. As with any classification model, using Bayesian networks faces two issues: inference and training. Inference aims at making a decision (as to whether or not the event considered is present in our case) based on the evidences, or observations, available, given a network. Training includes two usually distinct phases, namely the design of the model and the estimation of the parameters from examples given the structure. We first formally define Bayesian network and briefly discuss the inference issue, before discussing model design issues and presenting the general principles of structure learning in Bayesian networks.

2.1 Definition and inference

Formally, a Bayesian network is a statistical model of a set of random variables, representing relations between variables such as conditional independence or causality. A network can be represented as a graphical model, i.e., a direct acyclic graph (DAG) \({\cal G}\) where each node X i is a random variable, arcs representing a relation of conditional dependence between the two variables at stake. In other words, the arc X i X j indicates that X j depends on X i . Assuming random variable taking values in a discrete observation space, a probability distribution table is associated to each node X i of the DAG to describe the probability of the random variable taking a value conditionally on the value of the set \({\cal P}_i\) made of the parents of node X i , i.e. \({\cal P}_i = \{X_j \mbox{ s.t. } X_j \rightarrow X_i\}\). Hence, the network encodes the relations within the set of random variables considered, {X i }, and can be used to factor the total probability of the collection according to

$$ P[X_0, X_1, \ldots, X_n] = \prod\limits_{i=0}^{n} P[X_i | {\cal P}_{i}] \enspace , $$
(1)

where \({\cal P}_{i}\) denote the set of variables corresponding to the parents for node i, i.e., the set of random variables upon which X i is dependent.

A simple example of a Bayesian network is illustrated in Fig. 1, where the three variables X 1, X 2, X 3 are all independent conditionally on the knowledge of X 0. The total probability can therefore be decomposed as

$$ P[X_0,X_1,X_2,X_3] = P[X_0]P[X_1|X_0]P[X_2|X_0]P[X_3|X_0] \enspace . $$
Fig. 1
figure 1

Example of a simple Bayesian network with four variables

As can be seen, a Bayesian network over a given collection of random variables is fully defined by its structure, i.e., the topology of the direct acyclic graph, and by the conditional probability tables (CPT) at each node. In practice, Bayesian networks are mostly used to make decisions based on the inference of the value of unobserved variables given the observed ones. For example, in the framework of multimedia content classification that we are studying, one is interested in inferring the class value given observations. This can be done using so-called naive structures as the one depicted in Fig. 1 where X 0 represents the unknown class to be inferred—here, a binary class stating whether an event is present or not—given observations X 1, X 2, X 3. Inference algorithms, such as Kim and Peal [21] and Jensen et al. [19], are the key to solving marginalization or posterior problems so as to find an optimal configuration for unobserved variables.

2.2 Graphical model design

Apart from the inference issue, model design is a crucial step in implementing a Bayesian network classifier. Model design can be seen as a two-step process where the first step consists in defining the topology of the model while the second one relates to the estimation of the conditional probability tables from training data.

Maximum likelihood approaches have been designed for parameter estimation in a variety of networks, exploiting the factorization of the total probability in the network [11, 16, 26]. For simple networks such as the ones considered in this study, with discrete variables, all of them observable in the training data, maximum likelihood estimation boils down to estimating conditional probabilities with empirical frequencies.

On the contrary, only a few algorithms have been proposed for the estimation of an optimal graph topology given training data. Moreover, these algorithms are seldom used in practice for real-life classification problems such as the one targeted here. Approaches to structure learning can be grouped into two main approaches. The first one consists in using statistical dependency tests to search for causalities between variables, such as in the IC and SGS algorithms [28, 34]. The second family groups methods targeting the optimization of a score that evaluates the quality of a structure. Several scores have been proposed in the literature, along with efficient—but suboptimal—strategies to review a large set of candidate structures to choose from. For example, restricting the possible set of structures to trees, one can search for the best tree structure using the maximum weight spanning tree (MWST) algorithm [22], assuming the availability of a causality score between any two variables [5, 16]. For structures more general than trees, most algorithms, including the popular K2 algorithm [6] and its variants, imposes an ordering on the nodes such that the set of possible parents for a given node is limited to those nodes with a higher rank, thus drastically reducing the search space. Finally, greedy search heuristics were also proposed in the literature for efficient exploration of the space of all possible structures [3].

Most score-oriented structure learning algorithms exploit a score which seeks for a trade-off between accurately modeling the training data and obtaining a low complexity network. In particular, a popular score function which derives from a simplification of the K2 algorithm, is the Bayesian information criterion (BIC) for which the objective function to optimize is given for a graph \(\mathcal{G}\) over N variables by

$$\begin{array}{rll} Q_{\mbox{BIC}}(\mathcal{G}) & = & \ln P_{\mathcal{G}}[\mathbf{X}] - \frac{\lambda}{2}\;C(\mathcal{G})\;\ln(K) \nonumber \\ & = & \sum\limits_{i=0}^N \left( \ln P[X_i|{\cal P}_{i}] - \frac{\lambda}{2} \; C_i({\cal G})\;\ln(K) \right) \end{array}$$
(2)

where K is the number of training examples, X is the set of variables, \(C(\mathcal{G})\) is the total number of free parameters in the network and \(C_i(\mathcal{G})\) is the number of free parameters in node i. As explicitly shown in the above equations, the objective function is decomposable as a sum over all nodes of the network, thus limiting the amount of computation to get a new score when the structure is changed.

However, this approach suffers from severe drawbacks in a complex classification task, in particular because of the fact that the objective function is oriented towards description of the data, rather than towards optimal prediction. Indeed, in classification networks, the classification node, denoted X c in the sequel, plays a particular role and should be treated differently. Few criteria were proposed for the purpose of classification. The tree augmented network (TAN) consists in augmenting a naive network with a tree structure using a MWST algorithm [8, 10]. More recently, Grossman and Domingo [14] proposed to use the conditional likelihood—i.e., conditionally to the classification node—rather the likelihood in the objective function.

In the next section, we study the use of the Bayesian information criterion as an objective function to learn the structure for the task of event detection in soccer videos and show the limitations of likelihood-based objective functions in this case. A classification-oriented objective function is proposed in Section 4 and compared to the TAN algorithm and to a K2 augmented network.

3 Limitations of K2 structure learning for soccer video indexing

In preliminary work, we demonstrated the benefit of using (2) as the objective function for structure learning in the task of multimodal advertisement detection in videos [2]. Elaborating on these results, the same structure learning paradigm is here applied to a more complex task, where more variables are to be considered, namely the detection of actions in soccer videos based on low-level audio and visual features.

We first describe the task and experimental protocol that is used throughout the paper before presenting results which demonstrate that K2 fails at such a complex classification task.

3.1 The action detection task in soccer videos

We consider the task of detecting actions in soccer videos, where an action is defined as a period of time in the match when a player is about to shoot to score. Such an action usually takes place near the goal mouth and comes with the cheering of the crowd and an excitement of the speaker, thus requiring multimodal input. Some replays of the action also usually follow. Action detection in soccer video is a complex task which requires that multiple features be considered simultaneously, thus being far more challenging for structure learning algorithms than previous case studies. In particular, in comparison with advertisement detection, the number of features required to accurately identify actions is greater and no straightforward features such as monochrome frames are available.

In this work, detection is performed at the shot level, all videos being automatically segmented into shots. From each shot, the following set of 8 binary audio and visual features is automatically extracted, a value of one indicating the presence of the feature:

  1. i.

    crowd excitement: this feature usually is strongly related with a noticeable event;

  2. ii.

    transition shot: some transition effects, classically detected by the shot segmentation algorithm, are usually added to increase the attractiveness of an event;

  3. iii.

    wide shot: a shot is classified as wide based on the detection of green as the dominant color and on the detection of terraces in the background;

  4. iv.

    lull scene: a lull scene, as opposed to a peak scene, corresponds to a game sequence where nothing special is happening and for which directors usually alternate between wide shots and other shots to maintain the dynamics of the video. The detection of such scenes is mostly rule based, relying on the result of the classification of shots into wide or not;

  5. v.

    presence of face: shots containing mostly a face, as indicated by a face detection algorithm, are likely to be close-up which are strongly related to action;

  6. vi.

    green shot: shots where the green color is sufficiently present are marked as so to indicate whether the field is visible or not;

  7. vii.

    replay logo: this feature indicates the presence of a replay logo which indicates the start or end of a replay sequence;

  8. viii.

    goal mouth: this indicates whether the goal mouth is visible or not, thus acting as an indicator of the action importance.

However, actions are mostly characterized by the temporal evolution of the features and classification can hardly be performed on the base of a single shot. Hence, the primary features of a shot are augmented with features from the neighboring shots, taking 2 shots of context on the left and right side respectively. This amounts to a total of 40 contextual features which are to be modeled using a Bayesian network classifier. With respect to our previous study, it is important to note that this constitutes a larger set of features. Moreover, features are highly correlated, in particular due to the use of context shots. Finally, it should also be noted that not all features are always directly relevant for the classification task.

Experiments are carried out on a data set of 7 games broadcasted during the 2006 World Cup, amounting to about 14 h of video. Automatic shot segmentation yielded 9,632 shots in total, among which 192 were labeled by a human expert as action shots (about 2 % of the total number of shots). Table 1 provides details on a per video basis. Due to the limited number of data available a cross-validation protocol with 7 folds was adopted, retaining one match as test material for each fold. The training set for each fold is used both for structure learning and maximum likelihood parameter estimation, the two being performed jointly regardless of the structure learning criterion used. Results are reported in terms of recall and precision on the action shots, where a shot is deemed to be an action shot if the posterior probability P(X c /X 1, X 2,...X n ) of the classification node is above a threshold. The threshold is varied so as to achieve different trade-offs between recall and precision.

Table 1 Number of shots, number of action shots and total duration of the video per game

3.2 K2 structure learning for soccer videos

Following exactly the same methodology as in [2] where the K2 algorithm demonstrated effectiveness for advertisement detection, the K2 algorithm using the score function given in (2) was first used for event detection in soccer video. The structure was initialized using the MWST algorithm of Chow and Liu [5], taking the classification node X c as the root for the tree. Node ordering in the K2 algorithm was derived from the tree structure obtained from the initialization step.

Results are reported in Fig. 2 and compared with a naive structure. Contrarily to previous results on advertisement detection, the K2 structure fails at capturing the complex relations between variables, resulting in poorer performance than the naive structure. This result is counter-intuitive as one would expect the model to benefit from taking into account the correlations that might exist between variables. An example of a structure learned from the data is given in Fig. 3, where variables were assigned arbitrary numbers, and illustrates two important points regarding the behavior of the K2 algorithm. On the one hand, structure learning succeeds to some extent in capturing the relations between variables, resulting in a network structure rather different from the naive one. This structure nevertheless remains difficult to interpret, even with some expert knowledge in soccer. But, most of all, it also appears that only a few feature (observed) nodes are directly connected to the event classification node (red links), contrary to the naive network where, by definition, all features are connected to X c .

Fig. 2
figure 2

Recall vs. precision trade-off curves comparing the K2 (red) and naive (blue) structures

Fig. 3
figure 3

Structure obtained with K2 structure learning, where red links denote direct connections between X c (node 1 in the picture) and the observed variables

This last observation is the key to understanding the poor results obtained with structure learning when, contrarily to the advertisement detection use case, a large number of variables is at stake. Indeed, for a graph structure \({\cal G}\), the likelihood term in the score function \(Q_{\mbox{ BIC}}\) can be rewritten as

$$ \ln P_{\cal G}[\mathbf{X}] = \ln(P_{\cal G}[X_{1},...,X_{n}]) + \ln(P_{\cal G}[X_{c}|X_{1},...,X_{n}]) \enspace . $$
(3)

From this formulation, it can be seen that as the number n of observed variables increases, the first term on the right hand side of the equality decreases rapidly. Since the second term does not depend on n, structure learning with a large number of variables is dominated by the maximization of the term \(\ln(P_{\cal G}[X_{1},...,X_{n}])\), regardless of the class node X c . The result is a structure that represents the relationships between the observed variables, regardless of their impact on the classification task considered.

These preliminary results show that for classification tasks with a large number of observed variables, structure learning algorithms searching for a trade-off between the best fit of the data and the complexity of the resulting model are not suited. A solution to skirt this issue is feature selection so as to limit the number of observed variables, a solution that has proven experimentally valid. However, feature selection might result in information loss. An alternate solution consists in using objective functions for structure learning that account for the specificity of classification tasks, paying special attention to the peculiarity of X c .

4 Classification-oriented structure learning algorithms

Two main strategies can be envisioned to learn the structure of a Bayesian network for classification. The first one consists in forcing relations between the observed variables and the classification node, leaving structure learning to the sole relations between observed variables. The second one consists in explicitly accounting for classification issues in the objective function. The first option benefits from an easy implementation but is suboptimal, while the second one is optimal but difficult to implement because most classification-oriented objective functions are not decomposable.

For each of the two strategies, an efficient algorithm for Bayesian network structure learning in the framework of classification is proposed. Firstly, pursuing the philosophy of the tree augmented network structure learning algorithm of Friedman et al. [8], we impose constraints on the optimal structure, forcing all observed variables to be directly related to the classification node. This strategy yields a K2 augmented network structure which is still learned based on the likelihood-complexity trade-off. Secondly, a discriminative objective function [14] is used in replacement of the K2 likelihood based one. Unfortunately, this new objective function is not decomposable over the set of nodes in the network and we resort to genetic algorithms for greedy optimization.

4.1 K2 augmented structure

As naive networks have proven successful for classification tasks on many an occasion, Friedman et al. [8] proposed to augment the naive structure by adding arcs between observed variables using a MWST algorithm to generate a tree structure between the observed nodes. A score based on the mutual information between each pair of nodes, conditionally on X c , was used as input to MWST computation. This algorithm therefore results in a structure where each observed node has two parents: the classification node and another feature node.

Restricting the search of the structure to the set of trees clearly limits the complexity of the structure learning process. However, the relative simplicity of the resulting structure has also some drawbacks. First the tree structure will not allow to have connections between more than two features nodes, a fact that is likely to appear when a large set of variables is used. This is particularly true in our case because of the use of contextual features from the neighboring shots in the description of each shot, likely to be highly correlated one to another. Additionally, accounting for features not related to any other (i.e., a feature that should exhibit a unique connection to X c ) is impossible.

As a workaround, we propose to augment the naive structure using K2 structure learning with the BIC criterion, thus extending the tree augmented network philosophy. Using K2 structure learning to augment the naive structure clearly enlarges the set of possible structures, enabling more complex structures that do not suffer from the limitations stated. K2 augmented structure learning relies on a modified version of the Bayesian information criterion which accounts for the compulsory link between each feature and the event node. Formally, the objective function is defined as

$$ Q^{\mbox{(c)}}_{\mbox{ BIC}}(\mathcal{G}) = \sum\limits_{i=0}^{N} \left( \ln P(X_{i}|{\cal P}_{i},X_{c}) - \frac{\lambda}{2} C(X_i,{\cal G})\;\ln(K) \right) \enspace , $$
(4)

and remains decomposable, thus making it possible to use the same efficient exploration strategy based on node ordering as for the initial K2 algorithm described in Section 3.

4.2 Discriminative objective function

Even if the classification goal is explicitly considered in (4), the rationale still consists in finding the structure that best fits the training data, subject to the simplicity of the structure. Given a large number of observed variables, this structure might still be dominated by the search for a good explanation of the relations between features rather than by the search for the structure that best classifies the data. We therefore propose a new structure learning criterion with the goal of directly maximizing the class conditional probability P[X c |X 1,...,X n ] rather than the joint probability P[X c ,X 1,...,X n ]. The use of the class conditional probability was introduced in Greiner et al. [13] for parameter estimation and studied in Grossman and Domingo [14] for structure learning using the BNC algorithm. We detail here a variant of the BNC algorithm using a genetic algorithm to explore the space of possible network structures.

As in Grossman and Domingo [14], the objective function for structure selection is defined as

$$ Q_{\mbox{CLL}} = \sum\limits_{i=1}^{N} \ln P_{\cal G}[X_{c}|X_{1},...,X_{n}] \enspace . $$
(5)

Unfortunately, the discriminative score is not decomposable and cannot be written as a sum of local scores calculated separately for each node. We therefore resort to a genetic algorithm in order to explore the set of possible structures. Genetic algorithms are iterative algorithms that require an initial structure as a starting point. From this initial structure, a set of candidate structures is generated by adding, reverting or deleting one single arc. The discriminative score is calculated for each of the structures resulting from these mutations and the one that maximizes the score—given maximum likelihood estimates of the parameters—is then chosen as the starting point for the next iteration. The algorithm stops if none of the generated structures increases the score. The choice of the initial structure is crucial and will be discussed along with experimental results in the next section.

5 Experimental results

Recall vs. precision trade-off curves are plotted in Fig. 4 for the tree and K2 augmented networks as well as for the discriminant objective function. Performance for the naive network is also reported as a baseline. It should be noted that for the K2 augmented network, λ was experimentally set to 3 in (4).

Fig. 4
figure 4

Recall precision trade-off curves for i) a naive Bayesian network (dark blue), ii) a tree augmented network (green), iii) a network resulting from the K2 augmented technique (red) and, iv) a discriminatively trained network (light blue)

Results show that the two augmented approaches clearly improve over the naive baseline network. Indeed, by forcing the classification node to be connected to all the feature nodes, these techniques build a structure which benefits from the whole feature information, as with the naive Bayesian network, while also taking into account correlations between the features themselves. Moreover, the K2 augmented method provides better results than the TAN approach for the classification task, due to the fact that the resulting network is less constrained in the K2 augmented case. This allows for more flexibility to take into account the correlations between several features when they exist, or on the contrary to avoid non-relevant connections between features when this is not needed, a fact that is crucial in such a complex classification task as the one we are targeting. Finally, the classification-oriented approach based on a discriminative objective function actually outperforms all techniques. The maximization of this new score, dedicated to the classification task, is one explanation for these good results, the other being the absence of format restriction in the choice of the final structure. The resulting network will therefore better describe the correlations between the classification node and the features. An analysis of the resulting structure with this classification-oriented scheme is proposed in the following section.

An example of a structure obtained with the classification-oriented objective function is shown in Fig. 5, where node 1 corresponds to the classification node. An analysis of this structure highlights the few number of nodes used for the classification process: 13 nodes are directly connected to the classification node (red links) and 10 additional nodes are indirectly used. A total of 17 nodes was therefore rejected from the final structure, as non relevant for the classification task at hand. Connections with these nodes were indeed not increasing the discriminative score and the algorithm therefore applies implicit feature selection. It is interesting to note that implicit feature selection cannot, by construction, occur in augmented techniques. This reduction of the structure size consequently results in a more reliable parameter learning step, hence in an increased performance for the classification.

Fig. 5
figure 5

Example of a structure resulting from the use of the discriminative objective function

As mentioned previously, the choice of the initial point for the genetic exploration of the set of possible structures for the conditional probability maximization is usually of utmost importance. Three initial points were tested, namely naive, TAN and K2 augmented networks. We observed no impact on performance after structure learning. However, training time is significantly affected by the initialization point, as reported in Table 2. The much reduced training time when starting from one of the augmented structure highlights the quality of the latter, thus making good starting point for the genetic algorithm.

Table 2 Impact of the initial structure on the learning time of the discriminative approach

6 Conclusion

Taking action detection in soccer videos as a use case for multimodal event detection, we have shown how structure learning in Bayesian networks, associated with the adequate objective function, can efficiently detect complex multimodal events in videos. We have demonstrated that, while using an information criterion for structure learning in BNs suffers major drawbacks for complex classification tasks with a large number of correlated observed variables, classification-oriented objective functions can efficiently deal with such matter. In particular, we proposed a new K2 augmented network structure and a genetic implementation of the conditional likelihood objective function which both turned to outperform state-of-the-art structure learning methods. Experimental results however call for a few remarks and suggestions for further work.

Firstly, we observed that the ability to select relevant variables, i.e., to decide that some variable has no direct or indirect relation with the classification node, is crucial. While the conditional likelihood criterion embeds feature selection, it is not the case for the BIC and augmented approaches which might benefit from explicit feature selection. However, we believe that embedding structure learning and feature selection is better suited than performing feature selection as a required preliminary step to structure learning.

Secondly, we considered that the observed variables directly contribute to the classification if at all. In other terms, we only have two types of variables, the observed ones and the classification node. For complex classification tasks, it appears interesting to consider hidden variables which act as intermediate concepts between the observation and the decision. We are convinced that training such networks will help improve the structure inferred and thus classification by summarizing complex information from the features in a few concepts. However, extensions of the existing structure learning algorithms to handle hidden variables are required to do so.

Finally, the temporal dimension of videos was limited in this work to the use of contextual features from the neighboring shots, classification being performed on a per shot basis. As for shot-based classification, learning the temporal structure of the video with a goal-oriented objective function is likely to improve classification performance and requires further investigation. Directly learning the structure of a dynamic BN is intractable, except in some rare cases. Combining BNs and segmental HMMs appears like a plausible alternative, where the structure and parameter of BNs are trained to predict the posterior probabilities required for Viterbi decoding in segmental HMMs.