Keywords

1 Introduction and Motivation

Natural language parsing usually suffers from the problem of ambiguity. Many ambiguities cannot be easily resolved using only the linguistic information. Trained models learn to disambiguate the syntactic structure according to the prior probability distribution found in the training data: The most frequently found interpretation will also be taken as the most plausible one, irrespective of any other factors which might contribute counter-evidence. Such behavior is highly undesirable in dynamic contexts, where the actual choice should also consider the current state-of-affairs in the world. In such a situation, including visual information into the decision process might help to find a better fitting interpretation.

One of the most noticeable sources of ambiguity in natural language is prepositional phrase (PP) attachment. As shown in Fig. 1, “I saw a girl with a telescope”, the decision between high and low attachment cannot be taken based on pure syntactic information, even lexical preferences don’t provide reliable clues. If the high attachment is adopted, the PP is attached to the verb which indicates its relation to the verb (I use the telescope to see the girl). In the low attachment case, PP is attached to the closest lexical item, which marks the coexistence of the PP with that item (I saw a girl who has a telescope with her). If we have available visual input in addition to the linguistic one, integrating them into the learning model might help in such a situation, if the kind of knowledge provided by the visual (context) information is beneficial to disambiguate the dependencies.

Fig. 1.
figure 1

(A) High attachment. (B) Low attachment

In this paper, we provide a multimodal dependency parser. The parser does not only depend on the linguistic input, but also on the non-linguistic modality. Although we consider the context (visual) information as the non-linguistic modality and inject such input into the parser by providing the relations between the elements in the context, we don’t work on the level of relation extraction through image processing so far. Instead we focus on trying to overcome a range of other challenges in this research such as: introducing the context knowledge into the learning model of a graph-based parser, and manipulating the scoring function to take the decision based on the linguistic and non-linguistic modalities.

We use thematic roles in the form of triples as a description language stating the situation given in the non-linguistic context. The thematic roles contain information that helps in ambiguity resolution. Both linguistic and non-linguistic information are fed into the graph based dependency parser to improve its quality.

The paper starts with a review of dependency parsing methods (transition and graph based) and previous work on the integration of context information into a rule-based parser. In Sect. 3, we present our context-integrating model. The high-level architecture of the solution is presented in Sect. 4. Then a set of experiments is discussed in Sect. 5 before we state the conclusion and make proposals for future work.

2 Previous Work

2.1 Dependency Parsing

Dependency parsing extracts a syntactic dependency tree that describes binary relationships between the words of a sentence. The nodes of the tree correspond to the word forms in the sentence while the edges represent the dependency links between them in a child-parent relationship. These links are interpreted in terms of the functions that a lexical item fulfills with respect to its governor. These functions are described using labels attached to the edges. A valid dependency tree has to be an acyclic, and connected graph with a single head of each node (Nivre 2004).

Among the machine learning approaches for dependency parsing, there are two main methods: transition-based, and graph-based parsing. The transition-based (Shift-reduce) method constructs the tree incrementally, by attaching an incoming word immediately, or delaying its attachment until a better attachment point becomes available. The decision is based on an oracle which consults the history of prior attachments decisions. MALTparser is an example of this approach (Nivre et al. 2004).

Graph-based parsers start by creating a graph where each node represents a word from the sentence (Zhang et al. 2014a). All the nodes are connected to each other. A feature vector is assigned to each edge. The cost of each edge is dynamically learned based on a function of the training dataset. The parser finds a minimum spanning tree of the graph with the optimal score (Bohnet 2010). Different algorithms are used as alternatives for the creation of the minimum spanning tree: Chu-Liu/Edmond (Chu and Liu, 1965), and Hill-climbing (Zhanget al. 2014b). RBG parser (Lei et al. 2015) claims the state-of-the-art of graph-based parsing. It uses high-order features, different spanning tree decoding algorithms (Lei et al. 2014), a passive-aggressive online learning algorithm (MIRA), and parameter averaging (Crammer et al. 2006). It outperforms other dependency parsers quality.

2.2 Context Representation

The idea of utilizing context information from the visual environment in the dependency parsing was introduced by (McCrae 2009). He injected the visual information into a constraint-based parser (Weighted Constraint Dependency Grammar WCDG), and run his research on a German language dataset. He used the Web Ontology Language (OWL) to encode high-level descriptions of the visual input. Although OWL has two main components: t-box, a-box, he considered only the a-box to describe the relations between entities in the visual context. Under a-box representation, four thematic roles (Agent, Theme, Instrument, and Owner) are used to demonstrate the conceptual relationships in the context.

In this paper we implement our ideas by means of RBG parser to develop a proof of concept model, to demonstrate that the desired fusion of multimodal information can be achieved in a learning (graph-based) parsing model. The visual information is presented in form of thematic roles. Our experiments compare the results between the new implemented context-integrating parser that combines visual and linguistic information for English sentences against the original RBG parser as a benchmark.

3 Context-Integrating Dependency Parser

RBG parser considers only the linguistic input during model learning. It hypothesizes high order scoring functions to use them in minimum spanning tree extraction. To make it sensitive to visual information, we modify RBG parser to accept additional context information as features for the learning model. Our new version of RBG keeps linguistic features on the edges between combinatorial pairs of words in addition to newly introduced visual features between the entities (words) that have a relationship in the context (visual) input. As presented in Fig. 2, a visual relation has three parts: the relation type (agent, theme, etc.), the head of the relation which is the verb (except in the “owner” relation), and the modifier of the relation.

Fig. 2.
figure 2

The context information of an image (image taken from (Knoeferle 2005))

Figures 2 and 3 represent two pictures and their (visual) context information. The linguistic input corresponding to these figures are:

Fig. 3.
figure 3

The context features of an image (image taken from (Knoeferle 2005))

  • 2: “The doctor with a coat feeds at this moment the journalist with a microphone”

  • 3: “The journalist with a microphone feeds at this moment the doctor with a spoon”

These sentences are confusing for a person if he/she hears them without the visual information. Using the visual information, however, one can differentiate that “microphone” in sentence 2 is an object with the journalist while the “Spoon” in sentence 3 is the tool for feeding and not an object with the doctor.

In Fig. 4, we present the adapted graph representation of the sentence in the context integrating RBG parser. We can find (t) words, \( \left\{ {fv_{a = 1 \ldots m} } \right\} \) linguistic feature vectors (the original ones from the RBG parser). As presented in Eq. 1, each vector has \( n \) features encoding the linguistic properties of the pair of words \( (i,j) \). It has additionally \( p \). context feature vectors (newly introduced). These vectors consist of \( q \) visual features for the words pairs that have correlation in the context input.

Fig. 4.
figure 4

Graph representation of the context-integrating dependency parsing

As shown in Eq. 3, we build the learning model using multimodal inputs. In the testing phase, the model gets the linguistic information of the sentence in addition to the context information to find the optimal dependency tree. In equions 3, 4: \( x_{i} \). is the input sentence, \( c_{i} \) is the context input for sentence (i), and \( \tilde{y} \) the extracted dependency tree. For each sentence there is a set of possible trees \( T(x_{i} ) \), and a gold standard one \( \hat{y}_{i} \). To represent the feature vector of the context input we use \( {\ddot{f}}\left( {c_{i} ,y} \right) \) with parameters \( (\omega ) \). \( f\left( {x_{i} ,y} \right) \) is used for the linguistic features and \( \theta \) for the parameters.

$$ fv_{a = 1 \ldots m} = \left( {\begin{array}{*{20}c} {f_{i,j,1} } \\ \ldots \\ {f_{i,j,n} } \\ \end{array} } \right) $$
(1)
$$ cfv_{a = 1 \ldots p} = \left( {\begin{array}{*{20}c} {cf_{i,j,1} } \\ \ldots \\ {cf_{i,j,q} } \\ \end{array} } \right) $$
(2)
$$ \tilde{y} = \mathop {\hbox{max} }\limits_{{y \in T(x_{i} )}} \left\{ {\theta .f\left( {x_{i} ,y} \right) + \omega .{\ddot{f}}\left( {c_{i} ,y} \right) + \left\| {y - \hat{y}_{i} } \right\|} \right\}\quad Train $$
(3)
$$ \tilde{y} = \mathop {\hbox{max} }\limits_{y \in T(x)} \left\{ {\theta .f\left( {{\text{x}},y} \right) + \omega .{\ddot{f}}\left( {{\text{c}},y} \right)} \right\}\quad Test $$
(4)

For example, the context feature (HPp_HP_MAGP_MAGPn) describes four different aspects of a relation:

Now we present how we encode this example of feature. At the beginning of the parsing process, the parser builds a dictionary of available POS tags and assign an ID to each one. We use this mapping to encode the visual relation. As shown in Fig. 2, “Doctor” is “Agent” of the “feeds” action. Here, “feeds” is the head of the visual relation, “agent” is the relation type, and “Doctor” is the modifier. “HPp” refers to the POS of the previous word of “feeds.” “HP” is the POS of “feeds.” “MAGP” is the POS of “Doctor.” “MAGPn” is the POS of the next word to “Doctor.”. Table 1 shows the encoding of this feature, and how it consists of different POS’s ID referring to the mentioned dictionary. This encoding is used as a feature ID in the parser learning process. This feature is added to the visual feature vectors between the two words in the adapted graph represented above.

Table 1. Coding of the example feature

4 Solution Architecture

In this section, we present the high-level architecture of the context-integrating RBG parser and how we introduce the visual modality in it. As shown in Fig. 5, there are three types of components:

Fig. 5.
figure 5

The architecture of context-integrating dependency parser

  • Components of the RBG parser that are kept without modification (Old).

  • Components of the RGB parser that have been changed to be compatible with multi-modal parsing (Changed).

  • Newly introduced components (New).

The Online Learner is an existing component in RBG parser. It uses the Passive-Aggressive algorithm. We modify it to consider the additional features. Therefore, the features list and corresponding weights in RBG parser are also modified for the same purpose. On the other hand, the RBG component “Decoder Algorithm” is left unchanged. It is responsible for the minimum spanning tree decoder and implements different algorithms.

5 Experiments

5.1 Dataset Preparation

In our experiments, we developed two small corpora:

  1. 1.

    An extended version of Baumgärtner’s dataset (Baumgärtner 2013). The original dataset condensed 24 images and 96 sentences describing these images. All sentences follow the same structure: Subject, verb, and object with adverbial modifiers. We translated this dataset from German to English, and extended it to 500 sentences that are equally distributed into the following groups:

  2. The original dataset. Ex: “The Princess washes obviously the pirate.”

  3. A group has subject, object, and descriptions for both of them. Ex: “The Princess with long hair washes obviously the pirate with a woody leg.”

  4. A group has a descriptive subject, object, and a description of the action’s instrument. Ex: “The Princess with long hair washes obviously the pirate with a brush.”

  5. A group has a subject with a description, an object with a description, and a description of the action. Ex: “The Princess with long hair washes obviously the pirate with a woody leg with a brush.”

  6. Sentences with subject and object in a passive form. Ex: “The pirate is washed by the Princess.”

  7. 2.

    Part of the ILLIONS image corpus (Young et al. 2014) with 35 images and three corresponding sentences for each of them. This dataset is created through crowdsourcing to describe the content of the pictures. Therefore, the structure of the sentences varies in contrast to the first dataset.

We developed context description for each sentence in both datasets. Baumgärtner’s dataset had already initial context description to build up on, but with the ILLIONS dataset we had to start from scratch. Four thematic roles have been used: agent, theme, instrument, and owner. To prepare the training data, we used the online demo of “Noah’s ARK” Turbo parser (Thomson et al. 2014). The output (CONLL format) was verified manually against “Stanford dependency manual” (de Marneffe and Manning 2015).

5.2 Experiments Results

We present a set of experiments carried out to verify the effectiveness of context integration and its impact on the dependency parsing quality. We use three metrics:

  • Unlabeled attached score (UAS): the percentage of correct lexical-parent attachments in the testing data.

  • Labeled Attached Score (LAS): the percentage of correct labeled lexical-parent attachments in the testing data.

  • Complete Attached Score (CAS): the percentage of complete sentences that have been correctly analyzed.

Experiment 1.

Here we use the first dataset mentioned above. The training uses 440 sentences, and the testing data has 60 sentences. We implement two different degrees for the influence of the context features:

  • “Strong Context” condition: We treated the context features during learning with an extra confidence (3 times the weight of a normal linguistic feature).

  • “Normal Context” condition: The features added from the context have the normal influence on the final decision like the linguistic features.

As presented in Fig. 6, integrating the context information into the learning process slightly improves the UAS and LAS scores. This (relatively small) improvement of the attachment across all the words in the testing data, has, however, a quite big impact on the CAS score. That illustrates the benefit of using multimodal information as input for the graph based dependency parser. It helps to properly disambiguate more attachments which improves the overall CAS score by 18 %.

Fig. 6.
figure 6

Context information impact (Experiment 1)

Experiment 2.

In this experiment, we used the second data set mentioned in the dataset section. Due to the small size of this dataset, and the online learning process of RBG using Passive-Aggressive (PA) algorithm, we use in this experiment the whole dataset as training data, as well as testing data. In the PA algorithm, the feature weights are updated according to the currently processed input sentence neglecting the influence of this update on the previously handled cases. Therefore, we train and test on the same dataset to check the effect of the context information. The context-integrating parser improves the overall CAS score by 8 % (Fig. 7).

Fig. 7.
figure 7

Context Information Impact (Experiment 2)

Experiment 3.

We trained the parser with the first dataset while testing with the second one. As shown in Fig. 8, there is no noticeable improvement in this case. We find that due to the completely different sentence structures and lexicons between the training and testing data, the context information didn’t remarkably help towards better disambiguation in this case.

Fig. 8.
figure 8

Context Information Impact (Experiment 3)

6 Conclusion

In this paper, we present a context-integrating dependency parser by providing a new version of a graph-based dependency parser (namely RBG) accepting additional kind of features. These features present the visual context information of the sentence in a form of thematic roles. The experiments show an improvement of parsing quality between 8 % and 18 % in different experiments. We have shown the effectiveness of the idea on small datasets scale.

7 Future Work

While this paper is a first step in our context-integration parsing roadmap, we have identified some limitations in this work that should be tackled in future work. Currently, the system is not able to deal with a possible mismatch between the lexicons for the linguistic input and the context descriptions. So far we require them to be identical which is not a realistic assumption for richer linguistic stimuli.

In this research, we improved parsing quality using the cognitive influence of visual context information. In the future, we will work on enriching the context representations based on the linguistic input. We also need to apply the approach to a larger dataset. Additionally, we will study the behavior of the system in a situation where the context relationships contradict the linguistic content.