Keywords

1 Introduction

Imagine a household robot that helps you in the kitchen. You might want the robot to pass you the salt and instruct it as follows: “Could you pass me the salt? It is to the left of the stove”. Here, the salt is the located object (LO), because it should be located relative to the reference object (RO, the stove). To find the salt, the robot should interpret this sentence the way you intended it. In the interaction with artificial systems, humans often instruct artificial systems to interact with objects in their environment. To this end, artificial systems must interpret spatial language, i.e., language that describes the locations of the objects of interest. To make the interaction as natural as possible, artificial systems should understand spatial language the way humans do. The implementation of psychologically validated computational models of spatial language into artificial systems might thus prove useful. With these kind of models, artificial systems could begin to interpret and generate human-like spatial language.

[30] were the first to outline a computational framework of the processes that are assumed to take place when humans understand spatial language. Their framework consists of “four different kinds of processes: spatial indexing, reference frame adjustment, spatial template alignment, and computing goodness of fit” [30, p. 500].

Spatial indexing is required to bind the perceptual representations of the RO and the LO to their corresponding conceptual representations. According to [30, p. 499], “the viewer’s attention should move from the reference object to the located object”. Reference frame adjustment consists of imposing a reference frame on the RO and setting its parameters (origin, orientation, direction, scale). “The reference frame is a three-dimensional coordinate system [...]” [30, p. 499]. Spatial template alignment is the process of imposing a spatial template on the RO that is aligned with the reference frame. A spatial template consists of regions of acceptability of a spatial relation. Every spatial relation is assumed to have its own spatial template. Finally, computing goodness of fit is the evaluation of the location of the LO in the aligned spatial template.

Trying to identify possible nonlinguistic mechanisms that underlie the rating of spatial prepositions, [35] developed a cognitive model: the Attentional Vector Sum (AVS) model.Footnote 1 This model – based on the assumption that goodness-of-fit ratings for spatial prepositions against depicted objects reflect language processing – accounts for a range of empirical findings in spatial language processing. A central mechanism in the AVS model concerns the role of attention for the understanding of spatial relations.

Direction of the Attentional Shift. Previous research has shown that visual attention is needed to process spatial relations ([28,29,30]; see [7] for a review). The AVS model has formalized the role of visual attention. Although [35] do not explicitly talk about attentional shifts, the AVS model can be interpreted as assuming a shift of attention from the RO to the LO. [35] motivate the implementation of attention based on studies conducted by [28] and [29, p. 115]: “The linguistic distinction between located and reference objects specifies a direction for attention to move – from the reference object to the located object.” (See also [30, p. 499]: “the viewer’s attention should move from the reference object to the located object”.) But are humans actually shifting their attention in this direction while they are understanding a spatial preposition?

Evidence for shifts of covert attention comes from studies in the field of cognitive neuroscience by Franconeri and colleagues [15, 38]. Using EEG, [15] showed that humans shift their covert attention when they process spatial relations. In their first experiment, they presented four objects of which two had the same shape but different colors. Two objects were placed to the right and two objects were placed to the left of a fixation cross such that two different shapes appeared on each side of the cross. Participants had to fixate the fixation cross and judge whether, say, the orange circle was left or right of the cyan circle. After the stimulus display was shown, participants chose one spatial relation out of two possible arrangements on a response screen (cyan circle left of orange circle or orange circle left of cyan circle). During the experiment, event-related potentials were recorded. All experiments reported in [15] revealed that participants shifted their attention from one object to the other object, although they had been instructed to attend to both objects simultaneously. However, the role of the direction of these shifts remained unclear in [15].

In another experiment, [38] presented questions like “Is red left of green?” to participants. Subsequently, either a red or a green object appeared on the screen, followed shortly afterwards (0–233 ms) by a green or a red object respectively. By manipulating the presentation order of the objects, a shift of attention was cued. Participants were faster to verify the question if the presentation order was the same as the order in the question. [38] interpreted this as evidence that the perceptual representation of a spatial relation follows its linguistic representation.

Evidence that a shift of attention from the RO to the LO as suggested in the AVS model is not necessary for understanding spatial language has been recently reported by [3], who conducted a visual world study. Here, participants inspected a display and listened to spoken utterances while their eye movements were recorded. Note that [3] investigated overt attention ([15, 38] studied covert attention). [3] presented sentences with two German spatial prepositions (über [above] and unter [below]) across four different tasks. The RO and the LO of the sentence as well as a competitor object (not mentioned in the sentence) were presented on a computer screen. In their first experiment, participants verified the spatial sentence as quickly as possible, even before the sentence ended. In their second experiment, participants also verified the sentence, but they had to wait until the sentence was over. The third experiment consisted of a passive listening task, i.e., no response was required from the participants. Finally, in the fourth experiment, a gaze-contingent trigger was used: the competitor object and either the LO or the RO were removed from the display after participants had inspected the LO at least once.

The results from this study revealed that participants shifted their overt attention from the RO to the LO, as predicted by the AVS model. However, the task modulated the presence of these shifts. These shifts were only frequent in the post-sentence verification experiment (experiment 2), but infrequent in the other experiments. Crucially, if participants did not shift their attention from the RO to the LO, they performed equally well (as accuracy was not affected) – i.e., they were able to understand the sentence without shifting their attention overtly from the RO to the LO.

By contrast, participants frequently shifted gaze overtly from the LO towards the RO (in line with the incremental interpretation of the spoken sentence). This suggested that people may be able to apprehend a spatial relation with an overt attentional shift from the LO to the RO (and not from the RO to the LO as suggested by the AVS model).

Thus, the direction of the attentional shift as implemented in the AVS model conflicts with recent empirical findings. We propose a modified version of the AVS model: the reversed AVS (rAVS) model, for which the attentional shift has been reversed. Instead of a shift from the RO to the LO, we implemented a shift from the LO to the RO. We designed the rAVS model otherwise to be as similar as possible to the AVS model. By doing so, we can isolate the influence of the reversed shift on the performance of the two models.

2 The Models

In this section, we first describe the AVS model, since the proposed rAVS model is based on the structure of the AVS model and modifies some parts of it. Next, we introduce the rAVS model.

2.1 The AVS Model

[35] proposed a cognitive model of spatial term comprehension: the Attentional Vector Sum (AVS) model. The AVS model takes the 2D-location and the 2D-shape of a RO, the 2D-location of a LO, and a spatial preposition as input and computes an acceptability rating (i.e., how well the preposition describes the location of the LO relative to the RO). In the following, we are presenting how the AVS model processes the spatial relation between the RO and the LO and how it computes the acceptability rating. The AVS model consists of two components: The angular component and the height component. Figures 1a–c depict the angular component which we describe first. Figure 1d visualizes the height component that we describe thereafter.

Angular Component. First, the AVS model defines the focus F of a distribution of visual attention as the point on top of the RO “that is vertically aligned with the trajector [LO] or closest to being so aligned”Footnote 2 [35, p. 277]. Next, the model defines the distribution of attention on every point i of the RO as follows (see Fig. 1a for visualization):

$$\begin{aligned} a_i = \exp \left( \frac{-d_i}{\lambda \cdot \sigma }\right) \end{aligned}$$
(1)

Here, \(d_i\) is the euclidean distance between point i of the RO and the attentional focus F, \(\sigma \) is the euclidean distance between the attentional focus F and the LO, and \(\lambda \) is a free parameter. The resulting distribution of attention is highest at the focal point F and declines exponentially with greater distance from F (see Fig. 1a). Furthermore, the distance \(\sigma \) of the LO to the RO as well as the free parameter \(\lambda \) affect the width of the attentional distribution: A close LO results in a more focused attentional distribution (a large decline of attention from point F) whereas a distant LO results in a more broad attentional distribution (a small decline of attention from point F).

In the next step, vectors \(v_i\) are rooted at every point i of the RO. All vectors are pointing to the LO and are weighted with the amount of attention \(a_i\) that was previously defined (see Fig. 1b). All these vectors are summed up to obtain a final vector:

(2)
Fig. 1.
figure 1

Schematized steps of the AVS model developed by [35]. (Color figure online)

The deviation \(\delta \) of this final vector to canonical upright (in the case of above) is measured (see Fig. 1c) and used to obtain a rating with the help of the linear function g(\(\delta \)) that maps high deviations to low ratings and low deviations to high ratings:

$$\begin{aligned} \mathrm {g}\left( \delta \right) = slope \cdot \delta + intercept \end{aligned}$$
(3)

Both, slope and intercept, are free parameters and \(\delta \) is the angle between the sum of the vectors and canonical upright (in the case of above):

(4)

Height Component. g(\(\delta \)) is the last step of the angular component. This value is then multiplied with the height component. The height component modulates the final outcome with respect to the elevation of the LO relative to the top of the RO: A height component of 0 results in a low rating, whereas a height component of 1 does not change the output of the angular component. The height component is defined as follows:

$$\begin{aligned} \mathrm {height}(y_{LO}) = \frac{\mathrm {sig}(y_{LO} - hightop, highgain) + \mathrm {sig}(y_{LO} - lowtop, 1)}{2} \end{aligned}$$
(5)

Here, highgain is a free parameter, hightop (or lowtop) is the y-coordinate of the highest (or lowest) point on top of the RO, and the sig(\(\cdot , \cdot \)) function is defined as:

$$\begin{aligned} \text {sig}(x, gain) = \frac{1}{1 +\, \text {exp}\, (gain \cdot (-x))} \end{aligned}$$
(6)

The AVS model has four free parameters in total: \(\lambda , slope, intercept, highgain\). Taken together, the final acceptability rating is computed by the AVS model with the following formula:

$$\begin{aligned} \mathrm {above}(LO, RO) = \mathrm {g}\left( \delta \right) \cdot \mathrm {height}(y_{LO}) \end{aligned}$$
(7)

2.2 The rAVS Model

Although [35] do not explicitly mention shifts of attention, the AVS model can be interpreted as assuming a shift of attention from the RO to the LO: This shift is implemented by the location of the attentional focus and in particular by the direction of the vectors (see Figs. 1a–c). As discussed before, this direction of the attentional shift conflicts with recent empirical findings [3, 15, 38]. This is why our modified version of the AVS model, the reversed AVS (rAVS) model, implements a shift from the LO to the RO.

To this end, the rAVS model reverses the direction of the vectors in the vector sum in the following way: Instead of pointing from every point in the RO to the LO, the vectors are pointing from every point in the LO to the RO. Since the LO is simplified as a single point in the AVS model, the vector sum in the rAVS model consists of only one vector. The end point of this vector, however, must be defined, since the RO has a mass.

In the rAVS model, the vector end point D lies on the line between the center-of-mass C of the RO and the proximal point F (see Fig. 2a). Here, F is the same point as the attentional focus in the AVS model. Depending on the relative distance of the LO, the vector end point D is closer to C (for distant LOs) or closer to F (for close LOs). Thus, the center-of-mass orientation is more important for distant LOs, whereas the proximal orientation becomes important for close LOs, which corresponds to the rating pattern found by [35, experiment 7]. The width of the attentional distribution in the AVS model has a similar effect.

Fig. 2.
figure 2

Schematized steps of the rAVS model.

In the rAVS model, the distance of a LO is considered in relative terms, i.e., the width and height of the RO change the relative distance of a LO, even if the absolute distance remains the same (see Fig. 2b). The relative distance is computed as follows:

$$\begin{aligned} d_{rel.}(LO, RO) = \frac{|LO,P|_x}{RO_{width}} + \frac{|LO,P|_y}{RO_{height}} \end{aligned}$$
(8)

Here, P is the proximal point in the intuitive sense: The point on the RO that has the smallest absolute distance to the LO. F is guaranteed to lie on top of the RO, whereas P can also be at the left, right, or bottom of the RO. If P is on top of the RO, P equals F.

Furthermore, the computation of the vector end point D is guided with an additional free parameter \(\alpha \) (with \(\alpha \ge 0\)). The new parameter \(\alpha \) and the relative distance interact within the following linear function to obtain the new vector destination D:

(9)

The direction of the vector is finally compared to canonical downwards instead of canonical upright (in the case of above, see Fig. 2c) – similar to the angular component of the AVS model:

(10)

As in the AVS model, this angular deviation is then used as input for the linear function \(g(\delta )\) (see Eq. 3) to obtain a value for the angular component. Note that a comparison to downwards is modeled, although the preposition is above. [38, p. 7] also mention this “counterintuitive, but certainly not computationally difficult” flip of the reference direction in their account.

In the rAVS model, the attentional focus lies on the LO. In fact, however, the location of the attentional focus as well as the attentional distribution do not matter for the rAVS model, because its weighted vector sum consists of only one single vector (due to the simplified LO). Since the length of the vector sum is not considered in the computation of the angle (neither in the AVSFootnote 3 nor in the rAVS model), the amount of attention at the vector root is not of any importance for the final rating (as long as it is greater than zero, see Fig. 2d).Footnote 4

The height component of the AVS model is not changed in the rAVS model. So, it still takes the y-value of the LO as input and computes the height according to the grazing line of the RO (see Eq. 5). As in the AVS model the final rating is obtained by multiplying the height component with the angular component (see Eq. 7).

3 Model Comparison

In the previous section, we have presented the AVS model by [35] and proposed the rAVS model, since the AVS model conflicts with recent empirical findings regarding the direction of the attentional shift [3, 15, 38]. But how does the rAVS model perform in comparison to the AVS model?

[35] conducted seven acceptability rating experiments and showed that the AVS model was able to account for all empirical data from these experiments. These data consist of acceptability ratings for \(n = 337\) locations of the LO above 10 different types of ROs. We evaluated the rAVS model on the same data setFootnote 5 to assess its performance using three different model simulation methods: Goodness-Of-Fit (GOF, Sect. 3.1), Simple Hold-Out (SHO, Sect. 3.2), and Model Flexibility Analysis (MFA, Sect. 3.3). We introduce each of these simulation methods before we present its results.

Both models and all simulation methods were implemented in C++ with the help of the Computational Geometry Algorithms Library [11]. The C++ source code is available under an open source license from [21]. For all simulations, we constrained the range of the model parameters in the following way:

$$\begin{aligned} \frac{-1}{45}&\le slope \le 0\end{aligned}$$
(11)
$$\begin{aligned} 0.7&\le intercept \le 1.3 \end{aligned}$$
(12)
$$\begin{aligned} 0&\le highgain \le 10 \end{aligned}$$
(13)
$$\begin{aligned} 0&< \lambda \le 5 \end{aligned}$$
(14)
$$\begin{aligned} 0&< \alpha \le 5 \end{aligned}$$
(15)

3.1 Goodness-Of-Fit (GOF)

Method. The Goodness-Of-Fit (GOF) measures how well a model fits given data. We fitted both models to the n rating data points from [35] by minimizing the normalized Root Mean Square Error (nRMSE):

$$\begin{aligned} nRMSE = \frac{\sqrt{\frac{1}{n} \sum _i^n (data_i - modelOutput_i)^2}}{rating_{max} - rating_{min}} \end{aligned}$$
(16)

To this end, we used a method known as simulated annealing, a variant of the Metropolis algorithm [31]. This method estimates the free parameters of the model in order to minimize the nRMSE and has the advantage to not get stuck in local minima. The nRMSE gives us a Goodness-Of-Fit (GOF) value. In contrast to the non-normalized RMSE, the normalized RMSE can be compared throughout rating experiments with different rating scales, because it always has a range from 0 to 1: An nRMSE of 0.0 means best performance (the model is able to exactly reproduce the empirical data), an nRMSE of 1.0 means worst performance (model output and data are maximally different).

Results. Figure 3 shows the GOF results for fitting both models to all data from [35]. The model parameters for the plotted GOFs can be found in Table 1. First of all, both models are able to account for the data very closely as is evident from the overall low nRMSE (<0.08 for both models). The rAVS model has a slightly worse GOF value than the AVS model but the difference to the GOF value of the AVS model is very low (difference < 0.005) which renders this difference inconclusive. Also, the GOF values change slightly with each new estimation due to the random nature of the parameter estimation method. The most important conclusion one can draw from the GOF values is whether the models are able to fit the data at all and this is the case for the models and the data under consideration. Assessing the relative performance of more than one model solely with their GOF, however, should be done very carefully.

[37] provide a thorough discussion of the theoretical problems of using GOF as the only measure of model performance. Related to the problems discussed by [37], [34] focus on the specific problem of overfitting: A more flexible model might obtain a better GOF just because it is more flexible. Model flexibility Footnote 6 is the ability of a model to produce arbitrary output: The more different the possible output of a model, the more flexible the model. A more flexible model might fit the noise in the given data better than a less flexible model. Although this results in a better GOF, it does not add anything to the explanatory power of the model in question. We tried to overcome the problem of overfitting by applying the Simple Hold-Out (SHO) method as outlined in the next section.

3.2 Simple Hold-Out (SHO)

Method. To control for the problem of overfitting, we applied a cross-validation method that takes model flexibility into account: the Simple Hold-Out (SHO) method described in [40]. [40] showed that this method performs very well in comparison to other model comparison methods. In the SHO method, the data set is split into a training and a test set. Model parameters are estimated on the training set and used to compute an nRMSE on the test set. This nRMSE is also called prediction error, because it is the error the model makes for predicting “unseen” data (the test set).Footnote 7 This procedure is repeated several times with different, random splits of the data. The median of the prediction errors is the final outcome of the SHO method.

The results presented here were computed with 101 iterations of the SHO method. In each iteration 70% of the data was used as training data and 30 % was used as test data. Moreover, we computed 95% confidence intervals of the SHO median by using 100,000 bootstrap samples with the help of the boot package for R [6].

Results. The SHO results are plotted next to the GOF results in Fig. 3. These results reveal that the slightly better GOF of the AVS model might be the result of a light overfitting, because both models obtain similar SHO values with overlapping confidence intervals for the SHO values. That is, both models perform equally well and cannot be distinguished on these data using the SHO method.

Fig. 3.
figure 3

GOF and SHO results for the AVS and the rAVS model for fitting all data from [35]. Error bars show 95% confidence intervals computed with 100,000 bootstrap samples.

Table 1. Values of the model parameters to achieve the GOFs shown in Fig. 3. The \(\lambda \) parameter of the rAVS model does not change the output of the rAVS model, see Fig. 2d.

3.3 Model Flexibility Analysis (MFA)

Method. The Model. Flexibility Analysis (MFA) proposed by [41] tries to account for another problem of using GOFs as a criterion for model evaluation stated by [37]: A model that achieves a good fit to empirical data might also fit other (possibly non-empirical) data very well. Put differently, the fact that a model fits data well does not exclude the possibility that the model might predict outcomes that humans would never generate. If the model also fits non-empirical dataFootnote 8, its good fit to empirical data becomes less impressive.

Note that this problem is different from the problem for which we applied the SHO method. The SHO method accounts for overfitting, i.e., if a model fits too much noise, it will obtain a good GOF, but a worse SHO result (because predicting unseen data does not benefit from a closer fit to noisy data). Although the SHO method operates with predicting unseen data, it cannot be used for claims about all possible model predictions. This is because the SHO method uses (random) subsets of the same data set. All these subsets are in the same region of the space of possible data, namely, the region of empirically observed data. That is, a model can obtain good SHO values although it has a high flexibility (i.e., although it can generate a great range of different, possibly non-empirical data).

The MFA, however, was explicitly designed to quantify the size of the space of all possible outcomes a model can generate – regardless of whether these outcomes were empirically observed or not. This information can then be used as a measure to “know how impressed to be that theory and observation are consistent” [37, p. 359]. A good fit of a model is more impressive if the model generates (almost) only empirically observed data. A good fit is less impressive if the model also generates a great range of non-empirical data.

Given input to a model (in our case ROs and LOs), the MFA computes all possible model outputs (in our case acceptability ratings) by enumerating the whole space of the free parameters of the model. The outcome of the MFA is the proportion \(\phi \) of these outputs to the size of the data space, where the data space contains all hypothetical possible data:

$$\begin{aligned} \phi = \frac{\text {number of different model outputs}}{\text {number of all possible data points}} \end{aligned}$$
(17)

If \(\phi \) is low, the model is only able to compute (and thus fit) a small proportion of theoretically possible data. The lower \(\phi \), the more strongly the model constrains its possible outcomes, i.e., the less flexible is the model. If \(\phi \) is high, the model can compute a great range of possible data. With a high \(\phi \) the model only weakly constrains its output, i.e., the model is highly flexible. As an example consider one RO with two LOs and a rating scale from 0 to 9. The space of theoretically possible data is then two-dimensional (two ratings) and each dimension ranges from 0 to 9. Thus, there are \(10 \cdot 10 = 100\) theoretically possible data points, if we only consider integer ratings. If a model is able to compute 20 of these rating patterns, it will get the proportion \(\phi = \frac{20}{100} = 0.2\), a model that can compute 80 rating patterns results in \(\phi = 0.8\).

We computed the MFA proportions on the stimuli that [35] used. The dimension of the data space and each model output is 337, because [35] used 337 locations of the LO in total. We split the range of each of the four free parameters of both models into 50 intervals and computed the model output for each of these \(50^4\) parameter sets (with a rating scale from 0 to 1). This gave us all outputs the models can generate with these parameter sets. Then, we divided the number of different model outputs by the number of all possible data points (see Eq. 17). To determine whether two model outputs are equal, the MFA uses a grid over the data space. If two model outputs are in the same cell of the grid, they are considered equal. [41] suggest to split each dimension of the data space into \(\root n \of {j^k}\) cells, where n is the number of dimensions of the data space (in our case 337), k is the number of free parameters (in our case 4) and j is the number of intervals for each parameter (in our case 50). Accordingly, for our simulations each dimension of the data space should be split into 1.047528 cells. Since we used a rating scale from 0 to 1 for the computation of the MFA, this means that every rating between 0 and 0.954628 is mapped to the first cell and every rating between 0.954628 and 1 is mapped to the second cell. That is, almost all ratings are considered to be equal. This might not be be meaningful in our context, thus, we also considered a different splitting of the data space: Since [35] used a rating scale from 0 to 9 resulting in 10 possible, distinct ratings for each LO, we also computed the MFA with 10 cells for each dimension of the data space without changing the number of the model predictions.Footnote 9

Results. The results of the MFA are shown in Table 2. The worst possible \(\phi \) value is 1.0 (a model that generates all possible data), the lowest possible \(\phi \) value is 0.0 (a model that generates no data). Since all \(\phi \) values in Table 2 are relatively low, both models have a relatively low flexibility. However, regardless of the number of cells in the potential data space, the AVS model obtains higher \(\phi \) values than the rAVS model. Thus, the AVS model is more flexible than the rAVS model which makes the good performance of the AVS model less impressive than the good performance of the rAVS model.

Table 2. Results of the Model Flexibility Analysis (MFA). The lower the \(\phi \) value, the less flexible the model.

3.4 Discussion

With the GOF and SHO simulations, we showed that both models are able to account equally well for the data from [35] and that this good performance is not the result of overfitting. Considering only these results, both attentional shifts are equally well supported. However, the AVS model also showed a greater flexibility in the MFA: the AVS model computes a greater range of possible outputs than the rAVS model. The lower flexibility of the rAVS model thus makes the good performance of the rAVS model more impressive than the good performance of the AVS model. This is because the rAVS model achieves the same performance while it also more strongly constrains the space of the model output. The rAVS model thus can be more easily falsified than the AVS model.

We asked where the greater flexibility of the AVS model originates. Computationally, the biggest difference of the two models is the computation of the final vector: The AVS model uses a weighted vector sum whereas the rAVS model only uses a linear function.Footnote 10 Thus, the vector sum seems to be the main source of the greater flexibility of the AVS model. Our results suggest that the vector sum is more flexible than is needed for the empirical data. This does not necessarily mean that the shift from the RO to the LO as assumed in the AVS model is less supported than the reversed shift. Rather, the way the shift is implemented in the AVS model might be more complex than needed.Footnote 11 However, [35] motivated the implementation of the vector sum as a possible non-linguistic mechanism underlying the linguistic process. More specifically, they used the vector sum, because it seems to be a widely used representation of direction in the brain (see discussion below). Accordingly, the vector sum is a crucial part of the model. Our results show that the AVS model has an overall low flexibility but they also show that the weighted vector sum is more flexible than the linear function used by the rAVS model. However, we did not present a model that performs better, just a model that performs equally well – with a lower flexibility. Moreover, the rAVS model does not offer a competing explanation of how the vector is computed (mainly because its motivation was different). At the neuronal level, this computation could still be done with a vector sum but the rAVS model does not provide details about the neuronal level.

Although our model simulations do not result in the support of one of the two shifts in question, they raise the question to which degree the attentional shift from the RO to the LO as proposed by [29] and [30] is the only shift that is implicated in the processing of spatial relations. The results from [3] suggest that humans perform both shifts, but that an overt shift from the LO to the RO alone (as in the rAVS model) can be enough to apprehend the spatial relation between the objects. The shift back (from the RO to the LO) could be a way to double-check the goodness-of-fit of the spatial preposition. Our results support this by showing that the rAVS model – that assumes only the shift from the LO to the RO – can account for the data from [35].

4 Conclusion

We proposed a new cognitive model for spatial language understanding: the rAVS model. This model is based on the AVS model by [35] but integrates recent psycholinguistic and neuroscientific findings [3, 15, 38] that conflict with the assumption of the direction of the attentional shift in the AVS model. In the AVS model, attention shifts from the RO to the LO; in the rAVS model, attention shifts from the LO to the RO. We assessed both models using the data from [35] and found that both models perform equally well, while the rAVS model is less flexible than the AVS model. Accordingly, our model simulations favor the rAVS model. Since the lower flexibility of the rAVS model originates from the lack of using a vector sum, however, the advantage of the rAVS model does not result in the favor of any of the two directionalities of the attentional shift. However, we showed that both directionalities can account for the empirical data.

Theoretical Contribution. [35] developed the AVS model with the goal to identify possible nonlinguistic mechanisms that underlie spatial term rating. To this end, they implemented two independent observations in the AVS model: First, the importance of attention to understand spatial relations and second, the neuronal representation of a motor movement as a vector sum. So, the main goal of the AVS model was not to examine the direction of the shift of attention but rather to describe linguistic processes with nonlinguistic mechanisms.

Although the focus of the AVS model was not on the direction of the attentional shift, the model implies a shift from the RO to the LO. [35] motivated the use of a vector sum because it seems to be a widely used representation of direction in the brain. [17] found that the direction of an arm movement of a rhesus monkey can be predicted by a vector sum of orientation tuned neurons. [27] provide evidence for a similar representation for saccadic eye movements. Eye movements (overt attention) are motor movements that are closely connected to covert visual attention: “Many studies have investigated the interaction of overt and covert attention, and the order in which they are deployed. The consensus is that covert attention precedes eye movements [...].” [10, p. 1487] Although the authors of the AVS model do not explicitly speak about which movement the vector sum in their model represents nor clearly specify the kind of attention in the model, it seems reasonable to interpret the direction of the vector sum in the AVS model as the direction of a shift of attention that goes from the RO to the LO.

Our aim was to implement the most recent findings of attentional mechanisms into the AVS model. To this end, we designed the rAVS model as similar as possible to the AVS model. So, the rAVS model follows the same basic concepts while it integrates the most recent findings. We do not claim that the nonlinguistic mechanisms proposed in the AVS model do not happen – rather, we propose an alternative way of how they might take place. That is, on the neuronal level, the orientation of the vector in the rAVS model that points from the LO to the RO could still be computed by a weighted population of neurons (similar to the attentional vector sum) but the rAVS model does not provide such details. Its focus lies on a more abstract level that concerns the direction of the attentional shift and not the detailed computation of this shift. Keeping the same basic concepts as the AVS model, the rAVS model accounts for the same data equally well – and also for the recent empirical findings regarding the direction of the attentional shift.

4.1 Future Work

Modeling Both Shifts. The success of both the rAVS model and the AVS model support the existence of both directionalities of the attentional shift. It might well be that people shift their attention in both directions during the processing of spatial relations – depending on the task and the linguistic input. Accordingly, a model that implements both attentional shifts might fit more data than the AVS or the rAVS model alone.Footnote 12

It might be interesting to investigate this possibility by creating a model that allows both shifts of attention. Such a model should be applicable to more types of experimental data than the AVS model and the rAVS model (which both can only account for acceptability rating data). In particular, the model with both shifts should also specify when in time what type of attentional shift occurs and how long the computation takes. This model could then be fitted to a greater range of data, like real-time eye movement data from visual world studies (e.g., [3]) or reaction time data (e.g., [38]). Modeling different tasks would give more insight into the role of the attentional shift.

Modeling the LO. The main reason for the lower flexibility as well as the lower computational complexity of the rAVS model is the simplification of the LO as a single point. There is evidence, however, that geometric features of the LO also affect acceptability ratings [1, 2, 4]. A comprehensive model of spatial language thus should also model the LO in more detail. Accordingly, we are planning to extend the representation of the LO in the rAVS model by giving a mass to it. This would give us the opportunity to see first how the rAVS model deals with the situation where the computation of a vector sum is necessary to determine the angular deviation. Second, an extended LO might affect the role of the attentional distribution in the rAVS model.

We are also interested in modifying the use of the height component in the rAVS model. At the moment, the rAVS model applies the same computation as the AVS model for the height component: the y-coordinate of the LO is compared relative to the top of the RO (see Fig. 1d). In the rAVS model, the attentional focus is located on the LO. So, it would be more consistent if the location of the LO were taken as the baseline for the comparison with the location of the RO. Thus, we want to reverse the computation of the height component such that the grazing line lies on the bottom of the LO.

Model Distinction. To tease apart the two models and evaluate the accuracy of their predictions, we are currently analyzing the models with two more model simulation methods: the landscaping analysis proposed by [32, 42] and an algorithm called Parameter Space Partitioning (PSP) proposed by [20, 33].

Landscaping provides an overview how data and models behave to each other and how informative a specific data set is in distinguishing two models. The main idea of landscaping is the following: Given model input (i.e., ROs and LOs in our case), each model is used to generate sets of artificial data (i.e., ratings in our case) and then both models fit these data. Landscaping provides a measure of what is called model mimicry by [42]: The ability of a model to account for data generated by another model. Each model should fit the self-generated data quite well – without added noise this fit should be almost perfect. If, however, one model is also able to closely fit the data generated by another model, this model mimics the other model, i.e., this model is able to behave like the other model. We are currently applying the landscaping method on the stimuli from [35]. On a different set of stimuli, a landscaping analysis confirmed the greater flexibility of the AVS model (i.e., the AVS model mimics the rAVS model but not vice versa, see [24, 25]).

The PSP algorithm is a Markov chain Monte Carlo (MCMC) based method and searches in the parameter space of the models for regions of patterns that are qualitatively different. First results confirm the high flexibility of the AVS model (i.e., the AVS model is able to generate many patterns that are qualitatively different by using different sets of parameters). The rAVS model, however, generates fewer patterns with a qualitative difference (see [22] for details). To test the predictions revealed by the PSP analysis we conducted an empirical rating study with the same stimuli. More on the results of this study can be found in [24, 25].

Functionality. The AVS model does not account for any effects of the functionality of objects on spatial language comprehension, although there is evidence that – beside purely geometric effects – functional interactions between objects also affect the use of spatial prepositions [8, 9, 12,13,14, 18].

For instance, [9] conducted an object placement task, where participants had to place a toothpaste tube above a toothbrush. They showed that the toothpaste tube was not placed above the center-of-mass of the toothbrush, but rather above the bristles of the toothbrush – that is, at the location where both objects can functionally interact. Objects with a smaller amount of functional interaction (here, a tube of oil paint) were placed more above the center-of-mass of the toothbrush instead of above the bristles.

Despite this evidence, the AVS model (and thus also our rAVS model) only considers geometric representations of the RO and the LO. For the AVS model, however, a range of extensions that integrate functionality were already proposed [8, 26]. Since the rAVS model is designed to be as similar as possible to the AVS model, these functional extensions might also be applicable for the rAVS model.

Implementing the Models in Artificial Systems. In order to implement these models into artificial systems, additional steps are necessary. The models were designed to model spatial language understanding. The models thus produce an acceptability rating given a RO, a LO, and a preposition. As part of an artificial system that interprets spatial language, the models can be used straightforwardly: Given a spatial utterance and a visual scene, the models can be used to compute acceptability ratings for all points around the RO (i.e., a spatial template). The artificial system then starts the search for the LO at the point with the highest rating. To generate spatial language with the help of these models, one could imagine the following steps: Compute the acceptability ratings of different spatial prepositions (e.g., above, below, to the left of, in front of, ...) and subsequently pick the one with the highest rating.

Simulating the models showed that the computation of the attentional vector sum is computationally more expensive than the linear function that is used by the rAVS model. Thus, the rAVS model provides a shortcut for the computation of the final vector. In particular, this is interesting for the implementation of the model in real-time robotic systems that often have constrained computational resources.

In conclusion, we proposed a modified version of the AVS model: the rAVS model. The rAVS model accounts for the same empirical data as the AVS model while integrating additional recent findings regarding the direction of the attentional shift that conflict with the assumptions of the AVS model.