Keywords

1 Introduction

Home service robots are required to be able to understand and carry out tasks of cleaning or picking up objects through communications with people. We use various commands, such as “please fetch the cup” and “please put it back”. In the above task, it is necessary to estimate where the object indicated by the word “cup” is located. We consider that robots can communicate effectively with people using vocabularies representing locations, to efficiently perform tasks for estimating where the robot should go to pick up a cup when the cup can be located in multiple places. Furthermore, for the task of “please put it back”, it is necessary to estimate the place in which the presented object should be placed and the vocabulary expressing this. Therefore, we consider that robots should be able to learn the relationships between objects and places in order to carry out such tasks.

In addition, understanding human social interactions and developing a robot that can smoothly communicate with human users in the long term, requires an understanding of the dynamics of symbol systems, such as multimodal categorization [1]. Multimodal categorization involves forming categories based on sensorymotor information acquired by a robot, including visual information, haptic information, and auditory information. By forming categories by multimodal information, it is possible to classify observation information using each modal information category, and to estimation other modal information from one item of modal information.

Fig. 1.
figure 1

Overview of the learning relationships between places and objects by the proposed method

Regarding related work on place categorization, Taniguchi et al. proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA) on the basis of unsupervised word segmentation and a nonparametric Bayesian generative model that integrates self-localization and clustering in both words and places [2]. Hagiwara et al. proposed a method that enables robots to autonomously form place concepts using hierarchical multimodal latent Dirichlet allocation (hMLDA) [3], based on position and visual information [4]. In that study, robots are enabled to autonomously form hierarchical place concepts using hMLDA. Further, Ishibushi et al. proposed a method that statistically integrates position information obtained by Monte Carlo localization (MCL) [5] and visual information obtained by a convolutional neural network (CNN) [6, 7]. In that study, the authors demonstrated an ability to converge the positions and orientations of particles using their method, and reduced global positional errors. Further, Espinace et al. proposed a generative probabilistic hierarchical model, where object category classifiers are used to associate low-level visual features to objects, and contextual relations are used to associate objects to scenes [8]. In that study, common objects such as doors and furniture are used as distinguishing features of indoor scenes, as a key intermediate representation for recognizing indoor scenes. Rusu et al. proposed a method of acquisition of semantic 3D object maps that contain those parts of the environment with fixed positions and utilitarian functions for indoor household environments, in particular kitchens, from sensed 3D point cloud data [9]. However, their method cannot perform tasks such as “please fetch the cup” and “please put it back”, because the relationships between objects such as “cup” and places such as “kitchen” are not learned.

An overview of the learning of relationships between places and objects using the proposed method is shown in Fig. 1. In our study, we propose a model that learns the relationships between objects and places from multimodal information of self-localization, object information, and word information. Vocabulary expressing a place is adopted as vocabulary information. Word information constitutes a word expressing a place. Object information is a feature vector expressing a Bag-of-Objects (BoO) representation, detecting an object’s label using an object detection method.

In our experiments, we quantitatively evaluate the estimation results for objects from words expressing their places and estimation results for words expressing places from images.

Fig. 2.
figure 2

Graphical model of the proposed method

Table 1. Definitions of variables in the graphical model
Fig. 3.
figure 3

Object information representing BoO using result of detected object’s label by object detection method

2 Learning of Multimodal Spatial Concepts Based on Co-Occurrences of Objects

In this study, we propose a method that learns the relationships between objects and places using self-position, object, and word information. We define that relationship between an object and a place as the probability the object existing in that place. A graphical model of proposed method is shown in Fig. 2, the definitions of the variables in the graphical model are given in Table 1, and the generative model for the proposed method is given in Eqs. (1)–(10).

$$\begin{aligned} \pi&\sim \mathrm{GEM}\left( \gamma \right) \end{aligned}$$
(1)
$$\begin{aligned} C_{t}&\sim p\left( C_{t} | {x_{t}}, \mu , \varSigma , \pi \right) \nonumber \\&\quad \propto \frac{\mathcal {N} \left( \varvec{x_{t}} | \mu _{C_{t}}, \varSigma _{C_{t}}\right) \mathrm{Mult} \left( C_{t} | \pi \right) }{\sum _{c'} \mathcal {N} \left( \varvec{x_{t}} | \mu _{c'}, \varSigma _{c'} \right) \mathrm{Mult} \left( c' | \pi \right) }\end{aligned}$$
(2)
$$\begin{aligned} \varSigma&\sim \mathcal {IW} \left( \varSigma | \psi _{0}, \nu _{0} \right) \end{aligned}$$
(3)
$$\begin{aligned} \mu&\sim \mathcal {N} \left( \mu | \mu _{0}, \left( \varSigma /\kappa _{0} \right) \right) \end{aligned}$$
(4)
$$\begin{aligned} \varphi&\sim \mathrm{Dir}\left( \alpha \right) \end{aligned}$$
(5)
$$\begin{aligned} \eta&\sim \mathrm{Dir}\left( \beta \right) \end{aligned}$$
(6)
$$\begin{aligned} o_{t}&\sim \mathrm{Mult}\left( o_{t} | \varphi _{C_{t}} \right) \end{aligned}$$
(7)
$$\begin{aligned} w_{t}&\sim \mathrm{Mult}\left( w_{t} | \eta _{C_{t}} \right) \end{aligned}$$
(8)
$$\begin{aligned} \varvec{x_{t}}&\sim p\left( \varvec{x_{t}} | \varvec{x_{t-1}}, u_{t} \right) \end{aligned}$$
(9)
$$\begin{aligned} z_{t}&\sim p\left( z_{t} | \varvec{x_{t}} \right) \end{aligned}$$
(10)

Here, \(\mathrm {GEM}\left( \cdot \right) \) is the prior distribution configured using a stick breaking process (SBP) [10], \(\mathcal {IW}\left( \cdot \right) \) is the inverse-Wishart distribution, \(\mathcal {N} \left( \cdot \right) \) is a multivariate normal distribution, \(\mathrm{Mult}\left( \cdot \right) \) is a multinomial distribution, and \(\mathrm{Dir}\left( \cdot \right) \) is a Dirichlet distribution. Robots estimate their self-position with MCL, using a map created by simultaneous localization and mapping (SLAM) [5]. Moreover, in order to detect objects from images, we use you only look once (YOLO) [11] which is an object detection method. A method of acquiring object information is illustrated in the Fig. 3. Robots acquire object information \(o_{t}\) representing BoO using the result of the object’s label detected by YOLO from images acquired at time t. Furthermore, using the bounding box acquired by YOLO, object information is weighted according to Eq. (11) using the obtained depth information. Because learning is performed based on the position of the robot, weighted is accordingly performed to avoid the influences of distant objects.

$$\begin{aligned} weight\left( d \right) = \exp \biggl \{-\frac{\zeta d}{(D -d)} \biggl \} \end{aligned}$$
(11)

Here, d is the depth information for each object observed by the robot, D is a value for setting a convergence point where the weight becomes zero, and \(\zeta \) is the damping factor. From Eq. (11), the attenuation factor is increased as the value of the distance increases.

Self-position information is defined as \(\varvec{x_{t}}=\left( x_{t}, y_{t}, \sin \theta _{t}, \cos \theta _{t} \right) \), where (xy) is the self-position value of the robot in two-dimensional coordinates, and \(\theta _t\) is the direction of the robot. The angle \(\theta _t\) is such that the angle to the x axis is \(0^\circ \) and angle to the y axis is \(90^\circ \). Furthermore, \(u_{t}\) and \(z_{t}\) represent the control information of the robot and observation information from the distance sensor, respectively. The object information \(o_{t}\) is defined as \(o_{t}=\left( o_{t}^{1}, o_{t}^{2}, \cdots , o_{t}^{I} \right) \), where I is the number of categories of objects that can be detected by the object detection method. A human gives the name of the location corresponding to the self-position information \(\varvec{x_{t}}\) using vocabulary information \(w_{t}\). The number of spatial concepts is determined stochastically by SBP.

For learning spatial concepts in the proposed method, each parameter is estimated using Gibbs sampling. The procedure for sampling each parameter using the Gibbs sampling is shown in the Eqs. (12)–(15), where \(\mathcal {NIW}\left( \cdot \right) \) is the normal-Inverse-Wishart distribution; \(\psi _{n_{l}}, \nu _{n_{l}}, \mu _{n_{l}}, \kappa _{n_{l}}\) are hyperparameters after updating; and \(x_{l}, o_{l}, w_{l}\) are sets of self position information, object information, and word information data at \(C_{t}=l\), respectively. Furthermore, \(C_{t}\), \(\mu \), \(\varSigma \), \(\varphi \), \(\eta \) are parameters estimated by Gibbs sampling.

$$\begin{aligned} C_{t}&\sim p\left( C_{t}=l | \varvec{x_{t}}, \mu , \varSigma , \pi , \varphi , \eta \right) \nonumber \\&\quad \propto \mathcal {N}\left( \varvec{x_{t}}|\mu _{C_{t}}, \varSigma _{C_{t}} \right) \mathrm{Mult}\left( o_{t} | \varphi _{C_{t}} \right) \nonumber \\&\qquad \times \mathrm{Mult}\left( w_{t} | \eta _{C_{t}} \right) \mathrm{Mult}\left( C_{t} | \pi \right) \end{aligned}$$
(12)
$$\begin{aligned} \mu _{l}, \varSigma _{l}&\sim \mathcal {N}\left( x_{l}|\mu _{C_{t}}, \varSigma _{C_{t}} \right) \mathcal {NIW}\left( \mu _{l}, \varSigma _{l} | \psi _{0}, \nu _{0}, \mu _{0}, \kappa _{0} \right) \nonumber \\&\quad \propto \mathcal {NIW}\left( \mu _{l}, \varSigma _{l} | \psi _{n_{l}}, \nu _{n_{l}}, \mu _{n_{l}}, \kappa _{n_{l}} \right) \end{aligned}$$
(13)
$$\begin{aligned} \varphi _{l}&\sim \mathrm{Multi}\left( o_{l} | \varphi _{l} \right) \mathrm{Dir}\left( \varphi _{l} | \alpha \right) \end{aligned}$$
(14)
$$\begin{aligned} \eta _{l}&\sim \mathrm{Multi}\left( w_{l} | \eta _{l} \right) \mathrm{Dir}\left( \eta _{l} | \beta \right) \end{aligned}$$
(15)

3 Experiment

An experiment is performed using the proposed model to estimate objects from vocabulary information expressing places, and to estimate the vocabulary expressing places from images, and a quantitative evaluation makes it possible to judge the relevance relations between objects and places. In addition, we show the usefulness of our proposed model by actually carrying out the task of having the robot clear up an object using the proposed model.

3.1 Experimental Condition

In this experiment, we conduct experiments using TOYOTA’s Human Support Robot (HSR) [12]. The experimental environment is a home environment in the house owned by our laboratory. The layout of the experimental environment is illustrated in Fig. 4. It is assumed that the map is generated by SLAM in advance, using a laser range sensor, and that the robot has a map. Self-position estimation is performed using the amcl (adaptive MCL) package of Robot Operating System (ROS) [13]. The dictionary of the obtained word information contains the following: “The front of dining table”, “The front of TV”, “The front of trash box”, “The front of microwave rack”, “The front of sink”, “The front of bookshelf”, “The front of refrigerator”, “The front of living table”, and “The front of sofa”. Word information is allocated to 10% of the data of the self position information. Because we used the pre-learned darknet 19 model [11] in the dataset MS-COCOFootnote 1 for YOLO, the object information consists of 80 dimensions. The parameters for weight calculation are \(D=4\) and \(\zeta = 0.7\). The other parameters for this experiment are \(\alpha =0.1\), \(\beta _{0}=0.1\), \(\gamma _{0}=10\), \(\mu _{0}=(-0.05,-0.74,-0.01,-0.27)\), \(\kappa _{0}=1.0\), \(\nu _{0}=15\), and \(\psi _{0}=\mathrm{diag}(0.05,0.05,0.05,0.05)\). In addition, the number of iterations used for Gibbs sampling is 100. In order to verify the validity of the learned relationships between objects and places, three objects are selected from 80 objects that can be detected and set as correct labels. A correct label is created by a person who knows the experimental environment.

Fig. 4.
figure 4

Layout of the experimental environment

3.2 Experimental Procedure

While estimating its self-position using MCL, a robot moves in the environment by operation of the joy stick, and acquires self-position information and images at each self-position. In this experiment, word information gives data by typing in order to eliminate the problem of speech recognition error. The amount data giving word information is randomly determined. We use YOLO to detect an object in the image and acquire object information representing BoO. Furthermore, using a bounding box acquired by YOLO, obtained object information is weighted according to the Eq. (11), using the obtained depth information. Subsequently, a robot learns the relationships between objects and places by the proposed method using self-position, object, and word information. We confirm the spatial regions of each learned place by drawing a normal distribution on the map. We evaluate the proposed model by a quantitative evaluation that estimates the results for objects from the vocabulary information expressing places, and the results for the vocabulary expressing places from specified objects. Equation (16) is used to estimate the word W expressing a place by the occurrence probability of the feature vector O obtained from the presented image. We compare this with a model designed to handle word information in Ishibushi’s method [7]. Furthermore, we perform a comparison with the number of image features used in Ishibushi’s model in constructing the final and middle (fc6) layers of CNN. Equation (17) is used to estimate objects O at a place from the word W expressing that place. We compare the results with multimodal HDP-LDA. Multimodal HDP-LDA enables the multimodal handling of HDP-LDA in the topic distribution of LDA as Hierarchical Dirichlet Process (HDP) [14]. Here, HDP-LDA was learned using object information and word information. Finally, we demonstrate the usefulness of the proposed model by applying it to the task of the robot actually clearing up objects.

$$\begin{aligned} W&= \mathop {\text {arg max}}\limits _{w_{t}} p\left( w_{t} | o_{t}, \eta , \varphi , \pi \right) \nonumber \\&= \mathop {\text {arg max}}\limits _{w_{t}} \sum _{C_{t}} p\left( w_{t} | \eta _{C_{t}} \right) p\left( o_{t}=O | \varphi _{C_{t}} \right) p\left( C_{t} | \pi \right) \end{aligned}$$
(16)
$$\begin{aligned} O&\sim p\left( o_{t} | w_{t}, \varphi , \eta , \pi \right) \nonumber \\&\propto \sum _{C_{t}} p\left( o_{t} | \varphi _{C_{t}} \right) p\left( w_{t}=W | \eta _{C_{t}} \right) p\left( C_{t} | \pi \right) \end{aligned}$$
(17)
Fig. 5.
figure 5

An example of a position distribution formed for each place and classified image when object information is represent by BoO (a), and its weight (b) (Color figure online)

3.3 Experimental Result

An example of a position distribution formed for each place and an image classified is presented in the Fig. 5. The left side shows the position distribution learned when the object information is represents the BoO, and the right side shows position distribution when the object information is weighted according to the Eq. (11) from the obtained depth information. In Fig. 5, 10 spatial regions estimated by learning are shown, and can be identified by color. The position and direction of the arrow shows center and direction of the location area, and the translucent circle represents the covariance matrix of the spatial region. Although we need to discuss whether to include the direction in the spatial region estimated by learning, in this experiment we used data including directions to learn spatial concepts, so we showed arrows as a result of learning. The color of a circle identifies the spatial region, and does not indicate any relationship. Each image is an example of an image assigned to a spatial region. Furthermore, each histogram represents the probability of the occurrence of a word expressing the place, as obtained by learning. From this point, each spatial region will be indicated by its index. As can be seen from Fig. 5, since the indexes 0 and 33 differ in the directions in which visible objects are facing, they are distinct from each other. Moreover, when object information is weighted in the BoO representation, the range of spatial regions became smaller, because the learned spatial concept involves approaching the place where an object exists using the place in which the object can be seen. Table 2 presents the results of estimating words expressing places from images, and the accuracy, from more than 50 test data points. Table 2 shows that despite the fact that a few objects can be detected, our proposed method did not differ much from the final layer of CNN in terms of accuracy. Table 3 shows the results of estimating objects from words expressing places. From the Table 3, it can be seen that the accuracy of the proposed method is better than that of HDP-LDA. There was a bookshelf presents just inside the kitchen, and so the probability that a book exists is high. In addition, using the proposed model, we confirmed the task of the robot actually carrying out the task of clearing up could be performed. The movie that performed the task to clean up the object, the source code of our proposed method, and dataset are publicly availableFootnote 2.

Table 2. An example of the results of estimating words expressing places
Table 3. Examples of object estimation results

4 Conclusion and Future Work

In this study, we have proposed a method that learns the relationships between objects and places using self-position, object, and word information. Experimental results showed that the proposed method can estimate objects from words expressing their places, and estimate the words expressing places from images. In our experiment to estimate words expressing places from image, our method achieved an equivalent performance to Ishibushi’s method, but we consider that our method is more useful than that method, in that we can learn the relationships between objects and places. Furthermore, in the experiment to estimate objects from the words expressing their places, when using BoO a distant chair could be located, but in the case of weighting it was estimated to be located on the wrong side of a table. From this result, we consider that the learned spatial concept involves approaching the place where an object exists by using the place where it can be seen. From a quantitative evaluation of the results of estimating objects from words expressing their places and the correct labels, the validity of the relationships between objects and places obtained by the learning of this proposed model was demonstrated.

In this study, we conducted experiments on the relationships between objects and places. In the future, we will conduct experiments on the estimation of positions and movement, and we will further verify the effectiveness of proposed method. In addition, it is necessary to fine-tune YOLO, so that it is possible to detect objects in the home. In future work, we are considering conducting relative spatial concept learning [15], such as with the phrase “the front of”.