Abstract
We introduce a fully autonomous active vision system that explores its environment and learns visual representations of objects in the scene. The system design is motivated by the fact that infants learn internal representations of the world without much human assistance. Inspired by this, we build a curiosity driven system that is drawn towards locations in the scene that provide the highest potential for learning. In particular, the attention on a stimulus in the scene is related to the improvement in its internal model. This makes the system learn dynamic changes of object appearance in a cumulative fashion. We also introduce a self-correction mechanism in the system that rectifies situations where several distinct models have been learned for the same object or a single model has been learned for adjacent objects. We demonstrate through experiments that the curiosity-driven learning leads to a higher learning speed and improved accuracy.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Active vision
- Unsupervised learning
- Autonomous vision system
- Vision for robotics
- Humanoid robot
- Icub
- Object recognition
- Visual attention
- Stereo vision
- Intrinsic motivation
1 Introduction
One of the hallmarks of biological organisms is their ability to learn about their environment in a completely autonomous fashion. Future generations of robots assisting humans in their homes should similarly be able to autonomously acquire models of their working environment and any objects in it. While computer vision has made much progress in developing object recognition systems that can deal with many object classes, these systems need to be trained with supervised learning techniques, where a large number of hand-labeled training examples is required. Only recently, researchers have started addressing how a robot can learn to recognize objects in a largely autonomous fashion, e.g., [1], how learning can be made fully online [2, 3] and how the need for a human teacher can be minimized [4]. To this end, current attention systems of robots [5] have to be extended such that they support an efficient autonomous learning process.
The central inspiration of our approach is the concept of intrinsic motivation [6–8]. Children learn and build internal representations of the world without much external assistance. Instead, they are intrinsically motivated to explore and play and thereby acquire knowledge and competence. In short, they are curious. It has been proposed that infants’ interest in a stimulus may be related to their current learning progress, i.e., the improvement of an internal model of the stimulus [9]. We adopt the same idea to build a “curious” vision system whose attention is drawn towards those locations and objects in the scene that provide the highest potential for learning. Specifically, our system pays attention to salient image regions likely to contain objects, it continues looking at objects and updating their models as long as it can learn something new about them, it avoids looking at objects whose models are already accurate, and it avoids searching for objects in locations that have been visited recently. We show that our system learns more efficiently than alternative versions whose attention is not coupled to their learning progress.
2 Object Learning
Our system is implemented on the iCub robot head [10]. Its basic mode of operation is as follows. An attention mechanism generates eye movements to different locations. Any object present at the current location is segmented and tracked while learning proceeds. If the object is unfamiliar then a new object model is created. If the object is already familiar, then its model is updated if necessary. Learning proceeds for as long as the model can be improved. Then a new focus of attention is selected. Figure 1 shows the system architecture, which is explained in detail in the following sections.
We describe objects as spatial arrangements of local image features, an approach that is robust to occlusions, local deformations, variation in illumination conditions, and background clutter, e.g., [11]. To this end, image features are extracted at interest points detected with the Harris corner detector [12]. We use Gabor wavelet features, which have the shape of plane waves restricted by a Gaussian envelope function. At each interest point we extract a 40-dimensional feature vector, which we refer to as a Gabor-jet, resulting from filtering the image with Gabor wavelets of 5 scales and 8 orientations, e.g., [13]. The choice of the features is motivated by the fact that they have a similar shapes as the receptive fields of simple cells found in the primary visual cortex of mammals [14].
2.1 Stereo Segmentation and Tracking of the Object
To segment a potential object at the center of gaze from the background, we make use of stereo information. We find correspondences between interest points detected in the left and right image by exhaustively comparing Gabor-jets extracted at the interest points from left and right image, see Fig. 2a,b. Each interest point in the left image is associated with the best matching interest point in the right image if the similarity \(S\) between the two jets (we use the normalized inner product) is above a preset threshold (0.95 in our current implementation). We then cluster the matched interest points from the left image (that is used for learning) into different groups according to their image location and disparity (Fig. 2c). We use a greedy clustering scheme that starts with a single interest point and adds new ones if their x-position, y-position, and disparity are all within 5 pixels of any existing cluster member. Figure 2d shows how the object at the center of gaze is properly segmented from other objects which are at a similar depth but different spatial location or at a close-by spatial location but different depth.
After segmentation the cameras are moved to bring the object to the center of view and keep it there — in case the object is moving — by a tracking scheme. To this end, the mean location of foreground features is calculated, then this location is tracked with both eyes using a model-free tracking scheme called Democratic Integration (DI) [15]. DI is a multi-cue tracking system that provides a fast and robust way of tracking unknown objects in a changing environment. Once the object is at the center of gaze, model learning starts.
2.2 Learning Object Models
Once an object has been segmented and fixated, its novelty or familiarity is determined by the recognition system described in Sect. 2.4. If the object is already familiar, the recognition module provides the unique identity of the object, i.e., an object index that was assigned when the object was first encountered. Otherwise a new object index is assigned.
Object learning involves the generation of a model that has a set of associations between the Gabor wavelet features and the object index [16]. An association is made between a feature and an object index if they occur together during learning and it is labeled with the distance vector between the location of the feature and the center of the object, i.e., the point on the object on which gaze is centered (see Fig. 3b).
2.3 Feature Dictionary
Object learning is carried out in an on-line fashion. There are no separate training and testing/recognition phases. As the system starts learning, the models for all the objects are learnt incrementally using a shared feature dictionary accumulating information about objects and the associated feature vectors. We use a single-pass clustering scheme that updates the feature dictionary for every input feature vector. Let \(\mathcal {C}\) be the set of clusters and \(n\) be the number of clusters in the feature dictionary. Once the system starts learning it adds features from the objects in the scene. Each input feature vector \(\mathcal {J}\) has an associated object index \(k\) and the distance vector \((x,y)\) to the object center measured in pixels.
In the beginning, when the dictionary is empty, a cluster is created and it will be represented by the input vector. Subsequently, when the number of clusters grows, the algorithm decides to either assign a feature to an existing cluster (without altering its representation) if the similarity value \(S\) is higher than a threshold \(\theta \) (equal to 0.95) (see \(\diamond \) in Fig. 3a) or make it a new cluster otherwise (\(\star \) in Fig. 3a). During each update, object index and distance vector are associated to the same cluster. When a feature matches an existing cluster, a possible duplicate association of this cluster to the current object is avoided. If the object index is the same and if the feature locations are within a euclidean distance of 5.0 pixels the association is neglected. The algorithm can be summarized as follows:
2.4 Recognition
In our work recognition is an integral part of the learning process. When the robot looks at an object the features on the segmented portion are sent to the recognition module and compared with the features in the dictionary. We use a generalized Hough transform [17] with a two dimensional parameter space for recognition. Each feature votes in the space of all object identities and possible centroid locations based on their consistencies with the learned feature associations. Features with a similarity value higher than \(0.95\) will cast one vote each for the object identities that they match in the feature dictionary. Votes having information about object’s identity as well as object’s location are then aggregated in discretized bins in Hough space. We use bins of size \(5\times 5\) pixels in our work. If the number of votes in a bin favoring a particular object index is greater than a predefined threshold (10 in this implementation) we declare the object as being present at the corresponding location. However, if there are different bins voting for the same object at different locations in the scene due to possible false feature matching, the location with the maximum number of votes is marked as the expected location. In the end, the recognition module returns a set of locations corresponding to those objects in the model whose voting support was sufficient.
3 Attention Mechanism
Our attention mechanism controls what the robot will look at, for how long it will keep looking at it, and where it should avoid looking. We embody curiosity in the attention mechanism by introducing the following ways of guiding attention to where learning progress is likely.
3.1 Bottom-Up Saliency at Interest Points
We have adapted a bottom-up saliency model developed by Itti et al. [18]. In this model the conspicuity of each image location in terms of its color, intensity, orientation, motion, etc. is encoded in a so-called saliency map. We make use of stereo information to select the most salient point in the scene. Images from both eyes are processed to obtain left and right saliency maps. Since objects are represented as features extracted at interest points, our attention mechanism only considers points in the saliency map that are associated with a pair of interest points matched between left and right image (all other points are neglected). In this way we restrict attention to locations of potential objects that the system could learn about. The saliency values for the matched interest points are computed using a 2-dimensional gaussian centered on them, with \(\sigma \) = 1.5 and a cutoff value of 0.05. This has the effect of bringing out clusters of high salience more than just isolated pixels of high salience.
When there are no other variations in the visual characteristics of the scene it is very likely that the attention mechanism continues to select the same location as the most salient point. To avoid this we temporarily inhibit the saliency map around the current winner location by subtracting a Gaussian kernel at the current winner location. This allows the system to shift attention to the next most salient location. To avoid constant switching between the two most salient locations, we also use a top-down inhibition of already learned objects below.
3.2 Attention Based on Learning Progress
It has been argued that infants’ interest in a stimulus is related to their learning progress, i.e., the improvement of an internal model of the stimulus [9]. We mimic this idea in the following way. When the robot looks at an object, it detects whether the object is familiar or not. If the object is new it creates a new object model making new associations in the shared feature dictionary. If the object is known, the model is updated by acquiring new features from the object. The attention remains focused on the object until the learning progress becomes too small. As a side effect, the robot continues learning about an object when a human interferes by rotating or moving it, exposing different views with unknown features (Fig. 4).
3.3 Top-Down Rejection of Familiar Objects
The third mechanism to focus attention on locations where learning progress is likely makes use of the system’s increasing ability to recognize familiar objects. A purely saliency-based attention mechanism may select the same object again and again during exploration, even if the scope for further learning progress has become very small. Therefore, once there are no more new features found on certain objects, our system inhibits their locations in the saliency map wherever they are recognized (Fig. 5a). To this end, the models of these objects are used to detect them in every frame using the recognition module. The interest points on the saliency map that are in the vicinity of the object detections are removed from being considered for the winner location.
3.4 Top-Down Rejection of Recently Visited Locations
We have incorporated an inhibition-of-return mechanism that prevents the robot from looking back to locations that it has recently visited. To this end, the absolute 3D coordinates of the visited locations are saved in the memory and they are mapped onto the pixel coordinates on images from the cameras in their current positions to know the locations for inhibition. In our experiments, a list of the 5 most recently visited locations is maintained and close-by interest points are inhibited for the next gaze shift (Fig. 5b). In order to ease exploration of regions beyond the current field of view, we have also added a mechanism to occasionally turn the head in a new direction. To this end, the space of possible head directions is parcellated into 4 quadrants. Whenever the robot has visited ten locations in one quadrant it shifts to the opposite quadrant.
4 Self-correction Mechanism
We introduce a self-correction mechanism in the system that discovers if there are any inaccuracies in the object representations in the dictionary and tries to rectify them. Ignoring the problems caused by variations in illumination conditions and object deformations, the inaccuracies can primarily arise because of two reasons: (1) The representation for an object is incomplete or (2) The representation of an object has incorporated portions of other objects in the scene. This can lead to the following problems, respectively
-
1.
When the object is seen at different instants of time during the learning process in different poses, there will be duplications of the object in the dictionary with different models for different poses.
-
2.
When two objects in the scene are overlapping or in contact with each other during learning, then there will be a single object model assigned to both objects.
We address these issues using the techniques described in the next sections.
4.1 Merging Technique
An object may change its pose while it is in focus or otherwise. When an object is changing its pose while it is in focus it is easy to incorporate the changes into the object model using the approach described in Sect. 3.2. However, this is not possible when the object changes its pose when the focus of the robot was on other objects in the scene. This makes the system learn duplicate identities which will continue to exist in the dictionary even after revealing previously learned poses at a later time. To understand this, let us consider the example shown in Fig. 6. The figure shows a scenario wherein the robot had seen one side of the object and assigned the identity number (ID) 0 (Fig. 6a). Later it only saw the other side of the same object and assigned ID 2 (Fig. 6b). Later the robot sees the object again and identifies it as object 2 (Fig. 6c) the user slowly rotates and reveals the other side of the object which is also updated into the representation for object 2 in the dictionary. We now have two identities for the same object only one of which will be updated based on the initial appearance match. We avoid this by identifying such an event using recognition module (see Sect. 2.4) that is always running in the background. While the object is being updated the recognition module identifies that there is another ID for the object that is currently being updated (Fig. 6d). This indicates that there are duplicate IDs for this object in the model data base. We hence merge the IDs as well as their corresponding feature associations in the dictionary into one. This technique also helps to merge duplicate identities caused by variations in illumination conditions and object deformations.
4.2 Splitting Technique
When two objects are seen together in the scene that are in contact with each other, the system learns a single representation for them since it doesn’t know about the distinction of the appearances of the objects in the real world (see example in Fig. 7). In the future, when one of these two objects appears in the scene on its own the system would recognize it with the same object ID. This is akin to the situations wherein only a part of the object or one particular pose of the object is visible while the system is still able to recognize the object (see Sect. 2.4). Hence, in the case of two objects, the error goes unnoticed unless it is explicitly discovered. We again use the recognition module for identifying such an event in order to rectify the identities in the dictionary. Figure 7a shows feature locations on the combined object and their corresponding distance vectors with respect to the object centroid. When these two objects are separated and kept apart in the scene the features vote for the centroid of the object that are concentrated at two different locations in the scene (Fig. 7b). This gives rise to two recognitions (see Sect. 2.4). This is an indication that there were two objects encapsulated with a single identity. We then split the corresponding features in the dictionary into two groups based on the votes and associate them with two different object IDs.
5 Experiments and Results
The system described above incorporates several mechanisms to make it intrinsically motivated to seek out new information or, simply put, to make it curious. To evaluate the benefits of this curiosity, we test the performance of the system by incorporating one or more of the attention mechanisms in a staged manner. We will label the full system including all mechanisms as the IM (intrinsic motivation) system. Note that these tests are performed without the two self-correction mechanisms of merging and splitting of object models.
5.1 Experimental Setup
The model is implemented on an iCub robot head [10] (Fig. 8). It has two pan-tilt-vergence eyes mounted in the head supported by a yaw-pitch-twist neck. It has 6 degrees of freedom (3 for the neck and 3 for the cameras). Images are acquired from the iCub cameras at 27 fps with resolution of \(320\times 240\) pixels. Experiments are performed placing iCub in a cluttered environment with various objects in the scene that are placed at different depths with partial occlusions. The background comprises walls, doors and book shelves. Figure 8 shows the objects, which have different sizes and shapes.
5.2 Evaluation Method
To evaluate the system, we let the robot autonomously explore its environment for 5 min and then test its performance using previously recorded and manually segmented ground truth images. During ground truthing we manually control the robot to look at each object present in the scene. The robot will extract features on the objects, that are manually segmented, until it does not find any new feature. This period was observed to be less than 10 frames on an average for static objects, but more for rotating/moving objects (see below). Once all the features are collected on all the objects, they are tested with the model generated by the system at the end of the learning process. To evaluate the performance of the system we consider the following parameters: Number of objects learnt, number of visits on an object (to test the exploration efficiency), accuracy of the object models (in terms of repeated object identities, missed/wrong detections, recognition rate), and time taken for learning the objects. Since the object identities depend on the order in which objects are learnt, we programmed the systems to store representative images of the object together with the self-assigned object ID (Fig. 11). These images are displayed while testing and allow a visual verification of the correctness of the recognition.
5.3 Two Experimental Scenarios
In the following we describe two testing scenarios using static and dynamically changing scenes.
In the first scenario, objects are static and iCub has to actively explore the scene and learn about the objects. We set a time span of 5 min during which iCub learns as many objects as possible. We place 12 objects in the scene allowing partial occlusions. Object locations are varied from one experiment to another.
In the second scenario we tested the ability of the system to update the model of an object with new features (Fig. 4). We used only 3 objects that are rotated by a human to dynamically change the objects’ appearance while iCub learns about them. The learned object models are evaluated with separate test images showing the objects in four different poses.
5.4 Results
In this section we illustrate the performance of our system in a staged manner. We have employed bottom-up saliency in all the experimental scenarios. We will demonstrate a further improvement in attention and learning mechanism by using top-down information and learning progress parameters on top of this.
We will first illustrate the effect of top-down information on the system’s performance in the static object scenario. Figure 9 compares the system’s performance with and without top-down information. We report average values over 10 experiments carried out with different objects, locations, and lighting conditions. Error bars represent maximum and minimum values. Figure 9a shows the number of objects learnt by the system in 5 min that were validated by ground truth. Figure 9b shows the number of revisits of objects during exploration. In the absence of top-down information the system visits some objects repeatedly although little new information is available there. Similarly, Fig. 9c shows the maximum number of revisits across all objects. Figure 9d shows the number of objects whose models were incorrectly duplicated, i.e., the system did not recognize the object when visiting it at a later time and created a second object model for the same object. Figure 10 shows the comparison in terms of time taken by the system to learn the first \(n\) objects. Across all measures, the system using top-down information is superior to the one without. One can expect a higher performance on a robot that has higher visual range and resolution covering more objects in the scene.
Our system looks at an object for as long as it finds something new to learn about. To evaluate the benefit of this feature we compare the full system (IM) to a version that only looks at an object for a fixed duration (equal to 3 s which was observed to be sufficient for learning an arbitrary object) before shifting gaze (No IM). Table 1 compares the recognition accuracies of both versions in the static object scenario. Recognition accuracy is defined as the percentage of features of the object model matched with ground truth. We observe that the recognition accuracy is higher for the IM case, even though the objects are static. This is somewhat surprising since for static objects a single frame should be sufficient to learn an accurate model. We suspect that the advantage of the IM system in this setting is due to subtle variations in lighting and camera noise that slightly alter object appearance from frame to frame.
The advantage of the full IM system becomes much clearer in the rotating object scenario. For this experiment we used the three objects marked by black rectangles in Fig. 8. The objects are rotated by a human operator as the robot learns about them (see Fig. 4). It is observed that the full IM system avoids duplicate representations for the same object. Figure 12 shows feature to object associations after learning. The features corresponding to an object model are collected and their distance vectors are marked from the center of the object. Figure 12a shows that for the IM case the features are densely populated covering most of the parts of the object. As our object models are pose invariant what is depicted in the picture is the aggregation of feature vectors from all poses that are captured in the model. Figure 12b shows that for the other case there are duplicate models for the same object in the feature dictionary as the system in this case fails to realize that an object seen sometime later exhibiting different pose is the same object hence learning a new object model with new identity. The features are also not dense enough to identify the objects with high reliability. This is evident from Table 2 that lists the number of associated features in the feature dictionary for every object and the corresponding models. As shown in Table 3, the full IM system also has superior recognition accuracy. Recognition accuracy is defined as the percentage of features of the object model matched with ground truth. Four different poses of every object are shown to the system to see how well it can recognize. We observe that the recognition accuracy is substantially higher for the IM case.
The system performance recorded as a video can be viewed at: http://fias.uni-frankfurt.de/neuro/triesch/videos/icub/learning/
6 Conclusions
We have presented a “curious” robot vision system that autonomously learns about objects in its environment without human intervention. Our experiments comparing this curious system to several alternatives demonstrate the higher learning speed and accuracy achieved by focusing attention on locations where the learning progress is expected to be high. Our system integrates a sizeable number of visual competences including attention, stereoscopic vision, segmentation, tracking, model learning, and recognition. While each component leaves room for further improvement, the overall system represents a useful step towards building autonomous robots that cumulatively learn better models of their environment driven by nothing but their own curiosity.
References
Kim, H., Murphy-Chutorian, E., Triesch, J.: Semi-autonomous learning of objects. In: Conference on Computer Vision and Pattern Recognition Workshop, CVPRW ’06, p. 145 (2006)
Wersing, H., Kirstein, S., Gtting, M., Brandl, H., Dunn, M., Mikhailova, I., Goerick, C., Steil, J., Ritter, H., Krner, E.: Online learning of objects in a biologically motivated visual architecture. Int. J. Neural Syst. 17(4), 219–230 (2007)
Figueira, D., Lopes, M., Ventura, R., Ruesch, J.: From pixels to objects: enabling a spatial model for humanoid social robots. In: IEEE International Conference on Robotics and Automation, ICRA 2009, pp. 3049–3054 (2009)
Gatsoulis, Y., Burbridge, C., McGinnity, T.: Online unsupervised cumulative learning for life-long robot operation. In: 2011 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 2486–2490 (2011)
Begum, M., Karray, F.: Visual attention for robotic cognition: a survey. IEEE Trans. Auton. Ment. Dev. 3(1), 92–105 (2011)
Baranes, A., Oudeyer, P.-Y.: R-iac: robust intrinsically motivated exploration and active learning. IEEE Trans. Auton. Ment. Dev. 1(3), 155–169 (2009)
Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Trans. Auton. Ment. Dev. 2(3), 230–247 (2010)
Baldassarre, G.: What are intrinsic motivations? a biological perspective. In: 2011 IEEE International Conference on Development and Learning (ICDL), vol. 2, pp. 1–8 (2011)
Wang, Q., Chandrashekhariah, P., Spina, G.: Familiarity-to-novelty shift driven by learning: a conceptual and computational model. In: 2011 IEEE International Conference on Development and Learning (ICDL), vol. 2, pp. 1–6 (2011)
Metta, G., Sandini, G., Vernon, D., Natale, L., Nori, F.: The icub humanoid robot: an open platform for research in embodied cognition. In: Proceedings of the 8th Workshop on Performance Metrics for Intelligent Systems, PerMIS ’08, pp. 50–56. ACM, New York (2008)
Agarwal, S., Roth, D.: Learning a sparse representation for object detection. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 113–127. Springer, Heidelberg (2002)
Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of Fourth Alvey Vision Conference, pp. 147–151 (1988)
Wiskott, L., Fellous, J.-M., Kuiger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 775–779 (1997)
Jones, J., Palmer, L.: An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol. 58(6), 1233–1258 (1987)
Triesch, J., Triesch, J., von der Malsburg, C.: Democratic integration: self-organized integration of adaptive cues. Neural Comput. 13, 2049–2074 (2001)
Murphy-Chutorian, E., Triesch, J.: Shared features for scalable appearance-based object recognition. In: Seventh IEEE Workshops on Application of Computer Vision, WACV/MOTIONS ’05 Volume 1, vol. 1, pp. 16–21 (2005)
Ballard, D.H.: Generalizing the hough transform to detect arbitrary shapes. In: Fischler, M.A., Firschein, O. (eds.) Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, pp. 714–725. Morgan Kaufmann Publishers Inc., San Francisco (1987)
Itti, L., Koch, C.: Computational modelling of visual attention. Nat. Rev. Neurosci. 2(3), 194–203 (2001)
Acknowledgements
This work was supported by the BMBF Project “Bernstein Fokus: Neurotechnologie Frankfurt, FKZ 01GQ0840” and by the “IM-CLeVeR - Intrinsically Motivated Cumulative Learning Versatile Robots” project, FP7-ICT-IP-231722. We thank Richard Veale, Indiana University for providing the code on saliency.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chandrashekhariah, P., Spina, G., Triesch, J. (2014). A Curious Vision System for Autonomous and Cumulative Object Learning. In: Battiato, S., Coquillart, S., Laramee, R., Kerren, A., Braz, J. (eds) Computer Vision, Imaging and Computer Graphics -- Theory and Applications. VISIGRAPP 2013. Communications in Computer and Information Science, vol 458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44911-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-662-44911-0_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44910-3
Online ISBN: 978-3-662-44911-0
eBook Packages: Computer ScienceComputer Science (R0)