Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Location-based services provided by social networks, such as Facebook and Twitter, remarkably enrich the quantity of multimedia content tagged by geo-sensor including latitude and longitude. Besides, the popularity of mobile devices with sensors makes capturing, uploading, and sharing of outdoor user-generated videos (UGVs) highly convenient. This motivates us to investigate effective techniques to manage these Internet-scale UGVs.

To handle the vast amount of UGVs on social networks, we focus on object-of-interest (OOI) recognition, i.e., building an OOI recognition system by leveraging both visual features and sensor-social data. The major benefit is that not only it can localize OOIs, but also it is highly efficient and accurate by adopting sensor-social data. Such a recognition system would be of tremendous value and significance for a large body of multimedia applications. For example, Zheng et al. [1] proposed a web-scale landmark recognition engine by leveraging the vast amounts of multimedia data. However, most GPS-tagged recognition systems depend on a large collection of images to achieve accurate visual clusters. The existing techniques, however, are unsuitable for OOI recognition in UGVs because of two reasons: (1) OOI recognition in UGVs should be a lightweight application since UGVs are usually captured by mobile devices. Therefore, off-loading many recognition tasks onto cloud servers may increase latency and response time; (2) typical approaches which acquire a “complete” image dataset to handle object recognition may consume extra computation practically.

The explosive growth of UGVs leads to a significant challenge on how to efficiently organize large video repositories and make them searchable. Common approaches adopt content-based media analysis to extract visual features for similarity matching. However, due to the overwhelming amount of video materials, it is inappropriate to perform feature matching on a frame-by-frame level. In this work, we understand video content at object-level in a lightweight way. We propose to recognize OOIs in UGVs with user-intentionally captured objects. Similar to our work, Hao et al. [2] focused on point-of-interest detection in sensor-rich videos. It was achieved by analyzing a large number of sensor-rich videos automatically and comprehensively. This implies that the method is unsuitable for OOI recognition in a single UGV.

Fig. 1.
figure 1

The proposed OOI recognition pipeline using sensor-social data

An overview of the proposed method is presented in Fig. 1. We focus on analyzing UGVs uploaded on social networks at object-level, by utilizing sensor-social data. Given a collection of UGVs, the OOI acquisition and the classified object set recommendation are conducted simultaneously. The former task can be formulated as salient objects extraction, where saliency indicates the informative/interesting regions within a scenery. To obtain the most representative frames, a saliency-guided selection algorithm is proposed to filter frames with similar saliency distributions. For the latter task, candidate categories are recommended by leveraging sensor-social data. Metadata including timestamps, GPS coordinates, accuracy, and visible distances are employed as sensor data. Afterward, salient objects are extracted in social images based on category classification. A spatial-pyramid architecture [3] is adopted to describe social objects and OOIs in UGVs for its robustness in scene modeling. And the Euclidean distance is employed to measure the similarity between the classified and labeled object sets. Finally, OOIs associated with their annotated names are labeled in UGVs frame-by-frame. Experiments on object-level video summarization and content-based video retrieval demonstrate the usefulness of our method.

2 Related Work

OOI detection is a widely used technique in a variety of domains, e.g., video analysis and retrieval. Object/saliency detection and region of interest accumulation are typical approaches to localize OOIs. Most existing work on object detection depends on the sliding window approaches [4, 5]. They might be computationally intractable since windows detection with various scales are evaluated at many positions across the image. To accelerate computation, Harzallah et al. [8] and Vedaldi et al. [9] designed cascade-based methods respectively to discard windows at each stage, where richer features are adopted progressively. Cinbis et al. [10] developed an object detection system by employing the Fisher vector representation. State-of-the-art performance was achieved for image and video categorization. Kim et al. [11] proposed an OOI detection algorithm based on the assumption that OOIs are usually located near the image centroid. Zhang et al. [12] introduced a novel approach to extract primary object segments in videos from multiple object proposals. Although the above methods performs well on object detection, they are not lightweight algorithms. Thus, they cannot effectively handle OOI recognition toward mobile devices.

Many recent OOI detection algorithms are based on visual saliency prediction [13, 14]. It is generally accepted that OOIs are aroused by human perception and visual saliency can reflect the cognitive mechanism. Therefore, saliency prediction performance significantly influences these methods in detecting OOIs. Most of the existing saliency models are completely based on low-level visual features [15, 16]. However, some high-level semantic cues [17, 18] should also be integrated for saliency calculation [19]. Both biological and psychological studies [20] shown that, optimally fusing low-level and high-level visual features (including the location cue) can enhance saliency detection greatly. We employ the saliency detection by deploying the markov chain proposed by Jiang et al. [19]. One advantage of [19] is that both the appearance divergence and spatial distribution of foreground/background objects are integrated. It performs better on our multi-source location-aware dataset as compared with its competitors.

Many approaches have been proposed to predict where human perceives when viewing a scenery. The majority of the existing methods recognize OOIs based on the similarity of appearance features. Recently, the cheap availability of sensor-rich videos allows users to understand video semantics in a straightforward way [29]. For these different types of sensory metadata, we focus on the geo-attributes of sensor data throughout this paper. Associating GPS coordinates with digital photographs has becoming an active research domain over the last decade [30]. Toyama et al. [31] introduced a metadata-based image search algorithm and compiled a database which indexes photos using location and timestamp. Föckler et al. [32] developed a museum guidance system by utilizing camera-equipped mobile phones. Zheng et al. [1] constructed an efficient and effective landmark recognition engine, which organizes, models, and recognizes the landmarks on the world-scale. Gammeter et al. [33] introduced a fully functional and complete augmented reality system which can track both stationary and mobile objects. By utilizing geo-sensor data, a number of object recognition tasks are implemented based on GPS coordinates.

3 Sensor-Social-Based OOI Recognition

Given an outdoor UGV, we detect its OOIs and annotate them by utilizing a variety of multimedia features. The key to recognize OOIs in outdoor UGVs is to fuze video content, sensor data, and social factors optimally. Thereafter, video sequences with annotated OOIs can be generated.

3.1 OOI Acquisition from UGVs

Saliency-Based Frames Selection. Obviously, semantics between sequential video frames are highly correlated. Existing summarization algorithms typically detect key frames to alleviate computational burden. These techniques are popularly used in video editing and compression. Notably, two factors should be emphasized in our method: the computational efficiency and representative OOI sequences. This means that the conventional key frames selection algorithms may not be able to preserve the diverse OOI sequences. In order to select representative frames at OOI-level, we propose a novel saliency-based frame selection. First, saliency map of each UGV frame is calculated based on Jiang et al.’s algorithm [19]. We employ [19] because it jointly describes the appearance divergence and spatial distribution of foreground/background objects. By adopting the Markov chain theory [21], the saliency detection is conducted rapidly. Let \(Sal_{c,s}\) denote the calculated saliency map based on color and spatial distributions, e index the transient graph nodes, and \(y_w\) be the normalized weighted absorbed time vector, then the saliency map is simply obtained as:

$$\begin{aligned} Sal_{c,s}(e)=y_w(e),~~i=1,2,\cdots ,t, \end{aligned}$$
(1)

Afterward, region of OOI, denoted as \(R_{bw}(\cdot )\), is binarized by an adaptive threshold \(\tau _1\). The criterion of saliency-based frames selection is:

$$\begin{aligned} decision(i)=\left\{ \begin{array}{ll} 1 \qquad \text {if}\, ||Th(i)-Th(i+1)||>\tau _2\\ 0 \qquad \text {otherwise} \\ \end{array}\right. , \end{aligned}$$
(2)

where \(\tau _2\) denotes the divergence of saliency values between neighboring frames; Th(i) is the salient area in frame i and R(i) the salient object region; \(R(i,i + 1)\) denotes the salient area intersection between frames i and \(i + 1\).

Salient-Object-Assisted Track Learning Detection. To recognize OOIs, it is necessary to generate a number of OOI candidates extracted from UGVs. To balance the efficiency and accuracy, we employ the track learning detection framework proposed by Kalal et al. [22]. One advantage of [22] is that it can decompose a long-term tracking task into tracking, learning, and detection efficiently. Due to the complicated spatial context of a scenery, it is difficult to detect all the objects in UGVs accurately. To solve this problem, we propose a salient-object-assisted track learning detection. It combines object and saliency detection when processing each UGV frame. If the object detection fails, saliency detection will be conducted and assists the similarity measure between patches. Based on the assisting scheme, OOI acquisition is conducted for each frame. Thereafter, the new object modeling can be formulated as:

$$\begin{aligned} M=\{p_1^+,p_2^+,\cdots ,p_x^+,\cdots ,p_m^+,p_1^+,p_2^-,\cdots ,p_x^-,\cdots ,p_m^-\}, \end{aligned}$$
(3)

where \(p^+\) and \(p^-\) denote the foreground and background patches respectively; \(p_x^+\) and \(p_x^-\) are the saliency patches of object and background respectively. Example OOIs extracted from the UGVs are presented in Fig. 2. As can be seen, the proposed method not only detects those OOIs accurately, but also tracks them within the UGV frames. The tracking is performed by localizing a bounding box centered around each detected OOI.

Fig. 2.
figure 2

Tracking recognition for OOIs in UGVs. The long box contains frames randomly selected from UGVs with a tracking box around OOIs. The right column displays the extracted objects marked by the annotated names.

3.2 Classified Object Set Recommendation

Assisted by human interactions, social data has become an intellective media conveying informative cues, e.g., tagged images, video clips, and user comments. It is worth emphasizing that social data also contains lots of noises. Thus, effectively exploiting social data is a challenging task.

Sensor data is recorded by sensory modules embedded in mobile devices. In this work, we model sensor data of UGVs as a frame-related feature vector, which can be specified as:

$$\begin{aligned} S=&\{(t_i, lat_i, long_i, accur_i, visD_i)|t_i\in T, (lat_i;long_i)\in G,\nonumber \\&accur_i\in A, visD_i\in V\}, \end{aligned}$$
(4)

where T contains the capturing time of each frame; G is a set of GPS coordinates that describe the capturing location changes; A is a set of GPS location errors; V is a set of visible distances calculated by Arslan Ay et al. [28].

We constructed an image set containing candidate OOIs which are collected based on the category keywords. In particular, image retrieval is conducted by using different category names. Then, a collection of images are downloaded and classified from social networks. In order to compare at object-level, we calculate the saliency maps from these social images and then extract the salient objects as the OOIs adaptively. Saliency-based object classification minimizes the influence of noises resulted from the various backgrounds in social images. The classified OOIs from social images can be described as:

$$\begin{aligned} O_L=\{R_{bw}^1,R_{bw}^2,\cdots ,R_{bw}^n|n\in \mathcal {N}_L\}, \end{aligned}$$
(5)

where \(O_L\) is the OOI set labeled by L; and \(\mathcal {N}\) is the candidate category set.

3.3 OOI Description and Recognition

We adopt a spatial-pyramid-based [3] feature to represent an image, since it combines the advantages of standard feature extraction method. Spatial pyramid is a simple and efficient extension of an orderless bag-of-features image representation. It exhibits significantly improved performance on challenging scene categorization tasks. More specifically, local visual descriptors are quantized into a D-sized dictionary. Then, the spatial pyramid feature for the c-th class and n-th object is calculated as:

$$\begin{aligned} F_n^c=\{[f_1^1,f_2^1,\cdots ,f_t^1][f_1^2,f_2^2,\cdots ,f_s^2],\cdots ,[f_1^p,f_2^p,\cdots ,f_q^p]\}, \end{aligned}$$
(6)

where p represents the pyramid level; t, s, and q denote the feature dimensionality of each pyramid level. Examples of the above spatial pyramid representation are presented on the left of Fig. 3. Noticeably, to maximally eliminate the negative effects caused by the complicated scenic backgrounds, we introduce a salient object based image filtering scheme, as elaborated in Fig. 4. We perform k-means clustering of two subsets from each classified object image set to constitute two class feature samples. Generally, people tend to capture images with similar salient objects for a category. Thus, we discriminate positive and negative samples using the intra-class variance. Additionally, we extract features with the same spatial pyramid architecture for OOI in UGVs, toward a consistent feature description. A few examples are shown on the right of Fig. 3.

Fig. 3.
figure 3

Left: social objects and their three level spatial-pyramid features; right: OOIs of UGVs and their three level spatial-pyramid features

As the last step, we recognize objects in UGVs using a similarity metric to compare objects extracted from an image set. The similarity is calculated between the mean features of OOIs extracted from UGVs and those of salient objects extracted from social images.

Fig. 4.
figure 4

Salient-object-based social images filtering

4 Experimental Results and Analysis

4.1 Dataset and Experimental Setup

The UGVs in our experiments consist of sensor-annotated videos captured from an Android/iOS device in Nanjing and Singapore. For the Nanjing dataset, five volunteers captured 676 UGVs using Sumsung Galaxy Note 3 and iPhone 6 respectively. Two resolutions \(3840\times 2160\) and \(1920\times 1080\) are employed. The Singapore dataset has 835 \(720\times 480\) UGVs with complicated sceneries, e.g., the Merlion, the Marina Bay, the Esplanade, and the Singapore Flyer.

Our approach is implemented on a desktop PC with an Intel i7-4770K CPU and 16 GB main memory. Java is adopted to parse the Json data collected from social servers. Matlab is used to implement the entire framework for its convenience in image/video processing. The location-based social network is implemented based on the FoursquareFootnote 1. The threshold \(\tau _1\) for salient region detection is adaptively calculated by OTSU. The frame selection threshold \(\tau _2\) is set to 0.2 and the spatial-pyramid level p is set to 3.

4.2 Experimental Results and Analysis

The experiments are designed to evaluate: (1) whether the proposed frames selection method is capable to preserve OOIs from UGV in order to accelerate computation, (2) users’ satisfaction about the proposed tracking detection for OOIs in UGVs, and (3) the recognition accuracy.

Efficiency of Frame Selection. Figure 5 presents some results of the saliency-based frames selection. To better elaborate our proposed frames selection, we design a PSNR-loss histogram to measure the quality of the selected frames. The PSNR measure is popularly used to evaluate the reconstruction quality of the loss compression codec between images. In our experiment, we construct a PSNR-loss histogram \(H=\{P_{12},\cdots ,P_{ij}\}_L\) to calculate the PSNR difference between the i-th and j-th frames both in the original and the selected sequences. L is the frame number of the original UGVs. P denotes PSNR and is defined as:

$$\begin{aligned} P=10*\log _{10} \left( \frac{2^n-1}{M_{SE}}\right) , \end{aligned}$$
(7)

where \(M_{SE}=\sum \nolimits _{x=1}^M\sum \nolimits _{y=1}^M (f(x,y)-g(x,y))/M*N\); n represents using n bits per sample, f(xy) and g(xy) are the grayscale of neighboring frames; \(M\times N\) is the size of each frame.

Fig. 5.
figure 5

Example frames of the saliency-based selection

The PSNR value of the selected sequences falls into the bin based on its frame number in the original videos. Therefore, the information loss can be compared at frame-to-frame level. Figure 6 presents the PSNR-loss histograms of one UGV, reflecting the information loss of the input UGV, and the selected sequences with \(\tau _1 = 0.1\) and \(\tau _2= 0:2\). The red rectangle indicates that our method excludes the frames with very low information loss. It guarantees that the diversity changes of OOIs can be well preserved in the selected UGV frames.

Fig. 6.
figure 6

Left: PSNR-loss histogram of the original UGV and the saliency-guided selected UGV frames; Right: user satisfaction with respect to the tracking detection

User Satisfaction. To evaluate the effectiveness of the proposed system, we invite five volunteers (two females and three males) whom are the photographers of the GeoVidFootnote 2 to participate our user study. As to the multi-source location-aware dataset in Singapore, we also invite them to rate the OOI tracking results generated by our system. Each volunteer rates the UGVs captured by himself/herself, and then randomly assigns one fifth part of the Singapore dataset. The participants are asked to choose from three feelings about the generated UGVs: “Interesting”, “Borderline”, and “Boring”, which reflect their opinions after viewing the UGVs with the OOI tracking box. Noticeably, the five volunteers label each video to determine whether the OOIs are recognized successfully or not. Afterward, we accumulate the feedbacks from the five volunteers, as shown on the right of Fig. 6. We also explore the reasons why they feel boring about some UGVs. We observe that the reason is that the wrong trackings occurred on several frames. The borderline opinion primarily due to the size of bounding boxes. Some of them cannot fully contain the OOIs.

Recognition Accuracy. Our multi-source location-aware dataset contains two cities: Nanjing and Singapore. We first calculate the recognition accuracies separately on the two cities. Afterward, we average them to obtain a final recognition accuracy of our designed system. All the experimental UGVs are captured by volunteers spanning a long time, and there is no ground truth presented. Therefore, all the UGVs are labeled by whether they can be correctly recognized during the user study. We employ the traditional method to label the dataset, “1” for the correct recognition while “0” for the mistaken one. In order to validate which distance measurement can achieve the best performance, we calculate 6 recognition accuracies. They are based on the Euclidean distance, the Seuclidean distance, the Cosine distance, Histogram intersection, the Chebychev distance and the Hausdorff distance, respectively. The final recognition is calculated using the distance measure between feature vectors. All the accuracies on the two cities are presented in Fig. 7. Obviously, calculating the similarity by histogram intersection achieves the best accuracy of 92.86 % on the Nanjing dataset, and 91.02 % on the Singapore dataset. Therefore, the average recognition accuracy of our system on the multi-sources dataset is 91.94 %.

Fig. 7.
figure 7

OOI Recognition accuracies of UGVs extracted from Nanjing and Singapore

5 Conclusions

OI recognition on UGVs is an important application in multimedia [2427] and artificial intelligence [6, 7, 23, 34]. This paper proposes an automatic system to achieve OOI recognition on UGVs by leveraging sensor-social data. The key contributions of this paper can summarized as follows. First, we propose a lightweight framework for recognizing OOIs in outdoor UGVs by leveraging geo-sensor data with the location-aware social networks. Second, we introduce a novel saliency-guided frame selection algorithm, which performs OOI recognition effectively and reduces the computational burden. Third, we compile a multi-source location-aware dataset containing two cities, Nanjing and Singapore, with three kinds of resolutions and two types of frame rates. Third, our system achieves an OOI recognition accuracy of 91.94 %, which demonstrated that it is useful in both mobile and desktop applications.