Keywords

1 Introduction

For security purposes, there are various types of videos to examine with different devices, from different perspectives and in different environments. This type of data increasingly comes from mobile devices such as cameras worn by the body by UK police officers. Body worn cameras are evolving the information consumed by the different security agencies. In addition to providing important information on the health of the agents, the main use of these cameras is to record videos useful in dangerous situations to perform subject and action recognition. To test algorithms suitable for data like these, we created Gotcha-I. Our proposal contains videos with different modes. The subjects in the videos are both cooperative and non-cooperative, in order to simulate the user’s attempts to avoid the camera. They move along a path or freely and have been captured in different lighting environments. One of the innovations made by Gotcha-I is the possibility of working on videos captured by a moving camera. During the acquisition process, the camera speed is adjusted to the subject pace.

1.1 Related Work

As new needs arise in the world of video surveillance and security, new datasets are created to test recognition methods. In this section we will show an overview of dataset comparable with the one proposed.

COMPACT [2] is a biometric dataset focused on less-cooperative face recognition. Images are in high resolution but acquired in a fully automated manner. This allow to have real-world degradation like expressions, occlusion, blur etc.

Differently form COMPACT, UBEAR [3] is focused on ear images. Subjects in UBEAR are in movements and under varying lighting conditions. The subject can move the head freely and acquired images are in gray scale. In the outdoor environment we can found datasets as QUIS-CAMPI [4]. In QUIS-CAMPI, subjects are on the move and at about 50 m of distance. For the enroll of data, the same subjects were acquired also in indoor scenario and a 3D model of each subject is also available. There are full body images and, from them, a PTZ camera extracted face images.

In previous example images were captured by a camera, but in the surveillance purpose there are also dataset obtained by drones. In DRONEFACE [5], the authors focused on face tracking. This task became difficult to approach due to the distance between the drone and the subjects. For this reason, they built DRONEFACE, composed of facial images taken from various combinations of distances and heights for evaluating how a face recognition technique works in recognizing designated faces from the air.

Another dataset focused on faces on distance is SALSA [7]. Differently from the previous, in salsa we have a fixed camera network that record subjects in two different modalities, both in indoor environment. The first modality simulate a poster presentation, in which there is a presenter and an audience. The second modality simulate a cocktail party in which subjects are freely to move and interact.

All previous datasets are focused on one, or at most two biometric traits. However, recently, also multibiometric dataset were created.

As an example, MUBIDIUS-I [6] is a multibiometric and multienvironment dataset, acquired by drones and cameras. There are many biometric traits in this dataset as ear, face, iris and full body. Most of the modality are at less distance and with fully cooperative subjects, but there are still videos with less-cooperative subjects at distance in a outdoor environment.

Completely different from the previous ones are then the dataset captured with pose estimation purpose. In order to obtain an accurate ground truth, the most used dataset in this field are like BIWI [8]. BIWI is a face dataset captured by a kinect, that allows faces and 3D models of the subject captured in an indoor environment and in a cooperative mode. There are only 20 subjects in the dataset and only with face informations.

The proposed dataset Gotcha-I brings together all the features of the previous ones. This dataset provide a 3D model of each subjects, despite not having any image captured by depth camera. This make us able to perform pose estimation on real faces with a very precise ground truth as BIWI. As MUBIDIUS-I, our dataset can be used for multiple biometric traits due to the different distances and modalities. As in SALSA, our subjects can move freely in different modalities and as DRONEFACE we are able to perform the tracking of a subjects through different videos. We have also outdoor environments as in QUIS-CAMPI and a 180\(^\circ \) videos of the subjects that allows us to perform ear recognition as in UBEAR. Finally our modality is more than less-cooperative than COMPACT because our subjects deliberately try to avoid being filmed.

In Table 1, a list of overall specification of each presented dataset is available, compared with Gotcha-I.

In Table 2 a list of annotation furnished from each compared dataset is present. Very few dataset has 3D models of the subjects and Landmark annotation, by which we mean coordinates of keypoint on the faces or bodies of the subjects.

Table 1. Reference datasets with overall specifications. C./N.C. means Cooperative/Non Cooperative Mode
Table 2. Reference datasets with type of annotations

1.2 Security Purpose Applications

Videos captured by mobile camera with different environments are representative of surveillance data. In the last ten years, more and more countries provided their police officers with body worn cameras [9]. Differently from fixed cameras, mobile cameras has the ability to move around big spaces and cover big areas. However, data as images and videos from mobile cameras are quite different from fixed cameras due to the different point of view during recording [10]. Our dataset is proposed as a starting point to test algorithms operating on that topic and we will introduce some algorithms that could benefit from it.

Identity Recognition. A hot topic in surveillance is identity recognition. One of the most used biometric traits for this purpose is the face. In this sense, there are a lot of algorithms in last years that use faces to recognize a subject, both using neural networks [11, 12] or mixed methods [13]. In order to do that, on this type of data may be useful use previously a head pose estimation method to select the most frontal frame. This aim can be reached with various algorithm in literature that works in real time, both with Neural Networks and without [14,15,16]. The proposed dataset also allows to detect a subject from other biometric traits like ear or iris, as algorithms in [17, 18] performs. This is possible due to the different distance of the subject from the camera during recording. Thanks to the fact that this happens in the same video sequence, we are also able to fuse biometric traits in order to perform multibiometric recognition using different frames for different biometric trait [19,20,21].

Traits Classification. Not only the identity of the subject can be useful in security purpose, often we are interested in some physical traits like the gender, the age or facial characteristics in order to classify many subjects in few time. Regarding the gender recognition, it is often performed using face [22, 23]. However, it is also possible to use videos in which subjects can move freely, also far from camera, using the gait as biometric traits [24, 25] or data collected by mobile devices [26]. It is also interesting how different results are if we consider the cooperativeness or the non cooperativeness of the subjects, labeled in this dataset, as the work in [27]. Using the face or the movement of the subject we can also extract information about their age, as in [29, 30]. This problem is often performed together with the previous one, gender recognition, to extract the most intuitive and general information about the subjects [31]. Once some characteristics are captured we may be able to follow the subjects along different paths and cameras, as our dataset allow us. This is called tracking [32, 33] and it is a very hot topic in security due to the ability to find the same people in different environments, in real time, without the need to know the exact identity of the subject [34]. For this purpose, if we focus our attention on face, there are various characteristics we can use to discriminate subjects, as in [28]. At the same times various classification and clustering algorithms were built in this sense [35,36,37].

In conclusion, taking a look at identification and classification of subjects in security purpose, our dataset allow users to train, test and perform very different algorithms at the state of the art.

2 Gotcha-I

Our proposed dataset stems from the growing need to extract biometrics from video surveillance data and from the need to understand who the user is, where he is and what he is doing. Gotcha-I dataset allows to extract different biometrics: the face, the nose, the mouth, the eyes, the ears and the periocular area. Given the nature of the videos it is also possible to extract behavioral biometrics from gait. Gotcha-I dataset simulates the acquisition of the body worn camera in which a moving subject is acquired by a moving camera. It is available for download at [1].

2.1 Content of the Dataset

To simulate real-world conditions, no accessories (clothes, hats or glasses) were controlled, they were left participant dependent. About the procedure followed by each participant, there were two recording procedures: (I) a cooperative mode with the camera where the subject walks and collaborates with the camera watching it during the walk, see Fig. 1 (top-left), and (II) a non cooperative mode where the same subject walks trying to avoid the camera, see Fig. 1 (top-right). The dataset contains a total of 493 videos with an average duration of 4 min, including 62 subjects, 15 women and 47 men in an average age between 18 and 20 years.

In order to be able to create robust systems, several possible scenarios were considered for the previous described procedures. The dataset is composed of 11 different video modes in different environmental and behavioral contexts.

The contents of the dataset are listed below:

  • (EC1) indoor with artificial light - cooperative mode;

  • (EC2) indoor with artificial light - non cooperative mode;

  • (EC3) indoor without any lights but the camera flash - cooperative mode;

  • (EC4) indoor without any lights but the camera flash - non cooperative mode;

  • (EC5) outdoor with sunlight - cooperative mode;

  • (EC6) outdoor with sunlight - non cooperative mode;

  • (EC7) 180\(^\circ \) head video;

  • (EC8) stairs outdoor - cooperative mode;

  • (EC9) stairs outdoor - non cooperative mode;

  • (EC10) path outdoor - cooperative mode;

  • (EC11) path outdoor - non cooperative mode;

  • (EC12) derived files attached, detailed in Sect. 2.2.

All the videos have been acquired with the camera of the Samsung S9+ mobile phone; the modes (EC8-EC9-EC10-EC11) have also been acquired with an iPhone 10 and a Samsung Galaxy A5.

Fig. 1.
figure 1

Some Gotcha-I dataset samples: outdoor sunlight in cooperative mode (top left), indoor with artificial light in non cooperative mode (top right) and indoor with the camera flash in cooperative mode (down).

Illumination Differences. Some real world problems can occur regarding the selected illumination settings. Videos in (EC5-EC6-EC8-EC9-EC10-EC11) were captured outdoors with natural sunlight, Fig. 1 (top-left). Videos in (EC1-EC2-EC7) have been acquired in a room with a white background with the artificial lights on, see Fig. 1 (top-right). Videos in (EC3–EC4) have been acquired in the same room with the lights off and the flash camera on. In these video the use of camera flash can generate blur frames in some sequences Fig. 1 (down); as we can see, in this mode some frames could be blurred increasing the dataset complexity.

Cooperative and Non-cooperative Mode. In cooperative video sequences, the subjects look at the camera during the acquisition and follow the camera lens during the motion. In non-cooperative video sequences, the subjects try to avoid the camera during the motion, can be appreciated in Fig. 1 (top-right). This modality is clearly most competitive.

Figure 2-a shows the distance from the neck to the nose in a cooperative video and Fig. 2-b shows this distance in a non-cooperative video of the same subject. We can observe that the cooperative mode exhibits a linear behaviour, while the non cooperative mode behaves irregular. The differences in the regularity of the subjects pose in cooperative and non-cooperative videos have led us to carry out experiments on different methods to further analyze these differences.

Fig. 2.
figure 2

Head-pose variation sequence for each mode.

Path and Stairs Outdoor. In these two modes the videos were acquired from different points of view simultaneously between three different cameras: Samsung S9 +, iPhone 10 and a Samsung Galaxy A5. Furthermore not all subjects are present. These videos have been created specifically to perform re-identification and action recognition algorithms. Our aim is that these sequences simulate a video surveillance camera acquisition, so once the face (or the gait) is acquired it is possible to re-identify it and trace it for the whole journey. Furthermore, the action of “going up the stairs” allows to perform algorithms of action recognition in order to predict if a subject is going up the stairs or walking. Example of different frames extracted from these videos are shown in Fig. 3 .

\(\varvec{180^\circ }\) Head Video. The facial video sequences were acquired with the most favorable lighting conditions: indoor with lights on. There are 62 sequences, one for each subject, close to less than a meter from the face by rotating the camera around the head 180\(^\circ \): from the left ear to the right ear. The subjects were asked to sit on a chair placed in a room with a white panel behind them, the operator then made the video by turning around the subject. This mode has the purpose of acquiring the facial details and consequently, can be used to analyze facial traits that require a high resolution of the image, such as iris, ear, profile, nose, mouth, and periocular area (Fig. 3).

Fig. 3.
figure 3

Different outdoor sequences.

2.2 Additional Metadata

Additional related information such as 3D-data extracted from videos is included in our dataset. It was possible to extract from each 3D model the pitch yaw and roll rotation of the face for the Head Pose Estimation.

3D Model and Head Pose Estimation Data. From the videos in “(EC7) 180\(^\circ \) head video” we have reconstructed the 3D model of the head in .obj format available within the derived files attached. From the 3D model of the head adequately elaborated through the Blender software, the images of the head were extracted in all the pitch, yaw and roll poses with 5\(^\circ \) deviations in the following ranges of values:

  • Pitch (−30\(^\circ \); +30\(^\circ \));

  • Yaw (−40\(^\circ \); +40\(^\circ \));

  • Roll (−20\(^\circ \); +20\(^\circ \)).

For 62 subjects therefore 137.826 images were extracted. In Fig. 4 there is a subset of 25 images of the head pose estimation of the subject 62.

Fig. 4.
figure 4

Above some examples of image extracted from 3D model of the subject 62. At the bottom the degrees in pitch, yaw and roll of the head pose estimation corresponding to the position in the table.

Landmark Extraction Data. For each video frame, except for the videos in “(EC7) 180\(^\circ \) head video”, the landmarks of the 2D pose estimation of the body and the 68 landmarks of the face were extracted using the OpenPose software [38]. This data is useful for gait analysis and performance action recognition.

3 Conclusion

Gotcha-I is a multiview dataset built to meet the needs of surveillance data from Body Worn Cameras. With a total of 62 subjects in 11 different modalities, Gotcha-I results particularly suitable for tasks such as people tracking and recognition. The high definition and full bodies in video allow to perform different type of biometric traits, both physical and behavioral. Compared to the other datasets in literature, Gotcha-I presents a remarkable difference between cooperative and non cooperative modalities, allowing to analyze the response of different state-of-the-art algorithms on this data. Additional contents as the 3D model of each subjects, face and body coordinates and annotated head pose images, make our dataset very versatile in terms of possible testable applications.