Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In the past few years, the problem of automatically recognizing human activities in videos has emerged as an important field and attracted many researchers in the vision community. The problem is challenging as in general, the videos could have been shot in an unconstrained environment where the camera could be moving, the background can be cluttered, or the camera view point can be different. All these factors already make recognition of human activities difficult, let alone possible occlusions or variations of activities different subjects perform. With that said, much progress has been made toward the automatic understanding of human activities. On one hand, many approaches (e.g. feature representations and modeling) have been proposed, which have addressed the problem to some degree. On the other hand, many benchmarks and datasets consisting of activity video sequences have been collected and published. Different from other chapters, which focus on activity representation and modeling, this chapter surveys different publicly available benchmarks and summarizes the-state-of-art performances reported so far. Ideally, a good benchmarking should approximate the realistic situations as much as possible by incorporating video sequences with unrestricted camera motion, different scene contexts, different degrees of background clutter and different camera perspectives. It should also consist of video sequences with multiple subjects performing different activities in order to evaluate the robustness of activity recognition algorithms to the intra-class variations of human activities. In what follows, we will also analyze each dataset by these criteria. When summarizing the performances, we only report the best number achieved. The train/test split used in these works follow either leave-one-out or leave-one-actor-out procedure. For the former, testing is done on one sequence while training on the rest. For the latter, testing is done on sequences performed by one actor while training on the rest. The performance is reported as the average across the testing results.

2 Single View Activity Benchmarks with Cleaner Background

2.1 The KTH and the Weizmann Dataset

The KTH dataset [40] and the Weizmann dataset [10] are two widely used standard datasets, which consist of videos of different human activities performed by different subjects. The KTH dataset is published by Schuldt et al. [40] in order to benchmark their proposed motion features [40]. It contains six types of human activities (walking, jogging, running, boxing, hand waving, and hand clapping), which are performed by 25 actors in four different scenarios, resulting in 600 sequences, each with a spatial resolution of 160×120 pixels and a frame rate of 25 frames per second. The other standard benchmark, the Weizmann dataset [10], contains 10 types of activities (walking, running, jumping, gallop sideways, bending, one-hand waving, two-hand waving, jumping in place, jumping jack, and skipping), each performed by nine actors, resulting in 90 video sequences, each with a spatial resolution of 180×144 pixels and a frame rate of 50 frames per second. The background is static and clean with no camera motion. Each sequence is about three seconds long.

Since these datasets are originally published to validate proposed space–time features, they are easier compared with others as the background is cleaner and static, the camera perspective is mostly frontal, and the camera motion is mostly still, although the KTH dataset contains a certain degree of camera zooming. Therefore, they have been criticized for not being a realistic sampling of actions in the real world. With that said, many researchers use them as a validation for newly proposed algorithms. Most state-of-the-art activity recognition algorithms have already achieved higher than 90% accuracy on these two datasets. Below we summarize the published results on both datasets in Tables 20.1 and 20.2. For these two datasets, people typically use leave-one-actor-out evaluation. Hence, the training/testing split is 24:1 for the KTH dataset and 8:1 for the Weizman dataset.

Table 20.1 Performances on the KTH dataset in average accuracy
Table 20.2 Performances on the Weizmann dataset in average accuracy

2.2 The University of Rochester Activity of Daily Living Dataset

Messing et al. [31] publish an activity of daily living dataset. The dataset is created in order to approximate daily activities people might perform. The full list of activities is: answering a phone, dialing a phone, looking up a phone number in a telephone directory, writing a phone number on a white board, drinking a glass of water, eating snack chips, peeling a banana, eating a banana, chopping a banana, and eating food with silverware, all are ordinary activities people often perform. These activities are performed three times by five different people of different shapes, sizes, genders, and ethnicities, giving large appearance variations even for the same activity. The resolution is 1280×720 at 30 frames per second. Video sequences lasted between 10 and 60 seconds, ending when the activity was completed. Table 20.3 compares the performances on the University of Rochester activity of daily living dataset using different features. The evaluation follows the leave-one-actor-out procedure.

Table 20.3 Performances on the UR ADL dataset in average accuracy

The evaluation consisted of training on all repetitions of activities by four of the five subjects, and testing on all repetitions of the fifth subjects activities. This leave-one-out testing was averaged over the performance with each left-out subject.

2.3 Other Datasets

Other than the aforementioned datasets, Tran et al. [44] compose a UIUC activity dataset, consisting of 532 high resolution (1024×768) sequences of 14 activities performed by 8 different actors with extensive repetition. Each sequence lasts for 10∼15 seconds. They achieve an accuracy of 99.06% using the proposed metric learning method.

Another closely related source of datasets is the PETS (Performance Evaluation of Tracking and Surveillance) workshop [15], which releases high resolution surveillance footages every year. Portions of the released datasets are used as benchmarks for human activity recognition algorithms. For example, Ribeiro et al. [36] reported a 94% accuracy on the PETS04-CAVIAR dataset [13], which includes single person activities such as people fighting, walking or being immobile.

3 Single View Activity Benchmarks with Cluttered Background

3.1 The CMU Soccer Dataset and Crowded Videos Dataset

Different from the datasets introduced in previous section where the video sequences contain few or no background clutter, both the CMU soccer and CMU crowded video datasets are made to introduce cluttered background. In [7], Efros et al. record several minutes of a World Cup football game. The dataset consists of walking and running activities at different directions, giving a total of seven activities and around 5000 frames. Although the video sequences are recorded from TV programs, providing a resolution of 640×480, the dataset is challenging in that each human figure is only 30 pixels tall on average, and hence, fine-scale human pose estimation is not possible, making motion the only possible cue. Also, other moving humans from the background could also occlude the target subject. Table 20.4 summarizes the reported performance using leave-one-out procedure on this dataset. It suggests that putting a hierarchy or a generative model on the raw motion features could improve the performance by a 10% margin.

Table 20.4 Performances on the soccer dataset in accuracy

Ke et al. [19] collect video sequences of activities in crowded scenes to evaluate their proposed volumetric features, which are space–time templates for particular activities. These videos are recorded using a hand-held camera. Each activity is performed by three to six subjects, resulting in 110 activities of interest. The videos are downscaled to 160×120 in resolution. There is high variability in both how the subjects performed the activities and the background clutter. There are also significant spatial and temporal scale differences in the activities as well. Table 20.5 compares the performances of the state-of-the-art approaches. The performance gain of the latter two approaches comes from the incorporation of temporal features, for example, the time-series representation in [3]. Since these approaches are template-based, to test how well the templates generalize, the evaluation consists of training on sequences performed by one actor while testing on the rest.

Table 20.5 Performances on the CMU crowded videos dataset in Area under ROC Curve (AUROC)

3.2 The University of Maryland Gesture Dataset

Lin et al. [24] publish an UM gesture dataset consisting of 14 different gesture classes, which are a subset of the military signals. The gestures include “turn left”, “turn right”, “attention left”, “attention right”, “flap”, “stop left”, “stop right”, “stop both”, “attention both”, “start”, “go back”, “close distance”, “speed up” and “come near”. The dataset is collected using a color camera with 640×480 resolution. Each activity is performed by three people for three times, giving 126 video sequences for training which are captured using a fixed camera with the person viewed against a simple, static background.

There are 168 video sequences for testing which are captured from a moving camera and in the presence of background clutter and other moving objects. Lin et al. [24] use the proposed prototype tree to achieve an accuracy of 91.07%. Brendel et al. [3] achieve 96.3% using time-series modeling while Tran et al. [44] achieve an 100% accuracy. Note that since this dataset focuses on military signals, it might not be a suitable benchmark for generic activity recognition.

4 Multi-view Benchmarks

The aforementioned benchmarks only provide video sequences from a single camera perspective. In real life, it might be desirable to have a multi-camera configuration, for example, in surveillance applications. In what follows, we introduce two datasets consisting of activities from different perspectives.

4.1 The University of Central Florida Sports Dataset

Rodriguez et al. [37] publish a dataset consisting of a set of actions collected from various sports which are typically featured on broadcast television channels such as BBC and ESPN. It contains over 200 video sequences at a resolution of 720×480 and consists of nine sport activities including diving, golf swinging, kicking, lifting, horseback riding, running, skating, swinging a baseball bat, and pole vaulting. These activities are featured in a wide range of scenes and viewpoints. Table 20.6 summarizes the published results using leave-one-out procedure on this dataset. Note that the space–time MACH filter [37] is a template matching approach. The low accuracy of its performance suggests that the model-based approach captures intra-class variability better when the camera view point varies.

Table 20.6 Performances on the UCF sports dataset in average accuracy

Following the sports dataset, Yeffet et al. [57] publish a dataset of UFC videos from TV programs. UFC is a fighting sport similar to boxing. Therefore, the view-points and individual appearance vary differently and camera motion persists. In addition, two fighters act at the same time and could occlude each other. The dataset contains over 20 minutes of broadcast video, and two target activities are defined: the throw/take-down action and keen-kick action, two rarely occurred activities in UFC sport. Therefore, the dataset is versatile compared to other sports. One merit of this dataset is that the target activities are relevant to surveillance applications as these activities rarely occur and are similar to one person hitting another.

4.2 The INRIA Multi-view Dataset

To the best of our knowledge, the multi-view dataset published by Weinland et al. [53] is the only known large scale multi-view dataset that provides synchronized video sequences from multiple cameras for each activity. They use multiple cameras to record 13 activities such as “walk”, “sit down”, “check watch”, etc. Each activity is performed by multiple actors. The camera array provides five synchronized views at a resolution of 390×291 with a frame rate 23 frames per second. Each sequence lasts for a few seconds. Weinland et al. [52] demonstrate that by fusing views from multiple cameras, the accuracy can be greatly improved. Table 20.7 summarizes the performances reported so far. Note that Weinland et al. [52] use the information from all views while others, [28] and [44], use only one of the views. The evaluation follows the leave-one-actor-out procedure.

Table 20.7 Performances on the multi-view dataset in average accuracy

5 Benchmarks with Real World Footages

The datasets discussed thus far, except the UCF Sports Dataset, consist of video sequences where human actors perform different activities. Therefore, these datasets are made in a more controlled environment. In this section, we discuss datasets consisting of video sequences extracted from different real world sources, such as movies or the Internet. Since there is no limitation on how these video sequences should be made, these datasets are more difficult as the videos could contain occlusions, background clutters or could have been shot with different camera perspectives or motion.

5.1 The University of Central Florida Youtube Dataset

Liu et al. [26] collected video sequences from YouTube and made a dataset consisting of 11 activities, resulting in a total of 1168 sequences. These activities include basketball shooting (b_shooting), volleyball spiking (v_spiking), trampoline jumping (t_jumping), soccer juggling (s_juggling), horseback riding (h_riding), cycling, diving, swinging, golf swinging (g_swinging), tennis swinging (t_swinging), and walking (with a dog). Due to the diverse nature of video sources, these sequences contain significant camera motion, background clutters, and occlusions, variations in subject appearance, illumination and view point. Also, all the sequences are low-resolution videos (240×320) with a frame rate of 15 frames per second. Each activity is about 3∼5 seconds long. Table 20.8 summarizes the published results using the leave-one-out procedure on the YouTube dataset.

Table 20.8 Performances reported on the YouTube dataset in recognition accuracy

5.2 The Hollywood Dataset

In order to provide a realistic benchmarking in an unconstrained environment, Laptev et al. [22] initiates an effort by creating a dataset consisting of video sequences extracted from two episodes from the movie “Coffee and Cigarettes”, providing a pool of examples for atomic actions, such as “drinking” and “smoking”, where each atomic event ranges from 30 to 200 frames long, with a mean of 70 frames. They show on a ∼36000 frame test set that by combining both frame-based classifier and space–time based classifier improves the precision of action detection by a 30%∼40% margin given the same recall. Similarly, Rodriguez et al. [37] published a kissing/slapping dataset consisting of ∼200 sequences from several movies. They achieved ∼66% accuracy using a template-based approach.

Laptev et al. [23] later create a Hollywood-1 dataset by extracting eight different actions (answer phone, hug person, sit up, sit down, kiss, handshake, and stand up) from various movies. The dataset consists of ∼400 video sequences. Each sequence is about 50∼200 frames long with a resolution 240×500 and a frame rate of 24 frames per second. Using a combination of multi-scale flow and shape features, they achieve a 30%∼50% average precision for each action class. Marszałek et al. [29] subsequently create a Hollywood-2 dataset by augmenting Hollywood-1 to include up to twelve activities with a total of 600 K frames. The scene information is also annotated. They achieve an average precision of 35.5% by incorporating the context, i.e. the scene information. Both the Hollywood-1 and Hollywood-2 datasets come with a clean training set and a test set of roughly equal size (about 200 sequences).

Overall, the Hollywood datasets pose a great challenge to activity recognition as the camera views are different from sequence to sequence, the background is cluttered, multiple subjects are present, occlusions occur very often, and the intra-class variability is large, making recognition hard. Tables 20.9 and 20.10 summarize reported performance on Hollywood-1 and Hollywood-2 datasets. As we see from the tables, there is still huge room for improvement. Gilbert et al. [9] is the current state-of-the-art by mining the spatial-temporal relationships between space–time interest points.

Table 20.9 Performances on Hollywood-1 datasets in average precision
Table 20.10 Performances on Hollywood-2 datasets in average precision

5.3 The Olympic Dataset

Recently, Niebles et al. [33] publish the Olympic Sports Dataset. The dataset contains 50 videos from each of the following 16 activities: high jump, long jump, triple jump, pole vault, discus throw, hammer throw, javelin throw, shot put, basketball layup, bowling, tennis serve, platform (diving), springboard (diving), snatch (weightlifting), clean and jerk (weightlifting) and vault (gymnastics). These sequences, obtained from YouTube, contain severe occlusions, camera movements, and compression artifacts. In contrast to other sport datasets such as the UCF Sports Dataset [37], which contains periodic or simple activities such as walking, running, golf swinging, ball kicking, the activities in the Olympic Dataset are longer and more complex. Niebles et al. [33] achieved an accuracy of 72% by modeling the temporal structure of these activities.

6 Benchmarks with Multiple Activities

The benchmarks introduced so far focus more on “activity recognition”, i.e. video sequences in these datasets are typically pre-segmented and contain only one activity. It is desirable to have benchmarks with video sequences containing multiple activities for activity detection algorithms, i.e. finding out all possible activities in the video sequences, which is especially beneficial for surveillance applications. Uemura et al. [46] publish a Multi-KTH dataset, consisting of the same activities as the KTH dataset. The video sequences have a resolution of 640×480 and contain activities similar to the KTH dataset, except that one video sequence could contain multiple activities simultaneously and that the camera is constantly moving. By tracking space–time interest points, Uemura et al. [46] achieve an average precision of 65.4%, while Gilbert et al. [9] achieve 75.2% by data mining the space–time features. Table 20.11 summarizes the performances for each activity in terms of average precision.

Table 20.11 Performances reported on the multi-KTH dataset in average precision

In a similar setting to [19], Yuan et al. [58] publish an MSR-1 dataset containing 16 video sequences and having in total 63 actions: 14 hand clapping, 24 hand waving, and 25 boxing, performed by 10 subjects. Each sequence contains multiple types of actions. Some sequences contain actions performed by different people. There are both indoor and outdoor scenes. All of the video sequences are captured with cluttered and moving backgrounds. Each video is of low resolution, 320×240 and frame rate 15 frames per second. Their lengths are between 32 ∼ 76 seconds. An extended MSR-2 dataset consisting of 54 videos sequences is also available [4]. Yuan et al. [58] report a 57% recall and 87.5% precision.

Other than the aforementioned datasets, TRECVID [42] is an annual event detection challenge aiming at addressing realistic activity retrieval problems. The dataset is updated each year. It consists of videos from multiple surveillance cameras deployed at the London Gatwick airport. For example, for the 2009 challenge, the goal of the challenge was to detect several target events, including “ElevatorNoEntry”, “OpposingFlow” (moving in the opposite direction), “PersonRuns”, “Pointing”, “CellToEar”, “ObjectPut”, “TakePicture”, “Embrace”, “PeopleMeet”, and, “PeopleSplitUp”. The dataset is challenging in that unlike the sequences in previous datasets where activities are repetitive, most of the target events in TRECVID are rare and subtle. For example, to detect the activity “CellToEar” or “PersonRuns” in unconstrained video sequences is extremely difficult. Also, the sequences always have cluttered background, which could also include moving people, resulting in complicated occlusion scenarios. The intra-class variations of each activity are also huge, since each person performs the same activity differently. The evaluation is done using the Detection Error Tradeoff (DET) curve, a trade off curve between miss rate and false alarm rate. The state of the art approach achieves only 90% miss rate while keeping the false alarm rate to 20 per hour. The miss rate drops to ∼80% while the false alarm rate is kept at 100 per hour, an indication of the difficulty of the dataset.

7 Other Benchmarks

Other than recognizing single subject kinematic activities, recently, researchers have tried to extend activity recognition to a broader context. For example, Prabhakar et al. [34] use temporal causality to detect activities that involve interactions among people. They evaluate their approach on a toy dataset consisting of sequences of ball playing activities (“roll-ball”, “throw-ball”, and “kick-ball”) and a child play dataset [48] consisting of social games such as pattycake between an adult and a child, achieving 60%∼70% accuracy. They also report results on the “HandShake” from the Hollywood dataset [29] for realistic evaluations. Another dataset that also involves human interactions is the PETS07-BEHAVE [14] dataset consisting of video sequences of 640×480 resolution. The activities include walking together, splitting, approaching, fighting, chasing, and so on.

Another category of activities that attracts many research works involves object manipulation. The recognition of object manipulation based activities finds its application, for example, in Programming by Demonstration in Robotics or flow optimization for factory workers. Experimental protocols for laboratory technicians and recipes for home cooks are also example tasks. Also, in object recognition, more and more context information are brought in to help recognizing the objects and the way an object is manipulated or held significantly constrained the category of the object. On the other hand, the object class also affects how it can be grasped or manipulated and the activities that can be performed on it.

Gupta et al. [11] collect a sports image dataset consisting of five activities: “Cricket bowling”, “Croquet shot”, “Tennis forehand”, “Tennis serve”, “Volleyball smash”, each with 50 images. They report a 78.9% accuracy while recently, Yao et al. [55] achieve a recognition rate of 83.3% by jointly modeling activity, body pose and manipulated object.

Similarly, Yao et al. [54] publish an instrument playing dataset consisting of seven different musical instruments: bassoon, erhu, flute, French horn, guitar, saxophone, and violin. Each class includes ∼150 people-playing-musical-instrument images. They achieve an accuracy of 65.7% using their proposed Grouplet features, an extension of local interest point features to take into account neighboring relationships.

Kjellstrom et al. [20] collect the OAC (Object–Action-Complex) dataset. The dataset consists of 50 instances, each of three different action–object combinations: “look through binoculars”, “drink from cup”, and “pour from pitcher”. The activities are performed by 10 subjects, 5 times each. The classes are selected so that two of the activities, “look through” and “drink from” are similar, while two of the objects, “cup”and “pitch” are similar as well. They report the best performance of 6% error rate by jointly inferring the activities and the manipulated object using a CRF.

Another closely related work is the HumanEva datasets [41]. These datasets contain video sequences of six simple activities performed by four∼six subjects with motion sensors. Other than videos, the datasets also provide corresponding motion sensor values from the motion capture system in order to evaluate human pose estimation and articulated tracking algorithms.

Tables 20.12 and 20.13 summarize different properties, such as resolution, activities, degree of background clutter, of the major benchmarking datasets. We can see from the table, the numbers reported on the standard activity recognition datasets such as the KTH dataset [40] are saturated, mostly above 90%. On the other hand, there is still a huge room for improvement for realistic and multi-activity datasets, such as the Hollywood datasets [23, 29], the MSR dataset [58], or the TRECVID [42]. This suggests that more sophisticated methods are needed to address the problems of cluttered background or those of representing activities in finer scales.

Table 20.12 Summary of all the datasets. “r” indicates that the dataset was made out of realistic videos. “v” indicates the dataset consists of video sequences with various perspectives. The performance is reported in average accuracy unless otherwise specified. The columns are dataset names, number of activities, number of actors, resolution of the videos (res.), and camera views
Table 20.13 Summary of all the datasets. “r” indicates that the dataset was made out of realistic videos. “v” indicates the dataset consists of video sequences with various perspectives. The performance is reported in average accuracy unless otherwise specified. The columns are dataset names, degree of background clutter (bg clutter), camera motion (c_motion), and the state-of-the-art performances

8 Conclusions

In this chapter, we have covered the state-of-the-art benchmarking datasets for human activity recognition algorithms, ranging from standard KTH dataset [40] to realistic Hollywood dataset [23, 29] or TRECVID dataset [42]. To conclude, datasets such as the KTH dataset [40] or the Weizmann dataset [10] for which the state-of-the-art approaches have already achieved above 90% accuracy provide benchmarks in a more controlled environment, while the YouTube dataset [26], the Hollywood datasets [23, 29], and the TRECVID dataset [42] approximate realistic situations better, posing great challenges to human activity recognition algorithms. The datasets with videos containing multiple activities, such as the MSR dataset [58] provide suitable benchmarks for activity detection techniques, which are still few in its genre as most human activity recognition techniques assume pre-segmented video sequences. The properties of these major benchmarking datasets are summarized in both Tables 20.12 and 20.13. We hope that by summarizing the state-of-the-art numbers, people would be able to use them as a baseline and report improved numbers on top of them.

A dataset that is presently lacking is one that contains human actions with the information on the action context as well as on the objects that are involved in the actions. This need was also outlined in Chap.  18 where the reader may find a more detailed discussion.

8.1 Further Readings

We refer the interested readers to Turaga et al. [45] for generic topics about human activity recognition. For empirical methods and evaluation methodologies in Computer Vision, Henrik et al. [6] and Venkata et al. [47] both cover the design of experiments and benchmarks for various topics in Computer Vision. Interested readers could also see [38] and [59] for information about providing ground-truth labeling.