Benchmarking Datasets for Human Activity Recognition

Liu, Haowei; Feris, Rogerio; Sun, Ming-Ting

doi:10.1007/978-0-85729-997-0_20

Haowei Liu⁵,
Rogerio Feris⁶ &
Ming-Ting Sun⁵

Abstract

Recognizing human activities has become an important topic in the past few years. A variety of techniques for representing and modeling different human activities have been proposed, achieving reasonable performances in many scenarios. On the other hand, different benchmarks have also been collected and published. Different from other chapters focusing on the algorithmic aspects, this chapter gives an overview of different benchmarking datasets, summarizes the performances of the-state-of-the-art algorithms, and analyzes these datasets.

Access provided by Autonomous University of Puebla. Download chapter PDF

Toward human activity recognition: a survey

Article 20 October 2022

Action Datasets and MHI

Model Evaluation Approaches for Human Activity Recognition from Time-Series Data

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In the past few years, the problem of automatically recognizing human activities in videos has emerged as an important field and attracted many researchers in the vision community. The problem is challenging as in general, the videos could have been shot in an unconstrained environment where the camera could be moving, the background can be cluttered, or the camera view point can be different. All these factors already make recognition of human activities difficult, let alone possible occlusions or variations of activities different subjects perform. With that said, much progress has been made toward the automatic understanding of human activities. On one hand, many approaches (e.g. feature representations and modeling) have been proposed, which have addressed the problem to some degree. On the other hand, many benchmarks and datasets consisting of activity video sequences have been collected and published. Different from other chapters, which focus on activity representation and modeling, this chapter surveys different publicly available benchmarks and summarizes the-state-of-art performances reported so far. Ideally, a good benchmarking should approximate the realistic situations as much as possible by incorporating video sequences with unrestricted camera motion, different scene contexts, different degrees of background clutter and different camera perspectives. It should also consist of video sequences with multiple subjects performing different activities in order to evaluate the robustness of activity recognition algorithms to the intra-class variations of human activities. In what follows, we will also analyze each dataset by these criteria. When summarizing the performances, we only report the best number achieved. The train/test split used in these works follow either leave-one-out or leave-one-actor-out procedure. For the former, testing is done on one sequence while training on the rest. For the latter, testing is done on sequences performed by one actor while training on the rest. The performance is reported as the average across the testing results.

2 Single View Activity Benchmarks with Cleaner Background

2.1 The KTH and the Weizmann Dataset

The KTH dataset [40] and the Weizmann dataset [10] are two widely used standard datasets, which consist of videos of different human activities performed by different subjects. The KTH dataset is published by Schuldt et al. [40] in order to benchmark their proposed motion features [40]. It contains six types of human activities (walking, jogging, running, boxing, hand waving, and hand clapping), which are performed by 25 actors in four different scenarios, resulting in 600 sequences, each with a spatial resolution of 160×120 pixels and a frame rate of 25 frames per second. The other standard benchmark, the Weizmann dataset [10], contains 10 types of activities (walking, running, jumping, gallop sideways, bending, one-hand waving, two-hand waving, jumping in place, jumping jack, and skipping), each performed by nine actors, resulting in 90 video sequences, each with a spatial resolution of 180×144 pixels and a frame rate of 50 frames per second. The background is static and clean with no camera motion. Each sequence is about three seconds long.

Since these datasets are originally published to validate proposed space–time features, they are easier compared with others as the background is cleaner and static, the camera perspective is mostly frontal, and the camera motion is mostly still, although the KTH dataset contains a certain degree of camera zooming. Therefore, they have been criticized for not being a realistic sampling of actions in the real world. With that said, many researchers use them as a validation for newly proposed algorithms. Most state-of-the-art activity recognition algorithms have already achieved higher than 90% accuracy on these two datasets. Below we summarize the published results on both datasets in Tables 20.1 and 20.2. For these two datasets, people typically use leave-one-actor-out evaluation. Hence, the training/testing split is 24:1 for the KTH dataset and 8:1 for the Weizman dataset.

Table 20.1 Performances on the KTH dataset in average accuracy

Full size table

Table 20.2 Performances on the Weizmann dataset in average accuracy

Full size table

2.2 The University of Rochester Activity of Daily Living Dataset

Messing et al. [31] publish an activity of daily living dataset. The dataset is created in order to approximate daily activities people might perform. The full list of activities is: answering a phone, dialing a phone, looking up a phone number in a telephone directory, writing a phone number on a white board, drinking a glass of water, eating snack chips, peeling a banana, eating a banana, chopping a banana, and eating food with silverware, all are ordinary activities people often perform. These activities are performed three times by five different people of different shapes, sizes, genders, and ethnicities, giving large appearance variations even for the same activity. The resolution is 1280×720 at 30 frames per second. Video sequences lasted between 10 and 60 seconds, ending when the activity was completed. Table 20.3 compares the performances on the University of Rochester activity of daily living dataset using different features. The evaluation follows the leave-one-actor-out procedure.

Table 20.3 Performances on the UR ADL dataset in average accuracy

Full size table

The evaluation consisted of training on all repetitions of activities by four of the five subjects, and testing on all repetitions of the fifth subjects activities. This leave-one-out testing was averaged over the performance with each left-out subject.

2.3 Other Datasets

Other than the aforementioned datasets, Tran et al. [44] compose a UIUC activity dataset, consisting of 532 high resolution (1024×768) sequences of 14 activities performed by 8 different actors with extensive repetition. Each sequence lasts for 10∼15 seconds. They achieve an accuracy of 99.06% using the proposed metric learning method.

Another closely related source of datasets is the PETS (Performance Evaluation of Tracking and Surveillance) workshop [15], which releases high resolution surveillance footages every year. Portions of the released datasets are used as benchmarks for human activity recognition algorithms. For example, Ribeiro et al. [36] reported a 94% accuracy on the PETS04-CAVIAR dataset [13], which includes single person activities such as people fighting, walking or being immobile.

3 Single View Activity Benchmarks with Cluttered Background

3.1 The CMU Soccer Dataset and Crowded Videos Dataset

Different from the datasets introduced in previous section where the video sequences contain few or no background clutter, both the CMU soccer and CMU crowded video datasets are made to introduce cluttered background. In [7], Efros et al. record several minutes of a World Cup football game. The dataset consists of walking and running activities at different directions, giving a total of seven activities and around 5000 frames. Although the video sequences are recorded from TV programs, providing a resolution of 640×480, the dataset is challenging in that each human figure is only 30 pixels tall on average, and hence, fine-scale human pose estimation is not possible, making motion the only possible cue. Also, other moving humans from the background could also occlude the target subject. Table 20.4 summarizes the reported performance using leave-one-out procedure on this dataset. It suggests that putting a hierarchy or a generative model on the raw motion features could improve the performance by a 10% margin.

Table 20.4 Performances on the soccer dataset in accuracy

Full size table

Ke et al. [19] collect video sequences of activities in crowded scenes to evaluate their proposed volumetric features, which are space–time templates for particular activities. These videos are recorded using a hand-held camera. Each activity is performed by three to six subjects, resulting in 110 activities of interest. The videos are downscaled to 160×120 in resolution. There is high variability in both how the subjects performed the activities and the background clutter. There are also significant spatial and temporal scale differences in the activities as well. Table 20.5 compares the performances of the state-of-the-art approaches. The performance gain of the latter two approaches comes from the incorporation of temporal features, for example, the time-series representation in [3]. Since these approaches are template-based, to test how well the templates generalize, the evaluation consists of training on sequences performed by one actor while testing on the rest.

Table 20.5 Performances on the CMU crowded videos dataset in Area under ROC Curve (AUROC)

Full size table

3.2 The University of Maryland Gesture Dataset

Lin et al. [24] publish an UM gesture dataset consisting of 14 different gesture classes, which are a subset of the military signals. The gestures include “turn left”, “turn right”, “attention left”, “attention right”, “flap”, “stop left”, “stop right”, “stop both”, “attention both”, “start”, “go back”, “close distance”, “speed up” and “come near”. The dataset is collected using a color camera with 640×480 resolution. Each activity is performed by three people for three times, giving 126 video sequences for training which are captured using a fixed camera with the person viewed against a simple, static background.

There are 168 video sequences for testing which are captured from a moving camera and in the presence of background clutter and other moving objects. Lin et al. [24] use the proposed prototype tree to achieve an accuracy of 91.07%. Brendel et al. [3] achieve 96.3% using time-series modeling while Tran et al. [44] achieve an 100% accuracy. Note that since this dataset focuses on military signals, it might not be a suitable benchmark for generic activity recognition.

4 Multi-view Benchmarks

The aforementioned benchmarks only provide video sequences from a single camera perspective. In real life, it might be desirable to have a multi-camera configuration, for example, in surveillance applications. In what follows, we introduce two datasets consisting of activities from different perspectives.

4.1 The University of Central Florida Sports Dataset

Rodriguez et al. [37] publish a dataset consisting of a set of actions collected from various sports which are typically featured on broadcast television channels such as BBC and ESPN. It contains over 200 video sequences at a resolution of 720×480 and consists of nine sport activities including diving, golf swinging, kicking, lifting, horseback riding, running, skating, swinging a baseball bat, and pole vaulting. These activities are featured in a wide range of scenes and viewpoints. Table 20.6 summarizes the published results using leave-one-out procedure on this dataset. Note that the space–time MACH filter [37] is a template matching approach. The low accuracy of its performance suggests that the model-based approach captures intra-class variability better when the camera view point varies.

Table 20.6 Performances on the UCF sports dataset in average accuracy

Full size table

Following the sports dataset, Yeffet et al. [57] publish a dataset of UFC videos from TV programs. UFC is a fighting sport similar to boxing. Therefore, the view-points and individual appearance vary differently and camera motion persists. In addition, two fighters act at the same time and could occlude each other. The dataset contains over 20 minutes of broadcast video, and two target activities are defined: the throw/take-down action and keen-kick action, two rarely occurred activities in UFC sport. Therefore, the dataset is versatile compared to other sports. One merit of this dataset is that the target activities are relevant to surveillance applications as these activities rarely occur and are similar to one person hitting another.

4.2 The INRIA Multi-view Dataset

To the best of our knowledge, the multi-view dataset published by Weinland et al. [53] is the only known large scale multi-view dataset that provides synchronized video sequences from multiple cameras for each activity. They use multiple cameras to record 13 activities such as “walk”, “sit down”, “check watch”, etc. Each activity is performed by multiple actors. The camera array provides five synchronized views at a resolution of 390×291 with a frame rate 23 frames per second. Each sequence lasts for a few seconds. Weinland et al. [52] demonstrate that by fusing views from multiple cameras, the accuracy can be greatly improved. Table 20.7 summarizes the performances reported so far. Note that Weinland et al. [52] use the information from all views while others, [28] and [44], use only one of the views. The evaluation follows the leave-one-actor-out procedure.

Table 20.7 Performances on the multi-view dataset in average accuracy

Full size table

5 Benchmarks with Real World Footages

The datasets discussed thus far, except the UCF Sports Dataset, consist of video sequences where human actors perform different activities. Therefore, these datasets are made in a more controlled environment. In this section, we discuss datasets consisting of video sequences extracted from different real world sources, such as movies or the Internet. Since there is no limitation on how these video sequences should be made, these datasets are more difficult as the videos could contain occlusions, background clutters or could have been shot with different camera perspectives or motion.

5.1 The University of Central Florida Youtube Dataset

Liu et al. [26] collected video sequences from YouTube and made a dataset consisting of 11 activities, resulting in a total of 1168 sequences. These activities include basketball shooting (b_shooting), volleyball spiking (v_spiking), trampoline jumping (t_jumping), soccer juggling (s_juggling), horseback riding (h_riding), cycling, diving, swinging, golf swinging (g_swinging), tennis swinging (t_swinging), and walking (with a dog). Due to the diverse nature of video sources, these sequences contain significant camera motion, background clutters, and occlusions, variations in subject appearance, illumination and view point. Also, all the sequences are low-resolution videos (240×320) with a frame rate of 15 frames per second. Each activity is about 3∼5 seconds long. Table 20.8 summarizes the published results using the leave-one-out procedure on the YouTube dataset.

Table 20.8 Performances reported on the YouTube dataset in recognition accuracy

Full size table

5.2 The Hollywood Dataset

In order to provide a realistic benchmarking in an unconstrained environment, Laptev et al. [22] initiates an effort by creating a dataset consisting of video sequences extracted from two episodes from the movie “Coffee and Cigarettes”, providing a pool of examples for atomic actions, such as “drinking” and “smoking”, where each atomic event ranges from 30 to 200 frames long, with a mean of 70 frames. They show on a ∼36000 frame test set that by combining both frame-based classifier and space–time based classifier improves the precision of action detection by a 30%∼40% margin given the same recall. Similarly, Rodriguez et al. [37] published a kissing/slapping dataset consisting of ∼200 sequences from several movies. They achieved ∼66% accuracy using a template-based approach.

Laptev et al. [23] later create a Hollywood-1 dataset by extracting eight different actions (answer phone, hug person, sit up, sit down, kiss, handshake, and stand up) from various movies. The dataset consists of ∼400 video sequences. Each sequence is about 50∼200 frames long with a resolution 240×500 and a frame rate of 24 frames per second. Using a combination of multi-scale flow and shape features, they achieve a 30%∼50% average precision for each action class. Marszałek et al. [29] subsequently create a Hollywood-2 dataset by augmenting Hollywood-1 to include up to twelve activities with a total of 600 K frames. The scene information is also annotated. They achieve an average precision of 35.5% by incorporating the context, i.e. the scene information. Both the Hollywood-1 and Hollywood-2 datasets come with a clean training set and a test set of roughly equal size (about 200 sequences).

Overall, the Hollywood datasets pose a great challenge to activity recognition as the camera views are different from sequence to sequence, the background is cluttered, multiple subjects are present, occlusions occur very often, and the intra-class variability is large, making recognition hard. Tables 20.9 and 20.10 summarize reported performance on Hollywood-1 and Hollywood-2 datasets. As we see from the tables, there is still huge room for improvement. Gilbert et al. [9] is the current state-of-the-art by mining the spatial-temporal relationships between space–time interest points.

Table 20.9 Performances on Hollywood-1 datasets in average precision

Full size table

Table 20.10 Performances on Hollywood-2 datasets in average precision

Full size table

5.3 The Olympic Dataset

Recently, Niebles et al. [33] publish the Olympic Sports Dataset. The dataset contains 50 videos from each of the following 16 activities: high jump, long jump, triple jump, pole vault, discus throw, hammer throw, javelin throw, shot put, basketball layup, bowling, tennis serve, platform (diving), springboard (diving), snatch (weightlifting), clean and jerk (weightlifting) and vault (gymnastics). These sequences, obtained from YouTube, contain severe occlusions, camera movements, and compression artifacts. In contrast to other sport datasets such as the UCF Sports Dataset [37], which contains periodic or simple activities such as walking, running, golf swinging, ball kicking, the activities in the Olympic Dataset are longer and more complex. Niebles et al. [33] achieved an accuracy of 72% by modeling the temporal structure of these activities.

6 Benchmarks with Multiple Activities

The benchmarks introduced so far focus more on “activity recognition”, i.e. video sequences in these datasets are typically pre-segmented and contain only one activity. It is desirable to have benchmarks with video sequences containing multiple activities for activity detection algorithms, i.e. finding out all possible activities in the video sequences, which is especially beneficial for surveillance applications. Uemura et al. [46] publish a Multi-KTH dataset, consisting of the same activities as the KTH dataset. The video sequences have a resolution of 640×480 and contain activities similar to the KTH dataset, except that one video sequence could contain multiple activities simultaneously and that the camera is constantly moving. By tracking space–time interest points, Uemura et al. [46] achieve an average precision of 65.4%, while Gilbert et al. [9] achieve 75.2% by data mining the space–time features. Table 20.11 summarizes the performances for each activity in terms of average precision.

Table 20.11 Performances reported on the multi-KTH dataset in average precision

Full size table

In a similar setting to [19], Yuan et al. [58] publish an MSR-1 dataset containing 16 video sequences and having in total 63 actions: 14 hand clapping, 24 hand waving, and 25 boxing, performed by 10 subjects. Each sequence contains multiple types of actions. Some sequences contain actions performed by different people. There are both indoor and outdoor scenes. All of the video sequences are captured with cluttered and moving backgrounds. Each video is of low resolution, 320×240 and frame rate 15 frames per second. Their lengths are between 32 ∼ 76 seconds. An extended MSR-2 dataset consisting of 54 videos sequences is also available [4]. Yuan et al. [58] report a 57% recall and 87.5% precision.

Other than the aforementioned datasets, TRECVID [42] is an annual event detection challenge aiming at addressing realistic activity retrieval problems. The dataset is updated each year. It consists of videos from multiple surveillance cameras deployed at the London Gatwick airport. For example, for the 2009 challenge, the goal of the challenge was to detect several target events, including “ElevatorNoEntry”, “OpposingFlow” (moving in the opposite direction), “PersonRuns”, “Pointing”, “CellToEar”, “ObjectPut”, “TakePicture”, “Embrace”, “PeopleMeet”, and, “PeopleSplitUp”. The dataset is challenging in that unlike the sequences in previous datasets where activities are repetitive, most of the target events in TRECVID are rare and subtle. For example, to detect the activity “CellToEar” or “PersonRuns” in unconstrained video sequences is extremely difficult. Also, the sequences always have cluttered background, which could also include moving people, resulting in complicated occlusion scenarios. The intra-class variations of each activity are also huge, since each person performs the same activity differently. The evaluation is done using the Detection Error Tradeoff (DET) curve, a trade off curve between miss rate and false alarm rate. The state of the art approach achieves only 90% miss rate while keeping the false alarm rate to 20 per hour. The miss rate drops to ∼80% while the false alarm rate is kept at 100 per hour, an indication of the difficulty of the dataset.

7 Other Benchmarks

Other than recognizing single subject kinematic activities, recently, researchers have tried to extend activity recognition to a broader context. For example, Prabhakar et al. [34] use temporal causality to detect activities that involve interactions among people. They evaluate their approach on a toy dataset consisting of sequences of ball playing activities (“roll-ball”, “throw-ball”, and “kick-ball”) and a child play dataset [48] consisting of social games such as pattycake between an adult and a child, achieving 60%∼70% accuracy. They also report results on the “HandShake” from the Hollywood dataset [29] for realistic evaluations. Another dataset that also involves human interactions is the PETS07-BEHAVE [14] dataset consisting of video sequences of 640×480 resolution. The activities include walking together, splitting, approaching, fighting, chasing, and so on.

Another category of activities that attracts many research works involves object manipulation. The recognition of object manipulation based activities finds its application, for example, in Programming by Demonstration in Robotics or flow optimization for factory workers. Experimental protocols for laboratory technicians and recipes for home cooks are also example tasks. Also, in object recognition, more and more context information are brought in to help recognizing the objects and the way an object is manipulated or held significantly constrained the category of the object. On the other hand, the object class also affects how it can be grasped or manipulated and the activities that can be performed on it.

Gupta et al. [11] collect a sports image dataset consisting of five activities: “Cricket bowling”, “Croquet shot”, “Tennis forehand”, “Tennis serve”, “Volleyball smash”, each with 50 images. They report a 78.9% accuracy while recently, Yao et al. [55] achieve a recognition rate of 83.3% by jointly modeling activity, body pose and manipulated object.

Similarly, Yao et al. [54] publish an instrument playing dataset consisting of seven different musical instruments: bassoon, erhu, flute, French horn, guitar, saxophone, and violin. Each class includes ∼150 people-playing-musical-instrument images. They achieve an accuracy of 65.7% using their proposed Grouplet features, an extension of local interest point features to take into account neighboring relationships.

Kjellstrom et al. [20] collect the OAC (Object–Action-Complex) dataset. The dataset consists of 50 instances, each of three different action–object combinations: “look through binoculars”, “drink from cup”, and “pour from pitcher”. The activities are performed by 10 subjects, 5 times each. The classes are selected so that two of the activities, “look through” and “drink from” are similar, while two of the objects, “cup”and “pitch” are similar as well. They report the best performance of 6% error rate by jointly inferring the activities and the manipulated object using a CRF.

Another closely related work is the HumanEva datasets [41]. These datasets contain video sequences of six simple activities performed by four∼six subjects with motion sensors. Other than videos, the datasets also provide corresponding motion sensor values from the motion capture system in order to evaluate human pose estimation and articulated tracking algorithms.

Tables 20.12 and 20.13 summarize different properties, such as resolution, activities, degree of background clutter, of the major benchmarking datasets. We can see from the table, the numbers reported on the standard activity recognition datasets such as the KTH dataset [40] are saturated, mostly above 90%. On the other hand, there is still a huge room for improvement for realistic and multi-activity datasets, such as the Hollywood datasets [23, 29], the MSR dataset [58], or the TRECVID [42]. This suggests that more sophisticated methods are needed to address the problems of cluttered background or those of representing activities in finer scales.

Table 20.12 Summary of all the datasets. “r” indicates that the dataset was made out of realistic videos. “v” indicates the dataset consists of video sequences with various perspectives. The performance is reported in average accuracy unless otherwise specified. The columns are dataset names, number of activities, number of actors, resolution of the videos (res.), and camera views

Full size table

Table 20.13 Summary of all the datasets. “r” indicates that the dataset was made out of realistic videos. “v” indicates the dataset consists of video sequences with various perspectives. The performance is reported in average accuracy unless otherwise specified. The columns are dataset names, degree of background clutter (bg clutter), camera motion (c_motion), and the state-of-the-art performances

Full size table

8 Conclusions

In this chapter, we have covered the state-of-the-art benchmarking datasets for human activity recognition algorithms, ranging from standard KTH dataset [40] to realistic Hollywood dataset [23, 29] or TRECVID dataset [42]. To conclude, datasets such as the KTH dataset [40] or the Weizmann dataset [10] for which the state-of-the-art approaches have already achieved above 90% accuracy provide benchmarks in a more controlled environment, while the YouTube dataset [26], the Hollywood datasets [23, 29], and the TRECVID dataset [42] approximate realistic situations better, posing great challenges to human activity recognition algorithms. The datasets with videos containing multiple activities, such as the MSR dataset [58] provide suitable benchmarks for activity detection techniques, which are still few in its genre as most human activity recognition techniques assume pre-segmented video sequences. The properties of these major benchmarking datasets are summarized in both Tables 20.12 and 20.13. We hope that by summarizing the state-of-the-art numbers, people would be able to use them as a baseline and report improved numbers on top of them.

A dataset that is presently lacking is one that contains human actions with the information on the action context as well as on the objects that are involved in the actions. This need was also outlined in Chap. 18 where the reader may find a more detailed discussion.

8.1 Further Readings

We refer the interested readers to Turaga et al. [45] for generic topics about human activity recognition. For empirical methods and evaluation methodologies in Computer Vision, Henrik et al. [6] and Venkata et al. [47] both cover the design of experiments and benchmarks for various topics in Computer Vision. Interested readers could also see [38] and [59] for information about providing ground-truth labeling.

References

Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 32(2), 288–303 (2010)
Article Google Scholar
Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space–time interest points. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Brendel, W., Todorovic, S.: Activities as time series of human postures. In: IEEE European Conference on Computer Vision (ECCV) (2010)
Google Scholar
Cao, L., Liu, Z., Huang, T.: Cross-dataset action detection. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Google Scholar
Chaudhry, R., Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Christensen, H., Phillips, J.: Empirical Evaluation Methods in Computer Vision. World Scientific, Singapore (2002)
Book MATH Google Scholar
Efros, A., Berg, A., Mori, G., Malik, J.: Recognizing action at a distance. In: IEEE International Conference on Computer Vision (ICCV) (2003)
Google Scholar
Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2008)
Google Scholar
Gilbert, A., Illingworth, J., Bowden, R.: Action recognition using mined hierarchical compound features. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI) (2010)
Google Scholar
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space–time shapes. In: IEEE International Conference on Computer Vision (ICCV) (2005)
Google Scholar
Gupta, A., Kembhavi, A., Davis, L.: Observing human–object interactions: Using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(10), 1775–1789 (2009)
Article Google Scholar
Han, D., Bo, L., Sminchisescu, C.: Selection and context for action recognition. In: IEEE International Conference on Computer Vision (ICCV) (2009)
Google Scholar
IEEE: Performance Evaluation of Tracking and Surveillance (2004)
Google Scholar
IEEE: Performance Evaluation of Tracking and Surveillance (2007)
Google Scholar
IEEE: Performance Evaluation of Tracking and Surveillance (2009)
Google Scholar
Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: Combining multiple features for human action recognition. In: IEEE European Conference on Computer Vision (ECCV) (2010)
Google Scholar
Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: IEEE International Conference on Computer Vision (ICCV) (2007)
Google Scholar
Jiang, H., Martin, D.: Finding actions using shape flows. In: IEEE European Conference on Computer Vision (ECCV) (2008)
Google Scholar
Ke, Y., Sukthankar, R., Hebert, M.: Event detection in cluttered videos. In: IEEE International Conference on Computer Vision (ICCV) (2007)
Google Scholar
Kjellström, H., Romero, J., Martínez, D., Kragić, D.: Simultaneous visual recognition of manipulation actions and manipulated objects. In: IEEE European Conference on Computer Vision (ECCV) (2008)
Google Scholar
Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space–time neighborhood features for human action recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Google Scholar
Laptev, I., Perez, P.: Retrieving actions in movies. In: IEEE International Conference on Computer Vision (ICCV), pp. 1–8 (2007)
Chapter Google Scholar
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2008)
Google Scholar
Lin, Z., Jiang, Z., Davis, L.: Recognizing actions by shape-motion prototype trees. In: IEEE International Conference on Computer Vision (ICCV), pp. 444–451 (2009)
Chapter Google Scholar
Liu, J., Shah, M.: Learning human action via information maximization. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2008)
Google Scholar
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Liu, J., Yang, Y., Shah, M.: Learning semantic visual vocabularies using diffusion distance. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Lv, F., Nevatia, R.: Single view human action recognition using key pose matching and Viterbi path searching. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2007)
Google Scholar
Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Matikainen, P., Hebert, M., Sukthankar, R.: Representing pairwise spatial and temporal relations for action recognition. In: IEEE European Conference on Computer Vision (ECCV) (2010)
Google Scholar
Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: IEEE International Conference on Computer Vision (ICCV) (2009)
Google Scholar
Niebles, J., Li, F.-F.: A hierarchical model of shape and appearance for human action classification. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2007)
Google Scholar
Niebles, J., Chen, C.-W., Li, F.-F.: Modeling temporal structure of decomposable motion segments for activity classification. In: IEEE European Conference on Computer Vision (ECCV) (2010)
Google Scholar
Prabhakar, K., Oh, S., Wang, P., Abowd, G., Rehg, J.: Temporal causality for the analysis of visual events. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Google Scholar
Raptis, M., Soatto, S.: Tracklet descriptors for action modeling and video analysis. In: IEEE European Conference on Computer Vision (ECCV) (2010)
Google Scholar
Ribeiro, P., Santos-Victor, J.: Human activity recognition from video: modeling, feature selection and classification architecture. In: International Workshop on Human Activity Recognition and Modelling (2005)
Google Scholar
Rodriguez, M., Ahmed, J., Shah, M.: Action mach: A spatio-temporal maximum average correlation height filter for action recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2008)
Google Scholar
Russell, B., Torralba, A., Murphy, K.: Labelme: A database and web-based tool for image annotation. Int. J. Comput. Vis. 77(1–3), 157–173 (2008)
Article Google Scholar
Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: IEEE European Conference on Computer Vision (ECCV) (2010)
Google Scholar
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: International Conference on Pattern Recognition (ICPR) (2004)
Google Scholar
Sigal, L., Balan, A., Black, M.: Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal on Computer Vision (IJCV) 87(1–2) (2010)
Google Scholar
Smeaton, A., Over, P., Kraaij, W.: Evaluation campaigns and trecvid. In: ACM International Conference on Multimedia Information Retrieval (MIR) (2006)
Google Scholar
Sun, J., Wu, X., Yan, S., Cheong, L.-F., Chua, T.-S., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Tran, D., Sorokin, A.: Human activity recognition with metric learning. In: IEEE European Conference on Computer Vision (ECCV) (2008)
Google Scholar
Turaga, P., Chellappa, R.: Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Technol. 18(11), 1473–1488 (2008)
Article Google Scholar
Uemura, H., Ishikawa, S., Mikolajczyk, K.: Feature tracking and motion compensation for action recognition. In: British Machine Vision Conference (BMVC) (2008)
Google Scholar
Venkata, S., Ahn, I., Jeon, D., Gupta, A., Louie, C., Garcia, S., Belongie, S., Taylor, M.: Sd-vbs: The San Diego Vision Benchmark Suite (2009)
Google Scholar
Wang, P., Abowd, G., Rehg, J.: Quasi-periodic event analysis for social game retrieval. In: IEEE International Conference on Computer Vision (ICCV) (2009)
Google Scholar
Wang, Y., Mori, G.: Learning a discriminative hidden part model for human action recognitio. In: Advances in Neural Information Processing Systems (NIPS) (2008)
Google Scholar
Wang, Y., Mori, G.: Human action recognition by semilatent topic models. IEEE Trans. Pattern Anal. Mach. Intell. 31(10), 1762–1774 (2009)
Article Google Scholar
Wang, Y., Mori, G.: Max-margin hidden conditional random fields for human action recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3d exemplars. In: IEEE International Conference on Computer Vision (ICCV) (2007)
Google Scholar
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding (2006)
Google Scholar
Yao, B., Fei-Fei, L.: Grouplet: A structured image representation for recognizing human and object interactions. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Google Scholar
Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human–object interaction activities. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Google Scholar
Yao, B., Zhu, S.-C.: Learning deformable action templates from cluttered videos. In: IEEE International Conference on Computer Vision (ICCV) (2009)
Google Scholar
Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: IEEE International Conference on Computer Vision (ICCV) (2009)
Google Scholar
Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Yuen, J., Russell, B., Liu, C., Torralba, A.: Labelme video: Building a video database with human annotations. In: IEEE International Conference on Computer Vision (ICCV) (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Washington, Seattle, WA, 98195, USA
Haowei Liu & Ming-Ting Sun
IBM T.J. Watson Research Center, Hawthorn, NY, 10532, USA
Rogerio Feris

Authors

Haowei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Rogerio Feris
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Ting Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haowei Liu .

Editor information

Editors and Affiliations

Department of Media Technology, Aalborg University, Niels Jernes Vej 14, Aalborg, 9220, Denmark
Thomas B. Moeslund
Centre for Vision, Speech & Signal Proc., University of Surrey, Guildford, GU2 7XH, Surrey, United Kingdom
Adrian Hilton
Copenhagen Institute of Technology, Aalborg University, Lautrupvang 2B, Ballerup, 2750, Denmark
Volker Krüger
Disney Research, Forbes Avenue 615, Pittsburgh, 15213, Pennsylvania, USA
Leonid Sigal

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, H., Feris, R., Sun, MT. (2011). Benchmarking Datasets for Human Activity Recognition. In: Moeslund, T., Hilton, A., Krüger, V., Sigal, L. (eds) Visual Analysis of Humans. Springer, London. https://doi.org/10.1007/978-0-85729-997-0_20

Download citation

DOI: https://doi.org/10.1007/978-0-85729-997-0_20
Publisher Name: Springer, London
Print ISBN: 978-0-85729-996-3
Online ISBN: 978-0-85729-997-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Benchmarking Datasets for Human Activity Recognition

Abstract

Similar content being viewed by others

Toward human activity recognition: a survey

Action Datasets and MHI

Model Evaluation Approaches for Human Activity Recognition from Time-Series Data

Keywords

1 Introduction

2 Single View Activity Benchmarks with Cleaner Background

2.1 The KTH and the Weizmann Dataset

2.2 The University of Rochester Activity of Daily Living Dataset

2.3 Other Datasets

3 Single View Activity Benchmarks with Cluttered Background

3.1 The CMU Soccer Dataset and Crowded Videos Dataset

3.2 The University of Maryland Gesture Dataset

4 Multi-view Benchmarks

4.1 The University of Central Florida Sports Dataset

4.2 The INRIA Multi-view Dataset

5 Benchmarks with Real World Footages

5.1 The University of Central Florida Youtube Dataset

5.2 The Hollywood Dataset

5.3 The Olympic Dataset

6 Benchmarks with Multiple Activities

7 Other Benchmarks

8 Conclusions

8.1 Further Readings

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Benchmarking Datasets for Human Activity Recognition

Abstract

Similar content being viewed by others

Toward human activity recognition: a survey

Action Datasets and MHI

Model Evaluation Approaches for Human Activity Recognition from Time-Series Data

Keywords

1 Introduction

2 Single View Activity Benchmarks with Cleaner Background

2.1 The KTH and the Weizmann Dataset

2.2 The University of Rochester Activity of Daily Living Dataset

2.3 Other Datasets

3 Single View Activity Benchmarks with Cluttered Background

3.1 The CMU Soccer Dataset and Crowded Videos Dataset

3.2 The University of Maryland Gesture Dataset

4 Multi-view Benchmarks

4.1 The University of Central Florida Sports Dataset

4.2 The INRIA Multi-view Dataset

5 Benchmarks with Real World Footages

5.1 The University of Central Florida Youtube Dataset

5.2 The Hollywood Dataset

5.3 The Olympic Dataset

6 Benchmarks with Multiple Activities

7 Other Benchmarks

8 Conclusions

8.1 Further Readings

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation