Keywords

1 Introduction

With the growth of the multimedia information, many researchers have been attracted toward video indexing and video retrieval. Better performance video indexing and retrieval system can be achieved when proper video segmentation tools and algorithm are applied to the system. Video-on-demand, digital video libraries, distance education, geographical information systems, etc. are some applications of video segmentation.

It is needed to divide the video data for effective indexing and retrieval into basic elements called shots. A shot [1] represents a sequence of frames captured when camera starts rolling and until it stops. The main problem of segmenting a video sequence into shots is the ability to distinguish between scene breaks and normal changes which may be due to the motion of large objects or the motion of the camera. When special effects are involved, two shots are merged using gradual transition. The types of gradual transitions used mostly are dissolve, fade in, and fade out. A fade is a gradual transition between a scene and a constant image (fade out) or between a constant image and a scene (fade in). A dissolve is a gradual transition from one scene to another, in which the first scene fades out and the second scene fades in.

In [2], video segmentation is performed based on the types of editing operation that an editor can perform when editing two shots. They are identity class (cuts), spatial class (page translates, page turns, shattering edits), chromatic class (fade in, fade out, and dissolve), and spatio-chromatic class (image morphing and wipes).

The Sects. 2, 3, and 4 of this paper contains a short survey of video segmentation based on text feature, audio feature and image and motion feature.

2 Text Feature Extraction

Hauptman and Smith [3] used scenes, audio signal, and word relevance in the video to segment the video for better indexing and retrieval which is to be implemented for the Informedia Digital Video Library Project at Carnegie Mellon University. Two main techniques are used in natural language analysis.

  1. 1.

    Natural structural text markers such as punctuation are used to identify segments of video.

  2. 2.

    Term frequency/inverse document frequency (TF/IDF) is used to identify critical keywords and their relative importance for the video document.

3 Audio Feature Extraction

Hauptman and Smith [3] used Sphinx-II speech recognizer to recognize spoken words, which uses semi-continuous hidden Markov models (HMM) to model context-dependent phones (triphones), including between-word context. To detect breaks between utterances, they used a modification of signal-to-noise ratio (SNR) techniques which compute signal power.

Boreczky and Wilcox [4] describes a technique for segmenting video by using an audio distance based on the acoustic difference in intervals just before and after the frames. An audio distance based on sliding two-second intervals is calculated for extracting audio features.

Huang et al. [5] computes a feature vector for each clip and then calculates an audio dissimilarity index for each clip to detect audio breaks for segmentation.

4 Image and Motion Feature Extraction

Fukunaga and Hostetler [6] proposed a general nonparametric density and mean shift estimators for application in pattern recognition procedure and its efficacy on low-level vision tasks such as tracking has been considered.

Black [7] presented a method for segmenting image sequences by using intensity and motion information which adopts Markov random fields approach.

Arman et al. [8] proposed an algorithm for detecting scene change using the DCT coefficient of MPEG- or JPEG-encoded video sequences before decoding.

Alattar [9] developed a dissolve detector and a dissolve handling algorithm for the DVI PLV 2.0 which has been incorporated with PLV 2.0 and currently being used with Intel Corporation’s DCF.

Zhang et al. [10] used pairwise comparison, likelihood ratio, and histogram comparison algorithms as video partitioning metrics. The twin-comparison technique is also used to detect dissolve.

Hampapur et al. [11] adopted a feature-based classification approach to the problem of edit detection. A chromatic and spatial edit is achieved by manipulating the color or intensity space and pixel space of the shots being edited, respectively. The segmentation is achieved by using a modified two-class discrete classifier.

Tonomura et al. [12] propose a method for shot analysis based on video X-ray images. A sliced image from spatiotemporal images, based on video intensity data, is extracted for edge detection. They used an interframe differencing operation for detecting cuts.

Meng et al. [13] proposed an algorithm for the detection of abrupt scene change and special editing effects such as dissolve in a compressed MPEG/MPEG-2 bit-stream with minimal decoding of the bit-stream. Scene changes are easily detected with DCT DC coefficients and motion vectors. The detection is performed with only a partial decoding of the compressed bit-stream: minimal decoding for the DCT DC coefficients for the I- and P-frames and motion vectors for the P-frames and B-frames.

Hampapur et al. [2] presented a video edit model which is based on a study of video production processes. The techniques proposed in this paper treat video segmentation as edit boundary detection. This model captures the essential aspects of video editing. Video features extractors for measuring image sequence properties are designed based on the video edit model. The work presented in this paper has taken the top down approach to video segmentation resulting in the use of effect models which have allowed the design of simple and effective techniques for segmenting digital video.

Hauptman and Smith [3] used histogram difference analysis for image analysis. TF/IDF technique is used to identify critical keywords and their relative importance for the video document.

Wang [14] described a technique for unsupervised video segmentation based on watersheds and temporal tracking. The motion estimation algorithm for arbitrarily shaped objects is based on hierarchical block-matching and least-square approximation. It is computationally efficient and generally provides motion parameters of reasonable accuracy. The algorithm for motion projection allows the proposed technique to track fast-moving objects. The algorithm for marker extraction and the modified watershed algorithm can detect the appearance and disappearance of objects and can segment newly appearing objects.

Boreczky and Wilcox [4] described a technique for segmenting video using HMM. Features for segmentation include an image-based distance between adjacent video frames and an estimate of motion between the two frames. The histogram feature measures the distance between adjacent video frames based on the distribution of luminance levels. It is simple, easy to compute, and works well for most types of video. Motion vectors are computed using an exhaustive-search block-matching algorithm in a 24 × 24 window for nine evenly distributed 40 × 40 pixel blocks.

Huang et al. [5] presented a scheme for video segmentation by integrating image, motion, and audio information which is used to differentiate between shot break and scene break. Color histogram of each video frame is calculated to detect color break. The difference in motion histogram between two successive frames is calculated to detect motion break.

Salembier and Marques [15] discussed region-based representations of image and video that are useful for multimedia services such as those supported by the MPEG-4 and MPEG-7 standards. Classical tools related to the generation of the region-based representations are discussed. It mentions that object tracking enables video object (in the sense of MPEG-4) creation and it opens the door to content-based functionalities.

Meier and Ngan [16] presented the video object plane (VOP) segmentation algorithm which separates foreground objects from the background based on motion information. The model points consist of edge pixels detected by the Canny operator which is later used for VOP extraction. MPEG-4 video is used.

Allunbasak [17] introduced a framework for selecting statistically optimal thresholds in temporal video segmentation algorithms. Statistics for various temporal video events such as shot boundary, zoom, and pan events are collected. Then, optimal thresholds are estimated so as to minimize a statistical cost function defined in terms of the complied statistics.

Truong et al. [18] improved cut detection method by using color histogram differences by utilizing an adaptive threshold computed from a local window on the luminance histogram difference curve. Based on the mathematical models for producing ideal fades and dissolves, different clues (e.g., monochrome frames) for discovering the existence of these effects are proposed, and constraints on the characteristics of frame luminance mean and variance curves are derived analytically in this approach to eliminate false positives caused by camera and object motion during gradual transitions.

Jadon et al. [19] proposed a scheme using the Rayleigh distribution for fuzzification of frame-to-frame property. They used histogram intersection for detecting abrupt change, combination of pixel difference and histogram intersection in detecting a gradual change and a combination of pixel difference, histogram intersection and edge-pixel count is used to detect a fade. Three-dimensional euclidean distance is used for computing pixel difference and Sobel edge detection method is used for edge-pixel count.

Huang and Liao [1] used DC image difference and histogram-based difference to measure the differences between frames. A simple gradient operator (i.e., Sobel masks) is used to compute the gradient image for edge detection.

Lo and Wang [20] proposed a video segmentation method using a histogram-based fuzzy c-means (HBFCM) clustering algorithm. The HBFCM clustering algorithm is a hybrid of the shot change detection algorithm (pixel-based, histogram-based, and block-based algorithms) and the clustering algorithm (K-means clustering and fuzzy c-means clustering).

Xu et al. [21] used DC-images which are extracted by parsing the MPEG-1 video stream without full decoding; then, grass pixels are detected according to the grass detector; and view labels are obtained by quantizing the grass-ratios into 3 levels according to appropriate threshold values adaptively set in the initialization phase.

Khan and Shah [22] proposed a maximum posteriori probability (MAP) framework that uses multiple cues, such as spatial location, color, and motion, for segmentation. Weights are assigned to color and motion terms, which are adjusted at every pixel, based on a confidence measure of each feature. The appropriate modeling of probability density functions (PDFs) of each feature of a region is also discussed. The correct modeling of the spatial PDF imposes temporal consistency among segments in consecutive frames.

Patras et al. [23] used watershed segmentation algorithm. A filtering with morphological operators with a small (3 × 3) structuring element is used for a nonlinear smoothing of the current frame.

DeMenthon [24] adopted a hierarchical clustering method which operates by repeatedly applying mean shift analysis. Each pixel of a 3D space–time video stack is mapped to a 7D feature point, and the clustering of these feature points provides color segmentation and motion segmentation.

Lu et al. [25] proposed a novel scheme for automatic video segmentation which fully exploits both temporal information from background and spatial information from foreground. In this technique, the background regions of one scene are first warped into a large sprite image, which includes all visible parts of background throughout the sequence. Temporal segmentation produces independent foreground component (IFC) image with the rough locations of the foreground objects. Spatial segmentation produces homogeneous regions with precise edges reserved. The final segmentation is complete based on temporal and spatial segmentations by using boundary extraction strategy.

Sifakis et al. [26] applied empirical probability distribution to the test frame for determining the frame differences which later predict the camera motion. A multi-label fast-marching-level set algorithm which is an extension of the well-known fast-marching algorithm is applied to find the interframe difference. To localize the boundary of the moving object, region-growing algorithm is used.

Chien et al. [27] proposed a new fast watershed algorithm, named P-Watershed, for image sequence segmentation. By utilizing the temporal coherence property of the video signal, this algorithm updates watersheds instead of searching watersheds in every frame, which can avoid a lot of redundant computation. An intra–interwatershed scheme (IP-Watershed) is also proposed to further improve the results.

Porter et al. [28] proposed a motion-based algorithm to identify shot cuts which deals inherently with object and camera motion. It uses block-matching motion compensation to generate an interframe difference metric which is achieved by calculating the normalized correlation between blocks and locating the maximum correlation coefficient.

Guimaraes et al. [29] transformed the video segmentation problem into problem of 2D image pattern detection. They used discrete line analysis, max tree analysis and topological and morphological tools to detect fade transitions, flashes, and cuts, respectively.

Qi et al. [30] used supervised classification methods for video shot segmentation: a whole-frame color histogram difference and an 8 × 8 block-wise histogram difference between frames, both in the YUV color space. Combining both global- and block-wise histogram differences makes the system insensitive to object motion but very sensitive to cut boundaries, and it is followed by k-nearest neighbor classification.

Wang et al. [31] proposed an anisotropic kernel mean shift. The application of mean shift to an image or video consists of two stages. The first stage is to define a kernel of influence for each pixel. This kernel defines a measure of intuitive distance between pixels, where distance encompasses both spatial (and temporal in the case of video) and color distances. The second iterative stage of the mean shift procedure assigns to each pixel a mean shift point, M(xi), initialized to coincide with the pixel. These mean shift points are then iteratively moved upwards along the gradient of the density function defined by the sum of all the kernels until they reach a stationary point (a mode or hilltop on the virtual terrain defined by the kernels). For the color domain, they use an Epanechnikov kernel with a profile \( kr(z) = 1 - |z| \) if \( |z| < 1 \) and 0 otherwise.

Fang et al. [32] proposed a hybrid scheme for temporal segmentation of videos, where a fuzzy logical approach is pursued to combine a number of features in performing shot-cut detection. These features include color histogram intersection, motion compensation, and texture change for abrupt shot boundary detection, and edge variance for gradual shot-cut detection.

Boccignone et al. [33] defined a novel approach to partitioning of a video into shots based on a foveated representation of the video. The proposed scheme aims at detecting both abrupt and gradual transitions between shots using a single technique, rather than a set of dedicated methods.

Apostoloff and Fitzgibbon [34] presented an automatic video segmentation algorithm that uses spatiotemporal features to regularize the segmentation. Detecting spatiotemporal T-junctions that indicate occlusion edges, they learn an occlusion edge model that is used within a color contrast sensitive MRF to segment individual frames of a video sequence. T-junctions are learn and classified using a support vector machine, and a Gaussian mixture model is fitted to the (foreground, background) pixel pairs sampled from the detected T-junctions. Graph cut is then used to segment each frame of the video, showing that sparse occlusion edge information can automatically initialize the video segmentation problem.

Joyce and Liu [35] presented two algorithms for the detection of gradual transitions in video sequences. The first is a dissolve detection algorithm utilizing certain properties of a dissolves trajectory in image space. It is implemented both as a simple threshold-based detector and as a parametric detector by modeling the error properties of the extracted statistics. The second is an algorithm to detect a wide variety of wipes based on image histogram characteristics during such transitions. Both algorithms operate in the compressed domain, requiring only partial decoding of the compressed video stream.

Wang et al. [36] proposed a method which can detect the scene change boundaries accurately through the macroblock information of B-frame and P-frame. The algorithm can extract the macroblock information from the encoded bit-stream directly without total decompression for MPEG bit-stream. The macroblock coding of a B-frame has four types, intro-type, forward-type, backward-type, and bidirectional-type. Motion compensation is used.

Tan et al. [37] presented a metric based on blocked color histogram (BCH) for interframe difference. They use SVM classifier for pattern recognition.

Pritch et al. [38] proposed video synopsis which is a temporally compact representation of video that enables video browsing and retrieval. The video synopsis uses energy minimization which represents input video sequence in 3D space–time volume. The cost function used in this paper corresponds to a 3D Markov random field (MRF) where each node corresponds to a pixel in the 3D volume of the output movie and can be assigned any time value corresponding to an input frame.

Tziakos et al. [39] proposed an approach with a graph based on the similarity of frames and enriched with the temporal information from the sequence processed by Laplacian eigenmaps. This makes it possible to visualize the manifold of motion in the scene and to detect unusual events in a low-dimensional space. The proposed framework enables the separation of actions without the need of an object detector or object tracker.

Yin et al. [40] presented an algorithm for the efficient segmentation of foreground and background in monocular video sequences. Depth-based labeling is used in monocular system to learn to imitate stereo motion features, called motons. The base weak classifier used in this paper is the widely used decision stump. A useful framework called tree-cube taxonomy is used for constructing strong classifiers by combining weak classifiers in different ways.

Hong et al. [41] proposed the dual threshold method to segment the video into segments and extract key frames from each segment. SIFT features are extracted from the key frames of the segments. Then, SVD-based method is proposed to match two video frames with SIFT point set descriptors.

5 Conclusion

This paper presents a short survey on video segmentation. Video segmentation is applied to video indexing and video retrieval system. MPEG-compressed videos are used by the researchers for video segmentation as the compressed feature of macroblock coding of B-frames intro-type, forward-type, backward-type, and bidirectional-type is used directly without total decompression for MPEG bit-stream [36]. Motion compensation and motion vector are effectively used in MPEG bit-streams.

Video segmentation using pattern recognition, SVM, and foveated shot detection is rarely used in video segmentation. This can be used for effective object-based or content-based video segmentation. Foveation mechanisms [33] have never been taken into account for video segmentation, while there are some recent applications to video compression.