The papers in this special issue cover a wide range of topics in multimedia content analysis and signal processing including multimedia content analysis, image and audio processing, and video coding and compression.

  1. A)

    Multimedia Content Analysis

Enriching the abstract tags for social images is important for keyword-based social image search and retrieval. The paper entitled “Automatic Abstract Tag Detection for Social Image Tag Refinement and Enrichment” by Xia et al. constructs the concept ontology with three-level semantics to detect the candidates of abstract tags. Based on the concept ontology, new tags can also be added to enrich the tags of social images. The proposed methods were compared with other existing approaches to demonstrate its effectiveness.

Multimedia event detection is an important research topic due to its application in video indexing and retrieval. In the paper entitled “Multimedia Event Detection Using Segment-based Approach for Motion Feature,” Phan et al. propose a new approach for multimedia event detection by partitioning each video into segments for feature extraction and classification. The experimental results on the TRECVID Multimedia Event Detection datasets show promising results of their approach.

Saliency detection can be used for many real world applications such as object detection and recognition. In the paper entitled “Top-Down Saliency Detection via Contextual Pooling,” Zhu et al. propose a new top-down approach for saliency detection by additionally using spatial context information. The results demonstrate their method achieves state-of-the-art performance for saliency detection.

Hand gesture recognition is an important component for human computer interaction. In the paper entitled “Real-time Hand Gesture Recognition from Depth Images Using Convex Shape Decomposition Method,” Qin et al. propose a new hand gesture recognition system by using depth images. New hand detection and segmentation methods were proposed before conducting gesture recognition. The experiments demonstrate that their system can accurately recognize hand gestures in real-time.

Colorizing the gray scale facial images is an interesting research topic in multimedia. In the paper entitled “Colorization for Gray Scale Facial Image by Locality-constrained Linear Coding,” Liang et al. develop a new colorization system for gray scale facial images by using Locality-constrained Linear Coding (LLC) and Markov Random Field (MRF). The experiments demonstrate the effectiveness of their proposed system.

  1. B)

    Image and Audio Processing

Noise estimation is a fundamental issue for image denoising and many other image processing applications. In the paper entitled “Robust Noise Estimation Based on Noise Injection,” Tang et al. describe a new noise level estimation algorithm by investigating the distribution of local variances in natural images. They have used a wavelet decomposition based preliminary estimation stage to alleviate the influence of an image's textural or structural information, and then a noise injection based estimation stage to find the impact of image content on the variance distribution. Proper experiments and comparative analysis demonstrate that the proposed algorithm can infer noise levels and has robust performance over a wide range of visual content, as compared to the existing relevant methods.

Audio is indispensable in multimedia applications. In “Audio Quality Requirements and Comparison of Multimodal vs. Unimodal Perception of Impairments for Long Duration Content” by Borowiak et al., the effect of the time dimension on quality ratings and user responses, as well as that of audio artifacts related to different compression rates over extended periods of time, are investigated. The study gives useful insights toward user quality expectations, user reaction time to quality degradation, user sensitivity to quality changes when he is able to influence the quality himself, and cross-modal effect between visual and auditory modalities.

In the next paper, “An Adaptive Non Reference Anchor Array Framework for Audio Retrieval in Teleconferencing Environment,” Nathwani et al. discuss an adaptive method for audio retrieval for live teleconferencing with multiple participants. A non reference anchor array (NRA) was used for capturing the interfering speech, in addition to the primary array that captures the speech source of interest (SOI). This method claimed to be computationally efficient since it does not require the computation of acoustic impulse response (AIR) of the teleconferencing room, and the NRA is able to remove correlated noise in the direction of the SOI. The proposed method was evaluated by conducting experiments with clean speech acquisition from distant microphone arrays, as well as using the existing databases.

  1. C)

    Video Coding and Compression

Video compression continues to be an important research area due to the fact that visual signal still consumes a dominant portion of system resources (like computation, bandwidth, storage, and so on) in multimedia services. This special issue includes three papers for 2D, multi-view and 3D video compression, respectively. The paper “SSIM-based Error Resilient Video Coding over Packet-Switched Networks” by Zhang et al. proposes a structural similarity (SSIM) (rather than the traditional MSE) based error resilient video coding scheme, in order to improve perceived visual quality of compressed videos over packet-switched networks. To be more specific, a SSIM-based distortion model was first developed to estimate the perceptual distortion due to quantization, error concealment and error propagation; an adaptive mode selection strategy was then presented to enhance the robustness of the resultant algorithm.

In the next paper “Adaptive Learning Based View Synthesis Prediction for Multi-View Video Coding,” Hu et al. devise an adaptive learning based view synthesis prediction algorithm to enhance the prediction of virtual view picture for free-view TV by integrating the least square prediction with backward warping. It utilizes both adjacent view and temporal decoding information to adaptively learn the prediction coefficients. The proposed method has been demonstrated to save bitrates with up to 11 %–18 % when compared with the relevant existing approaches.

Because different parts of a depth map have different impacts on the synthesized image quality for 3-D video represented by texture plus depth map, Xiao et al. suggest a macroblock-level bit allocation method in the paper entitled “Macroblock Level Bits Allocation for Depth Maps in 3-D Video Coding.” In this work, different macroblocks of a depth map were encoded with different quantization parameters and coding modes. With the fine bit-allocation granularity, the proposed approach outperformed other bits allocation approaches, maintaining the synthesized view quality with little price in terms of pre-encoding delay.