1 Introduction

Video data has been widely utilized in numerous applications. A number of applications, such as video summarization [84], object tracking, traffic status estimation [46], multi-view objects recognition [68], human activity recognition [77], categorization [78, 82], saliency detection [83], image segmentation [79], photo retargeting [66, 80] etc., which highly rely on video data for information retrieval [59] or advanced query, have been boosted by the deployment of various video capturing devices. In addition, the advancement of network technology allows higher network bandwidth, which enables access to videos via Internet from anywhere.

High quality of collected videos is significant for reliable and stable video analysis services, also for the experience of end users. Limited by transmission resources, video data should be compressed to reduce its size before transmission. A lossy video compression algorithm may introduce artificial content, while packet losses or random noises in the transmission process [60] can damage the video quality. The distorted video content can pose significant negative effects on video processing and analyses, as well as unpleasant feelings to the end users. Consequently, the assessment [63] of video quality is crucial for evaluating captured videos.

The problem of video quality assessment (VQA) has attracted a lot of research interests. Numerous metrics and assessment methods were proposed to effectively evaluate the video quality [65] by utilizing various features, including visual characteristics, perceptional features and psychological features. Existing VQA metrics can be classified in different ways. One classification method is based on the requirement of an intact reference video, which includes Full-Reference (FR) metrics, No-Reference (NR) metrics and Reduced-Reference (RR) metrics [52]. As the FR metrics are heavily dependent on the availability of reference videos, RR and NR metrics are more appealing in practice.

VQA methods can also be classified using different perspectives. According to the principles in [9], VQA methods can either be visual characteristic based or perceptual characteristic based. Furthermore, visual-based methods can be further categorized into methods using natural statistical characteristics or visual features. Perceptual methods can be divided into frequency-domain or pixel-domain methods. In addition, the quality assessment methods can also be categorized as either subjective or objective [23]. The aim of subjective VQA is to evaluate the average vote score from the perspective of users. It is the most reliable way but the result of a subjective VQA needs to be annotated by a trained expert. Objective VQA methods allow to predict video quality in an automatic way with selected metrics. Besides the aforementioned classification methods, VQA methods can also be classified as two-dimensional (2D) oriented or three-dimensional (3D) oriented.

This paper focuses on the challenges of VQA, and latest works of VQA metrics, VQA methods, and video quality improvement. The remaining of this paper is structured as follows. In Section 2, we review applications of video query in different areas. Then, in Section 3 VQA metrics are briefly introduced. VQA methods are elucidated in Section 4 according to various features. Section 5 provides A Brief review on improvement methods of video quality. A Conclusion is drawn in Section 6.

2 Applications of video query

There is growing interest in query optimization in multimedia [62] data bases both in academic and commercial field. It is significant and has much challenge to execute data analysis applications efficiently as data set sizes and formats grow continually [3]. Different multimedia object will use different types of techniques. It involves query segmentation which takes a query and simplified restructured query expressed in relational algorithm after modification due to views, security enforcements and semantic integrity control [44]. There are three optimization techniques which are widely used in multimedia data bases, which include Content Based Query, Semantic Based Query and Metadata. In the following, we discuss these three related optimization techniques firstly and then we summarize the advantages and disadvantages of each technique.

2.1 Content Based Query (CBQ)

This is the most widely used type of optimization technique. It is the main motivation in recent research on multimedia data bases [43]. Chan et al. access semantic information contained in video data by CBQ formulates queries [8]. Kwok et al. extract parts of the image and transform them into low level features on that particular data objects by using content based image query (CBIQ) [28]. Zhou et al. focus on the low level features such as color, motion, and texture index of the video based on CBQ [85]. Sadat et al. create a prototype foRMultimedia data base with a query interface focusing on the dependencies of context and content information for effective query [44].

Similarity search is one of the important implementation in the CBQ type, which focuses on the inner part of multimedia data. It is an operation that finds changing pattern of sequences or subsequences that are similar to the given query sequence. There are two different elements in content based video similarity search [85]. One is video clip query, which focuses on getting similar clips from a large collection of videos that have been segmented into shots of similar length size at content boundaries. The other is video subsequence identification [64], which focuses on searching for existent of any part of a long data base video that shares similar content to a query. Similarity queries can be classified into two categories which are whole matching (sequences to be compared have the same length) and sequence matching (involves the query sequence that is smaller and the comparison is done in the large sequence that best match the query sequence) [1]. Kriegel et al. focus on the effectiveness of similarity search in multimedia data bases using multiple representations for video and try to integrate multiple video representations featured into query processing [27].

The advantages of CBQ are summarized as: it provides ease of query formulation and interpretation; and this technique is suitable and flexible for formulating queries. The disadvantages: CBQ provides data dependency between multimedia data; and users are dependency; and the performance is lacking when queries are executed on large data sets.

2.2 Semantic Based Query (SBQ)

Semantic query uses knowledge about the domain of relations, nature of data, and constraints related to data base elements [48]. Semantic based search is defined as a technique that compares the original multimedia data to a prototypical category such as ’vehicle’, clothes and others [73]. The semantic of the video is modeled by elements which extracted from different modalities of a video, such as visual information, auditory information, and text in the video frames [10]. There are four main issues involved in semantic query [48]. The first issue focuses on the query and scheme which should be dynamically used to select the relevant semantics for optimization without any addition search of semantic rule base. Second, a suitable mechanism should be available to combine selected semantic with the query. Third, cost analyzer is needed to evaluate the cost of equivalent queries and rank them accordingly. Lastly, a heuristic guide is needed to show the whole process in a meaningful way so that the process can be easily understood by the users.

In order to handle rich, temporal and spatial requirements of multimedia data, a visualized semantic model is used to increase the information content so that the users cognitive load is reduced [8]. Query and data centric method can be incorporated to optimize the acquisition process. Semantic query optimization is based on the semantic equivalence rather than the syntactic equivalence between different queries [47].

The advantages of SBQ are: large amount of computation time can be saved compared to the index structures because of the simpler and overlap-free characteristics; and the technique can be easily extended to all transforms available. However, there are several disadvantages of semantic content such as it is difficult to extract automatically [17]; it cannot support queries on generalized concepts; and the retrieval is not precise enough for the process

2.3 Metadata

Metadata is the data or semantic information to classify the content, quality, condition and other characteristics of the data [57]. Metadata technique is a type of searching technique that the structured information describing characteristics assist users to identify digital content itself [37]. It changes the way of content search from using normal string to a conceptual level where users search for semantic contents of the data [49]. There are three famous metadata schemes which are MPEG-7, Dublin Core, and IEEE LOM.

MPEG-7 is different from other standard metadata that it can provide two types of schema which is divided into low level descriptions and high level ones [42]. Low level description is defined as color, texture, and the shape of the multimedia data while the high level one is the structural and semantic descriptions. MPEG-7 major focuses are video and images and the technology enables the CBIR. MPEG-7 enables the metadata to be used in different platform and applications [36].

The second scheme is called Dublin Core schemes which conceived author generated descriptions of Web sources [56]. Its aim is to define a standard that will encourage quality resource description and encourage interoperability between tools for resource discovery [21].

The third scheme is called IEEE standards that conform, integrate and refer open standards with the existing work in related areas [49]. Learning Object Metadata (LOM) standards specify a conceptual data scheme that defines the structure of a metadata instance [47]. IEEE LOM is divided into nine elements descriptors of learning resources, each of which is relatively independent and characterizes the resource from a separate aspect [17]. In addition, its specification and vocabularies were determined through discussion from the standpoints of both users and resource developers. The IEEE LOM uses hierarchical types of metadata description which is useful, and easy to be implemented in many levels of elements [70].

The advantages of Metadata are as following: rich metadata are very effective in assisting users to navigate and find desired content items; and high quality metadata is important for reliable and effective Web applications. While the disadvantages are: the more data become available, the harder and difficult to identify and extract metadata; the absence of meaningful metadata due to lack of users’ attention m providing the information.

In summary, among these three techniques, CBQ is more suitable for video data, where the content of the video can be segmented firstly and compared using similarity search. This technique normally involves more than one technique to extract the content of multimedia data. However, this technique provides dependency between multimedia data. Also, the process involves the extraction of multimedia segments in order to retrieve the query output. SBQ focus on image data more. It involves algorithm to analyze the structure of the image covering the shape, color, and texture of the data by using low and high feature extraction. Since this technique involved the extraction of the inner part of the data, queries cannot support generalized concepts and the output retrieval is not precise for the process. Metadata is suitable for both of the data. However, users have to interact in defining the metadata. In order to have an effective application, high quality metadata is very important. Whereas the more data are available, the harder it is to identify and find the suitable metadata (Table 1).

Table 1 Comparisons among three optimization techniques

3 metrics for video quality assessment

For FR metrics, they are evaluated by comparing the test video to its corresponding reference video. The mean squared error (MSE), peak signal-to-noise ratio (PSNR), and extended metrics based on MSE or PSNR, are commonly seen metrics used in FR VQA methods. These metrics are simple, but their performance in subjective correlation are poor. By integrating the assessment of the spatial and temporal distortion, Motion-based Video Integrity Evaluation (MOVIE), spatial MOVIE (SMOVIE), temporal MOVIE (TMOVIE) were also proposed as FR metrics [45]. The main drawback of FR metrics lies in their heavy dependence on the reference video, and the reference video must be aligned with the test video. Features of saliency [14], MSE and video content [6], distortion [20], etc. can be used for FR metrics design.

NR metrics generally works on the estimation of blocks, distortion, blur, noises, quantization errors, water marks, bitstream, etc. They can be used to analyze the test video without any reference video. The WMBER metric, which detects the macro-block errors weighted by a saliency map and considers the characteristics of the human visual system (HVS), was proposed in [7]. Bitstream features can be analyzed to get metrics for video quality assessment. In [25], a prediction model of visual quality was developed by utilizing extracted bitstream-based features from partial least squares regression. A NR color-based metric for video quality is presented in [35]. It employs a flow tensor and a perceptual mask, which integrates spatio-temporal contrast sensitivity function and luminance sensitivity, to define the metric. In addition, other NR metrics can be achieved by estimating such features as blur [38], block artefacts [53].

RR VQA metrics can assess the video quality with partial reference information available. Discriminative local harmonic strength and motion measurement were used in the proposed metric [15], in which the harmonic gain/loss was evaluated by a harmonic analysis of the source video frames. Structural similarity-based metric, SSIM, utilizes coding tools based on the distributed source coding theory in [50]. To evaluate the quality of 3D videos query, compressed depth maps and color features are incorporated into the RRMetric [18].

The comparison among three metrics is shown as follows (Table 2).

Table 2 Comparisons among three metrics

In order to show the researchers to follow the general state of art in this research area quickly, we introduces some available data sources and detailed applications in this area, which is shown as follows (Table 3).

Table 3 Data sources

4 Methods of video quality assessment

In this paper, we categorize methods of video quality assessment according to the type of features utilized for the evaluation of video quality. The features include visual features, data-based features and network transmission-related features. Visual features cover most characteristics utilized in quality assessment, and they are related to motion, content, distortion, spatial information, temporal information, visual attention, etc. Transmission-related features are relevant to the packets and bitstreams in the transmission of video data through the network.

4.1 Visual feature-based methods

  1. (1)

    Motion feature-based methods

In [19], motion information was employed to filter unnecessary information in the spatial frequency domain, and a spatio-velocity contrast sensitivity function (SV-CSF) was introduced for objective video quality assessment. SV-CSF describes the relationship among contrast sensitivities, spatial frequencies and velocities of perceived stimuli. However, the SV-CSF cannot work in the spatial frequency domain directly in the filtering process. Video frames separated in spatial frequency domain should be obtained and these frames are weighted by contrast sensitivities in the SV-CSF model. Another motion estimation based method is suggested in [22]. In this work, the video quality is assessed by estimating motion quality and motion trajectory. And a Motion-based Video Integrity Evaluation (MOVIE) index is introduced based on motion estimation. The evaluation result demonstrates that the quality assessment score derived by MOVIE index is close to human subjective judgment.

Moreover, the salient motion is included in the assessment scheme suggested in [11]. In this work, features were proposed to describe the salient motion intensity and the coding artifacts intensity in the salient motion region. The experimental results exhibited that the salient-motion features can enhance the video quality assessment when blocking artifacts and blurring in the salient region, as well as temporal changes of regional intensities. The temporal motion smoothness of video sequences is measured in the proposed assessment method in [76] by temporal variations of local phase structures in the complex wavelet transform domain. The proposed method is adaptive to a wide range of video issues, including distortions, noise contamination, blurring, jittering and frame dropping, and it has a low reduced-reference data rate with lower computational cost.

  1. (2)

    Content-based methods

In addition to encoder settings and network quality of service, the type of video content is another factor that can affect the video quality. A content-based method of video quality assessment was proposed in [24], in which a novel metric, Simplified Perceptual Quality Region (SPQR), was used a sign of video quality degradation. SPQR determines the face location of the speakers in the video, and the discrepancies of face location in the corresponding frames. And the evaluation result shows that the proposed method is a light-weight implementation. To assess the quality of high definition video streaming with packet losses, a quality of experience model presented in [30] utilizes the SSIM metric, temporal pooling method and content-based features to evaluate the video quality and performs well.

In the work presented in [4], the video content type is taken into consideration in the design of a reference-free video quality prediction model, in which the motion vector is utilized to extract the temporal information, while the spatial information is obtained with quantization parameters and the number of bits of frames. Then the derived metric is used to represent the content type of video sequences. And finally, the quality prediction model is built by combining the content type metric with encoding quantization parameters and network packet loss rate. The experimental result achieves an accuracy of 92.

Considering that contents in different regions play different roles in the perceptual quality of an image, a three-component image model was proposed in [32] to evaluate the video quality. In this work, gradient properties were used to classify the local image regions, then apply variable weights to structural similarity image index scores according to region. A frame-based video quality assessment algorithm [33] is thereby derived. Experimental results on the Video Quality Experts Group (VQEG) FR-TV Phase 1 test dataset show that the proposed algorithm outperforms existing video quality assessment methods.

An embedded water mark was utilized to estimate the video quality in the proposed method in [40]. The pseudo-random binary water mark is fused with the original video frames, and the similarity between the original and the extracted water mark is evaluated to assess the query quality of a video segment. The water mark can be embedded in small or large wavelet scales [67] for sensing small or substantial distortions of the video frame.

  1. (3)

    Methods based on distortion estimation

Distortion features of video can be used to estimate the video quality. To estimate the perceived visual distortion in an image or video frames, structural distortion is employed in [55]. The proposed method is performed on the video quality experts group Phase I FR-TV test data set, and the testing result demonstrates that the proposed method is computationally efficient.

Statistical distortion features were used in a no-reference method of video quality assessment presented in [72]. In this work, each frame is converted into the wavelet domain and the oriented band-pass response is generated by their decomposition [69]. Statistical distortion features are extracted with resulted sub-band coefficients and are used to construct a feature vector as a descriptor of the overall distortion of the frame. The video quality in the wavelet domain is achieved by the classification of the feature vectors across images and a score mapping. The temporal quality is evaluated with a motion-compensated method utilizing block and motion vector. And the overall quality is achieved by a pooling strategy.

  1. (4)

    Methods using spatial and temporal features

Spatial and temporal factors are usually combined to improve the assessment results of video quality. Spatial and temporal artefacts in videos are discussed in [41], and the result shows that those artefacts are correlated. Moreover, the contribution of spatial quality to the overall video quality is more than that of temporal quality. Based on aforementioned ideas, an objective quality evaluation model was proposed by combing spatial quality with temporal quality. In the computational model proposed in [39], the spatial and temporal factors were also combined by exploiting the worst-case pooling method and the variation of spatial quality along the temporal axis. The interaction between these two factors is determined by a machine learning algorithm. With the popularity of video-sharing services, web videos are variable in various aspects including content, capturing devices, resolution, etc. In order to evaluate the quality of web videos, spatiotemporal factors integrating the features of video editing style are utilized to predict the quality of web videos in [58]. Then the task of quality evaluation can be seen as a problem of two-class classification.

  1. (5)

    Methods utilizing visual attention

The query quality assessment model introduced in [75] was built up the attention theory, in which video quality is perceived in both local and global assessment. In this model, the attention map, which is derived by fusing several visual features that can have impacts on visual attention, is designed for optimizing local quality model to evaluate the degradations on the attended stimuli. A global quality model is formed by fusing four designed quality features. Then, a content adaptive linear fusion method is used for fusing these local and global features to assess the video quality.

  1. (6)

    Methods exploiting block artefacts

Artefacts resulted by block-based codecs related to H.264/AVC were utilized in [31] to reach a no-reference metric for video quality assessment. Compared with a full-reference method, the suggested metric, Structural Similarity Index Metric (SSIM), produces a better result, and functions well in the real-time scenario. In [71], the cepstrum analysis is utilized to improve the estimation of blocking artefacts. A no-reference blocking artifacts metric was proposed in [53], in which the weighted evaluation of artefacts was performed in flat and edge regions. The result shows that it iSHighly consistent with visual perception.

4.2 Bitstream-based or packets-based methods

Packet loss is another issue when the video stream transfer through the limited bandwidth network channels. In the video delivery service over an IP network, the features of packet losses or bitstream of videos should be taken into consideration when evaluating the video quality. As for the quality assessment approach proposed in [74], spatial and temporal pooling were conducted with packet losses for evaluating the video quality, and the experimental results show that the packet loss-based method is more sensitive to the most annoying spatial regions and temporal segments. A saliency based query quality assessment method is proposed in [13] to evaluate the videos with packet losses, and the visual saliency of the pixel is used as a weight of the error of the corresponding pixel.

Instead of reconstructing video information, coding parameters were utilized to evaluate video query qualities in [29]. In this method, coding parameters, which include boundary strengths, quantization parameters and average bitrates, are extracted from the H.264/AVC bitstream. The accuracy of the proposed method iSHigher than previous methods, and its computational complexity is lower, which makes it a competitive candidate in real-time scenarios.

4.3 Methods utilizing data features

In video quality assessment, peak signal-to-noise (PSNR) and mean squared error (MSE) are two commonly used criteria. However, they are unable to fulfil the requirement of human subjective assessment. In order to consider the effects of visual sensitivity on video quality, structural similarity and visual statistical information sensitivity were taken into account in designing the assessment method. In [34], the proposed method introduces the phase spectrum of quaternion Fourier transforms, and the weighted saliency features are exploited in query quality assessment.

4.4 3D-oriented assessment methods

With the popularity of 3D videos, query quality assessment methods for 3D videos attract a lot of researchers due to that assessment methods for 2D videos cannot be directly exploited to 3D videos. A no-reference stereoscopic video quality perception model for 3D video was suggested in [16]. It is built on four extracted factors, which are temporal variance, disparity variation in intra-frames, disparity variation in inter-frames and disparity distribution of frame boundary areas. The model is free from the depth map and the parameters can be estimated by linear regression.

3D Singular Value Decomposition (3D-SVD), which is a singular value decomposition in 3D space, is utilized by the models presented in [51] and [81] for query quality assessment of 3D videos. The model in [51] utilized the original video, and the distorted video is projected onto the singular vectors of the original video, then the quality of the video query can be evaluated by calculating the weighted differences between the reflection coefficients. For the method in [81], the image is separated into different planes based on depth values, then a global error is computed as the distance between the distorted image and the original image. In addition, depth information and motion cue are explored in 3D VQA method presented in [26]. By combining depth and motion cues, a weighting map based on PSNR and SSIM is generated for 3D VQA.

5 Video quality improvement

After assessing the video quality by either a subjective or an objective approach, various video quality enhancement methods can be utilized to improve the quality of experience perceived by the end users. Firstly, video quantization can be analyzed to achieve an optimal value of the quantizer_scale factor for improving the video quality, and this method can save bandwidth in an IPTV network. It is used as an automatic measure to improve the video quality received by the end user [5]. When transmitting video frames over the network with bursting losses, the video quality can be improved by classifying video frames at the edge node and differentiating the service on the network. Optical Burst Switched (OBS) networks and Optical Packet Switched (OPS) networks are examples of such technologies and Video quality is evaluated by means of a no reference video quality metric (the Frame Starvation Ratio) [12].

Moreover, the advanced video coding (AVC) which is also called aSH.264/ MPEG-4 is the latest compression standard available for video compression. Fidelity Range Extension (FRExt) is the recent work done on the AVC which includes a number of enhanced capabilities relative to the base standards. Human vision models can be incorporated into the standard codes for improving video quality, and a contrast sensitivity function is introduced in the transform coding stage of standard codes. AVC with FRExt featureSHas been implemented first and the quality of the reconstructed video signals are evaluated using both subjective and objective measures such as MOS, PSNR, MSSIM, VIF and VSNR [2]. In the work presented in [54], a Multi Layer Streaming-Simplified DCT Domain Transcoder (MLS-SDDT) is proposed to video quality improvement, including an FGS compatible Simplified DCT Domain Transcoder (FGS-SDDT) architecture for MPEG-1/2/4 single layer transcoding and an R-D optimized multi-layer streaming model for video quality improvement. By applying the MLS-SDDT to FGS-toMPEG-1/2/4 single layer transcoding, experiments show 1.4 7.0 dB PSNR improvement for MPEG-1 and 1.9 8.6 dB improvement for MPEG-2 compared to SDDT architecture.

6 Conclusion

Considering the popularity of video query in various applications [61] and great importance of VQA, we conducted a comprehensive review on VQA metrics and VQA methods, as well as video quality improvement in this work. We introduced three optimization techniques which are widely used in multimedia data bases, which include Content Based Query, Semantic Based Query and Metadata. Based on visual or physical features such as color, depth information, structural similarity, bitstream, distortion, video content, spatial and temporal information, etc., we gave a detailed description of FR metrics, NR metrics and RRmetrics. Later we gave the classification and introduction of VQA methods and some experimental results, such as Visual feature-based methods, Bitstream-based or Packets-based Methods, etc. And finally, a brief introduction of video quality improvement methods is given.

Generally, our paper give a brief review about metrics and methods of video quality assessment and introduce some basic concepts, methods and applications in this field. With the development of video acquisitions and applications, video quality assessment will play an more important role in the future. However, there are many challenges remaining to be resolved. It will be a long way to design new metrics or improve the old one to achieve high performance in the future. In the future, we planned to mine the quantitative relationship between video quality metrics and methods. This research can guide the development trend in the video quality assessment area. We will also follow the research on video quality improvement, which must have great need in the future.