Keywords

1 Introduction

More and more techniques and software, such as Adobe After Effect and Corel Video Studio, provide to people the convenience of editing and altering videos. Among all the techniques, matte painting is one which can combine several video materials together, and it is widely used in movie effect area. However, by taking advantages of matte paining, people can also make fake videos for evil purposes. Since all the video materials components are real, it is not easy to extract obvious visual clues from the fake video (as in Fig. 1). To tackle this kind of problem, we propose a brand new digital forensic method to detect whether a video is authentic or faked by matte painting.

Fig. 1.
figure 1

A true (a) and fake (b) video are shown. The background region is replaced by another video clip and visual clues are hardly seen.

A lot of work have been done for different kinds of digital video forensics. Milani et al. outlined the video forensic technologies of different kinds of forgeries [1]. Wang and Farid successfully worked out the problem of interlaced and de-interlaced video [2]. Stamm et al. used the fingerprint model to detect the frame deleting/adding operations [3]. Hsu et al. used the temporal noise correlation to detect video forgery, however the model is sensitive to the quantization noise [4]. Chen and Fridrich used characteristics of the sensor noise to detect the tampering [5]. However, since lots of effects and recompression have been added to videos during the editing process, these methods can hardly detect the forgery implemented by Chroma key composition. Lighting [6], shadows [7] and reflections [8] are also used for forensics. But these content-based methods do not perform well under poor illumination conditions. The copy-move detecting techniques, such as [9], may not work properly because composites are not from the same source video. Thus, geometric methods are more suitable for the matte painting forensic task. Yao used perspective constraints to detect forgery [10]. Single-view metrology is the theoretical basis of that method, and ideal perspective effects and priori knowledge of objects are used to detect the forgery in images or videos. On the other hand, multi-view metrology based methods mainly focus on ways of detecting forgery by means of planar constraints [1113]. However, these methods are applicable only when the fake contents are coplanar. To tackle more general matte painting problem in video, we propose a geometric technique using extrinsic camera parameters in this paper. This method implements multi-view metrology to estimate extrinsic camera parameters, and then we focus on investigating the differences of extrinsic parameters estimated from different regions in video frame. We find that regions can be characterized by the extrinsic parameters, and the difference of parameters can help to reveal the matte painting forgery. Experiments shows that our method is robust and efficient, even under the non-coplanar condition.

2 Extrinsic Camera Parameter Estimation

Extrinsic camera parameters are usually introduced to model the position and orientation of cameras. Currently, Structure from Motion (SfM) is one of the most popular methods to estimate extrinsic camera parameters from multi-view images [14, 15]. Usually, for simplicity, a camera can be modeled as a pinhole camera. Let \(p=[x,y]^T\) denote a 2D point in the image coordinate system. Similarly, \(P=[X,Y,Z]^T\) denotes a 3D point in the world coordinate system. \(p=[x,y,1]^T\) and \(P=[X,Y,Z,1]^T\) denote the augmented vectors of them respectively.

In the pinhole camera model, a 3D real world point P and its projection p on the image plane satisfies:

$$\begin{aligned} sp = \mathbf {K} \left[ \begin{array}{cc} \mathbf {R}&\mathbf {t} \end{array} \right] P \end{aligned}$$
(1)

where s is a scale factor; \(\mathbf {K}\) is the intrinsic camera parameter matrix which carries the information such as the focal length, skewness and principal point of a camera; the extrinsic camera parameters, \(\mathbf {t}\) and \(\mathbf {R}\), represent the translation and rotation from the world coordinate to the image coordinate system. \(\mathbf {t}\) is a \(3 \times 1\) matrix and \(\mathbf {R}\) is a \(3 \times 3\) matrix:

$$\begin{aligned} \mathbf {t} = [T_x,T_y,T_z]^T \end{aligned}$$
(2)

and

$$\begin{aligned} \mathbf {R} = \left[ \begin{array}{ccc} r_{11}&{}r_{12}&{}r_{13} \\ r_{21}&{}r_{22}&{}r_{23} \\ r_{31}&{}r_{32}&{}r_{33} \end{array} \right] \end{aligned}$$
(3)

where

$$\begin{aligned} \begin{aligned}&r_{11}=\cos \beta \cos \gamma ,r_{12}=\sin \alpha \sin \beta \cos \gamma -\cos \alpha \sin \gamma \\&r_{13}=\cos \alpha \sin \beta \cos \gamma +\sin \alpha \sin \gamma \\&r_{21}=\cos \beta \sin \gamma ,r_{22}=\sin \alpha \sin \beta \sin \gamma +\cos \alpha \cos \gamma \\&r_{23}=\cos \alpha \sin \beta \sin \gamma -\sin \alpha \cos \gamma \\&r_{31}=-\sin \beta ,r_{32}=\sin \alpha \cos \beta \\&r_{33}=\cos \alpha \cos \beta \\ \end{aligned} \end{aligned}$$

\(\alpha \), \(\beta \) and \(\gamma \) are Euler angles representing three elementary rotations around \(x,y,z-axis\) respectively. In this paper we use the rotation angle vector \(\mathbf {r}\) instead of the rotation matrix in later sections:

$$\begin{aligned} \mathbf {r} = [\alpha ,\beta ,\gamma ]^T. \end{aligned}$$
(4)

When a camera moves in a scene and takes photos of the same object from different views, it is easy for us to find corresponding points of that same object in these photos. Let \(p_1\) denote a point in image \(\mathbf {I}_1\) and its corresponding point \(p_2\) in image \(\mathbf {I}_2\). \(\mathbf {I}_1\) and \(\mathbf {I}_2\) are images of the same object taken from different views. \(p_1\) and \(p_2\) satisfy the fundamental matrix constraint as follows:

$$\begin{aligned} {p'_2}^T \mathbf {F} p_1=0 \end{aligned}$$
(5)

where \(\mathbf {F}\) is the fundamental matrix which relates corresponding points in the stereo image pair:

$$\begin{aligned} \mathbf {F} = \mathbf {K_2}^{-T} \hat{\mathbf {T}} \mathbf {R} \mathbf {K_1}^{-1}, \hat{\mathbf {T}}= \left[ \begin{array}{ccc} 0&{}-T_z&{}T_y \\ T_z&{}0&{}-T_x \\ -T_y&{}T_x&{}0 \end{array} \right] . \end{aligned}$$

Remarkably, if points in \(\mathbf {I}_1\) and \(\mathbf {I}_2\) are coplanar in the real world scene, the fundamental matrix constraint will degenerate to the planar constraint which is used in [1113].

By matching enough corresponding points (at least 8 valid points for each pair) among multi-view images, we can solve the constraint problem and get the fundamental matrix \(\mathbf {F}\) [16]. Given \(\mathbf {F}\), we can further get \(\mathbf {R}\) and \(\mathbf {t}\) as well as \(\mathbf {K}\). In this way, we can estimate successfully both intrinsic and extrinsic camera parameters. In [17], intrinsic parameters are applied to detect some kinds of video forgery in which the matte painting forgery is not included. Extra information, such as extrinsic parameters, is needed for such kind of forensic, and this paper focuses on how to explore the utility of extrinsic parameters.

3 Proposed Method

Our method is based on utilization of extrinsic camera parameters. Theoretically, in frames of authentic videos, all of the corresponding points should hold the same fundamental matrix constraint (5). Therefore, the same extrinsic camera parameters are supposed to be estimated from corresponding points in a authentic video. If we have extracted different extrinsic camera parameters from different image regions in the same video, it means the video has been tampered. In this way, the forgery in the video can be detected successively.

Steps of the proposed method are arranged as follows. Firstly, we divide each video frame into several different regions with masks. Secondly, we estimate extrinsic parameters from these regions respectively and calculate differences of the parameters. Thirdly, if the threshold is exceeded by the differences between a certain region and all other regions, this region will be considered as a fake one; otherwise, this region will be considered as an authentic one. Figure 2 shows the diagram of our proposed method.

Fig. 2.
figure 2

Diagram of our proposed method

3.1 Estimating Extrinsic Camera Parameters

There are many softwares for estimating the extrinsic camera parameter. Before estimating, we employ the SIFT algorithm to extract feature points [18] and RANSAC algorithm to match them [19]. Then we use the software VisualSFM [20, 21] for estimation and bundle adjustment to refine the result [22].

3.2 Detecting Forgeries with Extrinsic Camera Parameters

Even when applying the parameter estimation in a video without any tampering, it is difficult for us to get the exactly same result every time. Many factors, such as mismatched corresponding points, distortion of lens, will lead to the fluctuation of results.

Assuming elements in translation vector (2) and rotation angle vector (4) are independent and identically distributed(i.i.d) respectively, the translational and rotational differences between the estimated and the ground truth values should follow the zero mean Gaussian distribution, i.e.,

$$\begin{aligned} \mathbf {t}_{est} - \mathbf {t}_{truth} \sim N(\mathbf {0},{\sigma _t}^2 \mathbf {I}) \end{aligned}$$
(6)
$$\begin{aligned} \mathbf {r}_{est} - \mathbf {r}_{truth} \sim N(\mathbf {0},{\sigma _r}^2 \mathbf {I}) \end{aligned}$$
(7)

where \(\mathbf {t}_{est}\) and \(\mathbf {r}_{est}\) denote the estimated translation vector and rotation angle vector respectively, \(\mathbf {t}_{truth}\) and \(\mathbf {r}_{truth}\) are the ground truth vectors, \(\mathbf {I}\) is the unit covariance matrix.

If we divide a video frame into N regions and estimate the extrinsic parameter vectors for each region, we will get the translation vectors \(\mathbf {t}_i\) from the ith region and \(\mathbf {t}_j\) from the jth region. We define the translational difference between \(\mathbf {t}_i\) and \(\mathbf {t}_j\) as follows:

$$\begin{aligned} DT_{ij}= \frac{||\mathbf {t}_i - \mathbf {t}_j||^2}{{\sigma _t}^2} \end{aligned}$$
(8)

where \(i,j=1,2,...,N\) and \(i \ne j\); ||.|| is the L2-norm of vector. Since the square of the L2-norm is equally the sum of squares, and meanwhile all elements of the vector are mutually independent Gaussian random variables, and thus the translational difference \(DT_{ij}\) should follow the chi-squared distribution with 3 degrees of freedom (since the vector contains 3 elements):

$$\begin{aligned} DT_{ij} \sim {\chi }^2 (3). \end{aligned}$$
(9)

We define the rotational difference \(DR_{ij}\) in the same way:

$$\begin{aligned} DR_{ij}= \frac{||\mathbf {r}_i - \mathbf {r}_j||^2}{{\sigma _r}^2} \end{aligned}$$
(10)
$$\begin{aligned} DR_{ij} \sim {\chi }^2 (3). \end{aligned}$$
(11)

Usually the standard deviation factor is related to the ground truth parameters:

$$\begin{aligned} \sigma _t = k_t ||\mathbf {t}_{truth}|| \end{aligned}$$
(12)
$$\begin{aligned} \sigma _r = k_r ||\mathbf {r}_{truth}|| \end{aligned}$$
(13)

where \(k_t\) is the total translation factor and \(k_r\) is the total rotation factor.

With respect to the ith region, if the mean of \(DT_{ij}\) and \(DR_{ij}\) from all frames exceed the threshold, we can claim that the ith region is tampered. In this paper, the threshold is set 7.82 which comes from the \({\chi }^2\) value given the 0.05 p-value of the chi-squared distribution with 3 degrees of freedom, and it indicates that the probability that the object value exceeds 7.82 is less than 0.05.

4 Experiments

Fig. 3.
figure 3

Test videos from YouTube. The first column shows the first frames. Column 2 to 4 show the three divided regions. (a) and (b) are tampered videos. (c) is the true version of (a). Column 2 in (a) and (b) show the tampered regions.

4.1 Forensic Model Training

To estimate the total factor \(k_t\) and \(k_r\) in (12) and (13), we collect more than 50 true video clips, which are either taken by ourselves or downloaded from video-sharing websites. Then, more than 20 frames are extracted from each video. Next, we use Adobe Photoshop’s mask tool to generate three new pictures from the original frame. Each new picture contains only one part information of the original frame, while the rest of the new picture contains nothing but black by setting RGB values to 0. The strategy which we run for segmenting frames, is that divide suspicious part from others as much as possible, and in the meantime, make sure each part contains enough feature points to keep VisualSFM work efficiently. So far, we get all triple-segmenting sub-region frame sequences resembling Fig. 3.

To extract the extrinsic camera parameters, the sub-region frame sequences are separately sent to VisualSFM. Afterward, we get extrinsic parameters, the position and orientation, of each sub-frame, as in Fig. 4.

Fig. 4.
figure 4

Results provided by VisualSFM, the sequence of rectangles stand for the cameras taking different frames, while the points stand for corresponding feature 3D points from sub-region frames.

However, when we estimate the extrinsic camera parameters, the ground truth are always unknown. Thus, when evaluating our method, we take the mean value \(\bar{\mathbf {t}}\) of translation vectors extracted from different regions of all frames as the ground truth translation vector \(\mathbf {t}_{truth}\) in (12). And \(\bar{\mathbf {r}}\) is taken in the similar way.

After investigating the distribution of translational and rotational difference (as in Fig. 5), we find that 95 % of the difference values are less than 7.82 when \(k_t=1\) and \(k_r=0.5\). Thus, we take \(k_t=1\) and \(k_r=0.5\) in our later experiments.

Fig. 5.
figure 5

Distribution of translational difference and rotational difference. The red dash line shows the \(\chi ^2\) distribution with 3 degrees of freedom. About 95 % of the difference values are less than 7.82.

4.2 Test of Fake Videos

Then we evaluate our method by examining some video clips obtained from video-sharing websites such as YouTube. We divide each video frame into three regions so that \(N=3\) in (8) and (10), as same as the steps of model training. Figure 3 shows the example. In Fig. 3, (a) and (b) show the tampered videos while (c) shows the authentic version of (a). The first column shows the first frames extracted from the videos. Column 2 to 4 show the three divided regions from top to bottom of the frame. The top regions of both (a) and (b) (as in Column 2) are tampered regions. Here the three regions are simply denoted as Region 1, 2 and 3. The region index pair for calculating translational difference and rotational difference is denoted as (ij). The result is shown in Table 1.

Table 1. Differences of extrinsic camera parameters for detecting forgery on videos from YouTube

In video (a), the whole Region 1 is the suspicious part (the building). The mean value of translational difference and rotational difference are both greater than the thresholds, and thus, Region 1 is predicted to be the tampered region. Region 2 contains a small suspicious part (the building) and has a little great difference of translation. Since most of Region 2 is true, our method predicts that it is authentic as well. Region 3 is totally authentic and has small differences of both translation and rotation.

In video (b), our method can point out the tampered region as well. Video (c) is the authentic version of video (a). The translational difference and rotational difference are both much smaller than the thresholds and no region is predicted fake.

Experiments of other test videos have the similar results. Our proposed method can detect fake regions by taking advantages of extrinsic camera parameters in videos.

5 Conclusion

We proposed a geometric method to detect forgery in video by means of extrinsic camera parameters. For a authentic video, no matter which frame region we use for camera parameter estimation, the estimated extrinsic parameters should not deviate much. Instead of purifying, we try to model the difference of extrinsic parameters in authentic videos so that we can distinguish the fake in a general way. With several real videos, we investigate the differences of extrinsic camera parameters extracted from different regions of the frame. We find that the translational difference and rotational difference follow the chi-squared distribution with 3 degrees of freedom. Then we choose the appropriate threshold for forensics from this distribution. Experiments on videos from video-sharing websites show the efficacy of our method.