Introduction

Brain encoding models provide effective means to understand how brain activity varies along with the variation in external stimuli and how well the brain activity can be predicted from the quantitatively measured external stimuli. It has been receiving increasing interest and a number of papers have been published in recent few years. Especially, several surveys by (Haynes and Rees 2006; Naselaris et al. 2011; Kay and Gallant 2009; Hasson et al. 2010; Sugase-Miyamoto et al. 2011), and Chen et al. (2014) have provided a broad overview of approaches for encoding including image analysis methodologies, functional magnetic resonance imaging (fMRI) analysis algorithms, machine learning algorithms and region of interest (ROI) selection methods and so on. An encoding model mainly consists of four components: structural substrates for brain response modeling, brain response modeling, external stimuli modeling, and the mapping from stimuli to brain response (Naselaris et al. 2011; Chen et al. 2014).

Although previous studies have yielded remarkable results, in our opinion, three problems are required to be revisited in current encoding studies. The first one is that the quantified external stimuli used in previous works are limited. In most fMRI studies [e.g. (Shirer et al. 2012; Haxby et al. 2001; Sterzer et al. 2008; Peelen et al. 2009; Mitchell et al. 2008; Nishimoto et al. 2011)], visual features were used to represent external stimuli of image/video, which includes image grid intensity, color (Naselaris et al. 2011; Nishimoto et al. 2011; Miyawaki et al. 2008), semantic category labels [e.g., (Mitchell et al. 2008)], and participants rated scores about the external stimuli such as face and human body in a naturalistic video stream (Bartels and Zeki 2004). However, those representations are generally qualitative and subjective, and thus substantially limit the power of encoding models. To alleviate this problem, researchers tried to model the external stimuli via computational image/video descriptors. For example, Kay (Kay et al. 2008) and Naselaris (Naselaris et al. 2009) adopted Wavelet Gabor filters to model the texture feature of input image. Bartels (Bartels et al. 2008) used a motion energy model to describe visual stimuli of free viewing of movie segments. The computer vision community has developed a large amount of visual feature descriptors to represent image/video from different perspectives, for example, color, shape and motion. These features are typically objective and can be automatically derived by computer vision algorithms. However, whether those computer vision based features are feasible for fMRI encoding models has not been fully examined yet. Furthermore, in the computer vision field, the visual features are typically evaluated by conducting recognition or classification experiments based on image/video benchmarks with their human-labeled ground truth. However, this evaluation mechanism is from an engineering view without taking human brain cognition into full consideration. It is of great interest to explore the feasibility of applying brain encoding models to evaluate and compare various visual features.

The second problem is in the component of structural substrates for functional brain response modeling. The structural substrates provide the base for extracting meaningful information from fMRI data. In existing encoding models, voxel-based and ROI-based methods (Thirion et al. 2007; Polyn et al. 2005; Naselaris et al. 2011; Dumoulin and Wandell 2008; Mitchell et al. 2008) have been widely adopted. Voxels and ROIs were determined manually based on neuroscience domain knowledge or automatically based on activation detection using task-fMRI. Although voxels and ROIs-based methods are easy to implement and effective in many existing works, their reproducibility, generalizability and reliability have been limited due to the lack of a common and individualized representation of human brain architecture as pointed out in (Liu 2011; Chen et al. 2014). To be specific, voxel-based methods pose difficulties in assessing the consistency of encoding models across subjects due to the intrinsic variability of brain structure and functions and thus the lack of precise voxel-wise correspondence between subjects (Liu 2011). Recently, we developed and validated a novel data-driven strategy, namely DICCCOL (dense individualized and common connectivity-based cortical landmarks) (Zhu et al. 2012, 2013), to discover consistent and corresponding structural landmarks across various brains. In total, 358 consistent and corresponding functional landmarks were identified, each of which was optimized to possess maximal group-wise consistency of DTI-derived fiber shape patterns (Zhu et al. 2012). Moreover, this set of the 358 structural brain landmarks can be accurately and reliably predicted in a subject based only on DTI data (Zhang et al. 2012). The DICCCOL system provides an appropriate representation of human brain network and enables the opportunity of exploring the consistency of encoding and decoding brain network responses across subjects.

The third problem is in the component of brain response modeling. In previous encoding literatures (Naselaris et al. 2011; Kay et al. 2008; Miyawaki et al. 2008), fMRI Blood Oxygen-level Dependent (BOLD) intensities have been widely utilized to measure the brain’s functional response. However, many literature reports (Logothetis et al. 2001; Chen et al. 2014; Heeger and Ress 2002) have pointed out that fMRI BOLD signals are often sensitive to physiological motion effect and some non-neuronal noises, which may reduce the reliability of encoding models. Another group of methods adopted the brain activation patterns measured by the GLM (general linear model) (e.g., Haxby et al. 2001; Naselaris et al. 2011; Sterzer et al. 2008; Walther et al. 2009; Mitchell et al. 2008) to construct encoding models. Recently, the results reported in the literatures (Richiardi et al. 2011; Shirer et al. 2012) suggest us that functional connectivity is a new, alternative school of methodologies for quantitatively measuring functional brain response. Essentially, brain function is resulted from large-scale functional connectivities (Haynes and Rees 2006; Lynall et al. 2010; Friston 2009; Hagmann et al. 2010). The brain’s comprehension of visual stimuli can be precisely represented by these functional connectivities and interactions among relevant brain networks (Friston 2009; Hagmann et al. 2010; Lynall et al. 2010). Notably, a few recent studies (Hu et al. 2012; Han et al. 2013) have demonstrated that functional connectivity is an effective tool to model brain response to free viewing of videos.

These three above described problems motivates us to develop a novel fMRI encoding model to predict the brain’s responses to free viewing of videos. The architecture of proposed encoding model is illustrated in Fig. 1. To represent the visual stimuli, we adopt a number of representative features in computer vision research included RGB histogram, color moments, Histogram of Oriented Optical Flow (HOOF) (Nayak et al. 2011) and RGB-SIFT (Van De Sande et al. 2010). To model the universal brain activity in response to video stimuli across subjects, we use the DICCCOL system (Zhu et al. 2013) to localize large-scale cortical ROIs and measured the functional connectivities among them. Afterwards, the encoding model bridging feature space and response space is trained via least-squares support vector regression (LSSVR) (Suykens and Vandewalle 1999). Experimental results demonstrated that brain network responses during free viewing of videos can be robustly and accurately predicted by those visual features and across different subjects.

Fig. 1
figure 1

The framework of the proposed encoding model. A number of representative features in computer vision research (Van De Sande et al. 2010) are adopted to model the input video stimuli and the DICCCOL system (Zhu et al. 2013) is used to localize large scale cortical ROIs, based on which the brain responses are quantified as the functional interactions among the ROIs. Afterwards, the encoding model bridging feature space and response space is trained via least-squares support vector regression (LSSVR) (Suykens and Vandewalle 1999)

The rest of this paper is organized as follows. Section “Materials and methods” describes the materials and methods adopted in this paper, including the brain response feature representation procedure and the visual feature extraction pipeline and the specifics of the proposed encoding model. The experiments and results are reported in “Experimental results” section. Finally, the discussion and conclusions are drawn in “Conclusion” section.

Materials and methods

Data acquisition and pre-processing

Subjects and stimuli

Three subjects participated in the study, which was approved by the University of Georgia IRB. All participants are young male student aged between 20 and 30 years old and they were in good health with no past history of psychiatric or neurological diseases. Participants all had normal or corrected-to-normal vision.

Natural stimulus fMRI (N-fMRI), e.g., during video watching in this paper, provides an uncontrolled environment to study the functional mechanism of the human brain. We randomly selected 51 shots including 12 commercials, 19 weather reports and 20 sports from the TRECVID 2005 data set (Smeaton et al. 2006). Each video clip is lasting 60 s or so.

MRI data acquisition

During fMRI scan, these clips were presented to these subjects via MRI-compatible goggles. The E-prime software (Schneider et al. 2002) was used for the strict synchronization between movie viewing and fMRI scan. Every participating subject took the multimodal DTI and fMRI scans in three separate scan sessions. The acquired DTI data of each participant was used to localize their DICCCOL ROIs.

Functional images were acquired on a GE 3T Signed MRI system using an 8-channel head coil at The University of Georgia Bioimaging Research Center. We set the scan parameters as follows: 30 axial slices, matrix size 64 × 64, 4 mm slice thickness, 220 mm FOV, TR = 1.5 s, TE = 25 ms, ASSET = 2. Diffusion tensor imaging data was also acquired for DICCCOL landmarks localization. DTI data was acquired using the isotropic spatial resolution 2 mm × 2 mm × 2 mm and the specific parameters were: TR = 15.5 s, TE = min-full, b-value = 1,000 for 30 DWIs and 3 B0 volumes.

Data preprocessing

FMRI data were preprocessed using the FSL (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/). The preprocessing of fMRI data includes skull removal, motion correction, spatial smoothing, temporal prewhitening, slice time correction, and global drift removal. To predict ROIs for each subject based on DICCCOL system, the preprocessing of DTI data includes skull removal, motion correction and eddy current correction. Fiber tracking was performed via MEDINRIA (http://www-sop.inria.fr/asclepios/software/MedINRIA/).

Brain response modeling

Localizing reproducible and accurate cortical ROIs that are consistent and correspondent across individuals is a critical problem for brain network studies. Recently, we developed and validated a novel data-driven discovery approach that identified 358 consistent and corresponding DICCCOL ROIs in over 240 brains (Zhu et al. 2013). The intrinsic neuroscience foundation of the approach is that each brain’s cytoarchitectonic area possess a unique set of extrinsic in and out, entitled the “connectional fingerprint” in (Passingham et al. 2002), which principally determines the functions of each brain area. A variety of recent studies (Laird et al. 2009; Passingham et al. 2002; Zhu et al. 2013; Zhang et al. 2012) have confirmed and replicated this close relationship between structural connection pattern and brain function. In addition, this set of 358 structural brain landmarks can be accurately and reliably predicted in an individual subject based only on DTI data (Zhang et al. 2012), demonstrating the remarkable reproducibility and predictability. Therefore, in this paper, we employ the DICCCOL system to localize dense cortical ROIs for each subject.

We first use the brain ROI prediction approach in (Zhang et al. 2012) to localize the 358 DICCCOLs in the scanned subjects with DTI data. Then, after linearly transforming the ROIs to the fMRI image space, each stimulus fMRI signals were extracted for each of these 358 DICCCOLs. Afterwards, we applied the PCA (principal component analysis) on the multiple fMRI time series within each ROI to extract a representative fMRI signal (Zhu et al. 2012). Finally, the eigenvector corresponding to the largest eigenvalue was defined as the representative fMRI signal for this ROI. With the 358 ROIs for each subject, the functional connectivity between any pair of ROIs is measured as the Pearson correlation coefficient between their N-fMRI signals, resulting in a 358 × 358 matrix for each video sample. Since the functional connectivity between ROIs is symmetric and the correlation between the same ROI is nonsense, we obtained 63903-dimensional functional response vector for each video sample.

Video stimuli representation

A large amount of feature descriptors have been developed and used by the computer vision community. A recent work (Van De Sande et al. 2010) reviewed a number of color descriptors commonly used in computer vision field and quantitatively evaluated them based on the accuracy of performing object and scene recognition tasks on image/video benchmarks. The work of (Nayak et al. 2011) discussed a number of state-of-the-art features used in activity recognition which are adequate to the representation of videos’ motion patterns. Motivated by the work (Van De Sande et al. 2010) and (Nayak et al. 2011), in this paper, we selected four representative visual descriptor to characterize video clips, which are RGB histogram, color moments, RGB-SIFT, and HOOF. The former two descriptors measure the video color distribution while the RGB-SIFT characterize the local shape and spatial information and the HOOF describes the global motion information of the video.

RGB histogram

A 48-dimensional color histogram was extracted in RGB color space to describe the global color distribution in the video. The RGB histogram is a combination of three 1-D histograms calculated on R, G, and B channels of the RGB color space.

Color moments

Although color moments (Amir et al. 2003) of an image in the RGB space are simple to calculate, they are very effective for image/video analysis. In this paper, an image is firstly partitioned into 2 * 3 sub-blocks, and then the color moments of each block in each channel are calculated and concatenated. Similar to (Amir et al. 2003), we use three central moments which are mean, standard deviation and skewness to represent an image’s color distribution. Thus, we obtained a 54 dimensional color moments descriptor for each key frame.

RGB-SIFT

The SIFT descriptor proposed in (Lowe 2004) is one of state-of-the-art techniques to characterize the local shape of a region based on edge orientation histograms derived from the gradient information. The RGB-SIFT (Van De Sande et al. 2010) calculated SIFT descriptors in the RGB color space.

HOOF

Histogram of Oriented Optical Flow (Nayak et al. 2011) is a popular and effective scale-invariant global feature to represent the motion in an entire frame using optical flow (Baker et al. 2011) in computer vision community. HOOF was extracted as follows. First, optical flow was computed at every key frame. Then, optical flow vector was binned according to its primary angle and weighted based on its magnitude. The number of bins was set to 60.

TRECVID 2005 dataset provides multiple key frames for each video sample. At first, for each given key frame provided by TRECVID 2005 dataset, the above described four feature descriptors (RGB histogram, color moments, RGB-SIFT and HOOF) were calculated. Then each video sample was represented by the average of feature vectors of its all key frames. Figure 2 shows the visual patterns of each descriptor for a sample video clip. Each representation corresponds to a visual descriptor.

Fig. 2
figure 2

The visual patterns of RGB histogram, color moments, HOOF, RGB-SIFT of an exemplar video clip. While RGB histogram and color moments describe the video color distribution, the HOOF characterizes the global motion information in the video and the RGB-SIFT characterizes the local shape and spatial information. (Color figure online)

LSSVR-based stimuli–brain response mapping

In the current studies, the mapping between external stimuli and brain response is mainly accomplished by machine learning methods. In the early research of brain mapping, GLM models (Friston et al. 1995) were widely used to map the brain’s hemodynamic responses with external stimuli due to its simplicity and effectiveness (Haxby et al. 2001; Naselaris et al. 2011; Sterzer et al. 2008; Walther et al. 2009; Mitchell et al. 2008). Additionally, researchers have explored several machine learning methods such as Gaussian Naive Bayes (GNB), SVM-based methods, and K nearest neighbor (KNN) to model the relationship between the stimuli and brain response (Naselaris et al. 2009; Mitchell et al. 2004; Walther et al. 2009). Among these methods, SVM-based approaches showed great advantages especially where there are a large number of features as the regularization in SVM-based methods help weaken the effect of noisy features which are highly correlated with each other (Pereira et al. 2009).

In our study, the least squares support vector regression algorithm (LSSVR) (Suykens and Vandewalle 1999) is adopted to solve the mapping f(X → y) such that f(X) has at most ε deviation from the actually obtained targets for all the training data, and is as flat as possible simultaneously (Suykens and Vandewalle 1999). The flatness of f(X) ensures the superior generalizability when predicting the brain’s responses from the corresponding visual features for a new video sample. The encoding model was trained for each dimension of the functional response vector independently. Denote y = (e 1 ij , e 2 ij , …,e n ij )T as the set of brain’s responses. e k ij is the functional connectivity between the i-th and j-th ROI in the k-th video sample. Denote X = (X 1, X 2, …X 3)T as the visual feature set where X k  = (x k1, x k2,…, x kp )T·p is the dimensionality of the visual feature and n is the total number of training video samples. As suggested in (Naselaris et al. 2011; Pereira et al. 2009), the linear kernel was used.

Afterwards, the leave-one-out cross-validation was adopted to evaluate the performance of the trained encoding model. Each encoding model’s training and testing were performed for each subject and visual feature set independently. For each video, nine sets of encoding models can be trained for the three subjects by using three visual features. Given a trained encoding model, the prediction error associated with the functional connectivity between every pair of ROIs is calculated using all video samples:

$$error_{ij} \frac{1}{n}\sum\limits_{k = 1}^{n} {\left| {{{\left( {\hat{e}_{ij}^{k} - e_{ij}^{k} } \right)} \mathord{\left/ {\vphantom {{\left( {\hat{e}_{ij}^{k} - e_{ij}^{k} } \right)} {e_{ij}^{k} }}} \right. \kern-0pt} {e_{ij}^{k} }}} \right|}$$
(1)

where \(\hat{e}_{ij}^{k}\) is the predicted e k ij for the k-th video sample.

Experimental results

Evaluation of visual features in feature space

We first performed a video classification test to evaluate the distinctiveness of the proposed visual features. As for the classifiers, we adopted the K-nearest neighbor (K-NN) classifiers due to its simplicity and efficiency. The classification test was performed on those 51 video clips with visual features and the average precision in leave-one-out cross-validation was calculated. Figure 3 shows the results of different visual features which reflect their distinctiveness. From the Fig. 3, we can see that the RGB-SIFT perform the best, then followed by HOOF, color moments and RGB histogram respectively.

Fig. 3
figure 3

Evaluation of visual descriptors in feature space. The bar-plot indicates the average precision of video classification in leave-one-out cross-validation using these visual features

Encoding accuracy

In terms of the encoding error defined in Eq. (1), we assessed the proportion of relatively accurate predictions using different visual features and across different subjects. The accumulative histogram curves of encoding error are shown in Fig. 4. The x-axis in Fig. 4 is a predefined threshold for error defined in Eq. (1) and the y-axis is the proportion (against all the 63,903 functional interactions) of functional connections with encoding error less than the error threshold. Note that in Fig. 4, the threshold for error is only up to 100 % for the purpose of better visualization. The areas under those curves in Fig. 4 are summarized in Table 1.

Fig. 4
figure 4

Encoding accuracy results using DICCCOL: Accumulative distribution of the relatively accurate predicted connections in different feature spaces and subjects against a predefined threshold for encoding error (error in Eq. (1). The x-axis is a predefined threshold for error defined in Eq. (1) and the y-axis is the proportion (against all the 6,3903 functional interactions) of the functional connections with encoding error less than the error threshold

Table 1 Area under the accumulative distribution curves in Fig. 4

Based on the results shown in Fig. 4 and Table 1, a few important points can be observed: (1) A number of the functional connections can be predicted with relatively high accuracy by the proposed encoding models. For example in the first subject, 255, 87, 124 and 74 functional connections can be predicted with error less than 20 % by RGB-SIFT, HOOF, color moments and RGB histogram, respectively. And the numbers in the second and the third subject are 328, 155, 86, 23 and 295, 174, 135, 75, respectively. (2) The number of accurately predicted functional connections is the highest by using the RGB-SIFT feature in all the three subjects, and followed by using the HOOF, color moments and RGB histogram in turn. This result may be explained by two reasons. One is that in the computer vision community it is widely accepted that RGB-SIFT which characterizes meaningful shape and spatial information of visual stimuli is more complex than other features. Thus the comprehension of RGB-SIFT may involve more brain regions and their functional interactions. The other reason is that it has also been reported in (Van De Sande et al. 2010) and validated in “Evaluation of visual features in feature space” section that the RGB-SIFT performs better than other features in recognizing objects and scenes. The distinctiveness of the feature may be an inherent factor to determine its encoding accuracy. (3) Inter-subject variation can be observed, which may be caused by the different capabilities of the subjects in recognizing objects and scenes. We will provide more details about inter-subject consistency of the encoding model in the next subsection.

Encoding consistency across subjects

The number of correctly encoded functional connections for each visual feature is different across subject. Here, a functional connection is regarded as “correctly encoded” if its corresponding prediction error is below a predefined threshold. Then, we use the overlap ratio of correctly encoded functional connections across subjects to assess the inters-subject consistency of the encoding model for a visual feature. In this paper, we assume that two functional connections from different subjects are equivalent if both of them are in the same type of sub-network interactions. For example, two functional connections may relate to different DICCCOL ROIs in two subjects. However, if both of them are functional interactions between visual and attention system of the human brain, they are treated as “equivalent”. Figure 5a shows the number of overlapped functional connections against a threshold for error across the three subjects for different feature spaces. Figure 5b shows the overlap ratio, which is calculated as the ratio between the number of overlapped functional connections and the total number of correctly encoded functional connections for a specific subject. Again, the threshold for encoding error is up to 100 % for better visualization.

Fig. 5
figure 5

Encoding consistency across different subjects. a The number of overlapped functional connections against a predefined threshold for error (Eq. 1) across the three subjects for different feature spaces. b The overlap ratio for different subjects and feature spaces

From Fig. 5 we can see that the encoding model based on the RGB-SIFT shows the best inter-subject consistency followed by the ones based on HOOF, color moments and RGB histogram, especially when the encoding error is small (e.g., less than 30 %). For example, when the encoding error is less than 30 %, the average overlap rate in the four feature space is 0.3708, 0.3088, 0.1881 and 0.1020, respectively. Unlike the performance metric of encoding accuracy in subsection “Evaluation of visual features in feature space”, the inter-subject consistency of the encoding models is only related to the capability of the corresponding feature set in characterizing the content of the input video stimuli. In this context, we may draw the conclusion that the RGB-SIFT outperforms HOOF, color moments and RGB histogram in describing video content from the perspective of functional brain responses prediction. Likewise, the work in (Van De Sande et al. 2010) and section “Evaluation of visual features in feature space” also demonstrated that the RGB-SIFT perform better than the other features in recognizing objects and scenes from images and videos.

Conclusion

In this paper, we proposed an fMRI encoding model to predict brain network responses to free viewing of videos. The brain responses were quantified as the functional interactions in large-scale brain networks identified by recently developed and validated DICCCOL brain landmarks localization system. The encoding model which maps the feature space to the brain response space was trained based on LSSVR. Our experimental results demonstrated that the brain network responses to video stimuli can be robustly and accurately predicted across both different feature spaces and different subjects.

Our major contributions are summarized as follows. (1) We adopted a number of representative visual features in computational vision analysis community to represent a video sample. As mentioned before, the computational representation of stimuli in feature space is quite limited. The idea of taking advantage of visual features in computational vision community will greatly benefit the encoding and decoding studies. (2) We firstly employed the DICCCOOL system to explore the consistency of the encoding and decoding models across different subjects. The remarkable reproducibility and predictability of DICCCOLs in individual subject demonstrate great advantages of DICCCOL for inter-subject generalization. (3) Our study revealed the feasibility of using fMRI-based brain encoding techniques to evaluate visual features. Beyond neuroimaging, our results of testing visual features in encoding model are consistent with computational community which implies inherent correlations between the discriminativeness and the encoding capability of a feature.

In future, we will improve the proposed work in the following aspects. First, both the number of participants and the number of natural-stimulus video clips are relatively small. In the future, we will collect a larger scale dataset which includes more participants and uses more video clips as external stimuli, are repeat the studies proposed in this paper. Second, a number of structured visual features in computational community will be applied. Meanwhile, we will derive and test more brain response features reflecting the brain’s comprehension of video stimuli. Finally, other alternative brain mapping techniques, such as sparsity constrained regression model, will be investigated and compared with the LSSVR algorithm used in this study. We believe that the combination of functional brain imaging and computational vision research will offer great benefit to both fields.