Video and image saliency prediction model

According to the study on the human visual system (HVS), when a person looks at a scene (video/image), he/she may pay much visual attention on a small region (the fovea) around a point of eye fixation at high resolutions. The other regions, namely the peripheral regions, are captured with little attention at low resolutions. As such, humans are able to avoid processing tremendous visual data. Saliency prediction is the processing to predict the regions of interest. As the output of the saliency prediction, saliency map is yielded to indicate which region may be fixed at more probably. Most recently, saliency prediction has been widely applied in object detection, object recognition, image retargeting, image quality assessment, and image/video compression. Taking video compression as an example, the report from Cisco stated that the internet video data would be 80 EB per month. Actually, as one of the most common way, saliency map provides the weights to guide the spatial video compression.

Our researches include the saliency prediction of image and video. The main difference between them is that people are supposed to view an image longer than a frame of the video, so the saliency map of the image is sparser. For image, the saliency only depends on special feature. However, the change and consistency of the sequence have more influence on human visual attention while viewing videos. Thus, the temporal features usually play a more important role in video saliency prediction algorithm.

Saliency prediction for face image: One of our researches for image saliency prediction is on face image sets. Although the existing work has taken into account one or more faces on saliency prediction, it does not explore the distribution of eye fixations within faces. As shown in Fig 1, a simple isotropic Gaussian model (GM) assumption for saliency distribution in face has the limitation on modeling visual attention attracted by faces. As can be seen in this figure, for images with small faces, non-isotropic GM is more accurate in modeling saliency distribution inside face. For images with large faces, a single GM is not effective, as the fixations tend to cluster around the facial features (e.g., eyes). Accordingly, saliency distribution, in the form of Gaussian mixture model (GMM), needs to be learnt from eye fixations on face images. Fig 1 (d) shows that the saliency with the learnt GMM distribution is more consistent with the ground truth visual attention. Specifically, one non-isotropic Gaussian component should be utilized for images with small faces, whereas more than one components can be applied for images with large faces. This paper thereby proposes a learning-based saliency detection method, which learns various GMMs and the corresponding weights across different face sizes.




Fig 1. Examples for saliency prediction vs fixations in face region, selected from Zhao et al. The red dots represent the fixations recorded by the eye tracker. Note that both saliency and fixations belonging to face regions are displayed.


For analyzing visual attention on face images, we established a large eye tracking database with visual attention analysis on face images, in which 510 images with faces at different sizes were free-viewed by 24 subjects. Based on analysis, we observed the following from this database.

Observation 1: Visual attention on face is significantly more than that on background.

Observation 2: Visual attention on face increases along with the enlarged face size in the image.

Observation 3: Visual attention on facial features increases along with the enlarged face size in the image.




Fig 2. The framework of our work


For face images, we predict saliency by integrating the top-down channels of face and facial features, in which GMMs for top-down saliency distribution and the corresponding weights for each top-down channel are learnt from the training fixations. Combined with the conventional bottom-up features (i.e., color, intensity, and orientation), our saliency detection method is capable of accurately predicting human visual attention on face images. It is because our method learns the GMM distribution of attention on face, rather than simply assuming isotropic Gaussian distribution of saliency over face regions in other methods.

Compared with 8 state-of-the-art methods, our method is able to well locate the saliency regions, much closer to the maps of human fixations.

Saliency prediction for natural images: Besides, we propose a saliency prediction algorithm (OSDL) for natural images with a novel low-level feature (SR-LTA). Towards natural images without many semantic objects, we found that salient regions (or non-salient region) shared similar texture patterns. As shown in Fig 3, the salient regions (from eye-tracking data) from different images in the first row are quiet similar, and can be represented by several basic texture atoms. The non-salient regions in the second row have same property.




Fig 3. The texture similarity of salient and non-salient regions


According to the observation mentioned above, we designed a dictionary learning based algorithm to learn two sets of basic atoms from salient regions (positive samples) and non-salient regions (negative samples), respectively. For each input image patch from test set, sparse representation is conducted to reconstruct it, with two learned dictionaries. Then, the SR-LTA feature is figured out through the difference between two reconstruction errors. Specifically, a center-bias (a human visual mechanism) term is added into the optimization function in dictionary learning, and an OSDL algorithm is proposed to solve this problem. After that, the final saliency map is obtained by linearly combining SR-LTA feature with another two low-level features (contrast and luminance). A sparsity correction is also proposed to make each feature map and final saliency map more reasonable in weight distribution. The overall framework is shown in Fig 4.




Fig 4. The overall framework of this work


The experimental results show that our method outperforms 9 state-of-the-art methods for bottom-up saliency detection in the terms of ROC, AUC, CC, NSS and chi-square distance.

Video saliency prediction in compressed domain: For video saliency prediction, we achieved detecting saliency in compressed domain. In other words, the human fixations can be predicted by the bit stream (only decoded in a very shallow level) of video instead of the uncompressed video (totally decoded). As almost all the videos are stored and transmit in the term of bit stream, saliency prediction method in compressed domain can save much computational time and storage. As one of the most popular and state-of-the-art standard, high efficiency video coding (HEVC) technology is widely used in video compression. Three HEVC basic features can be extracted from the bit stream encoded by HEVC. That is, CTU structure in Fig 5 (a), bit allocation in Fig 5 (b), and motion vector in Fig 5 (c). A high correlation between HEVC features and human fixations can be seen from Fig 5. Furthermore, statistic result in Fig 6 also proves that human fixations tend to fall into the regions with large-valued HEVC features.




Fig 5. The correlation between human fixations and HEVC features




Fig 6. The proportions of human fixations falling into different groups of pixels, in which values of the corresponding HEVC features are sorted in the descending order.


Based on the three afore mentioned HEVC features, 3 spatial features and 3 temporal features are proposed to evaluate the spatial and temporal difference of CTU structure, bit allocation and motion vector, respectively. Specifically, we develop a simple voting algorithm to detect the camera motion of the video. The camera motion should be compensated to make the temporal and motion related features more reasonable. The framework of feature extractor is shown in Fig 7.




Fig 7. The framework of the HEVC feature extractor


After extracting the 9 features, a support vector machine (SVM) with linear kernel is trained by the training set, to integrate feature maps into the final saliency map. Fig 8 shows the overall framework of this video saliency prediction method. The effect of every single feature and the whole model is proved by the experimental result. In terms of evaluation metrics, such as AUC, CC, NSS, KL and EER, our method outperforms other 7 state-of-the-art saliency prediction methods both in compressed and uncompressed domain.




Fig 8. The overall framework of this work


Mai Xu, associate professor, school of electronics and information engineering, Beihang University, E-mail:




[1] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In ICCV, pages 2106–2113, 2009. 1, 2, 6, 8.

[2] Yufan Liu, Haoji Hu, Mai Xu*. Subjective rate-distortion optimization in HEVC with perceptual model of multiple faces. 2015 Visual Communications and Image Processing (VCIP).

[3] Yun Ren, Mai Xu*, Ruihan Pan, Zulin Wang. Learning Gaussian mixture model for saliency detection on face images. 2015 IEEE International Conference on Multimedia and Expo (ICME).

[4] Mai Xu*, Yun Ren, Zulin Wang. Learning to Predict Saliency on Face Images. 2015 Proceedings of the IEEE International Conference on Computer Vision.

[5] Lai Jiang, Mai Xu*, Zhaoting Ye, Zulin Wang. Image Saliency Detection with Sparse Representation of Learnt Texture Atoms. 2015 Proceedings of the IEEE International Conference on Computer Vision Workshops.

[6] Mai Xu, Lai Jiang, Zhaoting Ye, Zulin Wang*. Bottom-up saliency detection with Sparse Representation of Learnt Texture Atoms. 2016.5 Pattern Recognition.

[7] Mai Xu*, Lai Jiang, Xiaoyan Sun, Zhaoting Ye, Zulin Wang. Learning to Detect Video Saliency With HEVC Features. 2016.11 IEEE Transactions on Image Processing.