From spatial to spatiotemporal: salient/primary video object segmentation

Segmenting salient objects in images and videos is a classic problem in the area of computer vision. In recent years, image-based salient object detection (SOD) has achieved impressive success since powerful models can be directly trained on large image datasets by using Random Forest, Stacked Autoencoders and Deep Neural Networks. In contrast, segmenting the most salient object sequence in a video (i.e. Primary Video Object) remains a challenging task. Due to the camera or object motion, the same primary object may co-occur with or be occluded by various distractors in different frames (see Fig 1), making it difficult to consistently pop-out throughout the whole video. The two main problems that prevent the fast growth of this area are summarized as follows:

1)The definition of a salient object in videos is still not very clear. We cannot directly follow the definition of image salient objects;

2)The amount of video data with per-frame pixel-level annotations is much less than that of images, which may prevent the direct end-to-end training of spatiotemporal model, especially the deep models.




Fig 1. Primary objects may co-occur with or be occluded by various distractors.


To address these issues, we propose the VOS, a large-scale benchmark dataset with 200 realistic videos [1] for video-based salient object detection. To avoid ambiguous annotations, we collected two types of user data: 1) the eye-tracking data of 23 subjects viewing all the 200 videos and 2) the masks of all objects and regions in 7650 uniformly sampled keyframes that are annotated by another 4 subjects. Based on these user data, salient objects in the keyframes of a video are unambiguously annotated as the objects that consistently receive the highest density of fixations throughout the video. To the best of our knowledge, it is currently the largest dataset for video-based salient object detection. Due to this large-scale dataset, it becomes a feasible solution to directly learn a supervised or unsupervised model for video-based salient object detection. Now it is available to the public and can be downloaded from our website (

Based on this dataset, we proposed a novel approach that efficiently predicts and propagates spatial foregroundness and backgroundness within neighborhood reversible flow for primary video object segmentation [2]. The framework of this proposed approach is shown in Fig 2, which consists of two main modules. In the spatial module, the Complementary Convolutional Neural Networks (CCNN) are trained end-to-end on massive images with manually annotated salient objects so as to simultaneously handle two complementary tasks, i.e., foreground and background estimation, with separate branches. By using CCNN, we can obtain the initialized foregroundness and backgroundness maps on each individual frame. However, such maps are not always perfectly complementary. They sometimes leave certain areas mistakenly predicted in both the foreground and background branches (see Fig 2 (d) for the black area in the fusion maps). To efficiently and accurately propagate such spatial imperfect predictions between far-away frames, we constructed neighborhood reversible flow so as to depict the most reliable temporal correspondences between superpixels in different frames. With such flow, the initialized spatial foregroundness and backgroundness are efficiently propagated along the temporal axis. In this manner, primary objects can efficiently pop-out and distractors can be further suppressed. Extensive experiments on three video datasets show that the proposed approach acts efficiently and achieves impressive performances compared with 18 state-of-the-art models.




Fig 2. Neighborhood reversible approach for video primary object segmentation



Jia Li, associate professor, school of computer science and engineering, Beihang University, E-mail:



[1]Jia Li, Changqun Xia and Xiaowu Chen. A Benchmark Dataset and Saliency-Guided Stacked Autoencoders for Video-Based Salient Object Detection. IEEE Transactions on Image Processing, 27(1), pp. 349-364, Jan. 2018.

[2]Jia Li, Anlin Zheng, Xiaowu Chen and Bin Zhou. Primary Video Object Segmentation via Complementary CNNs and Neighborhood Reversible Flow. International Conference on Computer Vision (ICCV), 2017.