Tag Archives: Video Analysis

Self-Supervised Tracking via Video Colorization



Tracking objects in video is a fundamental problem in computer vision, essential to applications such as activity recognition, object interaction, or video stylization. However, teaching a machine to visually track objects is challenging partly because it requires large, labeled tracking datasets for training, which are impractical to annotate at scale.

In “Tracking Emerges by Colorizing Videos”, we introduce a convolutional network that colorizes grayscale videos, but is constrained to copy colors from a single reference frame. In doing so, the network learns to visually track objects automatically without supervision. Importantly, although the model was never trained explicitly for tracking, it can follow multiple objects, track through occlusions, and remain robust over deformations without requiring any labeled training data.
Example tracking predictions on the publicly-available, academic dataset DAVIS 2017. After learning to colorize videos, a mechanism for tracking automatically emerges without supervision. We specify regions of interest (indicated by different colors) in the first frame, and our model propagates it forward without any additional learning or supervision.

Learning to Recolorize Video
Our hypothesis is that the temporal coherency of color provides excellent large-scale training data for teaching machines to track regions in video. Clearly, there are exceptions when color is not temporally coherent (such as lights turning on suddenly), but in general color is stable over time. Furthermore, most videos contain color, providing a scalable self-supervised learning signal. We decolor videos, and then add the colorization step because there may be multiple objects with the same color, but by colorizing we can teach machines to track specific objects or regions.

In order to train our system, we use videos from the Kinetics dataset, which is a large public collection of videos depicting everyday activities. We convert all video frames except the first frame into gray-scale, and train a convolutional network to predict the original colors in the subsequent frames. We expect the model to learn to follow regions in order to accurately recover the original colors. Our main observation is the need to follow objects for colorization will cause a model for object tracking to be automatically learned.
We illustrate the video recolorization task using video from the DAVIS 2017 dataset. The model receives as input one color frame and a gray-scale video, and predicts the colors for the rest of the video. The model learns to copy colors from the reference frame, which enables a mechanism for tracking to be learned without human supervision.
Learning to copy colors from the single reference frame requires the model to learn to internally point to the right region in order to copy the right colors. This forces the model to learn an explicit mechanism that we can use for tracking. To see how the video colorization model works, we show some predicted colorizations from videos in the Kinetics dataset below.

Examples of predicted colors from colorized reference frame applied to input video using the publicly-available Kinetics dataset.

Although the network is trained without ground-truth identities, our model learns to track any visual region specified in the first frame of a video. We can track outlined objects or a single point in the video. The only change we make is that, instead of propagating colors throughout the video, we now propagate labels representing the regions of interest.

Analyzing the Tracker
Since the model is trained on large amounts of unlabeled video, we want to gain insight into what the model learns. The videos below show a standard trick to visualize the embeddings learned by our model by projecting them down to three dimensions using Principal Component Analysis (PCA) and plotting it as an RGB movie. The results show that nearest neighbors in the learned embedding space tend to correspond to object identity, even over deformations and viewpoint changes.
Top Row: We show videos from the DAVIS 2017 dataset. Bottom Row: We visualize the internal embeddings from the colorization model. Similar embeddings will have a similar color in this visualization. This suggests the learned embedding is grouping pixels by object identity.

Tracking Pose
We found the model can also track human poses given key-points in an initial frame. We show results on the publicly-available, academic dataset JHMDB where we track a human joint skeleton.
Examples of using the model to track movements of the human skeleton. In this case the input was a human pose for the first frame and subsequent movement is automatically tracked. The model can track human poses even though it was never explicitly trained for this task.

While we do not yet outperform heavily supervised models, the colorization model learns to track video segments and human pose well enough to outperform the latest methods based on optical flow. Breaking down performance by motion type suggests that our model is a more robust tracker than optical flow for many natural complexities, such as dynamic backgrounds, fast motion, and occlusions. Please see the paper for details.

Future Work
Our results show that video colorization provides a signal that can be used for learning to track objects in videos without supervision. Moreover, we found that the failures from our system are correlated with failures to colorize the video, which suggests that further improving the video colorization model can advance progress in self-supervised tracking.

Acknowledgements
This project was only possible thanks to several collaborations at Google. The core team includes Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama and Kevin Murphy. We also thank David Ross, Bryan Seybold, Chen Sun and Rahul Sukthankar.

Source: Google AI Blog


An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!



Last September, we released the YouTube-8M dataset, which spans millions of videos labeled with thousands of classes, in order to spur innovation and advancement in large-scale video understanding. More recently, other teams at Google have released datasets such as Open Images and YouTube-BoundingBoxes that, along with YouTube-8M, can be used to accelerate image and video understanding. To further these goals, today we are releasing an update to the YouTube-8M dataset, and in collaboration with Google Cloud Machine Learning and kaggle.com, we are also organizing a video understanding competition and an affiliated CVPR’17 Workshop.

An Updated YouTube-8M
The new and improved YouTube-8M includes cleaner and more verbose labels (twice as many labels per video, on average), a cleaned-up set of videos, and for the first time, the dataset includes pre-computed audio features, based on a state-of-the-art audio modeling architecture, in addition to the previously released visual features. The audio and visual features are synchronized in time, at 1-second temporal granularity, which makes YouTube-8M a large-scale multi-modal dataset, and opens up opportunities for exciting new research on joint audio-visual (temporal) modeling. Key statistics on the new version are illustrated below (more details here).
A tree-map visualization of the updated YouTube-8M dataset, organized into 24 high-level verticals, including the top-200 most frequent entities, plus the top-5 entities for each vertical.
Sample videos from the top-18 high-level verticals in the YouTube-8M dataset.
The Google Cloud & YouTube-8M Video Understanding Challenge
We are also excited to announce the Google Cloud & YouTube-8M Video Understanding Challenge, in partnership with Google Cloud and kaggle.com. The challenge invites participants to build audio-visual content classification models using YouTube-8M as training data, and to then label ~700K unseen test videos. It will be hosted as a Kaggle competition, sponsored by Google Cloud, and will feature a $100,000 prize pool for the top performers (details here). In order to enable wider participation in the competition, Google Cloud is also offering credits so participants can optionally do model training and exploration using Google Cloud Machine Learning. Open-source TensorFlow code, implementing a few baseline classification models for YouTube-8M, along with training and evaluation scripts, is available at Github. For details on getting started with local or cloud-based training, please see our README and the getting started guide on Kaggle.

The CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding
We will announce the results of the challenge and host invited talks by distinguished researchers at the 1st YouTube-8M Workshop, to be held July 26, 2017, at the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) in Honolulu, Hawaii. The workshop will also feature presentations by top-performing challenge participants and a selected set of paper submissions. We invite researchers to submit papers describing novel research, experiments, or applications based on YouTube-8M dataset, including papers summarizing their participation in the above challenge.

We designed this dataset with scale and diversity in mind, and hope lessons learned here will generalize to many video domains (YouTube-8M captures over 20 diverse video domains). We believe the challenge can also accelerate research by enabling researchers without access to big data or compute clusters to explore and innovate at previously unprecedented scale. Please join us in advancing video understanding!

Acknowledgements
This post reflects the work of many others within Machine Perception at Google Research, including Sami Abu-El-Haija, Anja Hauth, Nisarg Kothari, Joonseok Lee, Hanhan Li, Sobhan Naderi Parizi, Rahul Sukthankar, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan, Jiang Wang, as well as Philippe Poutonnet and Mike Styer from Google Cloud, and our partners at Kaggle. We are grateful for the support and advice from many others at Google Research, Google Cloud, and YouTube, and especially thank Aren Jansen, Jort Gemmeke, Dan Ellis, and the Google Research Sound Understanding team for providing the audio features in the updated dataset.

Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research



Many recent breakthroughs in machine learning and machine perception have come from the availability of large labeled datasets, such as ImageNet, which has millions of images labeled with thousands of classes. Their availability has significantly accelerated research in image understanding, for example on detecting and classifying objects in static images.

Video analysis provides even more information for detecting and recognizing objects, and understanding human actions and interactions with the world. Improving video understanding can lead to better video search and discovery, similarly to how image understanding helped re-imagine the photos experience. However, one of the key bottlenecks for further advancements in this area has been the lack of real-world video datasets with the same scale and diversity as image datasets.

Today, we are excited to announce the release of YouTube-8M, a dataset of 8 million YouTube video URLs (representing over 500,000 hours of video), along with video-level labels from a diverse set of 4800 Knowledge Graph entities. This represents a significant increase in scale and diversity compared to existing video datasets. For example, Sports-1M, the largest existing labeled video dataset we are aware of, has around 1 million YouTube videos and 500 sports-specific classes--YouTube-8M represents nearly an order of magnitude increase in both number of videos and classes.
In order to construct a labeled video dataset of this scale, we needed to address two key challenges: (1) video is much more time-consuming to annotate manually than images, and (2) video is very computationally expensive to process and store. To overcome (1), we turned to YouTube and its video annotation system, which identifies relevant Knowledge Graph topics for all public YouTube videos. While these annotations are machine-generated, they incorporate powerful user engagement signals from millions of users as well as video metadata and content analysis. As a result, the quality of these annotations is sufficiently high to be useful for video understanding research and benchmarking purposes.

To ensure the stability and quality of the labeled video dataset, we used only public videos with more than 1000 views, and we constructed a diverse vocabulary of entities, which are visually observable and sufficiently frequent. The vocabulary construction was a combination of frequency analysis, automated filtering, verification by human raters that the entities are visually observable, and grouping into 24 top-level verticals (more details in our technical report). The figures below depict the dataset browser and the distribution of videos along the top-level verticals, and illustrate the dataset’s scale and diversity.
A dataset explorer allows browsing and searching the full vocabulary of Knowledge Graph entities, grouped in 24 top-level verticals, along with corresponding videos. This screenshot depicts a subset of dataset videos annotated with the entity “Guitar”.
The distribution of videos in the top-level verticals illustrates the scope and diversity of the dataset and reflects the natural distribution of popular YouTube videos.
To address (2), we had to overcome the storage and computational resource bottlenecks that researchers face when working with videos. Pursuing video understanding at YouTube-8M’s scale would normally require a petabyte of video storage and dozens of CPU-years worth of processing. To make the dataset useful to researchers and students with limited computational resources, we pre-processed the videos and extracted frame-level features using a state-of-the-art deep learning model--the publicly available Inception-V3 image annotation model trained on ImageNet. These features are extracted at 1 frame-per-second temporal resolution, from 1.9 billion video frames, and are further compressed to fit on a single commodity hard disk (less than 1.5 TB). This makes it possible to download this dataset and train a baseline TensorFlow model at full scale on a single GPU in less than a day!

We believe this dataset can significantly accelerate research on video understanding as it enables researchers and students without access to big data or big machines to do their research at previously unprecedented scale. We hope this dataset will spur exciting new research on video modeling architectures and representation learning, especially approaches that deal effectively with noisy or incomplete labels, transfer learning and domain adaptation. In fact, we show that pre-training models on this dataset and applying / fine-tuning on other external datasets leads to state of the art performance on them (e.g. ActivityNet, Sports-1M). You can read all about our experiments using this dataset, along with more details on how we constructed it, in our technical report.