Tag Archives: Computer Vision

CVPR 2016 & Research at Google



This week, Las Vegas hosts the 2016 Conference on Computer Vision and Pattern Recognition (CVPR 2016), the premier annual computer vision event comprising the main conference and several co-located workshops and short courses. As a leader in computer vision research, Google has a strong presence at CVPR 2016, with many Googlers presenting papers and invited talks at the conference, tutorials and workshops.

We congratulate Google Research Scientist Ce Liu and Google Faculty Advisor Abhinav Gupta, who were selected as this year’s recipients of the PAMI Young Researcher Award for outstanding research contributions within computer vision. We also congratulate Googler Henrik Stewenius for receiving the Longuet-Higgins Prize, a retrospective award that recognizes up to two CVPR papers from ten years ago that have made a significant impact on computer vision research, for his 2006 CVPR paper “Scalable Recognition with a Vocabulary Tree”, co-authored with David Nister.

If you are attending CVPR this year, please stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for hundreds of millions of people. The Google booth will also showcase sveral recent efforts, including the technology behind Motion Stills and a live demo of neural network-based image compression. Learn more about our research being presented at CVPR 2016 in the list below (Googlers highlighted in blue).

Oral Presentations
Generation and Comprehension of Unambiguous Object Descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, Kevin Murphy

Detecting Events and Key Actors in Multi-Person Videos
Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, Li Fei-Fei

Spotlight Session: 3D Reconstruction
DeepStereo: Learning to Predict New Views From the World’s Imagery
John Flynn, Ivan Neulander, James Philbin, Noah Snavely

Posters
Discovering the Physical Parts of an Articulated Object Class From Multiple Videos
Luca Del Pero, Susanna Ricco, Rahul Sukthankar, Vittorio Ferrari

Blockout: Dynamic Model Selection for Hierarchical Deep Networks
Calvin Murdock, Zhen Li, Howard Zhou, Tom Duerig

Rethinking the Inception Architecture for Computer Vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, Zbigniew Wojna

Improving the Robustness of Deep Neural Networks via Stability Training
Stephan Zheng, Yang Song, Thomas Leung, Ian Goodfellow

Semantic Image Segmentation With Task-Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform
Liang-Chieh Chen, Jonathan T. Barron, George Papandreou, Kevin Murphy, Alan L. Yuille

Tutorial
Optimization Algorithms for Subset Selection and Summarization in Large Data Sets
Ehsan Elhamifar, Jeff Bilmes, Alex Kulesza, Michael Gygli

Workshops
Perceptual Organization in Computer Vision: The Role of Feedback in Recognition and Reorganization
Organizers: Katerina Fragkiadaki, Phillip Isola, Joao Carreira
Invited talks: Viren Jain, Jitendra Malik

VQA Challenge Workshop
Invited talks: Jitendra Malik, Kevin Murphy

Women in Computer Vision
Invited talk: Caroline Pantofaru

Computational Models for Learning Systems and Educational Assessment
Invited talk: Jonathan Huang

Large-Scale Scene Understanding (LSUN) Challenge
Invited talk: Jitendra Malik

Large Scale Visual Recognition and Retrieval: BigVision 2016
General Chairs: Jason Corso, Fei-Fei Li, Samy Bengio

ChaLearn Looking at People
Invited talk: Florian Schroff

Medical Computer Vision
Invited talk: Ramin Zabih

Motion Stills – Create beautiful GIFs from Live Photos



Today we are releasing Motion Stills, an iOS app from Google Research that acts as a virtual camera operator for your Apple Live Photos. We use our video stabilization technology to freeze the background into a still photo or create sweeping cinematic pans. The resulting looping GIFs and movies come alive, and can easily be shared via messaging or on social media.
With Motion Stills, we provide an immersive stream experience that makes your clips fun to watch and share. You can also tell stories of your adventures by combining multiple clips into a movie montage. All of this works right on your phone, no Internet connection needed.
A Live Photo before and after stabilization with Motion Stills
How does it work?
We pioneered this technology by stabilizing hundreds of millions of videos and creating GIF animations from photo bursts. Our algorithm uses linear programming to compute a virtual camera path that is optimized to recast videos and bursts as if they were filmed using stabilization equipment, yielding a still background or creating cinematic pans to remove shakiness.

Our challenge was to take technology designed to run distributed in a data center and shrink it down to run even faster on your mobile phone. We achieved a 40x speedup by using techniques such as temporal subsampling, decoupling of motion parameters, and using Google Research’s custom linear solver, GLOP. We obtain further speedup and conserve storage by computing low-resolution warp textures to perform real-time GPU rendering, just like in a videogame.
Making it loop
Short videos are perfect for creating loops, so we added loop optimization to bring out the best in your captures. Our approach identifies optimal start and end points, and also discards blurry frames. As an added benefit, this fixes “pocket shots” (footage of the phone being put back into the pocket).

To keep the background steady while looping, Motion Stills has to separate the background from the rest of the scene. This is a difficult task when foreground elements occlude significant portions of the video, as in the example below. Our novel method classifies motion vectors into foreground (red) and background (green) in a temporally consistent manner. We use a cascade of motion models, moving our motion estimation from simple to more complex models and biasing our results along the way.
Left: Original with virtual camera path (red rectangle) and motion classification; foreground(red) vs. background(green) Right: Motion Stills result
Try it out
We’re excited to see what you can create with this app. From fun family moments to exciting adventures with friends, try it out and let us know what you think. Motion Stills is an on-device experience with no sign-in: even if you’re on top of a glacier without signal, you can see your results immediately. You can show us your favorite clips by using #motionstill on social media.

This app is a way for us to experiment and iterate quickly on the technology needed for short video creation. Based on the feedback we receive, we hope to integrate this feature into existing products like Google Photos.

Motion Stills is available on the App Store.

"Aw, so cute!": Allo helps you respond to shared photos



Today, Google announced Allo — our new mobile messaging app. From day one of the Allo development effort, we set out to build a truly special product that is powered by Google’s strengths in machine intelligence to make messaging easier, more efficient, and more expressive. Photo Reply is a unique feature of Allo that just does that! We use machine learning to understand what a shared photo depicts and to suggest rich natural language replies that the user can tap to send. This makes it easier for users to sustain meaningful conversations while using small mobile keyboards.

Here is an example of the responses that Allo suggests when a friend shares a photo of his child.
Photo Reply — Under the Hood

During the winter, our product managers, Patrick McGregor and Ryan Cassidy, challenged us to develop new approaches to simplify media sharing in messaging while simultaneously delighting users with Google insights. With my colleagues Vivek Ramavajjala, Sergey Nazarov, and Sujith Ravi, we set out to build Photo Reply.

We utilize Google's image recognition technology, developed by our Machine Perception team, to associate images with semantic entities — people, animals, cars, etc. We then apply a machine learned model that maps those recognized entities to actual natural language responses. Our system produces replies for thousands of entity types that are drawn from a taxonomy that is a subset of Google's Knowledge Graph and may be at different granularity levels. For example, when you receive a photo of a dog, the system may detect that the dog is actually a labrador and suggest "Love that lab!". Or given a photo of a pasta dish, it may detect the type of pasta ("Yum linguine!") and even the cuisine ("I love Italian food!").
Examples of response suggestions reflecting fine-grained object classes
One aspect of the system that we find very useful is that it can suggest responses not just for physical objects but also for abstract concepts. It can produce suggestions for events (birthday parties, weddings, etc.), nature (sunrises, mountains, etc.), recreational activities (hiking, camping, etc.), and many more categories. Also, the system can generate responses that reflect the emotions that might be associated with an image, such as “happiness”. Here are some examples of responses for abstract concepts:
Response suggestions reflecting abstract concepts
Learning entity-response associations

At runtime, Photo Reply recognizes entities in the shared photo and triggers responses for the entities. The model that maps entities to natural language responses is learned offline using Expander, which is a large-scale graph-based semi-supervised learning platform at Google. We built a massive a graph where nodes correspond to photos, semantic entities, and textual responses. Edges in the graph indicate when an entity was recognized for a photo, when a specific response was given for a photo, and visual similarities between photos. Some of the nodes are "labeled" and we learn associations for the unlabeled nodes by propagating label information across the graph.

To illustrate this, consider the graph below. There are two labels: the red label corresponds to the response "yummy" and the blue label corresponds to "delicious". The nodes for "spaghetti" and "linguine" are unlabeled, but from the fact that they are close to the red and blue nodes, the algorithm can learn that they should be associated to the "yummy" and "delicious" responses. Notice that in this way, we are associating the entity "linguine" to the response "yummy" even though none of the linguine photos in the graph are directly connected to this answer. Expander can perform this kind of learning at very large scale, for graphs containing billions of nodes and hundred of billions of edges.
Graph of entities, photos, and responses
Photo Reply is an exciting example of multimodal learning, where computer vision and natural language processing come together in order to create a compelling user experience. Allo will be available on Android and iOS later this summer. Be sure to check out what Allo sees in your beautiful photos!

How to Classify Images with TensorFlow



Prior to joining Google, I spent a lot of time trying to get computers to recognize objects in images. At Jetpac my colleagues and I built mustache detectors to recognize bars full of hipsters, blue sky detectors to find pubs with beer gardens, and dog detectors to spot canine-friendly cafes. At first, we used the traditional computer vision approaches that I'd used my whole career, writing a big ball of custom logic to laboriously recognize one object at a time. For example, to spot sky I'd first run a color detection filter over the whole image looking for shades of blue, and then look at the upper third. If it was mostly blue, and the lower portion of the image wasn't, then I'd classify that as probably a photo of the outdoors.

I'd been an engineer working on vision problems since the late 90's, and the sad truth was that unless you had a research team and plenty of time behind you, this sort of hand-tailored hack was the only way to get usable results. As you can imagine, the results were far from perfect and each detector I wrote was a custom job, and didn't help me with the next thing I needed to recognize. This probably seems laughable to anybody who didn't work in computer vision in the recent past! It's such a primitive way of solving the problem, it sounds like it should have been superseded long ago.

That's why I was so excited when I started to play around with deep learning. It became clear as I tried them out that the latest approaches using convolutional neural networks were producing far better results than my hand-tuned code on similar problems. Not only that, the process of training a detector for a new class of object was much easier. I didn't have to think about what features to detect, I'd just supply a network with new training examples and it would take it from there.

Those experiences converted me into a deep learning enthusiast, and so when Jetpac was acquired and I had the chance to join Google and work with many of the stars of the field, I couldn't resist. What impressed me more than anything was the team's willingness to share their knowledge with the rest of the world.

I'm especially happy that we've just managed to release TensorFlow, our internal machine learning framework, because it gives me a chance to show practical, usable examples of why I'm so convinced deep learning is an essential tool for anybody working with images, speech, or text in ML.

Given my background, my favorite first example is using a deep network to spot objects in an image. One of the early showcases for the new approach to neural networks was an annual competition to recognize 1,000 different classes of objects, from the Imagenet data set, and TensorFlow includes a pre-trained network for that task. If you look inside the examples folder in the source code, you'll see “label_image”, which is a small C++ application for using that network.

The README has the instructions for building TensorFlow on your machine, downloading the binary files defining the network, and compiling the sample code. Once it's all built, just run it with no arguments, and you should see a list of results showing "Military Uniform" at the top. This is running on the default image of Admiral Grace Hopper, and correctly spots her attire.
Image via Wikipedia
After that, try pointing it at your own images using the “--image” command line flag, and you should see a set of labels for each. If you want to know more about what's going on under the hood, the C++ section of the TensorFlow Inception tutorial goes into a lot more detail.

The only things it will spot are those that are in the original 1,000 Imagenet classes, and it will always try to find something, which can lead to some funny results. There are no people categories, so on portraits you'll often see objects that are associated with people like seat belts or oxygen masks, or in Lincoln’s case, a bow tie!
Image via U.S History Images
If the image is poorly lit, then “nematode” is usually the top pick since most training photos of those are taken in very dim surroundings. It's also not perfect in its identification, with an error rate of 5.6% for getting the right label in the top five results. However, that’s not all that bad considering Stanford’s Andrej Karpathy found that even someone who was trained at the job could only achieve a slightly-better 5.1% error doing the same task manually. We can do even better if we combine the outputs of four trained models into an "ensemble", with an error rate of just 3.5%.

It's unlikely that the set of labels it produces is exactly what you need for your application, so the next step would be to train your own network. That is a much bigger task than running a pre-trained one like this, but one of the things I like about TensorFlow is that it spans the whole lifecycle of a machine learning model, from experimentation, to training, and into production, as this example shows. To get started training, I'd recommend looking at this simple tutorial on recognizing hand-drawn digits from the MNIST data set.

I hope that sharing this framework will help developers build amazing user experiences we’d never even think of. We’ve been having a massive amount of fun with TensorFlow, and I can’t wait to see what interesting image tools you build using it!