Tag Archives: CVPR

Announcing the 6th Fine-Grained Visual Categorization Workshop



In recent years, fine-grained visual recognition competitions (FGVCs), such as the iNaturalist species classification challenge and the iMaterialist product attribute recognition challenge, have spurred progress in the development of image classification models focused on detection of fine-grained visual details in both natural and man-made objects. Whereas traditional image classification competitions focus on distinguishing generic categories (e.g., car vs. butterfly), the FGVCs go beyond entry level categories to focus on subtle differences in object parts and attributes. For example, rather than pursuing methods that can distinguish categories, such as “bird”, we are interested in identifying subcategories such as “indigo bunting” or “lazuli bunting.”

Previous challenges attracted a large number of talented participants who developed innovative new models for image recognition, with more than 500 teams competing at FGVC5 at CVPR 2018. FGVC challenges have also inspired new methods such as domain-specific transfer learning and estimating test-time priors, which have helped fine-grained recognition tasks reach state-of-the-art performance on several benchmarking datasets.

In order to further spur progress in FGVC research, we are proud to sponsor and co-organize the 6th annual workshop on Fine-Grained Visual Categorization (FGVC6), to be held on June 17th in Long Beach, CA at CVPR 2019. This workshop brings together experts in computer vision with specialists focusing on biodiversity, botany, fashion, and the arts, to address the challenges of applying fine-grained visual categorization to real-life settings.

This Year’s Challenges
This year there will be a wide variety of competition topics, each highlighting unique challenges of fine-grained visual categorization, including an updated iNaturalist challenge, fashion & products, wildlife camera traps, food, butterflies & moths, fashion design, and cassava leaf disease. We are also delighted to introduce two new partnerships with world class institutions—The Metropolitan Museum of Art for the iMet Collection challenge and the New York Botanical Garden for the Herbarium challenge.
The FGVC workshop at CVPR focuses on subordinate categories, including (from left to right, top to bottom) animal species from wildlife camera traps, retail products, fashion attributes, cassava leaf disease, Melastomataceae species from herbarium sheets, animal species from citizen science photos, butterfly and moth species, cuisine of dishes, and fine-grained attributes for museum art objects.
In the iMet Collection challenge, participants compete to train models on artistic attributes including object presence, culture, content, theme, and geographic origin. The Metropolitan Museum of Art provided a large training dataset for this task based on subject matter experts’ descriptions of their museum collections. This dataset highlights the challenge of inferring fine-grained attributes that are grounded in the visual context indirectly (e.g., period, culture, medium).
A diverse sample of images included in the iMet Collection challenge dataset. Images were taken from the Metropolitan Museum of Art’s public domain dataset.
The iMet Collection challenge is also noteworthy for its status as the first image-based Kernels-only competition, a recently introduced option on Kaggle that levels the playing field for data scientists who might not otherwise have access to adequate computational resources. Kernel competitions provide all participants with the same hardware allowances, giving rise to a more balanced competition. Moreover, the winning models tend to be simpler than their counterparts in other competitions, since the participants must work within the compute constraints imposed by the Kernels platform. At the time of writing, the iMet Collection challenge has over 250 participating teams.

In the Herbarium challenge, researchers are invited to tackle the problem of classifying species from the flowering plant family Melastomataceae. This challenge is distinguished from the iNaturalist competition, since the included images depict dried specimens preserved on herbarium sheets, exclusively. Herbarium sheets are essential to plant science, as they not only preserve the key details of the plants for identification and DNA analysis, but also provide a rare perspective into plant ecology in a historical context. As the world’s second largest herbarium, NYBG’s Steere Herbarium collection contributed a dataset of over 46,000 specimens for this year’s challenge.
In the Herbarium challenge, participants will identify species from the flowering plant family Melastomataceae. The New York Botanical Garden (NYBG) provided a dataset of over 46,000 herbarium specimens including over 680 species. Images used with permission of the NYBG.
Every one of this year’s challenges requires deep engagement with subject matter experts, in addition to institutional coordination. By teeing up image recognition challenges in a standard format, the FGVC workshop paves the way for technology transfer from the top of the Kaggle leaderboards into the hands of everyday users via mobile apps such as Seek by iNaturalist and Merlin Bird ID. We anticipate the techniques developed by our competition participants will not only push the frontier of fine-grained recognition, but also be beneficial for applying machine vision to advance scientific exploration and curatorial studies.

Invitation to Participate
We invite teams to participate in these competitions to help advance the state-of-the-art in fine-grained image recognition. Deadlines for entry into the competitions range from May 26 to June 3, depending on the challenge. The results of these competitions will be presented at the FGVC6 workshop at CVPR 2019, and will provide broad exposure to the top performing teams. We are excited to encourage the community's development of more accurate and broadly impactful algorithms in the field of fine-grained visual categorization!

Acknowledgements
We’d like to thank our colleagues and friends on the FGVC6 organizing committee for working together to advance this important area. At Google we would like to thank Hartwig Adam, Chenyang Zhang, Yulong Liu, Kiat Chuan Tan, Mikhail Sirotenko, Denis Brulé, Cédric Deltheil, Timnit Gebru, Ernest Mwebaze, Weijun Wang, Grace Chu, Jack Sim, Andrew Howard, R.V. Guha, Srikanth Belwadi, Tanya Birch, Katherine Chou, Maggie Demkin, Elizabeth Park, and Will Cukierski.

Source: Google AI Blog


Announcing the Second Workshop and Challenge on Learned Image Compression



Last year, we announced the Workshop and Challenge on Learned Image Compression (CLIC), an event that aimed to advance the field of image compression with and without neural networks. Held during the 2018 Computer Vision and Pattern Recognition conference (CVPR 2018), CLIC was quite a success, with 23 accepted workshop papers, 95 authors and 41 entries into the competition. This spawned many new algorithms for image compression, domain specific applications to medical image compression and augmentations to existing methods based, with the winner Tucodec (abbreviated TUCod4c in the image below) achieving 13% better mean opinion score (MOS) than Better Portable Graphics (BPG) compression.
An example image from the 2018 test set, comparing the original image to BPG, JPEG and the results from nine competing teams. All the methods are better than JPEG in color reproduction and many of them are comparable to BPG in their ability to create legible text on the sign.
This year, we are again happy co-sponsor the second Workshop and Challenge on Learned Image Compression at CVPR 2019 in Long Beach, California.The half day workshop will feature talks from invited guests Anne Aaron (Netflix), Aaron Van Den Oord (DeepMind) and Jyrki Alakuijala (Google), along with presentations from five top performing teams in the 2019 competition, which is currently open for submissions.

This year's competition features two tracks for participants to compete in. The first track remains the same as last year, in what we're calling the "low-rate compression" track. The goal for low-rate compression is to compress an image dataset to 0.15 bits per pixel and maintaining the highest quality metrics as measured by PSNR, MS-SSIM and a human evaluated rating task.

The second track incorporates feedback from last year's workshop, in which participants expressed interest in the inverse challenge of determining the amount an image could be compressed and still look good. In this "transparent compression" challenge, we set a relatively high quality threshold for the test dataset (in both PSNR and MS-SSIM) with the goal of compressing the dataset to the smallest file sizes.

If you're doing research in the field of learned image compression, we encourage you to participate in CLIC during CVPR 2019. For more details on the competition and dates, please refer to compression.cc.

Acknowledgements
This workshop is being jointly hosted by researchers at Google, Twitter and ETH Zürich. We'd like to thank: George Toderici (Google), Michele Covell (Google), Johannes Ballé (Google), Nick Johnston (Google), Eirikur Agustsson (Google), Wenzhe Shi (Twitter), Lucas Theis (Twitter), Radu Timofte (ETH Zürich), Fabian Mentzer (ETH Zürich) for their contributions.

Source: Google AI Blog


Google at CVPR 2018

Posted by Christian Howard, Editor-in-Chief, Google AI Communications

This week, Salt Lake City hosts the 2018 Conference on Computer Vision and Pattern Recognition (CVPR 2018), the premier annual computer vision event comprising the main conference and several co-located workshops and tutorials. As a leader in computer vision research and a Diamond Sponsor, Google will have a strong presence at CVPR 2018 — over 200 Googlers will be in attendance to present papers and invited talks at the conference, and to organize and participate in multiple workshops.

If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively pursuing the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind portrait mode on the Pixel 2 and Pixel 2 XL smartphones, the Open Images V4 dataset and much more.

You can learn more about our research being presented at CVPR 2018 in the list below (Googlers highlighted in blue)

Organization
Finance Chair: Ramin Zabih

Area Chairs include: Sameer Agarwal, Aseem Agrawala, Jon Barron, Abhinav Shrivastava, Carl Vondrick, Ming-Hsuan Yang

Orals/Spotlights
Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee

DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, Yebin Liu

Neural Kinematic Networks for Unsupervised Motion Retargetting
Ruben Villegas, Jimei Yang, Duygu Ceylan, Honglak Lee

Burst Denoising with Kernel Prediction Networks
Ben Mildenhall, Jiawen Chen, Jonathan BarronRobert Carroll, Dillon Sharlet, Ren Ng

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob, Skirmantas Kligys, Bo Chen, Matthew Tang, Menglong Zhu, Andrew Howard, Dmitry KalenichenkoHartwig Adam

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Focal Visual-Text Attention for Visual Question Answering
Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, Alexander G. Hauptmann

Inferring Light Fields from Shadows
Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba

Modifying Non-Local Variations Across Multiple Views
Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-Manor

Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta

Unsupervised Training for 3D Morphable Model Regression
Kyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William Freeman

Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le

The iNaturalist Species Classification and Detection Dataset
Grant van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie

Learning Intrinsic Image Decomposition from Watching the World
Zhengqi Li, Noah Snavely

Learning Intelligent Dialogs for Bounding Box Annotation
Ksenia Konyushkova, Jasper Uijlings, Christoph Lampert, Vittorio Ferrari

Posters
Revisiting Knowledge Transfer for Training Object Class Detectors
Jasper Uijlings, Stefan Popov, Vittorio Ferrari

Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David Ross, Jia Deng, Rahul Sukthankar

Hierarchical Novelty Detection for Visual Object Recognition
Kibok Lee, Kimin Lee, Kyle Min, Yuting Zhang, Jinwoo Shin, Honglak Lee

COCO-Stuff: Thing and Stuff Classes in Context
Holger Caesar, Jasper Uijlings, Vittorio Ferrari

Appearance-and-Relation Networks for Video Classification
Limin Wang, Wei Li, Wen Li, Luc Van Gool

MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks
Ariel Gordon, Elad Eban, Bo Chen, Ofir Nachum, Tien-Ju Yang, Edward Choi

Deformable Shape Completion with Graph Convolutional Autoencoders
Or Litany, Alex Bronstein, Michael Bronstein, Ameesh Makadia

MegaDepth: Learning Single-View Depth Prediction from Internet Photos
Zhengqi Li, Noah Snavely

Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee

Burst Denoising with Kernel Prediction Networks
Ben Mildenhall, Jiawen Chen, Jonathan Barron, Robert Carroll, Dillon Sharlet, Ren Ng

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob, Skirmantas Kligys, Bo Chen, Matthew Tang, Menglong Zhu, Andrew Howard, Dmitry Kalenichenko, Hartwig Adam

Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling
Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Tianfan Xue, Joshua Tenenbaum, William Freeman

Sparse, Smart Contours to Represent and Edit Images
Tali Dekel, Dilip Krishnan, Chuang Gan, Ce Liu, William Freeman

MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features
Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, Hartwig Adam

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning
Yin Cui, Yang Song, Chen Sun, Andrew Howard, Serge Belongie

Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks
Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Sung Jin Hwang, George Toderici, Troy Chinen, Joel Shor

MobileNetV2: Inverted Residuals and Linear Bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen

ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans 
Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Juergen Sturm, Matthias Nießner

Sim2Real View Invariant Visual Servoing by Recurrent Control
Fereshteh Sadeghi, Alexander Toshev, Eric Jang, Sergey Levine

Alternating-Stereo VINS: Observability Analysis and Performance Evaluation
Mrinal Kanti Paul, Stergios Roumeliotis

Soccer on Your Tabletop
Konstantinos Rematas, Ira Kemelmacher, Brian Curless, Steve Seitz

Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints
Reza Mahjourian, Martin Wicke, Anelia Angelova

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Inferring Light Fields from Shadows
Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba

Modifying Non-Local Variations Across Multiple Views
Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-Manor

Aperture Supervision for Monocular Depth Estimation
Pratul Srinivasan, Rahul Garg, Neal Wadhwa, Ren Ng, Jonathan Barron

Instance Embedding Transfer to Unsupervised Video Object Segmentation
Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, C.-C. Jay Kuo

Frame-Recurrent Video Super-Resolution
Mehdi S. M. Sajjadi, Raviteja Vemulapalli, Matthew Brown

Weakly Supervised Action Localization by Sparse Temporal Pooling Network
Phuc Nguyen, Ting Liu, Gautam Prasad, Bohyung Han

Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta

Learning and Using the Arrow of Time
Donglai Wei, Andrew Zisserman, William Freeman, Joseph Lim

HydraNets: Specialized Dynamic Architectures for Efficient Inference
Ravi Teja Mullapudi, Noam Shazeer, William Mark, Kayvon Fatahalian

Thoracic Disease Identification and Localization with Limited Supervision
Zhe Li, Chong Wang, Mei Han, Yuan Xue, Wei Wei, Li-jia Li, Fei-Fei Li

Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
Seunghoon Hong, Dingdong Yang, Jongwook Choi, Honglak Lee

Deep Semantic Face Deblurring
Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, Ming-Hsuan Yang

Unsupervised Training for 3D Morphable Model Regression
Kyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William Freeman

Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le

Learning Intrinsic Image Decomposition from Watching the World
Zhengqi Li, Noah Snavely

PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection
Nian Liu, Junwei Han, Ming-Hsuan Yang

Tutorials
Computer Vision for Robotics and Driving
Anelia Angelova, Sanja Fidler

Unsupervised Visual Learning
Pierre Sermanet, Anelia Angelova

UltraFast 3D Sensing, Reconstruction and Understanding of People, Objects and Environments
Sean Fanello, Julien Valentin, Jonathan Taylor, Christoph Rhemann, Adarsh Kowdle, Jürgen SturmChristine Kaeser-Chen, Pavel Pidlypenskyi, Rohit Pandey, Andrea Tagliasacchi, Sameh Khamis, David Kim, Mingsong Dou, Kaiwen Guo, Danhang Tang, Shahram Izadi

Generative Adversarial Networks
Jun-Yan Zhu, Taesung Park, Mihaela Rosca, Phillip Isola, Ian Goodfellow

Source: Google AI Blog


Announcing Open Images V4 and the ECCV 2018 Open Images Challenge



In 2016, we introduced Open Images, a collaborative release of ~9 million images annotated with labels spanning thousands of object categories. Since its initial release, we've been hard at work updating and refining the dataset, in order to provide a useful resource for the computer vision community to develop new models

Today, we are happy to announce Open Images V4, containing 15.4M bounding-boxes for 600 categories on 1.9M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8 per image on average; visualizer).
Annotated images from the Open Images dataset. Left: Mark Paul Gosselaar plays the guitar by Rhys A. Right: Civilization by Paul Downey. Both images used under CC BY 2.0 license.
In conjunction with this release, we are also introducing the Open Images Challenge, a new object detection challenge to be held at the 2018 European Conference on Computer Vision (ECCV 2018). The Open Images Challenge follows in the tradition of PASCAL VOC, ImageNet and COCO, but at an unprecedented scale.

This challenge is unique in several ways:
  • 12.2M bounding-box annotations for 500 categories on 1.7M training images,
  • A broader range of categories than previous detection challenges, including new objects such as “fedora” and “snowman”.
  • In addition to the object detection main track, the challenge includes a Visual Relationship Detection track, on detecting pairs of objects in particular relations, e.g. “woman playing guitar”.
The training set is available now. A test set of 100k images will be released on July 1st 2018 by Kaggle. Deadline for submission of results is on September 1st 2018. We hope that the very large training set will stimulate research into more sophisticated detection models that will exceed current state-of-the-art performance, and that the 500 categories will enable a more precise assessment of where different detectors perform best. Furthermore, having a large set of images with many objects annotated enables to explore Visual Relationship Detection, which is a hot emerging topic with a growing sub-community.

In addition to the above, Open Images V4 also contains 30.1M human-verified image-level labels for 19,794 categories, which are not part of the Challenge. The dataset includes 5.5M image-level labels generated by tens of thousands of users from all over the world at crowdsource.google.com.

Announcing Open Images V4 and the ECCV 2018 Open Images Challenge



In 2016, we introduced Open Images, a collaborative release of ~9 million images annotated with labels spanning thousands of object categories. Since its initial release, we've been hard at work updating and refining the dataset, in order to provide a useful resource for the computer vision community to develop new models

Today, we are happy to announce Open Images V4, containing 15.4M bounding-boxes for 600 categories on 1.9M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8 per image on average; visualizer).
Annotated images from the Open Images dataset. Left: Mark Paul Gosselaar plays the guitar by Rhys A. Right: Civilization by Paul Downey. Both images used under CC BY 2.0 license.
In conjunction with this release, we are also introducing the Open Images Challenge, a new object detection challenge to be held at the 2018 European Conference on Computer Vision (ECCV 2018). The Open Images Challenge follows in the tradition of PASCAL VOC, ImageNet and COCO, but at an unprecedented scale.

This challenge is unique in several ways:
  • 12.2M bounding-box annotations for 500 categories on 1.7M training images,
  • A broader range of categories than previous detection challenges, including new objects such as “fedora” and “snowman”.
  • In addition to the object detection main track, the challenge includes a Visual Relationship Detection track, on detecting pairs of objects in particular relations, e.g. “woman playing guitar”.
The training set is available now. A test set of 100k images will be released on July 1st 2018 by Kaggle. Deadline for submission of results is on September 1st 2018. We hope that the very large training set will stimulate research into more sophisticated detection models that will exceed current state-of-the-art performance, and that the 500 categories will enable a more precise assessment of where different detectors perform best. Furthermore, having a large set of images with many objects annotated enables to explore Visual Relationship Detection, which is a hot emerging topic with a growing sub-community.

In addition to the above, Open Images V4 also contains 30.1M human-verified image-level labels for 19,794 categories, which are not part of the Challenge. The dataset includes 5.5M image-level labels generated by tens of thousands of users from all over the world at crowdsource.google.com.

Source: Google AI Blog


Introducing the CVPR 2018 On-Device Visual Intelligence Challenge



Over the past year, there have been exciting innovations in the design of deep networks for vision applications on mobile devices, such as the MobileNet model family and integer quantization. Many of these innovations have been driven by performance metrics that focus on meaningful user experiences in real-world mobile applications, requiring inference to be both low-latency and accurate. While the accuracy of a deep network model can be conveniently estimated with well established benchmarks in the computer vision community, latency is surprisingly difficult to measure and no uniform metric has been established. This lack of measurement platforms and uniform metrics have hampered the development of performant mobile applications.

Today, we are happy to announce the On-device Visual Intelligence Challenge (OVIC), part of the Low-Power Image Recognition Challenge Workshop at the 2018 Computer Vision and Pattern Recognition conference (CVPR2018). A collaboration with Purdue University, the University of North Carolina and IEEE, OVIC is a public competition for real-time image classification that uses state-of-the-art Google technology to significantly lower the barrier to entry for mobile development. OVIC provides two key features to catalyze innovation: a unified latency metric and an evaluation platform.

A Unified Metric
OVIC focuses on the establishment of a unified metric aligned directly with accurate and performant operation on mobile devices. The metric is defined as the number of correct classifications within a specified per-image average time limit of 33ms. This latency limit allows every frame in a live 30 frames-per-second video to be processed, thus providing a seamless user experience1. Prior to OVIC, it was tricky to enforce such a limit due to the difficulty in accurately and uniformly measuring latency as would be experienced in real-world applications on real-world devices. Without a repeatable mobile development platform, researchers have relied primarily on approximate metrics for latency that are convenient to compute, such as the number of multiply-accumulate operations (MACs). The intuition is that multiply-accumulate constitutes the most time-consuming operation in a deep neural network, so their count should be indicative of the overall latency. However, these metrics are often poor predictors of on-device latency due to many aspects of the models that can impact the average latency of each MAC in typical implementations.
Even though the number of multiply-accumulate operations (# MACs) is the most commonly used metric to approximate on-device latency, it is a poor predictor of latency. Using data from various quantized and floating point MobileNet V1 and V2 based models, this graph plots on-device latency on a common reference device versus the number of MACs. It is clear that models with similar latency can have very different MACs, and vice versa.
The graph above shows that while the number of MACs is correlated with the inference latency, there is significant variation in the mapping. Thus number of MACs is a poor proxy for latency, and since latency directly affects users’ experiences, we believe it is paramount to optimize latency directly rather than focusing on limiting the number of MACs as a proxy.

An Evaluation Platform
As mentioned above, a primary issue with latency is that it has previously been challenging to measure reliably and repeatably, due to variations in implementation, running environment and hardware architectures. Recent successes in mobile development overcome these challenges with the help of a convenient mobile development platform, including optimized kernels for mobile CPUs, light-weight portable model formats, increasingly capable mobile devices, and more. However, these various platforms have traditionally required resources and development capabilities that are only available to larger universities and industry.

With that in mind, we are releasing OVIC’s evaluation platform that includes a number of components designed to make mobile development and evaluations that can be replicated and compared accessible to the broader research community:
  • TOCO compiler for optimizing TensorFlow models for efficient inference
  • TensorFlow Lite inference engine for mobile deployment
  • A benchmarking SDK that can be run locally on any Android phone
  • Sample models to showcase successful mobile architectures that run inference in floating-point and quantized modes
  • Google’s benchmarking tool for reliable latency measurements on specific Pixel phones (available to registered contestants).
Using these tools available in OVIC, a participant can conveniently incorporate measurement of on-device latency into their design loop without having to worry about optimizing kernels, purchasing latency/power measurement devices, or designing the framework to drive them. The only requirement for entry is experiences with training computer vision models in TensorFlow, which can be found in this tutorial.

With OVIC, we encourage the entire research community to improve the classification performance of low-latency high-accuracy models towards new frontiers, as shown in the following graphic.
Sampling of current MobileNet mobile models illustrating the tradeoff between increased accuracy and reduced latency.
We cordially invite you to participate here before the deadline on June 15th, and help us discover new mobile vision architectures that will propel development into the future.

Acknowledgements
We would like to acknowledge our core contributors Achille Brighton, Alec Go, Andrew Howard, Hartwig Adam, Mark Sandler and Xiao Zhang. We would also like to acknowledge our external collaborators Alex Berg and Yung-Hsiang Lu. We give special thanks to Andre Hentz, Andrew Selle, Benoit Jacob, Brad Krueger, Dmitry Kalenichenko, Megan Cummins, Pete Warden, Rajat Monga, Shiyu Hu and Yicheng Fan.


1 Alternatively the same metric could encourage even lower power operation by only processing a subset of the images in the input stream.



Introducing the CVPR 2018 On-Device Visual Intelligence Challenge



Over the past year, there have been exciting innovations in the design of deep networks for vision applications on mobile devices, such as the MobileNet model family and integer quantization. Many of these innovations have been driven by performance metrics that focus on meaningful user experiences in real-world mobile applications, requiring inference to be both low-latency and accurate. While the accuracy of a deep network model can be conveniently estimated with well established benchmarks in the computer vision community, latency is surprisingly difficult to measure and no uniform metric has been established. This lack of measurement platforms and uniform metrics have hampered the development of performant mobile applications.

Today, we are happy to announce the On-device Visual Intelligence Challenge (OVIC), part of the Low-Power Image Recognition Challenge Workshop at the 2018 Computer Vision and Pattern Recognition conference (CVPR2018). A collaboration with Purdue University, the University of North Carolina and IEEE, OVIC is a public competition for real-time image classification that uses state-of-the-art Google technology to significantly lower the barrier to entry for mobile development. OVIC provides two key features to catalyze innovation: a unified latency metric and an evaluation platform.

A Unified Metric
OVIC focuses on the establishment of a unified metric aligned directly with accurate and performant operation on mobile devices. The metric is defined as the number of correct classifications within a specified per-image average time limit of 33ms. This latency limit allows every frame in a live 30 frames-per-second video to be processed, thus providing a seamless user experience1. Prior to OVIC, it was tricky to enforce such a limit due to the difficulty in accurately and uniformly measuring latency as would be experienced in real-world applications on real-world devices. Without a repeatable mobile development platform, researchers have relied primarily on approximate metrics for latency that are convenient to compute, such as the number of multiply-accumulate operations (MACs). The intuition is that multiply-accumulate constitutes the most time-consuming operation in a deep neural network, so their count should be indicative of the overall latency. However, these metrics are often poor predictors of on-device latency due to many aspects of the models that can impact the average latency of each MAC in typical implementations.
Even though the number of multiply-accumulate operations (# MACs) is the most commonly used metric to approximate on-device latency, it is a poor predictor of latency. Using data from various quantized and floating point MobileNet V1 and V2 based models, this graph plots on-device latency on a common reference device versus the number of MACs. It is clear that models with similar latency can have very different MACs, and vice versa.
The graph above shows that while the number of MACs is correlated with the inference latency, there is significant variation in the mapping. Thus number of MACs is a poor proxy for latency, and since latency directly affects users’ experiences, we believe it is paramount to optimize latency directly rather than focusing on limiting the number of MACs as a proxy.

An Evaluation Platform
As mentioned above, a primary issue with latency is that it has previously been challenging to measure reliably and repeatably, due to variations in implementation, running environment and hardware architectures. Recent successes in mobile development overcome these challenges with the help of a convenient mobile development platform, including optimized kernels for mobile CPUs, light-weight portable model formats, increasingly capable mobile devices, and more. However, these various platforms have traditionally required resources and development capabilities that are only available to larger universities and industry.

With that in mind, we are releasing OVIC’s evaluation platform that includes a number of components designed to make mobile development and evaluations that can be replicated and compared accessible to the broader research community:
  • TOCO compiler for optimizing TensorFlow models for efficient inference
  • TensorFlow Lite inference engine for mobile deployment
  • A benchmarking SDK that can be run locally on any Android phone
  • Sample models to showcase successful mobile architectures that run inference in floating-point and quantized modes
  • Google’s benchmarking tool for reliable latency measurements on specific Pixel phones (available to registered contestants).
Using these tools available in OVIC, a participant can conveniently incorporate measurement of on-device latency into their design loop without having to worry about optimizing kernels, purchasing latency/power measurement devices, or designing the framework to drive them. The only requirement for entry is experiences with training computer vision models in TensorFlow, which can be found in this tutorial.

With OVIC, we encourage the entire research community to improve the classification performance of low-latency high-accuracy models towards new frontiers, as shown in the following graphic.
Sampling of current MobileNet mobile models illustrating the tradeoff between increased accuracy and reduced latency.
We cordially invite you to participate here before the deadline on June 15th, and help us discover new mobile vision architectures that will propel development into the future.

Acknowledgements
We would like to acknowledge our core contributors Achille Brighton, Alec Go, Andrew Howard, Hartwig Adam, Mark Sandler and Xiao Zhang. We would also like to acknowledge our external collaborators Alex Berg and Yung-Hsiang Lu. We give special thanks to Andre Hentz, Andrew Selle, Benoit Jacob, Brad Krueger, Dmitry Kalenichenko, Megan Cummins, Pete Warden, Rajat Monga, Shiyu Hu and Yicheng Fan.


1 Alternatively the same metric could encourage even lower power operation by only processing a subset of the images in the input stream.



Source: Google AI Blog


Introducing the iNaturalist 2018 Challenge



Thanks to recent advances in deep learning, the visual recognition abilities of machines have improved dramatically, permitting the practical application of computer vision to tasks ranging from pedestrian detection for self-driving cars to expression recognition in virtual reality. One area that remains challenging for computers, however, is fine-grained and instance-level recognition. Earlier this month, we posted an instance-level landmark recognition challenge for identifying individual landmarks. Here we focus on fine-grained visual recognition, which is to distinguish species of animals and plants, car and motorcycle models, architectural styles, etc. For computers, discriminating fine-grained categories is challenging because many categories have relatively few training examples (i.e., the long tail problem), the examples that do exist often lack authoritative training labels, and there is variability in illumination, viewing angle and object occlusion.

To help confront these hurdles, we are excited to announce the 2018 iNaturalist Challenge (iNat-2018), a species classification competition offered in partnership with iNaturalist and Visipedia (short for Visual Encyclopedia), a project for which Caltech and Cornell Tech received a Google Focused Research Award. This is a flagship challenge for the 5th International Workshop on Fine Grained Visual Categorization (FGVC5) at CVPR 2018. Building upon the first iNaturalist challenge, iNat-2017, iNat-2018 spans over 8000 categories of plants, animals, and fungi, with a total of more than 450,000 training images. We invite participants to enter the competition on Kaggle, with final submissions due in early June. Training data, annotations, and links to pretrained models can be found on our GitHub repo.

iNaturalist has emerged as a world leader for citizen scientists to share observations of species and connect with nature since its founding in 2008. It hosts research-grade photos and annotations submitted by a thriving, engaged community of users. Consider the following photo from iNaturalist:
The map on the right shows where the photo was taken. Image credit: Serge Belongie.
You may notice that the photo on the left contains a turtle. But did you also know this is a Trachemys scripta, common name “Pond Slider?” If you knew the latter, you possess knowledge of fine-grained or subordinate categories.

In contrast to other image classification datasets such as ImageNet, the dataset in the iNaturalist challenge exhibits a long-tailed distribution, with many species having relatively few images. It is important to enable machine learning models to handle categories in the long-tail, as the natural world is heavily imbalanced – some species are more abundant and easier to photograph than others. The iNaturalist challenge will encourage progress because the training distribution of iNat-2018 has an even longer tail than iNat-2017.
Distribution of training images per species for iNat-2017 and iNat-2018, plotted on a log-log scale, illustrating the long-tail behavior typical of fine-grained classification problems. Image Credit: Grant Van Horn and Oisin Mac Aodha.
Along with iNat-2018, FGVC5 will also host the iMaterialist 2018 challenge (including a furniture categorization challenge and a fashion attributes challenge for product images) and a set of “FGVCx” challenges representing smaller scale – but still significant – challenges, featuring content such as food and modern art.

FGVC5 will be showcased on the main stage at CVPR 2018, thereby ensuring broad exposure for the top performing teams. This project will advance the state-of-the-art in automatic image classification for real world, fine-grained categories, with heavy class imbalances, and large numbers of classes. We cordially invite you to participate in these competitions and help move the field forward!

Acknowledgements
We’d like to thank our colleagues and friends at iNaturalist, Visipedia, and FGVC5 for working together to advance this important area. At Google we would like to thank Hartwig Adam, Weijun Wang, Nathan Frey, Andrew Howard, Alessandro Fin, Yuning Chai, Xiao Zhang, Jack Sim, Yuan Li, Grant Van Horn, Yin Cui, Chen Sun, Yanan Qian, Grace Vesom, Tanya Birch, Wendy Kan, and Maggie Demkin.

Google-Landmarks: A New Dataset and Challenge for Landmark Recognition



Image classification technology has shown remarkable improvement over the past few years, exemplified in part by the Imagenet classification challenge, where error rates continue to drop substantially every year. In order to continue advancing the state of the art in computer vision, many researchers are now putting more focus on fine-grained and instance-level recognition problems – instead of recognizing general entities such as buildings, mountains and (of course) cats, many are designing machine learning algorithms capable of identifying the Eiffel Tower, Mount Fuji or Persian cats. However, a significant obstacle for research in this area has been the lack of large annotated datasets.

Today, we are excited to advance instance-level recognition by releasing Google-Landmarks, the largest worldwide dataset for recognition of human-made and natural landmarks. Google-Landmarks is being released as part of the Landmark Recognition and Landmark Retrieval Kaggle challenges, which will be the focus of the CVPR’18 Landmarks workshop. The dataset contains more than 2 million images depicting 30 thousand unique landmarks from across the world (their geographic distribution is presented below), a number of classes that is ~30x larger than what is available in commonly used datasets. Additionally, to spur research in this field, we are open-sourcing Deep Local Features (DELF), an attentive local feature descriptor that we believe is especially suited for this kind of task.

Geographic distribution of landmarks in our dataset.
Landmark recognition presents some noteworthy differences from other problems. For example, even within a large annotated dataset, there might not be much training data available for some of the less popular landmarks. Additionally, since landmarks are generally rigid objects which do not move, the intra-class variation is very small (in other words, a landmark’s appearance does not change that much across different images of it). As a result, variations only arise due to image capture conditions, such as occlusions, different viewpoints, weather and illumination, making this distinct from other image recognition datasets where images of a particular class (such as a dog) can vary much more. These characteristics are also shared with other instance-level recognition problems, such as artwork recognition — so we hope the new dataset can benefit research for other image recognition problems as well.

The two Kaggle challenges provide access to annotated data to help researchers address these problems. The recognition track challenge is to build models that recognize the correct landmark in a dataset of challenging test images, while the retrieval track challenges participants to retrieve images containing the same landmark.

A few examples of images from the Google-Landmarks dataset, including landmarks such as Big Ben, Sacre Coeur Basilica, the rock sculpture of Decebalus and the Megyeri Bridge, among others.
If you plan to be at CVPR this year, we hope you’ll attend the CVPR’18 Landmarks workshop. However, everyone is able to participate in the challenge, and access to the new dataset is available via the Kaggle website. We hope this resource is valuable to your research and we can’t wait to see the ideas you will come up with for recognizing landmarks!

Acknowledgments
Jack Sim, Will Cukierski, Maggie Demkin, Hartwig Adam, Bohyung Han, Shih-Fu Chang, Ondrej Chum, Torsten Sattler, Giorgos Tolias, Xu Zhang, Fernando Brucher, Marco Andreetto, Gursheesh Kour.

Google at CVPR 2017



From July 21-26, Honolulu, Hawaii hosts the 2017 Conference on Computer Vision and Pattern Recognition (CVPR 2017), the premier annual computer vision event comprising the main conference and several co-located workshops and tutorials. As a leader in computer vision research and a Platinum Sponsor, Google will have a strong presence at CVPR 2017 — over 250 Googlers will be in attendance to present papers and invited talks at the conference, and to organize and participate in multiple workshops.

If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively pursuing the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind Headset Removal for Virtual and Mixed Reality, Image Compression with Neural Networks, Jump, TensorFlow Object Detection API and much more.

You can learn more about our research being presented at CVPR 2017 in the list below (Googlers highlighted in blue).

Organizing Committee
Corporate Relations Chair - Mei Han
Area Chairs include - Alexander Toshev, Ce Liu, Vittorio Ferrari, David Lowe

Papers
Training object class detectors with click supervision
Dim Papadopoulos, Jasper Uijlings, Frank Keller, Vittorio Ferrari

Unsupervised Pixel-Level Domain Adaptation With Generative Adversarial Networks
Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip Krishnan

BranchOut: Regularization for Online Ensemble Tracking With Convolutional Neural Networks Bohyung Han, Jack Sim, Hartwig Adam

Enhancing Video Summarization via Vision-Language Embedding
Bryan A. Plummer, Matthew Brown, Svetlana Lazebnik

Learning by Association — A Versatile Semi-Supervised Training Method for Neural Networks Philip Haeusser, Alexander Mordvintsev, Daniel Cremers

Context-Aware Captions From Context-Agnostic Supervision
Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, Gal Chechik

Spatially Adaptive Computation Time for Residual Networks
Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan HuangDmitry Vetrov, Ruslan Salakhutdinov

Xception: Deep Learning With Depthwise Separable Convolutions
François Chollet

Deep Metric Learning via Facility Location
Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, Kevin Murphy

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors
Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Kevin Murphy

Synthesizing Normalized Faces From Facial Identity Features
Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, William T. Freeman

Towards Accurate Multi-Person Pose Estimation in the Wild
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, Kevin Murphy

GuessWhat?! Visual Object Discovery Through Multi-Modal Dialogue
Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, Aaron Courville

Learning discriminative and transformation covariant local feature detectors
Xu Zhang, Felix X. Yu, Svebor Karaman, Shih-Fu Chang

Full Resolution Image Compression With Recurrent Neural Networks
George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, Michele Covell

Learning From Noisy Large-Scale Datasets With Minimal Supervision
Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, Serge Belongie

Unsupervised Learning of Depth and Ego-Motion From Video
Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe

Cognitive Mapping and Planning for Visual Navigation
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik

Fast Fourier Color Constancy
Jonathan T. Barron, Yun-Ta Tsai

On the Effectiveness of Visible Watermarks
Tali Dekel, Michael Rubinstein, Ce Liu, William T. Freeman

YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video
Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, Vincent Vanhoucke

Workshops
Deep Learning for Robotic Vision
Organizers include: Anelia Angelova, Kevin Murphy
Program Committee includes: George Papandreou, Nathan Silberman, Pierre Sermanet

The Fourth Workshop on Fine-Grained Visual Categorization
Organizers include: Yang Song
Advisory Panel includes: Hartwig Adam
Program Committee includes: Anelia Angelova, Yuning Chai, Nathan Frey, Jonathan Krause, Catherine Wah, Weijun Wang

Language and Vision Workshop
Organizers include: R. Sukthankar

The First Workshop on Negative Results in Computer Vision
Organizers include: R. Sukthankar, W. Freeman, J. Malik

Visual Understanding by Learning from Web Data
General Chairs include: Jesse Berent, Abhinav Gupta, Rahul Sukthankar
Program Chairs include: Wei Li

YouTube-8M Large-Scale Video Understanding Challenge
General Chairs: Paul Natsev, Rahul Sukthankar
Program Chairs: Joonseok Lee, George Toderici
Challenge Organizers: Sami Abu-El-Haija, Anja Hauth, Nisarg Kothari, Hanhan Li, Sobhan Naderi Parizi, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan, Jian Wang