Tag Archives: Machine Perception

Google at CVPR 2018

Posted by Christian Howard, Editor-in-Chief, Google AI Communications

This week, Salt Lake City hosts the 2018 Conference on Computer Vision and Pattern Recognition (CVPR 2018), the premier annual computer vision event comprising the main conference and several co-located workshops and tutorials. As a leader in computer vision research and a Diamond Sponsor, Google will have a strong presence at CVPR 2018 — over 200 Googlers will be in attendance to present papers and invited talks at the conference, and to organize and participate in multiple workshops.

If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively pursuing the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind portrait mode on the Pixel 2 and Pixel 2 XL smartphones, the Open Images V4 dataset and much more.

You can learn more about our research being presented at CVPR 2018 in the list below (Googlers highlighted in blue)

Organization
Finance Chair: Ramin Zabih

Area Chairs include: Sameer Agarwal, Aseem Agrawala, Jon Barron, Abhinav Shrivastava, Carl Vondrick, Ming-Hsuan Yang

Orals/Spotlights
Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee

DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, Yebin Liu

Neural Kinematic Networks for Unsupervised Motion Retargetting
Ruben Villegas, Jimei Yang, Duygu Ceylan, Honglak Lee

Burst Denoising with Kernel Prediction Networks
Ben Mildenhall, Jiawen Chen, Jonathan BarronRobert Carroll, Dillon Sharlet, Ren Ng

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob, Skirmantas Kligys, Bo Chen, Matthew Tang, Menglong Zhu, Andrew Howard, Dmitry KalenichenkoHartwig Adam

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Focal Visual-Text Attention for Visual Question Answering
Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, Alexander G. Hauptmann

Inferring Light Fields from Shadows
Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba

Modifying Non-Local Variations Across Multiple Views
Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-Manor

Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta

Unsupervised Training for 3D Morphable Model Regression
Kyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William Freeman

Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le

The iNaturalist Species Classification and Detection Dataset
Grant van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie

Learning Intrinsic Image Decomposition from Watching the World
Zhengqi Li, Noah Snavely

Learning Intelligent Dialogs for Bounding Box Annotation
Ksenia Konyushkova, Jasper Uijlings, Christoph Lampert, Vittorio Ferrari

Posters
Revisiting Knowledge Transfer for Training Object Class Detectors
Jasper Uijlings, Stefan Popov, Vittorio Ferrari

Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David Ross, Jia Deng, Rahul Sukthankar

Hierarchical Novelty Detection for Visual Object Recognition
Kibok Lee, Kimin Lee, Kyle Min, Yuting Zhang, Jinwoo Shin, Honglak Lee

COCO-Stuff: Thing and Stuff Classes in Context
Holger Caesar, Jasper Uijlings, Vittorio Ferrari

Appearance-and-Relation Networks for Video Classification
Limin Wang, Wei Li, Wen Li, Luc Van Gool

MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks
Ariel Gordon, Elad Eban, Bo Chen, Ofir Nachum, Tien-Ju Yang, Edward Choi

Deformable Shape Completion with Graph Convolutional Autoencoders
Or Litany, Alex Bronstein, Michael Bronstein, Ameesh Makadia

MegaDepth: Learning Single-View Depth Prediction from Internet Photos
Zhengqi Li, Noah Snavely

Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee

Burst Denoising with Kernel Prediction Networks
Ben Mildenhall, Jiawen Chen, Jonathan Barron, Robert Carroll, Dillon Sharlet, Ren Ng

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob, Skirmantas Kligys, Bo Chen, Matthew Tang, Menglong Zhu, Andrew Howard, Dmitry Kalenichenko, Hartwig Adam

Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling
Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Tianfan Xue, Joshua Tenenbaum, William Freeman

Sparse, Smart Contours to Represent and Edit Images
Tali Dekel, Dilip Krishnan, Chuang Gan, Ce Liu, William Freeman

MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features
Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, Hartwig Adam

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning
Yin Cui, Yang Song, Chen Sun, Andrew Howard, Serge Belongie

Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks
Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Sung Jin Hwang, George Toderici, Troy Chinen, Joel Shor

MobileNetV2: Inverted Residuals and Linear Bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen

ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans 
Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Juergen Sturm, Matthias Nießner

Sim2Real View Invariant Visual Servoing by Recurrent Control
Fereshteh Sadeghi, Alexander Toshev, Eric Jang, Sergey Levine

Alternating-Stereo VINS: Observability Analysis and Performance Evaluation
Mrinal Kanti Paul, Stergios Roumeliotis

Soccer on Your Tabletop
Konstantinos Rematas, Ira Kemelmacher, Brian Curless, Steve Seitz

Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints
Reza Mahjourian, Martin Wicke, Anelia Angelova

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Inferring Light Fields from Shadows
Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba

Modifying Non-Local Variations Across Multiple Views
Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-Manor

Aperture Supervision for Monocular Depth Estimation
Pratul Srinivasan, Rahul Garg, Neal Wadhwa, Ren Ng, Jonathan Barron

Instance Embedding Transfer to Unsupervised Video Object Segmentation
Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, C.-C. Jay Kuo

Frame-Recurrent Video Super-Resolution
Mehdi S. M. Sajjadi, Raviteja Vemulapalli, Matthew Brown

Weakly Supervised Action Localization by Sparse Temporal Pooling Network
Phuc Nguyen, Ting Liu, Gautam Prasad, Bohyung Han

Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta

Learning and Using the Arrow of Time
Donglai Wei, Andrew Zisserman, William Freeman, Joseph Lim

HydraNets: Specialized Dynamic Architectures for Efficient Inference
Ravi Teja Mullapudi, Noam Shazeer, William Mark, Kayvon Fatahalian

Thoracic Disease Identification and Localization with Limited Supervision
Zhe Li, Chong Wang, Mei Han, Yuan Xue, Wei Wei, Li-jia Li, Fei-Fei Li

Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
Seunghoon Hong, Dingdong Yang, Jongwook Choi, Honglak Lee

Deep Semantic Face Deblurring
Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, Ming-Hsuan Yang

Unsupervised Training for 3D Morphable Model Regression
Kyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William Freeman

Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le

Learning Intrinsic Image Decomposition from Watching the World
Zhengqi Li, Noah Snavely

PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection
Nian Liu, Junwei Han, Ming-Hsuan Yang

Tutorials
Computer Vision for Robotics and Driving
Anelia Angelova, Sanja Fidler

Unsupervised Visual Learning
Pierre Sermanet, Anelia Angelova

UltraFast 3D Sensing, Reconstruction and Understanding of People, Objects and Environments
Sean Fanello, Julien Valentin, Jonathan Taylor, Christoph Rhemann, Adarsh Kowdle, Jürgen SturmChristine Kaeser-Chen, Pavel Pidlypenskyi, Rohit Pandey, Andrea Tagliasacchi, Sameh Khamis, David Kim, Mingsong Dou, Kaiwen Guo, Danhang Tang, Shahram Izadi

Generative Adversarial Networks
Jun-Yan Zhu, Taesung Park, Mihaela Rosca, Phillip Isola, Ian Goodfellow

Source: Google AI Blog


Announcing an updated YouTube-8M, and the 2nd YouTube-8M Large-Scale Video Understanding Challenge and Workshop



Last year, we organized the first YouTube-8M Large-Scale Video Understanding Challenge with Kaggle, in which 742 teams consisting of 946 individuals from 60 countries used the YouTube-8M dataset (2017 edition) to develop classification algorithms which accurately assign video-level labels. The purpose of the competition was to accelerate improvements in large-scale video understanding, representation learning, noisy data modeling, transfer learning and domain adaptation approaches that can help improve the machine-learning models that classify video. In addition to the competition, we hosted an affiliated workshop at CVPR’17, inviting competition top-performers and researchers and share their ideas on how to advance the state-of-the-art in video understanding.

As a continuation of these efforts to accelerate video understanding, we are excited to announce another update to the YouTube-8M dataset, a new Kaggle video understanding challenge and an affiliated 2nd Workshop on YouTube-8M Large-Scale Video Understanding, to be held at the 2018 European Conference on Computer Vision (ECCV'18).
An Updated YouTube-8M Dataset (2018 Edition)
Our YouTube-8M (2018 edition) features a major improvement in the quality of annotations, obtained using a machine learning system that combines audio-visual content with title, description and other metadata to provide more accurate ground truth annotations. The updated version contains 6.1 million URLs, labeled with a vocabulary of 3,862 visual entities, with each video annotated with one or more labels and an average of 3 labels per video. We have also updated the starter code, with updated instructions for downloading and training TensorFlow video annotation models on the dataset.

The 2nd YouTube-8M Video Understanding Challenge
The 2nd YouTube-8M Video Understanding Challenge invites participants to build audio-visual content classification models using YouTube-8M as training data, and then to label an unknown subset of test videos. Unlike last year, we strictly impose a hard limit on model size, encouraging participants to advance a single model within tight budget rather than assembling as many models as possible. Each of the top 5 teams will be awarded $5,000 to support their travel to Munich to attend ECCV’18. For details, please visit the Kaggle competition page.

The 2nd Workshop on YouTube-8M Large-Scale Video Understanding
To be held at ECCV’18, the workshop will consist of invited talks by distinguished researchers, as well as presentations by top-performing challenge participants in order to facilitate the exchange of ideas. We encourage those who wish to attend to submit papers describing their research, experiments, or applications based on YouTube-8M dataset, including papers summarizing their participation in the challenge above. Please refer to the workshop page for more details.

It is our hope that this update to the dataset, along with the new challenge and workshop, will continue to advance the research in large-scale video understanding. We hope you will join us again!

Acknowledgements
This post reflects the work of many machine perception researchers including Sami Abu-El-Haija, Ke Chen, Nisarg Kothari, Joonseok Lee, Hanhan Li, Paul Natsev, Sobhan Naderi Parizi, Rahul Sukthankar, George Toderici, Balakrishnan Varadarajan, as well as Sohier Dane, Julia Elliott, Wendy Kan and Walter Reade from Kaggle. We are also grateful for the support and advice from our partners at YouTube.

Source: Google AI Blog


Introducing the CVPR 2018 On-Device Visual Intelligence Challenge



Over the past year, there have been exciting innovations in the design of deep networks for vision applications on mobile devices, such as the MobileNet model family and integer quantization. Many of these innovations have been driven by performance metrics that focus on meaningful user experiences in real-world mobile applications, requiring inference to be both low-latency and accurate. While the accuracy of a deep network model can be conveniently estimated with well established benchmarks in the computer vision community, latency is surprisingly difficult to measure and no uniform metric has been established. This lack of measurement platforms and uniform metrics have hampered the development of performant mobile applications.

Today, we are happy to announce the On-device Visual Intelligence Challenge (OVIC), part of the Low-Power Image Recognition Challenge Workshop at the 2018 Computer Vision and Pattern Recognition conference (CVPR2018). A collaboration with Purdue University, the University of North Carolina and IEEE, OVIC is a public competition for real-time image classification that uses state-of-the-art Google technology to significantly lower the barrier to entry for mobile development. OVIC provides two key features to catalyze innovation: a unified latency metric and an evaluation platform.

A Unified Metric
OVIC focuses on the establishment of a unified metric aligned directly with accurate and performant operation on mobile devices. The metric is defined as the number of correct classifications within a specified per-image average time limit of 33ms. This latency limit allows every frame in a live 30 frames-per-second video to be processed, thus providing a seamless user experience1. Prior to OVIC, it was tricky to enforce such a limit due to the difficulty in accurately and uniformly measuring latency as would be experienced in real-world applications on real-world devices. Without a repeatable mobile development platform, researchers have relied primarily on approximate metrics for latency that are convenient to compute, such as the number of multiply-accumulate operations (MACs). The intuition is that multiply-accumulate constitutes the most time-consuming operation in a deep neural network, so their count should be indicative of the overall latency. However, these metrics are often poor predictors of on-device latency due to many aspects of the models that can impact the average latency of each MAC in typical implementations.
Even though the number of multiply-accumulate operations (# MACs) is the most commonly used metric to approximate on-device latency, it is a poor predictor of latency. Using data from various quantized and floating point MobileNet V1 and V2 based models, this graph plots on-device latency on a common reference device versus the number of MACs. It is clear that models with similar latency can have very different MACs, and vice versa.
The graph above shows that while the number of MACs is correlated with the inference latency, there is significant variation in the mapping. Thus number of MACs is a poor proxy for latency, and since latency directly affects users’ experiences, we believe it is paramount to optimize latency directly rather than focusing on limiting the number of MACs as a proxy.

An Evaluation Platform
As mentioned above, a primary issue with latency is that it has previously been challenging to measure reliably and repeatably, due to variations in implementation, running environment and hardware architectures. Recent successes in mobile development overcome these challenges with the help of a convenient mobile development platform, including optimized kernels for mobile CPUs, light-weight portable model formats, increasingly capable mobile devices, and more. However, these various platforms have traditionally required resources and development capabilities that are only available to larger universities and industry.

With that in mind, we are releasing OVIC’s evaluation platform that includes a number of components designed to make mobile development and evaluations that can be replicated and compared accessible to the broader research community:
  • TOCO compiler for optimizing TensorFlow models for efficient inference
  • TensorFlow Lite inference engine for mobile deployment
  • A benchmarking SDK that can be run locally on any Android phone
  • Sample models to showcase successful mobile architectures that run inference in floating-point and quantized modes
  • Google’s benchmarking tool for reliable latency measurements on specific Pixel phones (available to registered contestants).
Using these tools available in OVIC, a participant can conveniently incorporate measurement of on-device latency into their design loop without having to worry about optimizing kernels, purchasing latency/power measurement devices, or designing the framework to drive them. The only requirement for entry is experiences with training computer vision models in TensorFlow, which can be found in this tutorial.

With OVIC, we encourage the entire research community to improve the classification performance of low-latency high-accuracy models towards new frontiers, as shown in the following graphic.
Sampling of current MobileNet mobile models illustrating the tradeoff between increased accuracy and reduced latency.
We cordially invite you to participate here before the deadline on June 15th, and help us discover new mobile vision architectures that will propel development into the future.

Acknowledgements
We would like to acknowledge our core contributors Achille Brighton, Alec Go, Andrew Howard, Hartwig Adam, Mark Sandler and Xiao Zhang. We would also like to acknowledge our external collaborators Alex Berg and Yung-Hsiang Lu. We give special thanks to Andre Hentz, Andrew Selle, Benoit Jacob, Brad Krueger, Dmitry Kalenichenko, Megan Cummins, Pete Warden, Rajat Monga, Shiyu Hu and Yicheng Fan.


1 Alternatively the same metric could encourage even lower power operation by only processing a subset of the images in the input stream.



Introducing the CVPR 2018 On-Device Visual Intelligence Challenge



Over the past year, there have been exciting innovations in the design of deep networks for vision applications on mobile devices, such as the MobileNet model family and integer quantization. Many of these innovations have been driven by performance metrics that focus on meaningful user experiences in real-world mobile applications, requiring inference to be both low-latency and accurate. While the accuracy of a deep network model can be conveniently estimated with well established benchmarks in the computer vision community, latency is surprisingly difficult to measure and no uniform metric has been established. This lack of measurement platforms and uniform metrics have hampered the development of performant mobile applications.

Today, we are happy to announce the On-device Visual Intelligence Challenge (OVIC), part of the Low-Power Image Recognition Challenge Workshop at the 2018 Computer Vision and Pattern Recognition conference (CVPR2018). A collaboration with Purdue University, the University of North Carolina and IEEE, OVIC is a public competition for real-time image classification that uses state-of-the-art Google technology to significantly lower the barrier to entry for mobile development. OVIC provides two key features to catalyze innovation: a unified latency metric and an evaluation platform.

A Unified Metric
OVIC focuses on the establishment of a unified metric aligned directly with accurate and performant operation on mobile devices. The metric is defined as the number of correct classifications within a specified per-image average time limit of 33ms. This latency limit allows every frame in a live 30 frames-per-second video to be processed, thus providing a seamless user experience1. Prior to OVIC, it was tricky to enforce such a limit due to the difficulty in accurately and uniformly measuring latency as would be experienced in real-world applications on real-world devices. Without a repeatable mobile development platform, researchers have relied primarily on approximate metrics for latency that are convenient to compute, such as the number of multiply-accumulate operations (MACs). The intuition is that multiply-accumulate constitutes the most time-consuming operation in a deep neural network, so their count should be indicative of the overall latency. However, these metrics are often poor predictors of on-device latency due to many aspects of the models that can impact the average latency of each MAC in typical implementations.
Even though the number of multiply-accumulate operations (# MACs) is the most commonly used metric to approximate on-device latency, it is a poor predictor of latency. Using data from various quantized and floating point MobileNet V1 and V2 based models, this graph plots on-device latency on a common reference device versus the number of MACs. It is clear that models with similar latency can have very different MACs, and vice versa.
The graph above shows that while the number of MACs is correlated with the inference latency, there is significant variation in the mapping. Thus number of MACs is a poor proxy for latency, and since latency directly affects users’ experiences, we believe it is paramount to optimize latency directly rather than focusing on limiting the number of MACs as a proxy.

An Evaluation Platform
As mentioned above, a primary issue with latency is that it has previously been challenging to measure reliably and repeatably, due to variations in implementation, running environment and hardware architectures. Recent successes in mobile development overcome these challenges with the help of a convenient mobile development platform, including optimized kernels for mobile CPUs, light-weight portable model formats, increasingly capable mobile devices, and more. However, these various platforms have traditionally required resources and development capabilities that are only available to larger universities and industry.

With that in mind, we are releasing OVIC’s evaluation platform that includes a number of components designed to make mobile development and evaluations that can be replicated and compared accessible to the broader research community:
  • TOCO compiler for optimizing TensorFlow models for efficient inference
  • TensorFlow Lite inference engine for mobile deployment
  • A benchmarking SDK that can be run locally on any Android phone
  • Sample models to showcase successful mobile architectures that run inference in floating-point and quantized modes
  • Google’s benchmarking tool for reliable latency measurements on specific Pixel phones (available to registered contestants).
Using these tools available in OVIC, a participant can conveniently incorporate measurement of on-device latency into their design loop without having to worry about optimizing kernels, purchasing latency/power measurement devices, or designing the framework to drive them. The only requirement for entry is experiences with training computer vision models in TensorFlow, which can be found in this tutorial.

With OVIC, we encourage the entire research community to improve the classification performance of low-latency high-accuracy models towards new frontiers, as shown in the following graphic.
Sampling of current MobileNet mobile models illustrating the tradeoff between increased accuracy and reduced latency.
We cordially invite you to participate here before the deadline on June 15th, and help us discover new mobile vision architectures that will propel development into the future.

Acknowledgements
We would like to acknowledge our core contributors Achille Brighton, Alec Go, Andrew Howard, Hartwig Adam, Mark Sandler and Xiao Zhang. We would also like to acknowledge our external collaborators Alex Berg and Yung-Hsiang Lu. We give special thanks to Andre Hentz, Andrew Selle, Benoit Jacob, Brad Krueger, Dmitry Kalenichenko, Megan Cummins, Pete Warden, Rajat Monga, Shiyu Hu and Yicheng Fan.


1 Alternatively the same metric could encourage even lower power operation by only processing a subset of the images in the input stream.



Source: Google AI Blog


Looking to Listen: Audio-Visual Speech Separation



People are remarkably good at focusing their attention on a particular person in a noisy environment, mentally “muting” all other voices and sounds. Known as the cocktail party effect, this capability comes natural to us humans. However, automatic speech separation — separating an audio signal into its individual speech sources — while a well-studied problem, remains a significant challenge for computers.

In “Looking to Listen at the Cocktail Party”, we present a deep learning audio-visual model for isolating a single speech signal from a mixture of sounds such as other voices and background noise. In this work, we are able to computationally produce videos in which speech of specific people is enhanced while all other sounds are suppressed. Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context. We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking.
A unique aspect of our technique is in combining both the auditory and visual signals of an input video to separate the speech. Intuitively, movements of a person’s mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person. The visual signal not only improves the speech separation quality significantly in cases of mixed speech (compared to speech separation using audio alone, as we demonstrate in our paper), but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video.
The input to our method is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. The output is a decomposition of the input audio track into clean speech tracks, one for each person detected in the video.
An Audio-Visual Speech Separation Model
To generate training examples, we started by gathering a large collection of 100,000 high-quality videos of lectures and talks from YouTube. From these videos, we extracted segments with a clean speech (e.g. no mixed music, audience sounds or other speakers) and with a single speaker visible in the video frames. This resulted in roughly 2000 hours of video clips, each of a single person visible to the camera and talking with no background interference. We then used this clean data to generate “synthetic cocktail parties” -- mixtures of face videos and their corresponding speech from separate video sources, along with non-speech background noise we obtained from AudioSet.

Using this data, we were able to train a multi-stream convolutional neural network-based model to split the synthetic cocktail mixture into separate audio streams for each speaker in the video. The input to the network are visual features extracted from the face thumbnails of detected speakers in each frame, and a spectrogram representation of the video’s soundtrack. During training, the network learns (separate) encodings for the visual and auditory signals, then it fuses them together to form a joint audio-visual representation. With that joint representation, the network learns to output a time-frequency mask for each speaker. The output masks are multiplied by the noisy input spectrogram and converted back to a time-domain waveform to obtain an isolated, clean speech signal for each speaker. For full details, see our paper.
Our multi-stream, neural network-based model architecture.
Here are some more speech separation and enhancement results by our method. Sound by others than the selected speakers can be entirely suppressed or suppressed to the desired level.
To highlight the utilization of visual information by our model, we took two different parts from the same video of Google’s CEO, Sundar Pichai, and placed them side by side. It is very difficult to perform speech separation in this scenario using only characteristic speech frequencies contained in the audio, however our audio-visual model manages to properly separate the speech even in this challenging case.
Application to Speech Recognition
Our method can also potentially be used as a pre-process for speech recognition and automatic video captioning. Handling overlapping speakers is a known challenge for automatic captioning systems, and separating the audio to the different sources could help in presenting more accurate and easy-to-read captions.
You can similarly see and compare the captions before and after speech separation in all the other videos in this post and on our website, by turning on closed captions in the YouTube player when playing the videos (“cc” button at the lower right corner of the player).

On our project web page you can find more results, as well as comparisons with state-of-the-art audio-only speech separation and with other recent audio-visual speech separation work.
We envision a wide range of applications for this technology. We are currently exploring opportunities for incorporating it into various Google products. Stay tuned!

Acknowledgements
The research described in this post was done by Ariel Ephrat (as an intern), Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, Bill Freeman and Michael Rubinstein. We would like to thank Yossi Matias and Google Research Israel for their support for the project, and John Hershey for his valuable feedback. We also thank Arkady Ziefman for his help with animations and figures, and Rachel Soh for helping us procure permissions for video content in our results.

Looking to Listen: Audio-Visual Speech Separation



People are remarkably good at focusing their attention on a particular person in a noisy environment, mentally “muting” all other voices and sounds. Known as the cocktail party effect, this capability comes natural to us humans. However, automatic speech separation — separating an audio signal into its individual speech sources — while a well-studied problem, remains a significant challenge for computers.

In “Looking to Listen at the Cocktail Party”, we present a deep learning audio-visual model for isolating a single speech signal from a mixture of sounds such as other voices and background noise. In this work, we are able to computationally produce videos in which speech of specific people is enhanced while all other sounds are suppressed. Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context. We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking.
A unique aspect of our technique is in combining both the auditory and visual signals of an input video to separate the speech. Intuitively, movements of a person’s mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person. The visual signal not only improves the speech separation quality significantly in cases of mixed speech (compared to speech separation using audio alone, as we demonstrate in our paper), but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video.
The input to our method is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. The output is a decomposition of the input audio track into clean speech tracks, one for each person detected in the video.
An Audio-Visual Speech Separation Model
To generate training examples, we started by gathering a large collection of 100,000 high-quality videos of lectures and talks from YouTube. From these videos, we extracted segments with a clean speech (e.g. no mixed music, audience sounds or other speakers) and with a single speaker visible in the video frames. This resulted in roughly 2000 hours of video clips, each of a single person visible to the camera and talking with no background interference. We then used this clean data to generate “synthetic cocktail parties” -- mixtures of face videos and their corresponding speech from separate video sources, along with non-speech background noise we obtained from AudioSet.

Using this data, we were able to train a multi-stream convolutional neural network-based model to split the synthetic cocktail mixture into separate audio streams for each speaker in the video. The input to the network are visual features extracted from the face thumbnails of detected speakers in each frame, and a spectrogram representation of the video’s soundtrack. During training, the network learns (separate) encodings for the visual and auditory signals, then it fuses them together to form a joint audio-visual representation. With that joint representation, the network learns to output a time-frequency mask for each speaker. The output masks are multiplied by the noisy input spectrogram and converted back to a time-domain waveform to obtain an isolated, clean speech signal for each speaker. For full details, see our paper.
Our multi-stream, neural network-based model architecture.
Here are some more speech separation and enhancement results by our method, playing first the input video with mixed or noisy speech, then our results. Sound by others than the selected speakers can be entirely suppressed or suppressed to the desired level.
Application to Speech Recognition
Our method can also potentially be used as a pre-process for speech recognition and automatic video captioning. Handling overlapping speakers is a known challenge for automatic captioning systems, and separating the audio to the different sources could help in presenting more accurate and easy-to-read captions.
You can similarly see and compare the captions before and after speech separation in all the other videos in this post and on our website, by turning on closed captions in the YouTube player when playing the videos (“cc” button at the lower right corner of the player).

On our project web page you can find more results, as well as comparisons with state-of-the-art audio-only speech separation and with other recent audio-visual speech separation work. Indeed, with recent advances in deep learning, there is a clear growing interest in the academic community in audio-visual analysis. For example, independently and concurrently to our work, this work from UC Berkeley explored a self-supervised approach for separating speech of on/off-screen speakers, and this work from MIT addressed the problem of separating the sound of multiple on-screen objects (e.g., musical instruments), while locating the image regions from which the sound originates.

We envision a wide range of applications for this technology. We are currently exploring opportunities for incorporating it into various Google products. Stay tuned!

Acknowledgements
The research described in this post was done by Ariel Ephrat (as an intern), Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, Bill Freeman and Michael Rubinstein. We would like to thank Yossi Matias and Google Research Israel for their support for the project, and John Hershey for his valuable feedback. We also thank Arkady Ziefman for his help with animations and figures, and Rachel Soh for helping us procure permissions for video content in our results.

Source: Google AI Blog


MobileNetV2: The Next Generation of On-Device Computer Vision Networks



Last year we introduced MobileNetV1, a family of general purpose computer vision neural networks designed with mobile devices in mind to support classification, detection and more. The ability to run deep networks on personal mobile devices improves user experience, offering anytime, anywhere access, with additional benefits for security, privacy, and energy consumption. As new applications emerge allowing users to interact with the real world in real time, so does the need for ever more efficient neural networks.

Today, we are pleased to announce the availability of MobileNetV2 to power the next generation of mobile vision applications. MobileNetV2 is a significant improvement over MobileNetV1 and pushes the state of the art for mobile visual recognition including classification, object detection and semantic segmentation. MobileNetV2 is released as part of TensorFlow-Slim Image Classification Library, or you can start exploring MobileNetV2 right away in coLaboratory. Alternately, you can download the notebook and explore it locally using Jupyter. MobileNetV2 is also available as modules on TF-Hub, and pretrained checkpoints can be found on github.

MobileNetV2 builds upon the ideas from MobileNetV1 [1], using depthwise separable convolution as efficient building blocks. However, V2 introduces two new features to the architecture: 1) linear bottlenecks between the layers, and 2) shortcut connections between the bottlenecks1. The basic structure is shown below.
Overview of MobileNetV2 Architecture. Blue blocks represent composite convolutional building blocks as shown above.
The intuition is that the bottlenecks encode the model’s intermediate inputs and outputs while the inner layer encapsulates the model’s ability to transform from lower-level concepts such as pixels to higher level descriptors such as image categories. Finally, as with traditional residual connections, shortcuts enable faster training and better accuracy. You can learn more about the technical details in our paper, “MobileNet V2: Inverted Residuals and Linear Bottlenecks”.

How does it compare to the first generation of MobileNets?
Overall, the MobileNetV2 models are faster for the same accuracy across the entire latency spectrum. In particular, the new models use 2x fewer operations, need 30% fewer parameters and are about 30-40% faster on a Google Pixel phone than MobileNetV1 models, all while achieving higher accuracy.
MobileNetV2 improves speed (reduced latency) and increased ImageNet Top 1 accuracy
MobileNetV2 is a very effective feature extractor for object detection and segmentation. For example, for detection when paired with the newly introduced SSDLite [2] the new model is about 35% faster with the same accuracy than MobileNetV1. We have open sourced the model under the Tensorflow Object Detection API [4].

Model
Params
Multiply-Adds
mAP
Mobile CPU
MobileNetV1 + SSDLite
5.1M
1.3B
22.2%
270ms
4.3M
0.8B
22.1%
200ms

To enable on-device semantic segmentation, we employ MobileNetV2 as a feature extractor in a reduced form of DeepLabv3 [3], that was announced recently. On the semantic segmentation benchmark, PASCAL VOC 2012, our resulting model attains a similar performance as employing MobileNetV1 as feature extractor, but requires 5.3 times fewer parameters and 5.2 times fewer operations in terms of Multiply-Adds.

Model
Params
Multiply-Adds
mIOU
MobileNetV1 + DeepLabV3
11.15M
14.25B
75.29%
2.11M
2.75B
75.32%

As we have seen MobileNetV2 provides a very efficient mobile-oriented model that can be used as a base for many visual recognition tasks. We hope by sharing it with the broader academic and open-source community we can help to advance research and application development.

Acknowledgements:
We would like to acknowledge our core contributors Menglong Zhu, Andrey Zhmoginov and Liang-Chieh Chen. We also give special thanks to Bo Chen, Dmitry Kalenichenko, Skirmantas Kligys, Mathew Tang, Weijun Wang, Benoit Jacob, George Papandreou and Hartwig Adam.

References
  1. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H, arXiv:1704.04861, 2017.
  2. MobileNetV2: Inverted Residuals and Linear Bottlenecks, Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. arXiv preprint. arXiv:1801.04381, 2018.
  3. Rethinking Atrous Convolution for Semantic Image Segmentation, Chen LC, Papandreou G, Schroff F, Adam H. arXiv:1706.05587, 2017.
  4. Speed/accuracy trade-offs for modern convolutional object detectors, Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, Murphy K, CVPR 2017.
  5. Deep Residual Learning for Image Recognition, He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. arXiv:1512.03385,2015


1 The shortcut (also known as skip) connections, popularized by ResNets[5] are commonly used to connect the non-bottleneck layers. MobilenNetV2 inverts this notion and connects the bottlenecks directly.

MobileNetV2: The Next Generation of On-Device Computer Vision Networks



Last year we introduced MobileNetV1, a family of general purpose computer vision neural networks designed with mobile devices in mind to support classification, detection and more. The ability to run deep networks on personal mobile devices improves user experience, offering anytime, anywhere access, with additional benefits for security, privacy, and energy consumption. As new applications emerge allowing users to interact with the real world in real time, so does the need for ever more efficient neural networks.

Today, we are pleased to announce the availability of MobileNetV2 to power the next generation of mobile vision applications. MobileNetV2 is a significant improvement over MobileNetV1 and pushes the state of the art for mobile visual recognition including classification, object detection and semantic segmentation. MobileNetV2 is released as part of TensorFlow-Slim Image Classification Library, or you can start exploring MobileNetV2 right away in Colaboratory. Alternately, you can download the notebook and explore it locally using Jupyter. MobileNetV2 is also available as modules on TF-Hub, and pretrained checkpoints can be found on github.

MobileNetV2 builds upon the ideas from MobileNetV1 [1], using depthwise separable convolution as efficient building blocks. However, V2 introduces two new features to the architecture: 1) linear bottlenecks between the layers, and 2) shortcut connections between the bottlenecks1. The basic structure is shown below.
Overview of MobileNetV2 Architecture. Blue blocks represent composite convolutional building blocks as shown above.
The intuition is that the bottlenecks encode the model’s intermediate inputs and outputs while the inner layer encapsulates the model’s ability to transform from lower-level concepts such as pixels to higher level descriptors such as image categories. Finally, as with traditional residual connections, shortcuts enable faster training and better accuracy. You can learn more about the technical details in our paper, “MobileNet V2: Inverted Residuals and Linear Bottlenecks”.

How does it compare to the first generation of MobileNets?
Overall, the MobileNetV2 models are faster for the same accuracy across the entire latency spectrum. In particular, the new models use 2x fewer operations, need 30% fewer parameters and are about 30-40% faster on a Google Pixel phone than MobileNetV1 models, all while achieving higher accuracy.
MobileNetV2 improves speed (reduced latency) and increased ImageNet Top 1 accuracy
MobileNetV2 is a very effective feature extractor for object detection and segmentation. For example, for detection when paired with the newly introduced SSDLite [2] the new model is about 35% faster with the same accuracy than MobileNetV1. We have open sourced the model under the Tensorflow Object Detection API [4].

Model
Params
Multiply-Adds
mAP
Mobile CPU
MobileNetV1 + SSDLite
5.1M
1.3B
22.2%
270ms
4.3M
0.8B
22.1%
200ms

To enable on-device semantic segmentation, we employ MobileNetV2 as a feature extractor in a reduced form of DeepLabv3 [3], that was announced recently. On the semantic segmentation benchmark, PASCAL VOC 2012, our resulting model attains a similar performance as employing MobileNetV1 as feature extractor, but requires 5.3 times fewer parameters and 5.2 times fewer operations in terms of Multiply-Adds.

Model
Params
Multiply-Adds
mIOU
MobileNetV1 + DeepLabV3
11.15M
14.25B
75.29%
2.11M
2.75B
75.32%

As we have seen MobileNetV2 provides a very efficient mobile-oriented model that can be used as a base for many visual recognition tasks. We hope by sharing it with the broader academic and open-source community we can help to advance research and application development.

Acknowledgements:
We would like to acknowledge our core contributors Menglong Zhu, Andrey Zhmoginov and Liang-Chieh Chen. We also give special thanks to Bo Chen, Dmitry Kalenichenko, Skirmantas Kligys, Mathew Tang, Weijun Wang, Benoit Jacob, George Papandreou, Zhichao Lu, Vivek Rathod, Jonathan Huang, Yukun Zhu, and Hartwig Adam.

References
  1. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H, arXiv:1704.04861, 2017.
  2. MobileNetV2: Inverted Residuals and Linear Bottlenecks, Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. arXiv preprint. arXiv:1801.04381, 2018.
  3. Rethinking Atrous Convolution for Semantic Image Segmentation, Chen LC, Papandreou G, Schroff F, Adam H. arXiv:1706.05587, 2017.
  4. Speed/accuracy trade-offs for modern convolutional object detectors, Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, Murphy K, CVPR 2017.
  5. Deep Residual Learning for Image Recognition, He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. arXiv:1512.03385,2015


1 The shortcut (also known as skip) connections, popularized by ResNets[5] are commonly used to connect the non-bottleneck layers. MobilenNetV2 inverts this notion and connects the bottlenecks directly.

Source: Google AI Blog


Expressive Speech Synthesis with Tacotron

At Google, we’re excited about the recent rapid progress of neural network-based text-to-speech (TTS) research. In particular, end-to-end architectures, such as the Tacotron systems we announced last year, can both simplify voice building pipelines and produce natural-sounding speech. This will help us build better human-computer interfaces, like conversational assistants, audiobook narration, news readers, or voice design software. To deliver a truly human-like voice, however, a TTS system must learn to model prosody, the collection of expressive factors of speech, such as intonation, stress, and rhythm. Most current end-to-end systems, including Tacotron, don’t explicitly model prosody, meaning they can’t control exactly how the generated speech should sound. This may lead to monotonous-sounding speech, even when models are trained on very expressive datasets like audiobooks, which often contain character voices with significant variation. Today, we are excited to share two new papers that address these problems.

Our first paper, “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron”, introduces the concept of a prosody embedding. We augment the Tacotron architecture with an additional prosody encoder that computes a low-dimensional embedding from a clip of human speech (the reference audio).

We augment Tacotron with a prosody encoder. The lower half of the image is the original Tacotron sequence-to-sequence model. For technical details, please refer to the paper.

This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress, intonation, and timing. At inference time, we can use this embedding to perform prosody transfer, generating speech in the voice of a completely different speaker, but exhibiting the prosody of the reference.

Text: *Is* that Utah travel agency?
Reference prosody (Australian)
Synthesized without prosody embedding (American)
Synthesized with prosody embedding (American)

The embedding can also transfer fine time-aligned prosody from one phrase to a slightly different phrase, though this technique works best when the reference and target phrases are similar in length and structure.

Reference Text: For the first time in her life she had been danced tired.
Synthesized Text: For the last time in his life he had been handily embarrassed.
Reference prosody (American)
Synthesized without prosody embedding (American)
Synthesized with prosody embedding (American)

Excitingly, we observe prosody transfer even when the reference audio comes from a speaker whose voice is not in Tacotron’s training data.

Text: I’ve Swallowed a Pollywog.
Reference prosody (Unseen American Speaker)
Synthesized without prosody embedding (British)
Synthesized with prosody embedding (British)

This is a promising result, as it paves the way for voice interaction designers to use their own voice to customize speech synthesis. You can listen to the full set of audio demos for “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron” on this web page.

Despite their ability to transfer prosody with high fidelity, the embeddings from the paper above don’t completely disentangle prosody from the content of a reference audio clip. (This explains why they transfer prosody best to phrases of similar structure and length.) Furthermore, they require a clip of reference audio at inference time. A natural question then arises: can we develop a model of expressive speech that alleviates these problems?

In our second paper, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis”, we do just that. Building upon the architecture in our first paper, we propose a new unsupervised method for modeling latent “factors” of speech. The key to this model is that, rather than learning fine time-aligned prosodic elements, it learns higher-level speaking style patterns that can be transferred across arbitrarily different phrases.

The model works by adding an extra attention mechanism to Tacotron, forcing it to represent the prosody embedding of any speech clip as the linear combination of a fixed set of basis embeddings. We call these embeddings Global Style Tokens (GSTs), and find that they learn text-independent variations in a speaker’s style (soft, high-pitch, intense, etc.), without the need for explicit style labels.

Model architecture of Global Style Tokens. The prosody embedding is decomposed into “style tokens” to enable unsupervised style control and transfer. For technical details, please refer to the paper.

At inference time, we can select or modify the combination weights for the tokens, allowing us to force Tacotron to use a specific speaking style without needing a reference audio clip. Using GSTs, for example, we can make different sentences of varying lengths sound more “lively”, “angry”, “lamenting”, etc:

Text: United Airlines five six three from Los Angeles to New Orleans has Landed.
Style 1
Style 2
Style 3
Style 4
Style 5

The text-independent nature of GSTs make them ideal for style transfer, which takes a reference audio clip spoken in a specific style and transfers its style to any target phrase we choose. To achieve this, we first run inference to predict the GST combination weights for an utterance whose style we want to imitate. We can then feed those combination weights to the model to synthesize completely different phrases — even those with very different lengths and structure — in the same style.

Finally, our paper shows that Global Style Tokens can model more than just speaking style. When trained on noisy YouTube audio from unlabeled speakers, a GST-enabled Tacotron learns to represent noise sources and distinct speakers as separate tokens. This means that by selecting the GSTs we use in inference, we can synthesize speech free of background noise, or speech in the voice of a specific unlabeled speaker from the dataset. This exciting result provides a path towards highly scalable but robust speech synthesis. You can listen to the full set of demos for “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis” on this web page.

We are excited about the potential applications and opportunities that these two bodies of research enable. In the meantime, there are new important research problems to be addressed. We’d like to extend the techniques of the first paper to support prosody transfer in the natural pitch range of the target speaker. We’d also like to develop techniques to select appropriate prosody or speaking style automatically from context, using, for example, the integration of natural language understanding with TTS. Finally, while our first paper proposes an initial set of objective and subjective metrics for prosody transfer, we’d like to develop these further to help establish generally-accepted methods for prosodic evaluation.

Acknowledgements
These projects were done jointly between multiple Google teams. Contributors include RJ Skerry-Ryan, Yuxuan Wang, Daisy Stanton, Eric Battenberg, Ying Xiao, Joel Shor, Rif A. Saurous, Yu Zhang, Ron J. Weiss, Rob Clark, Fei Ren and Ye Jia.

Expressive Speech Synthesis with Tacotron



At Google, we're excited about the recent rapid progress of neural network-based text-to-speech (TTS) research. In particular, end-to-end architectures, such as the Tacotron systems we announced last year, can both simplify voice building pipelines and produce natural-sounding speech. This will help us build better human-computer interfaces, like conversational assistants, audiobook narration, news readers, or voice design software. To deliver a truly human-like voice, however, a TTS system must learn to model prosody, the collection of expressive factors of speech, such as intonation, stress, and rhythm. Most current end-to-end systems, including Tacotron, don't explicitly model prosody, meaning they can't control exactly how the generated speech should sound. This may lead to monotonous-sounding speech, even when models are trained on very expressive datasets like audiobooks, which often contain character voices with significant variation. Today, we are excited to share two new papers that address these problems.

Our first paper, “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron”, introduces the concept of a prosody embedding. We augment the Tacotron architecture with an additional prosody encoder that computes a low-dimensional embedding from a clip of human speech (the reference audio).
We augment Tacotron with a prosody encoder. The lower half of the image is the original Tacotron sequence-to-sequence model. For technical details, please refer to the paper.
This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress, intonation, and timing. At inference time, we can use this embedding to perform prosody transfer, generating speech in the voice of a completely different speaker, but exhibiting the prosody of the reference.

Text: *Is* that Utah travel agency?
Reference prosody (Australian)
Synthesized without prosody embedding (American)
Synthesized with prosody embedding (American)

The embedding can also transfer fine time-aligned prosody from one phrase to a slightly different phrase, though this technique works best when the reference and target phrases are similar in length and structure.

Reference Text: For the first time in her life she had been danced tired.
Synthesized Text: For the last time in his life he had been handily embarrassed.
Reference prosody (American)
Synthesized without prosody embedding (American)
Synthesized with prosody embedding (American)

Excitingly, we observe prosody transfer even when the reference audio comes from a speaker whose voice is not in Tacotron's training data.

Text: I've Swallowed a Pollywog.
Reference prosody (Unseen American Speaker)
Synthesized without prosody embedding (British)
Synthesized with prosody embedding (British)

This is a promising result, as it paves the way for voice interaction designers to use their own voice to customize speech synthesis. You can listen to the full set of audio demos for “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron” on this web page.

Despite their ability to transfer prosody with high fidelity, the embeddings from the paper above don't completely disentangle prosody from the content of a reference audio clip. (This explains why they transfer prosody best to phrases of similar structure and length.) Furthermore, they require a clip of reference audio at inference time. A natural question then arises: can we develop a model of expressive speech that alleviates these problems?

In our second paper, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis”, we do just that. Building upon the architecture in our first paper, we propose a new unsupervised method for modeling latent "factors" of speech. The key to this model is that, rather than learning fine time-aligned prosodic elements, it learns higher-level speaking style patterns that can be transferred across arbitrarily different phrases.

The model works by adding an extra attention mechanism to Tacotron, forcing it to represent the prosody embedding of any speech clip as the linear combination of a fixed set of basis embeddings. We call these embeddings Global Style Tokens (GSTs), and find that they learn text-independent variations in a speaker's style (soft, high-pitch, intense, etc.), without the need for explicit style labels.
Model architecture of Global Style Tokens. The prosody embedding is decomposed into “style tokens” to enable unsupervised style control and transfer. For technical details, please refer to the paper.
At inference time, we can select or modify the combination weights for the tokens, allowing us to force Tacotron to use a specific speaking style without needing a reference audio clip. Using GSTs, for example, we can make different sentences of varying lengths sound more "lively", "angry", "lamenting", etc:

Text: United Airlines five six three from Los Angeles to New Orleans has Landed.
Style 1
Style 2
Style 3
Style 4
Style 5
The text-independent nature of GSTs make them ideal for style transfer, which takes a reference audio clip spoken in a specific style and transfers its style to any target phrase we choose. To achieve this, we first run inference to predict the GST combination weights for an utterance whose style we want to imitate. We can then feed those combination weights to the model to synthesize completely different phrases — even those with very different lengths and structure — in the same style.

Finally, our paper shows that Global Style Tokens can model more than just speaking style. When trained on noisy YouTube audio from unlabeled speakers, a GST-enabled Tacotron learns to represent noise sources and distinct speakers as separate tokens. This means that by selecting the GSTs we use in inference, we can synthesize speech free of background noise, or speech in the voice of a specific unlabeled speaker from the dataset. This exciting result provides a path towards highly scalable but robust speech synthesis. You can listen to the full set of demos for "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis" on this web page.

We are excited about the potential applications and opportunities that these two bodies of research enable. In the meantime, there are new important research problems to be addressed. We'd like to extend the techniques of the first paper to support prosody transfer in the natural pitch range of the target speaker. We'd also like to develop techniques to select appropriate prosody or speaking style automatically from context, using, for example, the integration of natural language understanding with TTS. Finally, while our first paper proposes an initial set of objective and subjective metrics for prosody transfer, we'd like to develop these further to help establish generally-accepted methods for prosodic evaluation.

Acknowledgements
These projects were done jointly between multiple Google teams. Contributors include RJ Skerry-Ryan, Yuxuan Wang, Daisy Stanton, Eric Battenberg, Ying Xiao, Joel Shor, Rif A. Saurous, Yu Zhang, Ron J. Weiss, Rob Clark, Fei Ren and Ye Jia.