Tag Archives: #AutoML

Introducing the Next Generation of On-Device Vision Models: MobileNetV3 and MobileNetEdgeTPU

Posted by Andrew Howard, Software Engineer and Suyog Gupta, Silicon Engineer, Google Research

On-device machine learning (ML) is an essential component in enabling privacy-preserving, always-available and responsive intelligence. This need to bring on-device machine learning to compute and power-limited devices has spurred the development of algorithmically-efficient neural network models and hardware capable of performing billions of math operations per second, while consuming only a few milliwatts of power. The recently launched Google Pixel 4 exemplifies this trend, and ships with the Pixel Neural Core that contains an instantiation of the Edge TPU architecture, Google’s machine learning accelerator for edge computing devices, and powers Pixel 4 experiences such as face unlock, a faster Google Assistant and unique camera features. Similarly, algorithms, such as MobileNets, have been critical for the success of on-device ML by providing compact and efficient neural network models for mobile vision applications.

Today we are pleased to announce the release of source code and checkpoints for MobileNetV3 and the Pixel 4 Edge TPU-optimized counterpart MobileNetEdgeTPU model. These models are the culmination of the latest advances in hardware-aware AutoML techniques as well as several advances in architecture design. On mobile CPUs, MobileNetV3 is twice as fast as MobileNetV2 with equivalent accuracy, and advances the state-of-the-art for mobile computer vision networks. On the Pixel 4 Edge TPU hardware accelerator, the MobileNetEdgeTPU model pushes the boundary further by improving model accuracy while simultaneously reducing the runtime and power consumption.

Building MobileNetV3
In contrast with the hand-designed previous version of MobileNet, MobileNetV3 relies on AutoML to find the best possible architecture in a search space friendly to mobile computer vision tasks. To most effectively exploit the search space we deploy two techniques in sequence — MnasNet and NetAdapt. First, we search for a coarse architecture using MnasNet, which uses reinforcement learning to select the optimal configuration from a discrete set of choices. Then we fine-tune the architecture using NetAdapt, a complementary technique that trims under-utilized activation channels in small decrements. To provide the best possible performance under different conditions we have produced both large and small models.

Comparison of accuracy vs. latency for mobile models on the ImageNet classification task using the Google Pixel 4 CPU.

MobileNetV3 Search Space
The MobileNetV3 search space builds on multiple recent advances in architecture design that we adapt for the mobile environment. First, we introduce a new activation function called hard-swish (h-swish) which is based on the Swish nonlinearity function. The critical drawback of the Swish function is that it is very inefficient to compute on mobile hardware. So, instead we use an approximation that can be efficiently expressed as a product of two piecewise linear functions.

Next we introduce the mobile-friendly squeeze-and-excitation block, which replaces the classical sigmoid function with a piecewise linear approximation.

Combining h-swish plus mobile-friendly squeeze-and-excitation with a modified version of the inverted bottleneck structure introduced in MobileNetV2 yielded a new building block for MobileNetV3.

MobileNetV3 extends the MobileNetV2 inverted bottleneck structure by adding h-swish and mobile friendly squeeze-and-excitation as searchable options.

These parameters defined the search space used in constructing MobileNetV3:

Size of expansion layer
Degree of squeeze-excite compression
Choice of activation function: h-swish or ReLU
Number of layers for each resolution block

We also introduced a new efficient last stage at the end of the network that further reduced latency by 15%.

MobileNetV3 Object Detection and Semantic Segmentation
In addition to classification models, we also introduced MobileNetV3 object detection models, which reduced detection latency by 25% relative to MobileNetV2 at the same accuracy for the COCO dataset.

In order to optimize MobileNetV3 for efficient semantic segmentation, we introduced a low latency segmentation decoder called Lite Reduced Atrous Spatial Pyramid Pooling (LR-SPP). This new decoder contains three branches, one for low resolution semantic features, one for higher resolution details, and one for light-weight attention. The combination of LR-SPP and MobileNetV3 reduces the latency by over 35% on the high resolution Cityscapes Dataset.

MobileNet for Edge TPUs
The Edge TPU in Pixel 4 is similar in architecture to the Edge TPU in the Coral line of products, but customized to meet the requirements of key camera features in Pixel 4. The accelerator-aware AutoML approach substantially reduces the manual process involved in designing and optimizing neural networks for hardware accelerators. Crafting the neural architecture search space is an important part of this approach and centers around the inclusion of neural network operations that are known to improve hardware utilization. While operations such as squeeze-and-excite and swish non-linearity have been shown to be essential in building compact and fast CPU models, these operations tend to perform suboptimally on Edge TPU and hence are excluded from the search space. The minimalistic variants of MobileNetV3 also forgo the use of these operations (i.e., squeeze-and-excite, swish, and 5x5 convolutions) to allow easier portability to a variety of other hardware accelerators such as DSPs and GPUs.

The neural network architecture search, incentivized to jointly optimize the model accuracy and Edge TPU latency, produces the MobileNetEdgeTPU model that achieves lower latency for a fixed accuracy (or higher accuracy for a fixed latency) than existing mobile models such as MobileNetV2 and minimalistic MobileNetV3. Compared with the EfficientNet-EdgeTPU model (optimized for the Edge TPU in Coral), these models are designed to run at a much lower latency on Pixel 4, albeit at the cost of some loss in accuracy.

Although reducing the model’s power consumption was not a part of the search objective, the lower latency of the MobileNetEdgeTPU models also helps reduce the average Edge TPU power use. The MobileNetEdgeTPU model consumes less than 50% the power of the minimalistic MobileNetV3 model at comparable accuracy.

Left: Comparison of the accuracy on the ImageNet classification task between MobileNetEdgeTPU and other image classification networks designed for mobile when running on Pixel4 Edge TPU. MobileNetEdgeTPU achieves higher accuracy and lower latency compared with other models. Right: Average Edge TPU power in Watts for different classification models running at 30 frames per second (fps).

Objection Detection Using MobileNetEdgeTPU
The MobileNetEdgeTPU classification model also serves as an effective feature extractor for object detection tasks. Compared with MobileNetV2 based detection models, MobileNetEdgeTPU models offer a significant improvement in model quality (measured as the mean average precision; mAP) on the COCO14 minival dataset at comparable runtimes on the Edge TPU. The MobileNetEdgeTPU detection model has a latency of 6.6ms and achieves mAP score of 24.3, while MobileNetV2-based detection models achieve an mAP of 22 and takes 6.8ms per inference.

The Need for Hardware-Aware Models
While the results shown above highlight the power, performance, and quality benefits of MobileNetEdgeTPU models, it is important to note that the improvements arise due to the fact that these models have been customized to run on the Edge TPU accelerator.
MobileNetEdgeTPU when running on a mobile CPU delivers inferior performance compared with the models that have been tuned specifically for mobile CPUs (MobileNetV3). MobileNetEdgeTPU models perform a much greater number of operations, and so, it is not surprising that they run slower on mobile CPUs, which exhibit a more linear relationship between a model’s compute requirements and the runtime.

MobileNetV3 is still the best performing network when using mobile CPU as the deployment target.

For Researchers and Developers
The MobileNetV3 and MobileNetEdgeTPU code, as well as both floating point and quantized checkpoints for ImageNet classification, are available at the MobileNet github page. Open source implementation for MobileNetV3 and MobileNetEdgeTPU object detection is available in the Tensorflow Object Detection API. Open source implementation for MobileNetV3 semantic segmentation is available in TensorFlow through DeepLab.

Acknowledgements:
This work is made possible through a collaboration spanning several teams across Google. We’d like to acknowledge contributions from Berkin Akin, Okan Arikan, Gabriel Bender, Bo Chen, Liang-Chieh Chen, Grace Chu, Eddy Hsu, John Joseph, Pieter-jan Kindermans, Quoc Le, Owen Lin, Hanxiao Liu, Yun Long, Ravi Narayanaswami, Ruoming Pang, Mark Sandler, Mingxing Tan, Vijay Vasudevan, Weijun Wang, Dong Hyuk Woo, Dmitry Kalenichenko, Yunyang Xiong, Yukun Zhu and support from Hartwig Adam, Blaise Agüera y Arcas, Chidu Krishnan and Steve Molloy.

Source: Google AI Blog

Coral moves out of beta

Posted by Vikram Tank (Product Manager), Coral Team

microchips on coral colored background

Last March, we launched Coral beta from Google Research. Coral helps engineers and researchers bring new models out of the data center and onto devices, running TensorFlow models efficiently at the edge. Coral is also at the core of new applications of local AI in industries ranging from agriculture to healthcare to manufacturing. We've received a lot of feedback over the past six months and used it to improve our platform. Today we’re thrilled to graduate Coral out of beta, into a wider, global release.

Coral is already delivering impact across industries, and several of our partners are including Coral in products that require fast ML inferencing at the edge.

In healthcare, Care.ai is using Coral to build a device that enables hospitals and care centers to respond quickly to falls, prevent bed sores, improve patient care, and reduce costs. Virgo SVS is also using Coral as the basis of a polyp detection system that helps doctors improve the accuracy of endoscopies.

In a very different use case, Olea Edge employs Coral to help municipal water utilities accurately measure the amount of water used by their commercial customers. Their Meter Health Analytics solution uses local AI to reduce waste and predict equipment failure in industrial water meters.

Nexcom is using Coral to build gateways with local AI and provide a platform for next-gen, AI-enabled IoT applications. By moving AI processing to the gateway, existing sensor networks can stay in service without the need to add AI processing to each node.

From prototype to production

Microchips on white background

Coral’s Dev Board is designed as an integrated prototyping solution for new product development. Under the heatsink is the detachable Coral SoM, which combines Google’s Edge TPU with the NXP IMX8M SoC, Wi-Fi and Bluetooth connectivity, memory, and storage. We’re happy to announce that you can now purchase the Coral SoM standalone. We’ve also created a baseboard developer guide to help integrate it into your own production design.

Our Coral USB Accelerator allows users with existing system designs to add local AI inferencing via USB 2/3. For production workloads, we now offer three new Accelerators that feature the Edge TPU and connect via PCIe interfaces: Mini PCIe, M.2 A+E key, and M.2 B+M key. You can easily integrate these Accelerators into new products or upgrade existing devices that have an available PCIe slot.

The new Coral products are available globally and for sale at Mouser; for large volume sales, contact our sales team. By the end of 2019, we'll continue to expand our distribution of the Coral Dev Board and SoM into new markets including: Taiwan, Australia, New Zealand, India, Thailand, Singapore, Oman, Ghana and the Philippines.

Better resources

We’ve also revamped the Coral site with better organization for our docs and tools, a set of success stories, and industry focused pages. All of it can be found at a new, easier to remember URL Coral.ai.

To help you get the most out of the hardware, we’re also publishing a new set of examples. The included models and code can provide solutions to the most common on-device ML problems, such as image classification, object detection, pose estimation, and keyword spotting.

For those looking for a more in-depth application—and a way to solve the eternal problem of squirrels plundering your bird feeder—the Smart Bird Feeder project shows you how to perform classification with a custom dataset on the Coral Dev board.

Finally, we’ll soon release a new version of the Mendel OS that updates the system to Debian Buster, and we're hard at work on more improvements to the Edge TPU compiler and runtime that will improve the model development workflow.

The official launch of Coral is, of course, just the beginning, and we’ll continue to evolve the platform. Please keep sending us feedback at [email protected].

Source: Google Developers Blog

Coral summer updates: Post-training quant support, TF Lite delegate, and new models!

Posted by Vikram Tank (Product Manager), Coral Team

Coral’s had a busy summer working with customers, expanding distribution, and building new features — and of course taking some time for R&R. We’re excited to share updates, early work, and new models for our platform for local AI with you.

The compiler has been updated to version 2.0, adding support for models built using post-training quantization—only when using full integer quantization (previously, we required quantization-aware training)—and fixing a few bugs. As the Tensorflow team mentions in their Medium post “post-training integer quantization enables users to take an already-trained floating-point model and fully quantize it to only use 8-bit signed integers (i.e. `int8`).” In addition to reducing the model size, models that are quantized with this method can now be accelerated by the Edge TPU found in Coral products.

We've also updated the Edge TPU Python library to version 2.11.1 to include new APIs for transfer learning on Coral products. The new on-device back propagation API allows you to perform transfer learning on the last layer of an image classification model. The last layer of a model is removed before compilation and implemented on-device to run on the CPU. It allows for near-real time transfer learning and doesn’t require you to recompile the model. Our previously released imprinting API, has been updated to allow you to quickly retrain existing classes or add new ones while leaving other classes alone. You can now even keep the classes from the pre-trained base model. Learn more about both options for on-device transfer learning.

Until now, accelerating your model with the Edge TPU required that you write code using either our Edge TPU Python API or in C++. But now you can accelerate your model on the Edge TPU when using the TensorFlow Lite interpreter API, because we've released a TensorFlow Lite delegate for the Edge TPU. The TensorFlow Lite Delegate API is an experimental feature in TensorFlow Lite that allows for the TensorFlow Lite interpreter to delegate part or all of graph execution to another executor—in this case, the other executor is the Edge TPU. Learn more about the TensorFlow Lite delegate for Edge TPU.

Coral has also been working with Edge TPU and AutoML teams to release EfficientNet-EdgeTPU: a family of image classification models customized to run efficiently on the Edge TPU. The models are based upon the EfficientNet architecture to achieve the image classification accuracy of a server-side model in a compact size that's optimized for low latency on the Edge TPU. You can read more about the models’ development and performance on the Google AI Blog, and download trained and compiled versions on the Coral Models page.

And, as summer comes to an end we also want to share that Arrow offers a student teacher discount for those looking to experiment with the boards in class or the lab this year.

We're excited to keep evolving the Coral platform, please keep sending us feedback at [email protected].

Source: Google Developers Blog

EfficientNet-EdgeTPU: Creating Accelerator-Optimized Neural Networks with AutoML

Posted by Suyog Gupta, Machine Learning Accelerator Architect and Mingxing Tan, Software Engineer, Google Research

For several decades, computer processors have doubled their performance every couple of years by reducing the size of the transistors inside each chip, as described by Moore’s Law. As reducing transistor size becomes more and more difficult, there is a renewed focus in the industry on developing domain-specific architectures — such as hardware accelerators — to continue advancing computational power. This is especially true for machine learning, where efforts are aimed at building specialized architectures for neural network (NN) acceleration. Ironically, while there has been a steady proliferation of these architectures in data centers and on edge computing platforms, the NNs that run on them are rarely customized to take advantage of the underlying hardware.

Today, we are happy to announce the release of EfficientNet-EdgeTPU, a family of image classification models derived from EfficientNets, but customized to run optimally on Google’s Edge TPU, a power-efficient hardware accelerator available to developers through the Coral Dev Board and a USB Accelerator. Through such model customizations, the Edge TPU is able to provide real-time image classification performance while simultaneously achieving accuracies typically seen only when running much larger, compute-heavy models in data centers.

Using AutoML to customize EfficientNets for Edge TPU
EfficientNets have been shown to achieve state-of-the-art accuracy in image classification tasks while significantly reducing the model size and computational complexity. To build EfficientNets designed to leverage the Edge TPU’s accelerator architecture, we invoked the AutoML MNAS framework and augmented the original EfficientNet’s neural network architecture search space with building blocks that execute efficiently on the Edge TPU (discussed below). We also built and integrated a “latency predictor” module that provides an estimate of the model latency when executing on the Edge TPU, by running the models on a cycle-accurate architectural simulator. The AutoML MNAS controller implements a reinforcement learning algorithm to search this space while attempting to maximize the reward, which is a joint function of the predicted latency and model accuracy. From past experience, we know that Edge TPU’s power efficiency and performance tend to be maximized when the model fits within its on-chip memory. Hence we also modified the reward function to generate a higher reward for models that satisfy this constraint.

Overall AutoML flow for designing customized EfficientNet-EdgeTPU models.

Search Space Design
When performing the architecture search described above, one must consider that EfficientNets rely primarily on depthwise-separable convolutions, a type of neural network block that factorizes a regular convolution to reduce the number of parameters as well as the amount of computations. However, for certain configurations, a regular convolution utilizes the Edge TPU architecture more efficiently and executes faster, despite the much larger amount of compute. While it is possible, albeit tedious, to manually craft a network that uses an optimal combination of the different building blocks, augmenting the AutoML search space with these accelerator-optimal blocks is a more scalable approach.

A regular 3x3 convolution (right) has more compute (multiply-and-accumulate (mac) operations) than an depthwise-separable convolution (left), but for certain input/output shapes, executes faster on Edge TPU due to ~3x more effective hardware utilization.

In addition, removing certain operations from the search space that require modifications to the Edge TPU compiler to fully support, such swish non-linearity and squeeze-and-excitation block, naturally leads to models that are readily ported to the Edge TPU hardware. These operations tend to improve model quality slightly, so by eliminating them from the search space, we have effectively instructed AutoML to discover alternate network architectures that may compensate for any potential loss in quality.

Model Performance
The neural architecture search (NAS) described above produced a baseline model, EfficientNet-EdgeTPU-S, which is subsequently scaled up using EfficientNet's compound scaling method to produce the -M and -L models. The compound scaling approach selects an optimal combination of input image resolution scaling, network width, and depth scaling to construct larger, more accurate models. The -M, and -L models achieve higher accuracy at the cost of increased latency as shown in the figure below.

EfficientNet-EdgeTPU-S/M/L models achieve better latency and accuracy than existing EfficientNets (B1), ResNet, and Inception by specializing the network architecture for Edge TPU hardware. In particular, our EfficientNet-EdgeTPU-S achieves higher accuracy, yet runs 10x faster than ResNet-50.

Interestingly, the NAS-generated model employs the regular convolution quite extensively in the initial part of the network where the depthwise-separable convolution tends to be less effective than the regular convolution when executed on the accelerator. This clearly highlights the fact that trade-offs usually made while optimizing models for general purpose CPUs (reducing the total number of operations, for example) are not necessarily optimal for hardware accelerators. Also, these models achieve high accuracy even without the use of esoteric operations. Comparing with the other image classification models such as Inception-resnet-v2 and Resnet50, EfficientNet-EdgeTPU models are not only more accurate, but also run faster on Edge TPUs.

This work represents a first experiment in building accelerator-optimized models using AutoML. The AutoML-based model customization can be extended to not only a wide range of hardware accelerators, but also to several different applications that rely on neural networks.

From Cloud TPU training to Edge TPU deployment
We have released the training code and pretrained models for EfficientNet-EdgeTPU on our github repository. We employ tensorflow’s post-training quantization tool to convert a floating-point trained model to an Edge TPU-compatible integer-quantized model. For these models, the post-training quantization works remarkably well and produces only a very slight loss in accuracy (~0.5%). The script for exporting the quantized model from a training checkpoint can be found here. For an update on the Coral platform, see this post on the Google Developer’s Blog, and for full reference materials and detailed instructions, please refer to the Coral website.

Acknowledgements
Special thanks to Quoc Le, Hongkun Yu, Yunlu Li, Ruoming Pang, and Vijay Vasudevan from the Google Brain team; Bo Wu, Vikram Tank, and Ajay Nair from the Google Coral team; Han Vanholder, Ravi Narayanaswami, John Joseph, Dong Hyuk Woo, Raksit Ashok, Jason Jong Kyu Park, Jack Liu, Mohammadali Ghodrat, Cao Gao, Berkin Akin, Liang-Yun Wang, Chirag Gandhi, and Dongdong Li from the Google Edge TPU team.