Tag Archives: machine learning

Symbol tuning improves in-context learning in language models

A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via in-context learning. Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering or phrasing tasks as instructions, and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown incorrect labels.

In “Symbol tuning improves in-context learning in language models”, we propose a simple fine-tuning procedure that we call symbol tuning, which can improve in-context learning by emphasizing input–label mappings. We experiment with symbol tuning across Flan-PaLM models and observe benefits across various settings.

  • Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
  • Symbol-tuned models are much stronger at algorithmic reasoning tasks.
  • Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.
An overview of symbol tuning, where models are fine-tuned on tasks where natural language labels are replaced with arbitrary symbols. Symbol tuning relies on the intuition that when instruction and relevant labels are not available, models must use in-context examples to learn the task.

Motivation

Instruction tuning is a common fine-tuning method that has been shown to improve performance and allow models to better follow in-context examples. One shortcoming, however, is that models are not forced to learn to use the examples because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, on the left in the figure above, although the examples can help the model understand the task (sentiment analysis), they are not strictly necessary since the model could ignore the examples and just read the instruction that indicates what the task is.

In symbol tuning, the model is fine-tuned on examples where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., “Foo,” “Bar,” etc.). In this setup, the task is unclear without looking at the in-context examples. For example, on the right in the figure above, multiple in-context examples would be needed to figure out the task. Because symbol tuning teaches the model to reason over the in-context examples, symbol-tuned models should have better performance on tasks that require reasoning between in-context examples and their labels.

Datasets and task types used for symbol tuning.

Symbol-tuning procedure

We selected 22 publicly-available natural language processing (NLP) datasets that we use for our symbol-tuning procedure. These tasks have been widely used in the past, and we only chose classification-type tasks since our method requires discrete labels. We then remap labels to a random label from a set of ~30K arbitrary labels selected from one of three categories: integers, character combinations, and words.

For our experiments, we symbol tune Flan-PaLM, the instruction-tuned variants of PaLM. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Flan-PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.

We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation.

Experimental setup

We want to evaluate a model’s ability to perform unseen tasks, so we cannot evaluate on tasks used in symbol tuning (22 datasets) or used during instruction tuning (1.8K tasks). Hence, we choose 11 NLP datasets that were not used during fine-tuning.


In-context learning

In the symbol-tuning procedure, models must learn to reason with in-context examples in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from relevant labels or instructions. Symbol-tuned models should perform better in settings where tasks are unclear and require reasoning between in-context examples and their labels. To explore these settings, we define four in-context learning settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels)

Depending on the availability of instructions and relevant natural language labels, models may need to do varying amounts of reasoning with in-context examples. When these features are not available, models must reason with the given in-context examples to successfully perform the task.

Symbol tuning improves performance across all settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms FlanPaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on these tasks (effectively saving ∼10X inference compute).

Large-enough symbol-tuned models are better at in-context learning than baselines, especially in settings where relevant labels are not available. Performance is shown as average model accuracy (%) across eleven tasks.

Algorithmic reasoning

We also experiment on algorithmic reasoning tasks from BIG-Bench. There are two main groups of tasks: 1) List functions — identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers; and 2) simple turing concepts — reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string).

On the list function and simple turing concept tasks, symbol tuning results in an average performance improvement of 18.2% and 15.3%, respectively. Additionally, Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks on average, which is equivalent to a ∼10x reduction in inference compute. These improvements suggest that symbol tuning strengthens the model’s ability to learn in-context for unseen task types, as symbol tuning did not include any algorithmic data.

Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task.

Flipped labels

In the flipped-label experiment, labels of in-context and evaluation examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override prior knowledge. Previous work has shown that while pre-trained models (without instruction tuning) can, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability.

We see that there is a similar trend across all model sizes — symbol-tuned models are much more capable of following flipped labels than instruction-tuned models. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. Additionally, symbol-tuned models achieve similar or better than average performance as pre-training–only models.

Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are.

Conclusion

We presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context examples. We tuned four language models using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30K arbitrary symbols as labels.

We first showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Finally, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) restores the ability to follow flipped labels that was lost during instruction tuning.


Future work

Through symbol tuning, we aim to increase the degree to which models can examine and learn from input–label mappings during in-context learning. We hope that our results encourage further work towards improving language models’ ability to reason over symbols presented in-context.


Acknowledgements

The authors of this post are now part of Google DeepMind. This work was conducted by Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V. Le. We would like to thank our colleagues at Google Research and Google DeepMind for their advice and helpful discussions.

Source: Google AI Blog


Symbol tuning improves in-context learning in language models

A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via in-context learning. Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering or phrasing tasks as instructions, and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown incorrect labels.

In “Symbol tuning improves in-context learning in language models”, we propose a simple fine-tuning procedure that we call symbol tuning, which can improve in-context learning by emphasizing input–label mappings. We experiment with symbol tuning across Flan-PaLM models and observe benefits across various settings.

  • Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
  • Symbol-tuned models are much stronger at algorithmic reasoning tasks.
  • Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.
An overview of symbol tuning, where models are fine-tuned on tasks where natural language labels are replaced with arbitrary symbols. Symbol tuning relies on the intuition that when instruction and relevant labels are not available, models must use in-context examples to learn the task.

Motivation

Instruction tuning is a common fine-tuning method that has been shown to improve performance and allow models to better follow in-context examples. One shortcoming, however, is that models are not forced to learn to use the examples because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, on the left in the figure above, although the examples can help the model understand the task (sentiment analysis), they are not strictly necessary since the model could ignore the examples and just read the instruction that indicates what the task is.

In symbol tuning, the model is fine-tuned on examples where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., “Foo,” “Bar,” etc.). In this setup, the task is unclear without looking at the in-context examples. For example, on the right in the figure above, multiple in-context examples would be needed to figure out the task. Because symbol tuning teaches the model to reason over the in-context examples, symbol-tuned models should have better performance on tasks that require reasoning between in-context examples and their labels.

Datasets and task types used for symbol tuning.

Symbol-tuning procedure

We selected 22 publicly-available natural language processing (NLP) datasets that we use for our symbol-tuning procedure. These tasks have been widely used in the past, and we only chose classification-type tasks since our method requires discrete labels. We then remap labels to a random label from a set of ~30K arbitrary labels selected from one of three categories: integers, character combinations, and words.

For our experiments, we symbol tune Flan-PaLM, the instruction-tuned variants of PaLM. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Flan-PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.

We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation.

Experimental setup

We want to evaluate a model’s ability to perform unseen tasks, so we cannot evaluate on tasks used in symbol tuning (22 datasets) or used during instruction tuning (1.8K tasks). Hence, we choose 11 NLP datasets that were not used during fine-tuning.


In-context learning

In the symbol-tuning procedure, models must learn to reason with in-context examples in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from relevant labels or instructions. Symbol-tuned models should perform better in settings where tasks are unclear and require reasoning between in-context examples and their labels. To explore these settings, we define four in-context learning settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels)

Depending on the availability of instructions and relevant natural language labels, models may need to do varying amounts of reasoning with in-context examples. When these features are not available, models must reason with the given in-context examples to successfully perform the task.

Symbol tuning improves performance across all settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms FlanPaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on these tasks (effectively saving ∼10X inference compute).

Large-enough symbol-tuned models are better at in-context learning than baselines, especially in settings where relevant labels are not available. Performance is shown as average model accuracy (%) across eleven tasks.

Algorithmic reasoning

We also experiment on algorithmic reasoning tasks from BIG-Bench. There are two main groups of tasks: 1) List functions — identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers; and 2) simple turing concepts — reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string).

On the list function and simple turing concept tasks, symbol tuning results in an average performance improvement of 18.2% and 15.3%, respectively. Additionally, Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks on average, which is equivalent to a ∼10x reduction in inference compute. These improvements suggest that symbol tuning strengthens the model’s ability to learn in-context for unseen task types, as symbol tuning did not include any algorithmic data.

Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task.

Flipped labels

In the flipped-label experiment, labels of in-context and evaluation examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override prior knowledge. Previous work has shown that while pre-trained models (without instruction tuning) can, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability.

We see that there is a similar trend across all model sizes — symbol-tuned models are much more capable of following flipped labels than instruction-tuned models. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. Additionally, symbol-tuned models achieve similar or better than average performance as pre-training–only models.

Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are.

Conclusion

We presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context examples. We tuned four language models using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30K arbitrary symbols as labels.

We first showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Finally, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) restores the ability to follow flipped labels that was lost during instruction tuning.


Future work

Through symbol tuning, we aim to increase the degree to which models can examine and learn from input–label mappings during in-context learning. We hope that our results encourage further work towards improving language models’ ability to reason over symbols presented in-context.


Acknowledgements

The authors of this post are now part of Google DeepMind. This work was conducted by Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V. Le. We would like to thank our colleagues at Google Research and Google DeepMind for their advice and helpful discussions.

Source: Google AI Blog


Google Dev Library Letters: 21st Edition

Posted by Swathi Dharshna Subbaraj, Google Dev Library

In this newsletter, we highlight the best projects developed with Google technologies that have been contributed to the Google Dev Library platform. We hope this will spark some inspiration for your next project!

Highlights of the Month

In the past two months, we asked contributors to look back, revisit, and update their older Dev Library contributions as a best practice. Most contributors took the time to revise their content and incorporate recent releases. This campaign encourages developers to update their repositories with the latest Google technologies, which is advantageous to users and the broader developer community.

Here are some of the standout up-to-date projects:

  • Sheets Compose Dialogs by Maximilian Keppeler

See how an Android library that offers dialogs and views for various use cases - built with Jetpack Compose for Compose projects. All dialogs and views are easy and quick to implement. 

 

  • Round Corner Progress Bar by Somkiat Khitwongwattana

Progress Bar Animation
Use this extensive “Rounded Corner progress bar” library for your own Android projects. 

During the campaign, we noticed that some new projects were submitted. Here are some of the new projects from our contributors:

  • Android TV sample projects by Ademir Queiroga

Android TV Project
See some of the Android TV sample projects on the main topics around Android TV development, and the project follows Google's best practices with a few experience-based insights.  
 

  • Storage provisioning with Cloud SQL using Workload Identity by Fermin Blanco

Learn how to create a production ready GKE cluster in a matter of seconds. 

Android


Using Android’s new Credential Manager API by Priya Sindkar
Dive into this blog on how Android's new Credential Manager API provides a seamless way for your app’s users to log in with one-click solutions.  

KStore by Isuru Rajapakse
Learn how the tiny Kotlin multiplatform library that assists in saving and restoring objects to and from disk using kotlinx.coroutines, kotlinx.serialisation and okio.  

DevBricksX by Nan YE
Discover how DevBricksX is a remarkable remake and extended version of DevBricks, this project covers various aspects of daily development, from low-level database tasks to user interface design, as it eliminates the need for repetitive work.  

Dose app by Waseef Akhtar
Learn how Dose, a reminder app for people to take their medications on time, was built using Kotlin and Jetpack Compose with MVVM + clean architecture.  

Compose_adaptive_scaffold by Thomas Künneth
Explore how to write Jetpack Compose apps that support large screens and foldables.  

Cloud


Troubleshooting reachability with a Network Intelligence Center connectivity test by Gaurav Madan
Learn how network troubleshooting processes become crucial when time is of the essence, and how to do so efficiently.  

From data chaos to data insights with Google Cloud and GitLab CI: A cutting-edge solution by Gursimar Singh
Take a look at a streamlined, effective approach to acquire important insights from data and learn how to deal with the turmoil of manual data deployment and analysis easily.  

Machine Learning


Client-side in-decent content checking
Discover a JavaScript library to help you quickly identify unseemly images; all in the client's browser.  

YoloV7 in Tensorflow.js by Hugo Zanini
Learn object detection using Yolov7 in tensorflow.js, and how it’s trained on the MS COCO dataset to recognizes up to 80 different classes  

Flutter


Exploring Inherited Widget: The powerful state management solution by Muhammad Salman
Take a deep dive into the backstory of state management in Flutter and explore one of the most important concepts in Flutter state management, the Inherited Widget.  

Control your Flutter app on the fly with Firebase Remote Config by Mangirdas Kazlauskas
Flutter Forward agenda app
Learn the overview of Firebase Remote Config and how to use it to enable real-time features in your Flutter application.  

The ultimate Flutter Navigator 2.0 series using AutoRoute by Cavin Macwan
Explore the differences between Navigator 1.0 and 2.0 and why you need Navigator 2.0. You’ll also learn how you can implement Navigator 2.0 using the Auto Route package in Flutter.  

Angular


Papanasi (UI library) by Quique Fdez Guerra
Learn to use this frontend UI library across frameworks.  

How to manage complex forms in Angular by Roland Tubongye Wabubindja
See how to save and modify data from a form containing several FormArray.  

Community Updates


🚀 Announcing Google Maps Platform added to Dev Library

Progress Bar AnimationGoogle Maps platform in Dev Library

Google Maps Platform has now been officially added to the Dev Library! With these resources, developers can create applications that enable them to visualize geospatial data and build projects ranging from hyperlocal logistics to location-driven app development, and have access to even more resources to take their projects to the next level.

Dev Library contributors will be better able to write and create innovative and useful applications that utilize Google’s mapping, places, and routing data and features.

Visit the Google Maps Platform product page in Dev Library



Browse Dev Library | Google Developers Online on Discord | Newsletter Archives

An open-source gymnasium for machine learning assisted computer architecture design

Computer Architecture research has a long history of developing simulators and tools to evaluate and shape the design of computer systems. For example, the SimpleScalar simulator was introduced in the late 1990s and allowed researchers to explore various microarchitectural ideas. Computer architecture simulators and tools, such as gem5, DRAMSys, and many more have played a significant role in advancing computer architecture research. Since then, these shared resources and infrastructure have benefited industry and academia and have enabled researchers to systematically build on each other's work, leading to significant advances in the field.

Nonetheless, computer architecture research is evolving, with industry and academia turning towards machine learning (ML) optimization to meet stringent domain-specific requirements, such as ML for computer architecture, ML for TinyML accelerationDNN accelerator datapath optimization, memory controllers, power consumption, security, and privacy. Although prior work has demonstrated the benefits of ML in design optimization, the lack of strong, reproducible baselines hinders fair and objective comparison across different methods and poses several challenges to their deployment. To ensure steady progress, it is imperative to understand and tackle these challenges collectively.

To alleviate these challenges, in “ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design”, accepted at ISCA 2023, we introduced ArchGym, which includes a variety of computer architecture simulators and ML algorithms. Enabled by ArchGym, our results indicate that with a sufficiently large number of samples, any of a diverse collection of ML algorithms are capable of finding the optimal set of architecture design parameters for each target problem; no one solution is necessarily better than another. These results further indicate that selecting the optimal hyperparameters for a given ML algorithm is essential for finding the optimal architecture design, but choosing them is non-trivial. We release the code and dataset across multiple computer architecture simulations and ML algorithms.


Challenges in ML-assisted architecture research

ML-assisted architecture research poses several challenges, including:

  1. For a specific ML-assisted computer architecture problem (e.g., finding an optimal solution for a DRAM controller) there is no systematic way to identify optimal ML algorithms or hyperparameters (e.g., learning rate, warm-up steps, etc.). There is a wider range of ML and heuristic methods, from random walk to reinforcement learning (RL), that can be employed for design space exploration (DSE). While these methods have shown noticeable performance improvement over their choice of baselines, it is not evident whether the improvements are because of the choice of optimization algorithms or hyperparameters.

    Thus, to ensure reproducibility and facilitate widespread adoption of ML-aided architecture DSE, it is necessary to outline a systematic benchmarking methodology.

  2. While computer architecture simulators have been the backbone of architectural innovations, there is an emerging need to address the trade-offs between accuracy, speed, and cost in architecture exploration. The accuracy and speed of performance estimation widely varies from one simulator to another, depending on the underlying modeling details (e.g., cycle-accurate vs. ML-based proxy models). While analytical or ML-based proxy models are nimble by virtue of discarding low-level details, they generally suffer from high prediction error. Also, due to commercial licensing, there can be strict limits on the number of runs collected from a simulator. Overall, these constraints exhibit distinct performance vs. sample efficiency trade-offs, affecting the choice of optimization algorithm for architecture exploration.

    It is challenging to delineate how to systematically compare the effectiveness of various ML algorithms under these constraints.

  3. Finally, the landscape of ML algorithms is rapidly evolving and some ML algorithms need data to be useful. Additionally, rendering the outcome of DSE into meaningful artifacts such as datasets is critical for drawing insights about the design space.

    In this rapidly evolving ecosystem, it is consequential to ensure how to amortize the overhead of search algorithms for architecture exploration. It is not apparent, nor systematically studied how to leverage exploration data while being agnostic to the underlying search algorithm.

ArchGym design

ArchGym addresses these challenges by providing a unified framework for evaluating different ML-based search algorithms fairly. It comprises two main components: 1) the ArchGym environment and 2) the ArchGym agent. The environment is an encapsulation of the architecture cost model — which includes latency, throughput, area, energy, etc., to determine the computational cost of running the workload, given a set of architectural parameters — paired with the target workload(s). The agent is an encapsulation of the ML algorithm used for the search and consists of hyperparameters and a guiding policy. The hyperparameters are intrinsic to the algorithm for which the model is to be optimized and can significantly influence performance. The policy, on the other hand, determines how the agent selects a parameter iteratively to optimize the target objective.

Notably, ArchGym also includes a standardized interface that connects these two components, while also saving the exploration data as the ArchGym Dataset. At its core, the interface entails three main signals: hardware state, hardware parameters, and metrics. These signals are the bare minimum to establish a meaningful communication channel between the environment and the agent. Using these signals, the agent observes the state of the hardware and suggests a set of hardware parameters to iteratively optimize a (user-defined) reward. The reward is a function of hardware performance metrics, such as performance, energy consumption, etc. 

ArchGym comprises two main components: the ArchGym environment and the ArchGym agent. The ArchGym environment encapsulates the cost model and the agent is an abstraction of a policy and hyperparameters. With a standardized interface that connects these two components, ArchGym provides a unified framework for evaluating different ML-based search algorithms fairly while also saving the exploration data as the ArchGym Dataset.

ML algorithms could be equally favorable to meet user-defined target specifications

Using ArchGym, we empirically demonstrate that across different optimization objectives and DSE problems, at least one set of hyperparameters exists that results in the same hardware performance as other ML algorithms. A poorly selected (random selection) hyperparameter for the ML algorithm or its baseline can lead to a misleading conclusion that a particular family of ML algorithms is better than another. We show that with sufficient hyperparameter tuning, different search algorithms, even random walk (RW), are able to identify the best possible reward. However, note that finding the right set of hyperparameters may require exhaustive search or even luck to make it competitive.

With a sufficient number of samples, there exists at least one set of hyperparameters that results in the same performance across a range of search algorithms. Here the dashed line represents the maximum normalized reward. Cloud-1, cloud-2, stream, and random indicate four different memory traces for DRAMSys (DRAM subsystem design space exploration framework).

Dataset construction and high-fidelity proxy model training

Creating a unified interface using ArchGym also enables the creation of datasets that can be used to design better data-driven ML-based proxy architecture cost models to improve the speed of architecture simulation. To evaluate the benefits of datasets in building an ML model to approximate architecture cost, we leverage ArchGym’s ability to log the data from each run from DRAMSys to create four dataset variants, each with a different number of data points. For each variant, we create two categories: (a) Diverse Dataset, which represents the data collected from different agents (ACO, GA, RW, and BO), and (b) ACO only, which shows the data collected exclusively from the ACO agent, both of which are released along with ArchGym. We train a proxy model on each dataset using random forest regression with the objective to predict the latency of designs for a DRAM simulator. Our results show that:

  1. As we increase the dataset size, the average normalized root mean squared error (RMSE) slightly decreases.
  2. However, as we introduce diversity in the dataset (e.g., collecting data from different agents), we observe 9× to 42× lower RMSE across different dataset sizes.

Diverse dataset collection across different agents using ArchGym interface.
The impact of a diverse dataset and dataset size on the normalized RMSE.

The need for a community-driven ecosystem for ML-assisted architecture research

While, ArchGym is an initial effort towards creating an open-source ecosystem that (1) connects a broad range of search algorithms to computer architecture simulators in an unified and easy-to-extend manner, (2) facilitates research in ML-assisted computer architecture, and (3) forms the scaffold to develop reproducible baselines, there are a lot of open challenges that need community-wide support. Below we outline some of the open challenges in ML-assisted architecture design. Addressing these challenges requires a well coordinated effort and a community driven ecosystem.

Key challenges in ML-assisted architecture design.

We call this ecosystem Architecture 2.0. We outline the key challenges and a vision for building an inclusive ecosystem of interdisciplinary researchers to tackle the long-standing open problems in applying ML for computer architecture research. If you are interested in helping shape this ecosystem, please fill out the interest survey.


Conclusion

ArchGym is an open source gymnasium for ML architecture DSE and enables an standardized interface that can be readily extended to suit different use cases. Additionally, ArchGym enables fair and reproducible comparison between different ML algorithms and helps to establish stronger baselines for computer architecture research problems.

We invite the computer architecture community as well as the ML community to actively participate in the development of ArchGym. We believe that the creation of a gymnasium-type environment for computer architecture research would be a significant step forward in the field and provide a platform for researchers to use ML to accelerate research and lead to new and innovative designs.


Acknowledgements

This blogpost is based on joint work with several co-authors at Google and Harvard University. We would like to acknowledge and highlight Srivatsan Krishnan (Harvard) who contributed several ideas to this project in collaboration with Shvetank Prakash (Harvard), Jason Jabbour (Harvard), Ikechukwu Uchendu (Harvard), Susobhan Ghosh (Harvard), Behzad Boroujerdian (Harvard), Daniel Richins (Harvard), Devashree Tripathy (Harvard), and Thierry Thambe (Harvard).  In addition, we would also like to thank James Laudon, Douglas Eck, Cliff Young, and Aleksandra Faust for their support, feedback, and motivation for this work. We would also like to thank John Guilyard for the animated figure used in this post. Amir Yazdanbakhsh is now a Research Scientist at Google DeepMind and Vijay Janapa Reddi is an Associate Professor at Harvard.




Source: Google AI Blog


MediaPipe: Enhancing Virtual Humans to be more realistic

A guest post by the XR Development team at KDDI & Alpha-U

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, KDDI.

AI generated rendering of virtual human ‘Metako’
KDDI is integrating text-to-speech & Cloud Rendering to virtual human ‘Metako’

VTubers, or virtual YouTubers, are online entertainers who use a virtual avatar generated using computer graphics. This digital trend originated in Japan in the mid-2010s, and has become an international online phenomenon. A majority of VTubers are English and Japanese-speaking YouTubers or live streamers who use avatar designs.

KDDI, a telecommunications operator in Japan with over 40 million customers, wanted to experiment with various technologies built on its 5G network but found that getting accurate movements and human-like facial expressions in real-time was challenging.


Creating virtual humans in real-time

Announced at Google I/O 2023 in May, the MediaPipe Face Landmarker solution detects facial landmarks and outputs blendshape scores to render a 3D face model that matches the user. With the MediaPipe Face Landmarker solution, KDDI and the Google Partner Innovation team successfully brought realism to their avatars.


Technical Implementation

Using Mediapipe's powerful and efficient Python package, KDDI developers were able to detect the performer’s facial features and extract 52 blendshapes in real-time.

import mediapipe as mp from mediapipe.tasks import python as mp_python MP_TASK_FILE = "face_landmarker_with_blendshapes.task" class FaceMeshDetector: def __init__(self): with open(MP_TASK_FILE, mode="rb") as f: f_buffer = f.read() base_options = mp_python.BaseOptions(model_asset_buffer=f_buffer) options = mp_python.vision.FaceLandmarkerOptions( base_options=base_options, output_face_blendshapes=True, output_facial_transformation_matrixes=True, running_mode=mp.tasks.vision.RunningMode.LIVE_STREAM, num_faces=1, result_callback=self.mp_callback) self.model = mp_python.vision.FaceLandmarker.create_from_options( options) self.landmarks = None self.blendshapes = None self.latest_time_ms = 0 def mp_callback(self, mp_result, output_image, timestamp_ms: int): if len(mp_result.face_landmarks) >= 1 and len( mp_result.face_blendshapes) >= 1: self.landmarks = mp_result.face_landmarks[0] self.blendshapes = [b.score for b in mp_result.face_blendshapes[0]] def update(self, frame): t_ms = int(time.time() * 1000) if t_ms <= self.latest_time_ms: return frame_mp = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame) self.model.detect_async(frame_mp, t_ms) self.latest_time_ms = t_ms def get_results(self): return self.landmarks, self.blendshapes

The Firebase Realtime Database stores a collection of 52 blendshape float values. Each row corresponds to a specific blendshape, listed in order.

_neutral, browDownLeft, browDownRight, browInnerUp, browOuterUpLeft, ...

These blendshape values are continuously updated in real-time as the camera is open and the FaceMesh model is running. With each frame, the database reflects the latest blendshape values, capturing the dynamic changes in facial expressions as detected by the FaceMesh model.

Screenshot of realtime Database

After extracting the blendshapes data, the next step involves transmitting it to the Firebase Realtime Database. Leveraging this advanced database system ensures a seamless flow of real-time data to the clients, eliminating concerns about server scalability and enabling KDDI to focus on delivering a streamlined user experience.

import concurrent.futures import time import cv2 import firebase_admin import mediapipe as mp import numpy as np from firebase_admin import credentials, db pool = concurrent.futures.ThreadPoolExecutor(max_workers=4) cred = credentials.Certificate('your-certificate.json') firebase_admin.initialize_app( cred, { 'databaseURL': 'https://your-project.firebasedatabase.app/' }) ref = db.reference('projects/1234/blendshapes') def main(): facemesh_detector = FaceMeshDetector() cap = cv2.VideoCapture(0) while True: ret, frame = cap.read() facemesh_detector.update(frame) landmarks, blendshapes = facemesh_detector.get_results() if (landmarks is None) or (blendshapes is None): continue blendshapes_dict = {k: v for k, v in enumerate(blendshapes)} exe = pool.submit(ref.set, blendshapes_dict) cv2.imshow('frame', frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() exit()

 

To continue the progress, developers seamlessly transmit the blendshapes data from the Firebase Realtime Database to Google Cloud's Immersive Stream for XR instances in real-time. Google Cloud’s Immersive Stream for XR is a managed service that runs Unreal Engine project in the cloud, renders and streams immersive photorealistic 3D and Augmented Reality (AR) experiences to smartphones and browsers in real time.

This integration enables KDDI to drive character face animation and achieve real-time streaming of facial animation with minimal latency, ensuring an immersive user experience.

Illustrative example of how KDDI transmits data from the Firebase Realtime Database to Google Cloud Immersive Stream for XR in real time to render and stream photorealistic 3D and AR experiences like character face animation with minimal latency

On the Unreal Engine side running by the Immersive Stream for XR, we use the Firebase C++ SDK to seamlessly receive data from the Firebase. By establishing a database listener, we can instantly retrieve blendshape values as soon as updates occur in the Firebase Realtime database table. This integration allows for real-time access to the latest blendshape data, enabling dynamic and responsive facial animation in Unreal Engine projects.

Screenshot of Modify Curve node in use in Unreal Engine

After retrieving blendshape values from the Firebase SDK, we can drive the face animation in Unreal Engine by using the "Modify Curve" node in the animation blueprint. Each blendshape value is assigned to the character individually on every frame, allowing for precise and real-time control over the character's facial expressions.

Flowchart demonstrating how BlendshapesReceiver handles the database connection, authentication, and continuous data reception

An effective approach for implementing a realtime database listener in Unreal Engine is to utilize the GameInstance Subsystem, which serves as an alternative singleton pattern. This allows for the creation of a dedicated BlendshapesReceiver instance responsible for handling the database connection, authentication, and continuous data reception in the background.

By leveraging the GameInstance Subsystem, the BlendshapesReceiver instance can be instantiated and maintained throughout the lifespan of the game session. This ensures a persistent database connection while the animation blueprint reads and drives the face animation using the received blendshape data.

Using just a local PC running MediaPipe, KDDI succeeded in capturing the real performer’s facial expression and movement, and created high-quality 3D re-target animation in real time.

Flow chart showing how a real performer's facial expression and movement being captured and run through MediaPipe on a Local PC, and the high quality 3D re-target animation being rendered in real time by KDDI
      

KDDI is collaborating with developers of Metaverse anime fashion like Adastria Co., Ltd.


Getting started

To learn more, watch Google I/O 2023 sessions: Easy on-device ML with MediaPipe, Supercharge your web app with machine learning and MediaPipe, What's new in machine learning, and check out the official documentation over on developers.google.com/mediapipe.


What’s next?

This MediaPipe integration is one example of how KDDI is eliminating the boundary between the real and virtual worlds, allowing users to enjoy everyday experiences such as attending live music performances, enjoying art, having conversations with friends, and shopping―anytime, anywhere. 

KDDI’s αU provides services for the Web3 era, including the metaverse, live streaming, and virtual shopping, shaping an ecosystem where anyone can become a creator, supporting the new generation of users who effortlessly move between the real and virtual worlds.

Developers Share How They Built Their Careers: From Machine Learning to Cloud

Posted by Lyanne Alfaro, DevRel Program Manager, Google Developer Studio

Google Developer Student Club Alums Reflect On Their Journey To Google Developer Experts

Developer Journey is a monthly series highlighting diverse and global developers sharing relatable challenges, opportunities, and wins in their journey. Every month, we will spotlight developers around the world, the Google tools they leverage, and the kind of products they are building.

This month, we spoke with several Google Developer Experts to learn more about their path from being Google Developer Student Clubs leads to connoisseurs of their craft.


Suvaditya Mukherjee

Headshot of Suvaditya Mukherjee smiling
Mumbai, Maharashtra, India
Google Developer Expert, Machine Learning
Google Summer of Code Org Admin + ML Research Engineer Intern at Ivy
Research Intern at IIIT-Hyderabad

What are some key skills and knowledge you gained as a Google Developer Student Clubs Lead that helped you excel in your role as a Google Developer Expert?

Every day I spent as a lead was a learning experience, but what stood out to me was the holistic learning opportunities that the program brought. For example, as someone specializing in AI, I never found a need to learn Web Development until I had to help audit and create complex web apps for hosting competitions. Additionally, I learned how to absorb newer technical skills as quickly as possible, which proved to be incredibly valuable over time. I also learned the importance of soft skills, which helped me communicate better with my community. As an expert, it’s important to steward your community, and the leadership skills imparted by the program helped me build a deeper understanding of communication, logistics, and team-building.

What has been the impact of being part of the Google Developer Student Clubs community on your personal and professional growth?

As a Google Developer Student Clubs (GDSC) Lead, I benefited from participating in networking opportunities with like-minded folks and potential mentors who helped immensely in my journey. They helped shape my technical skills, and improve my soft skills. I also had the opportunity to speak in front of large crowds, develop content, manage teams, and closely understand what makes a community tick. As a GDE, it becomes important to have a pulse on the community's needs and requirements. The GDSC Program taught me how to measure these metrics at a grassroots level. I have had the privilege of working with the most skilled, dedicated, professional – and most importantly – humble folks as part of the GDSC Community. The program allowed me the privilege of communicating and building friendships with awesome people over time.

What Google tools have you used to build?

I have used quite a few Google tools in different projects and endeavors, including but not limited to Firebase, Flutter, and Android for hackathons. I have also made use of the Google Cloud Platform to develop and host scalable backend infrastructures during projects and internships in different places. But my most used tool is TensorFlow.

Which tool has been your favorite? Why?

As an ML Practitioner, TensorFlow and Keras have been a boon to simplify days of work into potentially hours or even minutes. The power it delivers to end-users in the most open and democratic way possible while constantly innovating for newer advances is something I have always appreciated. One of the biggest reasons I love Keras has to be the awesome community around it that welcomes everyone with open arms.

Tell us about something you've built in the past using Google tools.

I have hacked around a few projects over time. The most notable among them was an application I personally call TranscribeMate. Imagine you’re in an ongoing lecture and the professor is going quicker than usual, hindering your ability to take notes. TranscribeMate (built with Flutter, Firebase, and MLKit) allows you to use OCR technology to transcribe notes from simple photos of the classroom blackboard, allow newer annotations as a note-taking application, and save them for later use. This was an application I developed for a college course- but I ended up tweaking it a bit more and making use of it on my personal device as well for more general tasks too.

What will you create with Google Bard?

I have been using Bard for a while now; it has a permanent home on my browser. Bard helps me with random questions I have, and Python-related problems. Bard has helped me find solutions in seconds, compared to hours of work when done through traditional search methods. I have been using Bard's help on several projects I have been working on within my research, in projects at Ivy, and the Keras Team. Stay tuned for what comes next!

What advice would you give someone starting in their developer journey?

Seek new experiences to learn. No one can learn by working within a narrow niche. Having a working knowledge of different technologies at once allows you to have a diverse and multi-faceted approach to problem-solving. Optimizations in your systems become far more apparent, and you slowly end up learning how to write better code and design scalable systems with ease. Lastly, find a community. Find like-minded folks, talk to them, share notes on what you're building, and if you find yourself too shy to do so, then try anyway. Start by just showing up for one event near you. Then make it two. Then ask a question. The power of collaborative learning is immeasurable.


Veronica Putri Anggraini

Headshot of Veronica Putri Anggraini, smiling
Jakarta, Indonesia
Google Developer Expert, Android
GDSC Semarang State Polytechnic Lead Alumni (2017)
Google Developer Group
Women Techmakers Ambassador
Software Engineer Android, @ eWIDEPLUS

What are some key skills and knowledge you gained as a Google Developer Student Clubs Lead that helped you excel in your role as a Google Developer Expert?

Through GDSC, I learn a lot about Android technology, practice building Android projects, and do workshops for our members every week. This process improves my technical, writing, problem solving and public speaking skills at the same time. I started presenting as a student with a small group workshop of 5-10 people and grew to speaking in front of 1,000 people. This was also one of the necessary criteria to become a GDE.

Can you share some insights on the impact of being part of the Google Developer Student Clubs community on your personal and professional growth?

Exploring different resources while I was a student helped me develop sample app portfolios. I feel like I actually started my professional career as a curriculum developer and trainer for mobile development. I got an offer when I was a speaker at a tech event that discussed Android technology through the GDSC program. In fact, the CEO immediately offered the position after the event ended.

What Google tools have you used to build?

I have a lot of exploration with Jetpack Compose. I currently work closely with the CameraX, AndroidX Library, Google Analytics and Maps API.

Which tool has been your favorite? Why?

CameraX is one of my favorites, because it automatically manages camera resources and avoids unnecessary background work, so I got better performance.

Tell us about something you've built in the past using Google tools.

At my current company, we build a digital bank app product natively. This allows users to use Liveness as a verified onboarding process, QRPay, personalize promo campaigns, and other financial services that we build using Google tools.

What advice would you give someone starting in their developer journey?

Gain experience in dealing with issues in the stack that serve as a focus. Be consistent in learning, and don't give up easily when stuck. In other words, be the person that says: "Challenge Accepted".

You should know that learning together is more fun than learning alone, so join the community and learn everything you need and extend your network.


Anubhav Singh

Headshot of Anubhav Singh, smiling
Prayagraj, India
Google Developer Expert, Firebase
GDSC NSEC Kolkata Lead Alumni (2019-20)
GDG Cloud Kolkata Organizer & TFUG Kolkata Co-Organizer
Co-founder, Dynopii

What are some key skills and knowledge you gained as a Google Developer Student Clubs Lead that helped you excel in your role as a Google Developer Expert?

A major part of being a Google Developer Student Clubs Lead was to enable growth for those around me by learning together. I would often find myself guiding club members on various fronts – sometimes by taking knowledge-sharing sessions on technical topics, sometimes by diving deep into their projects’ code to help them overcome challenges they were facing and sometimes creating videos or written content for them to be able to follow along later.

Through partaking in these activities, I learned public speaking skills, mentoring, and how to be helpful to others experiencing roadblocks. These skills have proved important in my role as a Google Developer Expert.

What has been the impact of being part of the Google Developer Student Clubs community on your personal and professional growth?

Being a GDSC Lead helped me further steer teams with the same passion I have for building communities. As a GDSC Lead, you get to connect with a lot of amazing people. The community itself is highly diverse and vibrant. When I was organizing a workshop for the club during my time as a GDSC Lead, I was fortunate to meet two individuals who later became the co-founders of my startup. In that same club, three of our members became Google Developer Experts in the fields of their interest. Thus, being a GDSC Lead has had a very positive impact on both my professional and personal growth.

What Google tools have you used to build?

I’ve been working in the software development field for almost 12 years now and have used several Google tools over the years, including some that no longer exist. Some of the currently available tools that I most often work with are:

  1. Google Cloud Platform: Cloud Run, Cloud Functions, Cloud Firestore, Cloud Workflows, GKE, GCE, App Engine, Vertex AI and other AI based products, etc.
  2. Google Postmaster Tools, Search Console Tools, Analytics, Pagespeed Insights
  3. TensorFlow, Keras
  4. Google Maps API
  5. Firebase
  6. reCaptcha

Which tool has been your favorite? Why?

Firebase, hands down. As someone who loves building solutions that are useful to people, Firebase has been my go-to tool for prototyping solutions and MVPs rapidly. I’ve used it to build some simple tools which have been used by thousands of people over the years - all hosted for free and delivered with blazing speed! Even today, during my sessions as a GDE, I always use Firebase to build the UI part of the demo applications I present during the talk.

Tell us about something you've built in the past using Google tools.

I built Fireshort - a URL shortener solution running purely on Firebase. This project is completely open source and has been used by several companies as a base for their in-house URL shortening needs. I’ve been working on the next version of this project at Linkborg.

I’ve also built several real-time updating monitoring products using Firebase and Pub/Sub, mostly for enterprise clients.

As a proof of concept, I also built KolPay, which is a completely event-driven clone of EasyCard - RFID based payment wallet using Firebase, Pub/Sub, Cloud Firestore and Cloud Functions, along with hardware components like Raspberry Pi, RFID Reader/Card.

What will you create with Google Bard?

Building with Google Bard is an exciting prospect. It will be fun to no longer have to write the repetitive parts of code which I need whenever I am setting up a new project or a module within an existing project. Since I spend a lot of my day coding, I will be very happy to automate parts of it and having an AI do that would be amazing!

What advice would you give someone starting in their developer journey?

Starting a developer journey can be a daunting prospect - everyone’s talking about AI and everyone wants to build the next viral thing. If you are new to this field, step back, relax and start building a solution to any problem that has irked you for a long time. While you’re at it - read a lot of tech blogs about solving that problem, become a part of developer communities, either virtual or in person, and meet people who will share their insights about building similar products.


Kartik Derasari

Headshot of Kartik Derasari, smiling
Ahmedabad, Gujarat, India
Google Developer Expert, Google Cloud
GDSC Silver Oak University Lead Alumni (2020-2021)
Google Developers Group Cloud Organizer
Full-Stack Engineer at Persistent

What are some key skills and knowledge you gained as a Google Developer Student Clubs Lead that helped you excel in your role as a Google Developer Expert?

As a GDSC Lead, I’ve had the opportunity to collaborate with Googlers, Google Developer Experts, and Google Developer Groups Community Leads on various projects which helped me explore different technologies and choose what’s best for me. Knowledge sharing and public speaking is what I learned from the Google Developer Experts. Since then, I started my journey as a Technical Speaker where I share my learnings on Machine Learning & TensorFlow, Web, Firebase, and Google Cloud. I also had the opportunity to share my learnings across conferences like DevFest, Google Cloud Community Days, and GDSC WOW. These are some of the learnings that really helped me shape as a Google Developer Expert and excel in my journey.

Can you share some insights on the impact of being part of the Google Developer Student Clubs community on your personal and professional growth?

Being a GDSC Lead created a positive impact in my personal and professional journey. I came in touch with the tech community and I learned about Google Developer Groups & Google Developer Experts programs. I started volunteering for the GDG Cloud Ahmedabad chapter during my GDSC tenure and later I became one of the Community Organizers. I also started collaborating with Google Developer Experts on Web, Firebase, and Machine Learning projects and made some open-source contributions.

Everyone from the community was so welcoming and helpful. I’d highly recommend everyone join these developer programs by Google and get the best out of it. I also received mentorship from GDG Community Leads and Google Developer Experts for my professional career. They helped me connect with the right set of people and guided me to kick-start my professional career with MediaAgility, which is part of the Google Cloud Partner ecosystem. Since then, I have been working on Web & Google Cloud in my professional capacity and in my personal capacity as well.

I was motivated by the Google Cloud ecosystem in India and I cleared six Google Cloud Certifications, which created a huge impact in my personal and professional growth.

What Google tools have you used to build?

I started using Firebase as a Web Engineer. It has been very helpful when it comes to adding Authentication, storing application data in Firestore, and hosting web-app front-end static files over a CDN using Firebase Hosting. While building a set of web apps, I started exploring Machine Learning and used TensorFlow for building ML models for different use cases. Since then, I started using Google Cloud ML APIs and Cloud Functions for adding more functionalities to my web apps.

While working on these projects, I came across the Google Cloud Partner ecosystem and joined MediaAgility (now part of Persistent Systems) as a Full-Stack Engineer. Since then, I have been working on Google Cloud with Google Cloud PSO and enterprise customers.

Which tool has been your favorite? Why?

Cloud Run is something that I really like as an Application Developer. Since it’s a serverless compute platform, I can spend more time on building my application rather than worrying about my infrastructure. Firebase Authentication, Cloud Firestore, and Cloud Storage are also tools that I really love. They help me create full-stack apps and ship faster to production.

Tell us about something you've built in the past using Google tools. What will you create with Google Bard?

Since we’re in the wave of Generative AI right now, I have been working on building a number of apps using Google Cloud Run, BigQuery, Cloud Storage, Generative AI studio, Model Garden on Vertex AI and PaLM models. Recently, I built a chat application interface which provides insights from structured enterprise data warehouse and unstructured files, along with enterprise-grade data governance and security.

What advice would you give someone starting in their developer journey?

Be a consistent learner and a persistent explorer. It’s great to cultivate a learning habit, which will help you all the way in your personal and professional journey. This will not only help you explore new things, but it will also help you master something that you really love to do. As a beginner, it would be good to start with something that you find interesting, and then you can add a flavor of other things. For example, if you find building web apps interesting, try it. When you think you’re good at it, you can add a flavor of Machine Learning to it. That’s how you explore new things and experiment with what you know.

On-device diffusion plugins for conditioned text-to-image generation

In recent years, diffusion models have shown great success in text-to-image generation, achieving high image quality, improved inference performance, and expanding our creative inspiration. Nevertheless, it is still challenging to efficiently control the generation, especially with conditions that are difficult to describe with text.

Today, we announce MediaPipe diffusion plugins, which enable controllable text-to-image generation to be run on-device. Expanding upon our prior work on GPU inference for on-device large generative models, we introduce new low-cost solutions for controllable text-to-image generation that can be plugged into existing diffusion models and their Low-Rank Adaptation (LoRA) variants.

Text-to-image generation with control plugins running on-device.

Background

With diffusion models, image generation is modeled as an iterative denoising process. Starting from a noise image, at each step, the diffusion model gradually denoises the image to reveal an image of the target concept. Research shows that leveraging language understanding via text prompts can greatly improve image generation. For text-to-image generation, the text embedding is connected to the model via cross-attention layers. Yet, some information is difficult to describe by text prompts, e.g., the position and pose of an object. To address this problem, researchers add additional models into the diffusion to inject control information from a condition image.

Common approaches for controlled text-to-image generation include Plug-and-Play, ControlNet, and T2I Adapter. Plug-and-Play applies a widely used denoising diffusion implicit model (DDIM) inversion approach that reverses the generation process starting from an input image to derive an initial noise input, and then employs a copy of the diffusion model (860M parameters for Stable Diffusion 1.5) to encode the condition from an input image. Plug-and-Play extracts spatial features with self-attention from the copied diffusion, and injects them into the text-to-image diffusion. ControlNet creates a trainable copy of the encoder of a diffusion model, which connects via a convolution layer with zero-initialized parameters to encode conditioning information that is conveyed to the decoder layers. However, as a result, the size is large, half that of the diffusion model (430M parameters for Stable Diffusion 1.5). T2I Adapter is a smaller network (77M parameters) and achieves similar effects in controllable generation. T2I Adapter only takes the condition image as input, and its output is shared across all diffusion iterations. Yet, the adapter model is not designed for portable devices.


The MediaPipe diffusion plugins

To make conditioned generation efficient, customizable, and scalable, we design the MediaPipe diffusion plugin as a separate network that is:

  • Plugable: It can be easily connected to a pre-trained base model.
  • Trained from scratch: It does not use pre-trained weights from the base model.
  • Portable: It runs outside the base model on mobile devices, with negligible cost compared to the base model inference.
Method    Parameter Size     Plugable     From Scratch     Portable
Plug-and-Play    860M*     ✔️        
ControlNet    430M*     ✔️        
T2I Adapter    77M     ✔️     ✔️    
MediaPipe Plugin    6M     ✔️     ✔️     ✔️

Comparison of Plug-and-Play, ControlNet, T2I Adapter, and the MediaPipe diffusion plugin.
* The number varies depending on the particulars of the diffusion model.

The MediaPipe diffusion plugin is a portable on-device model for text-to-image generation. It extracts multiscale features from a conditioning image, which are added to the encoder of a diffusion model at corresponding levels. When connecting to a text-to-image diffusion model, the plugin model can provide an extra conditioning signal to the image generation. We design the plugin network to be a lightweight model with only 6M parameters. It uses depth-wise convolutions and inverted bottlenecks from MobileNetv2 for fast inference on mobile devices.

Overview of the MediaPipe diffusion model plugin. The plugin is a separate network, whose output can be plugged into a pre-trained text-to-image generation model. Features extracted by the plugin are applied to the associated downsampling layer of the diffusion model (blue).

Unlike ControlNet, we inject the same control features in all diffusion iterations. That is, we only run the plugin once for one image generation, which saves computation. We illustrate some intermediate results of a diffusion process below. The control is effective at every diffusion step and enables controlled generation even at early steps. More iterations improve the alignment of the image with the text prompt and generate more detail.

Illustration of the generation process using the MediaPipe diffusion plugin.

Examples

In this work, we developed plugins for a diffusion-based text-to-image generation model with MediaPipe Face Landmark, MediaPipe Holistic Landmark, depth maps, and Canny edge. For each task, we select about 100K images from a web-scale image-text dataset, and compute control signals using corresponding MediaPipe solutions. We use refined captions from PaLI for training the plugins.


Face Landmark

The MediaPipe Face Landmarker task computes 478 landmarks (with attention) of a human face. We use the drawing utils in MediaPipe to render a face, including face contour, mouth, eyes, eyebrows, and irises, with different colors. The following table shows randomly generated samples by conditioning on face mesh and prompts. As a comparison, both ControlNet and Plugin can control text-to-image generation with given conditions.

Face-landmark plugin for text-to-image generation, compared with ControlNet.

Holistic Landmark

MediaPipe Holistic Landmarker task includes landmarks of body pose, hands, and face mesh. Below, we generate various stylized images by conditioning on the holistic features.

Holistic-landmark plugin for text-to-image generation.

Depth

Depth-plugin for text-to-image generation.

Canny Edge

Canny-edge plugin for text-to-image generation.

Evaluation

We conduct a quantitative study of the face landmark plugin to demonstrate the model's performance. The evaluation dataset contains 5K human images. We compare the generation quality as measured by the widely used metrics, Fréchet Inception Distance (FID) and CLIP scores. The base model is a pre-trained text-to-image diffusion model. We use Stable Diffusion v1.5 here.

As shown in the following table, both ControlNet and the MediaPipe diffusion plugin produce much better sample quality than the base model, in terms of FID and CLIP scores. Unlike ControlNet, which needs to run at every diffusion step, the MediaPipe plugin only runs once for each image generated. We measured the performance of the three models on a server machine (with Nvidia V100 GPU) and a mobile phone (Galaxy S23). On the server, we run all three models with 50 diffusion steps, and on mobile, we run 20 diffusion steps using the MediaPipe image generation app. Compared with ControlNet, the MediaPipe plugin shows a clear advantage in inference efficiency while preserving the sample quality.

Model     FID↓     CLIP↑     Inference Time (s)
Nvidia V100     Galaxy S23
Base     10.32     0.26     5.0     11.5
Base + ControlNet     6.51     0.31     7.4 (+48%)     18.2 (+58.3%)
Base + MediaPipe Plugin     6.50     0.30     5.0 (+0.2%)     11.8 (+2.6%)

Quantitative comparison on FID, CLIP, and inference time.

We test the performance of the plugin on a wide range of mobile devices from mid-tier to high-end. We list the results on some representative devices in the following table, covering both Android and iOS.

Device     Android     iOS
    Pixel 4     Pixel 6     Pixel 7     Galaxy S23     iPhone 12 Pro     iPhone 13 Pro
Time (ms)     128     68     50     48     73     63

Inference time (ms) of the plugin on different mobile devices.

Conclusion

In this work, we present MediaPipe, a portable plugin for conditioned text-to-image generation. It injects features extracted from a condition image to a diffusion model, and consequently controls the image generation. Portable plugins can be connected to pre-trained diffusion models running on servers or devices. By running text-to-image generation and plugins fully on-device, we enable more flexible applications of generative AI.


Acknowledgments

We’d like to thank all team members who contributed to this work: Raman Sarokin and Juhyun Lee for the GPU inference solution; Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, and Matthias Grundmann for leadership. Special thanks to Jiuqiang Tang, Joe Zou and Lu wang, who made this technology and all the demos running on-device.

Source: Google AI Blog


Champion Innovator Elyes Manai, based in Quebec City, Quebec, Canada

Posted by Max Saltonstall, Developer Relations Engineer

In this ongoing interview series we sit down with Google Cloud Champion Innovators across the world to learn more about their journeys, their technology focus, and what excites them. Today we're talking to Elyes Manai. Elyes is a Machine Learning Consultant, Educator and Mentor. He helps companies tap into the power of data science to reduce costs and increase revenue as well as build relationships with relevant target audiences through educational content and community building.


What is a Champion Innovator?

Champion Innovators are a global network of more than 500 non-Google professionals, who are technical experts in Google Cloud products and services. Each Champion specializes in one of nine different technical categories which are cloud AI/ML, data analytics, hybrid multi-cloud, modern architecture, security and networking, serverless app development, storage, Workspace and databases.


What tech area has you most fascinated right now, and why?

Machine Learning: There are so many new insights we can gain from applying ML and AI to problems right now. Especially in security. I'm currently pursuing my PhD in AI applied to Cybersecurity, and am eager to teach the next generation about computer science, AI and security.

I fell into ML by accident, after trying to pursue something else in university. I had hoped to study architecture, but did not do nearly well enough in high school (in Tunisia, where I'm from). I ended up at my last choice of universities, in an IT program. And then I tried to transfer to an architecture school, but my paperwork got messed up so it didn't work out.

There I was, in a field I had not chosen, and yet I liked it. It felt pretty easy to do, I got good grades, and I realized I could make a career out of it. I liked solving problems with code, and progressed to doing web development and managing a team. From there I started thinking about what I wanted to do next.

I really love teaching, so I began looking into how to become a professor. That led me to the computer science? 50 class at Harvard, where I saw many signs pointing to a big AI trend, and so I decided to pursue a masters in computer science.


How do you like to learn new services, tools, and applications?

I dive right in; learn by doing. I frequently bounce around between subjects. I keep a list of ideas that come to me, and then when I'm ready for something new, I just scan through the list and pick one. This helps me stay fresh and excited.

Whenever I'm learning new skills I remind myself to go with the flow. I start small, learn just enough to start using the technology or tool. I'll ask myself:

  • What key concepts or pillars do I need to understand this more deeply?
  • How do I branch out from there?
  • Who should I talk to?
  • What can I make?

Since I'm in the middle of a doctoral program right now, I always challenge myself to make that idea somehow connect to my research, so I can bring it back to a common theme that's pervasive through all my work.


What are some exciting projects you have in flight right now?

Explainable AI, especially applied to less frequently used spoken languages in the world. We have a wealth of research on English language AI models, but what about applying BERT (and other technologies) on some lesser used languages, to expand the benefit to a wider population?

I'm also very excited about how we (as researchers) can optimize AI models to be more secure, be more private in terms of protecting our data, and be more useful to a wide variety of use cases.


What engages you outside of the technology world?

I love biking, and whenever it's warm enough in Québec I will go bike outside.

I like to read, especially science fiction. I recently started reading autobiographies to know more about the world from different perspectives. I'm currently reading autobiographies of Scott Kelley and Sohaila Abdulali.

I also keep a big list of ideas outside of tech for me to pursue: people to meet, foods to try, places to go. I'm always working on new experiences and adventures from that list, to broaden my world and learn more about what's all around me.


What brought you into the Innovators program?

I've been a Google Developer Expert (GDE) for two years and then got an invitation to join the Innovators program, after I attended a GDE event. It's helped me gain some respect and credibility, as I have a little bit of Google's reputation behind my voice now when I share my perspective or opinion. Also they have helped me get some great swag!


What's one thing our readers should do next?

Very few things stand the test of time, as our industry is shifting so quickly. I think CS50 on YouTube still has relevance for folks new to computer science.

I also want to encourage people to create social connections, and go meet the people behind the systems you are using. There are humans out there who can help you find the next project or position, and getting to know them is so important.


Each Champion Innovator is not affiliated with Google nor do they offer services on behalf of Google.

Preference learning with automated feedback for cache eviction

Caching is a ubiquitous idea in computer science that significantly improves the performance of storage and retrieval systems by storing a subset of popular items closer to the client based on request patterns. An important algorithmic piece of cache management is the decision policy used for dynamically updating the set of items being stored, which has been extensively optimized over several decades, resulting in several efficient and robust heuristics. While applying machine learning to cache policies has shown promising results in recent years (e.g., LRB, LHD, storage applications), it remains a challenge to outperform robust heuristics in a way that can generalize reliably beyond benchmarks to production settings, while maintaining competitive compute and memory overheads.

In “HALP: Heuristic Aided Learned Preference Eviction Policy for YouTube Content Delivery Network”, presented at NSDI 2023, we introduce a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. The Heuristic Aided Learned Preference (HALP) framework is a meta-algorithm that uses randomization to merge a lightweight heuristic baseline eviction rule with a learned reward model. The reward model is a lightweight neural network that is continuously trained with ongoing automated feedback on preference comparisons designed to mimic the offline oracle. We discuss how HALP has improved infrastructure efficiency and user video playback latency for YouTube’s content delivery network.


Learned preferences for cache eviction decisions

The HALP framework computes cache eviction decisions based on two components: (1) a neural reward model trained with automated feedback via preference learning, and (2) a meta-algorithm that combines a learned reward model with a fast heuristic. As the cache observes incoming requests, HALP continuously trains a small neural network that predicts a scalar reward for each item by formulating this as a preference learning method via pairwise preference feedback. This aspect of HALP is similar to reinforcement learning from human feedback (RLHF) systems, but with two important distinctions:

  • Feedback is automated and leverages well-known results about the structure of offline optimal cache eviction policies.
  • The model is learned continuously using a transient buffer of training examples constructed from the automated feedback process.

The eviction decisions rely on a filtering mechanism with two steps. First, a small subset of candidates is selected using a heuristic that is efficient, but suboptimal in terms of performance. Then, a re-ranking step optimizes from within the baseline candidates via the sparing use of a neural network scoring function to “boost” the quality of the final decision.

As a production ready cache policy implementation, HALP not only makes eviction decisions, but also subsumes the end-to-end process of sampling pairwise preference queries used to efficiently construct relevant feedback and update the model to power eviction decisions.


A neural reward model

HALP uses a light-weight two-layer multilayer perceptron (MLP) as its reward model to selectively score individual items in the cache. The features are constructed and managed as a metadata-only “ghost cache” (similar to classical policies like ARC). After any given lookup request, in addition to regular cache operations, HALP conducts the book-keeping (e.g., tracking and updating feature metadata in a capacity-constrained key-value store) needed to update the dynamic internal representation. This includes: (1) externally tagged features provided by the user as input, along with a cache lookup request, and (2) internally constructed dynamic features (e.g., time since last access, average time between accesses) constructed from lookup times observed on each item.

HALP learns its reward model fully online starting from a random weight initialization. This might seem like a bad idea, especially if the decisions are made exclusively for optimizing the reward model. However, the eviction decisions rely on both the learned reward model and a suboptimal but simple and robust heuristic like LRU. This allows for optimal performance when the reward model has fully generalized, while remaining robust to a temporarily uninformative reward model that is yet to generalize, or in the process of catching up to a changing environment.

Another advantage of online training is specialization. Each cache server runs in a potentially different environment (e.g., geographic location), which influences local network conditions and what content is locally popular, among other things. Online training automatically captures this information while reducing the burden of generalization, as opposed to a single offline training solution.


Scoring samples from a randomized priority queue

It can be impractical to optimize for the quality of eviction decisions with an exclusively learned objective for two reasons.

  1. Compute efficiency constraints: Inference with a learned network can be significantly more expensive than the computations performed in practical cache policies operating at scale. This limits not only the expressivity of the network and features, but also how often these are invoked during each eviction decision.
  2. Robustness for generalizing out-of-distribution: HALP is deployed in a setup that involves continual learning, where a quickly changing workload might generate request patterns that might be temporarily out-of-distribution with respect to previously seen data.

To address these issues, HALP first applies an inexpensive heuristic scoring rule that corresponds to an eviction priority to identify a small candidate sample. This process is based on efficient random sampling that approximates exact priority queues. The priority function for generating candidate samples is intended to be quick to compute using existing manually-tuned algorithms, e.g., LRU. However, this is configurable to approximate other cache replacement heuristics by editing a simple cost function. Unlike prior work, where the randomization was used to tradeoff approximation for efficiency, HALP also relies on the inherent randomization in the sampled candidates across time steps for providing the necessary exploratory diversity in the sampled candidates for both training and inference.

The final evicted item is chosen from among the supplied candidates, equivalent to the best-of-n reranked sample, corresponding to maximizing the predicted preference score according to the neural reward model. The same pool of candidates used for eviction decisions is also used to construct the pairwise preference queries for automated feedback, which helps minimize the training and inference skew between samples.

An overview of the two-stage process invoked for each eviction decision.


Online preference learning with automated feedback

The reward model is learned using online feedback, which is based on automatically assigned preference labels that indicate, wherever feasible, the ranked preference ordering for the time taken to receive future re-accesses, starting from a given snapshot in time among each queried sample of items. This is similar to the oracle optimal policy, which, at any given time, evicts an item with the farthest future access from all the items in the cache.

Generation of the automated feedback for learning the reward model.

To make this feedback process informative, HALP constructs pairwise preference queries that are most likely to be relevant for eviction decisions. In sync with the usual cache operations, HALP issues a small number of pairwise preference queries while making each eviction decision, and appends them to a set of pending comparisons. The labels for these pending comparisons can only be resolved at a random future time. To operate online, HALP also performs some additional book-keeping after each lookup request to process any pending comparisons that can be labeled incrementally after the current request. HALP indexes the pending comparison buffer with each element involved in the comparison, and recycles the memory consumed by stale comparisons (neither of which may ever get a re-access) to ensure that the memory overhead associated with feedback generation remains bounded over time.

Overview of all main components in HALP.


Results: Impact on the YouTube CDN

Through empirical analysis, we show that HALP compares favorably to state-of-the-art cache policies on public benchmark traces in terms of cache miss rates. However, while public benchmarks are a useful tool, they are rarely sufficient to capture all the usage patterns across the world over time, not to mention the diverse hardware configurations that we have already deployed.

Until recently, YouTube servers used an optimized LRU-variant for memory cache eviction. HALP increases YouTube’s memory egress/ingress — the ratio of the total bandwidth egress served by the CDN to that consumed for retrieval (ingress) due to cache misses — by roughly 12% and memory hit rate by 6%. This reduces latency for users, since memory reads are faster than disk reads, and also improves egressing capacity for disk-bounded machines by shielding the disks from traffic.

The figure below shows a visually compelling reduction in the byte miss ratio in the days following HALP’s final rollout on the YouTube CDN, which is now serving significantly more content from within the cache with lower latency to the end user, and without having to resort to more expensive retrieval that increases the operating costs.

Aggregate worldwide YouTube byte miss ratio before and after rollout (vertical dashed line).

An aggregated performance improvement could still hide important regressions. In addition to measuring overall impact, we also conduct an analysis in the paper to understand its impact on different racks using a machine level analysis, and find it to be overwhelmingly positive.


Conclusion

We introduced a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. Because of its design choices, HALP can be deployed in a manner similar to any other cache policy without the operational overhead of having to separately manage the labeled examples, training procedure and the model versions as additional offline pipelines common to most machine learning systems. Therefore, it incurs only a small extra overhead compared to other classical algorithms, but has the added benefit of being able to take advantage of additional features to make its eviction decisions and continuously adapt to changing access patterns.

This is the first large-scale deployment of a learned cache policy to a widely used and heavily trafficked CDN, and has significantly improved the CDN infrastructure efficiency while also delivering a better quality of experience to users.


Acknowledgements

Ramki Gummadi is now part of Google DeepMind. We would like to thank John Guilyard for help with the illustrations and Richard Schooler for feedback on this post.

Source: Google AI Blog


Speed is all you need: On-device acceleration of large diffusion models via GPU-aware optimizations

The proliferation of large diffusion models for image generation has led to a significant increase in model size and inference workloads. On-device ML inference in mobile environments requires meticulous performance optimization and consideration of trade-offs due to resource constraints. Running inference of large diffusion models (LDMs) on-device, driven by the need for cost efficiency and user privacy, presents even greater challenges due to the substantial memory requirements and computational demands of these models.

We address this challenge in our work titled “Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations” (to be presented at the CVPR 2023 workshop for Efficient Deep Learning for Computer Vision) focusing on the optimized execution of a foundational LDM model on a mobile GPU. In this blog post, we summarize the core techniques we employed to successfully execute large diffusion models like Stable Diffusion at full resolution (512x512 pixels) and 20 iterations on modern smartphones with high-performing inference speed of the original model without distillation of under 12 seconds. As discussed in our previous blog post, GPU-accelerated ML inference is often limited by memory performance, and execution of LDMs is no exception. Therefore, the central theme of our optimization is efficient memory input/output (I/O) even if it means choosing memory-efficient algorithms over those that prioritize arithmetic logic unit efficiency. Ultimately, our primary objective is to reduce the overall latency of the ML inference.

A sample output of an LDM on Mobile GPU with the prompt text: “a photo realistic and high resolution image of a cute puppy with surrounding flowers”.

Enhanced attention module for memory efficiency

An ML inference engine typically provides a variety of optimized ML operations. Despite this, achieving optimal performance can still be challenging as there is a certain amount of overhead for executing individual neural net operators on a GPU. To mitigate this overhead, ML inference engines incorporate extensive operator fusion rules that consolidate multiple operators into a single operator, thereby reducing the number of iterations across tensor elements while maximizing compute per iteration. For instance, TensorFlow Lite utilizes operator fusion to combine computationally expensive operations, like convolutions, with subsequent activation functions, like rectified linear units, into one.

A clear opportunity for optimization is the heavily used attention block adopted in the denoiser model in the LDM. The attention blocks allow the model to focus on specific parts of the input by assigning higher weights to important regions. There are multiple ways one can optimize the attention modules, and we selectively employ one of the two optimizations explained below depending on which optimization performs better.

The first optimization, which we call partially fused softmax, removes the need for extensive memory writes and reads between the softmax and the matrix multiplication in the attention module. Let the attention block be just a simple matrix multiplication of the form Y = softmax(X) * W where X and W are 2D matrices of shape a×b and b×c, respectively (shown below in the top half).

For numerical stability, T = softmax(X) is typically calculated in three passes:

  1. Determine the maximum value in the list, i.e., for each row in matrix X
  2. Sum up the differences of the exponential of each list item and the maximum value (from pass 1)
  3. Divide the exponential of the items minus the maximum value by the sum from pass 2

Carrying out these passes naïvely would result in a huge memory write for the temporary intermediate tensor T holding the output of the entire softmax function. We bypass this large memory write if we only store the results of passes 1 and 2, labeled m and s, respectively, which are small vectors, with a elements each, compared to T which has a·b elements. With this technique, we are able to reduce tens or even hundreds of megabytes of memory consumption by multiple orders of magnitude (shown below in the bottom half).

Attention modules. Top: A naïve attention block, composed of a SOFTMAX (with all three passes) and a MATMUL, requires a large memory write for the big intermediate tensor T. Bottom: Our memory-efficient attention block with partially fused softmax in MATMUL only needs to store two small intermediate tensors for m and s.

The other optimization involves employing FlashAttention, which is an I/O-aware, exact attention algorithm. This algorithm reduces the number of GPU high-bandwidth memory accesses, making it a good fit for our memory bandwidth–limited use case. However, we found this technique to only work for SRAM with certain sizes and to require a large number of registers. Therefore, we only leverage this technique for attention matrices with a certain size on a select set of GPUs.


Winograd fast convolution for 3×3 convolution layers

The backbone of common LDMs heavily relies on 3×3 convolution layers (convolutions with filter size 3×3), comprising over 90% of the layers in the decoder. Despite increased memory consumption and numerical errors, we found that Winograd fast convolution to be effective at speeding up the convolutions. Distinct from the filter size 3x3 used in convolutions, tile size refers to the size of a sub region of the input tensor that is processed at a time. Increasing the tile size enhances the efficiency of the convolution in terms of arithmetic logic unit (ALU) usage. However, this improvement comes at the expense of increased memory consumption. Our tests indicate that a tile size of 4×4 achieves the optimal trade-off between computational efficiency and memory utilization.


    Memory usage    
    Tile size         FLOPS savings         Intermediate tensors         Weights    
2×2 2.25× 4.00× 1.77×
4×4 4.00× 2.25× 4.00×
6×6 5.06× 1.80× 7.12×
8×8 5.76× 1.56× 11.1×

Impact of Winograd with varying tile sizes for 3×3 convolutions.

Specialized operator fusion for memory efficiency

We discovered that performantly inferring LDMs on a mobile GPU requires significantly larger fusion windows for commonly employed layers and units in LDMs than current off-the-shelf on-device GPU-accelerated ML inference engines provide. Consequently, we developed specialized implementations that could execute a larger range of neural operators than typical fusion rules would permit. Specifically, we focused on two specializations: the Gaussian Error Linear Unit (GELU) and the group normalization layer.

An approximation of GELU with the hyperbolic tangent function requires writing to and reading from seven auxiliary intermediate tensors (shown below as light orange rounded rectangles in the figure below), reading from the input tensor x three times, and writing to the output tensor y once across eight GPU programs implementing the labeled operation each (light blue rectangles). A custom GELU implementation that performs the eight operations in a single shader (shown below in the bottom) can bypass all the memory I/O for the intermediate tensors.

GELU implementations. Top: A naïve implementation with built-in operations would require 8 memory writes and 10 reads. Bottom: Our custom GELU only requires 1 memory read (for x) and 1 write (for y).

Results

After applying all of these optimizations, we conducted tests of Stable Diffusion 1.5 (image resolution 512x512, 20 iterations) on high-end mobile devices. Running Stable Diffusion with our GPU-accelerated ML inference model uses 2,093MB for the weights and 84MB for the intermediate tensors. With latest high-end smartphones, Stable Diffusion can be run in under 12 seconds.

Stable Diffusion runs on modern smartphones in under 12 seconds. Note that running the decoder after each iteration for displaying the intermediate output in this animated GIF results in a ~2× slowdown.

Conclusion

Performing on-device ML inference of large models has proven to be a substantial challenge, encompassing limitations in model file size, extensive runtime memory requirements, and protracted inference latency. By recognizing memory bandwidth usage as the primary bottleneck, we directed our efforts towards optimizing memory bandwidth utilization and striking a delicate balance between ALU efficiency and memory efficiency. As a result, we achieved state-of-the-art inference latency for large diffusion models. You can learn more about this work in the paper.


Acknowledgments

We'd like to thank Yu-Hui Chen, Jiuqiang Tang, Frank Barchard, Yang Zhao, Joe Zou, Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, Lu Wang, and Matthias Grundmann.

Source: Google AI Blog