Tag Archives: machine learning

Retrieval-augmented visual-language pre-training

Large-scale models, such as T5, GPT-3, PaLM, Flamingo and PaLI, have demonstrated the ability to store substantial amounts of knowledge when scaled to tens of billions of parameters and trained on large text and image datasets. These models achieve state-of-the-art results on downstream tasks, such as image captioning, visual question answering and open vocabulary recognition. Despite such achievements, these models require a massive volume of data for training and end up with a tremendous number of parameters (billions in many cases), resulting in significant computational requirements. Moreover, the data used to train these models can become outdated, requiring re-training every time the world's knowledge is updated. For example, a model trained just two years ago might yield outdated information about the current president of the United States.

In the fields of natural language processing (RETRO, REALM) and computer vision (KAT), researchers have attempted to address these challenges using retrieval-augmented models. Typically, these models use a backbone that is able to process a single modality at a time, e.g., only text or only images, to encode and retrieve information from a knowledge corpus. However, these retrieval-augmented models are unable to leverage all available modalities in a query and knowledge corpora, and may not find the information that is most helpful for generating the model’s output.

To address these issues, in “REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory”, to appear at CVPR 2023, we introduce a visual-language model that learns to utilize a multi-source multi-modal “memory” to answer knowledge-intensive queries. REVEAL employs neural representation learning to encode and convert diverse knowledge sources into a memory structure consisting of key-value pairs. The keys serve as indices for the memory items, while the corresponding values store pertinent information about those items. During training, REVEAL learns the key embeddings, value tokens, and the ability to retrieve information from this memory to address knowledge-intensive queries. This approach allows the model parameters to focus on reasoning about the query, rather than being dedicated to memorization.

We augment a visual-language model with the ability to retrieve multiple knowledge entries from a diverse set of knowledge sources, which helps generation.


Memory construction from multimodal knowledge corpora

Our approach is similar to REALM in that we precompute key and value embeddings of knowledge items from different sources and index them in a unified knowledge memory, where each knowledge item is encoded into a key-value pair. Each key is a d-dimensional embedding vector, while each value is a sequence of token embeddings representing the knowledge item in more detail. In contrast to previous work, REVEAL leverages a diverse set of multimodal knowledge corpora, including the WikiData knowledge graph, Wikipedia passages and images, web image-text pairs and visual question answering data. Each knowledge item could be text, an image, a combination of both (e.g., pages in Wikipedia) or a relationship or attribute from a knowledge graph (e.g., Barack Obama is 6’ 2” tall). During training, we continuously re-compute the memory key and value embeddings as the model parameters get updated. We update the memory asynchronously at every thousand training steps.


Scaling memory using compression

A naïve solution for encoding a memory value is to keep the whole sequence of tokens for each knowledge item. Then, the model could fuse the input query and the top-k retrieved memory values by concatenating all their tokens together and feeding them into a transformer encoder-decoder pipeline. This approach has two issues: (1) storing hundreds of millions of knowledge items in memory is impractical if each memory value consists of hundreds of tokens and (2) the transformer encoder has a quadratic complexity with respect to the total number of tokens times k for self-attention. Therefore, we propose to use the Perceiver architecture to encode and compress knowledge items. The Perceiver model uses a transformer decoder to compress the full token sequence into an arbitrary length. This lets us retrieve top-k memory entries for k as large as a hundred.

The following figure illustrates the procedure of constructing the memory key-value pairs. Each knowledge item is processed through a multi-modal visual-language encoder, resulting in a sequence of image and text tokens. The key head then transforms these tokens into a compact embedding vector. The value head (perceiver) condenses these tokens into fewer ones, retaining the pertinent information about the knowledge item within them.

We encode the knowledge entries from different corpora into unified key and value embedding pairs, where the keys are used to index the memory and values contain information about the entries.


Large-scale pre-training on image-text pairs

To train the REVEAL model, we begin with the large-scale corpus, collected from the public Web with three billion image alt-text caption pairs, introduced in LiT. Since the dataset is noisy, we add a filter to remove data points with captions shorter than 50 characters, which yields roughly 1.3 billion image caption pairs. We then take these pairs, combined with the text generation objective used in SimVLM, to train REVEAL. Given an image-text example, we randomly sample a prefix containing the first few tokens of the text. We feed the text prefix and image to the model as input with the objective of generating the rest of the text as output. The training goal is to condition the prefix and autoregressively generate the remaining text sequence.

To train all components of the REVEAL model end-to-end, we need to warm start the model to a good state (setting initial values to model parameters). Otherwise, if we were to start with random weights (cold-start), the retriever would often return irrelevant memory items that would never generate useful training signals. To avoid this cold-start problem, we construct an initial retrieval dataset with pseudo–ground-truth knowledge to give the pre-training a reasonable head start.

We create a modified version of the WIT dataset for this purpose. Each image-caption pair in WIT also comes with a corresponding Wikipedia passage (words surrounding the text). We put together the surrounding passage with the query image and use it as the pseudo ground-truth knowledge that corresponds to the input query. The passage provides rich information about the image and caption, which is useful for initializing the model.

To prevent the model from relying on low-level image features for retrieval, we apply random data augmentation to the input query image. Given this modified dataset that contains pseudo-retrieval ground-truth, we train the query and memory key embeddings to warm start the model.


REVEAL workflow

The overall workflow of REVEAL consists of four primary steps. First, REVEAL encodes a multimodal input into a sequence of token embeddings along with a condensed query embedding. Then, the model translates each multi-source knowledge entry into unified pairs of key and value embeddings, with the key being utilized for memory indexing and the value encompassing the entire information about the entry. Next, REVEAL retrieves the top-k most related knowledge pieces from multiple knowledge sources, returns the pre-processed value embeddings stored in memory, and re-encodes the values. Finally, REVEAL fuses the top-k knowledge pieces through an attentive knowledge fusion layer by injecting the retrieval score (dot product between query and key embeddings) as a prior during attention calculation. This structure is instrumental in enabling the memory, encoder, retriever and the generator to be concurrently trained in an end-to-end fashion.

Overall workflow of REVEAL.


Results

We evaluate REVEAL on knowledge-based visual question answering tasks using OK-VQA and A-OKVQA datasets. We fine-tune our pre-trained model on the VQA tasks using the same generative objective where the model takes in an image-question pair as input and generates the text answer as output. We demonstrate that REVEAL achieves better results on the A-OKVQA dataset than earlier attempts that incorporate a fixed knowledge or the works that utilize large language models (e.g., GPT-3) as an implicit source of knowledge.

Visual question answering results on A-OKVQA. REVEAL achieves higher accuracy in comparison to previous works including ViLBERT, LXMERT, ClipCap, KRISP and GPV-2.

We also evaluate REVEAL on the image captioning benchmarks using MSCOCO and NoCaps dataset. We directly fine-tune REVEAL on the MSCOCO training split via the cross-entropy generative objective. We measure our performance on the MSCOCO test split and NoCaps evaluation set using the CIDEr metric, which is based on the idea that good captions should be similar to reference captions in terms of word choice, grammar, meaning, and content. Our results on MSCOCO caption and NoCaps datasets are shown below.

Image Captioning results on MSCOCO and NoCaps using the CIDEr metric. REVEAL achieves a higher score in comparison to Flamingo, VinVL, SimVLM and CoCa.

Below we show a couple of qualitative examples of how REVEAL retrieves relevant documents to answer visual questions.

REVEAL can use knowledge from different sources to correctly answer the question.


Conclusion

We present an end-to-end retrieval-augmented visual language (REVEAL) model, which contains a knowledge retriever that learns to utilize a diverse set of knowledge sources with different modalities. We train REVEAL on a massive image-text corpus with four diverse knowledge corpora, and achieve state-of-the-art results on knowledge-intensive visual question answering and image caption tasks. In the future we would like to explore the ability of this model for attribution, and apply it to a broader class of multimodal tasks.


Acknowledgements

This research was conducted by Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross and Alireza Fathi.

Source: Google AI Blog


Large sequence models for software development activities

Software isn’t created in one dramatic step. It improves bit by bit, one little step at a time — editing, running unit tests, fixing build errors, addressing code reviews, editing some more, appeasing linters, and fixing more errors — until finally it becomes good enough to merge into a code repository. Software engineering isn’t an isolated process, but a dialogue among human developers, code reviewers, bug reporters, software architects and tools, such as compilers, unit tests, linters and static analyzers.

Today we describe DIDACT (​​Dynamic Integrated Developer ACTivity), which is a methodology for training large machine learning (ML) models for software development. The novelty of DIDACT is that it uses the process of software development as the source of training data for the model, rather than just the polished end state of that process, the finished code. By exposing the model to the contexts that developers see as they work, paired with the actions they take in response, the model learns about the dynamics of software development and is more aligned with how developers spend their time. We leverage instrumentation of Google's software development to scale up the quantity and diversity of developer-activity data beyond previous works. Results are extremely promising along two dimensions: usefulness to professional software developers, and as a potential basis for imbuing ML models with general software development skills.

DIDACT is a multi-task model trained on development activities that include editing, debugging, repair, and code review.

We built and deployed internally three DIDACT tools, Comment Resolution (which we recently announced), Build Repair, and Tip Prediction, each integrated at different stages of the development workflow. All three of these tools received enthusiastic feedback from thousands of internal developers. We see this as the ultimate test of usefulness: do professional developers, who are often experts on the code base and who have carefully honed workflows, leverage the tools to improve their productivity?

Perhaps most excitingly, we demonstrate how DIDACT is a first step towards a general-purpose developer-assistance agent. We show that the trained model can be used in a variety of surprising ways, via prompting with prefixes of developer activities, and by chaining together multiple predictions to roll out longer activity trajectories. We believe DIDACT paves a promising path towards developing agents that can generally assist across the software development process.


A treasure trove of data about the software engineering process

Google’s software engineering toolchains store every operation related to code as a log of interactions among tools and developers, and have done so for decades. In principle, one could use this record to replay in detail the key episodes in the “software engineering video” of how Google’s codebase came to be, step-by-step — one code edit, compilation, comment, variable rename, etc., at a time.

Google code lives in a monorepo, a single repository of code for all tools and systems. A software developer typically experiments with code changes in a local copy-on-write workspace managed by a system called Clients in the Cloud (CitC). When the developer is ready to package a set of code changes together for a specific purpose (e.g., fixing a bug), they create a changelist (CL) in Critique, Google’s code-review system. As with other types of code-review systems, the developer engages in a dialog with a peer reviewer about functionality and style. The developer edits their CL to address reviewer comments as the dialog progresses. Eventually, the reviewer declares “LGTM!” (“looks good to me”), and the CL is merged into the code repository.

Of course, in addition to a dialog with the code reviewer, the developer also maintains a “dialog” of sorts with a plethora of other software engineering tools, such as the compiler, the testing framework, linters, static analyzers, fuzzers, etc.

An illustration of the intricate web of activities involved in developing software: small actions by the developer, interactions with a code reviewer, and invocations of tools such as compilers.

A multi-task model for software engineering

DIDACT utilizes interactions among engineers and tools to power ML models that assist Google developers, by suggesting or enhancing actions developers take — in context — while pursuing their software-engineering tasks. To do that, we have defined a number of tasks about individual developer activities: repairing a broken build, predicting a code-review comment, addressing a code-review comment, renaming a variable, editing a file, etc. We use a common formalism for each activity: it takes some State (a code file), some Intent (annotations specific to the activity, such as code-review comments or compiler errors), and produces an Action (the operation taken to address the task). This Action is like a mini programming language, and can be extended for newly added activities. It covers things like editing, adding comments, renaming variables, marking up code with errors, etc. We call this language DevScript.

The DIDACT model is prompted with a task, code snippets, and annotations related to that task, and produces development actions, e.g., edits or comments.

This state-intent-action formalism enables us to capture many different tasks in a general way. What’s more, DevScript is a concise way to express complex actions, without the need to output the whole state (the original code) as it would be after the action takes place; this makes the model more efficient and more interpretable. For example, a rename might touch a file in dozens of places, but a model can predict a single rename action.


An ML peer programmer

DIDACT does a good job on individual assistive tasks. For example, below we show DIDACT doing code clean-up after functionality is mostly done. It looks at the code along with some final comments by the code reviewer (marked with “human” in the animation), and predicts edits to address those comments (rendered as a diff).

Given an initial snippet of code and the comments that a code reviewer attached to that snippet, the Pre-Submit Cleanup task of DIDACT produces edits (insertions and deletions of text) that address those comments.

The multimodal nature of DIDACT also gives rise to some surprising capabilities, reminiscent of behaviors emerging with scale. One such capability is history augmentation, which can be enabled via prompting. Knowing what the developer did recently enables the model to make a better guess about what the developer should do next.

An illustration of history-augmented code completion in action.

A powerful such task exemplifying this capability is history-augmented code completion. In the figure below, the developer adds a new function parameter (1), and moves the cursor into the documentation (2). Conditioned on the history of developer edits and the cursor position, the model completes the line (3) by correctly predicting the docstring entry for the new parameter.

An illustration of edit prediction, over multiple chained iterations.

In an even more powerful history-augmented task, edit prediction, the model can choose where to edit next in a fashion that is historically consistent. If the developer deletes a function parameter (1), the model can use history to correctly predict an update to the docstring (2) that removes the deleted parameter (without the human developer manually placing the cursor there) and to update a statement in the function (3) in a syntactically (and — arguably — semantically) correct way. With history, the model can unambiguously decide how to continue the “editing video” correctly. Without history, the model wouldn’t know whether the missing function parameter is intentional (because the developer is in the process of a longer edit to remove it) or accidental (in which case the model should re-add it to fix the problem).

The model can go even further. For example, we started with a blank file and asked the model to successively predict what edits would come next until it had written a full code file. The astonishing part is that the model developed code in a step-by-step way that would seem natural to a developer: It started by first creating a fully working skeleton with imports, flags, and a basic main function. It then incrementally added new functionality, like reading from a file and writing results, and added functionality to filter out some lines based on a user-provided regular expression, which required changes across the file, like adding new flags.


Conclusion

DIDACT turns Google's software development process into training demonstrations for ML developer assistants, and uses those demonstrations to train models that construct code in a step-by-step fashion, interactively with tools and code reviewers. These innovations are already powering tools enjoyed by Google developers every day. The DIDACT approach complements the great strides taken by large language models at Google and elsewhere, towards technologies that ease toil, improve productivity, and enhance the quality of work of software engineers.


Acknowledgements

This work is the result of a multi-year collaboration among Google Research, Google Core Systems and Experiences, and DeepMind. We would like to acknowledge our colleagues Jacob Austin, Pascal Lamblin, Pierre-Antoine Manzagol, and Daniel Zheng, who join us as the key drivers of this project. This work could not have happened without the significant and sustained contributions of our partners at Alphabet (Peter Choy, Henryk Michalewski, Subhodeep Moitra, Malgorzata Salawa, Vaibhav Tulsyan, and Manushree Vijayvergiya), as well as the many people who collected data, identified tasks, built products, strategized, evangelized, and helped us execute on the many facets of this agenda (Ankur Agarwal, Paige Bailey, Marc Brockschmidt, Rodrigo Damazio Bovendorp, Satish Chandra, Savinee Dancs, Matt Frazier, Alexander Frömmgen, Nimesh Ghelani, Chris Gorgolewski, Chenjie Gu, Vincent Hellendoorn, Franjo Ivančić, Marko Ivanković, Emily Johnston, Luka Kalinovcic, Lera Kharatyan, Jessica Ko, Markus Kusano, Kathy Nix, Sara Qu, Marc Rasi, Marcus Revaj, Ballie Sandhu, Michael Sloan, Tom Small, Gabriela Surita, Maxim Tabachnyk, David Tattersall, Sara Toth, Kevin Villela, Sara Wiltberger, and Donald Duo Zhao) and our extremely supportive leadership (Martín Abadi, Joelle Barral, Jeff Dean, Madhura Dudhgaonkar, Douglas Eck, Zoubin Ghahramani, Hugo Larochelle, Chandu Thekkath, and Niranjan Tulpule). Thank you!

Source: Google AI Blog


Barkour: Benchmarking animal-level agility with quadruped robots

Creating robots that exhibit robust and dynamic locomotion capabilities, similar to animals or humans, has been a long-standing goal in the robotics community. In addition to completing tasks quickly and efficiently, agility allows legged robots to move through complex environments that are otherwise difficult to traverse. Researchers at Google have been pursuing agility for multiple years and across various form factors. Yet, while researchers have enabled robots to hike or jump over some obstacles, there is still no generally accepted benchmark that comprehensively measures robot agility or mobility. In contrast, benchmarks are driving forces behind the development of machine learning, such as ImageNet for computer vision, and OpenAI Gym for reinforcement learning (RL).

In “Barkour: Benchmarking Animal-level Agility with Quadruped Robots”, we introduce the Barkour agility benchmark for quadruped robots, along with a Transformer-based generalist locomotion policy. Inspired by dog agility competitions, a legged robot must sequentially display a variety of skills, including moving in different directions, traversing uneven terrains, and jumping over obstacles within a limited timeframe to successfully complete the benchmark. By providing a diverse and challenging obstacle course, the Barkour benchmark encourages researchers to develop locomotion controllers that move fast in a controllable and versatile way. Furthermore, by tying the performance metric to real dog performance, we provide an intuitive metric to understand the robot performance with respect to their animal counterparts.



We invited a handful of dooglers to try the obstacle course to ensure that our agility objectives were realistic and challenging. Small dogs complete the obstacle course in approximately 10s, whereas our robot’s typical performance hovers around 20s.


Barkour benchmark

The Barkour scoring system uses a per obstacle and an overall course target time based on the target speed of small dogs in the novice agility competitions (about 1.7m/s). Barkour scores range from 0 to 1, with 1 corresponding to the robot successfully traversing all the obstacles along the course within the allotted time of approximately 10 seconds, the average time needed for a similar-sized dog to traverse the course. The robot receives penalties for skipping, failing obstacles, or moving too slowly.

Our standard course consists of four unique obstacles in a 5m x 5m area. This is a denser and smaller setup than a typical dog competition to allow for easy deployment in a robotics lab. Beginning at the start table, the robot needs to weave through a set of poles, climb an A-frame, clear a 0.5m broad jump and then step onto the end table. We chose this subset of obstacles because they test a diverse set of skills while keeping the setup within a small footprint. As is the case for real dog agility competitions, the Barkour benchmark can be easily adapted to a larger course area and may incorporate a variable number of obstacles and course configurations.

Overview of the Barkour benchmark’s obstacle course setup, which consists of weave poles, an A-frame, a broad jump, and pause tables. The intuitive scoring mechanism, inspired by dog agility competitions, balances speed, agility and performance and can be easily modified to incorporate other types of obstacles or course configurations.


Learning agile locomotion skills

The Barkour benchmark features a diverse set of obstacles and a delayed reward system, which pose a significant challenge when training a single policy that can complete the entire obstacle course. So in order to set a strong performance baseline and demonstrate the effectiveness of the benchmark for robotic agility research, we adopt a student-teacher framework combined with a zero-shot sim-to-real approach. First, we train individual specialist locomotion skills (teacher) for different obstacles using on-policy RL methods. In particular, we leverage recent advances in large-scale parallel simulation to equip the robot with individual skills, including walking, slope climbing, and jumping policies.

Next, we train a single policy (student) that performs all the skills and transitions in between by using a student-teacher framework, based on the specialist skills we previously trained. We use simulation rollouts to create datasets of state-action pairs for each one of the specialist skills. This dataset is then distilled into a single Transformer-based generalist locomotion policy, which can handle various terrains and adjust the robot's gait based on the perceived environment and the robot’s state.

During deployment, we pair the locomotion transformer policy that is capable of performing multiple skills with a navigation controller that provides velocity commands based on the robot's position. Our trained policy controls the robot based on the robot's surroundings represented as an elevation map, velocity commands, and on-board sensory information provided by the robot.



Deployment pipeline for the locomotion transformer architecture. At deployment time, a high-level navigation controller guides the real robot through the obstacle course by sending commands to the locomotion transformer policy.

Robustness and repeatability are difficult to achieve when we aim for peak performance and maximum speed. Sometimes, the robot might fail when overcoming an obstacle in an agile way. To handle failures we train a recovery policy that quickly gets the robot back on its feet, allowing it to continue the episode.


Evaluation

We evaluate the Transformer-based generalist locomotion policy using custom-built quadruped robots and show that by optimizing for the proposed benchmark, we obtain agile, robust, and versatile skills for our robot in the real world. We further provide analysis for various design choices in our system and their impact on the system performance.

Model of the custom-built robots used for evaluation.

We deploy both the specialist and generalist policies to hardware (zero-shot sim-to-real). The robot’s target trajectory is provided by a set of waypoints along the various obstacles. In the case of the specialist policies, we switch between specialist policies by using a hand-tuned policy switching mechanism that selects the most suitable policy given the robot’s position.



Typical performance of our agile locomotion policies on the Barkour benchmark. Our custom-built quadruped robot robustly navigates the terrain’s obstacles by leveraging various skills learned using RL in simulation.

We find that very often our policies can handle unexpected events or even hardware degradation resulting in good average performance, but failures are still possible. As illustrated in the image below, in case of failures, our recovery policy quickly gets the robot back on its feet, allowing it to continue the episode. By combining the recovery policy with a simple walk-back-to-start policy, we are able to run repeated experiments with minimal human intervention to measure the robustness.



Qualitative example of robustness and recovery behaviors. The robot trips and rolls over after heading down the A-frame. This triggers the recovery policy, which enables the robot to get back up and continue the course.

We find that across a large number of evaluations, the single generalist locomotion transformer policy and the specialist policies with the policy switching mechanism achieve similar performance. The locomotion transformer policy has a slightly lower average Barkour score, but exhibits smoother transitions between behaviors and gaits.



Measuring robustness of the different policies across a large number of runs on the Barkour benchmark.

Histogram of the agility scores for the locomotion transformer policy. The highest scores shown in blue (0.75 - 0.9) represent the runs where the robot successfully completes all obstacles.


Conclusion

We believe that developing a benchmark for legged robotics is an important first step in quantifying progress toward animal-level agility. To establish a strong baseline, we investigated a zero-shot sim-to-real approach, taking advantage of large-scale parallel simulation and recent advancements in training Transformer-based architectures. Our findings demonstrate that Barkour is a challenging benchmark that can be easily customized, and that our learning-based method for solving the benchmark provides a quadruped robot with a single low-level policy that can perform a variety of agile low-level skills.


Acknowledgments

The authors of this post are now part of Google DeepMind. We would like to thank our co-authors at Google DeepMind and our collaborators at Google Research: Wenhao Yu, J. Chase Kew, Tingnan Zhang, Daniel Freeman, Kuang-Hei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, Nathan Batchelor, Steven Bohez, Federico Casarini, Jose Enrique Chen, Omar Cortes, Erwin Coumans, Adil Dostmohamed, Gabriel Dulac-Arnold, Alejandro Escontrela, Erik Frey, Roland Hafner, Deepali Jain, Yuheng Kuang, Edward Lee, Linda Luu, Ofir Nachum, Ken Oslund, Jason Powell, Diego Reyes, Francesco Romano, Feresteh Sadeghi, Ron Sloat, Baruch Tabanpour, Daniel Zheng, Michael Neunert, Raia Hadsell, Nicolas Heess, Francesco Nori, Jeff Seto, Carolina Parada, Vikas Sindhwani, Vincent Vanhoucke, and Jie Tan. We would also like to thank Marissa Giustina, Ben Jyenis, Gus Kouretas, Nubby Lee, James Lubin, Sherry Moore, Thinh Nguyen, Krista Reymann, Satoshi Kataoka, Trish Blazina, and the members of the robotics team at Google DeepMind for their contributions to the project.

Source: Google AI Blog


Barkour: Benchmarking animal-level agility with quadruped robots

Creating robots that exhibit robust and dynamic locomotion capabilities, similar to animals or humans, has been a long-standing goal in the robotics community. In addition to completing tasks quickly and efficiently, agility allows legged robots to move through complex environments that are otherwise difficult to traverse. Researchers at Google have been pursuing agility for multiple years and across various form factors. Yet, while researchers have enabled robots to hike or jump over some obstacles, there is still no generally accepted benchmark that comprehensively measures robot agility or mobility. In contrast, benchmarks are driving forces behind the development of machine learning, such as ImageNet for computer vision, and OpenAI Gym for reinforcement learning (RL).

In “Barkour: Benchmarking Animal-level Agility with Quadruped Robots”, we introduce the Barkour agility benchmark for quadruped robots, along with a Transformer-based generalist locomotion policy. Inspired by dog agility competitions, a legged robot must sequentially display a variety of skills, including moving in different directions, traversing uneven terrains, and jumping over obstacles within a limited timeframe to successfully complete the benchmark. By providing a diverse and challenging obstacle course, the Barkour benchmark encourages researchers to develop locomotion controllers that move fast in a controllable and versatile way. Furthermore, by tying the performance metric to real dog performance, we provide an intuitive metric to understand the robot performance with respect to their animal counterparts.



We invited a handful of dooglers to try the obstacle course to ensure that our agility objectives were realistic and challenging. Small dogs complete the obstacle course in approximately 10s, whereas our robot’s typical performance hovers around 20s.


Barkour benchmark

The Barkour scoring system uses a per obstacle and an overall course target time based on the target speed of small dogs in the novice agility competitions (about 1.7m/s). Barkour scores range from 0 to 1, with 1 corresponding to the robot successfully traversing all the obstacles along the course within the allotted time of approximately 10 seconds, the average time needed for a similar-sized dog to traverse the course. The robot receives penalties for skipping, failing obstacles, or moving too slowly.

Our standard course consists of four unique obstacles in a 5m x 5m area. This is a denser and smaller setup than a typical dog competition to allow for easy deployment in a robotics lab. Beginning at the start table, the robot needs to weave through a set of poles, climb an A-frame, clear a 0.5m broad jump and then step onto the end table. We chose this subset of obstacles because they test a diverse set of skills while keeping the setup within a small footprint. As is the case for real dog agility competitions, the Barkour benchmark can be easily adapted to a larger course area and may incorporate a variable number of obstacles and course configurations.

Overview of the Barkour benchmark’s obstacle course setup, which consists of weave poles, an A-frame, a broad jump, and pause tables. The intuitive scoring mechanism, inspired by dog agility competitions, balances speed, agility and performance and can be easily modified to incorporate other types of obstacles or course configurations.


Learning agile locomotion skills

The Barkour benchmark features a diverse set of obstacles and a delayed reward system, which pose a significant challenge when training a single policy that can complete the entire obstacle course. So in order to set a strong performance baseline and demonstrate the effectiveness of the benchmark for robotic agility research, we adopt a student-teacher framework combined with a zero-shot sim-to-real approach. First, we train individual specialist locomotion skills (teacher) for different obstacles using on-policy RL methods. In particular, we leverage recent advances in large-scale parallel simulation to equip the robot with individual skills, including walking, slope climbing, and jumping policies.

Next, we train a single policy (student) that performs all the skills and transitions in between by using a student-teacher framework, based on the specialist skills we previously trained. We use simulation rollouts to create datasets of state-action pairs for each one of the specialist skills. This dataset is then distilled into a single Transformer-based generalist locomotion policy, which can handle various terrains and adjust the robot's gait based on the perceived environment and the robot’s state.

During deployment, we pair the locomotion transformer policy that is capable of performing multiple skills with a navigation controller that provides velocity commands based on the robot's position. Our trained policy controls the robot based on the robot's surroundings represented as an elevation map, velocity commands, and on-board sensory information provided by the robot.



Deployment pipeline for the locomotion transformer architecture. At deployment time, a high-level navigation controller guides the real robot through the obstacle course by sending commands to the locomotion transformer policy.

Robustness and repeatability are difficult to achieve when we aim for peak performance and maximum speed. Sometimes, the robot might fail when overcoming an obstacle in an agile way. To handle failures we train a recovery policy that quickly gets the robot back on its feet, allowing it to continue the episode.


Evaluation

We evaluate the Transformer-based generalist locomotion policy using custom-built quadruped robots and show that by optimizing for the proposed benchmark, we obtain agile, robust, and versatile skills for our robot in the real world. We further provide analysis for various design choices in our system and their impact on the system performance.

Model of the custom-built robots used for evaluation.

We deploy both the specialist and generalist policies to hardware (zero-shot sim-to-real). The robot’s target trajectory is provided by a set of waypoints along the various obstacles. In the case of the specialist policies, we switch between specialist policies by using a hand-tuned policy switching mechanism that selects the most suitable policy given the robot’s position.



Typical performance of our agile locomotion policies on the Barkour benchmark. Our custom-built quadruped robot robustly navigates the terrain’s obstacles by leveraging various skills learned using RL in simulation.

We find that very often our policies can handle unexpected events or even hardware degradation resulting in good average performance, but failures are still possible. As illustrated in the image below, in case of failures, our recovery policy quickly gets the robot back on its feet, allowing it to continue the episode. By combining the recovery policy with a simple walk-back-to-start policy, we are able to run repeated experiments with minimal human intervention to measure the robustness.



Qualitative example of robustness and recovery behaviors. The robot trips and rolls over after heading down the A-frame. This triggers the recovery policy, which enables the robot to get back up and continue the course.

We find that across a large number of evaluations, the single generalist locomotion transformer policy and the specialist policies with the policy switching mechanism achieve similar performance. The locomotion transformer policy has a slightly lower average Barkour score, but exhibits smoother transitions between behaviors and gaits.



Measuring robustness of the different policies across a large number of runs on the Barkour benchmark.

Histogram of the agility scores for the locomotion transformer policy. The highest scores shown in blue (0.75 - 0.9) represent the runs where the robot successfully completes all obstacles.


Conclusion

We believe that developing a benchmark for legged robotics is an important first step in quantifying progress toward animal-level agility. To establish a strong baseline, we investigated a zero-shot sim-to-real approach, taking advantage of large-scale parallel simulation and recent advancements in training Transformer-based architectures. Our findings demonstrate that Barkour is a challenging benchmark that can be easily customized, and that our learning-based method for solving the benchmark provides a quadruped robot with a single low-level policy that can perform a variety of agile low-level skills.


Acknowledgments

The authors of this post are now part of Google DeepMind. We would like to thank our co-authors at Google DeepMind and our collaborators at Google Research: Wenhao Yu, J. Chase Kew, Tingnan Zhang, Daniel Freeman, Kuang-Hei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, Nathan Batchelor, Steven Bohez, Federico Casarini, Jose Enrique Chen, Omar Cortes, Erwin Coumans, Adil Dostmohamed, Gabriel Dulac-Arnold, Alejandro Escontrela, Erik Frey, Roland Hafner, Deepali Jain, Yuheng Kuang, Edward Lee, Linda Luu, Ofir Nachum, Ken Oslund, Jason Powell, Diego Reyes, Francesco Romano, Feresteh Sadeghi, Ron Sloat, Baruch Tabanpour, Daniel Zheng, Michael Neunert, Raia Hadsell, Nicolas Heess, Francesco Nori, Jeff Seto, Carolina Parada, Vikas Sindhwani, Vincent Vanhoucke, and Jie Tan. We would also like to thank Marissa Giustina, Ben Jyenis, Gus Kouretas, Nubby Lee, James Lubin, Sherry Moore, Thinh Nguyen, Krista Reymann, Satoshi Kataoka, Trish Blazina, and the members of the robotics team at Google DeepMind for their contributions to the project.

Source: Google AI Blog


Google Research at I/O 2023

Wednesday, May 10th was an exciting day for the Google Research community as we watched the results of months and years of our foundational and applied work get announced on the Google I/O stage. With the quick pace of announcements on stage, it can be difficult to convey the substantial effort and unique innovations that underlie the technologies we presented. So today, we’re excited to reveal more about the research efforts behind some of the many exciting announcements at this year's I/O.


PaLM 2

Our next-generation large language model (LLM), PaLM 2, is built on advances in compute-optimal scaling, scaled instruction-fine tuning and improved dataset mixture. By fine-tuning and instruction-tuning the model for different purposes, we have been able to integrate state-of-the-art capabilities into over 25 Google products and features, where it is already helping to inform, assist and delight users. For example:

  • Bard is an early experiment that lets you collaborate with generative AI and helps to boost productivity, accelerate ideas and fuel curiosity. It builds on advances in deep learning efficiency and leverages reinforcement learning from human feedback to provide more relevant responses and increase the model’s ability to follow instructions. Bard is now available in 180 countries, where users can interact with it in English, Japanese and Korean, and thanks to the multilingual capabilities afforded by PaLM 2, support for 40 languages is coming soon.
  • With Search Generative Experience we’re taking more of the work out of searching, so you’ll be able to understand a topic faster, uncover new viewpoints and insights, and get things done more easily. As part of this experiment, you’ll see an AI-powered snapshot of key information to consider, with links to dig deeper.
  • MakerSuite is an easy-to-use prototyping environment for the PaLM API, powered by PaLM 2. In fact, internal user engagement with early prototypes of MakerSuite accelerated the development of our PaLM 2 model itself. MakerSuite grew out of research focused on prompting tools, or tools explicitly designed for customizing and controlling LLMs. This line of research includes PromptMaker (precursor to MakerSuite), and AI Chains and PromptChainer (one of the first research efforts demonstrating the utility of LLM chaining).
  • Project Tailwind also made use of early research prototypes of MakerSuite to develop features to help writers and researchers explore ideas and improve their prose; its AI-first notebook prototype used PaLM 2 to allow users to ask questions of the model grounded in documents they define.
  • Med-PaLM 2 is our state-of-the-art medical LLM, built on PaLM 2. Med-PaLM 2 achieved 86.5% performance on U.S. Medical Licensing Exam–style questions, illustrating its exciting potential for health. We’re now exploring multimodal capabilities to synthesize inputs like X-rays.
  • Codey is a version of PaLM 2 fine-tuned on source code to function as a developer assistant. It supports a broad range of Code AI features, including code completions, code explanation, bug fixing, source code migration, error explanations, and more. Codey is available through our trusted tester program via IDEs (Colab, Android Studio, Duet AI for Cloud, Firebase) and via a 3P-facing API.

Perhaps even more exciting for developers, we have opened up the PaLM APIs & MakerSuite to provide the community opportunities to innovate using this groundbreaking technology.

PaLM 2 has advanced coding capabilities that enable it to find code errors and make suggestions in a number of different languages.

Imagen

Our Imagen family of image generation and editing models builds on advances in large Transformer-based language models and diffusion models. This family of models is being incorporated into multiple Google products, including:

  • Image generation in Google Slides and Android’s Generative AI wallpaper are powered by our text-to-image generation features.
  • Google Cloud’s Vertex AI enables image generation, image editing, image upscaling and fine-tuning to help enterprise customers meet their business needs.
  • I/O Flip, a digital take on a classic card game, features Google developer mascots on cards that were entirely AI generated. This game showcased a fine-tuning technique called DreamBooth for adapting pre-trained image generation models. Using just a handful of images as inputs for fine-tuning, it allows users to generate personalized images in minutes. With DreamBooth, users can synthesize a subject in diverse scenes, poses, views, and lighting conditions that don’t appear in the reference images.
    I/O Flip presents custom card decks designed using DreamBooth.

Phenaki

Phenaki, Google’s Transformer-based text-to-video generation model was featured in the I/O pre-show. Phenaki is a model that can synthesize realistic videos from textual prompt sequences by leveraging two main components: an encoder-decoder model that compresses videos to discrete embeddings and a transformer model that translates text embeddings to video tokens.


ARCore and the Scene Semantic API

Among the new features of ARCore announced by the AR team at I/O, the Scene Semantic API can recognize pixel-wise semantics in an outdoor scene. This helps users create custom AR experiences based on the features in the surrounding area. This API is empowered by the outdoor semantic segmentation model, leveraging our recent works around the DeepLab architecture and an egocentric outdoor scene understanding dataset. The latest ARCore release also includes an improved monocular depth model that provides higher accuracy in outdoor scenes.

Scene Semantics API uses DeepLab-based semantic segmentation model to provide accurate pixel-wise labels in a scene outdoors.

Chirp

Chirp is Google's family of state-of-the-art Universal Speech Models trained on 12 million hours of speech to enable automatic speech recognition (ASR) for 100+ languages. The models can perform ASR on under-resourced languages, such as Amharic, Cebuano, and Assamese, in addition to widely spoken languages like English and Mandarin. Chirp is able to cover such a wide variety of languages by leveraging self-supervised learning on unlabeled multilingual dataset with fine-tuning on a smaller set of labeled data. Chirp is now available in the Google Cloud Speech-to-Text API, allowing users to perform inference on the model through a simple interface. You can get started with Chirp here.


MusicLM

At I/O, we launched MusicLM, a text-to-music model that generates 20 seconds of music from a text prompt. You can try it yourself on AI Test Kitchen, or see it featured during the I/O preshow, where electronic musician and composer Dan Deacon used MusicLM in his performance.

MusicLM, which consists of models powered by AudioLM and MuLAN, can make music (from text, humming, images or video) and musical accompaniments to singing. AudioLM generates high quality audio with long-term consistency. It maps audio to a sequence of discrete tokens and casts audio generation as a language modeling task. To synthesize longer outputs efficiently, it used a novel approach we’ve developed called SoundStorm.


Universal Translator dubbing

Our dubbing efforts leverage dozens of ML technologies to translate the full expressive range of video content, making videos accessible to audiences across the world. These technologies have been used to dub videos across a variety of products and content types, including educational content, advertising campaigns, and creator content, with more to come. We use deep learning technology to achieve voice preservation and lip matching and enable high-quality video translation. We’ve built this product to include human review for quality, safety checks to help prevent misuse, and we make it accessible only to authorized partners.


AI for global societal good

We are applying our AI technologies to solve some of the biggest global challenges, like mitigating climate change, adapting to a warming planet and improving human health and wellbeing. For example:

  • Traffic engineers use our Green Light recommendations to reduce stop-and-go traffic at intersections and improve the flow of traffic in cities from Bangalore to Rio de Janeiro and Hamburg. Green Light models each intersection, analyzing traffic patterns to develop recommendations that make traffic lights more efficient — for example, by better synchronizing timing between adjacent lights, or adjusting the “green time” for a given street and direction.
  • We’ve also expanded global coverage on the Flood Hub to 80 countries, as part of our efforts to predict riverine floods and alert people who are about to be impacted before disaster strikes. Our flood forecasting efforts rely on hydrological models informed by satellite observations, weather forecasts and in-situ measurements.

Technologies for inclusive and fair ML applications

With our continued investment in AI technologies, we are emphasizing responsible AI development with the goal of making our models and tools useful and impactful while also ensuring fairness, safety and alignment with our AI Principles. Some of these efforts were highlighted at I/O, including:

  • The release of the Monk Skin Tone Examples (MST-E) Dataset to help practitioners gain a deeper understanding of the MST scale and train human annotators for more consistent, inclusive, and meaningful skin tone annotations. You can read more about this and other developments on our website. This is an advancement on the open source release of the Monk Skin Tone (MST) Scale we launched last year to enable developers to build products that are more inclusive and that better represent their diverse users.
  • A new Kaggle competition (open until August 10th) in which the ML community is tasked with creating a model that can quickly and accurately identify American Sign Language (ASL) fingerspelling — where each letter of a word is spelled out in ASL rapidly using a single hand, rather than using the specific signs for entire words — and translate it into written text. Learn more about the fingerspelling Kaggle competition, which features a song from Sean Forbes, a deaf musician and rapper. We also showcased at I/O the winning algorithm from the prior year’s competition powers PopSign, an ASL learning app for parents of deaf or hard of hearing children created by Georgia Tech and Rochester Institute of Technology (RIT).

Building the future of AI together

It’s inspiring to be part of a community of so many talented individuals who are leading the way in developing state-of-the-art technologies, responsible AI approaches and exciting user experiences. We are in the midst of a period of incredible and transformative change for AI. Stay tuned for more updates about the ways in which the Google Research community is boldly exploring the frontiers of these technologies and using them responsibly to benefit people’s lives around the world. We hope you're as excited as we are about the future of AI technologies and we invite you to engage with our teams through the references, sites and tools that we’ve highlighted here.


Source: Google AI Blog


Larger language models do in-context learning differently

There have recently been tremendous advances in language models, partly because they can perform tasks with strong performance via in-context learning (ICL), a process whereby models are prompted with a few examples of input-label pairs before performing the task on an unseen evaluation example. In general, models’ success at in-context learning is enabled by:

  • Their use of semantic prior knowledge from pre-training to predict labels while following the format of in-context examples (e.g., seeing examples of movie reviews with “positive sentiment” and “negative sentiment” as labels and performing sentiment analysis using prior knowledge).
  • Learning the input-label mappings in context from the presented examples (e.g., finding a pattern that positive reviews should be mapped to one label, and negative reviews should be mapped to a different label).

In “Larger language models do in-context learning differently”, we aim to learn about how these two factors (semantic priors and input-label mappings) interact with each other in ICL settings, especially with respect to the scale of the language model that’s used. We investigate two settings to study these two factors — ICL with flipped labels (flipped-label ICL) and ICL with semantically-unrelated labels (SUL-ICL). In flipped-label ICL, labels of in-context examples are flipped so that semantic priors and input-label mappings disagree with each other. In SUL-ICL, labels of in-context examples are replaced with words that are semantically unrelated to the task presented in-context. We found that overriding prior knowledge is an emergent ability of model scale, as is the ability to learn in-context with semantically-unrelated labels. We also found that instruction tuning strengthens the use of prior knowledge more than it increases the capacity to learn input-label mappings.

An overview of flipped-label ICL and semantically-unrelated label ICL (SUL-ICL), compared with regular ICL, for a sentiment analysis task. Flipped-label ICL uses flipped labels, forcing the model to override semantic priors in order to follow the in-context examples. SUL-ICL uses labels that are not semantically related to the task, which means that models must learn input-label mappings in order to perform the task because they can no longer rely on the semantics of natural language labels.

Experiment design

For a diverse dataset mixture, we experiment on seven natural language processing (NLP) tasks that have been widely used: sentiment analysis, subjective/objective classification, question classification, duplicated-question recognition, entailment recognition, financial sentiment analysis, and hate speech detection. We test five language model families, PaLM, Flan-PaLM, GPT-3, InstructGPT, and Codex.


Flipped labels

In this experiment, labels of in-context examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override their priors. In this setting, models that are able to override prior knowledge and learn input-label mappings in-context should experience a decrease in performance (since ground-truth evaluation labels are not flipped).

The ability to override semantic priors when presented with flipped in-context example labels emerges with model scale. Smaller models cannot flip predictions to follow flipped labels (performance only decreases slightly), while larger models can do so (performance decreases to well below 50%).

We found that when no labels are flipped, larger models have better performance than smaller models (as expected). But when we flip more and more labels, the performance of small models stays relatively flat, but large models experience large performance drops to well-below random guessing (e.g., 90% → 22.5% for code-davinci-002).

These results indicate that large models can override prior knowledge from pre-training when contradicting input-label mappings are presented in-context. Small models can’t do this, making this ability an emergent phenomena of model scale.


Semantically-unrelated labels

In this experiment, we replace labels with semantically-irrelevant ones (e.g., for sentiment analysis, we use “foo/bar” instead of “negative/positive”), which means that the model can only perform ICL by learning from input-label mappings. If a model mostly relies on prior knowledge for ICL, then its performance should decrease after this change since it will no longer be able to use semantic meanings of labels to make predictions. A model that can learn input–label mappings in-context, on the other hand, would be able to learn these semantically-unrelated mappings and should not experience a major drop in performance.

Small models rely more on semantic priors than large models do, as indicated by the greater decrease in performance for small models than for large models when using semantically-unrelated labels (i.e., targets) instead of natural language labels. For each plot, models are shown in order of increasing model size (e.g., for GPT-3 models, a is smaller than b, which is smaller than c).

Indeed, we see that using semantically-unrelated labels results in a greater performance drop for small models. This suggests that smaller models primarily rely on their semantic priors for ICL rather than learning from the presented input-label mappings. Large models, on the other hand, have the ability to learn input-label mappings in-context when the semantic nature of labels is removed.

We also find that including more in-context examples (i.e., exemplars) results in a greater performance improvement for large models than it does for small models, indicating that large models are better at learning from in-context examples than small models are.

In the SUL-ICL setup, larger models benefit more from additional examples than smaller models do.

Instruction tuning

Instruction tuning is a popular technique for improving model performance, which involves tuning models on various NLP tasks that are phrased as instructions (e.g., “Question: What is the sentiment of the following sentence, ‘This movie is great.’ Answer: Positive”). Since the process uses natural language labels, however, an open question is whether it improves the ability to learn input-label mappings or whether it strengthens the ability to recognize and apply semantic prior knowledge. Both of these would lead to an improvement in performance on standard ICL tasks, so it’s unclear which of these occur.

We study this question by running the same two setups as before, only this time we focus on comparing standard language models (specifically, PaLM) with their instruction-tuned variants (Flan-PaLM).

First, we find that Flan-PaLM is better than PaLM when we use semantically-unrelated labels. This effect is very prominent in small models, as Flan-PaLM-8B outperforms PaLM-8B by 9.6% and almost catches up to PaLM-62B. This trend suggests that instruction tuning strengthens the ability to learn input-label mappings, which isn’t particularly surprising.

Instruction-tuned language models are better at learning input–label mappings than pre-training–only language models are.

More interestingly, we saw that Flan-PaLM is actually worse than PaLM at following flipped labels, meaning that the instruction tuned models were unable to override their prior knowledge (Flan-PaLM models don’t reach below random guessing with 100% flipped labels, but PaLM models without instruction tuning can reach 31% accuracy in the same setting). These results indicate that instruction tuning must increase the extent to which models rely on semantic priors when they’re available.

Instruction-tuned models are worse than pre-training–only models at learning to override semantic priors when presented with flipped labels in-context.

Combined with the previous result, we conclude that although instruction tuning improves the ability to learn input-label mappings, it strengthens the usage of semantic prior knowledge more.


Conclusion

We examined the extent to which language models learn in-context by utilizing prior knowledge learned during pre-training versus input-label mappings presented in-context.

We first showed that large language models can learn to override prior knowledge when presented with enough flipped labels, and that this ability emerges with model scale. We then found that successfully doing ICL using semantically-unrelated labels is another emergent ability of model scale. Finally, we analyzed instruction-tuned language models and saw that instruction tuning improves the capacity to learn input-label mappings but also strengthens the use of semantic prior knowledge even more.


Future work

These results underscore how the ICL behavior of language models can change depending on their scale, and that larger language models have an emergent ability to map inputs to many types of labels, a form of reasoning in which input-label mappings can potentially be learned for arbitrary symbols. Future research could help provide insights on why these phenomena occur with respect to model scale.


Acknowledgements

This work was conducted by Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. We would like to thank Sewon Min and our fellow collaborators at Google Research for their advice and helpful discussions.

Source: Google AI Blog


Introducing MediaPipe Solutions for On-Device Machine Learning

Posted by Paul Ruiz, Developer Relations Engineer & Kris Tonthat, Technical Writer

MediaPipe Solutions is available in preview today

This week at Google I/O 2023, we introduced MediaPipe Solutions, a new collection of on-device machine learning tools to simplify the developer process. This is made up of MediaPipe Studio, MediaPipe Tasks, and MediaPipe Model Maker. These tools provide no-code to low-code solutions to common on-device machine learning tasks, such as audio classification, segmentation, and text embedding, for mobile, web, desktop, and IoT developers.

image showing a 4 x 2 grid of solutions via MediaPipe Tools

New solutions

In December 2022, we launched the MediaPipe preview with five tasks: gesture recognition, hand landmarker, image classification, object detection, and text classification. Today we’re happy to announce that we have launched an additional nine tasks for Google I/O, with many more to come. Some of these new tasks include:

  • Face Landmarker, which detects facial landmarks and blendshapes to determine human facial expressions, such as smiling, raised eyebrows, and blinking. Additionally, this task is useful for applying effects to a face in three dimensions that matches the user’s actions.
moving image showing a human with a racoon face filter tracking a range of accurate movements and facial expressions
  • Image Segmenter, which lets you divide images into regions based on predefined categories. You can use this functionality to identify humans or multiple objects, then apply visual effects like background blurring.
moving image of two panels showing a person on the left and how the image of that person is segmented into rergions on the right
  • Interactive Segmenter, which takes the region of interest in an image, estimates the boundaries of an object at that location, and returns the segmentation for the object as image data.
moving image of a dog  moving around as the interactive segmenter identifies boundaries and segments

Coming soon

  • Image Generator, which enables developers to apply a diffusion model within their apps to create visual content.
moving image showing the rendering of an image of a puppy among an array of white and pink wildflowers in MediaPipe from a prompt that reads, 'a photo realistic and high resolution image of a cute puppy with surrounding flowers'
  • Face Stylizer, which lets you take an existing style reference and apply it to a user’s face.
image of a 4 x 3 grid showing varying iterations of a known female and male face acrosss four different art styles

MediaPipe Studio

Our first MediaPipe tool lets you view and test MediaPipe-compatible models on the web, rather than having to create your own custom testing applications. You can even use MediaPipe Studio in preview right now to try out the new tasks mentioned here, and all the extras, by visiting the MediaPipe Studio page.

In addition, we have plans to expand MediaPipe Studio to provide a no-code model training solution so you can create brand new models without a lot of overhead.

moving image showing Gesture Recognition in MediaPipe Studio

MediaPipe Tasks

MediaPipe Tasks simplifies on-device ML deployment for web, mobile, IoT, and desktop developers with low-code libraries. You can easily integrate on-device machine learning solutions, like the examples above, into your applications in a few lines of code without having to learn all the implementation details behind those solutions. These currently include tools for three categories: vision, audio, and text.

To give you a better idea of how to use MediaPipe Tasks, let’s take a look at an Android app that performs gesture recognition.

moving image showing Gesture Recognition across a series of hand gestures in MediaPipe Studio including closed fist, victory, thumb up, thumb down, open palm and i love you.

The following code will create a GestureRecognizer object using a built-in machine learning model, then that object can be used repeatedly to return a list of recognition results based on an input image:

// STEP 1: Create a gesture recognizer val baseOptions = BaseOptions.builder() .setModelAssetPath("gesture_recognizer.task") .build() val gestureRecognizerOptions = GestureRecognizerOptions.builder() .setBaseOptions(baseOptions) .build() val gestureRecognizer = GestureRecognizer.createFromOptions( context, gestureRecognizerOptions) // STEP 2: Prepare the image val mpImage = BitmapImageBuilder(bitmap).build() // STEP 3: Run inference val result = gestureRecognizer.recognize(mpImage)

As you can see, with just a few lines of code you can implement seemingly complex features in your applications. Combined with other Android features, like CameraX, you can provide delightful experiences for your users.

Along with simplicity, one of the other major advantages to using MediaPipe Tasks is that your code will look similar across multiple platforms, regardless of the task you’re using. This will help you develop even faster as you can reuse the same logic for each application.


MediaPipe Model Maker

While being able to recognize and use gestures in your apps is great, what if you have a situation where you need to recognize custom gestures outside of the ones provided by the built-in model? That’s where MediaPipe Model Maker comes in. With Model Maker, you can retrain the built-in model on a dataset with only a few hundred examples of new hand gestures, and quickly create a brand new model specific to your needs. For example, with just a few lines of code you can customize a model to play Rock, Paper, Scissors.

image showing 5 examples of the 'paper' hand gesture in the top row and 5 exaples of the 'rock' hand gesture on the bottom row

from mediapipe_model_maker import gesture_recognizer # STEP 1: Load the dataset. data = gesture_recognizer.Dataset.from_folder(dirname='images') train_data, validation_data = data.split(0.8) # STEP 2: Train the custom model. model = gesture_recognizer.GestureRecognizer.create( train_data=train_data, validation_data=validation_data, hparams=gesture_recognizer.HParams(export_dir=export_dir) ) # STEP 3: Evaluate using unseen data. metric = model.evaluate(test_data) # STEP 4: Export as a model asset bundle. model.export_model(model_name='rock_paper_scissor.task')

After retraining your model, you can use it in your apps with MediaPipe Tasks for an even more versatile experience.

moving image showing Gesture Recognition in MediaPipe Studio recognizing rock, paper, and scissiors hand gestures

Getting started

To learn more, watch our I/O 2023 sessions: Easy on-device ML with MediaPipe, Supercharge your web app with machine learning and MediaPipe, and What's new in machine learning, and check out the official documentation over on developers.google.com/mediapipe.


What’s next?

We will continue to improve and provide new features for MediaPipe Solutions, including new MediaPipe Tasks and no-code training through MediaPipe Studio. You can also keep up to date by joining the MediaPipe Solutions announcement group, where we send out announcements as new features are available.

We look forward to all the exciting things you make, so be sure to share them with @googledevs and your developer communities!

New Resources to Build with Google AI

Posted by Jaimie Hwang, ML Product Marketing and Danu Mbanga, ML Product Management

Today's development environment is only getting more complex and as machine learning becomes increasingly integrated with mobile, web, and cloud platforms, developers are looking for clear pathways to cut through this growing complexity. To help developers of all levels, we've unified our machine learning products, tools, and guidance on Google AI, so you can spend less time searching, and more time building AI solutions.

Whether you are looking for a pre-trained dataset, the latest in Generative AI, or tracking the latest announcements from Google I/O, you’ll be able to find it on ai.google/build/machinelearning. It’s a single destination for building AI solutions, no matter where you are in your machine learning workflow, or where your models are deployed.

cropped screenshot of Google AI landing page

Toolkits: end-to-end guidance

We're taking it one step further with new toolkits that provide you end-to-end guidance to build the latest AI solutions. These toolkits combine many of our products, many of which are open source, alongside a walkthrough so you can learn best practices and implement code. Check out how to build a text classifier using Keras or how you can take a large language model and shrink it to run on Android using Keras and TensorFlow Lite. And we are just getting started. These are our first two toolkits but have many more to come soon – so stay tuned!

moving image of finding a toolkit on Google AI to build an LLM on Android and Keras and TensorFlow Lite
Toolkit to build a LLM on Android with Keras and TensorFlow Lite

Whether you're just starting out with machine learning or you're an experienced developer looking for the latest tools and resources, Google AI has the resources you need to build AI solutions. Visit ai.google/build/machinelearning today to learn more.

Build smarter Android apps with on-device Machine Learning

Posted by Thomas Ezan, Developer Relations

In the past year, the Android team made significant improvements to on-device machine learning to help developers create smarter apps with more features to process images, sound, and text. In the Google I/O talk Build smarter Android apps with on-device Machine Learning, David Miro-Llopis PM on ML Kit and Thomas Ezan Android Developer Relation Engineer review new Android APIs and solutions and showcase applications using on-device ML.

Running ML processes on-device enables low-latency, increases data-privacy, enables offline support and potentially reduces cloud bill. Applications such as Lens AR Translate or the document scanning feature available in Files in India, benefit from the advantages of on-device ML.

To deploy ML features on Android, developers have two options:

  • ML Kit: which offers production-ready ML solutions to common user flows, via easy-to-use APIs.
  • Android’s custom ML stack: which is built on top of Tensorflow Lite, and provides control over the inference process and the user experience.

ML Kit released new APIs and improved existing features

Over the last year, the ML Kit team worked on both improving existing APIs and launching new ones: face mesh and document scanner. ML Kit is launching a new document scanner API in Q3 2023, that will provide a consistent scanning experience across apps in Android. Developers will be able to use it only with a few lines of code, without needing camera permission and with low apk size impact (given that it will be distributed via Google Play Services. In a similar fashion, Google code scanner is now generally available and provides a consistent scanning experience across apps, without needing camera permission, via Google Play Services.

Image a series of three photos of two girls smiling to show how face mesh improves facial recognition

Additionally, ML Kit improved the performance of the following APIs: barcode detection (by 17%), text recognition, digital ink recognition, pose detection, translation, and smart reply. ML Kit also integrated some APIs to Google Play Services so you don’t have to bundle the models to your application. Many developers are using ML Kit to easily integrate machine learning into their apps; for example, WPS uses ML Kit to translate text in 43 languages and save $65M a year.


Acceleration Service in Android’s custom ML stack is now in public beta

To support custom machine learning, the Android ML team is actively developing Android’s custom ML stack. Last year, TensorFlow Lite and GPU delegates were added to the Google Play Services which lets developers use TensorFlow Lite without bundling it to their app and provides automatic updates. With improved inference performance, hardware acceleration can in turn also significantly improve the user experience of your ML-enabled Android app. This year, the team is also announcing Acceleration Service, a new API enabling developers to pick the optimal hardware acceleration configuration at runtime. It is now in public beta and developers can learn more and get started here.

To learn more, watch the video:

Machine Learning Communities: Q1 ‘23 highlights and achievements

Posted by Nari Yoon, Bitnoori Keum, Hee Jung, DevRel Community Manager / Soonson Kwon, DevRel Program Manager

Let’s explore highlights and accomplishments of vast Google Machine Learning communities over the first quarter of 2023. We are enthusiastic and grateful about all the activities by the global network of ML communities. Here are the highlights!



ML Campaigns



ML Community Sprint

ML Community Sprint is a campaign, a collaborative attempt bridging ML GDEs with Googlers to produce relevant content for the broader ML community. Throughout Feb and Mar, MediaPipe/TF Recommendation Sprint was carried out and 5 projects were completed.


ML Olympiad 2023

I'm hosting a competiton ML Olympiad 2023 #MLOlympiad

ML Olympiad is an associated Kaggle Community Competitions hosted by ML GDE, TFUG, 3rd-party ML communities, supported by Google Developers. The second, ML Olympiad 2023 has wrapped up successfully with 17 competitions and 300+ participants addressing important issues of our time - diversity, environments, etc. Competition highlights include Breast Cancer Diagnosis, Water Quality Prediction, Detect ChatGpt answers, Ensure healthy lives, etc. Thank you all for participating in ML Olympiad 2023!

Also, “ML Paper Reading Clubs” (GalsenAI and TFUG Dhaka), “ML Math Clubs” (TFUG Hajipur and TFUG Dhaka) and “ML Study Jams” (TFUG Bauchi) were hosted by ML communities around the world.


Community Highlights



Keras


Screen shot of Fine-tuning Stable Diffusion using Keras

Various ways of serving Stable Diffusion by ML GDE Chansung Park (Korea) and ML GDE Sayak Paul (India) shares how to deploy Stable Diffusion with TF Serving, Hugging Face Endpoint, and FastAPI. Their other project Fine-tuning Stable Diffusion using Keras provides how to fine-tune the image encoder of Stable Diffusion on a custom dataset consisting of image-caption pairs.

Serving TensorFlow models with TFServing by ML GDE Dimitre Oliveira (Brazil) is a tutorial explaining how to create a simple MobileNet using the Keras API and how to serve it with TF Serving.

Fine-tuning the multilingual T5 model from Huggingface with Keras by ML GDE Radostin Cholakov (Bulgaria) shows a minimalistic approach for training text generation architectures from Hugging Face with TensorFlow and Keras as the backend.


Image showing a range of low-lit pictures enhanced incljuding inference time and ther metrics

Lighting up Images in the Deep Learning Era by ML GDE Soumik Rakshit (India), ML GDE Saurav Maheshkar (UK), ML GDE Aritra Roy Gosthipaty (India), and Samarendra Dash explores deep learning techniques for low-light image enhancement. The article also talks about a library, Restorers, providing TensorFlow and Keras implementations of SoTA image and video restoration models for tasks such as low-light enhancement, denoising, deblurring, super-resolution, etc.

How to Use Cosine Decay Learning Rate Scheduler in Keras? by ML GDE Ayush Thakur (India) introduces how to correctly use the cosine-decay learning rate scheduler using Keras API.


Screen shot of Implementation of DreamBooth using KerasCV and TensorFlow

Implementation of DreamBooth using KerasCV and TensorFlow (Keras.io tutorial) by ML GDE Sayak Paul (India) and ML GDE Chansung Park (Korea) demonstrates DreamBooth technique to fine-tune Stable Diffusion in KerasCV and TensorFlow. Training code, inference notebooks, a Keras.io tutorial, and more are in the repository. Sayak also shared his story, [ML Story] DreamBoothing Your Way into Greatness on the GDE blog.

Focal Modulation: A replacement for Self-Attention by ML GDE Aritra Roy Gosthipaty (India) shares a Keras implementation of the paper. Usha Rengaraju (India) shared Keras Implementation of NeurIPS 2021 paper, Augmented Shortcuts for Vision Transformers.

Images classification with TensorFlow & Keras (video) by TFUG Abidjan explained how to define an ML model that can classify images according to the category using a CNN.

Hands-on Workshop on KerasNLP by GDG NYC, GDG Hoboken, and Stevens Institute of Technology shared how to use pre-trained Transformers (including BERT) to classify text, fine-tune it on custom data, and build a Transformer from scratch.


On-device ML

Stable diffusion example in an android application — Part 1 & Part 2 by ML GDE George Soloupis (Greece) demonstrates how to deploy a Stable Diffusion pipeline inside an Android app.

AI for Art and Design by ML GDE Margaret Maynard-Reid (United States) delivered a brief overview of how AI can be used to assist and inspire artists & designers in their creative space. She also shared a few use cases of on-device ML for creating artistic Android apps.


ML Engineering (MLOps)


Overall system architecture of End-to-End Pipeline for Segmentation with TFX, Google Cloud, and Hugging Face

End-to-End Pipeline for Segmentation with TFX, Google Cloud, and Hugging Face by ML GDE Sayak Paul (India) and ML GDE Chansung Park (Korea) discussed the crucial details of building an end-to-end ML pipeline for Semantic Segmentation tasks with TFX and various Google Cloud services such as Dataflow, Vertex Pipelines, Vertex Training, and Vertex Endpoint. The pipeline uses a custom TFX component that is integrated with Hugging Face Hub - HFPusher.

Extend your TFX pipeline with TFX-Addons by ML GDE Hannes Hapke (United States) explains how you can use the TFX-Addons components or examples.



Textual Inversion Pipeline architecture

Textual Inversion Pipeline for Stable Diffusion by ML GDE Chansung Park (Korea) demonstrates how to manage multiple models and their prototype applications of fine-tuned Stable Diffusion on new concepts by Textual Inversion.

Running a Stable Diffusion Cluster on GCP with tensorflow-serving (Part 1 | Part 2) by ML GDE Thushan Ganegedara (Australia) explains how to set up a GKE cluster, how to use Terraform to set up and manage infrastructure on GCP, and how to deploy a model on GKE using TF Serving.


Photo of Googler Joinal Ahmed giving a talk at TFUG Bangalore

Scalability of ML Applications by TFUG Bangalore focused on the challenges and solutions related to building and deploying ML applications at scale. Googler Joinal Ahmed gave a talk entitled Scaling Large Language Model training and deployments.

Discovering and Building Applications with Stable Diffusion by TFUG São Paulo was for people who are interested in Stable Diffusion. They shared how Stable Diffusion works and showed a complete version created using Google Colab and Vertex AI in production.


Responsible AI


Thumbnail image for Between the Brackets Fairness & Ethics in AI: Perspectives from Journalism, Medicine and Translation

In Fairness & Ethics In AI: From Journalism, Medicine and Translation, ML GDE Samuel Marks (United States) discussed responsible AI.

In The new age of AI: A Convo with Google Brain, ML GDE Vikram Tiwari (United States) discussed responsible AI, open-source vs. closed-source, and the future of LLMs.

Responsible IA Toolkit (video) by ML GDE Lesly Zerna (Bolivia) and Google DSC UNI was a meetup to discuss ethical and sustainable approaches to AI development. Lesly shared about the “ethic” side of building AI products as well as learning about “Responsible AI from Google”, PAIR guidebook, and other experiences to build AI.

Women in AI/ML at Google NYC by GDG NYC discussed hot topics, including LLMs and generative AI. Googler Priya Chakraborty gave a talk entitled Privacy Protections for ML Models.


ML Research

Efficient Task-Oriented Dialogue Systems with Response Selection as an Auxiliary Task by ML GDE Radostin Cholakov (Bulgaria) showcases how, in a task-oriented setting, the T5-small language model can perform on par with existing systems relying on T5-base or even bigger models.

Learning JAX in 2023: Part 1 / Part 2 / Livestream video by ML GDE Aritra Roy Gosthipaty (India) and ML GDE Ritwik Raha (India) covered the power tools of JAX, namely grad, jit, vmap, pmap, and also discussed the nitty-gritty of randomness in JAX.


Screen grab from JAX Streams: Parallelism with Flax | Ep4 with David Cardozo and Cristian Garcia

In Deep Learning Mentoring MILA Quebec, ML GDE David Cardozo (Canada) did mentoring for M.Sc and Ph.D. students who have interests in JAX and MLOps. JAX Streams: Parallelism with Flax | EP4 by David and ML GDE Cristian Garcia (Columbia) explored Flax’s new APIs to support parallelism.

March Machine Learning Meetup hosted by TFUG Kolkata. Two sessions were delivered: 1) You don't know TensorFlow by ML GDE Sayak Paul (India) presented some under-appreciated and under-used features of TensorFlow. 2) A Guide to ML Workflows with JAX by ML GDE Aritra Roy Gosthipaty (India), ML GDE Soumik Rakshit (India), and ML GDE Ritwik Raha (India) delivered on how one could think of using JAX functional transformations for their ML workflows.

A paper review of PaLM-E: An Embodied Multimodal Language Model by ML GDE Grigory Sapunov (UK) explained the details of the model. He also shared his slide deck about NLP in 2022.

An annotated paper of On the importance of noise scheduling in Diffusion Models by ML GDE Aakash Nain (India) outlined the effects of noise schedule on the performance of diffusion models and strategies to get a better schedule for optimal performance.


TensorFlow

Three projects were awarded as TF Community Spotlight winners: 1) Semantic Segmentation model within ML pipeline by ML GDE Chansung Park (Korea), ML GDE Sayak Paul (India), and ML GDE Merve Noyan (France), 2) GatedTabTransformer in TensorFlow + TPU / in Flax by Usha Rengaraju, and 3) Real-time Object Detection in the browser with YOLOv7 and TF.JS by ML GDE Hugo Zanini (Brazil).

Building ranking models powered by multi-task learning with Merlin and TensorFlow by ML GDE Gabriel Moreira (Brazil) describes how to build TensorFlow models with Merlin for recommender systems using multi-task learning.


Transform your Web Apps with Machine Learning: Unleashing the Power of Open-Source Python Libraries like TensorFlow Hub & Gradio Bhjavesh Bhatt @_bhaveshbhatt

Building ML Powered Web Applications using TensorFlow Hub & Gradio (slide) by ML GDE Bhavesh Bhatt (India) demonstrated how to use TF Hub & Gradio to create a fully functional ML-powered web application. The presentation was held as part of an event called AI Evolution with TensorFlow, covering the fundamentals of ML & TF, hosted by TFUG Nashik.

create-tf-app (repository) by ML GDE Radostin Cholakov (Bulgaria) shows how to set up and maintain an ML project in Tensorflow with a single script.


Cloud

Creating scalable ML solutions to support big techs evolution (slide) by ML GDE Mikaeri Ohana (Brazil) shared how Google can help big techs to generate impact through ML with scalable solutions.

Search of Brazilian Laws using Dialogflow CX and Matching Engine by ML GDE Rubens Zimbres (Brazil) shows how to build a chatbot with Dialogflow CX and query a database of Brazilian laws by calling an endpoint in Cloud Run.


4x4 grid of sample results from Vintedois Diffusion model

Stable Diffusion Finetuning by ML GDE Pedro Gengo (Brazil) and ML GDE Piero Esposito (Brazil) is a fine-tuned Stable Diffusion 1.5 with more aesthetic images. They used Vertex AI with multiple GPUs to fine-tune it. It reached Hugging Face top 3 and more than 150K people downloaded and tested it.