Tag Archives: User Experience

Performance Class helps Google Maps deliver premium experiences

Posted by Nevin Mital - Developer Relations Engineer, Android Media

The Android ecosystem features a diverse range of devices, and it can be difficult to build experiences that take advantage of new or premium hardware features while still working well for users on all devices. With Android 12, we introduced the Media Performance Class (MPC) standard to help developers better understand a device’s capabilities and identify high-performing devices. For a refresher on what MPC is, please see our last blog post, Using performance class to optimize your user experience, or check out the Performance Class documentation.

Earlier this year, we published the first stable release of the Jetpack Core Performance library as the recommended solution for more reliably obtaining a device’s MPC level. In particular, this library introduces the PlayServicesDevicePerformance class, an API that queries Google Play Services to get the most up-to-date MPC level for the current device and build. I’ll get into the technical details further down, but let’s start by taking a look at how Google Maps was able to tailor a feature launch to best fit each device with MPC.

Performance Class unblocks premium experience launch for Google Maps

Google Maps recently took advantage of the expanded device coverage enabled by the Play Services module to unblock a feature launch. Google Maps wanted to update their UI by increasing the transparency of some layers. Consequently, this meant they would need to render more of the map, and found they had to stop the rollout due to latency increases on many devices, especially towards the low-end. To resolve this, the Maps team started by slicing an existing key metric, “seconds to UI item visibility”, by MPC level, which revealed that while all devices had a small increase in this latency, devices without an MPC level had the largest increase.

A bar graph displays A/B test results for Seconds to UI item visibility, comparing control results with those using increased transparency across different Media Performance Class Levels.  A green horizontal line and text indicate the updated experience shipped to devices qualifying for MPC. A vertical green dotted line separates results for devices without a specific MPC level, which kept the previous UI.

With these results in hand, Google Maps started their rollout again, but this time only launching the feature on devices that report an MPC level. As devices continue to get updated and meet the bar for MPC, the updated Google Maps UI will be available to them as well.

The new Play Services module

MPC level requirements are defined in the Android Compatibility Definition Document (CDD), then devices and Android builds are validated against these requirements by the Android Compatibility Test Suite (CTS). The Play Services module of the Jetpack Core Performance library leverages these test results to continually update a device’s reported MPC level without any additional effort on your end. This also means that you’ll immediately have access to the MPC level for new device launches without needing to acquire and test each device yourself, since it already passed CTS. If the MPC level is not available from Google Play Services, the library will fall back to the MPC level declared by the OEM as a build constant.

A flowchart depicts the process of determining Performance Class levels for Android devices, involving manufacturers, CTS tests, a Grader, the Play Services module, and the CDD.

As of writing, more than 190M in-market devices covering over 500 models across 40+ brands report an MPC level. This coverage will continue to grow over time, as older devices update to newer builds, from Android 11 and up.

Using the Core Performance library

To use Jetpack Core Performance, start by adding a dependency for the relevant modules in your Gradle configuration, and create an instance of DevicePerformance. Initializing a DevicePerformance should only happen once in your app, as early as possible - for example, in the onCreate() lifecycle event of your Application. In this example, we’ll use the Google Play services implementation of DevicePerformance.

// Implementation of Jetpack Core library.
implementation("androidx.core:core-ktx:1.12.0")
// Enable APIs to query for device-reported performance class.
implementation("androidx.core:core-performance:1.0.0")
// Enable APIs to query Google Play Services for performance class.
implementation("androidx.core:core-performance-play-services:1.0.0")

import androidx.core.performance.play.services.PlayServicesDevicePerformance

class MyApplication : Application() {
  lateinit var devicePerformance: DevicePerformance

  override fun onCreate() {
    // Use a class derived from the DevicePerformance interface
    devicePerformance = PlayServicesDevicePerformance(applicationContext)
  }
}

Then, later in your app when you want to retrieve the device’s MPC level, you can call getMediaPerformanceClass():

class MyActivity : Activity() {
  private lateinit var devicePerformance: DevicePerformance
  override fun onCreate(savedInstanceState: Bundle?) {
    super.onCreate(savedInstanceState)
    // Note: Good app architecture is to use a dependency framework. See
    // https://developer.android.com/training/dependency-injection for more
    // information.
    devicePerformance = (application as MyApplication).devicePerformance
  }

  override fun onResume() {
    super.onResume()
    when {
      devicePerformance.mediaPerformanceClass >= Build.VERSION_CODES.UPSIDE_DOWN_CAKE -> {
        // MPC level 34 and later.
        // Provide the most premium experience for the highest performing devices.
      }
      devicePerformance.mediaPerformanceClass == Build.VERSION_CODES.TIRAMISU -> {
        // MPC level 33.
        // Provide a high quality experience.
      }
      else -> {
        // MPC level 31, 30, or undefined.
        // Remove extras to keep experience functional.
      }
    }
  }
}

Strategies for using Performance Class

MPC is intended to identify high-end devices, so you can expect to see MPC levels for the top devices from each year, which are the devices you’re likely to want to be able to support for the longest time. For example, the Pixel 9 Pro released with Android 14 and reports an MPC level of 34, the latest level definition when it launched.

You should use MPC as a complement to any existing Device Clustering solutions you already use, such as querying a device’s static specs or manually blocklisting problematic devices. An area where MPC can be a particularly helpful tool is for new device launches. New devices should be included at launch, so you can use MPC to gauge new devices’ capabilities right from the start, without needing to acquire the hardware yourself or manually test each device.

A great first step to get involved is to include MPC levels in your telemetry. This can help you identify patterns in error reports or generally get a better sense of the devices your user base uses if you segment key metrics by MPC level. From there, you might consider using MPC as a dimension in your experimentation pipeline, for example by setting up A/B testing groups based on MPC level, or by starting a feature rollout with the highest MPC level and working your way down. As discussed previously, this is the approach that Google Maps took.

You could further use MPC to tune a user-facing feature, for example by adjusting the number of concurrent video playbacks your app attempts based on the MPC level’s concurrent codec guarantees. However, make sure to still query a device’s runtime capabilities when using this approach, as they may differ depending on the environment and state the device is in.

Get in touch!

If MPC sounds like it could be useful for your app, please give it a try! You can get started by taking a look at our sample code or documentation. We welcome you to share any questions or feedback you have in this short form.


This blog post is a part of Camera and Media Spotlight Week. We're providing resources – blog posts, videos, sample code, and more – all designed to help you uplevel the media experiences in your app.

To learn more about what Spotlight Week has to offer and how it can benefit you, be sure to read our overview blog post.

Reddit improved app startup speed by over 50% using Baseline Profiles and R8

Posted by Ben Weiss – Developer Relations Engineer, and Lauren Darcey – Senior Engineering Manager, Reddit

Reddit is one of the world’s largest internet forums, bringing together countless communities looking for entertainment, answers to everyday questions, and so much more.

Recently, the team optimized its Android app to reduce startup speed and improve rendering performance using Baseline Profiles. But the team didn’t stop there. Reddit app developers also enabled Android’s R8 compiler in full mode to maximize bytecode optimization and used Jetpack Compose to rewrite legacy UI, improving both the user and developer experience.

Maximizing optimization using Baseline Profiles and R8 full mode

The Reddit Android app has undergone countless performance upgrades over the years. Reddit developers have long since cleared the list of quick and easy tasks for optimization, but the team still wants to improve the app, bringing its performance to the next level and ensuring it runs well on every Android device.

“Reddit is looking for any strategic improvement to its app performance so we can make the app experience better for new and existing users,” said Rob McWhinnie, a staff engineer at Reddit. “Baseline Profiles fit this use case well since they are based on critical user journeys.”

Reddit’s platform engineering team used screen-specific performance metrics and observability to help its feature teams improve key metrics like time to interactive and scroll performance. Baseline Profiles were a natural fit to help improve these metrics and the user experience behind them, so the team integrated them to make tracking and optimizing easier, using insights from geodata and device classes.

The team built Baseline Profiles for five critical user journeys so far, like scrolling the home feed, logging in, launching the full-screen video player, navigating between subreddits and scrolling their feeds, and using the chat feature.

Simplifying Baseline Profile management in their continuous integration processes, enabled Reddit to remove the need for manual maintenance and streamlining optimization. Now, Baseline Profiles are automatically regenerated for each release.

Enabling Android’s R8 optimization compiler in full mode was another area Reddit engineers worked on. The team had already used R8 in compatibility mode, but some of Reddit’s legacy code would’ve made implementing R8’s more aggressive features difficult. The team worked through the app’s existing technical debt first, making it easier to integrate R8's full mode capabilities and maximize Android app optimization.

Quote card with image of Catherine Chi, Senior Engineer at Reddit that reads: 'It’s now trivial to work with a team to instrument Baseline Profiles for their critical user journeys. We turn them around in a couple of hours and see results in production a week later.

Improvements with Baseline Profiles and R8 full mode

Reddit's Baseline Profiles and R8 full mode optimization led to multiple performance improvements across the app, with early benchmarks of the first Baseline Profile for feeds showing a 51% median startup time improvement. While responses from Redditors initially confirmed large startup improvements, Baseline Profile optimizations for less frequent journeys, like logging in, saw fewer user reports.

Baseline Profiles for the home feed had a 36% reduction in frozen frames' 95th percentile. Baseline Profiles for the community feed also delivered strong screen load and scroll performance improvements. At the 90th percentile, screen Time To Interactive improved by 12% and time to first draw decreased by 22%. Reddit’s scrolling performance also saw a 12% reduction in P90 slow frames.

The upgrade to R8 full mode led to an increase in Google Play average ratings. The proportion of global positive ratings (fours and fives) increased by four percent, with a notable decrease in negative reports. R8 full mode also reduced total application-not-responding errors by almost 30%.

Overall, the app saw cold start improvements of 20%, scroll performance improvements of 15%, and widespread enhancements in lower-end devices and emerging markets. Google Play vitals saw improvements in slow cold starts, a 10% reduction in excessive frozen frames, and a 30% reduction in excessive slow frames. Nearly 75% of screens, refactored using Jetpack Compose, experienced performance gains.

Quote card with image of Lauren Darcey, Senior Engineering Manager at Reddit that reads: 'When you find a feature that users love and engage with, taking the time to refine and optimize it can be the difference between a good and a great experience for your users.

Further optimizations using Jetpack Compose

Reddit adopted Jetpack Compose years ago and has since rebuilt much of its UI with the toolkit, benefitting both the app and its design system. According to the Reddit team, Google’s ongoing support for Compose’s stability and performance made it a strong fit as Reddit scaled its app, allowing for more efficient feature development and better performance.

One major example is Reddit’s feed rewrite using Compose, which resulted in more maintainable code and an improved developer experience. Compose enabled teams to focus on future work instead of being bogged down by legacy code, allowing them to fix bugs quickly and improve overall app stability.

“The R8 and Compose upgrades were important to deploy in relative isolation and stabilize,” said Drew Heavner, a staff engineer at Reddit. “We feel like we got great outcomes from this work for all teams adopting our modern tech stack and Compose.”

After upgrading to the September 2024 release of Compose, the latest iteration, Reddit saw significant performance gains across the board. Cold start times improved by 13%, excessive slow frames decreased by 25%, and frozen frames dropped by 10%. Low- and mid-tier devices saw even greater improvements where app start times improved by up to 40%, especially in markets with lower-performing devices.

Screens using Reddit’s modern Compose-powered design stack showed substantial improvements in both slow and frozen frame rates. For example, the home feed saw a 23% reduction in frozen frames, and scrolling performance visibly improved according to internal reviews. These updates were well received among users and reflected a 17% increase in the app’s Google Play average rating.

Quote card with image of the Android Bot peeking in from the right side that reads: Compose continues to deliver great new features for a more responsive user experience. It also provides stability and performance improvements we get to take advantage of.” — Eric Kuck, a Principal Engineer at Reddit

Up-leveling UX through optimization

Adding value to an app isn’t just about introducing new features—it's about refining and optimizing the ones users already love. Investing in performance improvements made Reddit’s key features faster and more reliable, enhancing the overall user experience. These optimizations not only improved app startup and runtime performance but also simplified development workflows, increasing both developer satisfaction and app stability.

The focus on high-traffic features, such as feeds, has demonstrated the power of performance tuning, with substantial gains in user engagement and satisfaction. As the app has become more efficient, both users and developers have benefitted from a cleaner codebase and faster performance.

Looking ahead, Reddit plans to extend the usage of Baseline Profiles to other critical user journeys, including Reddit’s post and comment experiences, ensuring even more users benefit from these ongoing performance improvements.

Reddit’s platform engineers also want to continue collaborating with feature teams to integrate performance improvements across the app. These efforts will ensure that as the app evolves, it remains a smooth, fast, and engaging experience for all Redditors.

“Adding new features isn’t the only way to add value to an experience for users,” said Lauren Darcey, a senior engineering manager at Reddit. “When you find a feature that users love and engage with, taking the time to refine and optimize it can be the difference between a good and a great experience for your users.”

Get started

Improve your app performance using Baseline Profiles, R8 full mode, and Jetpack Compose.

Introducing Restore Credentials: Effortless account restoration for Android apps

Posted by Neelansh Sahai - Developer Relations Engineer

Did you know that, on average, 40% of the people in the US reset or replace their smartphones every year? This frequent device turnover presents a challenge – and an opportunity – for maintaining strong user relationships. When users get a new phone, the friction of re-entering login credentials can lead to frustration, app abandonment, and churn.

To address this issue, we're introducing Restore Credentials, a new feature of Android’s Credential Manager API. With Restore Credentials, apps can seamlessly onboard users to their accounts on a new device after they restore their apps and data from their previous device. This makes the transition to a new device effortless and fosters loyalty and long term relationships.

On top of all this, there's no developer effort required for the transfer of a restore key from one device to the other, as this process is tied together with the android system’s backup and restore mechanism. However, if you want to login your users silently as soon as the restore is completed, you might want to implement BackupAgent and add your logic in the onRestore callback. The experience is delightful - users will continue being signed in as they were on their previous device, and they will be able to get notifications to easily access their content without even needing to open the app on the new device.

An illustration the process of restoring app data and keys to a new device, highlighting automated steps and user interactions.  The top row shows a user signing into an app and a restore key being saved locally, while the bottom row shows the restore process on a new device.
click to enlarge

Some of the benefits of the Restore Credentials feature include:

    • Seamless user experience: Users can easily transition to a new Android device.
    • Immediate engagement: Engage users with notifications or other prompts as soon as they start using their new device.
    • Silent login with backup agent integration: If you're using a backup agent, users can be automatically logged back in after data restoration is complete.
    • Restore key checks without backup agent integration: If a backup agent isn't being used, the app can check for a restore key upon first launch and then log the user in automatically.

How does Restore Credentials work?

The Restore Credentials feature enables seamless user account restoration on a new device. This process occurs automatically in the background during device setup when a user restores apps and data from a previous device. By restoring app credentials, the feature allows the app to sign the user back in without requiring any additional interaction.

The credential type that’s supported for this feature is called restore key, which is a public key compatible with passkey / FIDO2 backends.

A diagram shows the device-to-device and cloud backup restore processes for app data and restore keys between old and new devices.  Steps are numbered and explained within the diagram.
Diagram that depicts restoring an app data to a new device using a restore credential, including creating the credential, initiating a restore flow, and automatic user sign-in.

User flow

On the old device:

    1. If the current signed-in user is trusted, you can generate a restore key at any point after they've authenticated in your app. For instance, this could be immediately after login or during a routine check for an existing restore key.
    2. The restore key is stored locally and backed up to the cloud. Apps can opt-out of backing it up to the cloud.

On the new device:

    1. When setting up a new device, the user can select one of the two options to restore data. Either they can restore data from a cloud backup, or can locally transfer the data. If the user transfers locally, the restore key is transferred locally from the old to the new device. Otherwise, if the user restores using the cloud backup, the restore key gets downloaded along with the app data from cloud backup to the new device.
    2. Once this restore key is available on the new device, the app can use it to log in the user on the new device silently in the background.
Note: You should delete the restore key as soon as the user signs out. You don’t want your user to get stuck in a cycle of signing out intentionally and then automatically getting logged back in.

How to implement Restore Credentials

Using the Jetpack Credential Manager let you create, get, and clear the relevant Restore Credentials:

    • Create a Restore Credential: When the user signs in to your app, create a Restore Credential associated with their account. This credential is stored locally and synced to the cloud if the user has enabled Google Backup and end to end encryption is available. Apps can opt out of syncing to the cloud.
    • Get the Restore Credential: When the user sets up a new device, your app requests the Restore Credential from Credential Manager. This allows your user to sign in automatically.
    • Clear the Restore Credential: When the user signs out of your app, delete the associated Restore Credential.

Restore Credentials is available through the Credential Manager Jetpack library. The minimum version of the Jetpack Library is 1.5.0-beta01, and the minimum GMS version is 242200000. For more on this, refer to the Restore Credentials DAC page. To get started, follow these steps:

    1. Add the Credential Manager dependency to your project.

// build.gradle.kts
implementation("androidx.credentials:credentials:1.5.0-beta01")

    2. Create a CreateRestoreCredentialRequest object.

// Fetch Registration JSON from server
// Same as the registrationJson created at the time of creating a Passkey
// See documentation for more info
val registrationJson = ... 

// Create the CreateRestoreCredentialRequest object
// Pass in the registrationJSON 
val createRequest = CreateRestoreCredentialRequest(
  registrationJson,
  /* isCloudBackupEnabled = */ true
)
      NOTE: Set the isCloudBackupEnabled flag to false if you want the restoreKey to be stored locally and not in the cloud. It’s set as true by default

    3. Call the createCredential() method on the CredentialManager object.

val credentialManager = CredentialManager.create(context)

// On a successful authentication create a Restore Key
// Pass in the context and CreateRestoreCredentialRequest object
val response = credentialManager.createCredential(
    context,
    createRestoreRequest
)

    4. When the user sets up a new device, call the getCredential() method on the CredentialManager object.

// Fetch the Authentication JSON from server
val authenticationJson = ...

// Create the GetRestoreCredentialRequest object
val options = GetRestoreCredentialOption(authenticationJson)
val getRequest = GetCredentialRequest(Immutablelist.of(options))

// The restore key can be fetched in two scenarios to 
// 1. On the first launch of app on the device, fetch the Restore Key
// 2. In the onRestore callback (if the app implements the Backup Agent)
val response = credentialManager.getCredential(context, getRequest)

If you're using a backup agent, perform the getCredential part within the onRestore callback. This ensures that the app's credentials are restored immediately after the app data is restored.

    5. When the user signs out of your app, call the clearCredentialState() method on the CredentialManager object.

// Create a ClearCredentialStateRequest object
val clearRequest = ClearCredentialStateRequest(/* requestType = */ 1)

// On user log-out, clear the restore key
val response = credentialManager.clearCredentialState(clearRequest)

Conclusion

The Restore Credentials feature provides significant benefits, ensuring users experience a smooth transition between devices, and allowing them to log in quickly and easily through backup agents or restore key checks. For developers, the feature is straightforward to integrate and leverages existing passkey server-side infrastructure. Overall, Restore Credentials is a valuable tool that delivers a practical and user-friendly authentication solution.


This blog post is a part of our series: Spotlight Week: Passkeys. We're providing you with a wealth of resources through the week. Think informative blog posts, engaging videos, practical sample code, and more—all carefully designed to help you leverage the latest advancements in seamless sign-up and sign-in experiences.

With these cutting-edge solutions, you can enhance security, reduce friction for your users, and stay ahead of the curve in the rapidly evolving landscape of digital identity. To get a complete overview of what Spotlight Week has to offer and how it can benefit you, be sure to read our overview blog post.

Computer-aided diagnosis for lung cancer screening

Lung cancer is the leading cause of cancer-related deaths globally with 1.8 million deaths reported in 2020. Late diagnosis dramatically reduces the chances of survival. Lung cancer screening via computed tomography (CT), which provides a detailed 3D image of the lungs, has been shown to reduce mortality in high-risk populations by at least 20% by detecting potential signs of cancers earlier. In the US, screening involves annual scans, with some countries or cases recommending more or less frequent scans.

The United States Preventive Services Task Force recently expanded lung cancer screening recommendations by roughly 80%, which is expected to increase screening access for women and racial and ethnic minority groups. However, false positives (i.e., incorrectly reporting a potential cancer in a cancer-free patient) can cause anxiety and lead to unnecessary procedures for patients while increasing costs for the healthcare system. Moreover, efficiency in screening a large number of individuals can be challenging depending on healthcare infrastructure and radiologist availability.

At Google we have previously developed machine learning (ML) models for lung cancer detection, and have evaluated their ability to automatically detect and classify regions that show signs of potential cancer. Performance has been shown to be comparable to that of specialists in detecting possible cancer. While they have achieved high performance, effectively communicating findings in realistic environments is necessary to realize their full potential.

To that end, in “Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the US and Japan”, published in Radiology AI, we investigate how ML models can effectively communicate findings to radiologists. We also introduce a generalizable user-centric interface to help radiologists leverage such models for lung cancer screening. The system takes CT imaging as input and outputs a cancer suspicion rating using four categories (no suspicion, probably benign, suspicious, highly suspicious) along with the corresponding regions of interest. We evaluate the system’s utility in improving clinician performance through randomized reader studies in both the US and Japan, using the local cancer scoring systems (Lung-RADSs V1.1 and Sendai Score) and image viewers that mimic realistic settings. We found that reader specificity increases with model assistance in both reader studies. To accelerate progress in conducting similar studies with ML models, we have open-sourced code to process CT images and generate images compatible with the picture archiving and communication system (PACS) used by radiologists.


Developing an interface to communicate model results

Integrating ML models into radiologist workflows involves understanding the nuances and goals of their tasks to meaningfully support them. In the case of lung cancer screening, hospitals follow various country-specific guidelines that are regularly updated. For example, in the US, Lung-RADs V1.1 assigns an alpha-numeric score to indicate the lung cancer risk and follow-up recommendations. When assessing patients, radiologists load the CT in their workstation to read the case, find lung nodules or lesions, and apply set guidelines to determine follow-up decisions.

Our first step was to improve the previously developed ML models through additional training data and architectural improvements, including self-attention. Then, instead of targeting specific guidelines, we experimented with a complementary way of communicating AI results independent of guidelines or their particular versions. Specifically, the system output offers a suspicion rating and localization (regions of interest) for the user to consider in conjunction with their own specific guidelines. The interface produces output images directly associated with the CT study, requiring no changes to the user’s workstation. The radiologist only needs to review a small set of additional images. There is no other change to their system or interaction with the system.

Example of the assistive lung cancer screening system outputs. Results for the radiologist’s evaluation are visualized on the location of the CT volume where the suspicious lesion is found. The overall suspicion is displayed at the top of the CT images. Circles highlight the suspicious lesions while squares show a rendering of the same lesion from a different perspective, called a sagittal view.

The assistive lung cancer screening system comprises 13 models and has a high-level architecture similar to the end-to-end system used in prior work. The models coordinate with each other to first segment the lungs, obtain an overall assessment, locate three suspicious regions, then use the information to assign a suspicion rating to each region. The system was deployed on Google Cloud using a Google Kubernetes Engine (GKE) that pulled the images, ran the ML models, and provided results. This allows scalability and directly connects to servers where the images are stored in DICOM stores.

Outline of the Google Cloud deployment of the assistive lung cancer screening system and the directional calling flow for the individual components that serve the images and compute results. Images are served to the viewer and to the system using Google Cloud services. The system is run on a Google Kubernetes Engine that pulls the images, processes them, and writes them back into the DICOM store.


Reader studies

To evaluate the system’s utility in improving clinical performance, we conducted two reader studies (i.e., experiments designed to assess clinical performance comparing expert performance with and without the aid of a technology) with 12 radiologists using pre-existing, de-identified CT scans. We presented 627 challenging cases to 6 US-based and 6 Japan-based radiologists. In the experimental setup, readers were divided into two groups that read each case twice, with and without assistance from the model. Readers were asked to apply scoring guidelines they typically use in their clinical practice and report their overall suspicion of cancer for each case. We then compared the results of the reader’s responses to measure the impact of the model on their workflow and decisions. The score and suspicion level were judged against the actual cancer outcomes of the individuals to measure sensitivity, specificity, and area under the ROC curve (AUC) values. These were compared with and without assistance.

A multi-case multi-reader study involves each case being reviewed by each reader twice, once with ML system assistance and once without. In this visualization one reader first reviews Set A without assistance (blue) and then with assistance (orange) after a wash-out period. A second reader group follows the opposite path by reading the same set of cases Set A with assistance first. Readers are randomized to these groups to remove the effect of ordering.

The ability to conduct these studies using the same interface highlights its generalizability to completely different cancer scoring systems, and the generalization of the model and assistive capability to different patient populations. Our study results demonstrated that when radiologists used the system in their clinical evaluation, they had an increased ability to correctly identify lung images without actionable lung cancer findings (i.e., specificity) by an absolute 5–7% compared to when they didn’t use the assistive system. This potentially means that for every 15–20 patients screened, one may be able to avoid unnecessary follow-up procedures, thus reducing their anxiety and the burden on the health care system. This can, in turn, help improve the sustainability of lung cancer screening programs, particularly as more people become eligible for screening.

Reader specificity increases with ML model assistance in both the US-based and Japan-based reader studies. Specificity values were derived from reader scores from actionable findings (something suspicious was found) versus no actionable findings, compared against the true cancer outcome of the individual. Under model assistance, readers flagged fewer cancer-negative individuals for follow-up visits. Sensitivity for cancer positive individuals remained the same.

Translating this into real-world impact through partnership

The system results demonstrate the potential for fewer follow-up visits, reduced anxiety, as well lower overall costs for lung cancer screening. In an effort to translate this research into real-world clinical impact, we are working with: DeepHealth, a leading AI-powered health informatics provider; and Apollo Radiology International a leading provider of Radiology services in India to explore paths for incorporating this system into future products. In addition, we are looking to help other researchers studying how best to integrate ML model results into clinical workflows by open sourcing code used for the reader study and incorporating the insights described in this blog. We hope that this will help accelerate medical imaging researchers looking to conduct reader studies for their AI models, and catalyze translational research in the field.


Acknowledgements

Key contributors to this project include Corbin Cunningham, Zaid Nabulsi, Ryan Najafi, Jie Yang, Charles Lau, Joseph R. Ledsam, Wenxing Ye, Diego Ardila, Scott M. McKinney, Rory Pilgrim, Hiroaki Saito, Yasuteru Shimamura, Mozziyar Etemadi, Yun Liu, David Melnick, Sunny Jansen, Nadia Harhen, David P. Nadich, Mikhail Fomitchev, Ziyad Helali, Shabir Adeel, Greg S. Corrado, Lily Peng, Daniel Tse, Shravya Shetty, Shruthi Prabhakara, Neeral Beladia, and Krish Eswaran. Thanks to Arnav Agharwal and Andrew Sellergren for their open sourcing support and Vivek Natarajan and Michael D. Howell for their feedback. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, and Jonny Wong and Carli Sampson for coordinating the reader studies.

Source: Google AI Blog


Resolving code review comments with ML

Code-change reviews are a critical part of the software development process at scale, taking a significant amount of the code authors’ and the code reviewers’ time. As part of this process, the reviewer inspects the proposed code and asks the author for code changes through comments written in natural language. At Google, we see millions of reviewer comments per year, and authors require an average of ~60 minutes active shepherding time between sending changes for review and finally submitting the change. In our measurements, the required active work time that the code author must do to address reviewer comments grows almost linearly with the number of comments. However, with machine learning (ML), we have an opportunity to automate and streamline the code review process, e.g., by proposing code changes based on a comment’s text.

Today, we describe applying recent advances of large sequence models in a real-world setting to automatically resolve code review comments in the day-to-day development workflow at Google (publication forthcoming). As of today, code-change authors at Google address a substantial amount of reviewer comments by applying an ML-suggested edit. We expect that to reduce time spent on code reviews by hundreds of thousands of hours annually at Google scale. Unsolicited, very positive feedback highlights that the impact of ML-suggested code edits increases Googlers' productivity and allows them to focus on more creative and complex tasks.


Predicting the code edit

We started by training a model that predicts code edits needed to address reviewer comments. The model is pre-trained on various coding tasks and related developer activities (e.g., renaming a variable, repairing a broken build, editing a file). It’s then fine-tuned for this specific task with reviewed code changes, the reviewer comments, and the edits the author performed to address those comments.

An example of an ML-suggested edit of refactorings that are spread within the code.

Google uses a monorepo, a single repository for all of its software artifacts, which allows our training dataset to include all unrestricted code used to build Google's most recent software, as well as previous versions.

To improve the model quality, we iterated on the training dataset. For example, we compared the model performance for datasets with a single reviewer comment per file to datasets with multiple comments per file, and experimented with classifiers to clean up the training data based on a small, curated dataset to choose the model with the best offline precision and recall metrics.


Serving infrastructure and user experience

We designed and implemented the feature on top of the trained model, focusing on the overall user experience and developer efficiency. As part of this, we explored different user experience (UX) alternatives through a series of user studies. We then refined the feature based on insights from an internal beta (i.e., a test of the feature in development) including user feedback (e.g., a “Was this helpful?” button next to the suggested edit).

The final model was calibrated for a target precision of 50%. That is, we tuned the model and the suggestions filtering, so that 50% of suggested edits on our evaluation dataset are correct. In general, increasing the target precision reduces the number of shown suggested edits, and decreasing the target precision leads to more incorrect suggested edits. Incorrect suggested edits take the developers time and reduce the developers’ trust in the feature. We found that a target precision of 50% provides a good balance.

At a high level, for every new reviewer comment, we generate the model input in the same format that is used for training, query the model, and generate the suggested code edit. If the model is confident in the prediction and a few additional heuristics are satisfied, we send the suggested edit to downstream systems. The downstream systems, i.e., the code review frontend and the integrated development environment (IDE), expose the suggested edits to the user and log user interactions, such as preview and apply events. A dedicated pipeline collects these logs and generates aggregate insights, e.g., the overall acceptance rates as reported in this blog post.

Architecture of the ML-suggested edits infrastructure. We process code and infrastructure from multiple services, get the model predictions and surface the predictions in the code review tool and IDE.

The developer interacts with the ML-suggested edits in the code review tool and the IDE. Based on insights from the user studies, the integration into the code review tool is most suitable for a streamlined review experience. The IDE integration provides additional functionality and supports 3-way merging of the ML-suggested edits (left in the figure below) in case of conflicting local changes on top of the reviewed code state (right) into the merge result (center).

3-way-merge UX in IDE.

Results

Offline evaluations indicate that the model addresses 52% of comments with a target precision of 50%. The online metrics of the beta and the full internal launch confirm these offline metrics, i.e., we see model suggestions above our target model confidence for around 50% of all relevant reviewer comments. 40% to 50% of all previewed suggested edits are applied by code authors.

We used the “not helpful” feedback during the beta to identify recurring failure patterns of the model. We implemented serving-time heuristics to filter these and, thus, reduce the number of shown incorrect predictions. With these changes, we traded quantity for quality and observed an increased real-world acceptance rate.

Code review tool UX. The suggestion is shown as part of the comment and can be previewed, applied and rated as helpful or not helpful.

Our beta launch showed a discoverability challenge: code authors only previewed ~20% of all generated suggested edits. We modified the UX and introduced a prominent “Show ML-edit” button (see the figure above) next to the reviewer comment, leading to an overall preview rate of ~40% at launch. We additionally found that suggested edits in the code review tool are often not applicable due to conflicting changes that the author did during the review process. We addressed this with a button in the code review tool that opens the IDE in a merge view for the suggested edit. We now observe that more than 70% of these are applied in the code review tool and fewer than 30% are applied in the IDE. All these changes allowed us to increase the overall fraction of reviewer comments that are addressed with an ML-suggested edit by a factor of 2 from beta to the full internal launch. At Google scale, these results help automate the resolution of hundreds of thousands of comments each year.

Suggestions filtering funnel.

We see ML-suggested edits addressing a wide range of reviewer comments in production. This includes simple localized refactorings and refactorings that are spread within the code, as shown in the examples throughout the blog post above. The feature addresses longer and less formally-worded comments that require code generation, refactorings and imports.

Example of a suggestion for a longer and less formally worded comment that requires code generation, refactorings and imports.

The model can also respond to complex comments and produce extensive code edits (shown below). The generated test case follows the existing unit test pattern, while changing the details as described in the comment. Additionally, the edit suggests a comprehensive name for the test reflecting the test semantics.

Example of the model's ability to respond to complex comments and produce extensive code edits.

Conclusion and future work

In this post, we introduced an ML-assistance feature to reduce the time spent on code review related changes. At the moment, a substantial amount of all actionable code review comments on supported languages are addressed with applied ML-suggested edits at Google. A 12-week A/B experiment across all Google developers will further measure the impact of the feature on the overall developer productivity.

We are working on improvements throughout the whole stack. This includes increasing the quality and recall of the model and building a more streamlined experience for the developer with improved discoverability throughout the review process. As part of this, we are investigating the option of showing suggested edits to the reviewer while they draft comments and expanding the feature into the IDE to enable code-change authors to get suggested code edits for natural-language commands.


Acknowledgements

This is the work of many people in Google Core Systems & Experiences team, Google Research, and DeepMind. We'd like to specifically thank Peter Choy for bringing the collaboration together, and all of our team members for their key contributions and useful advice, including Marcus Revaj, Gabriela Surita, Maxim Tabachnyk, Jacob Austin, Nimesh Ghelani, Dan Zheng, Peter Josling, Mariana Stariolo, Chris Gorgolewski, Sascha Varkevisser, Katja Grünwedel, Alberto Elizondo, Tobias Welp, Paige Bailey, Pierre-Antoine Manzagol, Pascal Lamblin, Chenjie Gu, Petros Maniatis, Henryk Michalewski, Sara Wiltberger, Ambar Murillo, Satish Chandra, Madhura Dudhgaonkar, Niranjan Tulpule, Zoubin Ghahramani, Juanjo Carin, Danny Tarlow, Kevin Villela, Stoyan Nikolov, David Tattersall, Boris Bokowski, Kathy Nix, Mehdi Ghissassi, Luis C. Cobo, Yujia Li, David Choi, Kristóf Molnár, Vahid Meimand, Amit Patel, Brett Wiltshire, Laurent Le Brun, Mingpan Guo, Hermann Loose, Jonas Mattes, Savinee Dancs.

Source: Google AI Blog


How Project Starline improves remote communication

As companies settle into a new normal of hybrid and distributed work, remote communication technology remains critical for connecting and collaborating with colleagues. While this technology has improved, the core user experience often falls short: conversation can feel stilted, attention can be difficult to maintain, and usage can be fatiguing.

Project Starline renders people at natural scale on a 3D display and enables natural eye contact.

At Google I/O 2021 we announced Project Starline, a technology project that combines advances in hardware and software to create a remote communication experience that feels like you’re together, even when you’re thousands of miles apart. This perception of co-presence is created by representing users in 3D at natural scale, enabling eye contact, and providing spatially accurate audio. But to what extent do these technological innovations translate to meaningful, observable improvement in user value compared to traditional video conferencing?

In this blog we share results from a number of studies across a variety of methodologies, finding converging evidence that Project Starline outperforms traditional video conferencing in terms of conversation dynamics, video meeting fatigue, and attentiveness. Some of these results were previously published while others we are sharing for the first time as preliminary findings.


Improved conversation dynamics

In our qualitative studies, users often describe conversations in Project Starline as “more natural.” However, when asked to elaborate, many have difficulty articulating this concept in a way that fully captures their experience. Because human communication relies partly on unconscious processes like nonverbal behavior, people might have a hard time reflecting on these processes that are potentially impacted by experiencing a novel technology. To address this challenge, we conducted a series of behavioral lab experiments to shed light on what “more natural” might mean for Project Starline. These experiments employed within-subjects designs in which participants experienced multiple conditions (e.g., meeting in Project Starline vs. traditional videoconferencing) in randomized order. This allowed us to control for between-subject differences by comparing how the same individual responded to a variety of conditions, thus increasing statistical power and reducing the sample size necessary to detect statistical differences (sample sizes in our behavioral experiments range from ~ 20 to 30).

In one study, preliminary data suggest Project Starline improves conversation dynamics by increasing rates of turn-taking. We recruited pairs of participants who had never met each other to have unstructured conversations in both Project Starline and traditional video conferencing. We analyzed the audio from each conversation and found that Project Starline facilitated significantly more dynamic "back and forth" conversations compared to traditional video conferencing. Specifically, participants averaged about 2-3 more speaker hand-offs in Project Starline conversations compared to those in traditional video conferencing across a two minute subsample of their conversation (a uniform selection at the end of each conversation to help standardize for interpersonal rapport). Participants also rated their Starline conversations as significantly more natural (“smooth,” “easy,” “not awkward”), higher in quality, and easier to recognize when it was their turn to speak compared to conversations using traditional video conferencing.

In another study, participants had conversations with a confederate in both Project Starline and traditional video conferencing. We recorded these conversations to analyze select nonverbal behaviors. In Project Starline, participants were more animated, using significantly more hand gestures (+43%), head nods (+26%), and eyebrow movements (+49%). Participants also reported a significantly better ability to perceive and convey nonverbal cues in Project Starline than in traditional video conferencing. Together with the turn-taking results, these data help explain why conversations in Project Starline may feel more natural.

We recorded participants to quantify their nonverbal behaviors and found that they were more animated in Project Starline (left) compared to traditional video conferencing (right).

Reduced video meeting fatigue

A well-documented challenge of video conferencing, especially within the workplace, is video meeting fatigue. The causes of video meeting fatigue are complex, but one possibility is that video communication is cognitively taxing because it becomes more difficult to convey and interpret nonverbal behavior. Considering previous findings that suggested Project Starline might improve nonverbal communication, we examined whether video meeting fatigue might also be improved (i.e., reduced) compared to traditional video conferencing.

Our study found preliminary evidence that Project Starline indeed reduces video meeting fatigue. Participants held 30-minute mock meetings in Project Starline and traditional video conferencing. Meeting content was standardized across participants using an exercise adapted from academic literature that emulates key elements of a work meeting, such as brainstorming and persuasion. We then measured video meeting fatigue via the Zoom Exhaustion and Fatigue (ZEF) Scale. Additionally, we measured participants' reaction times on a complex cognitive task originally used in cognitive psychology. We repurposed this task as a proxy for video meeting fatigue based on the assumption that more fatigue would lead to slower reaction times. Participants reported significantly less video meeting fatigue on the ZEF Scale (-31%) and had faster reaction times (-12%) on the cognitive task after using Project Starline compared to traditional video conferencing.


Increased attentiveness

Another challenge with video conferencing is focusing attention on the meeting at hand, rather than on other browser windows or secondary devices.

In our earlier study on nonverbal behavior, we included an exploratory information-retention task. We asked participants to write as much as they could remember about each conversation (one in Project Starline, and one in traditional video conferencing). We found that participants wrote 28% more in this task (by character count) after their conversation in Project Starline. This could be because they paid closer attention when in Project Starline, or possibly that they found conversations in Project Starline to be more engaging.

To explore the concept of attentiveness further, we conducted a study in which participants wore eye-tracking glasses. This allowed us to calculate the percentage of time participants spent focusing on their conversation partner’s face, an important source of social information in human interaction. Participants had a conversation with a confederate in Project Starline, traditional video conferencing, and in person. We found that participants spent a significantly higher proportion of time looking at their conversation partner's face in Project Starline (+14%) than they did in traditional video conferencing. In fact, visual attentiveness in Project Starline mirrored that of the in-person condition: participants spent roughly the same proportion of time focusing on their meeting partner’s face in the Project Starline and in-person conditions.

The use of eye-tracking glasses and facial detection software allowed us to quantify participants' gaze patterns. The video above illustrates how a hypothetical participant's eye tracking data (red dot) correspond to their meeting partner's face (white box).

User value in real meetings

The lab-based, experimental approach used in the studies above allows for causal inference while minimizing confounding variables. However, one limitation of these studies is that they are low in external validity — that is, they took place in a lab environment, and the extent to which their results extend to the real world is unclear. Thus, we studied actual users within Google who used Project Starline for their day-to-day work meetings and collected their feedback.

An internal pilot revealed that users derive meaningful value from using Project Starline. We used post-meeting surveys to capture immediate feedback on individual meetings, longer monthly surveys to capture holistic feedback on the experience, and conducted in-depth qualitative interviews with a subset of users. We evaluated Project Starline on concepts such as presence, nonverbal behavior, attentiveness, and personal connection. We found strong evidence that Project Starline delivered across these four metrics, with over 87% of participants expressing that their meetings in Project Starline were better than their previous experiences with traditional video conferencing.


Conclusion

Together, these findings offer a compelling case for Project Starline's value to users: improved conversation dynamics, reduced video meeting fatigue, and increased attentiveness. Participants expressed that Project Starline was a significant improvement over traditional video conferencing in highly controlled lab experiments, as well as when they used Project Starline for their actual work meetings. We’re excited to see these findings converge across multiple methodologies (surveys, qualitative interviews, experiments) and measurements (self-report, behavioral, qualitative), and we’re eager to continue exploring the implications of Project Starline on human interaction.


Acknowledgments

We’d like to thank Melba Tellez, Eric Baczuk, Jinghua Zhang, Matthew DuVall, and Travis Miller for contributing to visual assets and illustrations.

Source: Google AI Blog


ML-Enhanced Code Completion Improves Developer Productivity

The increasing complexity of code poses a key challenge to productivity in software engineering. Code completion has been an essential tool that has helped mitigate this complexity in integrated development environments (IDEs). Conventionally, code completion suggestions are implemented with rule-based semantic engines (SEs), which typically have access to the full repository and understand its semantic structure. Recent research has demonstrated that large language models (e.g., Codex and PaLM) enable longer and more complex code suggestions, and as a result, useful products have emerged (e.g., Copilot). However, the question of how code completion powered by machine learning (ML) impacts developer productivity, beyond perceived productivity and accepted suggestions, remains open.

Today we describe how we combined ML and SE to develop a novel Transformer-based hybrid semantic ML code completion, now available to internal Google developers. We discuss how ML and SEs can be combined by (1) re-ranking SE single token suggestions using ML, (2) applying single and multi-line completions using ML and checking for correctness with the SE, or (3) using single and multi-line continuation by ML of single token semantic suggestions. We compare the hybrid semantic ML code completion of 10k+ Googlers (over three months across eight programming languages) to a control group and see a 6% reduction in coding iteration time (time between builds and tests) and a 7% reduction in context switches (i.e., leaving the IDE) when exposed to single-line ML completion. These results demonstrate that the combination of ML and SEs can improve developer productivity. Currently, 3% of new code (measured in characters) is now generated from accepting ML completion suggestions.

Transformers for Completion
A common approach to code completion is to train transformer models, which use a self-attention mechanism for language understanding, to enable code understanding and completion predictions. We treat code similar to language, represented with sub-word tokens and a SentencePiece vocabulary, and use encoder-decoder transformer models running on TPUs to make completion predictions. The input is the code that is surrounding the cursor (~1000-2000 tokens) and the output is a set of suggestions to complete the current or multiple lines. Sequences are generated with a beam search (or tree exploration) on the decoder.

During training on Google’s monorepo, we mask out the remainder of a line and some follow-up lines, to mimic code that is being actively developed. We train a single model on eight languages (C++, Java, Python, Go, Typescript, Proto, Kotlin, and Dart) and observe improved or equal performance across all languages, removing the need for dedicated models. Moreover, we find that a model size of ~0.5B parameters gives a good tradeoff for high prediction accuracy with low latency and resource cost. The model strongly benefits from the quality of the monorepo, which is enforced by guidelines and reviews. For multi-line suggestions, we iteratively apply the single-line model with learned thresholds for deciding whether to start predicting completions for the following line.

Encoder-decoder transformer models are used to predict the remainder of the line or lines of code.

Re-rank Single Token Suggestions with ML
While a user is typing in the IDE, code completions are interactively requested from the ML model and the SE simultaneously in the backend. The SE typically only predicts a single token. The ML models we use predict multiple tokens until the end of the line, but we only consider the first token to match predictions from the SE. We identify the top three ML suggestions that are also contained in the SE suggestions and boost their rank to the top. The re-ranked results are then shown as suggestions for the user in the IDE.

In practice, our SEs are running in the cloud, providing language services (e.g., semantic completion, diagnostics, etc.) with which developers are familiar, and so we collocated the SEs to run on the same locations as the TPUs performing ML inference. The SEs are based on an internal library that offers compiler-like features with low latencies. Due to the design setup, where requests are done in parallel and ML is typically faster to serve (~40 ms median), we do not add any latency to completions. We observe a significant quality improvement in real usage. For 28% of accepted completions, the rank of the completion is higher due to boosting, and in 0.4% of cases it is worse. Additionally, we find that users type >10% fewer characters before accepting a completion suggestion.

Check Single / Multi-line ML Completions for Semantic Correctness
At inference time, ML models are typically unaware of code outside of their input window, and code seen during training might miss recent additions needed for completions in actively changing repositories. This leads to a common drawback of ML-powered code completion whereby the model may suggest code that looks correct, but doesn’t compile. Based on internal user experience research, this issue can lead to the erosion of user trust over time while reducing productivity gains.

We use SEs to perform fast semantic correctness checks within a given latency budget (<100ms for end-to-end completion) and use cached abstract syntax trees to enable a “full” structural understanding. Typical semantic checks include reference resolution (i.e., does this object exist), method invocation checks (e.g., confirming the method was called with a correct number of parameters), and assignability checks (to confirm the type is as expected).

For example, for the coding language Go, ~8% of suggestions contain compilation errors before semantic checks. However, the application of semantic checks filtered out 80% of uncompilable suggestions. The acceptance rate for single-line completions improved by 1.9x over the first six weeks of incorporating the feature, presumably due to increased user trust. As a comparison, for languages where we did not add semantic checking, we only saw a 1.3x increase in acceptance.

Language servers with access to source code and the ML backend are collocated on the cloud. They both perform semantic checking of ML completion suggestions.

Results
With 10k+ Google-internal developers using the completion setup in their IDE, we measured a user acceptance rate of 25-34%. We determined that the transformer-based hybrid semantic ML code completion completes >3% of code, while reducing the coding iteration time for Googlers by 6% (at a 90% confidence level). The size of the shift corresponds to typical effects observed for transformational features (e.g., key framework) that typically affect only a subpopulation, whereas ML has the potential to generalize for most major languages and engineers.

Fraction of all code added by ML 2.6%
Reduction in coding iteration duration 6%
Reduction in number of context switches 7%
Acceptance rate (for suggestions visible for >750ms) 25%
Average characters per accept 21
Key metrics for single-line code completion measured in production for 10k+ Google-internal developers using it in their daily development across eight languages.
Fraction of all code added by ML (with >1 line in suggestion) 0.6%
Average characters per accept 73
Acceptance rate (for suggestions visible for >750ms) 34%
Key metrics for multi-line code completion measured in production for 5k+ Google-internal developers using it in their daily development across eight languages.

Providing Long Completions while Exploring APIs
We also tightly integrated the semantic completion with full line completion. When the dropdown with semantic single token completions appears, we display inline the single-line completions returned from the ML model. The latter represent a continuation of the item that is the focus of the dropdown. For example, if a user looks at possible methods of an API, the inline full line completions show the full method invocation also containing all parameters of the invocation.

Integrated full line completions by ML continuing the semantic dropdown completion that is in focus.
Suggestions of multiple line completions by ML.

Conclusion and Future Work
We demonstrate how the combination of rule-based semantic engines and large language models can be used to significantly improve developer productivity with better code completion. As a next step, we want to utilize SEs further, by providing extra information to ML models at inference time. One example can be for long predictions to go back and forth between the ML and the SE, where the SE iteratively checks correctness and offers all possible continuations to the ML model. When adding new features powered by ML, we want to be mindful to go beyond just “smart” results, but ensure a positive impact on productivity.

Acknowledgements
This research is the outcome of a two-year collaboration between Google Core and Google Research, Brain Team. Special thanks to Marc Rasi, Yurun Shen, Vlad Pchelin, Charles Sutton, Varun Godbole, Jacob Austin, Danny Tarlow, Benjamin Lee, Satish Chandra, Ksenia Korovina, Stanislav Pyatykh, Cristopher Claeys, Petros Maniatis, Evgeny Gryaznov, Pavel Sychev, Chris Gorgolewski, Kristof Molnar, Alberto Elizondo, Ambar Murillo, Dominik Schulz, David Tattersall, Rishabh Singh, Manzil Zaheer, Ted Ying, Juanjo Carin, Alexander Froemmgen and Marcus Revaj for their contributions.


Source: Google AI Blog


Contextual Rephrasing in Google Assistant

When people converse with one another, context and references play a critical role in driving their conversation more efficiently. For instance, if one asks the question “Who wrote Romeo and Juliet?” and, after receiving an answer, asks “Where was he born?”, it is clear that ‘he’ is referring to William Shakespeare without the need to explicitly mention him. Or if someone mentions “python” in a sentence, one can use the context from the conversation to determine whether they are referring to a type of snake or a computer language. If a virtual assistant cannot robustly handle context and references, users would be required to adapt to the limitation of the technology by repeating previously shared contextual information in their follow-up queries to ensure that the assistant understands their requests and can provide relevant answers.

In this post, we present a technology currently deployed on Google Assistant that allows users to speak in a natural manner when referencing context that was defined in previous queries and answers. The technology, based on the latest machine learning (ML) advances, rephrases a user’s follow-up query to explicitly mention the missing contextual information, thus enabling it to be answered as a stand-alone query. While Assistant considers many types of context for interpreting the user input, in this post we are focusing on short-term conversation history.

Context Handling by Rephrasing
One of the approaches taken by Assistant to understand contextual queries is to detect if an input utterance is referring to previous context and then rephrase it internally to explicitly include the missing information. Following on from the previous example in which the user asked who wrote Romeo and Juliet, one may ask follow-up questions like “When?”. Assistant recognizes that this question is referring to both the subject (Romeo and Juliet) and answer from the previous query (William Shakespeare) and can rephrase “When?” to “When did William Shakespeare write Romeo and Juliet?”

While there are other ways to handle context, for instance, by applying rules directly to symbolic representations of the meaning of queries, like intents and arguments, the advantage of the rephrasing approach is that it operates horizontally at the string level across any query answering, parsing, or action fulfillment module.

Conversation on a smart display device, where Assistant understands multiple contextual follow-up queries, allowing the user to have a more natural conversation. The phrases appearing at the bottom of the display are suggestions for follow-up questions that the user can select. However, the user can still ask different questions.

A Wide Variety of Contextual Queries
The natural language processing field, traditionally, has not put much emphasis on a general approach to context, focusing on the understanding of stand-alone queries that are fully specified. Accurately incorporating context is a challenging problem, especially when considering the large variety of contextual query types. The table below contains example conversations that illustrate query variability and some of the many contextual challenges that Assistant’s rephrasing method can resolve (e.g., differentiating between referential and non-referential cases or identifying what context a query is referencing). We demonstrate how Assistant is now able to rephrase follow-up queries, adding contextual information before providing an answer.

System Architecture
At a high level, the rephrasing system generates rephrasing candidates by using different types of candidate generators. Each rephrasing candidate is then scored based on a number of signals, and the one with the highest score is selected.

High level architecture of Google Assistant contextual rephraser.

Candidate Generation
To generate rephrasing candidates we use a hybrid approach that applies different techniques, which we classify into three categories:

  1. Generators based on the analysis of the linguistic structure of the queries use grammatical and morphological rules to perform specific operations — for instance, the replacement of pronouns or other types of referential phrases with antecedents from the context.
  2. Generators based on query statistics combine key terms from the current query and its context to create candidates that match popular queries from historical data or common query patterns.
  3. Generators based on Transformer technologies, such as MUM, learn to generate sequences of words according to a number of training samples. LaserTagger and FELIX are technologies suitable for tasks with high overlap between the input and output texts, are very fast at inference time, and are not vulnerable to hallucination (i.e., generating text that is not related to the input texts). Once presented with a query and its context, they can generate a sequence of text edits to transform the input queries into a rephrasing candidate by indicating which portions of the context should be preserved and which words should be modified.

Candidate Scoring
We extract a number of signals for each rephrasing candidate and use an ML model to select the most promising candidate. Some of the signals depend only on the current query and its context. For example, is the topic of the current query similar to the topic of the previous query? Or, is the current query a good stand-alone query or does it look incomplete? Other signals depend on the candidate itself: How much of the information of the context does the candidate preserve? Is the candidate well-formed from a linguistic point of view? Etc.

Recently, new signals generated by BERT and MUM models have significantly improved the performance of the ranker, fixing about one-third of the recall headroom while minimizing false positives on query sequences that are not contextual (and therefore do not require a rephrasing).

Example conversation on a phone where Assistant understands a sequence of contextual queries.

Conclusion
The solution described here attempts to resolve contextual queries by rephrasing them in order to make them fully answerable in a stand-alone manner, i.e., without having to relate to other information during the fulfillment phase. The benefit of this approach is that it is agnostic to the mechanisms that would fulfill the query, thus making it usable as a horizontal layer to be deployed before any further processing.

Given the variety of contexts naturally used in human languages, we adopted a hybrid approach that combines linguistic rules, large amounts of historic data through logs, and ML models based on state-of-the-art Transformer approaches. By generating a number of rephrasing candidates for each query and its context, and then scoring and ranking them using a variety of signals, Assistant can rephrase and thus correctly interpret most contextual queries. As Assistant can handle most types of linguistic references, we are empowering users to have more natural conversations. To make such multi-turn conversations even less cumbersome, Assistant users can turn on Continued Conversation mode to enable asking follow-up queries without the need to repeat "Hey Google" between each query. We are also using this technology in other virtual assistant settings, for instance, interpreting context from something shown on a screen or playing on a speaker.

Acknowledgements
This post reflects the combined work of Aliaksei Severyn, André Farias, Cheng-Chun Lee, Florian Thöle, Gabriel Carvajal, Gyorgy Gyepesi, Julien Cretin, Liana Marinescu, Martin Bölle, Patrick Siegler, Sebastian Krause, Victor Ähdel, Victoria Fossum, Vincent Zhao. We also thank Amar Subramanya, Dave Orr, Yury Pinsky for helpful discussions and support.

Source: Google AI Blog


App performance to drive app excellence

Posted by Maru Ahues Bouza, Director Android Developer Relations

hand drawing shapes on a tablet

In our previous blog post in this series, we defined app excellence as “creating an app that provides consistent, effortless, and seamless app user experiences. It is high performing and provides a great experience, no matter the device being used.” Let’s focus on the concept of app performance — what are the features of high performing apps, and how do you achieve app excellence through strong performance?

From a user’s perspective, high-performing apps “just work.” However, the process of creating a high performing app is not always straightforward. To break things down, here are the main dimensions of high performance:

Stability

An app should be robust and reliable. It should not freeze (application not responding, or “ANR”) or crash. Before you launch your app, check out Google Play’s pre-launch report to identify potential stability issues. After deployment, pay attention to the Android Vitals page in the Google Play developer console. Specifically, ANRs are caused by threading issues. The ANR troubleshooting guide can help you diagnose and resolve any ANRs that exist in your app.

Quick loading

Imagine the first experience a user has of your app is…..waiting. At some point, they are going to get distracted or bored, and you have lost a new user. Your app should either load quickly or provide some sort of feedback onscreen such as a progress indicator. You can use data from Android vitals to quantify any issues you may have with start up times. Android vitals considers excessive start up times as:

  • Cold startup takes 5 seconds or longer.
  • Warm startup takes 2 seconds or longer.
  • Hot startup takes 1.5 seconds or longer.

However, these are relatively conservative numbers. We recommend you aim for lower. Here are some great tips on how to test start up performance.

Fast rendering

High quality frame rendering is not just for games. Smooth visual experiences that don’t stall or act sluggish are also important for apps. At a minimum aim to render frames every 16ms to achieve 60 frames per second, but bear in mind there are devices in the market with faster refresh rates. To monitor performance as you test, use the Profile HWUI rendering option. Here are tools to help diagnose rendering issues.

Economical with battery usage

As soon as a user realizes your app is draining their battery, they are going to consider uninstalling. Your app can drain battery through stuck partial wake locks, excessive wakeups, background Wifi scans, or background network usage. Use the Android Studio energy profiler combined with planned background work, to diagnose unexpected battery use. For apps that need to execute background tasks that require a guarantee that the system will run them even if the app exits, WorkManager is a battery friendly Android Jetpack library that runs deferrable, guaranteed background work when the work’s constraints are satisfied.

Using up-to-date SDKs

For both security and performance, it’s important that any Google or third-party SDKs used are up-to-date. Improvements to these SDKs, such as stability, compatibility, or security, should be available to users in a timely manner. You are responsible for the entire code base, including any third party SDKs you may utilize. For Google SDKs, consider using SDKs powered by Google Play services, when available. These SDKs are backward compatible, receive automatic updates, reduce your app package size, and make efficient use of on-device resources.

To learn more, please visit the Android app excellence webpage, where you will find case studies, practical tips, and the opportunity to sign up for our App Excellence summit..

In our next blog posts, we will talk about seamless user experiences across devices. Sign up to the Android developer newsletter here to be notified of the next installment and get news and insights from the Android team.

Sensing Force-Based Gestures on the Pixel 4



Touch input has traditionally focussed on two-dimensional finger pointing. Beyond tapping and swiping gestures, long pressing has been the main alternative path for interaction. However, a long press is sensed with a time-based threshold where a user’s finger must remain stationary for 400–500 ms. By its nature, a time-based threshold has negative effects for usability and discoverability as the lack of immediate feedback disconnects the user’s action from the system’s response. Fortunately, fingers are dynamic input devices that can express more than just location: when a user touches a surface, their finger can also express some level of force, which can be used as an alternative to a time-based threshold.

While a variety of force-based interactions have been pursued, sensing touch force requires dedicated hardware sensors that are expensive to design and integrate. Further, research indicates that touch force is difficult for people to control, and so most practical force-based interactions focus on discrete levels of force (e.g., a soft vs. firm touch) — which do not require the full capabilities of a hardware force sensor.

For a recent update to the Pixel 4, we developed a method for sensing force gestures that allowed us to deliver a more expressive touch interaction experience By studying how the human finger interacts with touch sensors, we designed the experience to complement and support the long-press interactions that apps already have, but with a more natural gesture. In this post we describe the core principles of touch sensing and finger interaction, how we designed a machine learning algorithm to recognise press gestures from touch sensor data, and how we integrated it into the user experience for Pixel devices.

Touch Sensor Technology and Finger Biomechanics
A capacitive touch sensor is constructed from two conductive electrodes (a drive electrode and a sense electrode) that are separated by a non-conductive dielectric (e.g., glass). The two electrodes form a tiny capacitor (a cell) that can hold some charge. When a finger (or another conductive object) approaches this cell, it ‘steals’ some of the charge, which can be measured as a drop in capacitance. Importantly, the finger doesn’t have to come into contact with the electrodes (which are protected under another layer of glass) as the amount of charge stolen is inversely proportional to the distance between the finger and the electrodes.
Left: A finger interacts with a touch sensor cell by ‘stealing’ charge from the projected field around two electrodes. Right: A capacitive touch sensor is constructed from rows and columns of electrodes, separated by a dielectric. The electrodes overlap at cells, where capacitance is measured.
The cells are arranged as a matrix over the display of a device, but with a much lower density than the display pixels. For instance, the Pixel 4 has a 2280 × 1080 pixel display, but a 32 × 15 cell touch sensor. When scanned at a high resolution (at least 120 Hz), readings from these cells form a video of the finger’s interaction.
Slowed touch sensor recordings of a user tapping (left), pressing (middle), and scrolling (right).
Capacitive touch sensors don’t respond to changes in force per se, but are tuned to be highly sensitive to changes in distance within a couple of millimeters above the display. That is, a finger contact on the display glass should saturate the sensor near its centre, but will retain a high dynamic range around the perimeter of the finger’s contact (where the finger curls up).

When a user’s finger presses against a surface, its soft tissue deforms and spreads out. The nature of this spread depends on the size and shape of the user’s finger, and its angle to the screen. At a high level, we can observe a couple of key features in this spread (shown in the figures): it is asymmetric around the initial contact point, and the overall centre of mass shifts along the axis of the finger. This is also a dynamic change that occurs over some period of time, which differentiates it from contacts that have a long duration or a large area.
Touch sensor signals are saturated around the centre of the finger’s contact, but fall off at the edges. This allows us to sense small deformations in the finger’s contact shape caused by changes in the finger’s force.
However, the differences between users (and fingers) makes it difficult to encode these observations with heuristic rules. We therefore designed a machine learning solution that would allow us to learn these features and their variances directly from user interaction samples.

Machine Learning for Touch Interaction
We approached the analysis of these touch signals as a gesture classification problem. That is, rather than trying to predict an abstract parameter, such as force or contact spread, we wanted to sense a press gesture — as if engaging a button or a switch. This allowed us to connect the classification to a well-defined user experience, and allowed users to perform the gesture during training at a comfortable force and posture.

Any classification model we designed had to operate within users’ high expectations for touch experiences. In particular, touch interaction is extremely latency-sensitive and demands real-time feedback. Users expect applications to be responsive to their finger movements as they make them, and application developers expect the system to deliver timely information about the gestures a user is performing. This means that classification of a press gesture needs to occur in real-time, and be able to trigger an interaction at the moment the finger’s force reaches its apex.

We therefore designed a neural network that combined convolutional (CNN) and recurrent (RNN) components. The CNN could attend to the spatial features we observed in the signal, while the RNN could attend to their temporal development. The RNN also helps provide a consistent runtime experience: each frame is processed by the network as it is received from the touch sensor, and the RNN state vectors are preserved between frames (rather than processing them in batches). The network was intentionally kept simple to minimise on-device inference costs when running concurrently with other applications (taking approximately 50 µs of processing per frame and less than 1 MB of memory using TensorFlow Lite).
An overview of the classification model’s architecture.
The model was trained on a dataset of press gestures and other common touch interactions (tapping, scrolling, dragging, and long-pressing without force). As the model would be evaluated after each frame, we designed a loss function that temporally shaped the label probability distribution of each sample, and applied a time-increasing weight to errors. This ensured that the output probabilities were temporally smooth and converged towards the correct gesture classification.

User Experience Integration
Our UX research found that it was hard for users to discover force-based interactions, and that users frequently confused a force press with a long press because of the difficulty in coordinating the amount of force they were applying with the duration of their contact. Rather than creating a new interaction modality based on force, we therefore focussed on improving the user experience of long press interactions by accelerating them with force in a unified press gesture. A press gesture has the same outcome as a long press gesture, whose time threshold remains effective, but provides a stronger connection between the outcome and the user’s action when force is used.
A user long pressing (left) and firmly pressing (right) on a launcher icon.
This also means that users can take advantage of this gesture without developers needing to update their apps. Applications that use Android’s GestureDetector or View APIs will automatically get these press signals through their existing long-press handlers. Developers that implement custom long-press detection logic can receive these press signals through the MotionEvent classification API introduced in Android Q.

Through this integration of machine-learning algorithms and careful interaction design, we were able to deliver a more expressive touch experience for Pixel users. We plan to continue researching and developing these capabilities to refine the touch experience on Pixel, and explore new forms of touch interaction.

Acknowledgements
This project is a collaborative effort between the Android UX, Pixel software, and Android framework teams.

Source: Google AI Blog