Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Finer-grained security using custom roles for Cloud IAM



IT security aims to ensure the right people have access to the right resources and use them in the right ways. Making sure those are the only things that can happen is the "principle of least privilege," a cornerstone of enterprise security policy. Custom roles for Cloud IAM make that easier with the power to pick the precise permissions people need to do their jobs—and are now generally available.

Google Cloud Platform (GCP) offers hundreds of predefined roles that range from "Owner" to product- and job-specific roles as narrow as "Cloud Storage Viewer." These are curated combinations of the thousands of IAM permissions that control every API in GCP, from starting a virtual machine to making predictions using machine learning models. For even finer-grained access control, custom roles now offer production-level support for remixing permissions across all GCP services.


Security that’s built to fit


Consider a tool that needs access to multiple GCP services to inventory Cloud Storage buckets, BigQuery tables and Cloud Spanner databases. Enumerating data doesn’t require privileges to decrypt that data. While predefined roles to view an entire project may grant .query,.decrypt and .get as a set, custom roles make it possible to grant .get permission on its own. Since a custom role can also combine permissions from multiple GCP services, you can put all of the permissions for a service account in one place—and then share that new role across your entire organization.


Custom roles aren’t just for services; users can also benefit from roles that are properly tailored to get their jobs done. For example, one regulation may state that a privacy auditor should be able to inspect all the the personally-identifiable information (PII) stored about your customers; another, that only full-time employees should process such data. Depending on job roles, it may be too powerful to grant Bigquery Data Owner to an auditor (who shouldn’t be able to delete data); yet Bigquery Data Viewer may be too weak for employees (who also need to search the data and run reports). IAM custom roles allow you to include or exclude permissions to match specific job roles:
“As the largest owner and operator of shopping centers in Australia and New Zealand, data security is crucial to our business. Google Cloud IAM custom roles help us meet our security standards, legislative requirements and remain compliant with the Australian Privacy Principles. With this feature, we can implement identity and access control to the authorized tasks performed by a specific person or machine, allowing us to fine-tune permissions and rigorously conform to the principle of least privilege.” 
— Evgeny Minkevich, Integration Solution Architect, Scentre Group

Managing custom roles


GCP is constantly expanding and evolving, and the set of permissions that control all of its APIs do, too. Almost all permissions are available for customization today, with the exception of a few that are only tested and supported in predefined role combinations. To keep abreast of new permissions, and changes in the support level of existing ones, you can now rely on a central permission change log for all public GCP services as well as a list of all supported permissions in custom roles.

We also suggest some recommended practices for testing, deploying and maintaining your own custom roles. To track and control changes to your custom roles, we’ve improved our integration with Cloud Deployment Manager to create and update custom roles, both within projects and across entire organizations (sample code). Together with existing Deployment Manager features that control how resources are created, organized and secured, IAM custom roles can help automate applying the principle of least privilege.

What’s next


We continue to invest in making IAM more powerful and easier to use, including helping you to create and manage custom roles. That starts with regular updates on permission changes, so you can keep your own custom roles in sync with Google’s new services, roles and permissions. It extends into research with the Forseti Security open source initiative to explain why a permission was granted or denied. We want the principle of "least privilege" to take the least effort, too!

Google Cloud Platform opens region in the Netherlands



Our fourteenth Google Cloud Platform region, located in the Netherlands, is now open for you to build applications and store your data.

The new Netherlands region, europe-west4, joins Belgium, London and Frankfurt in Europe and makes it easier to build highly available, performant applications using resources across those geographies.


Services


The Netherlands region has everything you need to build the next great application, and three zones to make it stand up to whatever Mother Nature has to offer:

Google Cloud Network


Interested in a GCP service that’s not available in the Netherlands region? No problem. You can access this service via the Google Network, the largest cloud network as measured by number of points of presence.

And that network infrastructure keeps getting stronger. We recently announced the Havfrue cables system to further expand our transatlantic information corridor, and will offer Dedicated Interconnect services from the Netherlands region, accessible from both Equinix Amsterdam Schepenbergweg (AM5) (formerly Telecity AMS5) and Equinix Amsterdam (AM3) Locations.

Our Dutch datacenter


The Netherlands region is located in our existing datacenter in Eemshaven. Prior to opening this datacenter two years ago, we had procured enough renewable energy on the Dutch grid to ensure consumption would be matched with 100% renewable energy from day one. This means that when you use this region to run your compute, store your data and develop your applications, you're doing so sustainably.


What customers are saying


Companies in the Netherlands welcome the addition of this GCP region.

“We are very excited with the arrival of the Google Cloud Platform in The Netherlands, one of our key European markets. Google Cloud Platform enables us to rapidly and easily grow our business globally, empowering our people to deliver true value to our customers. They are a strategic partner supporting us to provide personalized travel booking services at scale. 
 John Mangelaars, Chief Executive Officer, Travix

"At Blendle, we're big believers in multi-cloud and open source software such as Linux, Docker and Kubernetes. With the introduction of Google Kubernetes Engine, we switched a big part of our infrastructure to Google Cloud. Google's newest region underscores its growing commitment to Europe and The Netherlands, and will allow us to expand our mission of saving quality journalism in The Netherlands, Europe and eventually on the world." 
 Jean Mertz, Chief Technology Officer, Blendle
"By using Google Cloud Platform, we can focus our engineering effort on getting the best features for our customers and partners available 24/7."  
 Jurrie Van Rooijen, Chief Technology Officer, bol.com


Google Cloud in Europe


We’ve built out our Google Cloud presence in Europe over the last 10 years, with dedicated Cloud teams in over 20 offices across Europe, including our Benelux offices in Amsterdam and Brussels. We’re also active with local communities, supporting a volunteer-run Google Developer Group which includes Android, Web, Cloud, IoT and GoLang, with over 3,000 members across them all. This year, we'll continue to grow our customer-facing and partner teams to help companies in the Netherlands transform their businesses.

Getting started


For help migrating to GCP, please contact our local partners. For additional details on the region, please visit our Netherlands region page, where you’ll get access to free resources, whitepapers, the "Cloud On-Air" on-demand video series and more. Our locations page provides updates on the availability of additional services and regions. Contact us to request early access to new regions and help us prioritize what we build next.

 *Please visit our Service Specific Terms to get detailed information on our data storage capabilities.

12 best practices for user account, authorization and password management



Account management, authorization and password management can be tricky. For many developers, account management is a dark corner that doesn't get enough attention. For product managers and customers, the resulting experience often falls short of expectations.

Fortunately, Google Cloud Platform (GCP) brings several tools to help you make good decisions around the creation, secure handling and authentication of user accounts (in this context, anyone who identifies themselves to your system  customers or internal users). Whether you're responsible for a website hosted in Google Kubernetes Engine, an API on Apigee, an app using Firebase or other service with authenticated users, this post will lay out the best practices to ensure you have a safe, scalable, usable account authentication system.

1. Hash those passwords


My most important rule for account management is to safely store sensitive user information, including their password. You must treat this data as sacred and handle it appropriately.

Do not store plaintext passwords under any circumstances. Your service should instead store a cryptographically strong hash of the password that cannot be reversed  created with, for example, PBKDF2, SHA3, Scrypt, or Bcrypt. The hash should be salted with a value unique to that specific login credential. Do not use deprecated hashing technologies such as MD5, SHA1 and under no circumstances should you use reversible encryption or try to invent your own hashing algorithm.

You should design your system assuming it will be compromised eventually. Ask yourself "If my database were exfiltrated today, would my users' safety and security be in peril on my service or other services they use? What can we do to mitigate the potential for damage in the event of a leak?"

Another point: If you could possibly produce a user's password in plaintext at any time outside of immediately after them providing it to you, there's a problem with your implementation.

2. Allow for third-party identity providers if possible


Third-party identity providers enable you to rely on a trusted external service to authenticate a user's identity. Google, Facebook and Twitter are commonly used providers.

You can implement external identity providers alongside your existing internal authentication system using a platform such as Firebase Auth. There are a number of benefits that come with Firebase Auth, including simpler administration, smaller attack surface and a multi-platform SDK. We'll touch on more benefits throughout this list. See our case studies on companies that were able to integrate Firebase Auth in as little as one day.

3. Separate the concept of user identity and user account


Your users are not an email address. They're not a phone number. They're not the unique ID provided by an OAUTH response. Your users are the culmination of their unique, personalized data and experience within your service. A well designed user management system has low coupling and high cohesion between different parts of a user's profile.

Keeping the concepts of user account and credentials separate will greatly simplify the process of implementing third-party identity providers, allowing users to change their username and linking multiple identities to a single user account. In practical terms, it may be helpful to have an internal global identifier for every user and link their profile and authentication identity via that ID as opposed to piling it all in a single record.

4. Allow multiple identities to link to a single user account


A user who authenticates to your service using their username and password one week might choose Google Sign-In the next without understanding that this could create a duplicate account. Similarly, a user may have very good reason to link multiple email addresses to your service. If you properly separated user identity and authentication, it will be a simple process to link several identities to a single user.

Your backend will need to account for the possibility that a user gets part or all the way through the signup process before they realize they're using a new third-party identity not linked to their existing account in your system. This is most simply achieved by asking the user to provide a common identifying detail, such as email address, phone or username. If that data matches an existing user in your system, require them to also authenticate with a known identity provider and link the new ID to their existing account.

5. Don't block long or complex passwords


NIST has recently updated guidelines on password complexity and strength. Since you are (or will be very soon) using a strong cryptographic hash for password storage, a lot of problems are solved for you. Hashes will always produce a fixed-length output no matter the input length, so your users should be able to use passwords as long as they like. If you must cap password length, only do so based on the maximum POST size allowable by your servers. This is commonly well above 1MB. Seriously.

Your hashed passwords will be comprised of a small selection of known ASCII characters. If not, you can easily convert a binary hash to Base64. With that in mind, you should allow your users to use literally any characters they wish in their password. If someone wants a password made of Klingon, Emoji and control characters with whitespace on both ends, you should have no technical reason to deny them.

6. Don't impose unreasonable rules for usernames


It's not unreasonable for a site or service to require usernames longer than two or three characters, block hidden characters and prevent whitespace at the beginning and end of a username. However, some sites go overboard with requirements such as a minimum length of eight characters or by blocking any characters outside of 7-bit ASCII letters and numbers.

A site with tight restrictions on usernames may offer some shortcuts to developers, but it does so at the expense of users and extreme cases will drive some users away.

There are some cases where the best approach is to assign usernames. If that's the case for your service, ensure the assigned username is user-friendly insofar as they need to recall and communicate it. Alphanumeric IDs should avoid visually ambiguous symbols such as "Il1O0." You're also advised to perform a dictionary scan on any randomly generated string to ensure there are no unintended messages embedded in the username. These same guidelines apply to auto-generated passwords.

7. Allow users to change their username


It's surprisingly common in legacy systems or any platform that provides email accounts not to allow users to change their username. There are very good reasons not to automatically release usernames for reuse, but long-term users of your system will eventually come up with a good reason to use a different username and they likely won't want to create a new account.

You can honor your users' desire to change their usernames by allowing aliases and letting your users choose the primary alias. You can apply any business rules you need on top of this functionality. Some orgs might only allow one username change per year or prevent a user from displaying anything but their primary username. Email providers might ensure users are thoroughly informed of the risks before detaching an old username from their account or perhaps forbid unlinking old usernames entirely.

Choose the right rules for your platform, but make sure they allow your users to grow and change over time.


8. Let your users delete their accounts


A surprising number of services have no self-service means for a user to delete their account and associated data. There are a number of good reasons for a user to close an account permanently and delete all personal data. These concerns need to be balanced against your security and compliance needs, but most regulated environments provide specific guidelines on data retention. A common solution to avoid compliance and hacking concerns is to let users schedule their account for automatic future deletion.

In some circumstances, you may be legally required to comply with a user's request to delete their data in a timely manner. You also greatly increase your exposure in the event of a data breach where the data from "closed" accounts is leaked.

9. Make a conscious decision on session length


An often overlooked aspect of security and authentication is session length. Google puts a lot of effort into ensuring users are who they say they are and will double-check based on certain events or behaviors. Users can take steps to increase their security even further.

Your service may have good reason to keep a session open indefinitely for non-critical analytics purposes, but there should be thresholds after which you ask for password, 2nd factor or other user verification.

Consider how long a user should be able to be inactive before re-authenticating. Verify user identity in all active sessions if someone performs a password reset. Prompt for authentication or 2nd factor if a user changes core aspects of their profile or when they're performing a sensitive action. Consider whether it makes sense to disallow logging in from more than one device or location at a time.

When your service does expire a user session or require re-authentication, prompt the user in real-time or provide a mechanism to preserve any activity they have unsaved since they were last authenticated. It's very frustrating for a user to fill out a long form, submit it some time later and find out all their input has been lost and they must log in again.

10. Use 2-Step Verification


Consider the practical impact on a user of having their account stolen when choosing from 2-Step Verification (also known as 2-factor authorization or just 2FA) methods. SMS 2FA auth has been deprecated by NIST due to multiple weaknesses, however, it may be the most secure option your users will accept for what they consider a trivial service. Offer the most secure 2FA auth you reasonably can. Enabling third-party identity providers and piggybacking on their 2FA is a simple means to boost your security without great expense or effort.

11. Make user IDs case insensitive


Your users don't care and may not even remember the exact case of their username. Usernames should be fully case-insensitive. It's trivial to store usernames and email addresses in all lowercase and transform any input to lowercase before comparing.

Smartphones represent an ever-increasing percentage of user devices. Most of them offer autocorrect and automatic capitalization of plain-text fields. Preventing this behavior at the UI level might not be desirable or completely effective, and your service should be robust enough to handle an email address or username that was unintentionally auto-capitalized.

12. Build a secure auth system


If you're using a service like Firebase Auth, a lot of security concerns are handled for you automatically. However, your service will always need to be engineered properly to prevent abuse. Core considerations include implementing a password reset instead of password retrieval, detailed account activity logging, rate limiting login attempts, locking out accounts after too many unsuccessful login attempts and requiring 2-factor authentication for unrecognized devices or accounts that have been idle for extended periods. There are many more aspects to a secure authentication system, so please see the section below for links to more information.


Further reading


There are a number of excellent resources available to guide you through the process of developing, updating or migrating your account and authentication management system. I recommend the following as a starting place:

Cloud Shell Tutorials: Learning experiences integrated into the Cloud Console



A few weeks ago we released the Open in Cloud Shell feature, which lets a simple hyperlink open a cloud shell with an automatically cloned Github repo, preselected open files in the Cloud Editor and other features to make creating interactive content as easy as possible. Today we’re adding the ability to integrate your tutorials right into the Google Cloud Platform Console and run them via the click of a link. Here’s a quick peek at the authoring and user experience:

Cloud Shell tutorials are authored in (CommonMark) Markdown syntax, with extensions to support capabilities to, for example:

  • Create a new project 
  • Enable billing 
  • Open a file 
  • Highlight a UI element 
Cloud Shell tutorials offer the ability to automate all of the above functions and more, via simple Markdown syntax.

In other words, instead of providing indirect instructions, like “open the file foo.txt by clicking the highlighted file in the image shown below,” your tutorial can, more directly and more simply, say “click here to open foo.txt.” Clicking that link will have the expected effect directly inside the Cloud Console.

Here are some examples of use cases:

  • a tutorial for a service or product where a sample app lives in an associated repo
  • a walkthrough of a repo that introduces people to the project structure 
  • a sequence of steps for installing prerequisites and building something interesting (akin to the INSTALL file found in some open source repos) 
Cloud Shell tutorials make it easy for anyone to build compelling instructional material, integrated right into the Cloud Console. You can read more about this feature in the public documentation, which contains an example tutorial showing off all the new capabilities along with the corresponding Markdown syntax.

Here’s a simple self-explanatory example. We hope you enjoy building some great tutorials using this feature and let us know about your experiences, good and bad, via the Feedback link on the bottom of the documentation page!

How we built a serverless digital archive with machine learning APIs, Cloud Pub/Sub and Cloud Functions



[Editor’s note: Today we hear from Incentro, a digital service provider and Google partner, which recently built a digital asset management solution on top of GCP. It combines machine learning services like Cloud Vision and Speech APIs to easily find and tag digital assets, plus Cloud Pub/Sub and Cloud Functions for an automated, serverless solution. Read on to learn how they did it.]

Here at Incentro, we have a large customer base among media and publishing companies. Recently, we noticed that our customers struggle with storing and searching for digital media assets in their archives. It’s a cumbersome process that involves a lot of manual tagging. As a result, the videos are often stored without being properly tagged, making it nearly impossible to find and reuse these assets afterwards.

To eliminate this sort of manual labour and to generate more business value from these expensive video assets, we sought to create a solution that would take care of mundane tasks like tagging photos and videos. This solution is called Segona Media (https://segona.io/media), and lets our customers store assets and tag and index their digital assets automatically.

Segona Media management features


Segona Media currently supports images, video and audio assets. For each of these asset types, Google Cloud provides specific managed APIs to extract relevant content from the asset without customers having to tag them manually or transcribe them.

  • For images, Cloud Vision API extracts most of the content we need: labels, landmarks, text and image properties are all extracted and can be used to find an image.
  • For audio, Cloud Speech API showed us tremendous results in transcribing an audio track. After extracting the audio into speech, we also use Google Cloud Natural Language API to discover sentiment and categories in the transcription. This way, users can search for spoken text, but also search for categories of text and even sentiment.
  • For video, we typically use a combination of audio and image analysis. Cloud Video Intelligence API extracts labels and timeframes, and detects explicit content. On top of that, we process the audio track from a video the same way we process audio assets (see above). This way users can search content from the video as well as from spoken text in the video.

Segona Media architecture


The traditional way for developing a solution like this involves getting hardware running, determining and installing application servers, databases, storage nodes, etc. After developing and getting the solution into production you may then come across a variety of familiar challenges: the operating system needs to be updated or upgraded or databases don't scale to cope with unexpected production data. We didn't want any of this, so after careful consideration, decided on a completely managed solution and serverless architecture. That way we’d have no servers to maintain, we could leverage Google’s ongoing API improvements and our solution could scale to handle the largest archives we could find.

We wanted Segona Media to also be able to easily connect to common tools in the media and publishing industries. Adobe InDesign, Premiere, Photoshop and Digital Asset Management solutions must all be able to easily store and retrieve assets from Segona Media. We solved this by using GCP APIs that were already in place for storing assets in Google Cloud Storage and just take it from there. We retrieve assets using the managed Elasticsearch API that runs on GCP.

Each action that Segona Media performs is a separate Google Cloud Function, usually triggered mostly by a Cloud Pub/Sub queue. Using a Pub/Sub queue to trigger a Cloud Function is an easy and scalable way to publish new actions.

Here’s a high-level architecture view of Segona Media:

High Level Architecture

And here's how the assets flow through Segona Media:

  1. An asset is uploaded/stored to a Cloud Storage bucket
  2. This event triggers a Cloud Function, which generates a unique ID, extracts metadata from the file object, moves it to the appropriate bucket (with lifecycle management) and creates the asset in the Elasticsearch index (we run Elastic Cloud hosted on GCP).
  3. This queues up multiple asset processors in Google Cloud Pub/Sub that are specific for an asset type and that extract relevant content from assets using Google APIs.


Media asset management


Now, let's see how Segona Media handles different types of media assets.

Images

Images have a lot of features on which you can search, which we do via a dedicated microservice processor.

  • We use ImageMagick to extract metadata from the image object itself. We extract all XMP and EXIF metadata that's embedded in the file itself. This information is then added to the Elastic index and makes the image searchable by, for example, copyright info or resolution. 
  • Cloud Vision API extracts labels in the image, landmarks, text and image properties. This takes away manual tagging of objects in the image and makes a picture searchable for its contents.
  • Segona Media offers customers to create custom labels. For example, a television manufacturer might want to know the specific model of a television set in a picture. We’ve implemented custom predictions by building our own Tensorflow models trained on custom data, and we train and run predictions on Cloud ML Engine.
  • For easier serving of all assets, we also create a low resolution thumbnail of every image.

Audio


Processing audio is pretty straightforward. We want to be able to search for spoken text in audio files, and we use Cloud Speech API to extract text from the audio. We then feed the transcription into the Elasticsearch index, making the audio file searchable by every word.

Video


Video is basically the combination of everything we do with images and audio files. There are some minor differences though, so let's see what microservices we invoke for these assets:

  • First of all, we create a thumbnail so we can serve up a low-res image of the video. We take a thumbnail at 50% of the video. We do this by combining FFmpeg and FFprobe in Cloud Functions, and store this thumbnail alongside the video asset. Creating a thumbnail with Cloud Functions and FFmpeg is easy! Check out this code snippet: https://bitbucket.org/snippets/keesvanbemmel/keAkqe 
  • Using the same FFmpeg architecture, we extract the audio stream from the video. This audio stream is then processed like any other audio file: We extract the text from spoken text in the audio stream and add it to the Elastic index so the video itself can also be found by searching for every word that's spoken. We extract the audio stream from the video in a single channel FLAC format as this gives us the best results. 
  • We also extract relevant information from the video contents itself using Cloud Video Intelligence. We extract labels that are in the video as well as the timestamps for when the labels were created. This way, we know which objects are at what point in the video. Knowing the timestamps for a given label is a fantastic way to point a user to not only a video, but the exact moment in the video that contains the object they're looking for.

There you have it—a summary of how to do smart media tagging in a completely serverless fashion, without all the OS updates when scaling up or out, or of course, the infrastructure maintenance and support! This way we can focus on what we care about: bringing an innovative, scalable solution to our end customers. Any questions? Let us know! We love to talk about this stuff ;) Leave a comment below, email me at [email protected], or find us on Twitter at @incentro_.

An example escalation policy — CRE life lessons



In an earlier blog post, we discussed the spectrum of engineering effort between reliability and feature development and the importance of describing when and how an organization should dedicate engineering time towards the reliability of a service that is out of SLO. In this post, we show a lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.

This SRE team works with large teams of developers focused on different areas of the serving stack, which comprises around ten high-traffic services and a dozen or so smaller ones, all with SRE support. The team has shards in Europe and America, each covering 12 hours of a follow-the-sun on-call rotation. The supported services have both coarse top-level SLOs representing desired user experience and finer-grained SLOs representing the availability requirements of stack components; crucially the SRE team can route pages to dev teams at the granularity of an individual SLO, making "revoking support" for an SLO both cheap and quick. Alerting is configured to page when the service has burned nine hours of error budget within an hour, and file a ticket when it has burned one week of error budget over the previous week.

It's important to note that this policy is just an example, and probably a poor one if your SRE team supports a service with availability targets of 99.99% or higher. The industry that this Google team operates in is highly competitive and moves quickly, making feature iteration speed and time-to-market more important than maintaining high levels of availability.

Escalation policy preamble


Before getting into the specifics of the escalation policy, it's important to consider the following broad points.

The intent of an escalation policy is not to be completely proscriptive; SREs are expected to make judgement calls as to appropriate responses to situations they face. Instead, this document establishes reasonable thresholds for specific actions to take place, with the intent of reducing the likely range of responses and achieving a measure of consistency. It's structured as a series of thresholds that, when crossed, trigger the redirection of more engineering effort towards addressing an SLO violation.

Furthermore, SRE must focus on fixing the class of issue before declaring an incident resolved. This is a higher bar than fixing the issue itself. For example, if a bad flag flip causes a severe outage, reverting the flag flip is insufficient to bring the service back into SLO. SRE must instead ensure that flag flips in general are extremely unlikely to threaten the SLO in the future, with staged rollouts, automated rollbacks on push failures, and versioned configuration to tie flags to binary versions.

For the following four thresholds in the escalation policy, "bringing a service back into SLO" means:
  • finding the root cause and fixing the relevant class of issue, or
  • automating remediation such that ongoing manual intervention is no longer necessary, or 
  • simply waiting one week, if the class of issue is extremely unlikely to recur with frequency and severity sufficient to threaten the SLO in the future
In other words, a plan for manual remediation is not sufficient to consider the service back within SLO. Bear in mind that you usually need to understand the root cause of a violation to conclude that it's unlikely to recur or to automate remediation.


Escalation policy thresholds


Threshold 1 -  wherein SRE are notified that an SLO is potentially impacted

SRE will maintain alerting so as to be notified of danger to supported SLOs. Upon being notified, SRE will investigate and attempt to find and address the root cause. SRE will consider taking mitigating actions, including redirecting traffic at the load balancers and rolling back binary or configuration pushes. SRE on-call engineers will notify the dev team about the SLO impact and keep them updated as necessary, but no action on their part is required at this point.

Threshold 2 - wherein SRE escalates to the developers
  • If,
    • SRE have concluded they cannot bring the service into SLO without help, and
    • SRE and dev agree that the SLO represents desired user experience
  • Then,
    • SRE and dev on-calls prioritize fixing the root cause and update the bug daily
    • SRE escalates to dev leads for visibility and additional assistance if necessary
    • Alerting thresholds may be relaxed to avoid continually paging for the known issue, while continuing to provide protection against further regressions
  • When the service is brought back into SLO,
    • SRE will revert any alerting changes
    • SRE may create a postmortem
    • Or, if the SLO does not accurately represent desired user experience, the SRE, dev and product teams will agree to change or retire the SLO
Threshold 3 - wherein SRE pauses feature releases and focuses on reliability
  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 30-day error budget is exhausted
  • Then during the following week,
    • Only cherry-picked fixes for diagnosed root causes may be pushed to production
    • SRE may escalate to their leadership and dev management to request that members of the dev team prioritize finding and fixing the root cause over any non-emergency work
    • Daily updates may be made to an "escalations" mailing list (used to broadcast information about outages to a wide audience, including executive leadership).
  • When the service is brought back into SLO,
    • Normal binary releases resume
    • SRE creates a postmortem
    • Team members may re-prioritize normal project work

Threshold 4 - wherein SRE may escalate or revoke support

  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 90-day error budget is exhausted or the dev team is unwilling to pause feature work to improve reliability
  • Then,
    • SRE may escalate to executive leadership to commandeer more people dedicated to fixing the problem
    • SRE may revoke support for the SLO or the service, and re-direct or disable relevant alerting

On escalation and incident response


SREs are first responders, and there's an expectation that they'll make a reasonable effort to bring the service back within SLO before escalating to developers. As such, threshold 1 applies when the SRE team is notified about a violation, despite the one-week ticket alert indicating the seven-day budget is already exhausted. SRE should wait no longer than one week from the initial violation notification before escalating to developers, but they may exercise their own judgement as to whether escalation is appropriate before this point.

Every time SRE escalates, it’s important to ask developers whether the availability goals still represent the desired balance between reliability and development velocity. This gives them the choice between preserving availability goals by rolling back a new feature and temporarily relaxing them to preserve the availability of that feature for users if the latter is the desired user experience. For repeated violations of the same SLO in a short time window, you probably don't need to ask the question over and over again, though that's a strong signal that further escalation is necessary. It's also OK to insist that developers take back the pager for the service until they're willing to restore the previously-agreed availability targets—if they want to run a less reliable service temporarily so that a business-critical feature remains available while they work on its reliability, they can also shoulder the burden of its failures.

On blocking releases


Blocking releases is an appropriate course of action for three main reasons:
  1. Commonly, the largest source of burnt error budget at steady state is the release push. If you’ve already burned all your budget, not pushing new releases lowers the steady-state burn rate, bringing the service back into SLO more quickly
  2. It eliminates the risk of further unexpected SLO violations due to bugs in new code. This is also why any fixes for diagnosed root causes must be patched into the current release, rather than rolling forward to a new release
  3. While blocking releases is not intended as a punitive measure, it does directly impact release velocity, which the dev org cares about deeply. As such, tying SLO violations to reduced velocity aligns the incentives of both organizations. SRE wants the service to stay within SLO, the dev org wants to build new features quickly. This way, either both happen or neither do.
SRE should prefer to unblock feature releases sooner rather than later, once the root cause(s) of a violation has been found and fixed. Giving our dev teams the benefit of the doubt that there will be no further service degradation before the SLO is in compliance over a 30-day window strikes a more acceptable balance between reliability and velocity. This is effectively "borrowing" future error budget to unblock the release before the service is compliant, with the expectation that it will be within a reasonable timeframe. Absent any push-related outages, new features should increase user happiness with the service, repaying some of the unhappiness caused by the SLO violation.

SRE may choose not to unblock releases if pre-violation error-budget burn rates were close to the SLO threshold. In this case, there's less future budget to borrow, thus the risk of further violations is higher and the time until the service is SLO compliant will be significantly longer if releases are allowed to continue.

Summary


We hope that the above example gives you some ideas about how to make trade-offs between reliability and development velocity for a service where the latter is a key business priority. The main concessions to velocity are that SRE doesn’t immediately block releases when an SLO is violated, and provides a mechanism for them to resume before the SLO has returned to compliance with the informed consent of SRE. In the final post of the series, we'll take these policy thresholds out for a spin with some hypothetical scenarios.

Whitepaper: Embark on a journey from monoliths to microservices



Today we introduced the next in a series of white papers about migration entitled “Taking the Cloud-Native Approach with Microservices.” This paper switches gears from “lift-and-shift,” and introduces the idea of “move-and-improve.” If you missed the first white paper, you can read the blog and download a copy.

The white paper provides context on monolithic software application architecture, as well as microservices architecture. You’ll also learn about the shortcomings of monoliths: They can be challenging to scale properly, and their faults are harder to isolate. Deploying monoliths can also be cumbersome and time consuming, and they generally require a long-term commitment to a particular technology stack. Alternatively, microservices are thought to be more agile, fault-resilient and scalable, because the application is modularized into a system of small services with well-defined, narrowly scoped functions and APIs.

PetShop is an eCommerce website reference implementation that is well known within both the Java and Microsoft .NET development communities, and the white paper uses it to step through the process of deconstructing a monolith into microservices. Specifically, the paper considers three different layers that may or may not be deployed in different physical tiers: the presentation, business logic and data access layers.

In addition, you’ll be introduced to the concept of domain-driven design (DDD), which advocates modeling based on a business’s practical use cases. In its simplest form, DDD consists of decomposing a business domain into smaller functional chunks, at either the business function or business process level, so that the complexity of both a business and problem domain can be better understood and resolved through your application.

Download your copy of the white paper, and GitHub repositories; then, take a look at how you can deconstruct the PetShop reference implementation and build a microservice-based version. You’ll be well on your way to deconstructing and rebuilding your own monoliths!

Analyzing your BigQuery usage with Ocado Technology’s GCP Census



[Editor’s note: Today we hear from Google Cloud customer Ocado Technology, which created (and open sourced!) a program to give them at-a-glance insights about their data warehouse usage, by reading BigQuery metadata. Read on to learn about how they architected the tool, what kinds of questions it can answerand whether it might be useful in your own environment.]

Here at Ocado Technology, we use a wide range of Google Cloud Platform (GCP) big data products for data-driven decisions and machine learning. Notably, we use Google BigQuery as the main storage solution for data analytics in the Ocado Smart Platform, our proprietary solution for operating online retail businesses.

Because BigQuery is so central to the Ocado platform, we wanted an easy way to get a bird’s eye view of the data stored in it. So we created GCP Census a tool that collects metadata about BigQuery tables and stores it back into BigQuery for analysis. To have a better overview of all the data stored in BigQuery, we wanted to ask:
  • Which datasets/tables are the largest or the most expensive?
  • How many tables/partitions do we have?
  • How often are tables/partitions updated over time?
  • How are our datasets/tables/partitions growing over time?
  • Which tables/datasets are stored in a specific location?
If you also need better visibility into your organization’s BigQuery usage, read on to learn about how we architected GCP Census and what it can do. Then go ahead and download it for your own use—we recently open sourced it!

Our BigQuery domain


We store petabytes of data in BigQuery, divided into multiple GCP projects and hundreds of thousands of tables. BigQuery has many useful features for enterprise cloud data warehouses, especially in terms of speed, scalability and reliability. One example is partitioned tables rather than daily tables, which we recently adopted for their numerous benefits. At the same time, partitioned tables increased the complexity and scale of our BigQuery environment, and BigQuery offers limited ways of analysing metadata:
  • overall data size per project (from billing data) 
  • single table size (from BigQuery UI or REST API) 
  • __TABLES_SUMMARY__ and __PARTITIONS_SUMMARY__ provide only basic information, like list of tables/partitions and last update time
These constraints inspired us to build an additional layer to give us a bird’s eye view of our data.

GCP Census architecture

The resulting tool, GCP Census, is a Google App Engine app written in Python that regularly collects metadata about BigQuery tables and stores it in BigQuery.


Here's how it works:
  1. App Engine cron triggers a daily run
  2. GCP Census crawls metadata from all projects/datasets/tables to which it has access
  3. It creates a task for each table and schedules it for execution in App Engine Task Queue
  4. A task worker then retrieves table metadata using the REST API and streams it into the metadata tables. In case of partitioned tables, GCP Census also retrieves the partitions’ summary by querying the partitioned table and stores the metadata in partition_metadata table
GCP Census is highly scalable as it can easily scan millions of table/partitions. It’s also easy to set up: before GCP Census scans the resources to which it has IAM access, it automatically creates the relevant tables and views. Finally, it’s a secure cloud-native solution with App Engine Firewall and fine-grained access control, plus App Engine’s scalability and reliability!

Using GCP Census


There are several benefits to using GCP Census. Now you can get answers to all the questions by querying BigQuery from the UI or the API.

You can find below a few examples of how you can query GCP Census metadata.
  • Count all data to which GCP Census has access
    SELECT sum(numBytes) FROM
    `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
  • Count all tables and partitions
    SELECT count(*)
    FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
    SELECT count(*) FROM `YOUR-PROJECT-ID.bigquery_views.partition_metadata_v1_0`
  • Select top 100 largest datasets
    SELECT projectId, datasetId, sum(numBytes) as totalNumBytes
    FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
    GROUP BY projectId, datasetId ORDER BY totalNumBytes DESC LIMIT 100
  • Select top 100 largest tables
    SELECT projectId, datasetId, tableId, numBytes
    FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`s
    ORDER BY numBytes DESC LIMIT 100
  • Select top 100 largest partitions
    SELECT projectId, datasetId, tableId, partitionId, numBytes
    FROM `YOUR-PROJECT-ID.bigquery_views.partition_metadata_v1_0`
    ORDER BY numBytes DESC LIMIT 100
Optionally, you can create a Data Studio dashboard based on the metadata. We used Data Studio because of the ease and simplicity in creating dashboards with the BigQuery connector. Splitting data by project, dataset or label and diving into the storage costs is now a breeze, and we have multiple Data Studio dashboards that help us quickly dive into the largest project, dataset or table.

Below you can find a screenshot with one of our dashboards (all real data has been redacted).
With GCP Census, we’ve learned some of the characteristics of our data; for example, we now know which data is modified daily or which historical partitions have been modified recently. We were also able to identify potential cost optimization areas—huge temporary tables that no one uses but that were incurring significant storage costs. All in all, we’ve learned a lot about our operations, and saved a bunch of money!

You can find the source code for GCP Census at Github at https://github.com/ocadotechnology/gcp-census, plus the required steps needed for installation and setup. We look forward to your ideas and contributions!

Running dedicated game servers in Kubernetes Engine: tutorial



Packaging server applications as container images is quickly gaining traction across tech organizations, game companies among them. They want to use containers to improve VM utilization, as well as take advantage of the isolated run-time paradigm. Despite their interest, many game companies don't know where to start.

Using the orchestration framework Kubernetes to deploy production-scale fleets of dedicated game servers in containers is an excellent choice. We recommend Google Kubernetes Engine as the easiest way to start a Kubernetes cluster for game servers on Google Cloud Platform (GCP) without manual setup steps. Kubernetes will help simplify your configuration management and select a VM with adequate resources to spin up a match for your players for you automatically.

We recently put together a tutorial that shows you how to integrate dedicated game servers with Kubernetes Engine, and how to automatically scale the number of VMs up and down according to player demand. It also offers some key storage strategies, including how to manage your game server assets without having to manually distribute them with each container image. Check it out, and let us know what other Google Cloud tools you’d like to learn how to use in your game operations. You can reach me on Twitter at @gcpjoe.

Get latest Kubernetes version 1.9 on Google’s managed offering



We're excited to announce that Kubernetes version 1.9 will be available on Google Kubernetes Engine next week in our early access program. This release includes greater support for stateful and stateless applications, hardware accelerator support for machine learning workloads and storage enhancements. Overall, this release achieves a big milestone in making it easy to run a wide variety of production-ready applications on Kubernetes without having to worry about the underlying infrastructure. Google is the leading contributor to open-source Kubernetes releases and now you can access the latest Kubernetes release on our fully-managed Kubernetes Engine, and let us take care of managing, scaling, upgrading, backing up and helping to secure your clusters. Further, we recently simplified our pricing by removing the fee for cluster management, resulting in real dollar savings for your environment.

We're committed to providing the latest technological innovation to Kubernetes users with one new release every quarter. Let’s a take a closer look at the key enhancements in Kubernetes 1.9.

Workloads APIs move to GA


The core Workloads APIs (DaemonSet, Deployment, ReplicaSet and StatefulSet), which let you run stateful and stateless workloads in Kubernetes 1.9, move to general availability (GA) in this release, delivering production-grade quality, support and long-term backwards compatibility.

Hardware accelerator enhancements


Google Cloud Platform (GCP) provides a great environment for running machine learning and data analytics workloads in containers. With this release, we’ve improved support for hardware accelerators such as NVIDIA Tesla P100 and K80 GPUs. Compute-intensive workloads will benefit greatly from cost-effective and high performance GPUs for many use cases ranging from genomics and computational finance to recommendation systems and simulations.

Local storage enhancements for stateful applications


Improvements to the Kubernetes scheduler in this release make it easier to use local storage in Kubernetes. The local persistent storage feature (alpha) enables easy access to local SSD on GCP through Kubernetes’ standard PVC (Persistent Volume Claim) interface in a simple and portable way. This allows you to take an existing Helm Chart, or StatefulSet spec using remote PVCs, and easily switch to local storage by just changing the StorageClass name. Local SSD offers superior performance including high input/output operations per second (IOPS), low latency, and is ideal for high performance workloads, distributed databases, distributed file systems and other stateful workloads.

Storage interoperability through CSI


This Kubernetes release introduces an alpha implementation of Container Storage Interface (CSI). We've been working with the Kubernetes community to provide a single and consistent interface for different storage providers. CSI makes it easy to add different storage volume plugins in Kubernetes without requiring changes to the core codebase. CSI underscores our commitment to being open, flexible and collaborative while providing maximum value—and options—to our users.

Try it now!


In a few days, you can access the latest Kubernetes Engine release in your alpha clusters by joining our early access program.