Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Time to “Hello, World”: VMs vs. containers vs. PaaS vs. FaaS



Do you want to build applications on Google Cloud Platform (GCP) but have no idea where to start? That was me, just a few months ago, before I joined the Google Cloud compute team. To prepare for my interview, I watched a bunch of GCP Next 2017 talks, to get up to speed with application development on GCP.

And since there is no better way to learn than by doing, I also decided to build a “Hello, World” web application on each of GCP’s compute offerings—Google Compute Engine (VMs), Google Kubernetes Engine (containers), Google App Engine (PaaS), and Google Cloud Functions (FaaS). To make this exercise more fun (and to do it in a single weekend), I timed things and took notes, the results of which I recently wrote up in a lengthy Medium post—check it out if you’re interested in following along and taking the same journey. 

So, where do I run my code?


At a high level, though, the question of which compute option to use is... it depends. Generally speaking, it boils down to thinking about the following three criteria:
  1. Level of abstraction (what you want to think about)
  2. Technical requirements and constraints
  3. Where your team and organization are going
Google Developer Advocate Brian Dorsey gave a great talk at Next last year on Deciding between Compute Engine, Container Engine, App Engine; here’s a condensed version:


As a general rule, developers prefer to take advantage of the higher levels of compute abstraction ladder, as it allows us to focus on the application and the problem we are solving, while avoiding undifferentiated work such as server maintenance and capacity planning. With Cloud Functions, all you need to think about is code that runs in response to events (developer's paradise!). But depending on the details of the problem you are trying to solve, technical constraints can pull you down the stack. For example, if you need a very specific kernel, you might be down at the base layer (Compute Engine). (For a good resource on navigating these decision points, check out: Choosing the right compute option in GCP: a decision tree.)

What programming language should I use?

GCP broadly supports the following programming languages: Go, Java, .NET, Node.js, PHP, Python, and Ruby (details and specific runtimes may vary by the service). The best language is a function of many factors, including the task at hand as well as personal preference. Since I was coming at this with no real-world backend development experience, I chose Node.js.

Quick aside for those of you who might be not familiar with Node.js: it’s an asynchronous JavaScript runtime designed for building scalable web application back-ends. Let’s unpack this last sentence:

  • Asynchronous means first-class support for asynchronous operations (compared to many other server-side languages where you might have to think about async operations and threading—a totally different mindset). It’s an ideal fit for most cloud applications, where a lot of operations are asynchronous. 
  • Node.js also is the easiest way for a lot of people who are coming from the frontend world (where JavaScript is the de-facto language) to start writing backend code. 
  • And there is also npm, the world’s largest collection of free, reusable code. That means you can import a lot of useful functionality without having to write it yourself.


Node.js is pretty cool, huh? I, for one, am convinced!

On your mark… Ready, set, go!

For my interview prep, I started with Compute Engine and VMs first, and then moved up the levels of compute service-abstraction ladder, to Kubernetes Engine and containers, App Engine and apps, and finally Cloud Functions. The following table provides a quick summary along with links to my detailed journey and useful getting started resources.


Getting from point A to point B
Time check and getting started resources
Compute Engine

Basic steps:
  1. Create & set up a VM instance
  2. Set up Node.js dev environment
  3. Code “Hello, World”
  4. Start Node server
  5. Expose the app to external traffic
  6. Understand how scaling works

4.5 hours

Kubernetes Engine

Basic steps:
  1. Code “Hello, World”
  2. Package the app into a container
  3. Push the image to Container Registry
  4. Create a Kubernetes cluster
  5. Expose the app to external traffic
  6. Understand how scaling works

6 hours

App Engine

Basic steps:
  1. Code “Hello, World”
  2. Configure an app.yaml project file
  3. Deploy the application
  4. Understand scaling options

1.5-2 hours

Cloud Functions

Basic steps:
  1. Code “Hello, World”
  2. Deploy the application

15 minutes



Time-to-results comparison

Although this might be somewhat like comparing apples and oranges, here is a summary of my results. (As a reminder, this is just in the context of standing up a “Hello, World” web application from scratch, all concerns such as running the app in production aside.)

Your speed-to-results could be very different depending on multiple factors, including your level of expertise with a given technology. My goal was to grasp the fundamentals of every option in the GCP’s compute stack and assess the amount of work required to get from point A to point B… That said, if there is ever a cross-technology Top Gear fighter jet vs. car style contest on standing up a scalable HTTP microservice from scratch, I wouldn’t be afraid to take on a Kubernetes grandmaster like Kelsey Hightower with Cloud Functions!

To find out more about application development on GCP, check out Computing on Google Cloud Platform. Don’t forget—you get $300 in free credits when you sign up.

Happy building!

Further reading on Medium:

Introducing sole-tenant nodes for Google Compute Engine — when sharing isn’t an option



Today we are excited to announce beta availability of sole-tenant nodes on Google Compute Engine. Sole-tenant nodes are physical Compute Engine servers designed for your dedicated use. Normally, VM instances run on physical hosts that may be shared by many customers.  With sole-tenant nodes, you have the host all to yourself.

You can launch instances using the same options you would use for regular compute instances, except on server capacity dedicated to you. You can launch instances of any shape (i.e., vCPU and memory). A placement algorithm automatically finds the optimal location to launch your instance across all your nodes. If you prefer more control, you can manually select the location upon which to launch your instances. Instances launched on sole-tenant nodes can take advantage of live migration to avoid downtime during proactive maintenance. Pricing remains simple--pay only for the nodes you use on a per-second basis with a one-minute minimum charge. Sustained use discounts automatically apply, as do any new or existing committed use discounts.

Sole-tenant nodes enable a number of valuable use cases:

  • Compliance and regulation - Organizations with strict compliance and regulatory requirements can use sole-tenant nodes with VM placement to facilitate physical separation of their compute resources in the cloud. 
  • Isolation and utilization - Control instance placement directly via user-defined labels, or let Compute Engine automatically handle instance placement across nodes. You can also create and launch different machine types, or “shapes” on your nodes, in order to achieve the highest level of utilization. 


It’s easy to get started with sole-tenant nodes. You can launch a VM onto a sole-tenant node from the Google Cloud SDK, as well as from the Compute Engine APIs (support for Google Cloud Console coming soon):

// CREATE A NODE TEMPLATE
gcloud beta compute sole-tenancy node-templates create mynodetemplate
--node-type n1-node-96-624 --region us-central1 

// PROVISION A NODE GROUP OF SIZE TWO
gcloud beta compute sole-tenancy node-groups create mynodegroup
--node-template mynodetemplate --target-size 2 --zone us-central1-a

// CREATE INSTANCE WITH 4vCPUS AND 8GB OF MEMORY ON A NODE
gcloud beta compute instances create my-vm --node-group
mynodegroup --custom-cpu 4 --custom-memory 8 --zone us-central1-a
For guidance on manual placements and more, check out the documentation. Visit the pricing page to learn more about the offering, as well as regional availability.

What DBAs need to know about Cloud Spanner, part 1: Keys and indexes



Cloud Spanner is a relational and horizontally scalable database service, built from a cloud/distributed design perspective. It brings efficiency and high availability for developers and database administrators (DBAs), and differs structurally from typical databases you’re used to. In this blog series, we’ll explore some of the key differences that DBAs and developers will encounter as you migrate from traditional vertically-scaling (scale-up) relational database management systems (RDBMS) and move to Cloud Spanner. We will discuss some of the dos-and-don'ts, best practices and why things are different in Cloud Spanner.

In this series, we will explore a range of topics, including:
  • Selection of keys and use of indexes
  • How to approach business logic
  • Importing and exporting data
  • Migrating from your existing RDBMS
  • Optimizing performance
  • Access control and logging
You’ll gain an understanding of how to best use Cloud Spanner and release its potential to achieve linearly scalable performance over massive databases. In this first installment, let’s start with a closer look at how the concepts of keys and indexes work in Cloud Spanner.

Choosing keys in Cloud Spanner

Just like in other databases, the choice of key is vitally important to optimize the performance of the database. It’s even more important in Cloud Spanner, due to the way its mechanisms distribute database load. Unlike traditional RDBMS, you’ll need to take care when choosing the primary keys for the tables and choosing which columns to index.

Using well-distributed keys results in a table whose size and performance can scale linearly with the number of Cloud Spanner nodes, while using poorly distributed keys can result in hotspots, where a single node is responsible for the majority of reads and writes to the table.

In a traditional vertically scaled RDBMS, there is a single node that manages all of the tables. (Depending on the installation, there may be replicas that can be used for reading or for failover). This single node therefore has full control over the table row locks, and the issuing of unique keys from a numeric sequence.

Cloud Spanner is a distributed system, with many nodes reading and writing to the database at any one time. However, to achieve scalability, along with global ACID transactions and strong consistency, only one node at any one time can have write responsibility for a given row.

Cloud Spanner distributes management of rows across multiple nodes by breaking up each table into several splits, using ranges of the lexicographically sorted primary key.

This enables Cloud Spanner to achieve high availability and scalability, but it also means that using any continuously increasing or decreasing sequence as a key is detrimental to performance. To explain why, let’s explore how Cloud Spanner creates and manages its table splits.

Table splits and key choice

Cloud Spanner manages splits using Paxos (you can learn more about how in this detailed documentation: Life of Cloud Spanner Reads & Writes and Spanner: Google's Globally Distributed Database). In a regional instance of Cloud Spanner, the responsibility for reading/writing each split is distributed across a group of three nodes, one in each of the three availability zones of the Cloud Spanner instance.

One node in this group of three is elected Split Leader and manages the writes and locks for all the rows in the split. All three nodes in the group can perform reads.

To create a visual example, consider a table with 600 rows that uses a simple, continuous, monotonically increasing integer key (as is common in traditional RDBMS), broken into six splits, running on a two-node (per zone) Cloud Spanner instance. In an ideal situation, the table will have six splits, the leaders of which will be the six individual nodes available in the instance.

This distribution would result in ideal performance when reading/updating the rows, provided that the reads and updates are evenly distributed across the key range.

Problems with hotspots


The problem arises when new rows are appended to the database. Every new row will have an incrementing ID and will be added to the last split, meaning that out of the six available nodes, only one node will be handling all of the writes. In the example above, node 2c would be handling all writes. This node then becomes a hotspot, limiting the overall write performance of the database. In addition, the row distribution becomes unbalanced, with the last split becoming significantly larger, so it’s then handling more row reads than the others.

Cloud Spanner does try to compensate for uneven load by adding and removing splits in the background according to read and write load, and by creating a new split once the split size crosses a set threshold. However, in a frequently appended table, this will not happen quickly enough to avoid creating a hotspot.

Along with monotonically increasing or decreasing keys, this issue also affects tables that are indexed by any deterministic key—for example, an increasing timestamp in an event log table. Timestamp-keyed tables are also more likely to have a read hotspot because, in most cases, recently timestamped rows are accessed more frequently than the others. (Check out Cloud Spanner — Choosing the right primary keys for detailed information on detecting and avoiding hotspots.)

Problems with sequence generators

The concept of sequence generators, or lack thereof, is an important area to explore further. Traditional vertical RDBMS have integrated sequence generators, which create new integer keys from a sequence during a transaction. Cloud Spanner cannot have this due to its distributed architecture, as there would either be race conditions between the split leader nodes when inserting new keys, or the table would have to be globally locked when generating a new key, both of which would reduce performance.

One workaround could be that the key is generated by the application (for example, by storing the next key value in a separate table in the database, or by getting the current maximum key value from the table). However, you’ll run into the same performance problems. Consider that as the application is also likely to be distributed, there may be multiple database clients trying to append a row at the same time, with two potential results depending on how the new key is generated:
  • If the SELECT for the existing key is performed in the transaction, one application instance trying to append would block all other application instances trying to append due to row locking.
  • If the SELECT for the existing key is done outside of the transaction, then there is a race between each of the application instances trying to append the new row. One will succeed, while others would have to retry (including generating a new key) after the append fails, since the key already exists.

What makes a good key

So if sequential keys will limit database performance in Cloud Spanner, what’s a good key to use? Ideally, the high-order bits should be evenly and semi-randomly distributed when keys are generated.

One simple way to generate such a key is to use random numbers, such as a random universally unique identifier (UUID). Note that there are several classes of UUID. Versions 1 and 2 use deterministic prefixes, such as timestamps or MAC addresses. Ensure that the UUID generation method you use is truly randomly distributed, i.e., v4, at least over the higher order bytes. This will ensure that the keys are evenly distributed among the keyspace, and hence that the load is distributed evenly over the spanner nodes.

Although another approach might be to use some real-world attributes of the data that are immutable and evenly distributed over the key range, this is quite a challenge since most uniformly distributed attributes are discrete and not continuous. For example, the random result of a dice roll is uniformly distributed and has six finite values. A continuous distribution could rely on an irrational number, for example π.

What if I really need an integer sequence as a key?

Though it’s not recommended, in some circumstances an integer sequence key is necessary, either for legacy or for external reasons, e.g., an employee ID.

To use an integer sequence key, you’ll first need a sequence generator that’s safe across a distributed system. One way of doing this is to have a table in Cloud Spanner contain a row for each required sequence that contains the next value in the sequence—so it would look something like this:
CREATE TABLE Sequences (
     Sequence_ID STRING(MAX) NOT NULL, -- The name of the sequence
     Next_Value INT64 NOT NULL
) PRIMARY KEY (Sequence_ID)
When a new ID value is required, the next value of the sequence is read, incremented and updated in the same transaction as the insert for the new row.

Note that this will limit performance when many rows are inserted, as each insert will block all other inserts due to the update of the Sequences table that we created above.

This performance issue can be reduced—though at the cost of possible gaps in the sequence—if each application instance reserves a block of, for example, 100 sequence values at once by incrementing Next_Value by 100, and then manages issuing individual IDs from that block internally.

In the table using the sequence, the key cannot simply be the numeric sequence value itself, as that will cause the last split to become a hotspot (as explained previously). So the application must generate a complex key that randomly distributes the rows among the splits.

This is known as application-level sharding and is achieved by prefixing the sequential ID with an additional column containing a value that’s evenly distributed among the key space—e.g., a hash of the original ID, or bit-reversing the ID. That looks something like this:
CREATE TABLE Table1 (
     Hashed_Id INT64 NOT NULL, 
     ID INT64 NOT NULL,
     -- other columns with data values follow....
) PRIMARY KEY (Hashed_Id, Id)
Even a simple cyclic redundancy check (CRC)32 checksum is good enough to provide a suitably pseudo-random Hashed_Id. It does not have to be secure, just enough to randomize the row order of the sequentially numbered keys, as in the following table:

Note that whenever a row is read directly, both the ID and Hashed_Id must be specified to prevent a table scan, as in this example:
SELECT * FROM Table1
WHERE t1.Hashed_Id = 0xDEADBEEF
      AND t1.Id = 1234
Similarly, whenever this table is joined with other tables in the query by Id, the join must also use both the ID and the Hashed_Id. Otherwise, you’ll lose performance, since a table scan will be required to find the row. This means that the table that references the ID must also include the Hashed_Id, like this:
CREATE TABLE Table2 (
     Id String(MAX),  -- UUID
     Table1_Hashed_Id INT64 NOT NULL, 
     Table1_Id INT64 NOT NULL,
     -- other columns with data values follow....
) PRIMARY KEY (Id)

SELECT * from Table2 t2 INNER JOIN Table1 t1 
     ON t1.Hashed_Id = t2.Table1_Hashed_Id
     AND t1.Id = t2.Table1_Id
WHERE ... -- some criteria

What if I really need to use a timestamp as a key?

In many cases, the row using the timestamp as a key also refers to some other table data. For example, the transactions on a bank account will refer to the source account. In this case, assuming that the source account number is already reasonably evenly distributed, you can use a complex key containing the account number first and then the timestamp:
CREATE TABLE Transactions (
     account_number INT64 NOT NULL,
     timestamp TIMESTAMP NOT NULL,
     transaction_info ...,
) PRIMARY KEY (account_number, timestamp DESC)
The splits will be made primarily using the account number and not the timestamp, thus distributing the newly added rows over various splits.

Note that in this table, the timestamp is sorted by descending order. That’s because in most cases you want to read the most recent transactions—which will be first in the table—so you won’t need to scan through the entire table to find the most recent rows.

If you do not, or cannot have an external reference, or have any other data that you can use in the key in order to distribute the order, then you will need to perform application-level sharding, which is shown in the integer sequence example above.

Note, however, that using a simple hash will make queries by timestamp range extremely slow, since retrieving a range of timestamps will require a full table scan to cover all the hashes. Instead, we recommend generating a ShardId from the timestamp. So, for example,
TimestampShardId = CRC32(Timestamp) % 100
will return a pseudo-random value between 0 and 99 from the timestamp. Then, you can use this ShardId in the table key so that sequential timestamps are distributed across multiple splits, like so:
CREATE TABLE Events (
     TimestampShardId INT64 NOT NULL
     Timestamp TIMESTAMP NOT NULL,
     event_info...
) PRIMARY KEY (TimestampShardId, Timestamp DESC)
For example, a table with dates of the first 10 days of 2018 (which without ShardId would be stored in the table in date order) will give the following ordering:

When a query is made, you must use a BETWEEN clause to be able to select across all shards without performing a table scan:
Select * from Events
WHERE
   TimestampShardId BETWEEN 0 AND 99
   AND Timestamp > @lower_bound
   AND Timestamp < @upper_bound;
Note that the ShardId is only a way of improving key distribution so that Cloud Spanner can use multiple splits to store sequential timestamps. It does not identify an actual database split, and rows in different tables with the same ShardId may well be in different splits.

Migration implications

When you’re migrating from an existing RDBMS that uses keys that are not optimal for Cloud Spanner, take the above considerations into account. If necessary, add key hashes to tables or change the key ordering.

Deciding on indexes in Cloud Spanner

In a traditional RDBMS, indexes are very efficient ways of looking up rows in a table by a value that is not the primary key. Under most circumstances, a row lookup via an index will take approximately the same time as a row lookup via its key. That’s because the table and the index are managed by a single node, so the index can point directly to the on-disk row of the table.

In Cloud Spanner, indexes are actually implemented using tables, which allows them to be distributed and enables the same degree of scalability and performance as normal tables.

However, because of this type of implementation, using indexes to read the data from the table row is less efficient than in a traditional RDBMS. It’s effectively an inner join with the original table, so reading from a table using an indexed key turns into this process:
  • Look up split for index key
  • Read index row from split to get table key
  • Look up split for table key
  • Read table row from split to get row values
  • Return row values
Note that there is no guarantee that the split for the index key is on the same node as the split for the table key, so a simple index query may require cross-node communication just to read one row.

Similarly, updating an indexed table will most likely require a multi-node write to update the table row and the index row. So using an index in Cloud Spanner is always a trade-off between improved read performance and reduced write performance.

Index keys and hotspots

Because indexes are implemented as tables in Cloud Spanner, you’ll encounter the same issues with the indexed columns as you did with the table keys: An index on a column with poorly distributed values (such as a timestamp) will lead to the creation of a hotspot, even if the underlying table is using well-distributed keys. That’s because when rows are appended to the table, the index will also have new rows appended, and writes for these new rows will always be sent to the same split.

Therefore, care must be taken when creating indexes, and we recommend that you create indexes only using columns which have a well-distributed set of values, just like when choosing a table key.

In some cases, you’ll need to do application-level sharding for the indexed columns in order to create a synthetic ShardId column, which can be used in the index to distribute values over the splits.

For example, this configuration below will create a hotspot when appending events due to the index, even if UserId is randomly distributed.
CREATE TABLE Events (
      UserId String(MAX),
      Timestamp TIMESTAMP,
      EventData)
PRIMARY KEY (UserId, Timestamp DESC);

CREATE INDEX EventsByTimestamp ON Events (Timestamp DESC);
As with a table keyed by timestamp only, a synthetic ShardId column will need to be added to the table, and then used as the first indexed column to help the distribution of the index among the splits.

A simple ShardId generator could be:
TimestampShardId = CRC32(Timestamp) % 100
which will give a hash value between 0 and 99 from the timestamp. You’ll need to add this to the original table as a new column, then use it as the first index key, like this:
CREATE TABLE Events (
     UserId String(MAX),
     Timestamp TIMESTAMP,
     TimestampShardId INT64, 
     EventData)
PRIMARY KEY (UserId, Timestamp DESC);

CREATE INDEX EventsByTimestamp ON Events (TimestampShardId,Timestamp);
This will remove the hotspot on index update, but will slow down range queries on timestamp, since you’ll have to run the query for each ShardId value (0-99) to get the timestamp range from all shards:
Select * from Events@{FORCE_INDEX=EventsByTimestamp}
WHERE
   TimestampShardId BETWEEN 0 AND 99
   AND Timestamp > @lower_bound
   AND Timestamp < @upper_bound;
Using this type of index and sharding strategy must strike a balance between the additional complexity when reading and the increase in performance of an indexed query.

Other indexes you should know

When you’re migrating to Cloud Spanner, you’ll also want to understand how these other index types function and when you might need to use them:

NULL_FILTERED indexes

By default, Cloud Spanner will index rows using NULL indexed column values. A NULL is considered to be the smallest possible value, so these values will appear at the start of the index.

It’s also possible to disable this behavior by using the CREATE NULL_FILTERED INDEX syntax, which will create an index ignoring rows with NULL indexed column values.

This index will be smaller than the complete index, as it will effectively be a materialized filtered view on the table, and will be faster to query than the full table when a table scan is necessary.

UNIQUE indexes

You can use a UNIQUE index to enforce that a column of a table has unique values. This constraint will be applied at transaction commit time (and at index creation).

Covering Indexes and STORING clause

To optimize performance when reading from indexes, Cloud Spanner can store the column values of the table row in the index itself, removing the need to read the table. This is known as a Covering Index. This is achieved by using the STORING clause when defining the index. The values of the column can then be read directly from the index, so reading from the index performs as well as reading from the table. For example, this table contains employee data:
CREATE TABLE Employees (
      CompanyUUID INT64,
      EmployeeUUID INT64,
      FullName STRING(MAX)
            ...
) PRIMARY KEY (CompanyUUID,EmployeeUUID)
If you often need to look up an employee’s full name, for example, you can create an index on employeeUUID, storing the full name for rapid lookups:
CREATE INDEX EmployeesById 
      ON Employees (EmployeeUUID) 
      STORING (FullName);

Forcing index use

Cloud Spanner’s query engine will only automatically use indexes in rare circumstances (when it is a query fully covered by the index), so it is important to use a FORCE_INDEX directive in the SQL SELECT statement to ensure that Cloud Spanner looks up values from the index. (You can find more details in the documentation.)
Select * 
from  Employees@{FORCE_INDEX=EmployeesById}
Where EmployeeUUID=xxx;
Note that when using the Cloud Spanner Read APIs, you can only perform fully covered queries—i.e., queries where the index stores all of the columns requested. To read the columns from the original table using an index, you must use an SQL query. See Use a Secondary Index section of the Getting Started docs for examples.

Continuing your Cloud Spanner education

There are some big conceptual differences when you’re using a cloud-built, horizontally scalable database like Cloud Spanner in contrast with the RDBMS you’ve been using for years. Once you’re familiar with the way keys and indexes work, you can start to take advantage of the benefits of Cloud Spanner for faster scalability.

In the next episode of this series, we will look at how to deal with business logic that would previously be implemented by triggers and stored procedures, neither of which exist in Cloud Spanner.

Want to learn more about Cloud Spanner in person? We’ll be discussing data migration and more in this session at Next 2018 in July. For more information, and to register, visit the Next ‘18 website.


Related content:

How we used Cloud Spanner to build our email personalization system—from “Soup” to nuts

A closer look at the HANA ecosystem on Google Cloud Platform



Since we announced our partnership with SAP in early 2017, we’ve rapidly expanded our support for SAP HANA, SAP’s in-memory, column-oriented, relational database management system. From the beginning, we knew we’d need to build tools that integrate SAP HANA with Google Cloud Platform (GCP) that make it faster and easier for developers and administrators to take advantage of the platform.

In this blog post, we’ll walk you through the evolution of SAP HANA on GCP and take a deeper look at the ecosystem we’ve built to support our customers.

The evolution of SAP HANA on GCP


For many enterprise customers running SAP HANA databases, instances with large amounts of memory are essential. That’s why we’ve been working to make virtual machines with larger memory configurations available for SAP HANA workloads.

Our initial work with SAP on the certification process for SAP HANA began in early 2017. In that 15 month period, we’ve rapidly evolved from instances with 208GB memory to 4TB memory. This has allowed us to support larger single-node SAP HANA installations of up to 4TB.

Smart Data Access — Google BigQuery

Google BigQuery, our serverless data warehouse, enables low cost, high performance analytics at petabyte scale. We’ve worked with SAP to natively integrate SAP HANA with BigQuery through smart data access which allows you to extend SAP HANA’s capabilities and query data stored within BigQuery by means of virtual tables. This support has been available since SAP HANA 2.0 SPS 03, and you can try it out by following this step-by-step codelabs tutorial.

Fully automated deployment of SAP HANA

In many cases, the manual deployment process can be time consuming, error prone and cumbersome. It’s very important to reduce or eliminate the margin of error and make the deployment conform to SAP’s best practices and standards.


To address this, we’ve launched deployment templates that fully automate the deployment of single node and scale-out SAP HANA configurations on GCP. In addition, we’ve also launched a new deployment automation template that creates a high availability SAP HANA configuration with automatic failover.

With these templates, you have access to fully configured, ready-to-go SAP HANA environments in a matter of minutes. You can also see the resources you created, and a complete catalog of all your deployments, in one location through the GCP console. We’ve also made the deployment process fully visible by providing deployment logs through Google Stackdriver.

Monitoring SAP HANA with Stackdriver

Visibility into what’s happening inside your SAP HANA database can help you identify factors impacting your database, and prepare accordingly. For example, a time series view of how resource utilization or latency within the SAP HANA database changes over time can help administrators plan in advance, and in many cases successfully troubleshoot issues.

Stackdriver provides monitoring, logging, and diagnostics to better understand the health, performance, and availability of cloud-powered applications. Stackdriver’s integration with SAP HANA helps administrators monitor their SAP HANA databases, notifying and alerting them so they can proactively fix issues.

More information on this integration is available in our documentation.

TensorFlow support in SAP HANA

SAP has offered support for TensorFlow Serving beginning with SAP HANA 2.0. This lets you directly build inference into SAP HANA through custom machine learning models hosted in TensorFlow serving applications running on Google Compute Engine.

You can easily build a continuous training pipeline by exporting data in your SAP HANA database to Google Cloud Storage and then using Cloud TPUs to train deep learning models. These models can then be hosted with TensorFlow Serving to be used for inference within SAP HANA.

SAP Cloud Platform, SAP HANA as a Service

SAP HANA as a Service is now also deployable to the Google Cloud Platform. The fully managed cloud service makes it possible for developers to leverage the power of SAP HANA, without spending time on operational and administrative tasks. The service is especially well suited to customers who have the goal of rapid innovation, reducing time to value.

SAP HANA, express edition on Google Cloud Launcher

SAP HANA, express edition is meant for developers and technologists who prefer a hands-on learning experience. A free license is included, allowing users to use a large catalog of tutorial content, online courses and samples to get started with SAP HANA. The Google Launcher provides users a fast and effective provisioning experience for SAP HANA, express edition.

Conclusion

These developments are all part of our continuing goal to make Google Cloud the best place to run SAP applications. We’ll continue listening to your feedback, and we’ll have more updates to share in the coming months. In the meantime, you can learn more about SAP HANA on GCP by visiting our website. And if you’d like to learn about all our announcements at SAPPHIRE NOW, read our Google Cloud blog post.

How to deploy geographically distributed services on Kubernetes Engine with kubemci



Increasingly, many enterprise Google Cloud Platform (GCP) customers use multiple Google Kubernetes Engine clusters to host their applications, for better resilience, scalability, isolation and compliance. In addition, their users expect to low-latency access to applications from anywhere around the world. Today we are introducing a new command-line interface (CLI) tool called kubemci to automatically configure ingress using Google Cloud Load Balancer (GCLB) for multi-cluster Kubernetes Engine environments. This allows you to use a Kubernetes Ingress definition to leverage GCLB along with multiple Kubernetes Engine clusters running in regions around the world, to serve traffic from the closest cluster using a single anycast IP address, taking advantage of GCP’s 100+ Points of Presence and global network. For more information on how the GCLB handles cross-region traffic see this link.

Further, kubemci will be the initial interface to an upcoming controller-based multi-cluster ingress (MCI) solution that can adapt to different use-cases and can be manipulated using the standard kubectl CLI tool or via Kubernetes API calls.

For example, in the picture below, we have created three independent Kubernetes Engine clusters and spread them across three continents (Asia, North America, and Europe). We then deployed the same service, “zone-printer”, to each of these clusters and used kubemci to create a single GCLB instance to stitch the services together. In this case, the 1000 requests-per-second (rps) from Tokyo are routed to the cluster in Asia, the New York requests are routed to the North American cluster, and the remaining 1 rps from London is routed to the European cluster. Because each of these requests arrive at the closest cluster to the end user they benefit from low round-trip latency. Additionally, if a region, cluster, or service were ever to become unavailable, GCLB automatically detects that and routes users to one of the other healthy service instances.

The feedback on kubemci has been great so far. Marfeel is a Spanish ad tech platform and has been using kubemci in production to improve their service offering:
“At Marfeel, we appreciate the value that this tool provides for us and our customers. Kubemci is simple to use and easily integrates with our current processes, helping to speed up our Multi-Cluster deployment process. In summary, kubemci offers us granularity, simplicity, and speed.”
-Borja García - SRE Marfeel

Getting started

To get started with kubemci, please check out the how-to guide, which contains information on the prerequisites along with step-by-step instructions on how to download the tool and set up your clusters, services and ingress objects.

As a quick preview, once your applications and services are running, you can set up a multi-cluster ingress by running the following command:
$ kubemci create my-mci --ingress=ingress.yaml \
    --kubeconfig=cluster_list.yaml
To learn more, check out this talk on Multicluster Ingress by Google software engineers Greg Harmon and Nikhil Jindal, at KubeCon Europe in Copenhagen, demonstrating some initial work in this space.

Regional clusters in Google Kubernetes Engine are now generally available



Editor's note: This is one of many posts on enterprise features you’ll find in Kubernetes Engine 1.10. For the full coverage, follow along here.

A highly available Kubernetes cluster is a key requirement for most production applications. However, adding this protection can be complex. We’ve consistently heard from Kubernetes users that creating and managing a high-availability Kubernetes cluster is no small feat. Keeping etcd (the key-value store) replicas in sync across zones, scaling your masters, and ensuring that your control plane is fronted by a resilient load balancer are just some of the challenges users face when maintaining their own highly available cluster.

Today we’re proud to announce the general availability of one of Google Kubernetes Engine's most requested enterprise-grade features: regional clusters. Regional clusters create a multi-master, highly-available Kubernetes cluster that spreads both the control plane and the nodes across multiple zones in the same region, allowing us to increase the control plane uptime to 99.95%. In addition to the increased availability, regional clusters give you a zero-downtime upgrade experience, so that your cluster is always available for deployments.

We’ve seen rapid adoption of regional clusters since we announced the beta, and with many users already running production workloads using regional clusters. In addition, we are pleased to announce today that regional clusters in Kubernetes Engine are available at no additional cost.

Get started with regional clusters

You can quickly create your first regional cluster using the Cloud Console or the gcloud command line tool.
$ gcloud container clusters create my-regional-cluster
--region=us-east1 --num-nodes=2
This creates a regional cluster in us-east1 with two nodes in each of the us-east1 zones.

By creating a regional cluster, you get:
  • Resilience from single zone failure - Because your masters and application nodes are available across a region rather than a single zone, your Kubernetes cluster is still fully functional if an entire zone goes down.
  • No downtime during master upgrades - Kubernetes Engine minimizes downtime during all Kubernetes master upgrades, but with a single master, some downtime is inevitable. By using regional clusters, the control plane remains online and available, even during upgrades.
Regional clusters is just one of the many features that makes Kubernetes Engine a great choice for enterprises seeking to run a production-grade, managed Kubernetes cluster in the cloud. For a more detailed explanation of the regional clusters feature along with additional flags you can use, check out the documentation.

Last month today: GCP in May



When it comes to Google Cloud Platform (GCP), every month is chock full of news and information. We’re kicking off a monthly recap of key moments you may have missed.

What caught your attention this month:

Announcements about open source projects were some of the most-read this month.
  • Open-sourcing gVisor, a sandboxed container runtime was by far your favorite post in May. gVisor is a sandbox that lets you run containers in strongly isolated environments. It’s isolated like a virtual machine, but more lightweight and also more flexible, since it interfaces with the host OS just like another process.
  • Our introduction of Asylo, an open-source framework for confidential computing, also got your attention. As more and more sensitive workloads move to cloud, lots of businesses want to be able to verify that they’re properly isolated, inside a closed environment that’s only available to authorized users. Asylo democratizes trusted execution environments (TEEs) by allowing them to run on generic hardware. With Asylo, developers will be able to run their workloads encrypted in a highly secure environment, whether it’s on-premises or in the cloud.
  • Rounding out the open-source fun for the month was our introduction of the beta availability of Cloud Memorystore, a fully managed in-memory data store service for Redis. Cloud Memorystore gives you the caching power of Redis to reduce latency, without having to manage the details.


Hot topics: Kubernetes, DevOps and SRE

Google Kubernetes Engine 1.10 debuted in May, and we had a lot to say about the new features that this version enables—from security to brand-new monitoring functionality via Stackdriver Kubernetes Monitoring to networking. Start with this post to see what’s new and how customers like Spotify are using Kubernetes Engine on Google Cloud.

And one of our recent posts also struck a chord, as two of our site reliability engineering (SRE) experts delved into the differences—and similarities—between SRE and DevOps. They have similar goals, mostly around creating flexible, agile dev environments, but SRE generally gets much more specific and prescriptive than DevOps in accomplishing them.

Under the radar: GCP adds infrastructure options

As you look for new ways to use GCP to run your business, our engineers are adding features and new releases to give you more power, resources and coverage.

First, we introduced ultramem Google Compute Engine machine types, which offer more memory and compute resources than any other Compute Engine VM instance. These machines types are especially useful for those of you running enterprise workloads that need a lot of memory, like data analytics or high-performance applications.

We’ve also been busy on the back-end in other ways too, as we continue adding new regional cloud computing infrastructure. Our third zone of the Singapore region opened in May, and we’ll open a Zurich region next year.

Stay tuned in June for more on the technologies behind Google Cloud—we’ve got lots up our sleeve.

7 tips to maintain security controls in your GCP DR environment



Cloud computing has changed many traditional IT practices, and one particularly useful change has been in the area of disaster recovery (DR). Our team helps Google Cloud Platform (GCP) users build their infrastructures with cloud, and we’ve seen some great results when they use GCP as the DR target environment.

When you integrate a cloud provider like GCP into your DR plan, you no longer have to invest up front in mostly idle backup infrastructure. Testing that DR plan no longer seems so daunting, as you can bring up your DR environment automatically and close it all down again when it’s no longer needed—and it’s always ready for the next tests. In the event of an actual disaster, the DR environment can be made ready.

However, planning and testing for and recovering from a disaster involves more than just getting your application restored and available within your Recovery Time Objective (RTO). You need to ensure the security controls you implemented on-premises also apply to your recovered environment. This post provides tips to help you maintain your security controls in your cloud DR environment.

1. Grant users the same access they’re used to

If your production environment is running on GCP, it’s easy to replicate the identity and access management (IAM) policies already in place. Use infrastructure as code methods and employ tools such as Cloud Deployment Manager to recreate your IAM policies. You can then bind the policies to corresponding resources when you’re standing up the DR environment.

If your production environment is on-premises, map functional roles such as your network administrator and auditor roles to appropriate IAM policies. Use the IAM documentation, which has some example configurations such as the networking and audit logging functional roles. You’ll want to configure IAM policies to grant appropriate permissions to GCP products. For example, you might want to restrict access to specific Google Cloud Storage buckets.

Once you’ve created a test environment, ensure that the access granted to your users confers the same permissions as those they are granted on-premises.

If your production environment runs on another cloud, you will need to understand how to map the IAM controls to Google Cloud Identity and Access Management (IAM) policies. The GCP for AWS professionals management doc can help if your other cloud is AWS.

2. Ensure user understanding

Do not wait for a disaster to occur before checking that your users—whether developers, operators, data scientists, security or network admins—can access the DR environment. Make sure their accounts have been granted the appropriate access rights. If you are using an alternative identity system, ensure accounts have been synced with your Cloud Identity account. Make sure users who will need access to the DR environment (which will be the production environment for awhile) are able to log in to the DR environment. Resolve any authentication issues. When you’re conducting regular DR tests, incorporate users logging into the DR environment as part of the test process.

Enable the GCP OS login feature on the projects that constitute your DR environment so you can centrally manage who has SSH access to VMs launched in the DR environment.

It’s also important to train users on the DR environment so that they understand how to undertake their usual actions in GCP. Use the test environment for this.

3. Double-check blocking and compliance requirements

It’s essential that your network controls confer the same separation and blocking settings in the DR as the source production environment. Learn how to configure Shared VPCs and GCP firewalls and take advantage of using service accounts as part of the firewall rules. Understand how to use service accounts to implement least privileges for applications accessing GCP APIs.

In addition, make sure your DR environment meets your compliance requirements. Validate that access to your DR environment is restricted to only those who need access. Ensure personally identifiable information (PII) data is appropriately redacted and encrypted. If you carry out regular penetration tests on your production environment, you should start including your DR environment and carry out those tests by regularly standing up a DR environment.

4. Capture log data

When the DR environment is in service, the logs collected during that time should be backfilled into your production environment log archive. Ensure that as part of your GCP DR environment you are able to export audit logs that are collected via Stackdriver to your main log sink archive. Use the export sink facilities. For application logs, stand up a mirror of your on-premises logging and monitoring environment. For another cloud, map across to the equivalent GCP services. Have a process in place to format this log input into your production environment.

5. Use Cloud Storage for day-to-day backup routines

Use Cloud Storage to store backups that result from the DR environment. Ensure the storage buckets containing your backups have limited permissions applied to them.

The same security controls should apply to your recovered data as if it were production data. The same permissions, encryption and audit requirements apply. Know where your backups are located and where, and who restored them.

6. Consider secrets and key management

Manage application-level secrets and keys using GCP to host the key or secret management service. You can use Cloud Key Management Service (KMS) or a third-party solution like HashiCorp Vault with a GCP backend such as Cloud Spanner or Cloud Storage.

7. Manage VM images and snapshots

If you have predefined configurations for VMs, such as who can use them or make changes, reflect those controls in your GCP DR recovery site. Follow the guidance outlined in restricting access to images.

These tips can help make sure you don’t lose the security you’ve built into your production environment when you stand up a DR site. You’ll be on your way more quickly to having a cloud DR site that works for your users and the business.

Next steps:

Read our guide on How to Design a DR plan.

Related content:

Cloud Identity-Aware Proxy: a simple and more secure way to manage application access
Know thy enemy: how to prioritize and communicate risks - CRE life lessons
Building trust through Access Transparency

Kubernetes best practices: upgrading your clusters with zero downtime



Editor’s note: Today is the final installment in a seven-part video and blog series from Google Developer Advocate Sandeep Dinesh on how to get the most out of your Kubernetes environment.

Everyone knows it’s a good practice to keep your application up to date to optimize security and performance. Kubernetes and Docker can make performing these updates much easier, as you can build a new container with the updates and deploy it with relative ease.

Just like your applications, Kubernetes is constantly getting new features and security updates, so the underlying nodes and Kubernetes infrastructure need to be kept up to date as well.

In this episode of Kubernetes Best Practices, let’s take a look at how Google Kubernetes Engine can make upgrading your Kubernetes cluster painless!

The two parts of a cluster

When it comes to upgrading your cluster, there are two parts that both need to be updated: the masters and the nodes. The masters need to be updated first, and then the nodes can follow. Let’s see how to upgrade both using Kubernetes Engine.

Upgrading the master with zero downtime
Kubernetes Engine automatically upgrades the master as point releases are released, however it usually won’t upgrade to a new version (for example, 1.7 to 1.8) automatically. When you are ready to upgrade to a new version, you can just click the upgrade master button in the Kubernetes Engine console.

However, you may have noticed that the dialog box says the following:

“Changing the master version can result in several minutes of control plane downtime. During that period you will be unable to edit this cluster.”

When the master goes down for the upgrade, deployments, services, etc. continue to work as expected. However, anything that requires the Kubernetes API stops working. This means kubectl stops working, applications that use the Kubernetes API to get information about the cluster stop working, and basically you can’t make any changes to the cluster while it is being upgraded.

So how do you update the master without incurring downtime?


Highly available masters with Kubernetes Engine regional clusters

While the standard “zonal” Kubernetes Engine clusters only have one master node backing them, you can create “regional” clusters that provide multi-zone, highly available masters.

When creating your cluster, be sure to select the “regional” option:

And that’s it! Kubernetes Engine automatically creates your nodes and masters in three zones, with the masters behind a load-balanced IP address, so the Kubernetes API will continue to work during an upgrade.

Upgrading nodes with zero downtime

When upgrading nodes, there are a few different strategies you can use. There are two I want to focus on:
  1. Rolling update
  2. Migration with node pools
Rolling update
The simplest way to update your Kubernetes nodes is to use a rolling update. The is the default upgrade mechanism Kubernetes Engine uses to update your nodes.

A rolling update works in the following way. One by one, a node is drained and cordoned so that there are no more pods running on that node. Then the node is deleted, and a new node is created with the updated Kubernetes version. Once that node is up and running, the next node is updated. This goes on until all nodes are updated.

You can let Kubernetes Engine manage this process for you completely by enabling automatic node upgrades on the node pool.

If you don’t select this, the Kubernetes Engine dashboard alerts you when an upgrade is available:

Just click the link and follow the prompt to begin the rolling update.

Warning: Make sure your pods are managed by a ReplicaSet, Deployment, StatefulSet, or something similar. Standalone pods won’t be rescheduled!

While it’s simple to perform a rolling update on Kubernetes Engine, it has a few drawbacks.

One drawback is that you get one less node of capacity in your cluster. This issue is easily solved by scaling up your node pool to add extra capacity, and then scaling it back down once the upgrade is finished.

The fully automated nature of the rolling update makes it easy to do, but you have less control over the process. It also takes time to roll back to the old version if there is a problem, as you have to stop the rolling update and then undo it.

Migration with node pools
Instead of upgrading the “active” node pool as you would with a rolling update, you can create a fresh node pool, wait for all the nodes to be running, and then migrate workloads over one node at a time.

Let’s assume that our Kubernetes cluster has three VMs right now. You can see the nodes with the following command:
kubectl get nodes
NAME                                        STATUS  AGE
gke-cluster-1-default-pool-7d6b79ce-0s6z    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-9kkm    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-j6ch    Ready   3h


Creating the new node pool
To create the new node pool with the name “pool-two”, run the following command:
gcloud container node-pools create pool-two

Note: Remember to customize this command so that the new node pool is the same as the old pool. You can also use the GUI to create a new node pool if you want.

Now if you check the nodes, you will notice there are three more nodes with the new pool name:
$ kubectl get nodes
NAME                                        STATUS  AGE
gke-cluster-1-pool-two-9ca78aa9–5gmk        Ready   1m
gke-cluster-1-pool-two-9ca78aa9–5w6w        Ready   1m
gke-cluster-1-pool-two-9ca78aa9-v88c        Ready   1m
gke-cluster-1-default-pool-7d6b79ce-0s6z    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-9kkm    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-j6ch    Ready   3h

However, the pods are still on the old nodes! Let’s move them over.

Drain the old pool
Now we need to move work to the new node pool. Let’s move over one node at a time in a rolling fashion.

First, cordon each of the old nodes. This will prevent new pods from being scheduled onto them.

kubectl cordon <node_name>
Once all the old nodes are cordoned, pods can only be scheduled on the new nodes. This means you can start to remove pods from the old nodes, and Kubernetes automatically schedules them on the new nodes.

Warning: Make sure your pods are managed by a ReplicaSet, Deployment, StatefulSet, or something similar. Standalone pods won’t be rescheduled!

Run the following command to drain each node. This deletes all the pods on that node.

kubectl drain <node_name> --force


After you drain a node, make sure the new pods are up and running before moving on to the next one.

If you have any issues during the migration, uncordon the old pool and then cordon and drain the new pool. The pods get rescheduled back to the old pool.

Delete the old pool
Once all the pods are safely rescheduled, it is time to delete the old pool.

Replace “default-pool” with the pool you want to delete.

gcloud container node-pools delete default-pool


You have just successfully updated all your nodes!

Conclusion


By using Kubernetes Engine, you can keep your Kubernetes cluster up to date with just a few clicks.

If you are not using a managed service like Kubernetes, you can still use the rolling update or node pools method with your own cluster to upgrade nodes. The difference is you need to manually add the new nodes to your cluster, and perform the master upgrade yourself, which can be tricky.

I highly recommend using Kubernetes Engine regional clusters for the high-availability masters and automatic node upgrades to have a hassle-free upgrade experience. If you need the extra control for your node updates, using node pools gives you that control without giving up the advantages of a managed Kubernetes platform that Kubernetes Engine gives you.

And thus concludes this series on Kubernetes best practices. If you have ideas for other topics you’d like me to address in the future, you can find me on Twitter. And if you’re attending Google Cloud Next ‘18 this July, be sure to drop by and say hi!

Troubleshooting tips: How to talk so your cloud provider will listen (and understand)



Editor’s note: We’re excited to bring you this blog post from the team of Google experts who wrote the book (really!) on Site Reliability Engineering (SRE) a few years back. The second edition of the book is underway, and as a teaser, this post delves into one area of SRE that’s relevant to many IT teams today: troubleshooting in the age of cloud computing. This is part one of two. Then, check out the second installment specifically on troubleshooting cloud provider communications.

Effective technology troubleshooting requires a systematic approach, as opposed to luck or experience. Troubleshooting can be a learned skill, as discussed in our site reliability engineering (SRE) troubleshooting primer.

But how does that change when you and your operations team are running systems and services on cloud infrastructure? Regardless of where your websites or apps live, you’re the one getting paged when the site goes down and you are the one under pressure to solve the problems and answer questions.

Cloud presents a new way of working for IT teams shifting away from legacy systems. You had full visibility and control of all aspects of your system when it was on-premises, but now you depend on off-site, cloud-based infrastructure, into which you may have limited visibility or understanding. That’s true no matter how top-notch your systems and processes are and how great a troubleshooter you are. You simply can’t see much into many cloud provider systems. The troubleshooting challenge is no longer limited to pure debugging; you also need to effectively communicate with your cloud provider. You’ll need to engage each provider’s support process and engineers to find where problems originated and fix them as soon as possible. This is an area of opportunity for IT teams as they gain new skills and adapt to the cloud-based technology model.

It’s definitely possible to do cloud troubleshooting right. When you’re choosing a provider, make sure to understand how they’ll work with you during any issues. Once you’re working with a provider, you have more control than you think over how you communicate issues and speed along their resolution. We propose this model, inspired by the SRE model of actionable improvements, for working with your cloud provider to troubleshoot more effectively and efficiently. (For more on cloud provider communications throughout the troubleshooting process, see the companion post.)

Understand your cloud provider's support workflow

Your goal here is to figure out the best way to provide the right information to your provider when an issue inevitably arises. This is a useful place to start, especially since you may depend on multiple cloud providers to run your infrastructure. Your interaction with cloud support will typically begin with questions related to migration. Your relationship then progresses into the domain of production integration, and finally, joint troubleshooting of production problems.

Keep in mind that different cloud providers have different philosophies when it comes to customer interaction. Some provide a large degree of freedom and little direct support. They expect you to find answers to most of your questions from online forums such as Stack Overflow. Other providers emphasize tight customer integration and joint troubleshooting. You have some homework to do before you begin serving real customer traffic from your cloud deployment. Talk to your cloud provider's salespeople, project managers and support engineers, to get a sense of how they approach support. Ask them the following questions:
  • What does the lifecycle of a typical support issue report look like?
  • What is your internal escalation process if an issue becomes complex or critical?
  • Do you have an internal SLO for <service name>? If so, what is that SLO?
  • What types of premium support are available?
This step is critical in reducing your frustration when you have to troubleshoot an issue that involves a cloud provider’s giant black box.

Communicate with cloud provider support efficiently

Once you have a sense of the provider’s support workflow, you can figure out the best way to get your information across. There are some best practices for filing the perfect issue report with cloud provider support teams, including what to say in your issue report and why. These guidelines follow the 80/20 rule: we try to give 20% of the details that will be useful in 80% of your issue reports. The same principles apply if you're filing bug reports in issue trackers or posting to user groups and forums.

Your guiding principle when communicating to cloud providers should be clarity: specify the appropriate level of technical detail and communicate expectations explicitly.

Provide basic information
It may seem like common sense, but these basics are essential to include in an issue report. Failing to provide any of these details leads to delays and a poor experience.

Include four critical details
Effective troubleshooting starts with time, product, location and specific identifiers.

1. Time
Here are a few examples of including time in an issue report:
  • Starting at 2017-09-08 15:13 PDT and ending 5 minutes later, we observed...
  • Observed intermittently, starting no earlier than 2017-09-10 and observed 2-5 times...
  • Ongoing since 2017-09-08 15:13 PDT...
  • From 2017-09-08 15:13 PDT to 2017-09-08 22:22 PDT...
Including the onset time and duration allow support teams to focus their time-series monitoring on the relevant period. Be explicit about whether the issue is ongoing or whether it was observed only in the past. If the issue is not ongoing, be explicit about that fact and provide an end time, if possible.

Remember to always include the time zone. ISO 8601 format is a good choice because it is unambiguous and easy to sort. If you instead specify time in a relative way, whomever you're working with must convert local time into an absolute format they can input into time-series monitoring tools. This conversion is error-prone and costly: for example, sending an email about something that happened "earlier yesterday" means that your counterpart has to search for an email header and start doing mental math. This introduces cognitive load, which decreases the person’s mental energy available for solving the technical problem.

If an issue was intermittent over some period of time, state when it was first observed, or perhaps note a time in the past when it was surely not happening. Include the frequency of observations and note one or two specific examples.

Meanwhile, here are some antipatterns to avoid:
  • Earlier today: Not specific enough
  • Yesterday: Requires the recipient to figure out the implied date; can be confusing especially when work crosses the International Date Line
  • 9/8: Ambiguous, as the date might be interpreted as September in United States, or August in other locales. Use ISO 8601 format for clarity.
2. Product
Be as specific as possible in your issue report about the product you're using, including version information where applicable. These, for example, aren’t specific enough to locate the components or logs that can help with diagnosis:
  • REST API returned errors...
  • The data mining query interface is hanging...
Ideally, you should refer to specific APIs or URLs, or include screenshots. If the issue originates in a specific component or tool (for example, the CLI or Terraform), clearly identify that tool. If multiple products are involved, be specific about each one.Describe the behavior you're observing, and the behavior you expected to occur.

Antipatterns:
  • Can't create virtual machine: It's not clear how you're attempting to create those machine, nor does it say what the failure mode is.
  • The CLI command is giving an error:
    • Instead, provide the specific error, and the command syntax so others can run the command themselves.
    • Better: I ran 'mktool create my-instance --zone us-central1' and got the following error message...
3. Location
It's important to specify the region and zone because cloud providers often roll out changes to one region or zone at a time. Therefore, region or zone is a proxy for a cloud-internal software version number.
These are examples of how you might include location information in an issue report:
  • In us-east1-a... 
  • I tried regions eu-west-1 and eu-west-3...
Given this information, the support team can see if there's a rollout underway in a given location, or map your issue to an internal release ID for use in an internal bug report.

4. Specific identifiers
Project identifiers are included in many troubleshooting tools. Specify whether you observed the error in multiple projects, or in one project but not another.

These are examples of specific identifiers:
  • In project 123412341234 or my-project-id... 
  • Across multiple projects (including 123412341234)... 
  • Connecting to cloud external IP 218.239.8.9 from our corporate gateway 56.56.56.56... 
IP addresses are another form of unambiguous identifiers, though they also require additional details when used in an issue report. When specifying an IP, try to describe the context of how it's used. For example, specify whether the IP is connected to a VM instance, a load balancer or a custom route, or if it's an API endpoint. If the IP address isn't part of the cloud platform (for example, your home internet, a VPN endpoint or an external monitoring system), specify that information. Remember, 192.168.0.1 occurs many times in the world.

Antipatterns:
  • One of our instances is unreachable…: Overly vague 
  • We can't connect from the Internet...: Overly vague 
Note that other models, such as the Five Ws (popular in fields like journalism) can provide structure to your report.

Specify impact and response expectations
Basic information to include in your issue report to a cloud provider should also include how it’s affecting your business and when it needs to be resolved.

Priority expectations
Cloud provider support commonly uses the priority you specify to initially route the issue report and to determine the urgency of the issue. The priority rating drives the speed of response, potentially paging oncall personnel.

In addition to selecting the appropriate priority, it's useful to add a sentence describing the impact. Help avoid incorrect assumptions by being explicit about why you selected P1.

Think of the priority in terms of the impact to your business. Your cloud provider may have what appear to be strict definitions of priority (e.g., P1 signified a total outage). Don't let these definitions slow progress on issues that are business critical; extrapolate impact if your issue isn't addressed, or describe the worst case scenario of an exposure related to the issue. For example, the following two descriptions essentially describe impeding P1 issues:
  • Our backlog is increasing. Current impact is minor, but if this issue is not fixed in 12 hrs, we're effectively down. 
  • A key monitoring component has failed. While there is no current impact, this means we have a blind spot that will cause the next failure to become a user visible outage. 
Response time expectations
If you have specific needs related to response time, indicate them clearly. For example, you might specify "I need a response by 5pm because that's when my shift ends." If you have internal outage communication SLOs, make sure you request a response time from your provider that is within that interval so you can meet those SLOs. Cloud providers likely have 24/7 support, but if this isn't the case, or if your relevant personnel are in a particular time zone, communicate your time zone-specific needs to your provider.

Including these details in your issue report will ideally save you time later and speed up your overall provider resolution process. Check out part two for tips on communicating with your cloud provider throughout the actual troubleshooting process.

Related content:
SRE vs. DevOps
Incident management at Google —adventures in SRE-land
Applying the Escalation Policy — CRE life lessons
Special thanks to Ralph Pearson, J.C. van Winkel, John Lowry, Dermot Duffy and Dave Rensin