Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Three ways to configure robust firewall rules



If you administer firewall rules for Google Cloud VPCs, you want to ensure that firewall rules you create can only be associated with correct VM instances by developers in your organization. Without that assurance, it can be difficult to manage access to sensitive content hosted on VMs in your VPCs or allow these instances access to the internet, and you must carefully audit and monitor the instances to ensure that such unintentional access is not given through the use of tags. With Google VPC, there are now multiple ways to help achieve the required level of control, which we’ll describe here in detail.

As an example, imagine you want to create a firewall rule to restrict access to sensitive user billing information in a data store running on a set of VMs in your VPC. Further, you’d like to ensure that developers who can create VMs for applications other than the billing frontend cannot enable these VMs to be governed by firewall rules created to allow access to billing data.
Example topology of a VPC setup requiring secure firewall access.
The traditional approach here is to attach tags to VMs and create a firewall rule that allows access to specific tags, e.g., in the above example you could create a firewall rule that allows all VMs with the billing-frontend tag access to all VMs with the tag billing-data. The drawback of this approach is that any developer with Compute InstanceAdmin role for the project can now attach billing-frontend as a tag to their VM, and thus unintentionally gain access to sensitive data.

Configuring Firewall rules with Service Accounts


With the general availability of firewall rules using service accounts, instead of using tags, you can block developers from enabling a firewall rule on their instances unless they have access to the appropriate centrally managed service accounts. Service accounts are special Google accounts that belong to your application or service running on a VM and can be used to authenticate the application or service for resources it needs to access. In the above example, you can create a firewall rule to allow access to the billing-data@ service account only if the originating source service account of the traffic is billing-frontend@.
Firewall setup using source and target service accounts. (Service accounts names are abbreviated for simplicity.)
You can create this firewall rule using the following gcloud command:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-service-accounts [email protected] \
    --target-service-accounts [email protected]
If, in the above example, the billing frontend and billing data applications are autoscaled, you can specify the service accounts for the corresponding applications in the InstanceTemplate configured for creating the VMs.

The advantage of using this approach is that once you set it up, the firewall rules may remain unchanged despite changes in underlying IAM permissions. However, you can currently only associate one service account with a VM and to change this service account, the instance must be in a stopped state.

Creating custom IAM role for InstanceAdmin


If you want the flexibility of tags and the limitations of service accounts is a concern, you can create a custom role with more restricted permissions that disable the ability to set tags on VMs; do this by removing the compute.instances.setTag permission. This custom role can have other permissions present in the InstanceAdmin role and can then be assigned to developers in the organization. With this custom role, you can create your firewall rules using tags:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-tags billing-frontend \
    --target-tags billing-data
Note, however, that permissions assigned to a custom role are static in nature and must be updated with any new permissions that might be added to the InstanceAdmin role, as and when required.

Using subnetworks to partition workloads


You can also create firewall rules using source and destination IP CIDR ranges if the workloads can be partitioned into subnetworks of distinct ranges as shown in the example diagram below.
Firewall setup using source and destination ranges.
In order to restrict developers’ ability to create VMs in these subnetworks, you can grant Compute Network User role selectively to developers on specific subnetworks or use Shared VPC.

Here’s how to configure a firewall rule with source and destination ranges using gcloud:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-ranges 10.20.0.0/16 \
    --destination-ranges 10.30.0.0/16
This method allows for better scalability with large VPCs and allows for changes in the underlying VMs as long as the network topology remains unchanged. Note, however, that if a VM instance has can_ip_forward enabled, it may send traffic using the above source range and thus gain access to sensitive workloads.

As you can see, there’s a lot to consider when configuring firewall rules for your VPCs. We hope these tips help you configure firewall rules in a more secure and efficient manner. To learn more about configuring firewall rules, check out the documentation.

Why you should pick strong consistency, whenever possible



Do you like complex application logic? We don’t either. One of the things we’ve learned here at Google is that application code is simpler and development schedules are shorter when developers can rely on underlying data stores to handle complex transaction processing and keeping data ordered. To quote the original Spanner paper, “we believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.”1

Put another way, data stores that provide transactions and consistency across the entire dataset by default lead to fewer bugs, fewer headaches and easier-to-maintain application code.

Defining database consistency


But to have an interesting discussion about consistency, it’s important to first define our terms. A quick look at different databases on the market shows that not all consistency models are created equal, and that some of the related terms can intimidate even the bravest database developer. Below is a short primer on consistency:



Term
Definition
What Cloud Spanner Supports
Consistency
Consistency in database systems refers to the requirement that any given database transaction must change affected data only in allowed ways. Any data written to the database must be valid according to all defined rules.2
Cloud Spanner provides external consistency, which is strong consistency + additional properties (including serializability and linearizability). All transactions across a Cloud Spanner database satisfy this consistency property, not just those within a replica or region.
Serializability
Serializability is an isolation property of transactions, where every transaction may read and write multiple objects. It guarantees that transactions behave the same as if they had executed in some serial order. It's okay for that serial order to be different from the order in which transactions were actually run.3
Cloud Spanner provides external consistency, which is a stronger property than serializability, which means that all transactions appear as if they executed in a serial order, even if some of the reads, writes and other operations of distinct transactions actually occurred in parallel.
Linearizability
Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn’t group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts.4
Cloud Spanner provides external consistency, which is a stronger property than linearizability, because linearizability does not say anything about the behavior of transactions.
Strong Consistency
All accesses are seen by all parallel processes (or nodes, processors, etc.) in the same order (sequentially)5

In some definitions, a replication protocol exhibits "strong consistency" if the replicated objects are linearizable.
The default mode for reads in Cloud Spanner is "strong," which guarantees that they observe the effects of all transactions that committed before the start of the operation, independent of which replica receives the read.
Eventual Consistency
Eventual consistency means that if you stop writing to the database and wait for some unspecified length of time, then
eventually all read requests will return the same value.6
Cloud Spanner supports bounded stale reads, which offer similar performance benefits as eventual consistency but with much stronger consistency guarantees.


Cloud Spanner, in particular, provides external consistency, which provides all the benefits of strong consistency plus serializability. All transactions (across rows, regions and continents) in a Cloud Spanner database satisfy the external consistency property, not just those within a replica. External consistency states that Cloud Spanner executes transactions in a manner that's indistinguishable from a system in which the transactions are executed serially, and furthermore, that the serial order is consistent with the order in which transactions can be observed to commit. External consistency is a stronger property than both linearizability and serializability.

Consistency in the wild


There are lots of use cases that call for external consistency. For example, a financial application might need to show users' account balances. When users make a deposit, they want to see the result of this deposit reflected immediately when they view their balance (otherwise they may fear their money has been lost!). There should never appear to be more or less money in aggregate in the bank than there really is. Another example might be a mail or messaging app: You click "send" on your message, then immediately view "sent messages" because you want to double check what you wrote. Without external consistency, the app’s request to retrieve your sent messages may go to a different replica that's behind on getting all state changes, and have no record of your message, resulting in a confusing and reduced user experience.

But what does it really mean from a technical standpoint to have external consistency? When performing read operations, external consistency means that you're reading the latest copy of your data in global order. It provides the ability to read the latest change to your data across rows, regions and continents. From a developer’s perspective, it means you can read a consistent view of the state of the entire database (not just a row or object) at any point in time. Anything less introduces tradeoffs and complexity in the application design. That in turn can lead to brittle, hard-to-maintain software and can cause innumerable maintenance headaches for developers and operators. Multi-master architectures and multiple levels of consistency are workarounds for not being able to provide the external consistency that Cloud Spanner does.

What’s the problem with using something less than external consistency? When you choose a relaxed/eventual consistency mode, you have to understand which consistency mode you need to use for each use case and have to hard code rigid transactional logic into your apps to guarantee the correctness and ordering of operations. To take advantage of "transactions" in database systems that have limited or no strong consistency across documents/objects/rows, you have to design your application schema such that you never need to make a change that involves multiple "things" at the same time. That’s a huge restriction and workarounds at the application layer are painful, complex, and often buggy.

Further, these workarounds have to be carried everywhere in the system. For example, take the case of adding a button to set your color scheme in an admin preferences panel. Even a simple feature like this is expected to be carried over immediately across the app and other devices and sessions. It needs a synchronous, strongly consistent update—or a makeshift way to obtain the same result. Using a workaround to achieve strong consistency at the application level adds a velocity-tax to every subsequent new feature—no matter how small. It also makes it really hard to scale the application dev team, because everyone needs to be an expert in these edge cases. With this example, a unit test that passes on a developer workstation does not imply it will work in production at scale, especially in high concurrency applications. Adding workarounds to an eventually consistent data store often introduces bugs that go unnoticed until they bite a real customer and corrupt data. In fact, you may not even recognize the workaround is needed in the first place.

Lots of application developers are under the impression that the performance hit of external or strong consistency is too high. And in some systems, that might be true. Additionally, we're firm believers that having choice is a good thing—as long as the database does not introduce unnecessary complexity or introduce potential bugs in the application. Inside Google, we aim to give application developers the performance they need while avoiding unnecessary complexity in their application code. To that end, we’ve been researching advanced distributed database systems for many years and have built a wide variety of data stores to get strong consistency just right. Some examples are Cloud Bigtable, which is strongly consistent within a row; Cloud Datastore, which is strongly consistent within a document or object; and Cloud Spanner, which offers strong consistency across rows, regions and continents with serializability. [Note: In fact, Cloud Spanner offers a stronger guarantee of external consistency (strong consistency + serializability), but we tend to talk about Cloud Spanner having strong consistency because it's a more broadly accepted term.]


Strongly consistent reads and Cloud Spanner


Cloud Spanner was designed from the ground up to serve strong reads (i.e., strongly consistent reads) by default with low latency and high throughput. Thanks to the unique power of TrueTime, Spanner provides strong reads for arbitrary queries without complex multi-phase consensus protocols and without locks of any kind. Cloud Spanner’s use of TrueTime also provides the added benefit of being able do global bounded-staleness reads.

Better yet, Cloud Spanner offers strong consistency for multi-region and regional configurations. Other globally distributed databases present a dilemma to developers: If they want to read the data from geographically distributed regions, they forfeit the ability to do strongly consistent reads. In these other systems, if a customer opts to have strongly consistent reads, then they forfeit the ability to do reads from remote regions.

To take maximum advantage of the external consistency guarantees that Cloud Spanner provides and to maximize your application's performance, we offer the following two recommendations:
  1. Always use strong reads, whenever possible. Strong reads, which provide strong consistency, ensure that you are reading the latest copy of your data. Strong consistency makes application code simpler and applications more trustworthy.
  2. If latency makes strong reads infeasible in some situations, then use reads with bounded-staleness to improve performance, in places where strong reads with the latest data are not necessary. Bounded-staleness semantics ensures you read a guaranteed prefix of the data (for example, within a specified period of time) that is consistent, as opposed to eventual consistency where you have no guarantees and your app can read almost anything forwards or back in time from when you queried it.
Foregoing strong consistency has some real risks. Strong reads across a database ensure that you're reading the latest copy of your data and that it maintains the referential integrity of the entire dataset, making it easier to reason about concurrent requests. Using weaker consistency models introduces the risk of software bugs and can be a waste of developer hours—and potentially—customer trust.

What about writes?


Strong consistency is even more important for write operations—especially read-modify-write transactions. Systems that don't provide strong consistency in such situations create a burden for application developers, as there's always a risk of putting your data into an inconsistent state.

Perhaps the most insidious type of problem is write skew. In write skew, two transactions read a set of objects and make changes to some of those objects. However, the modifications that each transaction makes affect what the other transaction should have read. For example, consider a database for an airline based in San Francisco. It’s the airline’s policy to always have a free plane in San Francisco, in the event that this spare plane is needed to replace another plane with maintenance problems or for some other need. Imagine two transactions that are both reserving planes for upcoming flights out of San Francisco:

Begin Transaction
SELECT * FROM Airplanes WHERE location = "San Francisco" AND Availability = "Free";
If number of airplanes is > 1:  # to enforce "one free plane" rule
Pick 1 airplane
Set its Availability to "InUse"
Commit
Else: Rollback


Without strong consistency (and, in particular, serializable isolation for these transactions), both transactions could successfully commit, thus potentially breaking our one free plane rule. There are many more situations where write skew can cause problems7.

Because Cloud Spanner was built from the ground up to be a relational database with strong, transactional consistency—even for complex multi-row and multi-table transactions—it can be used in many situations where a NoSQL database would cause headaches for application developers. Cloud Spanner protects applications from problems like write skew, which makes it appropriate for mission-critical applications in many domains including finance, logistics, gaming and merchandising.

How does Cloud Spanner differ from multi-master replication?


One topic that's often combined with scalability and consistency discussions is multi-master replication. At its core, multi-master replication is a strategy used to reduce mean time to recovery for vertically scalable database systems. In other words, it’s a disaster recovery solution, and not a solution for global, strong consistency. With a multi-master system, each machine contains the entire dataset, and changes are replicated to other machines for read-scaling and disaster recovery.

In contrast, Cloud Spanner is a truly distributed system, where data is distributed across multiple machines within a replica, and also replicated across multiple machines and multiple data centers. The primary distinction between Cloud Spanner and multi-master replication is that Cloud Spanner uses paxos to synchronously replicate writes out of region, while still making progress in the face of single server/cluster/region failures. Synchronous out-of-region replication means that consistency can be maintained, and strongly consistent data can be served without downtime, even when a region is unavailable—no acknowledged writes are delayed/lost due to the unavailable region. Cloud Spanner’s paxos implementation elects a leader so that it's not necessary to do time-intensive quorum reads to obtain strong consistency. Additionally, Cloud Spanner shards data horizontally across servers, so individual machine failures impact less data. While a node is recovering, replicated nodes on other clusters that contain that dataset can assume mastership easily, and serve strong reads without any visible downtime to the user.

A strongly consistent solution for your mission-critical data


For storing critical, transactional data in the cloud, Cloud Spanner offers a unique combination of external, strong consistency, relational semantics, high availability and horizontal scale. Stringent consistency guarantees are critical to delivering trustworthy services. Cloud Spanner was built from the ground up to provide those guarantees in a high-performance, intuitive way. We invite you to try it out and learn more.

See more on Cloud Spanner and external consistency.

1 https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
2 https://en.wikipedia.org/wiki/Consistency_(database_systems)
3 Kleppmann, Martin. Designing Data-Intensive Applications. O’Reilly, 2017, p. 329.
4 Kleppmann, Martin. Designing Data-Intensive Applications. O’Reilly, 2017, p. 329.
5 https://en.wikipedia.org/wiki/Strong_consistency
6 Kleppmann, Martin. Designing Data-Intensive Applications. O’Reilly, 2017, p. 322.
7 Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly, 2017, p. 246.

Trash talk: How moving Apigee Sense to GCP reduced our “data litter” and decreased our costs



In the year-plus since Apigee joined the Google Cloud family, we’ve had the opportunity to deploy several of our services to Google Cloud Platform (GCP). Most recently, we completely moved Apigee Sense to GCP to use its advanced machine learning capabilities. Along the way, we also experienced some important performance improvements as judged by a drop in what we call “data litter.” In this post, we explain what data litter is, and our perspective on how various GCP services keep it at bay. Through this account, you may come to recognize your own application, and come to see data litter as an important metric to consider.

What is data litter?


First, let’s take a look at Apigee Sense and its application characteristics. At its core, Apigee Sense protects APIs running on Apigee Edge from attacks and unwanted exploitation. Those attacks are usually performed by automated processes, or "bots," which run without the permission of the API owner. Sense is built around a four-element "CAVA" cycle: collect, analyze, visualize and act. It enhances human vigilance with statistical machine learning algorithms.

We collect a lot of traffic data as a by-product of billions of API calls that pass through Apigee Edge daily. The output end of each of the four elements in the CAVA cycle is stored in a database system. Therefore, the costs, performance and scalability of data management and data analysis toolchains are of great interest to us.

When optimizing an analytics application, there are several things that demand particular attention: latency, quality, throughput and cost.
  • Latency is the delay between when something happens and when we become aware of it. In the case of security, we define it as the delay between when a bot attacks and when we notice that the attack.
  • The quality of our algorithmic smarts is measured by true and false positives and negatives.
  • Throughput measures the average rate at which data arrives into the analytics application.
  • Cost, of course, measures the average rate at which dollars (or other currency) leave your wallet.
To this mix I like to consider a fifth metric: "data litter," which in many ways measures the interplay between the four traditional metrics. Fundamentally, all analytics systems are GIGO (garbage in / garbage out). That is, if the data entering the system is garbage, it doesn’t matter how quickly it is processed, how smart our algorithms are, or how much data we can process every second. The money we spend does matter, but only because of questions about the wisdom of continuing to spend it.

Sources of data litter


Generally speaking, there are three main sources of data litter in an analytical application like Apigee Sense.
  1. Timeliness of analysis: It’s the nature of a data-driven analysis engine like Sense to attempt to make a decision with all the data available to it when the decision needs to be made. A delayed decision is of little value in foiling an ongoing bot attack. Therefore, when there's little data available to make decisions, the engine makes a less-informed decision and moves on. Any data that arrives subsequently is discarded because it is no longer useful, as the decision has already been made. The result? Data litter. 
  2. Elasticity of data processing: If data arrives too quickly for the analysis engine to consume, it piles up and causes "data back pressure." There are two remedies. First, to increase the size (and cost) of the analysis engine, or, alternately, to drop some data to relieve the pressure. Because you can’t scale up an analysis engine instantly, or because it is cost-prohibitive, we build a pressure release valve into the pipeline, causing data litter.
  3. Scalability of the consumption chain: If the target database is down, or unable to consume the results at the rate at which they're produced, you might as well stop the pipeline and discard the incoming data. It's pointless to analyze data when there's no way to use or store the results of the analysis. This too causes data litter.
Therefore, data litter is a holistic measure of the quality of the analysis system. It will be low only when the pipeline, analysis engine and target database are all well-tuned and constantly performing to expectations.


The easiest way to deal with the first kind of data litter is to slow down the pipeline by increasing latency. The easiest way to address the second kind is to throw money at the problem and run the analysis engine on a larger cluster. And the final problem is best addressed by adding more or bigger hardware to the database. Whichever path we take, we either increase latency and lose relevance, or lose money.

Moving to GCP


At Apigee, we track data litter with the data coverage metric, which is, roughly speaking, the inverse measure of how much of data gets dropped or otherwise doesn’t contribute to the analysis. When we moved the Sense analytics chain to GCP, the data coverage metric went from below 80% to roughly 99.8% for one of our toughest customer use cases. Put another way, our data litter decreased from over 20%, or one in five, to approximately one in five hundred. That’s a decrease of a factor of approximately 100, or two orders of magnitude!

The chart below shows the fraction of data available and used for decision making before and after our move to GCP. The chart shows the numbers for four different APIs, representing a subset of Sense customers.
These improvements were measured even while the cost of the deployed system, as well its the pipeline latency, were simultaneously tightened. Meanwhile, our throughput and algorithms stayed the same, and latencies and cost both dropped. Since the release a couple of months ago, these savings, along with the availability and performance benefits of the system, have persisted, while our customer base and the processed traffic has grown. So we're getting more reliable answers more quickly than we did before and paying less than we did for almost the exact same use case. Wow!


Where did the data litter go?


There were two problems that accounted for the bulk of the data litter in the Sense pipeline. These were the elasticity of data processing and the scalability of the transactional store.

To alert customers of an attack as quickly as possible, we designed our system with adequate latency to avoid systematic data litter. In our environment, two features of the GCP platform contributed most significantly to the reduction of unplanned data litter.:
  1. System elasticity. The data rate coming into the system is never uniform, and is especially high when there is an ongoing attack. The system is most under pressure when it is of highest value and needs to have enough elasticity to be able to deal with spikes without being provisioned significantly above the median data rates. Without this, the pressure release valve needs to be constantly engaged. 
  2. Transactional processing power. The transactional load on the database at the end of the chain peaks during an attack. It also determines the performance characteristics of the user experience and of protective enforcement, both of which add to the workload when an API is under attack. Therefore, transactional loads need to be able to comfortably scale to meet the demands of the system near its limits. 
As part of this transition, we moved our analysis chain to Cloud Dataproc, which provided significantly more nimble and cost controlled elasticity. Because the cost of the analytics pipeline represented our most significant constraint, we were able to size our processing capacity limits more aggressively. This gave us the additional elastic capacity needed to meet peak demands without increasing our cost.

We also moved our target database to BigQuery. BigQuery distributes and scales cost-effectively and without hiccups well beyond our needs, and indeed, beyond most reasonable IT budgets. This completely eliminated the back pressure issue from the end of the chain.

Because two of the three sources of data litter are now gone, our team is able to focus on improving the timeliness of our analysis—ensuring that we move data from where it's gathered through the analysis engine and make more intelligent and more relevant decisions with lower latency. This is what Sense was intended to do.

By moving Apigee Sense to GCP, we feel that we’ve taken back the control of our destiny. I'm sure that our customers will notice the benefits not just in terms of a more reliable service, but also in the velocity with which we are able to ship new capabilities to them.

Whitepaper: Lift and shift to Google Cloud Platform



Today we're announcing the availability of a new white paper entitled “How to Lift-and-Shift a Line of Business Application onto Google Cloud Platform.” This is the first in a series of four white papers focused on application migration and modernization. Stay tuned to the GCP blog as we release the next installments in the coming weeks.

The “Lift-and-Shift” white paper walks you through migrating a Microsoft Windows-based, two-tier, expense reporting, web-application that currently resides on-premises, in your data center. The white paper provides background information, and a three-phased project methodology, as well as pointers to application code on github. You'll be able to replicate the scenario on-premises, and walk through migrating your application to Google Cloud Platform (GCP).

The phased project includes implementation of initial GCP resources, including GCP networking, a site-to-site VPN and virtual machines (VMs), as well as setting up Microsoft SQL Server availability groups, and configuring Microsoft Active Directory (AD) replication in your new hybrid environment.

Want to learn more about how to lift and shift your own application by reading through (or following the same steps) in the white paper? If you're ready to get started, you can download your copy of the white paper and start your migration today.

Simplify Cloud VPC firewall management with service accounts



Firewalls provide the first line of network defense for any infrastructure. On Google Cloud Platform (GCP), Google Cloud VPC firewalls do just that—controlling network access to and between all the instances in your VPC. Firewall rules determine who's allowed to talk to whom and more importantly who isn’t. Today, configuring and maintaining IP-based firewall rules is a complex and manual process that can lead to unauthorized access if done incorrectly. That’s why we’re excited to announce a powerful new management feature for Cloud VPC firewall management: support for service accounts.

If you run a complex application on GCP, you’re probably already familiar with service accounts in Cloud Identity and Access Management (IAM) that provide an identity to applications running on virtual machine instances. Service accounts simplify the application management lifecycle by providing mechanisms to manage authentication and authorization of applications. They provide a flexible yet secure mechanism to group virtual machine instances with similar applications and functions with a common identity. Security and access control can subsequently be enforced at the service account level.


Using service accounts, when a cloud-based application scales up or down, new VMs are automatically created from an instance template and assigned the correct service account identity. This way, when the VM boots up, it gets the right set of permissions and within the relevant subnet, so firewall rules are automatically configured and applied.

Further, the ability to use Cloud IAM ACLs with service accounts allows application managers to express their firewall rules in the form of intent, for example, allow my “application x” servers to access my “database y.” This remediates the need to manually manage Server IP Address lists while simultaneously reducing the likelihood of human error.
This process is leaps-and-bounds simpler and more manageable than maintaining IP address-based firewall rules, which can neither be automated nor templated for transient VMs with any semblance of ease.

Here at Google Cloud, we want you to deploy applications with the right access controls and permissions, right out of the gate. Click here to learn how to enable service accounts. And to learn more about Cloud IAM and service accounts, visit our documentation for using service accounts with firewalls.

Introducing Preemptible GPUs: 50% Off



In May 2015, Google Cloud introduced Preemptible VM instances to dramatically change how you think about (and pay for) computational resources for high-throughput batch computing, machine learning, scientific and technical workloads. Then last year, we introduced lower pricing for Local SSDs attached to Preemptible VMs, expanding preemptible cloud resources to high performance storage. Now we're taking it even further by announcing the beta release of GPUs attached to Preemptible VMs.

You can now attach NVIDIA K80 and NVIDIA P100 GPUs to Preemptible VMs for $0.22 and $0.73 per GPU hour, respectively. This is 50% cheaper than GPUs attached to on-demand instances, which we also recently lowered. Preemptible GPUs will be a particularly good fit for large-scale machine learning and other computational batch workloads as customers can harness the power of GPUs to run distributed batch workloads at predictably affordable prices.

As a bonus, we're also glad to announce that our GPUs are now available in our us-central1 region. See our GPU documentation for a full list of available locations.

Resources attached to Preemptible VMs are the same as equivalent on-demand resources with two key differences: Compute Engine may shut them down after providing you a 30-second warning, and you can use them for a maximum of 24 hours. This makes them a great choice for distributed, fault-tolerant workloads that don’t continuously require any single instance, and allows us to offer them at a substantial discount. But just like its on-demand equivalents, preemptible pricing is fixed. You’ll always get low cost, financial predictability and we bill on a per-second basis. Any GPUs attached to a Preemptible VM instance will be considered Preemptible and will be billed at the lower rate. To get started, simply append --preemptible to your instance create command in gcloud, specify scheduling.preemptible to true in the REST API or set Preemptibility to "On" in the Google Cloud Platform Console and then attach a GPU as usual. You can use your regular GPU quota to launch Preemptible GPUs or, alternatively, you can request a special Preemptible GPUs quota that only applies to GPUs attached to Preemptible VMs.
For users looking to create dynamic pools of affordable GPU power, Compute Engine’s managed instance groups can be used to automatically re-create your preemptible instances when they're preempted (if capacity is available). Preemptible VMs are also integrated into cloud products built on top of Compute Engine, such as Kubernetes Engine (GKE’s GPU support is currently in preview. The sign-up form can be found here).

Over the years we’ve seen customers do some very exciting things with preemptible resources: everything from solving problems in satellite image analysis, financial services, questions in quantum physics, computational mathematics and drug screening.
"Preemptible GPU instances from GCP give us the best combination of affordable pricing, easy access and sufficient scalability. In our drug discovery programs, cheaper computing means we can look at more molecules, thereby increasing our chances of finding promising drug candidates. Preemptible GPU instances have advantages over the other discounted cloud offerings we have explored, such as consistent pricing and transparent terms. This greatly improves our ability to plan large simulations, control costs and ensure we get the throughput needed to make decisions that impact our projects in a timely fashion." 
Woody Sherman, CSO, Silicon Therapeutics 

We’re excited to see what you build with GPUs attached to Preemptible VMs. If you want to share stories and demos of the cool things you've built with Preemptible VMs, reach out on Twitter, Facebook or G+.

For more details on Preemptible GPU resources, please check out the preemptible documentation, GPU documentation and best practices. For more pricing information, take a look at our Compute Engine pricing page or try out our pricing calculator. If you have questions or feedback, please visit our Getting Help page.

To get started using Preemptible GPUs today; sign up for Google Cloud Platform and get $300 in credits to try out Preemptible GPUs.

Consequences of SLO violations — CRE life lessons



Previous episodes of CRE life lessons have talked in detail about the importance of quantifying a service's availability and using SLOs to manage the competing priorities of features-focused development teams ("devs") versus a reliability-focused SRE team. Good SLOs can help reduce organizational friction and maintain development velocity without sacrificing reliability. But what should happen when SLOs are violated?

In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy. Future posts will go over an example taken from an SRE team here at Google, and work through some scenarios that put that policy into action.

Features or reliability?


In the ideal world (assuming spherical SREs in a vacuum), an SLO represents the dividing line between two binary states: developing new features when there's error budget to spare, and improving service reliability when there isn't. Most real engineering organizations will instead vary their effort on a spectrum between these two extremes as business priorities dictate. Even when a service is operating well within its SLOs, choosing to do some proactive reliability work may reduce the risk of future outages, improve efficiency and provide cost savings; conversely it's rare to find an organization that completely drops all in-flight feature development as soon as an SLO is violated.

Describing key inflection points from that spectrum in a policy document is an important part of the relationship between an SRE team and the dev teams with whom they partner. This ensures that all parts of the organization have roughly the same understanding around what is expected of them when responding to (soon to be) violated SLOs, and – most importantly – that the consequences of not responding are clearly communicated to all parties. The exact choice of inflection points and consequences will be specific to the organization and its business priorities.

Inflection points


Having a strong culture of blameless postmortems and fixing root causes should eventually mean that most SLO violations are unique – informally, “we are in the business of novel outages.” It follows that the response to each violation will also be unique; making judgement calls around these is a part of an SREs job when responding to the violation. But a large variance in the range of possible responses results in inconsistency of outcomes, people trying to game the system and uncertainty for the engineering organization.

For the purposes of an escalation policy, we recommend that SLO violations be grouped into a few buckets of increasing severity based on the cumulative impact of the violation over time (i.e., how much error budget has been burned over what time horizon), with clearly defined boundaries for moving from one bucket to another. It's useful to have some business justification for why violations are grouped as they are, but this should be in an appendix to the main policy to keep the policy itself clear.

It's a good idea to tie at least some of the bucket boundaries to any SLO-based alerting you have. For example, you may choose to page SREs to investigate when 10% of the weekly error budget has been burned in the past hour; this is an example of an inflection point tied to a consequence. It forms the boundary between buckets we might informally title "not enough error budget burned to notify anyone immediately" and "someone needs to investigate this right now before the service is out of its long-term SLO." We'll examine more concrete examples in our next post, where we look at a policy from an SRE team within Google.

Consequences


The consequences of a violation are the meat of the policy. They describe actions that will be taken to bring the service back into SLO, whether this is by root causing and fixing the relevant class of issue, automating any stop-gap mitigation tasks or by reducing the near-term risk of further deterioration. Again, the choice of consequence for a given threshold is going to be specific to the organization defining the policy, but there are several broad areas into which these fall. This list is not exhaustive!

Notify someone of potential or actual SLO violation

The most common consequence of any potential or actual SLO violation is that your monitoring systems tells a human that they need to investigate and take remedial action. For a mature, SRE-supported service, this will normally be in the form of a page to the oncall when a large quantity of error budget has been burned over a short window, or a ticket when there’s an elevated burn rate over a longer time horizon. It's not a bad idea for that page to also create a ticket in which you can record debugging details, use as a centralized communication point and reference when escalating a serious violation.

The relevant dev team should also be notified. It's OK for this to be a manual process; the SRE team can add value by filtering and aggregating violations and providing meaningful context. But ideally a small group of senior people in the dev team should be made aware of actual violations in an automated fashion (e.g., by CCing them on any tickets), so that they're not surprised by escalations and can chime in if they have pertinent information.

Escalate the violation to the relevant dev team

The key difference between notification and escalation is the expectation of action on the part of the dev team. Many serious SLO violations require close cooperation between SREs and developers to find the root cause and prevent recurrence. Escalation is not an admission of defeat. SREs should escalate as soon as they're reasonably sure that input from the dev team will meaningfully reduce the time to resolution. The policy should set an upper bound on the length of time an SLO violation (or near miss) can persist without escalation.

Escalation does not signify the end of SRE’s involvement with an SLO violation. The policy should describe the responsibilities of each team and a lower bound on the amount of engineering time they should divert towards investigating the violation and fixing the root cause. It will probably be useful to describe multiple levels of escalation, up to and including getting executive-level support to commandeer the engineering time of the entire dev team until the service is reliable.

Mitigate risk of service changes causing further impact to SLOs

Since a service in violation of its SLO is by definition making users unhappy, day-to-day operations that may increase the rate at which error budget is burned should be slowed or stopped completely. Usually, this means restricting the rate of binary releases and experiments, or stopping them completely until the service is again within SLO. This is where the policy needs to ensure all parties (SRE, development, QA/testing, product and execs) are on the same page. For some engineering organizations, the idea that SLO violations will impact their development and release velocity may be difficult to accept. Reaching a documented agreement on how and when releases will be blocked – and what fraction of engineers will be dedicated to reliability work when this occurs – is a key goal.

Revoke support for the service

If a service is shown to be incapable of meeting its agreed-upon SLOs over an extended time period, and the dev team responsible for that service is unwilling to commit to engineering improvements to its reliability, then SRE teams at Google have the option of handing back the responsibility for running that service in production. This is unlikely to be the consequence of a single SLO violation, rather the combination of multiple serious outages over an extended period of time, where postmortem AIs have been assigned to the dev team but not prioritized or completed.

This has worked well at Google, because it changes the incentives behind any conversation around engineering for reliability. Any dev team that neglects the reliability of a service knows that they will bear the consequences of that neglect. By definition, revoking SRE support for a service is a last resort, but stating the conditions that must be met for it to happen makes it a matter of policy, not an idle threat. Why should SRE care about service reliability if the dev team doesn't?

Summary


Hopefully this post has helped you think about the trade-off between engineering for reliability and features, and how responding to SLO violations moves the needle towards reliability. In our next post, we'll present an escalation policy from one of Google's SRE teams, to show the choices they made to help the dev teams they partner with maintain a high development velocity.

What a year! Google Cloud Platform in 2017



The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:
  1. You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions. 
  2.  How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services. 
  3.  Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem. 
  4.  You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems. So, without further ado, we present to you the most-read stories of 2017.

Cutting-edge infrastructure

If you subscribe to the “bigger is always better” theory of cloud infrastructure, then you were a happy camper this year. Early in 2017, we announced that GCP would be the first cloud provider to offer Intel Skylake architecture, GPUs for Compute Engine and Cloud Machine Learning became generally available and Shazam talked about why cloud GPUs made sense for them. In the spring, you devoured a piece on the performance of TPUs, and another about the then-largest cloud-based compute cluster. We announced yet more new GPU models and topping it all off, Compute Engine began offering machine types with a whopping 96 vCPUs and 624GB of memory.

It wasn’t just our chip offerings that grabbed your attention — you were pretty jazzed about Google Cloud network infrastructure too. You read deep dives about Espresso, our peering-edge architecture, TCP BBR congestion control and improved Compute Engine latency with Andromeda 2.1. You also dug stories about new networking features: Dedicated Interconnect, Network Service Tiers and GCP’s unique take on sneakernet: Transfer Appliance.

What’s the use of great infrastructure without somewhere to put it? 2017 was also a year of major geographic expansion. We started out the year with six regions, and ended it with 13, adding Northern Virginia, Singapore, Sydney, London, Germany, Sao Paolo and Mumbai. This was also the year that we shed our Earthly shackles, and expanded to Mars ;)

Security above all


Google has historically gone to great lengths to secure our infrastructure, and this was the year we discussed some of those advanced techniques in our popular Security in plaintext series. Among them: 7 ways we harden our KVM hypervisor, Fuzzing PCI Express and Titan in depth.

You also grooved on new GCP security services: Cloud Key Management and managed SSL certificates for App Engine applications. Finally, you took heart in a white paper on how to implement BeyondCorp as a more secure alternative to VPN, and support for the European GDPR data protection laws across GCP.

Open, hybrid development


When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

But our open source efforts aren’t limited to Kubernetes — we also made significant contributions to Spinnaker 1.0, and helped launch the Istio and Grafeas projects. You ate up our "Partnering on open source" series, featuring the likes of HashiCorp, Chef, Ansible and Puppet. Availability-minded developers loved our Customer Reliability Engineering (CRE) team’s missive on release canaries, and with API design: Choosing between names and identifiers in URLs, our Apigee team showed them a nifty way to have their proverbial cake and eat it too.

Google innovation


In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

2017 was also the year machine learning became accessible. For those of you with large datasets, we showed you how to use Cloud Dataprep, Dataflow, and BigQuery to clean up and organize unstructured data. It turns out you don’t need a PhD to learn to use TensorFlow, and for visual learners, we explained how to visualize a variety of neural net architectures with TensorFlow Playground. One Google Developer Advocate even taught his middle-school son TensorFlow and basic linear algebra, as applied to a game of rock-paper-scissors.

Natural language processing also became a mainstay of machine learning-based applications; here, we highlighted with a lighthearted and relatable example. We launched the Video Intelligence API and showed how Cloud Machine Learning Engine simplifies the process of training a custom object detector. And the makers among you really went for a post that shows you how to add machine learning to your IoT projects with Google AIY Voice Kit. Talk about accessible!

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

Greetings from North Pole Operations! All systems go!



Hi there! I’m Merry, Santa’s CIE (Chief Information Elf), responsible for making sure computers help us deliver joy to the world each Christmas. My elf colleagues are really busy getting ready for the big day (or should I say night?), but this year, my team has things under control, thanks to our fully cloud-native architecture running on Google Cloud Platform (GCP)! What’s that? You didn’t know that the North Pole was running in the cloud? How else did you think that we could scale to meet the demands of bringing all those gifts to all those children around the world?
You see, North Pole Operations have evolved quite a lot since my parents were young elves. The world population increased from around 1.6 billion in the early 20th century to 7.5 billion today. The elf population couldn’t keep up with that growth and the increased production of all these new toys using our old methods, so we needed to improve efficiency.

Of course, our toy list has changed a lot too. It used to be relatively simple rocking horses, stuffed animals, dolls and toy trucks, mostly. The most complicated things we made when I was a young elf were Teddy Ruxpins (remember those?). Now toy cars and even trading card games come with their own apps and use machine learning.

This is where I come in. We build lots of computer programs to help us. My team is responsible for running hundreds of microservices. I explain microservices to Santa as a computer program that performs a single service. We have a microservice for processing incoming letters from kids, another microservice for calculating kids’ niceness scores, even a microservice for tracking reindeer games rankings.
Here's an example of the Letter Processing Microservice, which takes handwritten letter in all languages (often including spelling and grammatical errors) and turns each one into text.
Each microservice runs on one or more computers (also called virtual machines or VMs). We tried to run it all from some computers we built here at the North Pole but we had trouble getting enough electricity for all these VMs (solar isn’t really an option here in December). So we decided to go with GCP. Santa had some reservations about “the Cloud” since he thought it meant our data would be damaged every time it rained (Santa really hates rain). But we managed to get him a tour of a data center (not even Santa can get in a Google data center without proper clearances), and he realized that cloud computing is really just a bunch of computers that Google manages for us.

Google lets us use projects, folders and orgs to group different VMs together. Multiple microservices can make up an application and everything together makes up our system. Our most important and most complicated application is our Christmas Planner application. Let’s talk about a few services in this application and how we make sure we have a successful Christmas Eve.
Our Christmas Planner application includes microservices for a variety of tasks: microservices generate lists of kids that are naughty or nice, as well as a final list of which child receives which gift based on preferences and inventory. Microservices plan the route, taking into consideration inclement weather and finally, generate a plan for how to pack the sleigh.

Small elves, big data


Our work starts months in advance, tracking naughty and nice kids by relying on parent reports, teacher reports, police reports and our mobile elves. Keeping track of almost 2 billion kids each year is no easy feat. Things really heat up around the beginning of December, when our army of Elves-on-the-Shelves are mobilized, reporting in nightly.
We send all this data to a system called BigQuery where we can easily analyze the billions of reports to determine who's naughty and who's nice in just seconds.

Deck the halls with SLO dashboards


Our most important service level indicator or SLI is “child delight”. We target “5 nines” or 99.999% delightment level meaning 99,999/100,000 nice children are delighted. This limit is our service level objective or SLO and one of the few things everyone here in the North Pole takes very seriously. Each individual service has SLOs we track as well.

We use Stackdriver for dashboards, which we show in our control center. We set up alerting policies to easily track when a service level indicator is below expected and notify us. Santa was a little grumpy since he wanted red and green to be represented equally and we explained that the red warning meant that there were alerts and incidents on a service, but we put candy canes on all our monitors and he was much happier.

Merry monitoring for all


We have a team of elite SREs (Site Reliability Elves, though they might be called Site Reliability Engineers by all you folks south of the North Pole) to make sure each and every microservice is working correctly, particularly around this most wonderful time of the year. One of the most important things to get right is the monitoring.

For example, we built our own “internet of things” or IoT where each toy production station has sensors and computers so we know the number of toys made, what their quota was and how many of them passed inspection. Last Tuesday, there was an alert that the number of failed toys had shot up. Our SREs sprang into action. They quickly pulled up the dashboards for the inspection stations and saw that the spike in failures was caused almost entirely by our baby doll line. They checked the logs and found that on Monday, a creative elf had come up with the idea of taping on arms and legs rather than sewing them to save time. They rolled back this change immediately. Crisis averted. Without the proper monitoring and logging, it would be very difficult to find and fix the issue, which is why our SREs consider it the base of their gift reliability pyramid.

All I want for Christmas is machine learning


Running things in Google Cloud has another benefit: we can use technology they’ve developed at Google. One of our most important services is our gift matching service, which takes 50 factors as input including the child’s wish list, niceness score, local regulations, existing toys, etc., and comes up with the final list of which gifts should be delivered to this child. Last year, we added machine learning or ML, where we gave the Cloud ML engine the last 10 years of inputs, gifts and child and parent delight levels. It automatically learned a new model to use in gift matching based on this data.

Using this new ML model, we reduced live animal gifts by 90%, ball pits by 50% and saw a 5% increase in child delight and a 250% increase in parent delight.

Tis the season for sharing


Know someone who loves technology that might enjoy this article or someone who reminds you of Santa  someone with many amazing skills but whose eyes get that “reindeer-in-the-headlights look” when you talk about cloud computing? Share this article with him or her and hopefully you’ll soon be chatting about all the cool things you can do with cloud computing over Christmas cookies and eggnog... And be sure to tell them to sign up for a free trial  Google Cloud’s gift to them!

One year of Cloud Performance Atlas



In March of this year, we kicked off a new content initiative called Cloud Performance Atlas, where we highlight best practices for GCP performance, and how to solve the most common performance issues that cloud developers come across.

Here’s the top topics from 2017 that developers found most useful.


5. The bandwidth delay problem


Every now and again, I’ll get a question from a company who recently updated their connection bandwidth from their on-premises systems to Google Cloud, and for some reason, aren’t getting any better performance as a result. The issue, as we’ve seen multiple times, usually resides in an area of TCP called “the bandwidth delay problem.”

The TCP algorithm works by transferring data in packets between two connections. A packet is sent to a connection, and then an acknowledgement packet is returned. To get maximum performance in this process, the connection between the two endpoints has to be optimized so that neither the sender or receiver is waiting around for acknowledgements from prior packets.

The most common way to address this problem is to adjust the window sizes for the packets to match the bandwidth of the connection. This allows both sides to continue sending data until an ACK arrives back from the client for an earlier packet, thereby creating no gaps and achieving maximum throughput. As such, a low window size will limit your connection throughput, regardless of the available or advertised bandwidth between instances.

Find out more by checking out the video, or article!

4. Improving CDN performance with custom keys


Google Cloud boasts an extremely powerful CDN that can leverage points-of-presence around the globe to get your data to users as fast as possible.

When setting up Cloud CDN for your site, one of the most important things is to ensure that you’re using the right Custom Cache Keys to configure what assets get cached, and which ones don’t. In most cases, this isn’t an issue, but if you’re leveraging a large site with content re-used across protocols (i.e., http and https) you can run into a problem where your cache fill costs can increase more than expected.

You can see how we helped a sports website get their CDN keys just right in the video, and article.


3. Google Cloud Storage and the sequential filename challenge


Google Cloud Storage is a one-stop-shop for all your content serving needs. However, one developer continued to run into a problem of slow upload speeds when pushing their content into the cloud.

The issue was that Cloud Storage uses the file path and name of the files being uploaded to segment and shard the connection to multiple frontends (improving performance). As we found out, if those file names are sequential then you could end up in a situation where multiple connections get squashed down to a single upload thread (thus hurting performance)!

As shown in the video and article, we were able to help a nursery camera company get past this issue with a few small fixes.

2. Improving Compute Engine boot time with custom images


Any cloud-based service needs to grow and shrink its resource allocations to respond to traffic load. Most of the time, this is a good thing, especially during the holiday season. ;) As traffic increases to your service/application, your backends will need to spin up more Compute Engine VMs to provide a consistent experience to your users.

However, if it takes too long for your VMs to start up, then the quality and performance for you users can be negatively impacted, especially if your VM needs to do a lot of things during its startup script, like compile code, or install large packages.

As we showed in the video, (article) you can pre-compute a lot of that work into a custom image of boot disks. When your VMs are loaded, they simply need to copy in the custom image to the disk (with everything already installed), rather than doing everything from scratch.

If you’re looking to improve your GCE boot performance, custom images are worth checking out!

1. App Engine boot time


Modern managed languages (Java, Python, Javascript, etc.) typically have a run-time dependencies step that occurs at the init phase of the program when code is imported and instantiated.

Before execution can begin, any global data, functions or state information are also set up. Most of the time, these systems are global in scope, since they need to be used by so many subsystems (for example, a logging system).

In the case of App Engine, this global initialization work can end up delaying start-time, since it must complete before a request can be serviced. And as we showed in the video and article, as your application responds to spikes in workload, this type of global variable contention can put a hurt on your request response times.


See you soon!


For the rest of 2017 our Cloud Performance team is enjoying a few hot cups of tea, relaxing with the holidays and counting down the days until the new year. In 2018, we’ve got a lot of awesome new topics to cover, including increased networking performance, Cloud Functions and Cloud Spanner!

Until then, make sure you check out the Cloud Performance Atlas videos on Youtube or our article series on Medium.

Thanks again for a great year everyone, and remember, every millisecond counts!