Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Google Cloud CDN joins CDN Interconnect providers, delivering choice to users

By Prajakta Joshi, Product Manager, Cloud Networking

With the general availability of Google Cloud CDN, we’re pleased to announce that the number of CDNs in Google Cloud’s CDN ecosystem has risen to eight, offering choice and speed to Google Cloud Platform (GCP) customers.

(click to enlarge)

In this blog post, we provide details and testimonials from the companies behind these CDNs, and explain the unique advantages of our own CDN offering.

(click to enlarge)

Google Cloud CDN highlights:

Google Cloud CDN is delivered on Google’s high performance global network and edge infrastructure. Google Cloud CDN uses Google's globally distributed 80+ locations of presence to accelerate content delivery for websites and applications served out of Google Compute Engine and Google Cloud Storage, reducing latency and serving costs.
Google Cloud CDN does not charge extra for HTTPS (TLS) traffic.
Google Cloud CDN is integrated into GCP. Once you've set up HTTP(S) Load Balancing, you can enable Cloud CDN with a single checkbox as shown below:

(click to enlarge)

We'll be launching a new user interface (UI) for Google Cloud CDN at Google Cloud Next 17 with a separate navigation bar for CDN configuration and reporting, in addition to keeping the existing checkbox for enabling it via the Load Balancing UI. Here's a preview of the new CDN UI:

(click to enlarge)

Google Cloud CDN performance

Cedexis, a well known web traffic optimization and analytics company, benchmarks and provides visibility into the latency, throughput and availability of leading CDN providers. Cedexis provided data on Google Cloud CDN performance with Google Cloud CDN demonstrating industry-leading performance. Cedexis testing methodology used for gathering this data can be viewed on its website.

CDN Interconnect

The CDN Interconnect program provides choice and depth to Google Cloud customers. It allows select CDN providers to establish direct interconnect links so that customers egressing network traffic from GCP through one of these links can benefit from the direct connectivity to the CDN providers as well as Interconnect pricing.

Below, some of our providers describe their integrations with Google Cloud and the value their CDN’s deliver to GCP customers:

Akamai

“As more and more enterprises come to rely on cloud-based computing and storage resources to drive their businesses, it’s critically important that performance is maximized and cost effectiveness is maintained. As the operator of the world’s largest distributed CDN platform, we’re collaborating with Google Cloud Platform to ensure that our joint customers can pass traffic directly from Google Cloud Platform to the Akamai CDN, empowering them to take full advantage of their cloud investments." — Bill Wheaton, executive vice president and general manager, Media Products Division, Akamai

Cloudflare

"We view our collaboration with Google Cloud Platform through their CDN Interconnect program as a crucial step in our mission to help build a better Internet. As a result of the high-speed interconnections between Google Cloud Platform and Cloudflare’s CDN, our joint customers will benefit from amplified performance, increased security and significant savings on their monthly bandwidth bill, making for an altogether better user experience.”— Michelle Zatlyn, Co-Founder & Head of User Experience at Cloudflare.

Fastly

"Fastly customers benefit from pairing our platform’s advanced edge compute, enforcement and application delivery solutions with Google Cloud Platform's origin services. Businesses like Firebase and Vimeo leverage the power of Fastly’s edge features for live log streaming to Google BigQuery and deep analytics, as well as the ability to extend applications as close to the end user as possible for faster, more secure and more reliable global performance.” — Lee Chen, Head of Strategic Partnerships at Fastly.

Level 3

“Our CDN is architected for high quality video delivery to support the growth in OTT video. Content creators can process their content in Google Cloud, then deploy seamlessly over our CDN to deliver the fast, high quality video experience end users demand." — Jon Alexander, Level 3 Senior Director of Product Management

Verizon

“Verizon Digital Media Services’ interconnections with Google Cloud Platform help customers delight their users with a lightning fast digital experience that can be scaled globally through our extensive presence and robust connectivity. Our industry-leading cache efficiency supports great user experience while minimizing transaction needs and egress volumes from cloud origins. Our cache efficiency has negated the need for us to charge our customers for origin storage transactions or egress, which makes us an ideal companion CDN to GCP.” — Kyle Okamoto, Chief Network Officer, Verizon.

Limelight Networks

“Using Google Cloud Platform in combination with the Limelight Orchestrate Platform helps our customers make the most of their cloud investments by extending their performance and capability with Limelight's content management, distribution and delivery services. Combining the flexibility of Google Cloud Platform with the workload efficiency and reach of Limelight's global infrastructure and experience, provides a unique platform to help deliver customer satisfaction,” — Steve Miller-Jones, Senior Director of Product Management at Limelight Networks.

Come to Google Cloud Next '17 to learn more about Google Cloud CDN and CDN Interconnect Providers.

Source: Google Cloud Platform Blog

Incident management at Google — adventures in SRE-land

By Paul Newson, Incident Commander

Have you ever wondered what happens at Google when something goes wrong? Our industry is fond of using colorful metaphors such as “putting out fires” to describe what we do.

Of course, unlike the actual firefighters pictured here, our incidents don’t normally involve risk to life and limb. Despite the imperfect metaphor, Google Site Reliability Engineers (SREs) have a lot in common with other first responders in other fields.

Like these other first responders, SREs at Google regularly practice emergency response, honing the skills, tools, techniques and attitude required to quickly and effectively deal with the problem at hand.

In emergency services, and at Google, when something goes wrong, it's called an “incident.”

This is the story of my first “incident” as a Google SRE.

Prologue: preparation

For the past several months, I’ve been on a Mission Control rotation with the Google Compute Engine SRE team. I did one week of general SRE training. I learned about Compute Engine through weekly peer training sessions, and by taking on project work. I participated in weekly “Wheel of Misfortune” sessions, where we're given a typical on-call problem and try to solve it. I shadowed actual on-callers, helping them respond to problems. I was secondary on-call, assisting the primary with urgent issues, and handling less urgent issues independently.

Sooner or later, after all the preparation, it’s time to be at the sharp end. Primary on-call. The first responder.

Editor's Note: Chapter 28 “Accelerating SREs to On-Call and Beyond” in Site Reliability Engineering goes into detail about how we prepare new SREs to be ready to be first responders.

Going on-call

There's a lot more to being an SRE than being on-call. On-call is, by design, a minority of what Site Reliability Engineers (SREs) do, but it's also critical. Not only because someone needs to respond when things go wrong, but because the experience of being on-call informs many other things we do as SREs.

During my first on-call shifts, our alerting system saw fit to page¹ me twice, and two other problems were escalated to me by other people. With each page, I felt a hit of adrenaline. I wondered "Can I handle this? What if I can’t?" But then I started to work the problem in front of me, like I was trained to, and I remembered that I don’t need to know everything — there are other people I can call on, and they will answer. I may be on point, but I’m not alone.

Editor’s Note: Chapter 11 “Being On-Call” in Site Reliability Engineering has lots of advice on how to organize on-call duties in a way that allows people to be effective over the long term.

It’s an incident!

Three of the pages I received were minor. The fourth was more, shall we say. . . interesting?

Another Google engineer using Compute Engine for their service had a test automation failure, and upon investigation noticed something unusual with a few of their instances. They notified the development team’s primary on-call, Parya, and she brought me into the loop. I reached out to my more experienced secondary, Benson, and the three of us started to investigate, along with others from the development team who were looped in. Relatively quickly we determined it was a genuine problem. Having no reason to believe that the impact was limited to the single internal customer who reported the issue, we declared an incident.

What does declaring an incident mean? In principle it means that an issue is of sufficient potential impact, scope and complexity that it will require a coordinated effort with well defined roles to manage it effectively. At some point, everything you see on the summary page of the Google Cloud Status Dashboard was declared an incident by someone at Google. In practice, declaring an incident at Google means creating a new incident in our internal incident management tool.

As part of my on-call training, I was trained on the principles behind Google’s incident management protocol, and the internal tool that we use to facilitate incident response. The incident management protocol defines roles and responsibilities for the individuals involved. Earlier I asserted that Google SREs have a lot in common with other first responders. Not surprisingly, our incident management process was inspired by, and is similar to, well established incident command protocols used in other forms of emergency response.

My role was Incident Commander. Less than seven minutes after I declared the incident, a member of our support team took on the External Communications role. In this particular incident, we did not declare any other formal roles, but in retrospect, Parya was the Operations Lead; she led the efforts to root-cause the issue, pulling in others as needed. Benson was the Assistant Incident Commander, as I asked him a series of questions of the form “I think we should do X, Y and Z. Does that sound reasonable to you?”

One of the keys to effective incident response is clear communication between incident responders, and others who may be affected by the incident. Part of that equation is the incident management tool itself, which is a central place that Googlers can go to know about any ongoing incidents with Google services. The tool then directs Googlers to additional relevant resources, such as an issue in our issue-tracking database that contains more details, or the communications channels being used to coordinate the incident response.

Editor’s Note: Chapters 12, 13 and 14 of Site Reliability Engineering discuss effective troubleshooting, emergency response and managing oncidents respectively.

The rollback — an SRE’s fire extinguisher

While some of us worked to understand the scope of the issue, others looked for the proximate and root causes so we could take action to mitigate the incident. The scope was determined to be relatively limited, and the cause was tracked down to a particular change included in a release that was currently being rolled out.

This is quite typical. The majority of problems in production systems are caused by changing something — a new configuration, a new binary, or a service you depend on doing one of those things. There are two best practices that help in this very common situation.

First, all non-emergency changes should use a progressive rollout, which simply means don’t change everything at once. This gives you the time to notice problems, such as the one described here, before they become big problems affecting large numbers of customers.

Second, all rollouts should have a well understood and well tested rollback mechanism. This means that once you understand which change is responsible for the problem, you have an “undo” button you can press to restore service.

Keeping your problems small using a progressive rollout, and then mitigating them quickly via a trusted rollback mechanism are two powerful tools in the quest to meet your Service Level Objectives (SLOs).

This particular incident followed this pattern. We caught the problem while it was small, and then were able to mitigate it quickly via a rollback.

Editor’s Note: Chapter 36 “A Collection of Best Practices for Production Services” in Site Reliability Engineering talks more about these, and other, best practices.

Epilogue: the postmortem

With the rollback complete, and the problem mitigated, I declared the incident “closed.” At this point, the incident management tool helpfully created a postmortem document for the incident responders to collaborate on. Taking our firefighting analogy to its logical conclusion, this is analogous to the part where the fire marshal analyzes the fire, and the response to the fire, to see how similar fires could be prevented in the future, or handled more effectively.

Google has a blameless postmortem culture. We believe that when something goes wrong, you should not look for someone to blame and punish. Chances are the people in the story were well intentioned, competent and doing the best they could with the information they had at the time. If you want to make lasting change, and avoid having similar problems in the future, you need to look to how you can improve the systems, tools and processes around the people, such that a similar problem simply can’t happen again.

Despite the relatively limited impact of the incident, and the relatively subtle nature of the bug, the postmortem identified nine specific follow-up actions that could potentially avoid the problem in the future, or allow us to detect and mitigate it faster if a similar problem occurs. These nine issues were all filed in our bug tracking database, with owners assigned, so they'll be considered, researched and followed up on in the future.

The follow-up actions are not the only outcome of the postmortem. Since every incident at Google has a postmortem, and since we use a common template for our postmortem documents, we can perform analysis of overall trends. For example, this is how we know that a significant fraction of incidents at Google come from configuration changes. (Remember this the next time someone says “but it’s just a config change” when trying to convince you that it’s a good idea to push it out late on the Friday before a long weekend . . .)

Postmortems are also shared within the teams involved. On the Compute Engine team, for example, we have a weekly incident review meeting, where incident responders present their postmortem to a broader group of SREs and developers who work on Compute Engine. This helps identify additional follow up items that may have been overlooked, and shares the lessons learned with the broader team, making everyone better at thinking about reliability from these case studies. It's also a very strong way to reinforce Google’s blameless post mortem culture. I recall one of these meetings where the person presenting the postmortem attempted to take blame for the problem. The person running the meeting said “While I appreciate your willingness to fall on your sword, we don’t do that here.”

The next time you read the phrase “We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence” on our status page, I hope you'll remember this story. Having experienced firsthand the way we follow up on incidents at Google, I can assure you that it's not an empty promise.

Editor's Note: Chapter 15, “Postmortem Culture: Learning from Failure” in Site Reliability Engineering discusses postmortem culture in depth.

1 We don’t actually use pagers anymore of course, but we still call it “getting paged” no matter what device or communications channel is used.

Source: Google Cloud Platform Blog

Google Cloud Platform is the first cloud provider to offer Intel Skylake

By Urs Hölzle, Senior Vice President, Google Cloud Infrastructure

I’m excited to announce that Google Cloud Platform (GCP) is the first cloud provider to offer the next generation Intel Xeon processor, codenamed Skylake.

Customers across a range of industries, including healthcare, media and entertainment and financial services ask for the best performance and efficiency for their high-performance compute workloads. With Skylake processors, GCP customers are the first to benefit from the next level of performance.

Skylake includes Intel Advanced Vector Extensions (AVX-512), which make it ideal for scientific modeling, genomic research, 3D rendering, data analytics and engineering simulations. When compared to previous generations, Skylake’s AVX-512 doubles the floating-point performance for the heaviest calculations.

We optimized Skylake for Google Compute Engine’s complete family of VMs — standard, highmem, highcpu and Custom Machine Types to help bring the next generation of high performance compute instances to everyone.

"Google and Intel have had a long standing engineering partnership working on Data Center innovation. We're happy to see the latest Intel Xeon technology now available on Google Cloud Infrastructure. This technology delivers significant enhancements for compute-intensive workloads, efficiently accelerating data analytics that businesses depend on for operations and growth.” — Diane Bryant, Intel Executive Vice President and GM of the Data Center Group

Skylake processors are available in five GCP regions: Western US, Eastern US, Central US, Western Europe and Eastern Asia Pacific. Sign up here to take advantage of the new Skylake processors.

You can learn more about Skylake for Google Compute Engine and see it in action at Google Cloud NEXT ’17 in San Francisco on March 8-10. Register today!

Source: Google Cloud Platform Blog

Three steps to help secure Elasticsearch on Google Cloud Platform

By Ian Maddox, Solutions Architect

Elasticsearch is an open source search engine built on top of Lucene that is commonly used for internal site searches and analytics. Recently, several high-profile ransomware attacks against unsecured Elasticsearch servers have reminded us that there may be a few things to do to help secure your deployment. In this post, we’ll be covering best practices to help secure your Elasticsearch instances on Google Cloud Platform (GCP).

There are a number of ways to attack an Elasticsearch instance. Poisoning the index, exploiting unauthorized API access and exfiltrating sensitive data are just a few. Read on for some suggestions to combat them.

1. Lock down your access policy

To help prevent abuse, Elasticsearch uses a trusting security model that relies on external access management. Thus, an important first step with a new Elasticsearch instance is to lock down the access policy. Best practices on GCP include using IAM policies for internal access and firewalls for external connections. Add-ons such as X-Pack help add another layer of security.

2. Don’t index sensitive data

Once you’ve updated your access policy for your Elasticsearch instance, think carefully about what content you intend to index. Your initial import into Elasticsearch is likely to be a bulk migration of data. A best practice is to carefully filter out personally identifiable information (PII), cardholder data or other sensitive information to prevent it from leaking. Even if you only provide abstract document IDs from your search engine, hackers can still deduce particularly sensitive bits of information. For example, a bad actor could use wildcard searches to deduce credit card numbers, SSNs or other information one character at a time.

3. Handle unfiltered content safely

Index poisoning occurs when unfiltered malicious content is ingested by Elasticsearch. If you index user-generated content (UGC), be sure to properly filter it before storing it. Any content returned from the search engine (or any other data storage, for that matter) should be properly escaped for the medium it will be presented through. That means HTML escaping any search result snippets presented in web pages and properly SQL escaping any result data you might use in a database query. See the OWASP pages on data validation and XSS prevention for more information.

In short, improving the security of Elasticsearch is a lot like locking down any other cloud and/or open-source service. Apply best practices, think like a hacker and remember that chaining together several non-critical vulnerabilities often results in the most devastating attacks.

Source: Google Cloud Platform Blog

GPUs are now available for Google Compute Engine and Cloud Machine Learning

By John Barrus, Product Manager

Google Cloud Platform gets a performance boost today with the much anticipated public beta of NVIDIA Tesla K80 GPUs. You can now spin up NVIDIA GPU-based VMs in three GCP regions: us-east1, asia-east1 and europe-west1, using the gcloud command-line tool. Support for creating GPU VMs using the Cloud Console appears later this week.

If you need extra computational power for deep learning, you can attach up to eight GPUs (4 K80 boards) to any custom Google Compute Engine virtual machine. GPUs can accelerate many types of computing and analysis, including video and image transcoding, seismic analysis, molecular modeling, genomics, computational finance, simulations, high performance data analysis, computational chemistry, finance, fluid dynamics and visualization.

NVIDIA K80 GPU Accelerator Board

Rather than constructing a GPU cluster in your own datacenter, just add GPUs to virtual machines running in our cloud. GPUs on Google Compute Engine are attached directly to the VM, providing bare-metal performance. Each NVIDIA GPU in a K80 has 2,496 stream processors with 12 GB of GDDR5 memory. You can shape your instances for optimal performance by flexibly attaching 1, 2, 4 or 8 NVIDIA GPUs to custom machine shapes.

Google Cloud supports as many as 8 GPUs attached to custom VMs, allowing you to optimize the performance of your applications.

These instances support popular machine learning and deep learning frameworks such as TensorFlow, Theano, Torch, MXNet and Caffe, as well as NVIDIA’s popular CUDA software for building GPU-accelerated applications.

Pricing

Like the rest of our infrastructure, the GPUs are priced competitively and are billed per minute (10 minute minimum). In the US, each K80 GPU attached to a VM is priced at $0.700 per hour per GPU and in Asia and Europe, $0.770 per hour per GPU. As always, you only pay for what you use. This frees you up to spin up a large cluster of GPU machines for rapid deep learning and machine learning training with zero capital investment.

Supercharge machine learning

The new Google Cloud GPUs are tightly integrated with Google Cloud Machine Learning (Cloud ML), helping you slash the time it takes to train machine learning models at scale using the TensorFlow framework. Now, instead of taking several days to train an image classifier on a large image dataset on a single machine, you can run distributed training with multiple GPU workers on Cloud ML, dramatically shorten your development cycle and iterate quickly on the model.

Cloud ML is a fully-managed service that provides end-to-end training and prediction workflow with cloud computing tools such as Google Cloud Dataflow, Google BigQuery, Google Cloud Storage and Google Cloud Datalab.

Start small and train a TensorFlow model locally on a small dataset. Then, kick off a larger Cloud ML training job against a full dataset in the cloud to take advantage of the scale and performance of Google Cloud GPUs. For more on Cloud ML, please see the Quickstart guide to get started, or this document to dive into using GPUs.

Next steps

Register for Cloud NEXT, sign up for the CloudML Bootcamp and learn how to Supercharge performance using GPUs in the cloud. You can use the gcloud command-line to create a VM today and start experimenting with TensorFlow-accelerated machine learning. Detailed documentation is available on our website.

Source: Google Cloud Platform Blog

Developer Advocates offer up their favorite Google Cloud NEXT 17 sessions

By Sam Ramji, Vice President of Product Management, Compute and Developer Services

Here at Google Cloud, we employ a small army of developer advocates, DAs for short, who are out on the front lines at conferences, at customer premise, or on social media, explaining our technologies and communicating back to people like me and our product teams about your needs as a member of a development community.

DAs take the responsibility of advocating for developers seriously, and have spent time poring over the extensive Google Cloud Next '17 session catalog, bookmarking the talks that will benefit you. To wit:

If you’re a developer working in Ruby, you know to turn to Aja Hammerly for all things Ruby/Google Cloud Platform (GCP)-related. Aja’s top pick for Rubyists at Next is Google Cloud Platform < 3 Ruby with Google Developer Program Engineer Remi Taylor, but there are other noteworthy mentions on her personal blog.
Mete Atamel is your go-to DA for all things Windows on GCP. Selfishly, his top Next session is his own about running ASP.NET apps on GCP, but he has plenty more suggestions for you to choose from.
Groovy nut Guillaume Laforge is going to be one busy guy at Next, jumping from between sessions about PaaS, serverless and containers, to name a few. Here’s his full list of his must-see sessions.
If you’re a game developer, let Mark Mandel be your guide. Besides co-presenting with Rob Whitehead, CTO of Improbable, Mark has bookmarked sessions about location-based gaming, using GPUs and game analytics. Mosy on over to his personal blog for the full list.
In the past year, Google Apps Script has opened the door to building amazing customizations for G Suite, our communication and collaboration platform. In this G Suite Developers blog post, Wesley Chun walks you through some of the cool Apps Script sessions, as well as sessions about App Maker and some nifty G Suite APIs.
Want to attend sessions that teach you about our machine learning services? That’s where you’ll find our hands-on ML expert Sara Robinson, who in addition to recommending her favorite Next sessions, also examines her talk from last year’s event using Cloud Natural Language API.

For my part, I’m really looking forward to Day 3, which we’re modeling after my favorite open source conferences thanks to Sarah Novotny’s leadership. We’ll have a carefully assembled set of open talks on Kubernetes, TensorFlow and Apache Beam that cover the technologies, how to contribute, the ecosystems around them and small group discussions with the developers. For a full list of keynotes, bootcamps and breakout sessions, check out the schedule and reserve your spot.

Source: Google Cloud Platform Blog

Guest post: Multi-Cloud continuous delivery using Spinnaker at Waze

By Tom Feiner, Systems Operations Engineer, Katriel Traum, Systems Operations Engineer, Nir Tarcic, Site Reliability Engineer, Waze and Matt Duftler, Software Engineer, Spinnaker

At Waze, our mission of outsmarting traffic, together forces us to be mindful of one of our users' most precious possessions — their time. Our cloud-based service saves users time by helping them find the optimal route based on crowdsourced data.

But a cloud service is only as good as it is available. At Waze, we use multiple cloud providers to improve the resiliency of our production systems. By running an active-active architecture across Google Cloud Platform (GCP) and AWS, we’re in a better position to survive a DNS DDOS attack, a regional failure — even a global failure of an entire cloud provider.

Sometimes, though, a bug in routing or ETA calculations makes it to production undetected, and we need the ability to roll that code back or fix it as fast as possible — velocity is key. That’s easier said than done in a multi-cloud environment. For example, our realtime routing service spans over 80 autoscaling groups/managed instance groups across two cloud providers and over multiple regions.

This is where continuous delivery helps out. Specifically, we use Spinnaker, an open source, continuous delivery platform for releasing software changes with high velocity and confidence. Spinnaker has handled 100% of our production deployments for the past year, regardless of target platform.

Spinnaker and continuous delivery FTW

Large-scale cloud deployments can be complex. Fortunately, Spinnaker abstracts many of the particulars of each cloud provider, allowing our developers to focus on making Waze even better for our users instead of concerning themselves with the low-level details of multiple cloud providers. All the while, we’re able to maintain important continuous delivery concepts like canaries, immutable infrastructure and fast rollbacks.

Jenkins is a first-class citizen in Spinnaker, so once code is committed to git and Jenkins builds a package, that same package triggers the main deployment pipeline for that particular microservice. That pipeline bakes the package into an immutable machine image on multiple cloud providers in parallel and continues to run any automated testing stages. The deployment proceeds to staging using blue/green deployment strategy, and finally to production without having to get deep into the details of each platform. Note that Spinnaker automatically resolves the correct image IDs per cloud provider, so that each cloud’s deployment processes happen automatically and correctly without the need for any configuration.

Example of a multi-cloud pipeline

*Jenkins icon from Jenkins project.

Multi-Cloud blue/green deployment using Spinnaker

Thanks to Spinnaker, developers can focus on developing business logic, rather than becoming experts on each cloud platform. Teams can track the lifecycle of a deployment using several notification mechanisms including email, Slack and SMS, allowing them to coordinate handoffs between developer and QA teams. Support for tools like canary analysis and fast rollbacks allows developers to make informed decisions about the state of their deployment. Since Spinnaker is designed from the ground up to be a self-service tool, developers can do all of this with minimal involvement from the Ops team.

At Waze, we strive to release new features and bug fixes to our users as quickly as possible. Spinnaker allows us to do just that while also helping keep multi-cloud deployments and rollbacks simple, easy and reliable.

If this sounds like something your organization would benefit from, check out Spinnaker. And don't miss our talks at GCP Next 2017:

Source: Google Cloud Platform Blog

Why we migrated Orbitera to managed Kubernetes on Google Cloud Platform

By Eyal Manor, Vice President of Engineering, Google Cloud

Google Cloud acquired Orbitera last fall, and already we’ve hit a key milestone: completing the migration of the multi-cloud commerce platform from AWS to Google Container Engine, our managed Kubernetes environment.

This is a real testament to the maturity of Kubernetes and Container Engine, which in less than two years, has emerged as the most popular managed container orchestrator service on the market, powering Google Cloud’s own services as well as customers such as Niantic with Pokemon Go.

Founded in 2012, Orbitera originally built its service on an extensive array of AWS services including EC2, S3, RDS and RedShift, and weighed the pros and cons of migrating to various Google Cloud compute platforms — Google Compute Engine, Google App Engine or Container Engine. Ultimately, migrating to Container Engine presented Orbitera with the best opportunity to modernize its service and move from a monolithic architecture running on virtual machines to a microservices-based architecture based on containers.

The resulting service allows Orbitera ISV partners to easily sell their software in the cloud, by managing the testing, provisioning and ongoing billing management of their applications across an array of public cloud providers.

Running on Container Engine is proving to be a DevOps management boon for the Orbitera service. With Container Engine, Orbitera now runs in multiple zones with three replicas in each zone, for better availability. Container Engine also makes it easy to scale component microservices up and down on demand as well as deploy to new regions or zones. Operators roll out new application builds regularly to individual Kubernetes pods, and easily roll them back if the new build behaves unexpectedly.

Meanwhile, as a native Google Cloud service, the container management service integrates with multiple Google Cloud Platform (GCP) services such as Cloud SQL load balancers, and Google Stackdriver, whose alerts and dashboards allow Orbitera to respond to issues more quickly and more efficiently.

Going forward, running on Container Engine positions Orbitera to take advantage of more modular microservices and APIs and rapidly build out new services and capabilities for customers. That’s a win for enterprises who depend on Orbitera to provide a seamless, consistent and easy way to manage software running across multiple clouds, including Amazon Web Services and Microsoft Azure.

Stay tuned for a technical deep dive from the engineering team that performed the migration, where they'll share lessons learned, and tips and tricks for performing a successful migration to Container Engine yourself. In the meantime, if you have questions about Container Engine, you can find us on our Slack channel.

Source: Google Cloud Platform Blog

Introducing Cloud Spanner: a global database service for mission-critical applications

By Deepti Srivastava, Product Manager for Cloud Spanner

Today, we’re excited to announce the public beta for Cloud Spanner, a globally distributed relational database service that lets customers have their cake and eat it too: ACID transactions and SQL semantics, without giving up horizontal scaling and high availability.

When building cloud applications, database administrators and developers have been forced to choose between traditional databases that guarantee transactional consistency, or NoSQL databases that offer simple, horizontal scaling and data distribution. Cloud Spanner breaks that dichotomy, offering both of these critical capabilities in a single, fully managed service.

“Cloud Spanner presents tremendous value for our customers who are retailers, manufacturers and wholesale distributors around the world. With its ease of provisioning and scalability, it will accelerate our ability to bring cloud-based omni-channel supply chain solutions to our users around the world,” — John Sarvari, Group Vice President of Technology, JDA

JDA, a retail and supply chain software leader, has used Google Cloud Platform (GCP) as the basis of its new application development and delivery since 2015 and was an early user of Cloud Spanner. The company saw its potential to handle the explosion of data coming from new information sources such as IoT, while providing the consistency and high availability needed when using this data.

Cloud Spanner rounds out our portfolio of database services on GCP, alongside Cloud SQL, Cloud Datastore and Cloud Bigtable.

As a managed service, Cloud Spanner provides key benefits to DBAs:

Focus on your application logic instead of spending valuable time managing hardware and software
Scale out your RDBMS solutions without complex sharding or clustering
Gain horizontal scaling without migration from relational to NoSQL databases
Maintain high availability and protect against disaster without needing to engineer a complex replication and failover infrastructure
Gain integrated security with data-layer encryption, identity and access management and audit logging

With Cloud Spanner, your database scales up and down as needed, and you'll only pay for what you use. It features a simple pricing model that charges for compute node-hours, actual storage consumption (no pre-provisioning) and external network access.

Cloud Spanner keeps application development simple by supporting standard tools and languages in a familiar relational database environment. It’s ideal for operational workloads supported by traditional relational databases, including inventory management, financial transactions and control systems, that are outgrowing those systems. It supports distributed transactions, schemas and DDL statements, SQL queries and JDBC drivers and offers client libraries for the most popular languages, including Java, Go, Python and Node.js.

More Cloud Spanner customers share feedback

Quizlet, an online learning tool that supports more than 20 million students and teachers each month, uses MySQL as its primary database; database performance and stability are critical to the business. But with users growing at roughly 50% a year, Quizlet has been forced to scale its database many times to handle this load. By splitting tables into their own databases (vertical sharding), and moving query load to replicas, it’s been able to increase query capacity — but this technique is reaching its limits quickly, as the tables themselves are outgrowing what a single MySQL shard can support. In its search for a more scalable architecture, Quizlet discovered Cloud Spanner, which will allow it to easily scale its relational database and simplify its application:

“Based on our experience and performance testing, Cloud Spanner is the most compelling option we’ve seen to power a high-scale relational query workload. It has the performance and scalability of a NoSQL database, but can execute SQL so it’s a viable alternative to sharded MySQL. It’s an impressive technology and could dramatically simplify how we manage our databases.” — Peter Bakkum, Platform Lead, Quizlet

The history of Spanner

For decades, developers have relied on traditional databases with a relational data model and SQL semantics to build applications that meet business needs. Meanwhile, NoSQL solutions emerged that were great for scale and fast, efficient data-processing, but they didn’t meet the need for strong consistency. Faced with these two sub-optimal choices that customers grapple with today, in 2007, a team of systems researchers and engineers at Google set out to develop a globally-distributed database that could bridge this gap. In 2012, we published the Spanner research paper that described many of these innovations. The result was a database that offers the best of both worlds:

(click to enlarge)

Remarkably, Cloud Spanner achieves this combination of features without violating the CAP Theorem. To understand how, read this post by the author of the CAP Theorem and Google Vice President of Infrastructure, Eric Brewer.

Over the years, we’ve battle-tested Spanner internally with hundreds of different applications and petabytes of data across data centers around the world. At Google, Spanner supports tens of millions of queries per second and runs some of our most critical services, including AdWords and Google Play.

If you have a MySQL or PostgreSQL system that's bursting at the seams, or are struggling with hand-rolled transactions on top of an eventually-consistent database, Cloud Spanner could be the solution you're looking for. Visit the Cloud Spanner page to learn more and get started building applications on our next-generation database service.

Source: Google Cloud Platform Blog

Inside Cloud Spanner and the CAP Theorem

By Eric Brewer, Vice President of Infrastructure, Google Cloud

Building systems that manage globally distributed data, provide data consistency and are also highly available is really hard. The beauty of the cloud is that someone else can build that for you.

The CAP theorem says that a database can only have two of the three following desirable properties:

C: consistency, which implies a single value for shared data
A: 100% availability, for both reads and updates
P: tolerance to network partitions

This leads to three kinds of systems: CA, CP and AP, based on what letter you leave out. Designers are not entitled to two of the three, and many systems have zero or one of the properties.

For distributed systems over a “wide area,” it's generally viewed that partitions are inevitable, although not necessarily common. If you believe that partitions are inevitable, any distributed system must be prepared to forfeit either consistency (AP) or availability (CP), which is not a choice anyone wants to make. In fact, the original point of the CAP theorem was to get designers to take this tradeoff seriously. But there are two important caveats: First, you only need to forfeit consistency or availability during an actual partition, and even then there are many mitigations. Second, the actual theorem is about 100% availability; a more interesting discussion is about the tradeoffs involved to achieve realistic high availability.

Spanner joins Google Cloud

Today, Google is releasing Cloud Spanner for use by Google Cloud Platform (GCP) customers. Spanner is Google’s highly available, global SQL database. It manages replicated data at great scale, both in terms of size of data and volume of transactions. It assigns globally consistent real-time timestamps to every datum written to it, and clients can do globally consistent reads across the entire database without locking.

In terms of CAP, Spanner claims to be both consistent and highly available despite operating over a wide area, which many find surprising or even unlikely. The claim thus merits some discussion. Does this mean that Spanner is a CA system as defined by CAP? The short answer is “no” technically, but “yes” in effect and its users can and do assume CA.

The purist answer is “no” because partitions can happen and in fact have happened at Google, and during some partitions, Spanner chooses C and forfeits A. It is technically a CP system.

However, no system provides 100% availability, so the pragmatic question is whether or not Spanner delivers availability that is so high that most users don't worry about its outages. For example, given there are many sources of outages for an application, if Spanner is an insignificant contributor to its downtime, then users are correct to not worry about it.

In practice, we find that Spanner does meet this bar, with more than five 9s of availability (less than one failure in 106). Given this, the target for multi-region Cloud Spanner will be right at five 9s, as it has some additional new pieces that will be higher risk for a while.

Inside Spanner

The next question is, how is Spanner able to achieve this?

There are several factors, but the most important one is that Spanner runs on Google’s private network. Unlike most wide-area networks, and especially the public internet, Google controls the entire network and thus can ensure redundancy of hardware and paths, and can also control upgrades and operations in general. Fibers will still be cut, and equipment will fail, but the overall system remains quite robust.

It also took years of operational improvements to get to this point. For much of the last decade, Google has improved its redundancy, its fault containment and, above all, its processes for evolution. We found that the network contributed less than 10% of Spanner’s already rare outages.

Building systems that can manage data that spans the globe, provide data consistency and are also highly available is possible; it’s just really hard. The beauty of the cloud is that someone else can build that for you, and you can focus on innovation core to your service or application.

Next steps

For a significantly deeper dive into the details, see the white paper also released today. It covers Spanner, consistency and availability in depth (including new data). It also looks at the role played by Google’s TrueTime system, which provides a globally synchronized clock. We intend to release TrueTime for direct use by Cloud customers in the future.

Furthermore, look for the addition of new Cloud Spanner-related sessions at Google Cloud Next ‘17 in San Francisco next month. Register soon, because seats are limited.