Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Exploring container security: Digging into Grafeas container image metadata



Editor’s note: This is the third in a series of blog posts on container security at Google.

The great thing about containers is how easy they are to create, modify and share. But that also raises the question of whether or not they're safe to deploy to production. One way to answer that is to track metadata about your container, for example, who worked on it, where it's stored, and whether it has any known vulnerabilities.

Last October, Google and several partners announced Grafeas, an open-source project that provides a structured metadata API for container images. Grafeas simplifies how you handle metadata, including security concerns like package vulnerability scan results, and keeps track of lots of different kinds of information:

  • "Build" metadata can be used to certify that a container image was built according to your build policies using a trusted builder with checked-in source code. 
  • "Package Vulnerability" metadata contains information from a vulnerability scanning service and lets you create policies on deploying vulnerable containers. 
  • "Image Basis" metadata contains information about the base image from which your container was derived as well as additional layers used in building the container image. 
  • "Package Manager" metadata indicates which packages were installed in the container image. 
  • "Deployment History" metadata allows you to track which resources were deployed and when.

Tracking Grafeas’ metadata lets you know what containers are in your environment, and also to enforce restrictions on which containers get deployed. You should review Grafeas metadata for compliance with your policies before deploying. This is where one last type of Grafeas metadata comes in—the mighty "Attestation." If the metadata you pull from Grafeas is consistent with your policy, you can then write an attestation that certifies that your image complies with your deployment requirements. Then, using the Kubernetes Admission Controller, you can check for the expected attestations, and block deployment when they aren’t present.

occurrencesResponse := grafeas.ListOccurrencesResponse{}

 imageUrl := "https://gcr.io/myproject/name@sha256hash"
 hasFixableVulnerability := false
// Iterate through occurrences looking for vulnerability occurrences for this image
for _, occurrence := range occurrencesResponse.Occurrences {
  resourceUrl := occurrence.ResourceUrl
  if imageUrl != resourceUrl && occurrence.Kind != grafeas.Note_PACKAGE_VULNERABILITY {
     continue
  }
  details := occurrence.GetVulnerabilityDetails()
  issues := details.GetPackageIssue()
  // This this vulnerability is fixable, we should not insert an attestation
  for _, i := range issues {
     if i.FixedLocation != nil {
        hasFixableVulnerability = true
     }
  }
}
// If there are no fixible vulnerabilities, create an attestation occurrence
if !hasFixableVulnerability {
  occ := &grafeas.Occurrence{
     ResourceUrl: imageUrl,
     NoteName:    "projects/vulnerabilty-certification/notes/vuln-free",
     Details: &grafeas.Occurrence_AttestationDetails{
        AttestationDetails: &grafeas.AttestationAuthority_AttestationDetails{
           Signature: &grafeas.AttestationAuthority_AttestationDetails_PgpSignedAttestation{
              PgpSignedAttestation: &grafeas.PgpSignedAttestation{
                 Signature: "MySignature",
                 ContentType: grafeas.PgpSignedAttestation_SIMPLE_SIGNING_JSON,
                 KeyId: "ThisIsMyKeyId"

              },
           },

           },
        },
     }
     req := grafeas.CreateOccurrenceRequest{Occurrence:occ}

  }

In fact, you can use Grafeas to enforce all kinds of security policies. Check out the tutorial by Kelsey Hightower on how to use Grafeas to allow only container images signed by a specific private key. But it doesn’t stop there. For example, you can write policies to block container images with vulnerabilities from being deployed, ensure that all your deployed images are built from a base image sanctioned by your security team, or require images to go through your build pipeline. The possibilities are endless!

How Shopify uses Grafeas to manage metadata for its 500k container images


Big changes in infrastructure, like moving to Kubernetes for container orchestration, can create opportunities to change the way you govern your workloads. Here at Shopify, we use Grafeas as the central place to store all the metadata about our deploys, helping us build six thousand containers a day, and making our lives easier.

Grafeas lets us answer fundamental questions like:

When was the container created? What code runs inside it? Who created it? Where is the running container located? Is the container vulnerable to a known exploit and do we need to pull it out of production to replace it?

Shopify’s Continuous Integration/Deployment (CI/CD) pipeline supplies information about who built the code, when it was built, what the commit hash is, and whether or not tests passed. We store this data in the “build” metadata in Grafeas, leaving a digital paper trail that can be audited to track changes.

After build, we test the container for known vulnerabilities. Since that list of vulnerabilities can change at any time, we check our images at regular intervals. The information in Grafeas’ “package vulnerability” metadata field is updated at each check and allows us to make informed decisions about when to pull an image out of production and deploy a new secure one.

Next up, we’re working on how to track information about deploys in the deployment history metadata field, so that we know when, where and by whom a container was deployed, as well as its current status. We will use that information to create a “bill of materials” of what’s running in our environment.

The metadata in Grafeas gives us an answer to all of our questions around deploys. We also use it to restrict the deployment of containers. During the creation of a workload, the admission controller makes its decision by querying Grafeas for the container’s metadata  if it cannot find anything that violates predefined policies, for example the existence of known vulnerabilities, it allows the workload to be created.

Using Grafeas in production gives us a 360-degree view of our deploys so we can protect our infrastructure and maintain a history of what’s going on in our cloud.

Get involved in Grafeas 


We’re currently working toward a beta version of the Grafeas API spec. In addition, Grafeas is also working with other projects, such as the in-toto team, to allow in-toto to create attestations based on Grafeas metadata.

If you’d like to contribute, Grafeas is always looking for more contributors as we grow this community. Join us!

Cloud-native architecture with serverless microservices — the Smart Parking story

By Brian Granatir, SmartCloud Engineering Team Lead, Smart Parking

Editor’s note: When it comes to microservices, a lot of developers ask why they would want to manage many services rather than a single, big, monolithic application? Serverless frameworks make doing microservices much easier because they remove a lot of the service management overhead around scaling, updating and reliability. In this first installment of a three-part series, Google Cloud Platform customer Smart Parking gives us their take on event-driven architecture using serverless microservices on GCP. Then read on for parts two and three, where they walk through how they built a high-volume, real-world smart city platform on GCP—with code samples!

Part 1


When "the cloud" first appeared, it was met with skepticism and doubt. “Why would anyone pay for virtual servers?” developers asked. “How do you control your environment?” You can't blame us; we're engineers. We resist change (I still use vim), and believe that proof is always better than a promise. But, eventually we found out that this "cloud thing" made our lives easier. Resistance was futile.

The same resistance to change happened with git (“svn isn't broken”) and docker (“it's just VMs”). Not surprising — for every success story, for every promise of a simpler developer life, there are a hundred failures (Ruby on Rails: shots fired). You can't blame any developer for being skeptical when some random "bloke with a blog" says they found the next great thing.

But here I am, telling you that serverless is the next great thing. Am I just a bloke? Is this a blog? HECK YES! So why should you read on (other than for the jokes, obviously)? Because you might learn a thing or two about serverless computing and how it can be used to solve non-trivial problems.

We developed this enthusiasm for serverless computing building a smart city platform. What is a smart city platform, you ask? Imagine you connect all the devices and events that occur in a city to improve resource efficiency and quality of citizen life. The platform detects a surge in parking events and changes traffic lights to help the flow of cars leaving downtown. It identifies a severe rainstorm and turns on street lights in the middle of the day. Public trash cans alert sanitation when they are full. Nathan Fillion is spotted on 12th street and it swarm-texts local citizens. A smart city is a vast network of distributed devices (IoT City 2000!) streaming data and methods to easily correlate these events and react to them. In other words, it's a hard problem with a massive scale—perfect for serverless computing!
In-ground vehicle detection sensor


What the heck is serverless?


But before we go into a lot more depth about the platform, let’s define our terms. In this first article, we give a brief overview of the main concepts used in our smart city platform and how they match up with GCP services. Then, in the second article, we'll dive deeper into the architecture and how each specific challenge was met using various different serverless solutions. Finally, we'll get extra technical and look at some code snippets and how you can maximize functionality and efficiency. In the meantime, if you have any questions or suggestions, please don't hesitate to leave a comment or email me directly ([email protected]).

First up, domain-driven design (DDD). What is domain-driven design? It's a methodology for designing software with an emphasis on expertise and language. In other words, we recognize that engineering, of any kind, is a human endeavour whose success relies largely on proper communication. A tiny miscommunication [wait, we're using inches?] can lead to massive delays or customer dissatisfaction. Developing a domain helps assure that everyone (not just the development team) is using the same terminology.

A quick example: imagine you’re working on a job board. A client calls customer support because a job they just posted never appeared online. The support representative contacts the development team to investigate. Unfortunately, they reach your manager, who promptly tells the team, “Hey! There’s an issue with a job in our system.” But the code base refers to job listings as "postings" and the daily database tasks as "jobs." So naturally, you look at the database "jobs" and discover that last night’s materialization failed. You restart the task and let support know that the issue should be resolved soon. Sadly, the customer’s issue wasn’t addressed, because you never addressed the "postings" error.

Of course, there are more potent examples of when language differences between various aspects of the business can lead to problems. Consider the words "output," "yield," and "spike" for software monitoring a nuclear reactor. Or, consider "sympathy" and "miss" for systems used by Klingons [hint: they don’t have words for both]. Is it too extreme to say domain-driven design could save your life? Ask a Klingon if he’ll miss you!

In some ways, domain-driven design is what this article is doing right now! We're establishing a strong, ubiquitous vocabulary for this series so everyone is on the same page. In part two, we'll apply DDD to our example smart city service.

Next, let's discuss event-driven architecture. Event-driven architecture (EDA) means constructing your system as a series of commands and/or events. A user submits an online form to make a purchase: that's a command. The items in stock are reserved: that's an event. A confirmation is sent to the user: that's an event. The concept is very simple. Everything in our system is either a command or an event. Commands lead to events and events may lead to new commands and so on.

Of course, defining events at the start of a project requires a good understanding of the domain. This is why it's common to see DDD and EDA together. That said, the elegance of a true event-driven architecture can be difficult to implement. If everything is a command or an event, where are the objects? I got that customer order, but where do I store the "order" and how to I access it? We'll investigate this in much more detail in part two of this series. For now, all you need to understand is that our example smart city project will be defining everything as commands and events!

Now, onto serverless. Serverless computing simply means using existing, auto-scaling cloud services to achieve system behaviours. In other words, I don't manage any servers or docker containers. I don't set up networks or manage operation (ops). I merely provide the serverless solution my recipe and it handles creation of any needed assets and performs the required computational process. A perfect example is Google BigQuery. If you haven't tried it out, please go do that. It's beyond cool (some kids may even say it's "dank": whatever that means). For many of us, it’s our first chance to interact with a nearly-infinite global compute service. We're talking about running SQL queries against terabytes of data in seconds! Seriously, if you can't appreciate what BigQuery does, then you better turn in your nerd card right now (mine says "I code in Jawa").

Why does serverless computing matter? It matters because I hate being woken up at night because something broke on production! Because it lets us auto-scale properly (instead of the cheating we all did to save money *cough* docker *cough*). Because it works wonderfully with event-driven architectures and microservices, as we'll see throughout parts 2 & 3 of this series.

Finally, what are microservices? Microservices is a philosophy, a methodology, and a swear word. Basically, it means building our system in the same way we try to write code, where each component does one thing and one thing only. No side effects. Easy to scale. Easy to test. Easier said than done. Where a traditional service may be one database with separate read/write modules, an equivalent microservices architecture may consist of sixteen databases each with individual access management.

Microservices are a lot like eating your vegetables. We all know it sounds right, but doing it consistently is a challenge. In fact, before serverless computing and the miracles of Google's cloud queuing and database services, trying to get microservices 100% right was nearly impossible (especially for a small team on a budget). However, as we'll see throughout this series, serverless computing has made microservices an easy (and affordable) reality. Potatoes are now vegetables!

With these four concepts, we’ve built a serverless sandwich, where:
  • Domain-driven design is the peanut butter, defining the language and context of our project 
  • Event-driven architecture is the jelly, limiting the scope of our domain to events 
  • Microservices: is the bread, limiting our architecture to tiny components that react to single event streams
And finally, serverless is having someone else make the sandwich for you (and cutting off the crust), running components on auto-scaling, auto-maintained compute services.

As you may have guessed, we're going to have a microservice that reacts to every command and event in our architecture. Sounds crazy, but as you'll see, it's super simple, incredibly easy to maintain, and cheap. In other words, it's fun. Honestly, remember when coding was fun? Time to recapture that magic!

To repeat, serverless computing is the next big thing! It's the peanut butter and jelly sandwich of software development. It’s an uninterrupted night’s sleep. It's the reason I fell back in love with web services. We hope you’ll come back for part two where we take all these ideas and outline an actual architecture.

What we learned doing serverless — the Smart Parking story



Part 3 

You made it through all the fluff and huff! Welcome to the main event. Time for us to explore some key concepts in depth. Of course, we won't have time to cover everything. If you have any further questions, or recommendations for a follow-up (or a prequel . . . who does a prequel to a tech blog?), please don't hesitate to email me.

"Indexing" in Google Cloud Bigtable


In parts one & two, we mentioned Cloud Bigtable a lot. It's an incredible, serverless database with immense power. However, like all great software systems, it's designed to deal with a very specific set of problems. Therefore, it has constraints on how it can be used. There are no traditional indexes in Bigtable. You can't say "index the email column" and then query it later. Wait. No indexes? Sounds useless, right? Yet, this is the storage mechanism used by Google to run our life-depending sites: Gmail, YouTube, Google Maps, etc. But I can search in those. How do they do it without traditional indexes? I'm glad you asked!!

The answer has two parts: (1) using rowkeys and (2) data mitosis. Let's take a look at both. But, before we do that, let's make one very important point: Never assume you have expertise in anything just because you read a blog about it!!!

I know, it feels like reading this blog [with its overflowing abundance of awesome] might be the exception. Unfortunately, it's not. To master anything, you need to study the deepest parts of its implementation and practice. In other words, to master Bigtable you need to understand "what" it is and "why" it is. Fortunately, Bigtable implements the HBase API. This means you can learn heaps about Bigtable and this amazing data storage and access model by reading the plentiful documentation on HBase, and its sister project Hadoop. In fact, if you want to understand how to build any system for scale, you need to have at least a basic understanding of MapReduce and Hadoooooooooop (little known fact, "Hadoop" can be spelt with as many o's as desired; reduce it later).

If you just followed the concepts covered in this blog, you'd walk away with an incomplete and potentially dangerous view of Bigtable and what it can do. Bigtable will change your development life, so at least take it out to dinner a few times before you get down on one knee!

Ok, got the disclaimer out of the way, now onto rowkeys!

Rowkeys are the only form of indexing provided in Bigtable. A rowkey is the ID used to distinguish individual rows. For example, if I was storing a single row per user, I might have the rowkeys be the unique usernames. For example:
We can then scan and select rows by using these keys. Sounds simple enough. However, we can make these rowkeys be compound indexes. That means that we carry multiple pieces of information within a single rowkey. For example, what if we had three kinds of users: admin, customer and employee. We can put this information in the rowkey. For example:


(Note: We're using # to delineate parts of our rowkey, but you can use any special character you want.)

Now we can query for user type easily. In other words, I can easily fetch all "admin" user rows by doing a prefix search (i.e., find all rows that start with "admin#"). We can get really fancy with our rowkeys too. For example, we can store user messages using something like:
However, we cannot search for the latest 10 messages by Brian using rowkeys. Also, there's no easy way to get a series of related messages in order. Maybe I need a unique conversation ID that I put at the start of each rowkey? Maybe.

Determining the right rowkeys is paramount to using Bigtable effectively. However, you'll need to watch out for hotspotting (a topic not covered in this blog post). Also, any Bigtablians out there will be very upset with me because my examples don't show column families. Yeah, Bigtable must have the best holidays, because everything is about families.

So, we can efficiently search our rows using rowkeys, but this may seem every limited. Who could design a single rowkey that covers every possible query? You can't. This is where the second major concept comes in: data mitosis.

What is data mitosis? It's replication of data into multiple tables that are optimized for specific queries. What? I'm replicating data just to overcome indexing limits? This is madness. NO! THIS. IS. SERVERLESS!

While it might sound insane, storage is cheap. In fact, storage is so cheap, we'd be naive to not abuse it. This means that we shouldn't be afraid to store our data as many times as we want to simply improve overall access. Bigtable works efficiently with billions of rows. So go ahead and have billions of rows. Don't worry about capacity or maintaining a monsterous data cluster, Google does that for you. This is the power of serverless. I can do things that weren't possible before. I can take a single record and store it ten (or even a hundred) times just to make data sets optimized for specific usages (i.e., for specific queries).

To be honest, this is the true power of serverless. To quote myself, storage is magic!

So, if I needed to access all messages in my system for analytics, why not make another view of the same data:
Of course, data mitosis means you have insanely fast access to data but it isn't without cost. You need to be careful in how you update data. Imagine the bookkeeping nightmare of trying to manage synchronized updates across dozens of data replicants. In most cases, the solution is never updating rows (only adding them). This is why event-driven architectures are ideal for Bigtable. That said, no database is perfect for all problems. That's why it's great that I can have SQL, noSQL, and HBASE databases all running for minimal costs (with no maintenance) using serverless! Why use only one database? Use them all!

Exporting to BigQuery


In the previous section we learned about the modern data storage model: store everything in the right database and store it multiple times. It sounds wonderful, but how do we run queries that transcend this eclectic set of data sources? The answer is. . . we cheat. BigQuery is cheating. I cannot think of any other way of describing the service. It's simply unfair. You know that room in your house (or maybe in your garage)? That place where you store EVERYTHING—all the stuff you never touch but don't want to toss out? Imagine if you had a service that could search through all the crap and instantly find what you're looking for. Wouldn't that be nice? That's BigQuery. If it existed IRL . . . it would save marriages. It's that good.

By using BigQuery, we can scale our searches across massive data sets and get results in seconds. Seriously. All we need to do is make our data accessible. Fortunately, BigQuery already has a bunch of onramps available (including pulling your existing data from Google Cloud Storage, CSVs, JSON, or Bigtable), but let's assume we need something custom. How do you do this? By streaming the data directly into BigQuery! Again, we're going to replicate our data into another place just for convenience. I would've never considered this until serverless made it cheap and easy.

In our architecture, this is almost too easy. We simply add a Cloud Function that listens to all our events and streams them into BigQuery. Just subscribe to the Pub/Sub topics and push. It’s so simple. Here's the code:

exports.exportStuffToBigQuery = function exportStuffToBigQuery( event, callback ) {
    return parseEvent(event)
    .then(( eventData ) => {
      const BigQuery = require('@google-cloud/bigquery')();
      return BigQuery.dataset('myData').table('stuff').insert(eventData);
    })
    .then(( ) => callback())
  };

That's it! Did you think it was going to be a lot of code? These are Cloud Functions. They should be under 100 lines of code. In fact, they should be under 40. With a bit of boilerplate, we can make this one line of code:

exports.exportStuffToBigQuery = function exportStuffToBigQuery( event, callback ) {
    myFunction.run(event, callback, (( data ) => { myFunction.sendOutput(eventData) });
  };

Ok, but what is the boilerplate code? More on that in the next section. This section is short, as it should be. Honestly, getting data into BigQuery is easy. Google has provided a lot of input hooks and keeps adding more. Once you have the data in there (regardless of size), you can just run the standard SQL queries you all know and loathe love. Up, up, down, down, left, right, left, right, B, A!


Cloud Function boilerplate


Cloud Functions use Node? Cloud Functions are JavaScript? Gag! Yes, that was my initial reaction. Now (9 months later), I never want to write anything else in my career. Why? Because Cloud Functions are simple. They are tiny. You don't need a big robust programming language when all you're doing is one thing. In fact, this is a case where less is more. Keep it simple! If your Cloud Function is too complex, break it apart.

Of course, there are a sequence of steps that we do in every Cloud Function:

  1) Parse trigger
  2) Do stuff
  3) Send output
  4) Issue callback
  5) Catch errors

The only thing we should be writing is step 2 (and sometimes step 5). This is where boilerplate code comes in. I like my code like I like my wine: DRY! [DRY = Don't Repeat Yourself, btw].

So write the code to parse your triggers and send your outputs once. There are more steps! The actual sequence of steps for a Cloud Function is:


  1) Filter unneeded events
  2) Log start
  3) Parse trigger
  4) Do stuff
  5) Send output(s)
  6) Issue retry for timeout errors
  7) Catch and log all fatal errors (no retry)
  8) Issue callback
  9) Do it all asynchronously
  10) Test above code
  11) Create environment resources (if needed)
  12) Deploy code

Ugh. So our simple Cloud Functions just became a giant list of steps. It sounds painful, but it can all be overcome with some boilerplate code and an understanding of how Cloud Functions work at a larger level.

How do we do this? By adding a common configuration for each Cloud Function that can be used to drive testing, deployment and common behaviour. All our Cloud Functions start with a block like this:

const options = {
    functionName: 'doStuff',
    trigger: 'stuff-commands',
    triggerType: 'pubsub',
    aggregateType: 'devices',
    aggregateSource: 'bigtable',
    targets: [
      { type: 'bigtable', name: 'stuff' },
      { type: 'pubsub', name: 'stuff-events'}
    ],
    filter: [ 'UpdateStuff' ]
  };

It may seem basic, but this understanding of Cloud Functions allows us to create a harness that can perform all of the above steps. We can deploy a Cloud Function if we know its trigger and its type. Since everything is inside GCP, we can easily create resources if we know our output targets and their types. We can perform efficient logging and track data through our system by knowing the start and end point (triggers and targets) for each function. The filter allows us to limit which events arriving in a Pub/Sub topic are handled.

So, what's the takeaway for this section? Make sure you understand Cloud Functions fully. See them as tiny connectors between a given input and target output (preferably only one). Use this to make boilerplate code. Each Cloud Function should contain a configuration and only the lines of code that make it unique. It may seem like a lot of work, but making a generic methodology for handling Cloud Functions will liberate you and your code. You'll get addicted and find yourself sheepishly saying, "Yeah, I kinda like JavaScript, and you know . . .Node" (imagine that!)

Testing


We can't end this blog without a quick talk on testing. Now let me be completely honest. I HATED testing for most of my career. I'm flawless, so why would I want to write tests? I know, I know . . . even a diamond needs to be polished every once-in-awhile.

That said, now I love testing. Why? Because testing Cloud Functions is super easy. Seriously. Just use Ava and Sinon and "BAM". . . sorted. It really couldn't be simpler. In fact, I wouldn't mind writing another series of posts on just testing Cloud Functions (a blog on testing, who'd read that?).

Of course, you don't need to follow my example. Those amazing engineers at Google already have examples for almost every possible subsystem. Just take a look at their Node examples on GitHub for Cloud Functions: https://github.com/GoogleCloudPlatform/nodejs-docs-samples/tree/master/functions [hint: look in the test folders].

For many of you, this will be very familiar. What might be new is integration testing across microservices. Again, this could be an entire series of articles, but I can provide a few quick tips here.

First, use Google's emulators. They have them for just about everything (Pub/Sub, Datastore, Bigtable, Cloud Functions). Getting them set up is easy. Getting them to all work together isn't super simple, but not too hard. Again, we can leverage our Cloud Function configuration (seen in the previous section), to help drive emulation.

Second, use monitoring to help design integration testing. What is good monitoring if not a constant integration test? Think about how you would monitor your distributed microservices and how you'd look at various data points to look for slowness or errors. For example, I'd probably like to monitor the average time it takes for a single input to propagate across my architecture and send alerts if we slip beyond standard deviation. How do I do this? By having a common ID carried from the start to end of a process.

Take our architecture as an example. Everything is a chain of commands and events. Something like this:
If we have a single ID that flows through this chain, it'll be easy for us to monitor (and perform integration testing). This is why it's great to have a common parent for both commands and events. This is typically referred to as a "fact." So everything in our system is a "fact." The JSON might look something like this:

{
    fact: {
      id: "19fg-3fsf-gg49",
      type: "Command",
      subtype: "UpdateReadings"
    },
    readings: {}
  }

As we move through our chain of commands and events, we change the fact type and subtype, but never the ID. This means that we can log and monitor the flow of each of our initial inputs as it migrates through the system.

Of course, as with all things monitoring (and integration testing), life isn't so simple. This stuff is hard. You simply cannot perfect your monitoring or integration testing. If you did, you would've solved the Halting Problem. Seriously, if I could give any one piece of advice to aspiring computer scientists, it would be to fully understand the Halting Problem and the CAP theorem.

Pitfalls of serverless


Serverless has no flaws! You think I'm joking. I'm not. The pitfalls in serverless have nothing to do with the services themselves. They all have to do with you. Yep. True serverless systems are extremely powerful and cost-efficient. The only problem: developers have a tendency to use these services wrong. They don't take the time to truly understand the design and rationale of the underlying technologies. Google uses these exact same services to run the world's largest and most performant web applications. Yet, I hear a lot of developers complaining that these services are just "too slow" or "missing a lot."

Frankly, you're wrong. Serverless is not generic. It isn't compute instances that let you you install whatever you want. That’s not what we want! Serverless is a compute service that does a specific task. Now, those tasks may seem very generic (like databases or functions), but they're not. Each offering has a specific compute model in mind. Understanding that model is key to getting the maximum value.

So what is the biggest pitfall? Abuse. Serverless lets you do a LOT for very cheap. That means the mistakes are going to come from your design and implementation. With serverless, you have to really embrace the design process (more than ever). Boiling your problem down to its most fundamental elements will let you build a system that doesn't need to be replaced every three to four years. To get where we needed to be, my team rebuilt the entire kernel of our service three times in one year. This may seem like madness and it was. We were our own worst enemy. We took old (and outdated) notions of software and web services and kept baking it into the new world. We didn't believe in serverless. We didn't embrace data mitosis. We resisted streaming. We didn't put data first. All mistakes. All because we didn’t fully understanding the intent of our tools.

Now, we have an amazing platform (with a code base reduced by 80%) that will last for a very long time. It'w optimized for monitoring and analytics, but we didn't even need to try. By embracing data and design, we got so much for free. It's actually really easy, if you get out of your own way.

Conclusion 


As development teams beginning to transition to a world of IoT and serverless, they will encounter an unique set of challenges. The goal of this series was to provide an overview of recommended techniques and technologies used by one team to ship a IoT/serverless product. A quick summary of each part is as follows:

Part 1 - Getting the most out of serverless computing requires a cutting-edge approach to software design. With the ability to rapidly prototype and release software, it’s important to form a flexible architecture that can expand at the speed of inspiration. Sound cheesy, but who doesn’t love cheese? Our team utilized domain-driven design (DDD) and event-driven architecture (EDA) to efficiently define a smart city platform. To implement this platform, we built microservices deployed on serverless compute services.

Biggest takeaway: serverless now makes event-driven architecture and microservices not only a reality, but almost a necessity. Viewing your system as a series of events will allow for resilient design and efficient expansion.

Part 2 - Implementation of an IoT architecture on serverless services is now easier than ever. On Google Cloud Platform (GCP), powerful serverless tools are available for:

  • IoT fleet management and security -> IoT Core 
  • Data streaming and windowing -> Dataflow 
  • High-throughput data storage -> Bigtable 
  • Easy transaction data storage -> Datastore 
  • Message distribution -> Pub/Sub 
  • Custom compute logic -> Cloud Functions 
  • On-demand, analytics of disparate data sources -> BigQuery

Combining these services allows any development team to produce a robust, cost-efficient and extremely performant product. Our team uses all of these and was able to adopt each new service within a single one-week sprint.

Biggest takeaway: DevOps is dead. Serverless systems (with proper non-destructive, deterministic data management and testing) means that we’re just developers again! No calls at 2am because some server got stuck? Sign me up for serverless!

Part 3 - To be truly serverless, a service must offer a limited set of computational actions. In other words, to be truly auto-scaling and self-maintaining, the service can’t do everything. Understanding the intent and design of the serverless services you use will greatly improve the quality of your code. Take the time to understand the use-cases designed for the service so that you extract the most. Using a serverless offering incorrectly can lead to greatly reduced performance.

For example, Pub/Sub is designed to guarantee rapid, at-least-once delivery. This means messages may arrive multiple times or out-of-order. That may sound scary, but it’s not. Pub/Sub is used by Google to manage distribution of data for their services across the globe. They make it work. So can you. Hint: consider deterministic code. Hint, hint: If order is essential at time of data inception, use windowing (see Dataflow).

Biggest takeaway: Don’t try to use a hammer to clean your windows. Research serverless services and pick the ones that suit your problem best. In other words, not all serverless offerings are created equal. They may offer the same essential API, but the implementation and goals can be vastly different.

Finally, before we part, let me say, “Thank you.” Thanks for following through all my ramblings to reach this point. There's a lot of information, and I hope that it gives you a preview of the journey that lies ahead. We're entering a new era of web development. It's a landscape full of treasure, opportunity, dungeons and dragons. Serverless computing lets us discard the burden of DevOps and return to the adventure of pure coding. Remember when all you did was code (not maintenance calls at 2am)? It's time to get back there. I haven't felt this happy in my coding life in a long time. I want to share that feeling with all of you!

Please, send feedback, requests, and dogs (although, I already have 7). Software development is a never-ending story. Each chapter depends on the last. Serverless is just one more step on our shared quest for holodecks. Yeah, once we have holodecks, this party is over! But until then, code as one.

Implementing an event-driven architecture on serverless — the Smart Parking story



Part 2 


In this article, we’re going to explore how to build an event-driven architecture on serverless services to solve a complex, real-world problem. In this case, we’re building a smart city platform. An overview of the domain can be found in part one. If you haven’t read part one, please go take a look now. Initial reviews are in, and critics are saying “this may be the most brilliantly composed look at modern software development; where’s dinner?” (please note: in this case the "critics" are my dogs, Dax and Kiki).

Throughout this part, we’ll be slowly building up an architecture. In part three, we’ll dive deeper into some of the components and review some actual code. So let’s get to it. Where do we start? Obviously, with our input!

Zero step: defining our domain 


Before we begin, let’s define the domain of a smart city. As we learned in the previous post, defining the domain means establishing a clear language and terminology for referencing objects and processes in our software system. Of course, creating this design is typically more methodical, iterative, and far more in-depth. It would take a genius to just snap and put an accurate domain at the end of a blog post (it’s strange that I never learned to snap my fingers, right?).

Our basic flow, for this project is a network of distributed IoT (Internet of Things) devices that send periodic readings that are used to define the frames of larger correlated events throughout a city.
  • Sensor - electronic device that's capable of capturing and reporting one or more specialized readings 
  • Gateway - an internet-connected hub that's capable of receiving readings from one or more sensors and sending these packages to our smart cloud platform 
  • Device - the logical combination of a sensor and its reporting gateway (used to define a clear split between the onramp from processing) 
  • Readings - key-value pairs (e.g., { temperature: 35, battery: "low" } ) sent by sensors 
  • UpdatedReadings - the command to update readings for a specific device 
  • ReadingsUpdated - the event that occurs in our system when new readings are received from a device (response to a UpdateReadings command) 
  • Frame - a collection of correlated / collocated events (typically ReadingsUpdated) used to drive business logic through temporal reasoning [lots more on this later] 
  • Device Report - an analytic view of devices and their health metrics (typically used by technicians) 
  • Event Report - an analytic view of frames (typically used by business managers) If we connect all of these parts together in a diagram, and add some serveless glue (parts in bold), we get a nice overview of our architecture:

Of course, there's a fair bit of missing glue in the above diagram. For example, how do we take an UpdateReadings command and get it into Bigtable? This is where my favorite serverless service comes into play: Cloud Functions! How do we install devices? Cloud Functions. How do we create organizations? Cloud Functions. How do we access data through an API? Cloud Functions. How do we conquer the world? Cloud Functions. Yep, I’m in love!

Alright, now we have our baseline, let’s spend the rest of this post exploring just how we go about implementing each part of our architecture and dataflows.

First step: inbound data


Our smart city platform is nothing more than a distributed network of internet-connected (IoT) devices. These devices are composed of one or more sensors that capture readings and their designated gateways that help package this data and send it through to our cloud.

For example, we may have an in-ground sensor used to detect a parked car. This senor reports IR and magnetic readings that are transferred through RF (radio frequencies) to a nearby gateway. Another example is a smart trash can that monitors capacity and broadcasts when the bin is full.

The challenge of IoT-based systems has always been collecting data, updating in-field devices, and security. We could write an entire series of articles on how to deal with these challenges. In fact, the burden of this task is the reason we haven’t seen many sophisticated, generic IoT platforms. But not anymore! The problem has been solved for us by those wonderful engineers at Google. Cloud IoT Core is a serverless service offered by Google Cloud Platform (GCP) that helps you skip all the annoying steps. It’s like jumping on top of the bricks!

Wait . . . does anyone get that reference anymore? Mario Brothers. The video game for the NES. You could jump on top of the ceiling of bricks to reach a secret pipe that let you skip a bunch of levels. It was a pipe because you were a plumber . . .  fighting a turtle dragon to save a princess. And you could throw fireballs by eating flowers. Trust me, it made perfect sense!

Anyway! Cloud IoT Core is the secret passage that lets you skip a bunch of levels and get to the good stuff. It scales automatically and is simple to use. Seriously, don’t spend any time managing your devices and securing your streams. Let Google do that for you.

So, sensors are observing life in our city and streaming this data to IoT Core. Where does it end up after IoT Core? In Cloud Pub/Sub, Google’s serverless queuing service. Think of it as a globally distributed subscription queue with guaranteed delivery. The result: our vast network of data streams has been converted to a series of queues that our services can subscribe to. This is our inbound pipeline. It scales nearly infinitely and requires no operation support. Think about that. We haven’t even written any code yet and we already have an incredibly robust architecture. Trust me, it took my team only a week to move over existing device onramp to IoT Core—it’s that straightforward. And how many problems have we had? How many calls at 3 AM to fix the inbound data? Zero. They should call it opsless rather than serverless!

Anyway, we got our data streaming in. So far, our architecture looks like this:
While we’re exploring a smart city platform made from IoT devices, you can use this pipeline with almost any architecture. Just replace the IoT Core box with your onboarding service and output to Pub/Sub. If you still want that joy of serverless (and no calls at 3 AM), then consider using Google Cloud Dataflow as your onramp!

What is Dataflow? It's a serverless implementation of a Hadoop-like pipeline used for the transformation and enriching of streaming or batch data. Sounds really fancy, and it actually is. If you want to know just how fancy, grab any data engineer and ask for their war stories on setting up and maintaining a Hadoop cluster (it might take awhile; bring popcorn). In our architecture, it can be used to both onramp data from an external source, to help with efficient formation of aggregates (i.e., to MapReduce a large number of events), or to help with windowing for streaming data. This is huge. If you know anything about streaming data, then you’ll know the value of a powerful, flexible windowing service.

Ok, now that we got streaming data, let’s do something with it!

Second step: normalizing streaming data

How is a trash can like a street lamp? How is a parking sensor like a barometer? How is a phaser like a lightsaber? These are questions about normalization. We have a lot of streaming data, but how do we find a common way of correlating it all?

For IoT this is a complex topic and more information can be found in this whitepaper. Of course, the only whitepaper in most developers lives comes on a roll. So here is a quick summary:

How do we normalize the data streaming from our distributed devices? By converting them all to geolocated events. If we know the time and location of a sensor reading, we can start to colocate and correlate events that can lead to action. In other words, we use location and time to help use build a common reference point for everything going on in our city.

Fortunately, many (if not all devices) will already need some form of decoding / translation. For example, consider our in-ground parking sensor. Since it's transmitting data over radio frequencies, it must optimize and encode data. Decoding could happen in the gateway, but we prefer a gateway to contain no knowledge of the devices it services. It should just act as the doorway to the world wide web (for all the Generation Z folks out there, that’s what the "www." in urls stands for).

Ideally, devices would all natively speak "smart city" and no decoding or normalization would be needed. Until then, we still need to create this step. Fortunately, it is super simple with Cloud Functions.

Cloud Functions is a serverless compute offering from Google. It allows us to run a chunk of code whenever a trigger occurs. We simply supply the recipe and identify the trigger and Google handles all the scaling and compute resource allocation. In other words, all I need to do is write the 20-50 lines of code that makes my service unique and never worry about ops. Pretty sweet, huh?

So, what’s our trigger? A Pub/Sub topic. What’s our code? Something like this:

function decodeParkingReadings( triggerInput ) {
    parsePubSubMessage(triggerInput)
    .then(decode)
    .then(converToCommand)
    .then(PubSub.topic(‘DeviceCommands’).publish)
  }

If you’re not familiar with promises and async coding in JavaScript, the above code simply does the following:
  1. Parse the message from Pub/Sub 
  2. When this is done, decode the payload byte string sent by the sensor 
  3. When this is done, wrap the decoded readings with our normalized UpdateReadings command data struct 
  4. When this is done, send the normalized event to the Device Readings Pub/Sub 
Of course, you’ll need to write the code for the "decode" and "convertToCommand" functions. If there's no timestamp provided by the device, then it would need to be added in one of these two steps. We’ll get more in-depth into code examples in part three.

So, in summary, the second step is to normalize all our streams by converting them into commands. In this case, all sensors are sending in a command to UpdateReadings for their associated device. Why didn’t we just create the event? Why bother making a command? Remember, this is an event-driven architecture. This means that events can only be created as a result of a command. Is it nitpicky? Very. But is it necessary? Yes. By not breaking the command -> event -> command chain, we make a system that's easy to expand and test. Without it, you can easily get lost trying to track data through the system (yes, a lot more on tracking data flows later).

So our architecture now looks like this:
Data streams coming into our platform are decoded using bespoke Cloud Functions that output a normalized, timestamped command. So far, we’ve only had to write about 30 - 40 lines of code, and the best part . . . we’re almost halfway complete with our entire platform.

Now that we have commands, we move onto the real magic. . . storage. Wait, storage is magic?

Third step: storage and indexing events


Now that we've converted all our inbound data into a sequence of commands, we’re 100% into event-driven architecture. This means that now we need to address the challenges of this paradigm. What makes event-driven architecture so great? It makes sense and is super easy to extend. What makes event-driven architecture painful? Doing it right has been a pain. Why? Because you only have commands and events in your system. If you want something more meaningful you need to aggregate these events. What does that mean? Let’s consider a simple example.

Let’s say you’ve got an event-driven architecture for a website that sells t-shirts. The orders come in as commands from a user-submitted web form. Updates also come in as commands. On the backend, we store only the events. So consider the following event sequence for a single online order:

1 - (Order Created) 
     orderNumber: 123foo
     items: [ item: redShirt, size: XL, quantity: 2 ]
     shippingAddress: 123 Bar Lane
2 - (Address Changed)
     orderNumber: 123foo
     shippingAddress: 456 Infinite Loop
3 - (Quantity Changed)
      orderNumber: 123foo
      items: [ item: redShirt, size: XL, quantity: 1 ]

You cannot get an accurate view of the current order by looking at only one event. If you only looked at #2 (Address Changed), you wouldn’t know the item quantities. If you only looked at #3 (Quantity Change), you wouldn’t have the address.

To get an accurate view, you need to "replay" all the events for the order. In event-driven architecture, this process is often referred to as "hydrating." Alternatively, you can maintain an aggregate view of the order (the current state of the order) in the database and update it whenever a new command arrives. Both of these methods are correct. In fact, many event-driven architectures use both hydration and aggregates.

Unfortunately, implementing consistent hydration and/or aggregation isn’t easy. There are libraries and even databases designed to handle this, but that was before the wondrous powers of serverless computing. Enter Google Cloud Bigtable and BigQuery.

Bigtable is the database service that Google uses to index and search the entire internet. Let that sink in. When you do a Google search, it's Bigtable that gets you the data you need in a blink of an eye. What does that mean? Unmatched power! We’re talking about a database optimized to handle billions of rows with millions of columns. Why is that so important? Because it lets us do event-driven architecture right!

For every command we receive, we create a corresponding event that we store in Bigtable.

Wow.
This blogger bloke just told us that we store data in a database.
Genius.

Why thank you! But honestly, it's the aspects of the database that matters. This isn’t just any database. Bigtable lets us optimize without optimizing. What does that mean? We can store everything and anything and access it with speed. We don’t need to write code to optimize our storage or build clever abstractions. We just store the data and retrieve it so fast that we can aggregate and interpret at access.

Huh?

Let me give you an example that might help explain the sheer joy of having a database that you cannot outrun.

These days, processors are fast. So fast that the slowest part of computing is loading data from the disk (even SSD). Therefore, most of the world’s most performant systems use aggressive compression when interacting with storage. This means that we'll compress all writes going to disk to reduce read-time. Why? Because we have so much excess processing power, it's faster to decompress data rather than read more bytes from disk. Go back 20 years and tell developers that we would "waste time" by compressing everything going to disk and they would tell you that you’re mad. You’ll never get good performance if you have to decompress everything coming off the disk!

In fact, most platforms go a step further and use the excessive processing power to also encrypt all data going to disk. Google does this. That’s why all your data is secure at rest in their cloud. Everything written to disk is compressed AND encrypted. That goes for their data too, so you know this isn’t impacting performance.

Bigtable is very much the same thing for web services. Querying data is so fast that we can perform processing post-query. Previously, we would optimize our data models and index the heck out of our tables just to reduce query time inside the database. When the database can query across billions of rows in 6 ms, that changes everything. Now we just store and store and store and process later.

This is why storing your data in a database is amazing, if it's the right database!

So how do we deal with aggregation and hydration? We don’t. At least not initially. We simply accept our command, load any auxiliary lookups needed (often the sensor / device won’t be smart enough to know its owner or location), and then save it to Bigtable. Again, we’re getting a lot of power with very little code and effort.

But wait, there’s more! I also mentioned BigQuery. This is the service that allows us to run complex SQL queries across massive datasets (even datasets store in a non-SQL database). In other words, now that I’ve stored all this data, how do I get meaning from it? You could write a custom service, or just use BigQuery. It will let you perform queries and aggregations across terabytes of data in seconds.

So yes, for most of you, this could very well be the end of the architecture:
Seriously, that’s it. You could build any modern web service (social media, music streaming, email) using this architecture. It will scale infinitely and have a max response time for queries of around 3 - 4 seconds. You would only need to write the initial normalization and the required SQL queries for BigQuery. If you wanted even faster responses, you could target queries directly at Bigtable, but that requires a little more effort.

This is why storage is magic. Pick the right storage and you can literally build your entire web service in two weeks and never worry about scale, ops, or performance. Bigtable is my Patronus!!

Now we could stop here. Literally. We could make a meaningful and useful city platform with just this architecture. We’d be able to make meaningful reports and views on events happening throughout our city. However, we want more! Our goal is to make a smart city, one that automatically reacts to events.

Fourth step: temporal reasoning


Ok, this is where things start to get a little more complex. We have events—a lot of events. They are stored, timestamped and geolocated. We can query this data easily and efficiently. However, we want to make our system react.

This is where temporal reasoning comes in. The fundamental idea: we build business rules and insights based on the temporal relationship between grouped events. These collections of related events are commonly referred to as "intervals" or "frames." For example, if my interval is a lecture, the lecture itself can contain many smaller events:


We can also take the lecture in the context of an even larger interval, such as a work day:
And of course, these days can be held in the context of an even larger frame, like a work week.

Once we've built these frames (these collection of events), we can start asking meaningful questions. For example, "Has the average temperature been above the danger threshold for more than 5 minutes?", "Was maintenance scheduled before the spike in traffic?", "Did the vehicle depart before the parking time limit?"

For many of you, this process may sound familiar. This approach of applying business rules for streaming data has many similarities to a Complex Event Processing (CEP) service. In fact, a wonderful implementation of a CEP that uses temporal reasoning is the Drools Fusion module. Amazing stuff! Why not just use a CEP? Unfortunately, business rule management systems (BRMS) and CEPs haven't yet fully embraced the smaller, bite-size methodologies of microservices. Most of these system require a single monolithic instance that demands absolute data access. What we need is a distributed collection of rules that can be easily referenced and applied by a distributed set of autoscaling workers.

Fortunately, writing the rules and applying the logic is easy once you have the grouped events. Creating and extending these intervals is the tricky part. For our smart city platform, this means having modules that define specific types of intervals and then adds any and all related events.

For example, consider a parking module. This would take the readings from sensors that detect the arrival and departure of a vehicle and create a larger parking interval. An example of a parking interval might be:
We simply build a microservice that listens to ReadingsUpdated events and manages creation and extension of parking intervals. Then, we're free to make a service that reacts to the FrameUpdated events and runs temporal reasoning rules to see if a new command should be created. For example, "If there's no departure 60 minutes after arrival, broadcast an IssueTicket command."

Of course, we may need to correlate events into an interval that are outside the scope of the initial sensor. In the parking example, we see "payment made." Payment is clearly not collected by the parking sensor. How do we manage this? By creating links between the interval and all known associated entities. Then, whenever a new event enters our system, we can add it to all related intervals (if the associated producer or its assigned groups are related). This sounds complex, but it's actually rather easy to maintain a complex set of linkages in Bigtable. Google does on a significant scale (like the entire internet). Of course, this would be a lot simpler if someone provided a serverless graph database!

So, without diving too much into the complexities, we have the final piece of our architecture. We collect events into common groups (intervals), maintain a list of links to related entities (for updates), and apply simple temporal reasoning (business rules) to drive system behavior. Again, this would be a nightmare without using an event-driven architecture built on serverless computing. In fact, once we get a serverless graph database and distributed BRMS, we've solved the internet (spoiler alert: we'll all change into data engineers, AI trainers and UI guys).

[BTW, for more information, please consult the work of the godfather of computer-based temporal reasoning, James F. Allen. More specifically, his whitepapers An Interval-Based Representation of Temporal Knowledge and Maintaining Knowledge about Temporal Events]

Fifth step: the extras


While everything sounds easy, there are a few details I may have glossed over. I hope you found some! You’re an engineer, of course you did. Sorry, but this part is a little technical. You can skip it if you like! Or, just email me your doubt and I’ll reply!

A quick example is how do I query a subset of devices. We have everything stored in Bigtable, but how do I look at only one group? For example, what if I only wanted to look at devices or events downtown?

This is where grouping comes in. It’s actually really easy with Bigtable. Since Bigtable is NoSQL, it means that we can have sparse columns. In other words, we can have a family called "groups" and any custom set of column qualifiers in this family per row. In other words, we let an event belong to any number of groups. We look up the current groups when the command is received for the device and add the appropriate columns. This will hopefully make more sense when we go deeper in part three.

Another area worth a passing mention is extension and testing. Why is serverless and event-driven architecture so easy to test and extend? The ease of testing comes from the microservices. Each component does one thing and does it well. It accepts either a command or an event and produces a simple output. For example, each of our event Pub/Subs has a Cloud Function that simply takes the events and stores them in Google Cloud Storage for archival purposes. This function is only 20 lines of code (mostly boilerplate) and has absolutely no impact on the performance of other parts of the system. Why no impact? It's serverless, meaning that it autoscales only for its needs. Also, thanks to the Pub/Sub queues, our microservice is taking a non-destructive replication of input (i.e., each microservice is getting a copy of the message without putting a burden on any other part of our architecture).

This zero impact is also why extension of our architecture is easy. If we want to build an entirely new subsystem, we simply branch off one of the Pub/Subs. This means a developer can rebuild the entire system if they want with zero impact and zero downtime for the existing system. I've done this [transitioned an entire architecture from Datastore to Bigtable1], and it's liberating. Finally, we can rebuild and refactor our services without having to toss out the core of our architecture—the events. In fact, since the heart of our system is events published through serverless queues, we can branch our system just like many developers branch their code in modern version control systems (i.e, Git). We simply create new ways to react to commands and events. This is perfect for introducing new team members. These noobs [technical term for a new starter] can branch off a Pub/Sub and deploy code to the production environment on their first day with zero risk of disrupting the existing system. That's powerful stuff! No-fear coding? Some dreams do come true.

BUT—and this is a big one (fortunately, I like big buts)—what about integration testing? Building and testing microservices is easy. Hosting them on serverless is easy. But how do we monitor them and, more importantly, how do we perform integration testing on this chain of independent functions? Fortunately, that's what Part Three is for. We'll cover this all in great detail there.

Conclusion


In this post, we went deep into how we can make an event-driven architecture work on serveless through the context of a smart city platform. Phew. That was a lot. Hope it all made sense (if not, drop me an email or leave a comment). In summary, modern serverless cloud services allow us to easily build powerful systems. By leveraging autoscaling storage, compute and queuing services, we can make a system that outpaces any demand and provides world-class performance. Furthermore, these systems (if designed correctly) can be easy to create, maintain and extend. Once you go serverless, you'll never go back! Why? Because it's just fun!

In the next part, we'll go even deeper and look at the code required to make all this stuff work.


A little more context on the refactor, for those who care to know. Google Datastore is a brillant and extremely cost-efficient database. It is noSQL like Bigtable but offers traditional-style indexing. For most teams, Datastore will be a magic bullet (solving all your scalability and throughput needs). It’s also ridiculously easy to use. However, as your data sets (especially for streaming) start to grow, you’ll find that the raw power of Bigtable cannot be denied. In fact, Datastore is built on Bigtable. Still, for most of us, Datastore will be everything we could want in a database (fast, easy and cheap, with infanite2 scaling).

Did I put a footnote in a footnote? Yes. Does that make it a toenote? Definitely. Is ‘infanite’ a word? Sort of. In·fa·nite (adjective) - practically infinite. Google’s serverless offerings are infanite, meaning that you’ll never hit the limit until your service goes galactic.

How to automatically scan Cloud Storage buckets for sensitive data: Taking charge of your security



Security in the cloud is often a matter of identifying—and sticking to—some simple best practices. A few months ago, we discussed some steps you can take to harden the security of your Cloud Storage buckets. We covered how to set up proper access permissions and provided tips on using tools like the Data Loss Prevention API to monitor your buckets for sensitive data. Today, let’s talk about how to automate data classification using the DLP API and Cloud Functions, Google Cloud Platform’s event-driven serverless compute platform that makes it easy for you to integrate and extend cloud services with code.

Imagine you need to regularly share data with a partner outside of your company, and this data cannot contain any sensitive elements such as Personally Identifiable Information (PII). You could just create a bucket, upload data to it, and grant access to your partner, but what if someone uploads the wrong file or doesn’t know that they aren’t supposed to upload PII? With the DLP API and Cloud Functions, you can automatically scan this data before it’s uploaded to the shared storage bucket.

Setting this up is easy: Simply create three buckets—one in which to upload data, one to share and one for any sensitive data that gets flagged. Then:

  1. Configure access appropriately so that relevant users can put data in the “upload” bucket 
  2. Write a Cloud Function triggered by an upload that scans the data using the DLP API 
  3. Based on any DLP findings, automatically move data into the share bucket or into a restricted bucket for further review.

To get you started, here’s a tutorial with detailed instructions including the Cloud Functions script. You can get it up and running in just a few minutes from the Cloud Console or via Cloud Shell. You can then easily modify the script for your environment, and add more advanced actions such as sending notification emails, creating a redacted copy or triggering approval workflows.

We hope we’ve showed you some proactive steps you can take to prevent sensitive data from getting into the wrong hands. To learn more, check out the documentation for the DLP API and Cloud Functions.

Best practices for securing your Google Cloud databases



If information is gold, the database is a treasure chest. Web applications store their most valuable data in a database, and lots of sites would cease to exist if their data were stolen or deleted. This post aims to give you a series of best practices to help protect and defend the databases you host on Google Cloud Platform (GCP).

Database security starts before the first record is ever stored. You must consider the impact on security as you design the hosting environment. From firewall rules to logging, there's a lot to think about both inside and outside the database.

First considerations


When it comes to database security, one of the fundamental things to consider is whether to deploy your own database servers or to use one of Google's managed storage services. This decision should be influenced heavily by your existing architecture, staff skills, policies and need for specialized functionality.

This post is not intended to sell you specific GCP products, but absent any overwhelming reasons to host your own database, we recommend using a managed version thereof. Managed database services are designed to work at Google scale, with all of the benefits of our security model. Organizations seeking compliance with PCI, SOX, HIPAA, GDPR, and other compliance regimes will appreciate the significant reduction in effort with a shared responsibility model. And even if these rules and regulations don't apply to your organization, I recommend following the PCI SAQ A (payment card industry self-assessment questionnaire type A) as a baseline set of best practices.

Access controls


You should limit access to the database itself as much as possible. Self-hosted databases should have VPC firewall rules locked down to only allow ingress from and egress to authorized hosts. All ports and endpoints not specifically required should be blocked. If possible, ensure changes to the firewall are logged and alerts are configured for unexpected changes. This happens automatically for firewall changes in GCP. Tools like Forseti Security can monitor and manage security configurations for both Google managed services and custom databases hosted on Google Compute Engine instances.

As you prepare to launch your database, you should also consider the environment in which it operates. Service accounts streamline authorization to Google databases using automatically rotating keys, and you can manage rotation for self-hosted databases in GCP using Cloud Key Management System (KMS).


Data security


Always keep your data retention policy in mind as you implement your schema. Sensitive data that you have no use for is a liability and should be archived or pruned. Many compliance regulations provide specific guidance (HIPAA, PCI, GDPR, SOX) to identify that data. You may find it helpful to operate under the pessimistic security model in which you assume your application will be cracked and your database will be exfiltrated. This can help clarify some of the decisions you need to make regarding retention, encryption at rest, etc.

Should the worst happen and your database is compromised, you should receive alerts about unexpected behavior such as spikes in egress traffic. Your organization may also benefit from using "canary" data—specially crafted information that should never be seen in the real world by your application under normal circumstances. Your application should be designed to detect a canary account logging in or canary values transiting the network. If found, your application should send you alerts and take immediate action to stem the possible compromise. In a way, canary data is similar to retail store anti-theft tags. These security devices have known detection characteristics and can be hidden inside a product. A security-conscious retailer will set up sensors at their store exists to detect unauthorized removal of inventory.

Of course, you should develop and test a disaster recovery plan. Having a copy of the database protects you from some failures, but it won't protect you if your data has been altered or deleted. A good disaster recovery plan will cover you in the event of data loss, hardware issues, network availability and any other disaster you might expect. And as always, you must regularly test and monitor the backup system to ensure reliable disaster recovery.

Configuration


If your database was deployed with a default login, you should make it a top priority to change or disable that account. Further, if any of your database accounts are password-protected, make sure those passwords are long and complex; don’t use simple or empty passwords used under any circumstances. If you're able to use Cloud KMS, that should be your first choice. Beyond that, be sure to develop a schedule for credential rotation and define criteria for out-of-cycle rekeying.

Regardless of which method you use for authentication, you should have different credentials for read, write and admin-level privileges. Even if an application performs both read and write operations, separate credentials can limit the damage caused by bad code or unauthorized access.

Everyone who needs to access the database should have their own private credentials. Create service accounts for each discrete application with only the permissions required for that service. Cloud Identity and Access Management is a powerful tool for managing user privileges in GCP; generic administrator accounts should be avoided as they mask the identity of the user. User credentials should restrict rights to the minimum required to perform their duties. For example, a user that creates ad-hoc reports should not be able to alter schema. Consider using views, stored procedures and granular permissions to further restrict access to only what a user needs to know and further mitigate SQL injection vulnerabilities.

Logging is a critical part of any application. Databases should produce logs for all key events, especially login attempts and administrative actions. These logs should be aggregated in an immutable logging service hosted apart from the database. Whether you're using Stackdriver or some other service, credentials and read access to the logs should be completely separate from the database credentials.


Whenever possible, you should implement a monitor or proxy that can detect and block brute force login attempts. If you’re using Stackdriver, you could set up an alerting policy to a Cloud Function webhook that keeps track of recent attempts and creates a firewall rule to block potential abuse.

The database should run as an application-specific user, not root or admin. Any host files should be secured with appropriate file permissions and ownership to prevent unauthorized execution or alteration. POSIX compliant operating systems offer chown and chmod to set file permissions, and Windows servers offer several tools as well. Tools such as Ubuntu's AppArmor go even further to confine applications to a limited set of resources.


Application considerations


When designing application, an important best practice is to only employ encrypted connections to the database, which eliminates the possibility of an inadvertent leak of data or credentials via network sniffing. Cloud SQL users can do this using Google's encrypting SQL proxy. Some encryption methods also allow your application to authenticate the database host to reduce the threat of impersonation or man-in-the-middle attacks.
If your application deals with particularly sensitive information, you should consider whether you actually need to retain all of its data in the first place. In some cases, handling this data is governed by a compliance regime and the decision is made for you. Even so, additional controls may be prudent to ensure data security. You may want to encrypt some data at the application level in addition to automatic at-rest encryption. You might reduce other data to a hash if all you need to know is whether the same data is presented again in the future. If using your data to train machine learning models, consider reading this article on managing sensitive data in machine learning.

Application security can be used to enhance database security, but must not be used in place of database security. Safeguards such as sanitization must be in place for any input sent to the database, whether it’s data for storage or parameters for querying. All application code should be peer-reviewed. Security scans for common vulnerabilities, including SQL injection and XSS, should be automated and run frequently.

The computers of anyone with access rights to the database should be subject to an organizational security policy. Some of the most audacious breaks in security history were due to malware, lax updates or mishandled portable data storage devices. Once a workstation is infected, every other system it touches is suspect. This also applies to printers, copiers and any other connected device.

Do not allow unsanitized production data to be used in development or test environments under any circumstances. This one policy will not only increase database security, but will all but eliminate the possibility of a non-production environment inadvertently emailing customers, charging accounts, changing states, etc.

Self-hosted database concerns


While Google's shared responsibility model allows managed database users to relieve themselves of some security concerns, we can't offer equally comprehensive controls for databases that you host yourself. When self-hosting, it's incumbent on the database administrator to ensure every attack vector is secured.

If you’re running your own database, make sure the service is running on its own host(s), with no other significant application functions allowed on it. The database should certainly not share a logical host with a web host or other web-accessible services. If this isn’t possible, the service and databases should reside in a restricted network shared and accessible by a minimum number of other hosts. A logical host is the lowest level computer or virtual computer in which an application itself is running. In GCP, this may be a container or virtual machine instance, while on-premises, logical hosts can be physical servers, VMs or containers.
A common use case for running your own database is to replicate across a hybrid cloud. For example, a hybrid cloud database deployment may have a master and one or more replicas in the cloud, with a replica on-premises. In that event, don’t connect the database to the corporate LAN or other broadly available networks. Similarly, on-premises hardware should be physically secure so it can't be stolen or tampered with. A proper physical security regime employs physical access controls, automated physical access logs and remote video monitoring.

Self-hosted databases also require your regular attention, to make sure they’ve been updated. Craft a software update policy and set up regular alerts for out-of-date packages. Consider using a system management (Ubuntu, Red Hat) or configuration management tool that lets you easily perform actions in bulk across all your instances, monitor and upgrade package versions and configure new instances. Be sure to monitor for out-of-cycle changes to key parts of the filesystem such as directories that contain configuration and executable files. 

Several compliance regimes recommend an intrusion detection system or IDS. The basic function of an IDS is to monitor and react to unauthorized system access. There are many products available that run either on each individual server or on a dedicated gateway host. Your choice in IDS may be influenced by several factors unique to your organization or application. An IDS may also be able to serve as a monitor for canary data, a security tactic described above.

All databases have specific controls that you must adjust to harden the service. You should always start with articles written by your database maker for software-specific advice on hardening the server. Hardening guides for several popular databases are linked in the further reading section below.
The underlying operating system should also be hardened as much as possible, and all applications that are not critical for database function should be disabled. You can achieve further isolation by sandboxing or containerizing the database. Use articles written by your OS maker for variant-specific advice on how to harden the platform. Guides for the most common operating systems available in Compute Engine are linked in the further reading section below.

Organizational security 


Staff policies to enforce security is an important but often overlooked part of IT security. It's a very nuanced and deep topic, but here are a few general tips that will aid in securing your database:

All staff with access to sensitive data should be considered for a criminal background check. Insist on strict adherence to a policy of eliminating or locking user accounts immediately upon transfer or termination. Human account password policies should follow the 2017 NIST Digital Identity Guidelines, and consider running social engineering penetration tests and training to reduce the chance of staff inadvertently enabling an attack.

Further reading 


Security is a journey, not a destination. Even after you've tightened security on your database, application, and hosting environment, you must remain vigilant of emerging threats. In particular, self-hosted DBs come with additional responsibilities that you must tend to. For your convenience, here are some OS- and database-specific resources that you may find useful.

Google named a Leader in the Forrester Public Cloud Development Platform Wave, Q2 2018



As enterprises increasingly turn to the public cloud to build and run their applications, tools such as the Forrester Public Cloud Development Platform Wave have become an important way for developers to evaluate and compare functionality across cloud providers.

Today, Forrester named Google Cloud a Leader in its report, The Forrester Wave: Full-Stack Public Cloud Development Platforms, North America Q2, 2018. We believe this recognition reflects our commitment to building a better cloud: one that lets you innovate and deliver great products and services faster to your end users.

In its report, Forrester noted six key strengths for Google Cloud. Here’s a little more on these areas, and what they mean for you.

1. Investments in global infrastructure.


As Forrester mentions in its report, Google Cloud “invests more in global infrastructure annually than nearly any other cloud platform.” Over the past three years, Google has expanded its global infrastructure footprint with cumulative capex of $30.9 billion. This includes adding new data centers, points of presence and fiber optic networks to connect Google Cloud customers reliably, and securely, around the world.

2. A fully-managed serverless platform.


Forrester cited that a growing number of enterprises recognize our ambition to “lead in serverless platforms.” Serverless computing frees you from worrying about managing underlying infrastructure, scaling instances or paying up front for resources you might not need. GCP’s fully managed serverless platform lets you build end-to-end serverless applications through services like App Engine and Cloud Functions, Datastore and Machine Learning Engine. Features like zero config deployments, auto-scaling, zero server management and traffic splitting enable you to focus on building great applications without the hassle of managing them.

3. Maximizing developer productivity.


Forrester’s report also cites our aim to reduce friction from cloud-native development, maximizing developer productivity with zero-configuration infrastructure, a demonstrated commitment to open source, and fully-managed services. We want to remove anxiety from application development while providing a compelling user experience. This means you can bring your own languages, runtimes, libraries and frameworks to GCP, or choose from one of the popular languages we support out of the box. Our continuous integration/continuous deployment (CI/CD) offerings help you go from source to build, test and deploy seamlessly, regardless of whether you’re running multi-cloud deployments, on-prem clusters, or on GCP.

4. Robust data and machine learning services.


Our Machine Learning services are being adopted by a growing number of traditional enterprise customers, and Forrester took notice. We’ve worked hard to make data, analytics and intelligence as accessible as possible, which is why GCP lets you combine cloud-native services with open source tools as needed. Deep learning benchmarks have shown that GCP’s ML service excels in training performance and accuracy. Additionally, GCP customers rely on the same ML services and infrastructure that we use internally for major Google applications like Gmail, YouTube and Photos.

5. Transparent pricing.


Forrester’s report notes that GCP pricing “is fully pay-as-you-go and transparent.” We’ve worked hard to make our pricing as approachable as possible. We offer features like sustained use discounts, per-second billing, custom machine types and committed use discounts to help you get the most from your resources. In addition, there are no upfront costs or termination fees, so you won’t get stuck in long term contracts or configurations that no longer serve your needs. Due to our scale, efficiency, and discounting structure, the cost of GCP can be as much as 50% less than other cloud providers for many compute use cases.1

6. The GCP ecosystem.


In its report, Forrester notes that our partner ecosystem has expanded dramatically in the last two years. Our global partnerships with companies like SAP, Red Hat, Pivotal, Gitlab, and IBM are an important way that we serve our customers, and we’ll be continuing to add new partnerships, as well as expanding existing ones, over the coming months.

Feedback matters to us, regardless of whether it comes from organizations like Forrester or everyday users. Here are a few things we've heard from our customers.
"Google has the best serverless story right now. We've been a little bit on the bleeding edge with Google but so far it's worked out well." 
Nick Rockwell, CTO, New York Times

"We don’t have to work hard to get an answer, as Google does most of the heavy lifting and scaling with the data."
Paul Clarke, Chief Technology Officer, Ocado 

"With the help of Google Cloud Platform, we are changing the fundamental business model of selling lighting to consumers." 
George Yianni, Head of Technology, Home Systems, Philips Lighting

"Moving to Google Cloud Platform helps us quickly launch new services and features for a changing market." 
Garrett Plasky, Technical Operations Manager, Evernote 
You can download the full Forrester Public Cloud Development Platform Wave, Q2 2018 report here. To learn more about GCP, visit our website, and sign up for a free trial.


1 Combination of list price differences, automatic sustained use discounts, and right size recommendations.

Google named a Leader in the Forrester Public Cloud Development Platform Wave, Q2 2018



As enterprises increasingly turn to the public cloud to build and run their applications, tools such as the Forrester Public Cloud Development Platform Wave have become an important way for developers to evaluate and compare functionality across cloud providers.

Today, Forrester named Google Cloud a Leader in its report, The Forrester Wave: Full-Stack Public Cloud Development Platforms, North America Q2, 2018. We believe this recognition reflects our commitment to building a better cloud: one that lets you innovate and deliver great products and services faster to your end users.

In its report, Forrester noted six key strengths for Google Cloud. Here’s a little more on these areas, and what they mean for you.

1. Investments in global infrastructure.


As Forrester mentions in its report, Google Cloud “invests more in global infrastructure annually than nearly any other cloud platform.” Over the past three years, Google has expanded its global infrastructure footprint with cumulative capex of $30.9 billion. This includes adding new data centers, points of presence and fiber optic networks to connect Google Cloud customers reliably, and securely, around the world.

2. A fully-managed serverless platform.


Forrester cited that a growing number of enterprises recognize our ambition to “lead in serverless platforms.” Serverless computing frees you from worrying about managing underlying infrastructure, scaling instances or paying up front for resources you might not need. GCP’s fully managed serverless platform lets you build end-to-end serverless applications through services like App Engine and Cloud Functions, Datastore and Machine Learning Engine. Features like zero config deployments, auto-scaling, zero server management and traffic splitting enable you to focus on building great applications without the hassle of managing them.

3. Maximizing developer productivity.


Forrester’s report also cites our aim to reduce friction from cloud-native development, maximizing developer productivity with zero-configuration infrastructure, a demonstrated commitment to open source, and fully-managed services. We want to remove anxiety from application development while providing a compelling user experience. This means you can bring your own languages, runtimes, libraries and frameworks to GCP, or choose from one of the popular languages we support out of the box. Our continuous integration/continuous deployment (CI/CD) offerings help you go from source to build, test and deploy seamlessly, regardless of whether you’re running multi-cloud deployments, on-prem clusters, or on GCP.

4. Robust data and machine learning services.


Our Machine Learning services are being adopted by a growing number of traditional enterprise customers, and Forrester took notice. We’ve worked hard to make data, analytics and intelligence as accessible as possible, which is why GCP lets you combine cloud-native services with open source tools as needed. Deep learning benchmarks have shown that GCP’s ML service excels in training performance and accuracy. Additionally, GCP customers rely on the same ML services and infrastructure that we use internally for major Google applications like Gmail, YouTube and Photos.

5. Transparent pricing.


Forrester’s report notes that GCP pricing “is fully pay-as-you-go and transparent.” We’ve worked hard to make our pricing as approachable as possible. We offer features like sustained use discounts, per-second billing, custom machine types and committed use discounts to help you get the most from your resources. In addition, there are no upfront costs or termination fees, so you won’t get stuck in long term contracts or configurations that no longer serve your needs. Due to our scale, efficiency, and discounting structure, the cost of GCP can be as much as 50% less than other cloud providers for many compute use cases.1

6. The GCP ecosystem.


In its report, Forrester notes that our partner ecosystem has expanded dramatically in the last two years. Our global partnerships with companies like SAP, Red Hat, Pivotal, Gitlab, and IBM are an important way that we serve our customers, and we’ll be continuing to add new partnerships, as well as expanding existing ones, over the coming months.

Feedback matters to us, regardless of whether it comes from organizations like Forrester or everyday users. Here are a few things we've heard from our customers.
"Google has the best serverless story right now. We've been a little bit on the bleeding edge with Google but so far it's worked out well." 
Nick Rockwell, CTO, New York Times

"We don’t have to work hard to get an answer, as Google does most of the heavy lifting and scaling with the data."
Paul Clarke, Chief Technology Officer, Ocado 

"With the help of Google Cloud Platform, we are changing the fundamental business model of selling lighting to consumers." 
George Yianni, Head of Technology, Home Systems, Philips Lighting

"Moving to Google Cloud Platform helps us quickly launch new services and features for a changing market." 
Garrett Plasky, Technical Operations Manager, Evernote 
You can download the full Forrester Public Cloud Development Platform Wave, Q2 2018 report here. To learn more about GCP, visit our website, and sign up for a free trial.


1 Combination of list price differences, automatic sustained use discounts, and right size recommendations.

How to dynamically generate GCP IAM credentials with a new HashiCorp Vault secrets engine



Applications often require secrets such as credentials at build- or run-time. These credentials are an assertion of a service’s or user’s identity that they can use to authenticate to other services. On Google Cloud Platform (GCP), you can manage services or temporary users using Cloud Identity and Access Management (IAM) service accounts, which are identities whose credentials your application code can use to access other GCP services. You can access a service account from code running on GCP, in your on-premises environment, or even another cloud.

Protecting service account keys is critical—you should tightly control access to them, rotate them, and make sure they're not committed in code. But managing these credentials shouldn’t get harder as your number of services increases. HashiCorp Vault is a popular open source tool for secret management that allows users to store, manage and control access to tokens, passwords, certificates, API keys and many other secrets. Vault supports pluggable mechanisms known as secrets engines for managing different secret types. These engines allow developers to store, rotate and dynamically generate many kinds of secrets in Vault. Because Vault's secrets engines are pluggable, they each provide different functionality. Some engines store static data, others manage PKI and certificates, and others manage database credentials.

Today, we're pleased to announce a Google Cloud Platform IAM secrets engine for HashiCorp Vault. This allows a user to dynamically generate IAM service account credentials with which a human, machine or application can access specific Google Cloud resources using their own identity. To limit the impact of a security incident, Vault allows credentials to be easily revoked.

This helps address some common use cases, for example:
  • Restricting application access for recurring jobs: An application runs a monthly batch job. Instead of a hard-coded, long-lived credential, the application can request a short-lived GCP IAM credential at the start of the job. After a short time, the credential automatically expires, reducing the surface area for a potential attack.
  • Restricting user access for temporary users: A contracting firm needs 90 days of read-only access to build dashboards. Instead of someone generating this credential and distributing it to the firm, the firm requests this credential through Vault. This creates a 1-1 mapping of the credential to its users in audit and access logs.

Getting started with the IAM service account secret engine


Let’s go through an example for generating new service account credentials using Vault. This example assumes the Vault server is already up and running.

You can also watch a demo of the backend in our new video below.
First, enable the secrets engine:

$ vault secrets enable gcp 
Then, set up the engine with initial config and role sets:

$ vault write gcp/config \
credentials=@path/to/creds.json \
 ttl=3600 \
 max_ttl=86400

This config supplies default credentials that Vault will use to generate the service account keys and access tokens, as well as TTL metadata for the leases Vault assigns to these secrets when generated.

Role sets define the sets of IAM roles, bound to specific resources, that you assign to generated credentials. Each role set can generate one of two types of secrets: either `access_token` for one-use OAuth access tokens or `service_account_key` for long-lived service account keys. Here are some examples for both types of rolesets:

# Create role sets
$ vault write gcp/roleset/token-role-set \
    project="myproject" \
    secret_type="access_token" \
    bindings=@token_bindings.hcl
    token_scopes="https://www.googleapis.com/auth/cloud-platform"

$ vault write gcp/roleset/key-role-set \
    project="myproject" \
    secret_type="service_account_key"
    bindings=”
The above bindings param expects a string (or, using the special Vault syntax ‘@’, a path to a file containing this string) with the following HCL (or JSON) format
resource "path/to/my/resource" {
    roles = [
      "roles/viewer",
      "roles/my-other-role",
    ]
}

resource "path/to/another/resource" {
    roles = [
      "roles/editor",
    ]
}
Creating a new role set generates a new service account for a role set. When a user generates a set of credentials, they specify a role set (and thus service account) under which to create the credentials.

Once you have set up the secrets engine, a Vault client can easily generate new secrets:

$ vault read gcp/key/key-role-set

Key                 Value
---                 -----
lease_id            gcp/key/key-roleset/
lease_duration      1h
lease_renewable     true
key_algorithm       KEY_ALG_RSA_2048
key_type            TYPE_GOOGLE_CREDENTIALS_FILE
private_key_data    


$ vault read gcp/token/token-role-set

Key                 Value
---                 -----
lease_id           gcp/token/test/
lease_duration     59m59s
lease_renewable    false
token              ya29.c.restoftoken...
These credentials can then be used to make calls to GCP APIs as needed and can be automatically revoked by Vault.

To learn more, check out the GCP IAM service account secret engine documentation.

How WePay uses HashiCorp Vault on GCP


WePay is an online payment service provider who uses HashiCorp Vault on GCP. It currently runs HashiCorp Vault servers as virtual machines on Google Compute Engine for two primary use cases:
  • Plain vanilla secret storage using a configuration service: WePay has a service-oriented architecture built on Google Kubernetes Engine. Each microservice stores secrets such as passwords, tokens, private keys, and certificates in a centralized configuration service. This in turn uses the generic "kv" (key value) HashiCorp Vault secrets engine to manage application secrets. The configuration service authenticates services that talk to it, and authorizes those services to access their secrets at deployment time. Secrets are segmented by service using base paths, i.e., superSecurePaymentSystem would only be able to access secrets in the superSecurePaymentSystem path.
  • Key management using a key management service: WePay needs a way to centrally manage the provisioning, deprovisioning and rotation of encryption keys used in its applications. A central key management service generates encryption keys, and stores these in HashiCorp Vault using the "kv" secret engine.
WePay has its complete infrastructure built in GCP, and the introduction of the GCP IAM service account secrets engine will help to put in stronger security practices. WePay is exploring options on how to use the GCP IAM service account secrets engine in its infrastructure, and is excited by the possibilities.

Continuing work for HashiCorp Vault on GCP


We're excited to see what amazing applications and services users will build using the new HashiCorp Vault GCP secrets engine. This feature release is part of our long-standing ongoing partnership with HashiCorp since 2013. We're excited to continue working together to help HashiCorp users make the best use of GCP services and features. To get you up and running with, HashiCorp Vault for IAM service accounts, check out our solution brief “Using Vault on Compute Engine for Secret Management” for an overview of best practices, and a new video on authentication options.

As always, both HashiCorp and Google welcome contributions from the open-source community. Give us a tweet or open an issue on GitHub if you have any questions!

How to dynamically generate GCP IAM credentials with a new HashiCorp Vault secrets engine



Applications often require secrets such as credentials at build- or run-time. These credentials are an assertion of a service’s or user’s identity that they can use to authenticate to other services. On Google Cloud Platform (GCP), you can manage services or temporary users using Cloud Identity and Access Management (IAM) service accounts, which are identities whose credentials your application code can use to access other GCP services. You can access a service account from code running on GCP, in your on-premises environment, or even another cloud.

Protecting service account keys is critical—you should tightly control access to them, rotate them, and make sure they're not committed in code. But managing these credentials shouldn’t get harder as your number of services increases. HashiCorp Vault is a popular open source tool for secret management that allows users to store, manage and control access to tokens, passwords, certificates, API keys and many other secrets. Vault supports pluggable mechanisms known as secrets engines for managing different secret types. These engines allow developers to store, rotate and dynamically generate many kinds of secrets in Vault. Because Vault's secrets engines are pluggable, they each provide different functionality. Some engines store static data, others manage PKI and certificates, and others manage database credentials.

Today, we're pleased to announce a Google Cloud Platform IAM secrets engine for HashiCorp Vault. This allows a user to dynamically generate IAM service account credentials with which a human, machine or application can access specific Google Cloud resources using their own identity. To limit the impact of a security incident, Vault allows credentials to be easily revoked.

This helps address some common use cases, for example:
  • Restricting application access for recurring jobs: An application runs a monthly batch job. Instead of a hard-coded, long-lived credential, the application can request a short-lived GCP IAM credential at the start of the job. After a short time, the credential automatically expires, reducing the surface area for a potential attack.
  • Restricting user access for temporary users: A contracting firm needs 90 days of read-only access to build dashboards. Instead of someone generating this credential and distributing it to the firm, the firm requests this credential through Vault. This creates a 1-1 mapping of the credential to its users in audit and access logs.

Getting started with the IAM service account secret engine


Let’s go through an example for generating new service account credentials using Vault. This example assumes the Vault server is already up and running.

You can also watch a demo of the backend in our new video below.
First, enable the secrets engine:

$ vault secrets enable gcp 
Then, set up the engine with initial config and role sets:

$ vault write gcp/config \
credentials=@path/to/creds.json \
 ttl=3600 \
 max_ttl=86400

This config supplies default credentials that Vault will use to generate the service account keys and access tokens, as well as TTL metadata for the leases Vault assigns to these secrets when generated.

Role sets define the sets of IAM roles, bound to specific resources, that you assign to generated credentials. Each role set can generate one of two types of secrets: either `access_token` for one-use OAuth access tokens or `service_account_key` for long-lived service account keys. Here are some examples for both types of rolesets:

# Create role sets
$ vault write gcp/roleset/token-role-set \
    project="myproject" \
    secret_type="access_token" \
    bindings=@token_bindings.hcl
    token_scopes="https://www.googleapis.com/auth/cloud-platform"

$ vault write gcp/roleset/key-role-set \
    project="myproject" \
    secret_type="service_account_key"
    bindings=”
The above bindings param expects a string (or, using the special Vault syntax ‘@’, a path to a file containing this string) with the following HCL (or JSON) format
resource "path/to/my/resource" {
    roles = [
      "roles/viewer",
      "roles/my-other-role",
    ]
}

resource "path/to/another/resource" {
    roles = [
      "roles/editor",
    ]
}
Creating a new role set generates a new service account for a role set. When a user generates a set of credentials, they specify a role set (and thus service account) under which to create the credentials.

Once you have set up the secrets engine, a Vault client can easily generate new secrets:

$ vault read gcp/key/key-role-set

Key                 Value
---                 -----
lease_id            gcp/key/key-roleset/
lease_duration      1h
lease_renewable     true
key_algorithm       KEY_ALG_RSA_2048
key_type            TYPE_GOOGLE_CREDENTIALS_FILE
private_key_data    


$ vault read gcp/token/token-role-set

Key                 Value
---                 -----
lease_id           gcp/token/test/
lease_duration     59m59s
lease_renewable    false
token              ya29.c.restoftoken...
These credentials can then be used to make calls to GCP APIs as needed and can be automatically revoked by Vault.

To learn more, check out the GCP IAM service account secret engine documentation.

How WePay uses HashiCorp Vault on GCP


WePay is an online payment service provider who uses HashiCorp Vault on GCP. It currently runs HashiCorp Vault servers as virtual machines on Google Compute Engine for two primary use cases:
  • Plain vanilla secret storage using a configuration service: WePay has a service-oriented architecture built on Google Kubernetes Engine. Each microservice stores secrets such as passwords, tokens, private keys, and certificates in a centralized configuration service. This in turn uses the generic "kv" (key value) HashiCorp Vault secrets engine to manage application secrets. The configuration service authenticates services that talk to it, and authorizes those services to access their secrets at deployment time. Secrets are segmented by service using base paths, i.e., superSecurePaymentSystem would only be able to access secrets in the superSecurePaymentSystem path.
  • Key management using a key management service: WePay needs a way to centrally manage the provisioning, deprovisioning and rotation of encryption keys used in its applications. A central key management service generates encryption keys, and stores these in HashiCorp Vault using the "kv" secret engine.
WePay has its complete infrastructure built in GCP, and the introduction of the GCP IAM service account secrets engine will help to put in stronger security practices. WePay is exploring options on how to use the GCP IAM service account secrets engine in its infrastructure, and is excited by the possibilities.

Continuing work for HashiCorp Vault on GCP


We're excited to see what amazing applications and services users will build using the new HashiCorp Vault GCP secrets engine. This feature release is part of our long-standing ongoing partnership with HashiCorp since 2013. We're excited to continue working together to help HashiCorp users make the best use of GCP services and features. To get you up and running with, HashiCorp Vault for IAM service accounts, check out our solution brief “Using Vault on Compute Engine for Secret Management” for an overview of best practices, and a new video on authentication options.

As always, both HashiCorp and Google welcome contributions from the open-source community. Give us a tweet or open an issue on GitHub if you have any questions!