Category Archives: Google Testing Blog

If it ain’t broke, you’re not trying hard enough

Testing on the Toilet: Web Testing Made Easier: Debug IDs

by Ruslan Khamitov 

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Adding ID attributes to elements can make it much easier to write tests that interact with the DOM (e.g., WebDriver tests). Consider the following DOM with two buttons that differ only by inner text:
Save buttonEdit button
<div class="button">Save</div>
<div class="button">Edit</div>

How would you tell WebDriver to interact with the “Save” button in this case? You have several options. One option is to interact with the button using a CSS selector:
div.button

However, this approach is not sufficient to identify a particular button, and there is no mechanism to filter by text in CSS. Another option would be to write an XPath, which is generally fragile and discouraged:
//div[@class='button' and text()='Save']

Your best option is to add unique hierarchical IDs where each widget is passed a base ID that it prepends to the ID of each of its children. The IDs for each button will be:
contact-form.save-button
contact-form.edit-button

In GWT you can accomplish this by overriding onEnsureDebugId()on your widgets. Doing so allows you to create custom logic for applying debug IDs to the sub-elements that make up a custom widget:
@Override protected void onEnsureDebugId(String baseId) {
super.onEnsureDebugId(baseId);
saveButton.ensureDebugId(baseId + ".save-button");
editButton.ensureDebugId(baseId + ".edit-button");
}

Consider another example. Let’s set IDs for repeated UI elements in Angular using ng-repeat. Setting an index can help differentiate between repeated instances of each element:
<tr id="feedback-{{$index}}" class="feedback" ng-repeat="feedback in ctrl.feedbacks" >

In GWT you can do this with ensureDebugId(). Let’s set an ID for each of the table cells:
@UiField FlexTable table;
UIObject.ensureDebugId(table.getCellFormatter().getElement(rowIndex, columnIndex),
baseID + colIndex + "-" + rowIndex);

Take-away: Debug IDs are easy to set and make a huge difference for testing. Please add them early.

Testing on the Toilet: Don’t Put Logic in Tests

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Programming languages give us a lot of expressive power. Concepts like operators and conditionals are important tools that allow us to write programs that handle a wide range of inputs. But this flexibility comes at the cost of increased complexity, which makes our programs harder to understand.

Unlike production code, simplicity is more important than flexibility in tests. Most unit tests verify that a single, known input produces a single, known output. Tests can avoid complexity by stating their inputs and outputs directly rather than computing them. Otherwise it's easy for tests to develop their own bugs.

Let's take a look at a simple example. Does this test look correct to you?

@Test public void shouldNavigateToPhotosPage() {
String baseUrl = "http://plus.google.com/";
Navigator nav = new Navigator(baseUrl);
nav.goToPhotosPage();
assertEquals(baseUrl + "/u/0/photos", nav.getCurrentUrl());
}

The author is trying to avoid duplication by storing a shared prefix in a variable. Performing a single string concatenation doesn't seem too bad, but what happens if we simplify the test by inlining the variable?

@Test public void shouldNavigateToPhotosPage() {
Navigator nav = new Navigator("http://plus.google.com/");
nav.goToPhotosPage();
assertEquals("http://plus.google.com//u/0/photos", nav.getCurrentUrl()); // Oops!
}

After eliminating the unnecessary computation from the test, the bug is obvious—we're expecting two slashes in the URL! This test will either fail or (even worse) incorrectly pass if the production code has the same bug. We never would have written this if we stated our inputs and outputs directly instead of trying to compute them. And this is a very simple example—when a test adds more operators or includes loops and conditionals, it becomes increasingly difficult to be confident that it is correct.

Another way of saying this is that, whereas production code describes a general strategy for computing outputs given inputs, tests are concrete examples of input/output pairs (where output might include side effects like verifying interactions with other classes). It's usually easy to tell whether an input/output pair is correct or not, even if the logic required to compute it is very complex. For instance, it's hard to picture the exact DOM that would be created by a Javascript function for a given server response. So the ideal test for such a function would just compare against a string containing the expected output HTML.

When tests do need their own logic, such logic should often be moved out of the test bodies and into utilities and helper functions. Since such helpers can get quite complex, it's usually a good idea for any nontrivial test utility to have its own tests.

The Deadline to Sign up for GTAC 2014 is Jul 28

Posted by Anthony Vallone on behalf of the GTAC Committee

The deadline to sign up for GTAC 2014 is next Monday, July 28th, 2014. There is a great deal of interest to both attend and speak, and we’ve received many outstanding proposals. However, it’s not too late to add yours for consideration. If you would like to speak or attend, be sure to complete the form by Monday.

We will be making regular updates to our site over the next several weeks, and you can find conference details there:
  developers.google.com/gtac

For those that have already signed up to attend or speak, we will contact you directly in mid August.

Measuring Coverage at Google

By Marko Ivanković, Google Zürich

Introduction


Code coverage is a very interesting metric, covered by a large body of research that reaches somewhat contradictory results. Some people think it is an extremely useful metric and that a certain percentage of coverage should be enforced on all code. Some think it is a useful tool to identify areas that need more testing but don’t necessarily trust that covered code is truly well tested. Others yet think that measuring coverage is actively harmful because it provides a false sense of security.

Our team’s mission was to collect coverage related data then develop and champion code coverage practices across Google. We designed an opt-in system where engineers could enable two different types of coverage measurements for their projects: daily and per-commit. With daily coverage, we run all tests for their project, where as with per-commit coverage we run only the tests affected by the commit. The two measurements are independent and many projects opted into both.

While we did experiment with branch, function and statement coverage, we ended up focusing mostly on statement coverage because of its relative simplicity and ease of visualization.

How we measured


Our job was made significantly easier by the wonderful Google build system whose parallelism and flexibility allowed us to simply scale our measurements to Google scale. The build system had integrated various language-specific open source coverage measurement tools like Gcov (C++), Emma / JaCoCo (Java) and Coverage.py (Python), and we provided a central system where teams could sign up for coverage measurement.

For daily whole project coverage measurements, each team was provided with a simple cronjob that would run all tests across the project’s codebase. The results of these runs were available to the teams in a centralized dashboard that displays charts showing coverage over time and allows daily / weekly / quarterly / yearly aggregations and per-language slicing. On this dashboard teams can also compare their project (or projects) with any other project, or Google as a whole.

For per-commit measurement, we hook into the Google code review process (briefly explained in this article) and display the data visually to both the commit author and the reviewers. We display the data on two levels: color coded lines right next to the color coded diff and a total aggregate number for the entire commit.


Displayed above is a screenshot of the code review tool. The green line coloring is the standard diff coloring for added lines. The orange and lighter green coloring on the line numbers is the coverage information. We use light green for covered lines, orange for non-covered lines and white for non-instrumented lines.

It’s important to note that we surface the coverage information before the commit is submitted to the codebase, because this is the time when engineers are most likely to be interested in improving it.

Results


One of the main benefits of working at Google is the scale at which we operate. We have been running the coverage measurement system for some time now and we have collected data for more than 650 different projects, spanning 100,000+ commits. We have a significant amount of data for C++, Java, Python, Go and JavaScript code.

I am happy to say that we can share some preliminary results with you today:


The chart above is the histogram of average values of measured absolute coverage across Google. The median (50th percentile) code coverage is 78%, the 75th percentile 85% and 90th percentile 90%. We believe that these numbers represent a very healthy codebase.

We have also found it very interesting that there are significant differences between languages:

C++JavaGoJavaScriptPython
56.6%61.2%63.0%76.9%84.2%


The table above shows the total coverage of all analyzed code for each language, averaged over the past quarter. We believe that the large difference is due to structural, paradigm and best practice differences between languages and the more precise ability to measure coverage in certain languages.

Note that these numbers should not be interpreted as guidelines for a particular language, the aggregation method used is too simple for that. Instead this finding is simply a data point for any future research that analyzes samples from a single programming language.

The feedback from our fellow engineers was overwhelmingly positive. The most loved feature was surfacing the coverage information during code review time. This early surfacing of coverage had a statistically significant impact: our initial analysis suggests that it increased coverage by 10% (averaged across all commits).

Future work


We are aware that there are a few problems with the dataset we collected. In particular, the individual tools we use to measure coverage are not perfect. Large integration tests, end to end tests and UI tests are difficult to instrument, so large parts of code exercised by such tests can be misreported as non-covered.

We are working on improving the tools, but also analyzing the impact of unit tests, integration tests and other types of tests individually.

In addition to languages, we will also investigate other factors that might influence coverage, such as platforms and frameworks, to allow all future research to account for their effect.

We will be publishing more of our findings in the future, so stay tuned.

And if this sounds like something you would like to work on, why not apply on our job site?