Tag Archives: conferences

VLDB 2015 and Database Research at Google



This week, Kohala, Hawaii hosts the 41st International Conference of Very Large Databases (VLDB 2015), a premier annual international forum for data management and database researchers, vendors, practitioners, application developers and users. As a leader in Database research, Google will have a strong presence at VLDB 2015 with many Googlers publishing work, organizing workshops and presenting demos.

The research Google is presenting at VLDB involves the work of Structured Data teams who are building intelligent and efficient systems to discover, annotate and explore structured data from the Web, surfacing them creatively through Google products (such as structured snippets and table search), as well as engineering efforts that create scalable, reliable, fast and general-purpose infrastructure for large-scale data processing (such as F1, Mesa, and Google Cloud's BigQuery).

If you are attending VLDB 2015, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about our research being presented at VLDB 2015 in the list below (Googlers highlighted in blue).

Google is a Gold Sponsor of VLDB 2015.

Papers:
Keys for Graphs
Wenfei Fan, Zhe Fan, Chao Tian, Xin Luna Dong

In-Memory Performance for Big Data
Goetz Graefe, Haris Volos, Hideaki Kimura, Harumi Kuno, Joseph Tucek, Mark Lillibridge, Alistair Veitch

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, Sam Whittle

Resource Bricolage for Parallel Database Systems
Jiexing Li, Jeffrey Naughton, Rimma Nehme

AsterixDB: A Scalable, Open Source BDMS
Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alex Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, Till Westmann

Knowledge-Based Trust: A Method to Estimate the Trustworthiness of Web Sources
Xin Luna Dong, Evgeniy Gabrilovich, Kevin Murphy, Van Dang, Wilko Horn, Camillo Lugaresi, Shaohua Sun, Wei Zhang

Efficient Evaluation of Object-Centric Exploration Queries for Visualization
You Wu, Boulos Harb, Jun Yang, Cong Yu

Interpretable and Informative Explanations of Outcomes
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, Divesh Srivastava

Take me to your leader! Online Optimization of Distributed Storage Configurations
Artyom Sharov, Alexander Shraer, Arif Merchant, Murray Stokely

TreeScope: Finding Structural Anomalies In Semi-Structured Data
Shanshan Ying, Flip Korn, Barna Saha, Divesh Srivastava

Workshops:
Workshop on Big-Graphs Online Querying - Big-O(Q) 2015
Workshop co-chair: Cong Yu

3rd International Workshop on In-Memory Data Management and Analytics
Program committee includes: Sandeep Tata

High-Availability at Massive Scale: Building Google's Data Infrastructure for Ads
Invited talk at BIRTE by: Ashish Gupta, Jeff Shute

Demonstrations:
KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing
Xu Chu, John Morcos, Ihab Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye

Error Diagnosis and Data Profiling with Data X-Ray
Xiaolan Wang, Mary Feng, Yue Wang, Xin Luna Dong, Alexandra Meliou

Pulling Back the Curtain on Google’s Network Infrastructure



Pursuing Google’s mission of organizing the world’s information to make it universally accessible and useful takes an enormous amount of computing and storage. In fact, it requires coordination across a warehouse-scale computer. Ten years ago, we realized that we could not purchase, at any price, a datacenter network that could meet the combination of our scale and speed requirements. So, we set out to build our own datacenter network hardware and software infrastructure. Today, at the ACM SIGCOMM conference, we are presenting a paper with the technical details on five generations of our in-house data center network architecture. This paper presents the technical details behind a talk we presented at Open Network Summit a few months ago.

From relatively humble beginnings, and after a misstep or two, we’ve built and deployed five generations of datacenter network infrastructure. Our latest-generation Jupiter network has improved capacity by more than 100x relative to our first generation network, delivering more than 1 petabit/sec of total bisection bandwidth. This means that each of 100,000 servers can communicate with one another in an arbitrary pattern at 10Gb/s.

Such network performance has been tremendously empowering for Google services. Engineers were liberated from optimizing their code for various levels of bandwidth hierarchy. For example, initially there were painful tradeoffs with careful data locality and placement of servers connected to the same top of rack switch versus correlated failures caused by a single switch failure. A high performance network supporting tens of thousands of servers with flat bandwidth also enabled individual applications to scale far beyond what was otherwise possible and enabled tight coupling among multiple federated services. Finally, we were able to substantially improve the efficiency of our compute and storage infrastructure. As quantified in this recent paper, scheduling a set of jobs over a single larger domain supports much higher utilization than scheduling the same jobs over multiple smaller domains.

Delivering such a network meant we had to solve some fundamental problems in networking. Ten years ago, networking was defined by the interaction of individual hardware elements, e.g., switches, speaking standardized protocols to dynamically learn what the network looks like. Based on this dynamically learned information, switches would set their forwarding behavior. While robust, these protocols targeted deployment environments with perhaps tens of switches communicating between between multiple organizations. Configuring and managing switches in such an environment was manual and error prone. Changes in network state would spread slowly through the network using a high-overhead broadcast protocol. Most challenging of all, the system could not scale to meet our needs.

We adopted a set of principles to organize our networks that is now the primary driver for networking research and industrial innovation, Software Defined Networking (SDN). We observed that we could arrange emerging merchant switch silicon around a Clos topology to scale to the bandwidth requirements of a data center building. The topology of all five generations of our data center networks follow the blueprint below. Unfortunately, this meant that we would potentially require 10,000+ individual switching elements. Even if we could overcome the scalability challenges of existing network protocols, managing and configuring such a vast number of switching elements would be impossible.
So, our next observation was that we knew the exact configuration of the network we were trying to build. Every switch would play a well-defined role in a larger-ensemble. That is, from the perspective of a datacenter building, we wished to build a logical building-scale network. For network management, we built a single configuration for the entire network and pushed the single global configuration to all switches. Each switch would simply pull out its role in the larger whole. Finally, we transformed routing from a pair-wise, fully distributed (but somewhat slow and high overhead) scheme to a logically-centralized scheme under the control of a single dynamically-elected master as illustrated in the figure below from our paper.
Our SIGCOMM paper on Jupiter is one of four Google-authored papers being presented at that conference. Taken together, these papers show the benefits of embedding research within production teams, reflecting both collaborations with PhD students carrying out extended research efforts with Google engineers during internships as well as key insights from deployed production systems:

  • Our work on Bandwidth Enforcer shows how we can allocate wide area bandwidth among tens of thousands of individual applications based on centrally configured policy, substantially improving network utilization while simultaneously isolating services from one another.
  • Condor addresses the challenges of designing data center network topologies. Network designers can specify constraints for data center networks; Condor efficiently generates candidate network designs that meet these constraints, and evaluates these candidates against a variety of target metrics.
  • Congestion control in datacenter networks is challenging because of tiny buffers and very small round trip times. TIMELY shows how to manage datacenter bandwidth allocation while maintaining highly responsive and low latency network roundtrips in the data center.

These efforts reflect the latest in a long series of substantial Google contributions in networking. We are excited about being increasingly open about results of our research work: to solicit feedback on our approach, to influence future research and development directions so that we can benefit from community-wide efforts to improve networking, and to attract the next-generation of great networking thinkers and builders to Google. Our focus on Google Cloud Platform further increases the importance of being open about our infrastructure. Since the same network powering Google infrastructure for a decade is also the underpinnings of our Cloud Platform, all developers can leverage the network to build highly robust, manageable, and globally scalable services.

Say hello to the Enigma conference



USENIX Enigma is a new conference focused on security, privacy and electronic crime through the lens of emerging threats and novel attacks. The goal of this conference is to help industry, academic, and public-sector practitioners better understand the threat landscape. Enigma will have a single track of 30-minute talks that are curated by a panel of experts, featuring strong technical content with practical applications to current and emerging threats.
Google is excited to both sponsor and help USENIX build Enigma, since we share many of its core principles: transparency, openness, and cutting-edge security research. Furthermore, we are proud to provide Enigma with with engineering and design support, as well as volunteer participation in program and steering committees.

The first instantiation of Enigma will be held January 25-27 in San Francisco. You can sign up for more information about the conference or propose a talk through the official conference site at http://enigma.usenix.org