Category Archives: Open Source Blog

News about Google’s open source projects and programs

The Power of Open Source


At the day 1 keynote of Open Source Summit North America, Timothy Jordan, Director of Developer Relations and Open Source at Google, will talk about the landscape of open source and AI, the importance of a responsible approach, and the transformative impact of community collaboration. In anticipation of this talk, let’s break down the AI open source ecosystem, and how Google approaches it.

Google believes in the power of open technology to drive innovation and benefit everyone. It fosters creativity and collaboration, while ensuring technology access for developers and allowing customization to fit unique use cases. Open source licenses give developers full creative autonomy without restriction. It is this ecosystem of open source and open technology, shaped by ML frameworks like TensorFlow, Keras, and JAX, that has enabled so many incredible advances in AI in recent years.

The open source community has been in discussion on how to apply the Open Source Definition to carry forward the open principles of the OSD while addressing concepts like derived work and author attribution in AI. During Timothy’s keynote, he’ll speak to his own philosophy on Open Source and AI, and share how his assumptions about how we apply open source to AI have evolved. The immediate availability of AI models, powered by the open source ecosystem of ML frameworks, means it’s more important than ever that we establish a shared definition for open source and AI.

While that definition is in development, at Google we’re using precise language to describe our openly available models like Gemma. The definition and license is only one part of this open ML/AI future; advancements in safety tooling, policies, and developer knowledge are all part of creating a responsible and open future for AI. Those advancements are all fueled by a dedication to collaboration. Whether sharing innovations and improvements with the community, or having conversations with policymakers and open source leaders, collaboration is key to a responsible approach to AI in the open ecosystem. AI can only be safe and responsible if everyone’s experiences and perspectives are brought to the forefront as it’s built.

To demonstrate how open source has made AI readily available, Timothy will also take the audience through a “low code” demo of how to run large language models in-browser for web applications. Using MediaPipe, the LLM Inference API, and Gemma, users can quickly add genAI capabilities like document summarization and text generation.

Join us at Open Source Summit North America for this keynote, and visit opensource.google to learn more.

By the Google Open Source team

Google Season of Docs announces participating organizations for 2024

Google Season of Docs provides support for open source projects to improve their documentation and gives professional technical writers an opportunity to gain experience in open source. Together we improve developer experience through better documentation, increase our understanding of best practices in open source documentation, and raise the profile of technical writers in open source.

For 2024, Google Season of Docs is pleased to announce that eleven organizations will be participating in the program! The list of participating organizations can be viewed on the website.

The project development phase now begins. Organizations and the technical writers they hire will work on their documentation projects from now until November 22nd. For organizations still looking to hire a technical writer, the hiring deadline is May 22nd.


How do I take part in Google Season of Docs as a technical writer?

Start by reading the technical writer guide and FAQs which give information about eligibility and choosing a project. Next, technical writers interested in working with accepted open source organizations can share their contact information via the Season of Docs GitHub repository; or they may submit a statement of interest directly to the organizations. We recommend technical writers reach out to organizations before submitting a statement of interest to discuss the project they’ll be working on and gain a better understanding of the organization. Technical writers do not need to submit a formal application through Google Season of Docs, so reach out to the organizations as soon as possible!


Will technical writers be paid while working with organizations accepted into Google Season of Docs?

Yes. Participating organizations will transfer funds directly to the technical writer via OpenCollective. Technical writers should review the organization's proposed project budgets and discuss their compensation and payment schedule with the organization before hiring. Check out our technical writer payment process guide for more details.


General Timeline

May 22, 2024

Technical writer hiring deadline

June 5, 2024

Organization administrators start reporting on their project status via monthly evaluations

December 10, 2024

Final date for Organization administrators submit their case study and final project evaluation

December 13, 2024

Google publishes the 2024 Season of Docs case studies and aggregate project data

May 1, 2025

Organizations begin to participate in post-program followup surveys

See the full timeline for details.


Care to join us?

Explore the Google Season of Docs website at g.co/seasonofdocs to learn more about the program. Use our logo and other promotional resources to spread the word. Review the timeline, check out the FAQ, and reach out to organizations now!

If you have any questions about the program, please email us at [email protected].

By Erin McKean, Google Open Source Programs Office

Introducing Jpegli: A New JPEG Coding Library

The internet has changed the way we live, work, and communicate. However, it can turn into a source of frustration when pages load slowly. At the heart of this issue lies the encoding of images. To improve on this, we are introducing Jpegli, an advanced JPEG coding library that maintains high backward compatibility while offering enhanced capabilities and a 35% compression ratio improvement at high quality compression settings.

Jpegli is a new JPEG coding library that is designed to be faster, more efficient, and more visually pleasing than traditional JPEG. It uses a number of new techniques to achieve these goals, including:

  • It provides both a fully interoperable encoder and decoder complying with the original JPEG standard and its most conventional 8-bit formalism, and API/ABI compatibility with libjpeg-turbo and MozJPEG.
  • High quality results. When images are compressed or decompressed through Jpegli, more precise and psychovisually effective computations are performed and images will look clearer and have fewer observable artifacts.
  • Fast. While improving on image quality/compression density ratio, Jpegli's coding speed is comparable to traditional approaches, such as libjpeg-turbo and MozJPEG. This means that web developers can effortlessly integrate Jpegli into their existing workflows without sacrificing coding speed performance or memory use.
  • 10+ bits. Jpegli can be encoded with 10+ bits per component. Traditional JPEG coding solutions offer only 8 bit per component dynamics causing visible banding artifacts in slow gradients. Jpegli's 10+ bits coding happens in the original 8-bit formalism and the resulting images are fully interoperable with 8-bit viewers. 10+ bit dynamics are available as an API extension and application code changes are needed to benefit from it.
  • More dense: Jpegli compresses images more efficiently than traditional JPEG codecs, which can save bandwidth and storage space, and speed up web pages.

How Jpegli works

Jpegli works by using a number of new techniques to reduce noise and improve image quality; mainly adaptive quantization heuristics from the JPEG XL reference implementation, improved quantization matrix selection, calculating intermediate results precisely, and having the possibility to use a more advanced colorspace. All the new methods have been carefully crafted to use the traditional 8-bit JPEG formalism, so newly compressed images are compatible with existing JPEG viewers such as browsers, image processing software, and others.


Adaptive quantization heuristics

Jpegli uses adaptive quantization to reduce noise and improve image quality. This is done by spatially modulating the dead zone in quantization based on psychovisual modeling. Using adaptive quantization heuristics that we originally developed for JPEG XL, the result is improved image quality and reduced file size. These heuristics are much faster than a similar approach originally used in guetzli.


Improved quantization matrix selection

Jpegli also uses a set of quantization matrices that were selected by optimizing for a mix of psychovisual quality metrics. Precise intermediate results in Jpegli improve image quality, and both encoding and decoding produce higher quality results. Jpegli can use JPEG XL's XYB colorspace for further quality and density improvements.


Testing Jpegli

In order to quantify Jpegli's image quality improvement we enlisted the help of crowdsourcing raters to compare pairs of images from Cloudinary Image Dataset '22, encoded using three codecs: Jpegli, libjpeg-turbo and MozJPEG, at several bitrates.

In this comparison we limited ourselves to comparing the encoding only, decoding was always performed using libjpeg-turbo. We conducted the study with the XYB ICC color profile disabled since that is how we anticipate most users would initially use Jpegli. To simplify comparing the results across the codecs and settings, we aggregated all the rater decisions using chess rankings inspired ELO scoring.

A bar graph of ELO scores on the left and plot graph of ELO scores on the right
A higher ELO score indicates a better aggregate performance in the rater study. We can observe that jpegli at 2.8 BPP received a higher ELO rating than libjpeg-turbo at 3.7 BPP, a bitrate 32 % higher than Jpegli's.

Results

Our results show that Jpegli can compress high quality images 35% more than traditional JPEG codecs.


Jpegli is a promising new technology that has the potential to make the internet faster and more beautiful.

By Zoltan Szabadka, Martin Bruse and Jyrki Alakuijala – Paradigms of Intelligence, Google Research

OSV and helping developers fix known vulnerabilities

In 2021, we launched the OSV project with a goal of enabling easy management of known vulnerabilities in open source software dependencies. To achieve this, we started by building an open source, comprehensive database (https://osv.dev) that accurately describes all known OSS vulnerabilities in the easy-to-use OpenSSF OSV Schema.

Over time, we worked with numerous open source communities to adopt the OSV Schema (totalling over 24 ecosystems), and introduced open source tools like our API and OSV-Scanner to directly make this database useful to developers.

The OSV project takes a very developer-focused approach to vulnerability management, as we realize that day-to-day developers are often the ones who bear the burden of managing dependency updates and triaging vulnerabilities in their dependencies.

Today the OSV team is excited to announce some exciting updates to the work we’ve been doing, and share how the OSV project as a whole helps developers with vulnerability management today.


Announcing guided remediation

Developers are often faced with an overwhelming number of vulnerabilities reported against their dependencies. To tackle this, we’re announcing a tool as part of OSV-Scanner to enable developers to both interactively and automatically prioritize and fix the vulnerabilities that matter in an easy way.

The basic usage of the tool provides a simple command for developers to run which will automatically fix as many vulnerabilities as possible by upgrading their project’s dependencies.

For developers who need or want finer control over vulnerability remediation, the tool also provides the more advanced interactive mode. In the interactive mode, developers can preview and make informed decisions on which packages to upgrade or which vulnerabilities they want to prioritize based on metrics such as vulnerability severity, dependency depth, or dependency type.

Filtering by all these advanced metrics are also available via CLI flags for running the tool non-interactively, which enables integration of guided remediation into automated workflows. For example, developers can connect the tool with their CI/test pipelines to determine the set of non-breaking dependency upgrades.

Currently, the guided remediation tool supports npm package.json and package-lock.json dependencies, but we’ll be adding support for more ecosystems in the future.

Check out our detailed documentation for more information or if you would like to try it out for yourself!


OSV-Scanner GitHub action

We’ve also recently launched the OSV-Scanner GitHub action, which provides an easy way for developers to integrate vulnerability scanning using OSV.dev into their CI/CD pipelines. This is currently being used by Tensorflow and Flutter to provide continuous scanning of their dependencies.

Our GitHub Action can be configured to do the following:

  • Regular vulnerability scan workflow. A common use case is to set a schedule to regularly scan the repository, with the workflow failing if a new vulnerability is found. Another use case can be to block release workflows if a vulnerability is found.
  • Trigger a differential vulnerability scan to run when a pull request is opened. This workflow can determine if your changes introduce new vulnerabilities and can be configured to block pull requests when the action fails. Enabling just this feature can allow you to stop new vulnerabilities from being introduced, while not breaking your existing workflows.

Head over to our documentation to see a quick and easy guide on how to get started integrating the OSV-Scanner action into your GitHub repository.


Other OSV features

Guided remediation and the GitHub actions support form is one piece of enabling our goal of making vulnerability management easier.

OSV also provides a broad suite of features:


What’s next?

We still have a lot more exciting work planned! A remaining challenge for dealing with known vulnerabilities in dependencies is remediation and dealing with false positives. Much of our work is focused on improving data quality and providing accurate and actionable results that lead to easy remediation.

These include:

  • Iterating on guided remediation: by addressing user feedback and adding support for additional ecosystems.
  • Improving container scanning. OSV-Scanner has so far focused on source repository scanning. One important gap we aim to fill is to provide better support for container scanning, in a way that provides actionable and useful remediation guidance, while minimizing false positives.
  • Continue to improve matching and data quality. A continuing focus for OSV-Scanner is making sure that our scanning is comprehensive and accurate. Accuracy is especially important for us, as one of our core goals is to minimize false positives and vulnerability noise for developers at the receiving end of the scanners through things such as reachability analysis.

Interested in using OSV in your project? Check out our OSV-Scanner and OSV.dev documentation for how to get started. Please share any feedback or bugs you encounter via our GitHub issue tracker.

By Michael Kedar, Rex Pan, and Oliver Chang – Google Open Source Security Team

Google Summer of Code 2024 contributor applications are open!

We are thrilled to announce that the Contributor applications for Google Summer of Code (GSoC) 2024 are now open! If you are a Student or a beginner in open source software development and 18+ years old, we hope you will apply. The application period opened March 18th at 18:00 UTC and closes April 2nd at 18:00 UTC.

This year we are celebrating the 20th year of Google Summer of Code! During GSoC, contributors will spend 12+ weeks writing code and learning more about open source software development under the guidance of experienced mentors.

Since 2005, GSoC has welcomed thousands of new developers into the open source community every year. The GSoC program has brought together over 20,000 contributors from 116 countries and 19,000 mentors from 850+ open source organizations.

This year we have added more projects focused on Artificial Intelligence, Machine Learning and Security than ever before; keep in mind the following points before applying:

  • Consider the three project sizes: ~90 hours, ~175 hours, and ~350 hours and choose which time commitment is best for you.
  • Contributors can submit a maximum of 3 project proposals (to different orgs or even multiple proposals to the same org).

With GSoC contributor applications now open, please review these helpful tips to guide your application:

  • Read the program rules, FAQ, contributor guide, and advice for applying and join us in our Discord chat Channel to connect with the community.
  • Review the list of 195 mentoring organizations and use filters to sort by your interests including programming language (python, Rust, etc.) and category (data, development tools, artificial intelligence, infrastructure and cloud, security, etc.).
  • Narrow down your list to 2-4 organizations and review their ideas list.
  • Reach out to the organizations via their contact methods listed on the GSoC site immediately.
  • Engage with the organization early and often, good communication is key! You must talk to the organization about your proposal before the application period ends if you want to be accepted into the program.
  • Watch our Intro to GSoC video, as well as the GSoC Org Highlight videos and Community Talks series, to get inspired about projects that contributors have worked on in the past.

Interested contributors may register and submit project proposals on the GSoC site from now until Tuesday, April 2nd at 18:00 UTC.

Best of luck to all our applicants!

By Stephanie Taylor, Program Manager, and Lucila Ortiz, Associate Program Manager for the Google Open Source Programs Office

PJRT Plugin to Accelerate Machine Learning

PJRT is an open, stable interface for device runtime and compiler, which simplifies ML hardware and framework integration. With PJRT, ML frameworks become hardware-agnostic and ML hardware becomes pluggable. For the ML developer, it simplifies the adoption of new ML hardware and models become more portable. This addresses ML infrastructure fragmentation across frameworks, compilers and runtimes enhancing the industry’s ability to productionize ML-driven advancements with velocity and at scale.

This article provides an overview of what building a PJRT plugin entails, how frameworks (and models) can use this plugin, and some updates on the PJRT API. PJRT is now used by a growing spectrum of hardware: Apple silicon, Google Cloud TPU, NVIDIA GPU, and Intel Max GPU. We also share a spotlight on Apple’s adoption of PJRT with some details on the workflow and performance.

If you’re developing an ML hardware accelerator or developing your own compiler and runtime, check out the PJRT source code on GitHub and sign up for the PJRT mailing list to quickly bootstrap your work.

What’s in a PJRT Plugin

PJRT was introduced to simplify the growing complexity of ML workload execution across hardware and frameworks. PJRT (used in conjunction with StableHLO) is a stable interface for device runtime and compiler, which abstracts away device specific implementations from frameworks.

An implementation of the PJRT API is called a PJRT plugin, which is usually a Python package for seamless ML model developer experience. To build a PJRT plugin for a hardware target, the following methods need to be implemented:

  • Compile: compile (program) -> executable
  • Runtime: execute (executable, arguments) -> results
  • Memory management: transfer buffer from host to device, device to host, device to device, as well as buffer management such as buffer donation
  • Topology information such as the platform, how many accelerators and how are they attached.

ML frameworks will discover and load one or multiple PJRT plugins, and call the PJRT API to compile and execute the model. The PJRT plugins may be required to register to the ML frameworks depending on the specific discovery mechanism the framework uses.

API Updates

Versioning and ABI Compatibility

PJRT API has a major version and a minor version. If the framework is newer than the plugin, the framework provides a N-week (N=6 today) forwards compatibility window for minor version updates. The major version updates will be a coordinated update. Frameworks will not support plugins with a lower major version. If the plugin is newer than the framework, plugins will define their own backward compatibility policy.

Multi-Node

A PJRT client is per node, and the plugin may need some way to communicate among nodes in a distributed workload. The framework can pass in key-value store callbacks to the plugin. The plugin can use them to bootstrap multi-node initialization and other coordination needs. An example with the NVIDIA GPU CUDA plugin is as follows:

  • JAX starts a distribution service and provides key-value store callbacks.
  • NVIDIA GPU CUDA plugin uses these callbacks to (1) generate global PJRT device topology that includes PJRT device information from all nodes, and (2) generate NCCL ids.

DLPack

A few C APIs were added to PJRT to support DLPack.

  • PJRT_Client_CreateViewOfDeviceBuffer supports receiving buffers from DLPack.
  • Exporting buffers to DLPack requires: PJRT_Buffer_IncreaseExternalReferenceCount, PJRT_Buffer_DecreaseExternalReferenceCount to get a PJRT_Buffer_OpaqueDeviceMemoryDataPointer.

Extension

PJRT API provides an extension mechanism that the plugin can provide extensions which are optional or experimental features. These extensions can have their own compatibility guarantee and do not need to support the ABI compatibility of PJRT API.

Industry Adoption

PJRT is the only interface for JAX, the primary interface for TensorFlow and fully supported for PyTorch through PyTorch/XLA. PJRT is not tied to a specific compiler and runtime. The toolchain-independent architecture and open-source availability as part of the OpenXLA Project allows it to be leveraged by any hardware, framework or compiler, with extensibility for unique features. This has allowed PJRT to be adopted by various industry partners through close collaboration. A brief account of Apple’s adoption of PJRT follows.

JAX on Apple Silicon

Apple’s PJRT plugin for the Metal training backend accelerates JAX models on Apple silicon and AMD GPUs. This empowers any ML developers to leverage the full potential of Apple silicon and AMD GPUs on their Apple hardware to accelerate JAX models for faster experimentation. The integration and user experience to accelerate JAX on Apple silicon GPUs is similar to the existing PyTorch and TensorFlow implementations.

The Metal plug-in uses the OpenXLA compiler and PJRT runtime to optimize and accelerate JAX workloads on GPU. When a JAX program is executed, the JAX graph is lowered into StableHLO, which is then passed to PJRT for compilation and execution. The StableHLO is converted to MPSGraph executables and the Metal runtime APIs are invoked to dispatch to the GPU.

Performance

The Metal backend with PJRT plugin provides impressive performance speedup for JAX. On an Apple MacBook Pro with M2 Max, training common networks in JAX see performance speedups of up to 28x, with an average of 10x over a CPU baseline. This empowers any ML developer to leverage the full potential of Apple Silicon on their Apple hardware to accelerate JAX models for faster experimentation.

graph of performance speedups of up to 28x on Apple MacBook Pro with M2 Max over CPU for JAX training.
Figure 1: Performance speedups of up to 28x on Apple MacBook Pro with M2 Max over CPU for JAX training.

Getting Started

Adding Metal support to JAX is as simple as a single pip install:

python -m pip install jax-metal
python -c 'import jax; print(jax.numpy.arange(10))'

For more details on environment setup and installation of JAX on Apple hardware, please refer to the Metal Developer Resources page.

Google Cloud TPU

PJRT is the default runtime for PyTorch 2.0 on Google Cloud TPU. GitHub Readme has more details.

NVIDIA GPU

The NVIDIA GPU CUDA implementation in JAX is extracted and packaged as a PJRT plugin. The ML model developers can install the NVIDIA GPU CUDA plugin from pypi. This plugin uses the newly added features such as multi-node, DLPack, and extensions.

Intel GPU

Intel is leveraging PJRT in Intel® Extension for TensorFlow to provide the Intel GPU backend for TensorFlow, JAX and PyTorch. The example of executing a JAX program on Intel GPU demonstrates how this greatly simplifies the framework and hardware integration.

PJRT Resources

PJRT is available on GitHub: source code for the API, integration guides and issues. If you develop ML frameworks, compilers, runtimes or are interested in improving portability of workloads across hardware, we want your feedback. We encourage you to contribute code, design ideas and feature suggestions. We also invite you to join the PJRT mailing list to stay updated with the latest product and community announcements and to help shape the future of an interoperable ML infrastructure.

Acknowledgements

Chalana Bezawada, Daniel Doctor, Kulin Seth, Shuhan Ding from Apple 
Penporn Koanantakool, Peter Hawkins, Skye Wanderman-Milne, Xiao Yu from Google.

By Aman Verma – Product Manager, Machine Learning Infrastructure, Google and Jieying Luo – Software Engineer, Machine Learning Infrastructure, Google

Mentor organizations announced for Google Summer of Code 2024

We are thrilled to share that we have 195 open source projects that have been selected for Google Summer of Code (GSoC) 2024! This year we are excited to welcome 30 new organizations for their first year as part of the program.

Check out our program site to view the complete list of GSoC 2024 accepted mentoring organizations. Get to know more about each organization on their GSoC program page, which includes reading through the project ideas that they are looking for GSoC contributors to work on this year.

Are you interested in being a GSoC Contributor?

The 2024 GSoC program is open to students and to beginners in open source software development. Contributor applications will open on Monday, March 18, 2024 at 18:00 UTC with a deadline of Tuesday, April 2, 2024 18:00 UTC to submit your application (including your project proposal).

If you are eager to enhance your chances of becoming a successful contributor this year, we highly recommend beginning your preparations and initiating communication with the organizations that interest you right away. Below are some tips for prospective GSoC contributors to accomplish before the application period begins March 18th:

  • Watch our ‘Introduction to GSoC’ video to see a quick overview of the program, and view our Community Talks or Org Highlight Videos to get inspired and learn more about some projects that contributors have worked on in the past.
  • Check out the Contributor Guide (so much great info in here!) and Advice for Applying to GSoC doc.
  • Review the list of accepted organizations here. We recommend finding two to four that interest you and reading through their project ideas lists. Use the filters on the site to help you narrow down based on the programming languages you are familiar with and the categories that interest you (cloud, AI, security, science, etc.).
  • As soon as you see an idea that sparks your interest, reach out to the organization via their preferred communication methods (listed on their org page on the GSoC program site). The earlier you start the conversation, the better your chances of being accepted as a GSoC contributor.
  • Talk with the mentors and community to determine if this project idea is something you would enjoy working on during the program. Find a project that excites you, otherwise it may be a challenging summer for you and your mentor.
  • Use the information you received during your communications with the mentors and other org community members to write up your proposal.

You can find more information about the program on our website which includes a full timeline of important dates. We also urge anyone interested in applying to read the FAQ and Program Rules and watch some of our other videos with more details about GSoC for contributors and mentors.

A hearty welcome—and thank you—to all of our mentor organizations. We look forward to working with all of you during this 20th year of Google Summer of Code!

By By Stephanie Taylor – Google Open Source

Building Open Models Responsibly in the Gemini Era

Google has long believed that open technology is not only good for our company, but good for the industry, consumers, and the world. We’ve released open-source projects like Android and Chromium that transformed access to mobile and web technologies, and have done the same in AI with Transformers, TensorFlow, and AlphaFold. The release of our Gemma family of open models is a next step in how we’re deepening our commitment to open technology alongside an industry-leading safe, responsible approach. At the same time, the rapidly evolving nature of AI raises important considerations for how to enable safety-aligned open models: an approach that supports broad innovation while promoting safe uses.

A benefit of open source is that once it is released, its license gives users full creative autonomy. This is a powerful guarantee of technology access for developers and end users. Another benefit is that open-source technology can be modified to fit the unique use case of the end user, without restriction.

In the hands of a malicious actor, however, the lack of restrictions can raise risks. Computing has been through similar cycles before, addressing issues such as protecting users of the open internet, handling cryptography, and addressing open-source software security. We now face this challenge with AI. Below we share the approach we took to openly releasing Gemma models, and the advancements in open model safety we hope to accelerate.


Providing access to Gemma open models

Today, Gemma models are being released as what the industry collectively has begun to refer to as “open models.” Open models feature free access to the model weights, but terms of use, redistribution, and variant ownership vary according to a model’s specific terms of use, which may not be based on an open-source license. The Gemma models’ terms of use make them freely available for individual developers, researchers, and commercial users for access and redistribution. Users are also free to create and publish model variants. In using Gemma models, developers agree to avoid harmful uses, reflecting our commitment to developing AI responsibly while increasing access to this technology.

We’re precise about the language we’re using to describe Gemma models because we’re proud to enable responsible AI access and innovation, and we’re equally proud supporters of open source. The definition of "Open Source" has been invaluable to computing and innovation because of requirements for redistribution and derived works, and against discrimination. These requirements enable cross-industry collaboration, individual innovation and entrepreneurship, and shared research to happen with exponential effects.

However, existing open-source concepts can’t always be directly applied to AI systems, which raises questions on how to use open-source licenses with AI. It’s important that we carry forward open principles that have made the sea-change we’re experiencing with AI possible while clarifying the concept of open-source AI and addressing concepts like derived work and author attribution.


Taking a comprehensive approach to releasing Gemma safely and responsibly

Licensing and terms of use are only one part of the evaluations, technical tools, and considered decision-making that went into aligning this release with our responsible AI Principles. Our approach involved:

  • Systematic internal review in accordance with our AI Principles: Consistent with our AI Principles, we release models only when we have determined the benefits are significant, and the risks of misuse are low or can be mitigated. We take that same approach to open models, incorporating a balance of the benefits of wider access to a particular model as well as the risks of misuse and how we can mitigate them. With Gemma, we considered the increased AI research and innovation by us and many others in the community, the access to AI technology the models could bring, and what access was needed to support these use cases.
  • A high evaluation bar: Gemma models underwent thorough evaluations, and were held to a higher bar for evaluating risk of abuse or harm than our proprietary models, given the more limited mitigations currently available for open models. These evaluations cover a broad range of responsible AI areas, including safety, fairness, privacy, societal risk, as well as capabilities such as chemical, biological, radiological, nuclear (CBRN) risks, cybersecurity, and autonomous replication. As described in our technical report, the Gemma models exhibit state-of-the-art safety performance in human side-by-side evaluations.
  • Responsibility tools for developers: As we release the Gemma models, we are also releasing a Responsible Generative AI Toolkit for developers, providing guidance and tools to help them create safer AI applications.

We continue to evolve our approach. As we build these frameworks further, we will proceed thoughtfully and incorporate what we learn into future model assessments. We will continue to explore the full range of access mechanisms, with benefits and risk mitigation in mind, including API-based access and staged releases.


Advancing open model safety together

Many of today’s AI safety tools are designed for systems where the design approach assumes restricted access and redistribution, as well as auxiliary controls like query filters. Similarly, much of the AI safety research for improving mitigations takes on the design assumptions of those systems. Just as we have created unique threat models and solutions for other open technology, we are developing safety and security tools appropriate for the differences of openly available AI.

As models become more and more capable, we are conducting research and investing in rigorous safety evaluation, testing, and mitigations for open models. We are also actively participating in conversations with policymakers and open-source community leaders on how the industry should approach this technology. This challenge is multifaceted, just like AI systems themselves. Model-sharing platforms like Hugging Face and Kaggle, where developers inspire each other with novel model iterations, play a critical role in efforts to develop open models safely; there is also a role for the cybersecurity community to contribute learnings and best practices.

Building those solutions requires access to open models, sharing innovations and improvements. We believe sharing the Gemma models will not just help increase access to AI technology, but also help the industry develop new approaches to safety and responsibility.

As developers adopt Gemma models and other safety-aligned open models, we look forward to working with the open-source community to develop more solutions for responsible approaches to AI in the open ecosystem. A global diversity of experiences, perspectives, and opportunities will help build safe and responsible AI that works for everyone.

By Anne Bertucio – Sr Program Manager, Open Source Programs Office; Helen King – Sr Director of Responsibility, Google DeepMind

Magika: AI powered fast and efficient file type identification

Today we are open-sourcing Magika, Google’s AI-powered file-type identification system, to help others accurately detect binary and textual file types. Under the hood, Magika employs a custom, highly optimized deep-learning model, enabling precise file identification within milliseconds, even when running on a CPU.

Magika command line tool used to recognize a identify the type of a diverse set of files
Magika command line tool used to recognize a identify the type of a diverse set of files

You can try the Magika web demo today, or install it as a Python library and standalone command line tool (output is showcased above) by using the standard command line pip install magika.

Why identifying file type is difficult

Since the early days of computing, accurately detecting file types has been crucial in determining how to process files. Linux comes equipped with libmagic and the file utility, which have served as the de facto standard for file type identification for over 50 years. Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file. For example, modern code editors use file-type detection to choose which syntax coloring scheme to use as the developer starts typing in a new file.

Accurate file-type detection is a notoriously difficult problem because each file format has a different structure, or no structure at all. This is particularly challenging for textual formats and programming languages as they have very similar constructs. So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.

This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand. In particular for security applications, creating dependable detection is especially challenging as attackers are constantly attempting to confuse detection with adversarially-crafted payloads.

To address this issue and provide fast and accurate file-type detection we researched and developed Magika, a new AI powered file type detector. Under the hood, Magika uses a custom, highly optimized deep-learning model designed and trained using Keras that only weighs about 1MB. At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.

Magika Performance

Magika detection quality compared to other tools on our 1M files benchmark
Magika detection quality compared to other tools on our 1M files benchmark

Performance wise, Magika, thanks to its AI model and large training dataset, is able to outperform other existing tools by about 20% when evaluated on a 1M files benchmark that encompasses over 100 file types. Breaking down by file type, as reported in the table below, we see even greater performance gains on textual files, including code files and configuration files that other tools can struggle with.

Table showing various file type identification tools performance for a selection of the file types included in our benchmark
Various file type identification tools performance for a selection of the file types included in our benchmark - n/a indicates the tool doesn’t detect the given file type.

Magika at Google

Internally, Magika is used at scale to help improve Google users’ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners. Looking at a weekly average of hundreds of billions of files reveals that Magika improves file type identification accuracy by 50% compared to our previous system that relied on handcrafted rules. In particular, this increase in accuracy allows us to scan 11% more files with our specialized malicious AI document scanners and reduce the number of unidentified files to 3%.

The upcoming integration of Magika with VirusTotal will complement the platform's existing Code Insight functionality, which employs Google's generative AI to analyze and detect malicious code. Magika will act as a pre-filter before files are analyzed by Code Insight, improving the platform’s efficiency and accuracy. This integration, due to VirusTotal’s collaborative nature, directly contributes to the global cybersecurity ecosystem, fostering a safer digital environment.

Open Sourcing Magika

By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Magika code and model are freely available starting today in Github under the Apache2 License. Magika can also quickly be installed as a standalone utility and python library via the pypi package manager by simply typing pip install magika with no GPU required. We also have an experimental npm package if you would like to use the TFJS version.

To learn more about how to use it, please refer to Magika documentation site.


Acknowledgements

Magika would not have been possible without the help of many people including: Ange Albertini, Loua Farah, Francois Galilee, Giancarlo Metitieri, Luca Invernizzi, Young Maeng, Alex Petit-Bianco , David Tao, Kurt Thomas, Amanda Walker

By Elie Bursztein – Cybersecurity AI Technical and Research Lead and Yanick Fratantonio – Cybersecurity Research Scientist

Magika: AI powered fast and efficient file type identification

Today we are open-sourcing Magika, Google’s AI-powered file-type identification system, to help others accurately detect binary and textual file types. Under the hood, Magika employs a custom, highly optimized deep-learning model, enabling precise file identification within milliseconds, even when running on a CPU.

Magika command line tool used to recognize a identify the type of a diverse set of files
Magika command line tool used to recognize a identify the type of a diverse set of files

You can try the Magika web demo today, or install it as a Python library and standalone command line tool (output is showcased above) by using the standard command line pip install magika.

Why identifying file type is difficult

Since the early days of computing, accurately detecting file types has been crucial in determining how to process files. Linux comes equipped with libmagic and the file utility, which have served as the de facto standard for file type identification for over 50 years. Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file. For example, modern code editors use file-type detection to choose which syntax coloring scheme to use as the developer starts typing in a new file.

Accurate file-type detection is a notoriously difficult problem because each file format has a different structure, or no structure at all. This is particularly challenging for textual formats and programming languages as they have very similar constructs. So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.

This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand. In particular for security applications, creating dependable detection is especially challenging as attackers are constantly attempting to confuse detection with adversarially-crafted payloads.

To address this issue and provide fast and accurate file-type detection we researched and developed Magika, a new AI powered file type detector. Under the hood, Magika uses a custom, highly optimized deep-learning model designed and trained using Keras that only weighs about 1MB. At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.

Magika Performance

Magika detection quality compared to other tools on our 1M files benchmark
Magika detection quality compared to other tools on our 1M files benchmark

Performance wise, Magika, thanks to its AI model and large training dataset, is able to outperform other existing tools by about 20% when evaluated on a 1M files benchmark that encompasses over 100 file types. Breaking down by file type, as reported in the table below, we see even greater performance gains on textual files, including code files and configuration files that other tools can struggle with.

Table showing various file type identification tools performance for a selection of the file types included in our benchmark
Various file type identification tools performance for a selection of the file types included in our benchmark - n/a indicates the tool doesn’t detect the given file type.

Magika at Google

Internally, Magika is used at scale to help improve Google users’ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners. Looking at a weekly average of hundreds of billions of files reveals that Magika improves file type identification accuracy by 50% compared to our previous system that relied on handcrafted rules. In particular, this increase in accuracy allows us to scan 11% more files with our specialized malicious AI document scanners and reduce the number of unidentified files to 3%.

The upcoming integration of Magika with VirusTotal will complement the platform's existing Code Insight functionality, which employs Google's generative AI to analyze and detect malicious code. Magika will act as a pre-filter before files are analyzed by Code Insight, improving the platform’s efficiency and accuracy. This integration, due to VirusTotal’s collaborative nature, directly contributes to the global cybersecurity ecosystem, fostering a safer digital environment.

Open Sourcing Magika

By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Magika code and model are freely available starting today in Github under the Apache2 License. Magika can also quickly be installed as a standalone utility and python library via the pypi package manager by simply typing pip install magika with no GPU required. We also have an experimental npm package if you would like to use the TFJS version.

To learn more about how to use it, please refer to Magika documentation site.


Acknowledgements

Magika would not have been possible without the help of many people including: Ange Albertini, Loua Farah, Francois Galilee, Giancarlo Metitieri, Luca Invernizzi, Young Maeng, Alex Petit-Bianco , David Tao, Kurt Thomas, Amanda Walker

By Elie Bursztein – Cybersecurity AI Technical and Research Lead and Yanick Fratantonio – Cybersecurity Research Scientist