Google Launches Gemma 4 Models and Shifts to Apache 2.0 License

Key Points
- Google released four Gemma 4 models for local and mobile use.
- 26B Mixture of Experts and 31B Dense run unquantized on a single 80GB Nvidia H100 GPU.
- Quantized versions can operate on consumer GPUs, expanding accessibility.
- Effective 2B and Effective 4B are optimized for smartphones, Raspberry Pi, and Jetson Nano.
- Latency improvements include activation of only 3.8 billion parameters in the 26B MoE model.
- Google switched from a custom license to the Apache 2.0 license for all Gemma models.
- The 31B Dense model is projected to rank third on the Arena open‑model leaderboard.
- Collaboration with Qualcomm and MediaTek helped achieve low memory usage and near‑zero latency on edge devices.
Google introduced the Gemma 4 family of open-weight AI models, offering four variants optimized for local execution and mobile devices. The two larger models—26B Mixture of Experts and 31B Dense—run unquantized on a single 80GB Nvidia H100 GPU and can be quantized for consumer GPUs. Smaller Effective 2B and Effective 4B models target smartphones and edge hardware, benefitting from collaboration with Qualcomm and MediaTek. Google also replaced its custom Gemma license with the Apache 2.0 license, giving developers greater freedom. The company claims Gemma 4 models are the most capable locally runnable AI systems, positioning them near the top of open AI model rankings.
New Gemma 4 Models
Google announced the Gemma 4 series, expanding its portfolio of open-weight artificial intelligence models. The family includes four sizes designed for different deployment scenarios, from high‑performance servers to mobile and edge devices. By providing models that can run locally, Google aims to give developers more control over inference environments and reduce reliance on cloud services.
Hardware and Performance
The two larger variants—named 26B Mixture of Experts (MoE) and 31B Dense—are built to operate unquantized in bfloat16 format on a single 80GB Nvidia H100 GPU. While the H100 is a high‑end AI accelerator, Google notes that quantized versions of these models can run on consumer‑grade GPUs, widening accessibility. A key performance improvement is latency reduction. The 26B MoE model activates only 3.8 billion of its 26 billion parameters during inference, delivering higher tokens‑per‑second than similarly sized competitors. The 31B Dense model emphasizes quality and is expected to be fine‑tuned for specific applications.
Mobile‑Optimized Variants
Effective 2B (E2B) and Effective 4B (E4B) are the smaller Gemma 4 models aimed at mobile and edge devices. Google worked closely with Qualcomm and MediaTek to optimize these models for smartphones, Raspberry Pi boards, and Jetson Nano platforms. The designs keep memory usage low during inference and promise “near‑zero latency,” offering a more efficient alternative to the previous Gemma 3 models.
Licensing Change
Responding to developer feedback about licensing constraints, Google is discarding its custom Gemma license in favor of the Apache 2.0 license. This shift provides developers with broader freedom to use, modify, and distribute the models without the restrictions previously imposed by the proprietary license.
Competitive Position
Google asserts that the Gemma 4 models are the most capable AI systems that can be run on local hardware. It predicts that the 31B Dense variant will rank third on the Arena list of top open AI models, trailing only GLM‑5 and Kimi 2.5. Despite this high ranking, the Gemma 4 models remain a fraction of the size of the leading competitors, potentially lowering operational costs for users.