Gemma architecture developed by Google is a dense model architecture.
At the moment (16/08/2024), Recurrent Gemma have 2 major version.
The major version are Gemma1 and Gemma2.
Gemma models support quantization.
| Quantization Technique | KV cache | Support | Updated |
|---|---|---|---|
| FP16 | FP16 | Yes | Yes |
| FP8 | FP16 | Yes | No |
| FP8 | FP8 | Yes | No |
| int8_weight_only | FP16 | Yes | Yes |
| int4_weight_only | FP16 | Yes | Yes |
| w4a16_awq | FP16 | No | No |
| w4a8_awq | FP16 | No | No |
| Model version | Model | Context size |
|---|---|---|
| Gemma | gemma-2b | 8,192 |
| Gemma | gemma-7b | 8,192 |
| Gemma2 | gemma2-2b | 8,192 |
| Gemma2 | gemma2-9b | 8,192 |
| Gemma2 | gemma2-27b | 8,192 |