Llama architecture developed by Meta AI is a dense model that is trained on a large corpus of text data.
At the moment (16/08/2024), Llama have 3 major version and 1 minor version.
The major version are Llama1, Llama2, and Llama3. The minor version is Llama3.1.
Llama is a transformer-based model decoder-only architecture.
Llama had leveraged Group Attention, Multi-Head Attention, and Self-Attention to achieve the state-of-the-art performance in various NLP tasks.
Llama models support quantization.
Quantization Technique | KV cache | Support | Updated |
---|---|---|---|
FP16 | FP16 | Yes | No |
FP8 | FP16 | Yes | Yes |
FP8 | FP8 | Yes | Yes |
int8_weight_only | FP16 | Yes | Yes |
int4_weight_only | FP16 | Yes | No |
w4a16_awq | FP16 | Yes | Yes |
w4a8_awq | FP16 | Yes | Yes |
Model version | Model | Context size |
---|---|---|
Llama | Llama-7b | 2,048 |
Llama | Llama-13b | 2,048 |
Llama | Llama-33b | 2,048 |
Llama | Llama-65b | 2,048 |
Llama2 | Llama-7b | 4,096 |
Llama2 | Llama-13b | 4,096 |
Llama2 | Llama-70b | 4,096 |
Llama3 | Llama-8b | 8,192 |
Llama3 | Llama-70b | 8,192 |
Llama3.1 | Llama-8b | 131,072 |
Llama3.1 | Llama-70b | 131,072 |
Llama3.1 | Llama-405b | 131,072 |