Llama

-> Go to Benchmark

Model overview

Llama architecture developed by Meta AI is a dense model that is trained on a large corpus of text data.

At the moment (16/08/2024), Llama have 3 major version and 1 minor version.

The major version are Llama1, Llama2, and Llama3. The minor version is Llama3.1.

Model details

Llama is a transformer-based model decoder-only architecture.

Llama had leveraged Group Attention, Multi-Head Attention, and Self-Attention to achieve the state-of-the-art performance in various NLP tasks.

Huggingface model

Support quantization

Llama models support quantization.

Quantization TechniqueKV cacheSupportUpdated
FP16FP16YesNo
FP8FP16YesYes
FP8FP8YesYes
int8_weight_onlyFP16YesYes
int4_weight_onlyFP16YesNo
w4a16_awqFP16YesYes
w4a8_awqFP16YesYes

Model versions

Model versionModelContext size
LlamaLlama-7b2,048
LlamaLlama-13b2,048
LlamaLlama-33b2,048
LlamaLlama-65b2,048
Llama2Llama-7b4,096
Llama2Llama-13b4,096
Llama2Llama-70b4,096
Llama3Llama-8b8,192
Llama3Llama-70b8,192
Llama3.1Llama-8b131,072
Llama3.1Llama-70b131,072
Llama3.1Llama-405b131,072