Llama

Model overview

Llama architecture developed by Meta AI is a dense model that is trained on a large corpus of text data.

At the moment (16/08/2024), Llama have 3 major version and 1 minor version.

The major version are Llama1, Llama2, and Llama3. The minor version is Llama3.1.

Llama is a transformer-based model decoder-only architecture.

Llama had leveraged Group Attention, Multi-Head Attention, and Self-Attention to achieve the state-of-the-art performance in various NLP tasks.

Llama models support quantization.

Quantization Technique	KV cache	Support	Updated
FP16	FP16	Yes	No
FP8	FP16	Yes	Yes
FP8	FP8	Yes	Yes
int8_weight_only	FP16	Yes	Yes
int4_weight_only	FP16	Yes	No
w4a16_awq	FP16	Yes	Yes
w4a8_awq	FP16	Yes	Yes