LLM
Inference speed

Quantize.Float16 is web-based tool to compare the inference speed of LLM models with different Quantization techniques and KV cache settings.

TL;DR

Model	Quantize	KV Cache	Batch size	Token per sec	Category
Llama3-8b	w4a8_awq	FP16	1	244.54	Best latency
Llama3-8b	FP8	FP8	32	1,947.10	Best throughput
Gemma2-27b	FP16	FP16	1	47.23	Best latency
Gemma2-27b	FP16	FP16	8	251.34	Best throughput
RecurrentGemma-9b	FP16	FP16	1	114.68	Best latency
RecurrentGemma-9b	FP16	FP16	128	1,940.87	Best throughput
Mamba2-2.7b	FP16	FP16	1	224.34	Best latency
Mamba2-2.7b	FP16	FP16	192	2,948.73	Best throughput

Notes

Best latency refers to the fastest speed at which a request is completed. (batch size = 1, input length = 512, output length = 32)
Best throughput refers to the highest number of tokens processed per second for requests. (batch size = 32, input length = 512, output length = 32)

Driver version:

550.54.15

Cuda version:

12.4

TensorRT-LLM version:

0.11.0-rel

GPU type:

H100

Latest update:

2024-08-20

Model Family	Model Architecture	Model	Context windows	Output size
Llama	Dense	Llama3-8b	8,192	4,096
Gemma	Dense	Gemma2-27b	2,048	2,048
RecurrentGemma	Hybrid	RecurrentGemma-9b	4,096	4,096
Mamba	State-spaces	Mamba2-2.7b	2,048	2,048

2024-09-22 : Update model Llama3-8b FP16

2024-08-20 : Update model Llama3-8b Quantization w4a8_awq

2024-08-20 : Update model Gemma2-27b Quantization FP16, int8_weight_only, int4_weight_only

2024-08-19 : Update model Llama3-8b Quantization FP8, w4a16_awq, int8_weight_only

2024-08-19 : Update model Gemma2-27b Quantization FP16

2024-08-19 : Update model RecurrentGemma-9b Quantization FP16

2024-08-19 : Update model Mamba2-2.7b Quantization FP16

2024-08-19 : Initial release