LLM
Inference speed

Quantize.Float16 is web-based tool to compare the inference speed of LLM models with different Quantization techniques and KV cache settings.

TL;DR

H100x1

ModelQuantizeKV CacheBatch sizeToken per secCategory
Llama3-8bw4a8_awqFP161244.54Best latency
Llama3-8bFP8FP8321,947.10Best throughput
Gemma2-27bFP16FP16147.23Best latency
Gemma2-27bFP16FP168251.34Best throughput
RecurrentGemma-9bFP16FP161114.68Best latency
RecurrentGemma-9bFP16FP161281,940.87Best throughput
Mamba2-2.7bFP16FP161224.34Best latency
Mamba2-2.7bFP16FP161922,948.73Best throughput

Notes

  • Best latency refers to the fastest speed at which a request is completed. (batch size = 1, input length = 512, output length = 32)
  • Best throughput refers to the highest number of tokens processed per second for requests. (batch size = 32, input length = 512, output length = 32)

Configuration

Driver version:

550.54.15

Cuda version:

12.4

TensorRT-LLM version:

0.11.0-rel

GPU type:

H100

Latest update:

2024-08-20

Support Models

Model FamilyModel ArchitectureModelContext windowsOutput sizeLearn more
LlamaDenseLlama3-8b8,1924,096
GemmaDenseGemma2-27b2,0482,048
RecurrentGemmaHybridRecurrentGemma-9b4,0964,096
MambaState-spacesMamba2-2.7b2,0482,048

Updated

  • 2024-09-22 : Update model Llama3-8b FP16
  • 2024-08-20 : Update model Llama3-8b Quantization w4a8_awq
  • 2024-08-20 : Update model Gemma2-27b Quantization FP16, int8_weight_only, int4_weight_only
  • 2024-08-19 : Update model Llama3-8b Quantization FP8, w4a16_awq, int8_weight_only
  • 2024-08-19 : Update model Gemma2-27b Quantization FP16
  • 2024-08-19 : Update model RecurrentGemma-9b Quantization FP16
  • 2024-08-19 : Update model Mamba2-2.7b Quantization FP16
  • 2024-08-19 : Initial release