LLM
Quantize.Float16 is web-based tool to compare the inference speed of LLM models with different Quantization techniques and KV cache settings.
H100x1
Model | Quantize | KV Cache | Batch size | Token per sec | Category |
---|
Llama3-8b | w4a8_awq | FP16 | 1 | 244.54 | Best latency |
Llama3-8b | FP8 | FP8 | 32 | 1,947.10 | Best throughput |
Gemma2-27b | FP16 | FP16 | 1 | 47.23 | Best latency |
Gemma2-27b | FP16 | FP16 | 8 | 251.34 | Best throughput |
RecurrentGemma-9b | FP16 | FP16 | 1 | 114.68 | Best latency |
RecurrentGemma-9b | FP16 | FP16 | 128 | 1,940.87 | Best throughput |
Mamba2-2.7b | FP16 | FP16 | 1 | 224.34 | Best latency |
Mamba2-2.7b | FP16 | FP16 | 192 | 2,948.73 | Best throughput |
Notes
- Best latency refers to the fastest speed at which a request is completed. (batch size = 1, input length = 512, output length = 32)
- Best throughput refers to the highest number of tokens processed per second for requests. (batch size = 32, input length = 512, output length = 32)
Configuration
TensorRT-LLM version:
0.11.0-rel
Support Models
Model Family | Model Architecture | Model | Context windows | Output size | Learn more |
---|
Llama | Dense | Llama3-8b | 8,192 | 4,096 | |
Gemma | Dense | Gemma2-27b | 2,048 | 2,048 | |
RecurrentGemma | Hybrid | RecurrentGemma-9b | 4,096 | 4,096 | |
Mamba | State-spaces | Mamba2-2.7b | 2,048 | 2,048 | |
Updated
2024-09-22 : Update model Llama3-8b FP162024-08-20 : Update model Llama3-8b Quantization w4a8_awq2024-08-20 : Update model Gemma2-27b Quantization FP16, int8_weight_only, int4_weight_only2024-08-19 : Update model Llama3-8b Quantization FP8, w4a16_awq, int8_weight_only2024-08-19 : Update model Gemma2-27b Quantization FP162024-08-19 : Update model RecurrentGemma-9b Quantization FP162024-08-19 : Update model Mamba2-2.7b Quantization FP162024-08-19 : Initial release