Token Throughput per GPU vs End-to-end Latency

LLM Benchmarks • 8 GPUs • full benchmark at https://github.com/Scicom-AI-Enterprise-Organization/llm-benchmaq/tree/main/benchmarks

GPU Type

Model

Parallelism (TP/DP)

Inference Engine

Benchmark Configuration

Parallelism: TP8/DP1 or TP4/DP2 (Total: 8 GPUs)
Concurrency: 100 requests
Input Length: Variable (1024, 2048, 4096, 8192, 15000 tokens)
Output Length: 128 tokens
Throughput per GPU: Total Token Throughput / 8