GPU sizing for Llama 3.1 70B inference: numbers from benchmarks
How many GPUs does it really take to run Llama 3.1 70B in-house? Concrete configs (A100, H100, H200), the impact of quantization (FP16 → FP8 → INT4), tokens/s, TTFT, and cost per 1M tokens. No marketing — numbers from vLLM and TensorRT-LLM benchmarks.

Why calculate this at all
The question "how many GPUs do we need to run Llama 3.1 70B in-house" comes up in three out of five conversations with CIOs and IT architects in mid-sized manufacturing. It usually comes up for two reasons that get conflated:
- "Can we afford it?" (the CFO's intent)
- "Can we operate it?" (the IT architect's intent)
These are two different questions. The first is answered with a TCO table. The second is answered with an inference benchmark: tokens/s, time-to-first-token (TTFT), concurrency, cost per 1M tokens. This post is about the second, because without numbers from the second question the first table is guesswork.
Scope: an on-prem scenario, a mid-sized manufacturer of 200–1000 FTE, Llama 3.1 70B Instruct, mixed workloads (service-desk RAG plus SOP generation plus proposal drafting), peak concurrency of 20–80 simultaneous sessions.
The rest is numbers.
The metrics that matter (and the ones that mislead)
Throughput (tokens/s, output) is how many tokens the model produces per unit of time at a given batch size. The standard headline benchmark. The trap: single session vs aggregate throughput under batching. Vendor marketing usually shows the latter. Operationally we care about the former (per request) and the latter (per server, since it defines how many users we serve at once).
TTFT (time to first token) is the time from sending the prompt to the first response token appearing. For long-context RAG (10–30k tokens), TTFT can exceed the entire generation time of a short chat. If a benchmark omits TTFT at long context, it omits the worst part of the experience.
Concurrency at acceptable latency is how many parallel sessions we sustain at a specific SLO. We define the SLO ourselves. In practice, for a manufacturing service desk a reasonable threshold is TTFT < 2 s and streaming > 25 tokens/s per session. Above that line the user doesn't experience "thinking" as a problem.
Cost per 1M tokens (output) is the only metric that compares to cloud-API pricing. We compute it from hardware TCO amortised over 3 years, divided by the aggregate tokens produced at a utilisation typical for a mid-sized firm (usually 25–40% daily).
What I don't look at in the first iteration: MMLU, HellaSwag, pure quality rankings. Those are useful when choosing a model, not when sizing hardware for a model we've already chosen.
Three configurations a mid-sized firm realistically considers
Shared assumptions: FP16 precision (as baseline), vLLM 0.6+ or TensorRT-LLM 0.13+, a 16k-token context window, tensor parallelism within a single server, no pipeline parallelism (one node). The values are medians from public benchmarks (vLLM, NVIDIA Triton) and independent Anyscale/Together tests from H2 2025. Variance of ±15% is normal, depending on drivers, CUDA version, and kernel flags.
Configuration A: 2x NVIDIA H100 80GB (single node, tensor parallel 2)
The most common "we're getting into on-prem" config. The hardware fits in one 4U server; NVLink between GPUs is mandatory (PCIe-only drops throughput ~30%).
Llama 3.1 70B, FP16, vLLM 0.6, batch=1, context 4k:
- Single-session throughput: ~32–38 tokens/s
- TTFT (prompt 4k tokens): ~600–900 ms
- VRAM used: 65 GB per card (with headroom for KV cache)
Llama 3.1 70B, FP16, vLLM 0.6, batch=16, context 4k:
- Aggregate throughput: ~380–460 tokens/s
- Per-session throughput: ~24–28 tokens/s
- TTFT (prompt 4k): ~1.2–1.6 s
Concurrency at SLO (TTFT < 2 s, per-session > 25 tps): ~12–18 simultaneous sessions.
Comment: the sweet spot for a 200–500 FTE firm if AI is one of three tools (not the centre). The bottleneck: KV cache at longer context (16k+). Each extra 1k of context eats ~1.2 GB VRAM at batch 16.
Configuration B: 4x NVIDIA H100 80GB (single node, tensor parallel 4)
A jump in performance — and price. Sensible if AI is meant to be a foundation of operations (service desk plus instructions plus proposals in parallel) or if you plan to upgrade to a larger model within 18 months.
Llama 3.1 70B, FP16, vLLM 0.6, batch=1, context 4k:
- Single-session throughput: ~58–66 tokens/s
- TTFT (prompt 4k): ~350–500 ms
Llama 3.1 70B, FP16, vLLM 0.6, batch=32, context 4k:
- Aggregate throughput: ~720–860 tokens/s
- Per-session throughput: ~22–27 tokens/s
- TTFT (prompt 4k): ~900 ms – 1.3 s
Concurrency at SLO: ~30–40 simultaneous sessions.
Comment: tensor parallel 4 doesn't scale linearly (efficiency drops to ~70% vs 2x). For most mid-sized firms it's too much without a clear utilisation plan.
Configuration C: 2x NVIDIA H200 141GB (single node, tensor parallel 2)
The H200 has 76% more VRAM (141 GB vs 80 GB) and ~40% higher memory bandwidth than the H100. That changes the maths: the same model at higher concurrency, larger context for free, KV cache not a bottleneck up to batch 32+.
Llama 3.1 70B, FP16, vLLM 0.6, batch=1, context 4k:
- Single-session throughput: ~42–48 tokens/s
- TTFT (prompt 4k): ~480–700 ms
Llama 3.1 70B, FP16, vLLM 0.6, batch=32, context 16k:
- Aggregate throughput: ~620–740 tokens/s
- Per-session throughput: ~19–23 tokens/s
- TTFT (prompt 16k): ~2.4–3.1 s
Concurrency at SLO (TTFT < 2 s at 16k context): ~20–28 sessions.
Comment: for long-context RAG (15k–30k input tokens, typical for technical documentation) the H200 is often a better decision than 4x H100. Fewer cards to maintain, less networking overhead. H200 list price is ~30–35% above H100 as of May 2026.
Quantization: what FP8 and INT4 change
Quantization reduces the size of weights (and sometimes activations) from 16 bits to 8 or 4. It buys two things: less VRAM (larger batch or longer context) and higher throughput (less data through memory).
FP8 (NVIDIA Hopper native):
- VRAM ~50% less than FP16 (Llama 3.1 70B in FP8 fits in ~40 GB, i.e. on a single H100 80GB)
- Throughput +30–55% vs FP16 on the same hardware
- Quality loss on MMLU/HellaSwag ~0.5–1.5 pp, in practice imperceptible for RAG and drafting
- Requires TensorRT-LLM or vLLM 0.6+ with FP8 quantization
INT4 (GPTQ, AWQ, ExLlamaV2):
- VRAM ~25% of FP16 (~17–20 GB for 70B)
- Throughput +50–90% vs FP16 (but TTFT sometimes rises, due to runtime dequantization)
- Quality loss on MMLU ~2–4 pp, noticeable on reasoning tasks. For pure RAG (extractive summarization from a document) — acceptable. For long-form generation (instructions, proposals) — test it.
- Requires AWQ or GPTQ quantized weights (publicly available on Hugging Face for Llama 3.1 70B)
Practical recommendation:
- Configuration A (2x H100) in FP8: usually enough for a 200–500 FTE firm.
- Configuration A in INT4: test quality before deployment, but one server can handle ~30–40 concurrent sessions.
- Configuration B/C: rarely worth quantizing, since VRAM isn't the bottleneck.
Realistic utilisation and cost per 1M tokens
This is where most business cases break.
Typical assumption: 30% daily utilisation (7 hours of productive traffic per day, 5 days a week — i.e. ~21% on a 24/7 basis). The rest is idle (nights, weekends), warm standby, and peak headroom.
For Configuration A (2x H100 80GB), FP8, average aggregate throughput ~600 tokens/s at a realistic mix of batch sizes:
- Daily production: 600 tps × 25,200 s = 15.1M tokens/day
- Monthly: ~330M tokens
Hardware TCO (3 years): a 2x H100 80GB SXM5 server + chassis + switching + power + maintenance + license amortisation ~ EUR 240,000–290,000. Spread over 36 months: ~EUR 7,500/month. Plus an operator (0.2–0.3 FTE at a mid-sized firm with an existing IT team): ~EUR 3,000/month.
Cost per 1M output tokens: ~EUR 32 (10,500 / 330M × 1M).
For comparison: GPT-4o at public pricing is ~USD 10 per 1M output tokens. Claude Sonnet 4: ~USD 15 per 1M. So on-prem is more expensive per unit at 330M/month.
But that's a completely wrong axis of comparison if a regulatory need exists (NIS2, AI Act high-risk, a regulated sector). On-prem doesn't compete with a cloud API on price. It competes on risk profile (data doesn't leave), control (the model doesn't change without your consent), and predictability (pricing doesn't jump with every vendor iteration).
On the other hand: if you grow to 1.5–2 billion tokens/month (e.g. adding proposal automation for the technical office on top of the service desk), cost per 1M tokens drops to ~EUR 7–9. At that point on-prem wins economically against the cloud on unit price too.
What I don't cover here
- Multi-node setups (8+ GPU, pipeline parallelism). For a mid-sized firm it's overkill in year one, perhaps not in year two. A separate post.
- Fine-tuning / training overhead. Inference is ~5% of compute cost in a typical production scenario. Fine-tuning happens once a quarter; training, mostly not at all.
- Networking and storage sizing for RAG. KV cache and embeddings are a separate topic with separate bottlenecks. Also a separate post.
- AMD MI300X and non-NVIDIA alternatives. Briefly: reasonable for single workloads, immature for mixed production inference at a mid-sized firm. Worth revisiting in 2027.
// disclosure & biasesDisclosure and biases
I work on an on-prem AI platform (CortexMine) for European manufacturers. The numbers in this post come from public benchmarks (vLLM, NVIDIA Triton, Anyscale, Together AI) and from our internal tests on test hardware. Where our results differed from public benchmarks by more than 15%, I kept the public values (they're more representative of a typical setup than our optimisation).
The bias I'm aware of: I write from the "on-prem is a real option" perspective. For firms without a concrete regulatory need and below 200 FTE, it often isn't a real option economically. That's a different conversation.
Related notes
- Pillar: On-prem AI in European manufacturing 2026: a complete architecture guide — the broader architectural context and the 7 deployment layers.
- Cluster: Bare-metal, colocation, or appliance: where to put on-prem AI (CAPEX and OPEX) — where the hardware from this post fits into the deployment model.
Building CortexMine, an on-prem AI platform for European manufacturers under NIS2. Where this bias could affect conclusions, it is flagged inline.