GPU sizing for Llama 3.1 70B inference: numbers from benchmarks

Fryderyk Pryjma·published May 26, 2026·updated June 9, 2026·8 min · 1618 words

[architecture]GPU sizingLlama 3.1inferenceon-prem

GPU sizing for Llama 3.1 70B inference: numbers from benchmarks

Why calculate this at all

The question "how many GPUs do we need to run Llama 3.1 70B in-house" comes up in three out of five conversations with CIOs and IT architects in mid-sized manufacturing. It usually comes up for two reasons that get conflated:

"Can we afford it?" (the CFO's intent)
"Can we operate it?" (the IT architect's intent)

These are two different questions. The first is answered with a TCO table. The second is answered with an inference benchmark: tokens/s, time-to-first-token (TTFT), concurrency, cost per 1M tokens. This post is about the second, because without numbers from the second question the first table is guesswork.

Scope: an on-prem scenario, a mid-sized manufacturer of 200–1000 FTE, Llama 3.1 70B Instruct, mixed workloads (service-desk RAG plus SOP generation plus proposal drafting), peak concurrency of 20–80 simultaneous sessions.

The rest is numbers.

The metrics that matter (and the ones that mislead)

Throughput (tokens/s, output) is how many tokens the model produces per unit of time at a given batch size. The standard headline benchmark. The trap: single session vs aggregate throughput under batching. Vendor marketing usually shows the latter. Operationally we care about the former (per request) and the latter (per server, since it defines how many users we serve at once).

TTFT (time to first token) is the time from sending the prompt to the first response token appearing. For long-context RAG (10–30k tokens), TTFT can exceed the entire generation time of a short chat. If a benchmark omits TTFT at long context, it omits the worst part of the experience.

Concurrency at acceptable latency is how many parallel sessions we sustain at a specific SLO. We define the SLO ourselves. In practice, for a manufacturing service desk a reasonable threshold is TTFT < 2 s and streaming > 25 tokens/s per session. Above that line the user doesn't experience "thinking" as a problem.

Cost per 1M tokens (output) is the only metric that compares to cloud-API pricing. We compute it from hardware TCO amortised over 3 years, divided by the aggregate tokens produced at a utilisation typical for a mid-sized firm (usually 25–40% daily).

What I don't look at in the first iteration: MMLU, HellaSwag, pure quality rankings. Those are useful when choosing a model, not when sizing hardware for a model we've already chosen.

Three configurations a mid-sized firm realistically considers

Shared assumptions: FP16 precision (as baseline), vLLM 0.6+ or TensorRT-LLM 0.13+, a 16k-token context window, tensor parallelism within a single server, no pipeline parallelism (one node). The values are medians from public benchmarks (vLLM, NVIDIA Triton) and independent Anyscale/Together tests from H2 2025. Variance of ±15% is normal, depending on drivers, CUDA version, and kernel flags.

Configuration A: 2x NVIDIA H100 80GB (single node, tensor parallel 2)

The most common "we're getting into on-prem" config. The hardware fits in one 4U server; NVLink between GPUs is mandatory (PCIe-only drops throughput ~30%).

Llama 3.1 70B, FP16, vLLM 0.6, batch=1, context 4k:

Single-session throughput: ~32–38 tokens/s
TTFT (prompt 4k tokens): ~600–900 ms
VRAM used: 65 GB per card (with headroom for KV cache)

Llama 3.1 70B, FP16, vLLM 0.6, batch=16, context 4k:

Aggregate throughput: ~380–460 tokens/s
Per-session throughput: ~24–28 tokens/s
TTFT (prompt 4k): ~1.2–1.6 s

Concurrency at SLO (TTFT < 2 s, per-session > 25 tps): ~12–18 simultaneous sessions.

Comment: the sweet spot for a 200–500 FTE firm if AI is one of three tools (not the centre). The bottleneck: KV cache at longer context (16k+). Each extra 1k of context eats ~1.2 GB VRAM at batch 16.

Configuration B: 4x NVIDIA H100 80GB (single node, tensor parallel 4)

A jump in performance — and price. Sensible if AI is meant to be a foundation of operations (service desk plus instructions plus proposals in parallel) or if you plan to upgrade to a larger model within 18 months.

Llama 3.1 70B, FP16, vLLM 0.6, batch=1, context 4k:

Single-session throughput: ~58–66 tokens/s
TTFT (prompt 4k): ~350–500 ms

Llama 3.1 70B, FP16, vLLM 0.6, batch=32, context 4k:

Aggregate throughput: ~720–860 tokens/s
Per-session throughput: ~22–27 tokens/s
TTFT (prompt 4k): ~900 ms – 1.3 s

Concurrency at SLO: ~30–40 simultaneous sessions.

Comment: tensor parallel 4 doesn't scale linearly (efficiency drops to ~70% vs 2x). For most mid-sized firms it's too much without a clear utilisation plan.

Configuration C: 2x NVIDIA H200 141GB (single node, tensor parallel 2)

The H200 has 76% more VRAM (141 GB vs 80 GB) and ~40% higher memory bandwidth than the H100. That changes the maths: the same model at higher concurrency, larger context for free, KV cache not a bottleneck up to batch 32+.

Llama 3.1 70B, FP16, vLLM 0.6, batch=1, context 4k:

Single-session throughput: ~42–48 tokens/s
TTFT (prompt 4k): ~480–700 ms

Llama 3.1 70B, FP16, vLLM 0.6, batch=32, context 16k:

Aggregate throughput: ~620–740 tokens/s
Per-session throughput: ~19–23 tokens/s
TTFT (prompt 16k): ~2.4–3.1 s

Concurrency at SLO (TTFT < 2 s at 16k context): ~20–28 sessions.

Comment: for long-context RAG (15k–30k input tokens, typical for technical documentation) the H200 is often a better decision than 4x H100. Fewer cards to maintain, less networking overhead. H200 list price is ~30–35% above H100 as of May 2026.

Quantization: what FP8 and INT4 change

Quantization reduces the size of weights (and sometimes activations) from 16 bits to 8 or 4. It buys two things: less VRAM (larger batch or longer context) and higher throughput (less data through memory).

FP8 (NVIDIA Hopper native):

VRAM ~50% less than FP16 (Llama 3.1 70B in FP8 fits in ~40 GB, i.e. on a single H100 80GB)
Throughput +30–55% vs FP16 on the same hardware
Quality loss on MMLU/HellaSwag ~0.5–1.5 pp, in practice imperceptible for RAG and drafting
Requires TensorRT-LLM or vLLM 0.6+ with FP8 quantization

INT4 (GPTQ, AWQ, ExLlamaV2):

VRAM ~25% of FP16 (~17–20 GB for 70B)
Throughput +50–90% vs FP16 (but TTFT sometimes rises, due to runtime dequantization)
Quality loss on MMLU ~2–4 pp, noticeable on reasoning tasks. For pure RAG (extractive summarization from a document) — acceptable. For long-form generation (instructions, proposals) — test it.
Requires AWQ or GPTQ quantized weights (publicly available on Hugging Face for Llama 3.1 70B)

Practical recommendation:

Configuration A (2x H100) in FP8: usually enough for a 200–500 FTE firm.
Configuration A in INT4: test quality before deployment, but one server can handle ~30–40 concurrent sessions.
Configuration B/C: rarely worth quantizing, since VRAM isn't the bottleneck.

Realistic utilisation and cost per 1M tokens

This is where most business cases break.

Typical assumption: 30% daily utilisation (7 hours of productive traffic per day, 5 days a week — i.e. ~21% on a 24/7 basis). The rest is idle (nights, weekends), warm standby, and peak headroom.

For Configuration A (2x H100 80GB), FP8, average aggregate throughput ~600 tokens/s at a realistic mix of batch sizes:

Daily production: 600 tps × 25,200 s = 15.1M tokens/day
Monthly: ~330M tokens

Hardware TCO (3 years): a 2x H100 80GB SXM5 server + chassis + switching + power + maintenance + license amortisation ~ EUR 240,000–290,000. Spread over 36 months: ~EUR 7,500/month. Plus an operator (0.2–0.3 FTE at a mid-sized firm with an existing IT team): ~EUR 3,000/month.

Cost per 1M output tokens: ~EUR 32 (10,500 / 330M × 1M).

For comparison: GPT-4o at public pricing is ~USD 10 per 1M output tokens. Claude Sonnet 4: ~USD 15 per 1M. So on-prem is more expensive per unit at 330M/month.

But that's a completely wrong axis of comparison if a regulatory need exists (NIS2, AI Act high-risk, a regulated sector). On-prem doesn't compete with a cloud API on price. It competes on risk profile (data doesn't leave), control (the model doesn't change without your consent), and predictability (pricing doesn't jump with every vendor iteration).

On the other hand: if you grow to 1.5–2 billion tokens/month (e.g. adding proposal automation for the technical office on top of the service desk), cost per 1M tokens drops to ~EUR 7–9. At that point on-prem wins economically against the cloud on unit price too.

What I don't cover here

Multi-node setups (8+ GPU, pipeline parallelism). For a mid-sized firm it's overkill in year one, perhaps not in year two. A separate post.
Fine-tuning / training overhead. Inference is ~5% of compute cost in a typical production scenario. Fine-tuning happens once a quarter; training, mostly not at all.
Networking and storage sizing for RAG. KV cache and embeddings are a separate topic with separate bottlenecks. Also a separate post.
AMD MI300X and non-NVIDIA alternatives. Briefly: reasonable for single workloads, immature for mixed production inference at a mid-sized firm. Worth revisiting in 2027.

// disclosure & biasesDisclosure and biases

I work on an on-prem AI platform (CortexMine) for European manufacturers. The numbers in this post come from public benchmarks (vLLM, NVIDIA Triton, Anyscale, Together AI) and from our internal tests on test hardware. Where our results differed from public benchmarks by more than 15%, I kept the public values (they're more representative of a typical setup than our optimisation).

The bias I'm aware of: I write from the "on-prem is a real option" perspective. For firms without a concrete regulatory need and below 200 FTE, it often isn't a real option economically. That's a different conversation.

Pillar: On-prem AI in European manufacturing 2026: a complete architecture guide — the broader architectural context and the 7 deployment layers.
Cluster: Bare-metal, colocation, or appliance: where to put on-prem AI (CAPEX and OPEX) — where the hardware from this post fits into the deployment model.

// author

Fryderyk Pryjma

Building CortexMine, an on-prem AI platform for European manufacturers under NIS2. Where this bias could affect conclusions, it is flagged inline.

More about the author LinkedIn CortexMine

Want to apply this to your case: architecture, compliance, and cost?

→ Book 30 min

// related notes

GPU sizing for Llama 3.1 70B inference: numbers from benchmarks

Why calculate this at all

The metrics that matter (and the ones that mislead)

Three configurations a mid-sized firm realistically considers

Configuration A: 2x NVIDIA H100 80GB (single node, tensor parallel 2)

Configuration B: 4x NVIDIA H100 80GB (single node, tensor parallel 4)

Configuration C: 2x NVIDIA H200 141GB (single node, tensor parallel 2)

Quantization: what FP8 and INT4 change

Realistic utilisation and cost per 1M tokens

What I don't cover here

// disclosure & biasesDisclosure and biases

Monitoring and observability for on-prem LLMs: what to log and how

Reranking in on-prem RAG: when it lifts relevance and when it just burns GPU

On-prem RAG: architecture, chunking, retrieval and what actually drives quality

GPU sizing for Llama 3.1 70B inference: numbers from benchmarks

Why calculate this at all

The metrics that matter (and the ones that mislead)

Three configurations a mid-sized firm realistically considers

Configuration A: 2x NVIDIA H100 80GB (single node, tensor parallel 2)

Configuration B: 4x NVIDIA H100 80GB (single node, tensor parallel 4)

Configuration C: 2x NVIDIA H200 141GB (single node, tensor parallel 2)

Quantization: what FP8 and INT4 change

Realistic utilisation and cost per 1M tokens

What I don't cover here

// disclosure & biasesDisclosure and biases

Related notes

Monitoring and observability for on-prem LLMs: what to log and how

Reranking in on-prem RAG: when it lifts relevance and when it just burns GPU

On-prem RAG: architecture, chunking, retrieval and what actually drives quality