// topic

Architecture

Reference architectures, RAG pipelines, GPU sizing benchmarks, and integrations for on-prem AI.

1 note · last update June 9, 2026

How many GPUs does it really take to run Llama 3.1 70B in-house? Concrete configs (A100, H100, H200), the impact of quantization (FP16 → FP8 → INT4), tokens/s, TTFT, and cost per 1M tokens. No marketing — numbers from vLLM and TensorRT-LLM benchmarks.