Topic cluster

On-prem AI architecture: RAG, GPU sizing, benchmarks

4 notes · last update July 23, 2026

Quick answer

The technical layer of an on-prem AI deployment: how to build a RAG that doesn't hallucinate, how to size GPUs for the model and load, and how to measure retrieval quality and inference throughput. Specifics, benchmarks, and reference architectures instead of marketing diagrams.

This cluster is the engineering side of on-prem AI. We break the RAG pipeline into the parts that actually decide quality: chunking, embeddings, retrieval, and reranking, with a clear answer on when a reranker (cross-encoder) improves relevance and when it just eats GPU budget. We publish GPU sizing in numbers (e.g. Llama 70B: throughput, VRAM, quantization, vLLM/TensorRT-LLM) and reference architectures mapped to security requirements. It's material for CIOs and architects who must make hardware and design decisions but can't find neutral benchmarks in Polish, because they're almost exclusively in English. We link every technical article to its regulatory consequence (logging, isolation, oversight) and to the TCO calculator, so an architectural decision has a price tag.

// notes in this topic

Monitoring and observability for on-prem LLMs: what to log and how

7 min · July 23, 2026

Four telemetry layers for an on-prem LLM: infrastructure, serving, quality and audit. What to log, what not to store, and how observability feeds the audit trail NIS2 expects.

Reranking in on-prem RAG: when it lifts relevance and when it just burns GPU

6 min · June 29, 2026

A reranker sharpens top-k ordering when queries are long and the corpus is dense. The numbers, the VRAM and latency cost, and five setups where it backfires.

On-prem RAG: architecture, chunking, retrieval and what actually drives quality

7 min · June 22, 2026

How to build RAG outside the public cloud: the pipeline layers, the most common retrieval failures, the data boundary inside the prompt, and the questions an auditor will ask. A technical note for architects and CISOs.

GPU sizing for Llama 3.1 70B inference: numbers from benchmarks

8 min · May 26, 2026

How many GPUs does it really take to run Llama 3.1 70B in-house? Concrete configs (A100, H100, H200), the impact of quantization (FP16 → FP8 → INT4), tokens/s, TTFT, and cost per 1M tokens. No marketing — numbers from vLLM and TensorRT-LLM benchmarks.

// related topics

Frequently asked questions

Does a reranker always improve RAG?

No, it helps with large, noisy corpora; with small, well-chunked sets it can be an unnecessary GPU cost and latency.

vLLM or Ollama for production?

For development Ollama is convenient; in production under load vLLM delivers many times the throughput (batching, multi-GPU).

How much VRAM do I need?

It depends on model size and quantization, see the GPU sizing article; roughly, an FP16 model needs ~2× its parameter count in GB, plus headroom for context.

Want to apply this to your case: architecture, compliance, and cost?

→ Book 30 min