Question 1

How much VRAM does an LLM need?

Accepted Answer

Weights VRAM = parameters × bytes-per-parameter for the chosen quantization (FP16 ≈ 2.0 bytes, Q8 ≈ 1.06, Q5_K_M ≈ 0.73, Q4_K_M ≈ 0.58, Q3 ≈ 0.46), plus KV-cache for your context length. Example: a 24B model at Q4_K_M needs about 24 × 0.58 ≈ 14 GB for weights, plus 1–3 GB KV-cache at 8K context — so a 16 GB GPU is a tight but workable fit.

Question 2

How is the KV-cache size calculated?

Accepted Answer

When the architecture is known we compute it exactly: 2 (K+V) × layers × context-tokens × kv-heads × head-dim × bytes (2 bytes FP16, 1 byte with KV-quantization). This is GQA-aware, so models with grouped-query attention need far less cache than their size suggests. If architecture details are missing we fall back to a size-based heuristic.

Question 3

How do you estimate tokens per second?

Accepted Answer

Decode speed is memory-bandwidth-bound: estimated tps ≈ (GPU memory bandwidth ÷ active-parameter bytes) × efficiency (≈0.82 for discrete GPUs, ≈0.68 for unified memory). Where the community has submitted measured benchmarks per GPU, quant and backend, we show real measured tokens/sec instead.

Question 4

What does Q4_K_M / quantization mean?

Accepted Answer

Quantization stores model weights in fewer bits to cut VRAM and disk use. Q4_K_M (~4.6 bits/weight) is the common sweet spot: roughly 70% smaller than FP16 with minimal quality loss for most use. Q8 is near-lossless; Q3 saves more memory but degrades quality noticeably.

Question 5

What is the composite score?

Accepted Answer

A weighted aggregate of published benchmarks — GPQA Diamond (reasoning), SWE-bench Verified (coding), AIME (math), MMLU-Pro (knowledge), BFCL v3 (tool use) and more — normalised to a 0–100 scale so local and API models can be compared on one axis.

Question 6

Local model or API — which is cheaper?

Accepted Answer

It depends on volume. APIs bill per million tokens with zero hardware cost; a local GPU is a fixed cost with near-zero marginal cost per token. Heavy daily use (agents, batch processing, RAG over private data) usually favours local; sporadic use favours APIs. Each model page lists both the VRAM requirements and current per-provider API pricing so you can compare directly.

Question 7

Can I run a 70B model on a 24 GB GPU?

Accepted Answer

Not fully in VRAM: 70B at Q4_K_M needs ~40 GB for weights alone. Options: partial CPU offload (slow), a second 24 GB GPU, a Mac with 64 GB+ unified memory, or picking a smaller model — modern 24–32B models often outperform older 70B ones. Use the fit-calculator on the homepage with your exact GPU.

Question 8

Where does the data come from?

Accepted Answer

Model specs come from official model cards and papers; benchmark scores from published evaluations; tokens/sec from community-submitted measurements per GPU, quantization and backend; prices from provider price pages. Reviews and upvotes come from registered community members.

Frequently asked questions