The clearest place to find what a local LLM is actually good at

slopsome.com is a search and comparison engine for open-weight & API language models and the GPUs that run them — built on one idea: organise models by capability, and be honest about how much to trust every number.

1The problem

Picking a model should start from a single, practical question: “Which model should I run, will it fit my hardware, how fast, and what will it cost?” — for both local open-weight models and paid APIs, in one place.

Existing resources answer a different question. Leaderboards collapse everything into one rank; vendor pages quote a hand-picked benchmark that “runs hot”; aggregators auto-generate numbers that can’t be traced. Almost none address the thing a hobbyist actually needs — will this run on my 4090, and how fast? — and none tell you how much a given score is worth.

2Thesis: capability + trust

Two principles drive the whole platform:

For local models we add what cloud leaderboards ignore entirely: effective context (not the advertised number), quantisation quality per bit-width, VRAM fit, and real tokens/sec on actual hardware.

3Hard rule: never invent data

Every figure is sourced or modelled — never fabricated. Unknown values are stored as nullwith a data_gaps note rather than guessed. Architecture comes straight from each model’s Hugging Face config.json (factual). Benchmark numbers are attributed to a source and labelled vendor vs independent. Speeds are either modelled from first principles or crowdsourced from real runs — and clearly marked as such.

4Architecture

Self-hosted, no cloud. A single Docker stack on a Raspberry Pi:

5The fit model — three honest states

Binary “fits / doesn’t fit” is wrong; the truth has three positive states:

This is computed with exact GQA-aware KV-cache math and a model’s real layer / KV-head / head-dim values, plus a usable-memory fraction — because a 32 GB Mac (unified memory, OS-reserved) is not a 32 GB discrete GPU.

weights_GB  = total_params_B × bits_per_weight / 8
kv_cache_GB = 2 × layers × context × (kv_heads × head_dim) × kv_bytes / 1e9
VRAM_total  = weights + kv_cache + overhead    (MoE: weights use TOTAL params)

6The speed model

Token generation (decode) is memory-bandwidth-bound, not compute-bound. So the headline rule is:

decode_tok/s ≈ memory_bandwidth_GBs / active_weight_GB × efficiency

A MoE model only reads its active experts per token, so it’s fast once it fits. Prefill (time-to-first-token) is the compute-bound part. Speculative decoding, flash attention and batching all raise the compute-to-memory ratio. The takeaway we teach inline: for local inference, bandwidth matters more than TFLOPS.

7Quantisation quality

Quantisation trades VRAM for accuracy, and the trade isn’t linear. Q4_K_M is the sweet spot; Q8 is near-lossless; below ~Q3 quality visibly drops. We prefer KL-divergence vs FP16 over raw perplexity (perplexity hides per-token distortion), and note that MoE tolerates lower bits than dense models — but their routers are quantisation-sensitive. Real quant-quality evals (e.g. lm-evaluation-harness) are crowdsourced per quant.

8The composite score

A single 0–10 number is a weighted blend, renormalised over whichever axes are present(so a model missing an axis isn’t penalised). It’s documented, not magic:

SWE-bench Verified 0.20MMLU-Pro 0.10
GPQA Diamond 0.15LMArena coding 0.08
HLE 0.12Instruction-following 0.07
BFCL tools 0.10Multilingual 0.05
AIME 0.10Hallucination (inverted) 0.03
APEX agents 0.05MRCR long-context 0.05

Vendor benchmarks are flagged (they run hot). Dollar-denominated agentic metrics (Vending-Bench, SWE-Lancer) are stored and shown but deliberately kept out of the blend — normalising them would require inventing a cap.

9The trust layer — the differentiator

Almost no site tells you how much to believe a benchmark. We attach, per score:

10Capability taxonomy

Each skill maps to the best current benchmark, with the saturated classics kept only as historical context:

11Community layer

Crowdsourced signal, kept honest by rules enforced in the database, not the UI:

12Data pipeline

Three sources, each at a different cadence and clearly labelled:

13Principles

14Roadmap

Live benchmark/price auto-ingest from primary, independent sources; measured quant quality (perplexity/KL per quant); a fine-tune / LoRA VRAM calculator; concurrency-aware throughput curves; and — once there’s a sensible comparison axis — embeddings, rerankers and vision models, which the schema already accommodates.