slopsome.com is a search and comparison engine for open-weight & API language models and the GPUs that run them — built on one idea: organise models by capability, and be honest about how much to trust every number.
Picking a model should start from a single, practical question: “Which model should I run, will it fit my hardware, how fast, and what will it cost?” — for both local open-weight models and paid APIs, in one place.
Existing resources answer a different question. Leaderboards collapse everything into one rank; vendor pages quote a hand-picked benchmark that “runs hot”; aggregators auto-generate numbers that can’t be traced. Almost none address the thing a hobbyist actually needs — will this run on my 4090, and how fast? — and none tell you how much a given score is worth.
Two principles drive the whole platform:
For local models we add what cloud leaderboards ignore entirely: effective context (not the advertised number), quantisation quality per bit-width, VRAM fit, and real tokens/sec on actual hardware.
Every figure is sourced or modelled — never fabricated. Unknown values are stored as nullwith a data_gaps note rather than guessed. Architecture comes straight from each model’s Hugging Face config.json (factual). Benchmark numbers are attributed to a source and labelled vendor vs independent. Speeds are either modelled from first principles or crowdsourced from real runs — and clearly marked as such.
Self-hosted, no cloud. A single Docker stack on a Raspberry Pi:
Binary “fits / doesn’t fit” is wrong; the truth has three positive states:
This is computed with exact GQA-aware KV-cache math and a model’s real layer / KV-head / head-dim values, plus a usable-memory fraction — because a 32 GB Mac (unified memory, OS-reserved) is not a 32 GB discrete GPU.
weights_GB = total_params_B × bits_per_weight / 8 kv_cache_GB = 2 × layers × context × (kv_heads × head_dim) × kv_bytes / 1e9 VRAM_total = weights + kv_cache + overhead (MoE: weights use TOTAL params)
Token generation (decode) is memory-bandwidth-bound, not compute-bound. So the headline rule is:
decode_tok/s ≈ memory_bandwidth_GBs / active_weight_GB × efficiency
A MoE model only reads its active experts per token, so it’s fast once it fits. Prefill (time-to-first-token) is the compute-bound part. Speculative decoding, flash attention and batching all raise the compute-to-memory ratio. The takeaway we teach inline: for local inference, bandwidth matters more than TFLOPS.
Quantisation trades VRAM for accuracy, and the trade isn’t linear. Q4_K_M is the sweet spot; Q8 is near-lossless; below ~Q3 quality visibly drops. We prefer KL-divergence vs FP16 over raw perplexity (perplexity hides per-token distortion), and note that MoE tolerates lower bits than dense models — but their routers are quantisation-sensitive. Real quant-quality evals (e.g. lm-evaluation-harness) are crowdsourced per quant.
A single 0–10 number is a weighted blend, renormalised over whichever axes are present(so a model missing an axis isn’t penalised). It’s documented, not magic:
| SWE-bench Verified 0.20 | MMLU-Pro 0.10 |
| GPQA Diamond 0.15 | LMArena coding 0.08 |
| HLE 0.12 | Instruction-following 0.07 |
| BFCL tools 0.10 | Multilingual 0.05 |
| AIME 0.10 | Hallucination (inverted) 0.03 |
| APEX agents 0.05 | MRCR long-context 0.05 |
Vendor benchmarks are flagged (they run hot). Dollar-denominated agentic metrics (Vending-Bench, SWE-Lancer) are stored and shown but deliberately kept out of the blend — normalising them would require inventing a cap.
Almost no site tells you how much to believe a benchmark. We attach, per score:
last_verified) — because eval sets drift and today’s separator saturates within months.Each skill maps to the best current benchmark, with the saturated classics kept only as historical context:
Crowdsourced signal, kept honest by rules enforced in the database, not the UI:
Three sources, each at a different cadence and clearly labelled:
Live benchmark/price auto-ingest from primary, independent sources; measured quant quality (perplexity/KL per quant); a fine-tune / LoRA VRAM calculator; concurrency-aware throughput curves; and — once there’s a sensible comparison axis — embeddings, rerankers and vision models, which the schema already accommodates.