#benchmarking

adingbatponder :nixos: ๐Ÿ‘พadingbatponder@fosstodon.org
2026-03-10

Thinkpad #X230 of 2013 (gift from friend),
#apple #imac 18.2 of 2017 (350 euros 64 Gb RAM 27" Retina display) ,
#hetzner CX43 #vm (10 euros per month) online server,
#Supermicro (SM in table on right) X11-WTR SYS-5019P-WTR #Xeon Silver 4110 ร— 16 (basic parts 500 euros 2nd hand)
CPU comparisons using #hardinfo2 #benchmarking #homelab

benchmark comparison table

Weights & Biases (@wandb)

W&B Inference๊ฐ€ @ArtificialAnlys์— ๋“ฑ๋ก๋˜์—ˆ๋‹ค๊ณ  ๋ฐœํ‘œํ–ˆ์Šต๋‹ˆ๋‹ค. W&B๋Š” ์„œ๋น„์Šคํ•˜๋Š” ๋ชจ๋“  ๋ชจ๋ธ์„ ์ง€๋Šฅ, ์†๋„, ๋น„์šฉ, ์ง€์—ฐ์‹œ๊ฐ„ ๊ธฐ์ค€์œผ๋กœ ๋…๋ฆฝ ๋ฒค์น˜๋งˆํฌํ•ด ๋น„๊ต ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ์˜ˆ๋กœ GLM-5, Kimi K2.5, MiniMax M2.5 ๋“ฑ์„ ๋ฒค์น˜๋งˆํฌ ๋Œ€์ƒ์œผ๋กœ ์–ธ๊ธ‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

x.com/wandb/status/20303911103

#wandb #inference #benchmarking #models #llm

Minko Gechev (@mgechev)

AI ์—์ด์ „ํŠธ ์Šคํ‚ฌ์„ ์œ„ํ•œ '์œ ๋‹› ํ…Œ์ŠคํŠธ' ๊ฐœ๋…์˜ Skill Eval์ด ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Docker๋กœ ๊ฒฉ๋ฆฌ๋œ ๋ฒค์น˜๋งˆํฌ์™€ ๊ฒฐ์ •๋ก ์  ๊ฒ€์‚ฌ ๋ฐ LLM ๊ธฐ๋ฐ˜ ์ฑ„์ ์ด ๊ฒฐํ•ฉ๋˜์–ด ์—์ด์ „ํŠธ ์Šคํ‚ฌ์˜ ํšŒ๊ท€๋ฅผ ์‚ฌ์ „์— ์žก์•„๋‚ด๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

x.com/mgechev/status/202921483

#skilleval #benchmarking #docker #agent #evaluation

2026-03-04

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

#CUDA #LLM #Benchmarking #Package

hgpu.org/?p=30630

Bindu Reddy (@bindureddy)

Gemini Pro 3.1์ด LiveBench์—์„œ ๊ฑฐ์˜ ๋ชจ๋“  ๋ฆฌ๋”๋ณด๋“œ๋ฅผ ํฐ ์ฐจ์ด๋กœ ์„ ๋„ํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๋ณด๊ณ ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ˆจ๊ฒจ์ง„(๋ณด์ง€ ๋ชปํ•œ) ์งˆ๋ฌธ๋“ค์—์„  ๋‚ฎ๊ฒŒ ๋‚˜์™€ ๋ฒค์น˜๋งˆํฌ ์ตœ์ ํ™” ์˜ํ˜น์ด ์ œ๊ธฐ๋˜๋ฉฐ, ์‹ค์ œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์—๋Š” ์œ ์˜ํ•  ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

x.com/bindureddy/status/202857

#geminipro #livebench #benchmarking #llm

fly51fly (@fly51fly)

ISO-Bench์€ ์‹ค์ œ ์ถ”๋ก (inference) ์›Œํฌ๋กœ๋“œ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์ฝ”๋”ฉ ์—์ด์ „ํŠธ๊ฐ€ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ์™€ ๋ถ„์„์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Lossfunk ์†Œ์† ์—ฐ๊ตฌ์ง„์ด ๋ฐœํ‘œํ•œ ์ด ์—ฐ๊ตฌ๋Š” ์ฝ”๋”ฉ ์—์ด์ „ํŠธ์˜ ์„ฑ๋Šฅ, ์‹ค๋ฌด์  ์ œ์•ฝ, ๋ฐ ์‹ค์„ธ๊ณ„ ์ถ”๋ก  ๋น„์šฉยท์†๋„ ์ตœ์ ํ™” ๊ฐ€๋Šฅ์„ฑ์„ ๊ฒ€์ฆํ•˜๋ฉฐ ๊ฐœ๋ฐœ์ž์šฉ ์ž๋™ํ™” ๋„๊ตฌยท์—์ด์ „ํŠธ ์—ฐ๊ตฌ์— ์‹ค์šฉ์  ์ธ์‚ฌ์ดํŠธ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

x.com/fly51fly/status/20282242

#benchmarking #codingagents #inference #optimization

Ivan Fioravanti แฏ… (@ivanfioravanti)

exolabs ์ตœ์‹  ๋ฒ„์ „์—์„œ ๋ฒค์น˜๋งˆํฌ ํ…Œ์ŠคํŠธ ์ค‘์— /bench/chat/completions ์—”๋“œํฌ์ธํŠธ๋ฅผ ์ฐพ์•„๋ƒ„. ์ด ์—”๋“œํฌ์ธํŠธ๋Š” ํ˜ธ์ถœ ๊ฐ„ ์บ์‹œ๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•ด ์‹ค์ œ ์„ฑ๋Šฅ ์ธก์ •์— ์ ํ•ฉํ•˜๋ฉฐ, M3 Ultra ์นฉ์„ ํ™œ์šฉํ•œ ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ์— ๋”ฑ ๋งž๋Š” ๊ธฐ๋Šฅ์ด๋ผ๊ณ  ํ‰ํ•จ.

x.com/ivanfioravanti/status/20

#exolabs #benchmarking #api #m3ultra

MRLN (@mrlnonai)

๋ชจ๋ธ์ด๋‚˜ ์‹œ์Šคํ…œ์ด SVG ์ถœ๋ ฅ ๋“ฑ ํŠน์ • ๋ฒค์น˜๋งˆํฌ์—์„œ ์„ฑ๋Šฅ์„ ๊ณผ๋Œ€ ํฌ์žฅ('benchmaxxed')ํ•˜๊ณ  ํ›ˆ๋ จ๋น„์šฉ์„ ๋‹จ 1000๋งŒ ๋‹ฌ๋Ÿฌ๋ผ๊ณ  ์ฃผ์žฅํ•˜๋Š” ์‚ฌ๋ก€์— ๋Œ€ํ•œ ๋น„ํŒ์  ์ฝ”๋ฉ˜ํŠธ์ž…๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ๋Š” ๊ทธ๋ž˜ํ”ฝ์นด๋“œ ๋“ฑ ์žฅ๋น„ ์ธ์ˆ˜ ๋น„์šฉ์ด ํฌํ•จ๋˜์ง€ ์•Š๊ณ  ์—๋„ˆ์ง€๋น„๋งŒ ๊ณ„์‚ฐ๋˜๋Š” ๋“ฑ ๋น„์šฉ ์‚ฐ์ •์˜ ์™œ๊ณก ๊ฐ€๋Šฅ์„ฑ์„ ์ง€์ ํ•˜๊ณ , ์‚ฌ์šฉ ์‹œ ์ถ”๋ก  ํ† ํฐ์ด ๋งŽ์•„์ง„๋‹ค๋Š” ์ฃผ์žฅ์„ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

x.com/mrlnonai/status/20278918

#benchmarking #trainingcosts #modelevaluation #svg #ml

danzindanzin
2026-02-28

I built labeille to find CPython JIT crashes, but it's a "run real world test suites at scale" platform.

It also works for:
โ€” Checking which packages pass their tests on a new CPython version
โ€” Testing free-threaded (no-GIL) CPython compatibility
โ€” Measuring coverage.py or memray overhead across hundreds of packages
โ€” Comparing CPython vs PyPy performance on real code

The registry of 350+ packages with install/test commands is the core.

danzindanzin
2026-02-28

I've been working on a new Python tool: labeille. Its main purpose is to look for CPython JIT crashes by running real world test suites.

github.com/devdanzin/labeille

But it's grown a feature that might interest more people: benchmarking using PyPI packages.

How does that work?

labeille allows you to run test suites in 2 different configurations. Say, with coverage on and off, or memray on and off. Here's an example:

gist.github.com/devdanzin/6352

Ivan Fioravanti แฏ… (@ivanfioravanti)

์ถ”๊ฐ€ ํ…Œ์ŠคํŠธ์—์„œ RTX 3090์ด ๋” ๋น ๋ฅด๋‹ค๋Š” ์ ์„ ์žฌํ™•์ธํ–ˆ๋‹ค๋Š” ๊ฐ„๋‹จํ•œ ์—…๋ฐ์ดํŠธ์ž…๋‹ˆ๋‹ค. ๋” ์ž์„ธํ•œ ์„ฑ๋Šฅ ๋น„๊ต์™€ ๋ถ„์„์€ ๋‚ด์ผ ๊ณต๊ฐœ๋  ๊ธฐ์‚ฌ์—์„œ ๋‹ค๋ฃฐ ์˜ˆ์ •์ด๋ฉฐ, ์ถ”๊ฐ€ ํ…Œ์ŠคํŠธ๋„ ๊ณ„์† ์ง„ํ–‰ํ•  ๊ณ„ํš์ด๋ผ๊ณ  ๋ฐํ˜”์Šต๋‹ˆ๋‹ค.

x.com/ivanfioravanti/status/20

#rtx3090 #gpu #benchmarking #nvidia

้‡‘ใฎใƒ‹ใƒฏใƒˆใƒช (@gosrum)

Qwen3.5-27B-UD-Q4_K_XL์„ llama.cpp๋กœ ์ถ”๋ก  ์†๋„ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ๋ชจ๋ธ์ด VRAM์— ์˜ฌ๋ผ๊ฐˆ ๊ฒฝ์šฐ RTX 5090์ด ๋งค์šฐ ๋น ๋ฆ„์„ ํ™•์ธ. RTX 5090(1์žฅ) Prefill ์•ฝ 2800 tps, Decode ์•ฝ 60 tps. M2 Ultra(2์žฅ) Prefill ์•ฝ 256 tps, Decode ์•ฝ 18 tps.

x.com/gosrum/status/2026450569

#qwen #llamacpp #benchmarking #rtx5090

้‡‘ใฎใƒ‹ใƒฏใƒˆใƒช (@gosrum)

Qwen3.5-122B-A10B์˜ ts-bench ๊ฒฐ๊ณผ๋ฅผ ์ถ”๊ฐ€ํ–ˆ์ง€๋งŒ RTX 5090์—์„œ ๋Œ๋ฆฌ์ง€ ๋ชปํ•ด ํ‰๊ฐ€์— ์‹œ๊ฐ„์ด ๊ฑธ๋ฆผ. ๋ฉ”๋ชจ: ๊ธฐ๋ณธ์˜ 'thinking' ์˜ต์…˜์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ํŽธ์ด ์ข‹์Œ(์–ด๋–ค ์‚ฌ์ด์ฆˆ๋“  ๋А๋ ค์ง€๊ณ  ์ ์ˆ˜๋„ ํ•˜๋ฝ). ๋˜ํ•œ ์ด๋ฒˆ ๊ฒฐ๊ณผ์—์„œ๋Š” 122B๊ฐ€ 27B๋ณด๋‹ค ์ ์ˆ˜๊ฐ€ ๋‚ฎ๊ฒŒ ๋‚˜์™”์Œ.

x.com/gosrum/status/2026577182

#qwen #benchmarking #llm #gpu

2026-02-23

Sadly adding quamina didn't bring any meaningful changes to the integration test suite I'm using for my federated server, probably because the amount of data they handle is way too low and the overhead of running the application and testsuite is way too high.

It looks like I need to build some artificial benchmarks handling strictly the storage fetches.

#benchmarking

Python Trending (@pythontrending)

InferenceX๋ผ๋Š” ์˜คํ”ˆ์†Œ์Šค ์—ฐ์† ์ถ”๋ก (continuous inference) ๋ฒค์น˜๋งˆํ‚น ํ”„๋กœ์ ํŠธ์—์„œ Qwen3.5, DeepSeek, GPTOSS ๋“ฑ ๋ชจ๋ธ์„ ๋Œ€์ƒ์œผ๋กœ GB200 NVL72, MI355X, B200, GB300 NVL72, H100 ๋“ฑ ๋‹ค์–‘ํ•œ ์ถ”๋ก  ํ•˜๋“œ์›จ์–ด๋ฅผ ๋น„๊ตํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ๋ฅผ ์†Œ๊ฐœํ•˜๋ฉฐ, ๊ณง TPUv6e/v7 ๋ฐ Trainium2/3 ์ง€์› ์˜ˆ์ •์ž„์„ ์•Œ๋ฆฝ๋‹ˆ๋‹ค.

x.com/pythontrending/status/20

#inferencex #benchmarking #opensource #qwen3.5 #h100

Bindu Reddy (@bindureddy)

Claude Sonnet 4.6์ด ์ถœ์‹œ๋˜์—ˆ์œผ๋ฉฐ ๋น„์šฉ ๋Œ€๋น„ ์ „์ฒด ์„ฑ๋Šฅ์—์„œ ์ตœ๊ณ ๊ฐ€ ๋  ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค๋Š” ๋ฐœํ‘œ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ LiveBench ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๊ฐ€ ๊ณง ๊ณต๊ฐœ๋  ์˜ˆ์ •์ด๋ผ๊ณ  ์˜ˆ๊ณ ํ•ด ์‹ค์ œ ์„ฑ๋Šฅ ๊ฒ€์ฆ์ด ๊ณง ์ด๋ฃจ์–ด์งˆ ๊ฒƒ์ž„์„ ์•Œ๋ฆฝ๋‹ˆ๋‹ค.

x.com/bindureddy/status/202382

#claude #sonnet #llm #benchmarking #livebench

Latent.Space (@latentspacepod)

๋ฒค์น˜๋งˆํฌ์— ๋Œ€ํ•œ ์ฝ”๋ฉ˜ํŠธ๋กœ, ํŠนํžˆ ๊ณต๊ฐœ๋œ ์™ธ๋ถ€ ๋ฒค์น˜๋งˆํฌ๋Š” ์œ ์šฉํ•˜์ง€๋งŒ ์œ ํšจ๊ธฐ๊ฐ„์ด ์žˆ๋‹ค๋Š” ๊ด€์ ์ž…๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์ข‹์€ ๋ฒค์น˜๋งˆํฌ๋Š” ์ดˆ๊ธฐ ์ ์ˆ˜๊ฐ€ 10~30% ์ˆ˜์ค€์œผ๋กœ ์‹œ์ž‘ํ•ด ์ดํ›„ ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ๋‚จ์•„์žˆ์–ด ์—ฐ๊ตฌยท๊ฐœ์„  ํ™œ๋™์„ ์ด‰์ง„ํ•˜๋Š” ์œ ํ˜•์ด๋ผ๋Š” ์ฃผ์žฅ์ž…๋‹ˆ๋‹ค.

x.com/latentspacepod/status/20

#benchmarking #evaluation #ml #aibenchmarks

[Show GN: AutoRAG-Research - ์ตœ์‹  RAG ๋…ผ๋ฌธ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌํ˜„์ฒด ๋ชจ์Œ ๋ฐ ๋น„๊ต ์‹คํ—˜ ๋„๊ตฌ

AutoRAG-Research๋Š” RAG(Retrieval-Augmented Generation) ๋ฐฉ๋ฒ•๋ก ์˜ ์žฌ๊ตฌํ˜„๊ณผ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ์œ„ํ•œ ์˜คํ”ˆ์†Œ์Šค ํ”„๋กœ์ ํŠธ๋กœ, ํ‘œ์ค€ํ™”๋œ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ฏธ๋ฆฌ ๊ตฌํ˜„๋œ ์ตœ์‹  RAG ๋…ผ๋ฌธ๋“ค์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ํ”„๋กœ์ ํŠธ๋Š” ์ปค์Šคํ…€ ๋ฐ์ดํ„ฐ์…‹๊ณผ RAG ํŒŒ์ดํ”„๋ผ์ธ์„ ์ถ”๊ฐ€ํ•˜๊ธฐ ์‰ฝ๊ฒŒ ์„ค๊ณ„๋œ ํ”Œ๋Ÿฌ๊ทธ์ธ ๊ตฌ์กฐ๋ฅผ ์ง€์›ํ•˜๋ฉฐ, AI Agent ์‹œ๋Œ€์—๋„ RAG์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

news.hada.io/topic?id=26624

#rag #opensource #airesearch #benchmarking #automl

DeepInfra (@DeepInfra)

DeepInfra๊ฐ€ GLM-4.7-Flash ๋ฒค์น˜์—์„œ @ArtificialAnlys๋ฅผ ์ œ์น˜๊ณ  ์ฒ˜๋ฆฌ๋Ÿ‰ยท์ง€์—ฐยท๊ฐ€๊ฒฉ ๋ฉด์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ฃผ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณด๊ณ ๋œ ์ˆ˜์น˜: 105.7 tok/s, 0.24s TTFT, $0.14/1M. ๋ฐœํ‘œ์ž๋Š” ๋” ๋‚˜์€ ์ปค๋„์ด ๋™์ผ ์˜ˆ์‚ฐ์œผ๋กœ ๋” ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ค€๋‹ค๊ณ  ๊ฐ•์กฐํ•ด AI ์ถ”๋ก  ์ธํ”„๋ผ ์ตœ์ ํ™”์™€ ๋น„์šฉ ํšจ์œจ์„ฑ ๊ด€๋ จ ์ค‘์š”ํ•œ ์—…๋ฐ์ดํŠธ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

x.com/DeepInfra/status/2019225

#deepinfra #glm4.7 #inference #benchmarking

khazzz1c (@Imkhazzz1c)

์ƒˆ ํšŒ์‚ฌ์— ํ•ฉ๋ฅ˜ํ•œ ๋’ค ๋ถˆ์•ˆ๊ฐ์„ ํ˜ธ์†Œํ•˜๋ฉฐ ํ•œ ๋‹ฌ ๋‚ด ICLR ๋…ผ๋ฌธ์„ ๋๋‚ด์•ผ ํ•˜๊ณ , 10xB ๊ทœ๋ชจ์˜ ๋ชจ๋ธ์„ ํฌ์ŠคํŠธํŠธ๋ ˆ์ด๋‹ํ•ด ๋ฆฌ๋”๋ณด๋“œ์— ์˜ฌ๋ ค์•ผ ํ•œ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•˜๋ฉด์„œ ๋ชฉํ‘œ๋กœ Gemini 2.5๋ฅผ ์ง€๋ชฉํ•จ. ๋Œ€ํšŒยท๋ฒค์น˜๋งˆํฌ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ ๋ชฉํ‘œ์™€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ํ›„์† ํ•™์Šต ๋ถ€๋‹ด์„ ๋“œ๋Ÿฌ๋‚ด๋Š” ๋‚ด์šฉ.

x.com/Imkhazzz1c/status/201898

#iclr #gemini #modeltraining #llm #benchmarking

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst