Lmst

#benchmarking

Thinkpad #X230 of 2013 (gift from friend),
#apple #imac 18.2 of 2017 (350 euros 64 Gb RAM 27" Retina display) ,
#hetzner CX43 #vm (10 euros per month) online server,
#Supermicro (SM in table on right) X11-WTR SYS-5019P-WTR #Xeon Silver 4110 × 16 (basic parts 500 euros 2nd hand)
CPU comparisons using #hardinfo2 #benchmarking #homelab

Weights & Biases (@wandb)

W&B Inference가 @ArtificialAnlys에 등록되었다고 발표했습니다. W&B는 서비스하는 모든 모델을 지능, 속도, 비용, 지연시간 기준으로 독립 벤치마크해 비교 결과를 제공하며, 예로 GLM-5, Kimi K2.5, MiniMax M2.5 등을 벤치마크 대상으로 언급했습니다.

https://x.com/wandb/status/2030391110301077892

#wandb #inference #benchmarking #models #llm

Minko Gechev (@mgechev)

AI 에이전트 스킬을 위한 '유닛 테스트' 개념의 Skill Eval이 소개되었습니다. Docker로 격리된 벤치마크와 결정론적 검사 및 LLM 기반 채점이 결합되어 에이전트 스킬의 회귀를 사전에 잡아내도록 설계되었습니다.

https://x.com/mgechev/status/2029214837255913727

#skilleval #benchmarking #docker #agent #evaluation

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

#CUDA #LLM #Benchmarking #Package

https://hgpu.org/?p=30630

Bindu Reddy (@bindureddy)

Gemini Pro 3.1이 LiveBench에서 거의 모든 리더보드를 큰 차이로 선도하고 있다는 보고입니다. 다만 숨겨진(보지 못한) 질문들에선 낮게 나와 벤치마크 최적화 의혹이 제기되며, 실제 일반화 성능에는 유의할 점이 있습니다.

https://x.com/bindureddy/status/2028577082801111161

#geminipro #livebench #benchmarking #llm

fly51fly (@fly51fly)

ISO-Bench은 실제 추론(inference) 워크로드를 대상으로 코딩 에이전트가 최적화를 수행할 수 있는지를 평가하는 벤치마크와 분석을 제안합니다. Lossfunk 소속 연구진이 발표한 이 연구는 코딩 에이전트의 성능, 실무적 제약, 및 실세계 추론 비용·속도 최적화 가능성을 검증하며 개발자용 자동화 도구·에이전트 연구에 실용적 인사이트를 제공합니다.

https://x.com/fly51fly/status/2028224267494854971

#benchmarking #codingagents #inference #optimization

Ivan Fioravanti ᯅ (@ivanfioravanti)

exolabs 최신 버전에서 벤치마크 테스트 중에 /bench/chat/completions 엔드포인트를 찾아냄. 이 엔드포인트는 호출 간 캐시를 비활성화해 실제 성능 측정에 적합하며, M3 Ultra 칩을 활용한 성능 테스트에 딱 맞는 기능이라고 평함.

https://x.com/ivanfioravanti/status/2028047012109734067

#exolabs #benchmarking #api #m3ultra

MRLN (@mrlnonai)

모델이나 시스템이 SVG 출력 등 특정 벤치마크에서 성능을 과대 포장('benchmaxxed')하고 훈련비용을 단 1000만 달러라고 주장하는 사례에 대한 비판적 코멘트입니다. 실제로는 그래픽카드 등 장비 인수 비용이 포함되지 않고 에너지비만 계산되는 등 비용 산정의 왜곡 가능성을 지적하고, 사용 시 추론 토큰이 많아진다는 주장을 담고 있습니다.

https://x.com/mrlnonai/status/2027891857942831218

#benchmarking #trainingcosts #modelevaluation #svg #ml

I built labeille to find CPython JIT crashes, but it's a "run real world test suites at scale" platform.

It also works for:
— Checking which packages pass their tests on a new CPython version
— Testing free-threaded (no-GIL) CPython compatibility
— Measuring coverage.py or memray overhead across hundreds of packages
— Comparing CPython vs PyPy performance on real code

The registry of 350+ packages with install/test commands is the core.

#Python #CPython #PyPI #testing #benchmarking #labeille

I've been working on a new Python tool: labeille. Its main purpose is to look for CPython JIT crashes by running real world test suites.

https://github.com/devdanzin/labeille

But it's grown a feature that might interest more people: benchmarking using PyPI packages.

How does that work?

labeille allows you to run test suites in 2 different configurations. Say, with coverage on and off, or memray on and off. Here's an example:

https://gist.github.com/devdanzin/63528343df98779b5fedf657bf8286cd

#Python #labeille #fuzzing #JIT #PyPI #benchmarking

Ivan Fioravanti ᯅ (@ivanfioravanti)

추가 테스트에서 RTX 3090이 더 빠르다는 점을 재확인했다는 간단한 업데이트입니다. 더 자세한 성능 비교와 분석은 내일 공개될 기사에서 다룰 예정이며, 추가 테스트도 계속 진행할 계획이라고 밝혔습니다.

https://x.com/ivanfioravanti/status/2027434967576547658

#rtx3090 #gpu #benchmarking #nvidia

金のニワトリ (@gosrum)

Qwen3.5-27B-UD-Q4_K_XL을 llama.cpp로 추론 속도 평가한 결과, 모델이 VRAM에 올라갈 경우 RTX 5090이 매우 빠름을 확인. RTX 5090(1장) Prefill 약 2800 tps, Decode 약 60 tps. M2 Ultra(2장) Prefill 약 256 tps, Decode 약 18 tps.

https://x.com/gosrum/status/2026450569695830360

#qwen #llamacpp #benchmarking #rtx5090

金のニワトリ (@gosrum)

Qwen3.5-122B-A10B의 ts-bench 결과를 추가했지만 RTX 5090에서 돌리지 못해 평가에 시간이 걸림. 메모: 기본의 'thinking' 옵션은 사용하지 않는 편이 좋음(어떤 사이즈든 느려지고 점수도 하락). 또한 이번 결과에서는 122B가 27B보다 점수가 낮게 나왔음.

https://x.com/gosrum/status/2026577182307529134

#qwen #benchmarking #llm #gpu

Sadly adding quamina didn't bring any meaningful changes to the integration test suite I'm using for my federated server, probably because the amount of data they handle is way too low and the overhead of running the application and testsuite is way too high.

It looks like I need to build some artificial benchmarks handling strictly the storage fetches.

#benchmarking

Python Trending (@pythontrending)

InferenceX라는 오픈소스 연속 추론(continuous inference) 벤치마킹 프로젝트에서 Qwen3.5, DeepSeek, GPTOSS 등 모델을 대상으로 GB200 NVL72, MI355X, B200, GB300 NVL72, H100 등 다양한 추론 하드웨어를 비교하는 벤치마크를 소개하며, 곧 TPUv6e/v7 및 Trainium2/3 지원 예정임을 알립니다.

https://x.com/pythontrending/status/2024088496328081630

#inferencex #benchmarking #opensource #qwen3.5 #h100

Bindu Reddy (@bindureddy)

Claude Sonnet 4.6이 출시되었으며 비용 대비 전체 성능에서 최고가 될 가능성이 크다는 발표입니다. 또한 LiveBench 벤치마크 결과가 곧 공개될 예정이라고 예고해 실제 성능 검증이 곧 이루어질 것임을 알립니다.

https://x.com/bindureddy/status/2023824564892168510

#claude #sonnet #llm #benchmarking #livebench

Latent.Space (@latentspacepod)

벤치마크에 대한 코멘트로, 특히 공개된 외부 벤치마크는 유용하지만 유효기간이 있다는 관점입니다. 가장 좋은 벤치마크는 초기 점수가 10~30% 수준으로 시작해 이후 개선의 여지가 남아있어 연구·개선 활동을 촉진하는 유형이라는 주장입니다.

https://x.com/latentspacepod/status/2023306359132061992

#benchmarking #evaluation #ml #aibenchmarks

[Show GN: AutoRAG-Research - 최신 RAG 논문 파이프라인 구현체 모음 및 비교 실험 도구

AutoRAG-Research는 RAG(Retrieval-Augmented Generation) 방법론의 재구현과 성능 비교를 위한 오픈소스 프로젝트로, 표준화된 벤치마크 데이터셋과 미리 구현된 최신 RAG 논문들을 제공합니다. 이 프로젝트는 커스텀 데이터셋과 RAG 파이프라인을 추가하기 쉽게 설계된 플러그인 구조를 지원하며, AI Agent 시대에도 RAG의 중요성을 강조합니다.

https://news.hada.io/topic?id=26624

#rag #opensource #airesearch #benchmarking #automl

DeepInfra (@DeepInfra)

DeepInfra가 GLM-4.7-Flash 벤치에서 @ArtificialAnlys를 제치고 처리량·지연·가격 면에서 우수한 성능을 주장했습니다. 보고된 수치: 105.7 tok/s, 0.24s TTFT, $0.14/1M. 발표자는 더 나은 커널이 동일 예산으로 더 높은 처리량을 준다고 강조해 AI 추론 인프라 최적화와 비용 효율성 관련 중요한 업데이트로 볼 수 있습니다.

https://x.com/DeepInfra/status/2019225015536001145

#deepinfra #glm4.7 #inference #benchmarking

khazzz1c (@Imkhazzz1c)

새 회사에 합류한 뒤 불안감을 호소하며 한 달 내 ICLR 논문을 끝내야 하고, 10xB 규모의 모델을 포스트트레이닝해 리더보드에 올려야 한다고 언급하면서 목표로 Gemini 2.5를 지목함. 대회·벤치마크 수준의 성능 목표와 대규모 모델 후속 학습 부담을 드러내는 내용.

https://x.com/Imkhazzz1c/status/2018980411003769097

#iclr #gemini #modeltraining #llm #benchmarking

#benchmarking

Client Info