Lmst

#Benchmarking

Ultralytics (@ultralytics)

Ultralytics의 YOLO26에 대한 원라인 벤치마크가 소개되었습니다. CPU·GPU·FP16 환경에서 추론 속도와 정확도를 측정해 배포 환경에 맞는 최적 구성을 빠르게 식별할 수 있도록 한 도구·워크플로우 안내입니다.

https://x.com/ultralytics/status/2018373879224308103

#ultralytics #yolo26 #benchmarking #computervision

TechFollow (@TechFollowrazzi)

Micah Hill-Smith(ArtificialAnlys 공동창업자 겸 CEO)가 독립 AI 벤치마킹 플랫폼을 운영한다는 내용입니다. 이 플랫폼은 팀들이 특정 유스케이스에 맞는 최적의 모델과 API 제공자를 비교·선택하도록 돕는 도구로 소개되어 AI 도구·모델 평가와 선택에 유용합니다.

https://x.com/TechFollowrazzi/status/2017990678022676755

#benchmarking #ai #models #platform

Ultralytics (@ultralytics)

새 튜토리얼: Ultralytics의 YOLO26 벤치마크 소개. Ultralytics 벤치마크 모드로 추론 속도, 레이턴시, 처리량을 측정하여 실제 성능 트레이드오프를 이해하는 방법을 안내하는 콘텐츠와 영상 링크를 제공함.

https://x.com/ultralytics/status/2017296843886019036

#yolo #computervision #benchmarking #ultralytics

Who doesn't want to nearly double the performance of their code?

#Benchmarking #Performance #DotNet #CSharp

OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)
https://quesma.com/blog/introducing-otel-bench/
#ycombinator #benchmarking #opentelemetry #observability #llm #instrumentation #tracing

Without benchmarking LLMs, you're likely overpaying 5-10x

https://karllorey.com/posts/without-benchmarking-llms-youre-overpaying

#HackerNews #LLMs #Benchmarking #Overpaying #AIInsights #CostEfficiency

Benchmarking a Baseline Fully-in-Place Functional Language Compiler [pdf]

https://trendsfp.github.io/papers/tfp26-paper-12.pdf

#HackerNews #Benchmarking #Functional #Language #Compiler #Compiler #Performance #Programming #Research #Tech #Trends

#throwback What started as a simple DBaaS comparison turned into a deep dive into PostgreSQL benchmarking
🚀 Dirk Krautschick shares hard-earned lessons on tools, workloads, tuning, and real vs synthetic benchmarks. Avoid common pitfalls and benchmark smarter.

▶️ Watch now! https://www.youtube.com/watch?v=aB5dNcpBI44&list=PL_m-TUcr7ZvnSBmPoxZvcB1lfy7C9eced&index=7

#PostgreSQL #PGDay #PPDD #Benchmarking #DatabasePerformance

CrystalMark 3D25 - Demo Scene GPU Benchmark Utility

While updating some of my apps, I noticed the team behind CrystalDiskMark & CrystalDiskInfo have released CrystalMark 3D25, a 3D GPU performance benchmark based on demo scene examples (as in the 3D programs coded to run from really small executables). It's a fun way to benchmark the GPU ✌😄

https://crystalmark.info/en/software/crystalmark3d25/

#CrystalMark #CrystalMark3D25 #DemoScene #3D #GPU #RealTime #Graphics #RealTimeGraphics #Benchmark #Benchmarking #GameDev #Gaming

Antoine Moyroud (@antoine_moyroud)

AI 모델의 가격·품질·지연 시간 트레이드오프가 불투명하다는 문제를 지적하며, @ArtificialAnlys와 Hugging Face가 함께 실제로 선도하는 모델(공개 API 및 폐쇄형)을 더 명확하게 보여준다고 평가한 글입니다. 모델 비교·평가에서 두 플랫폼의 중요성을 강조하고 있습니다.

https://x.com/antoine_moyroud/status/2012240016760541455

#benchmarking #huggingface #modelevaluation #pricing

Janek Mann (@janekm)

작성자는 논문/발표자의 벤치마크 결과를 이해하지 못하겠다며, 벤치마킹을 심하게 잘못했거나 공개된 모델에 버그가 있을 수 있다고 지적합니다. 또한 Z-Image-Turbo(두번째 이미지)가 본인 환경에서 더 나은 텍스트 출력을 보여준다고 언급해 출시된 모델의 성능/재현성 문제를 제기하고 있습니다.

https://x.com/janekm/status/2011415256506179735

#imagemodels #benchmarking #zimageturbo #modelbug

Python Trending (@pythontrending)

τ²-Bench(tau2-bench)는 이중 제어(dual-control) 환경에서 대화형 에이전트의 성능을 평가하는 벤치마크/도구로, 에이전트의 상호작용 품질·안전성·제어성 등의 비교 측정에 쓰이는 연구용 평가 프레임워크를 제시합니다.

https://x.com/pythontrending/status/2011476866201198982

#benchmarking #conversationalai #evaluation #agents

Rohan Paul (@rohanpaul_ai)

이 논문은 오늘날의 코드 작성 에이전트들이 작은 모델에서도 많은 에너지를 소모하지만 버그 수정 성과는 낮다고 보고합니다. AutoCodeRover는 OpenHands보다 에너지 소비가 9.4배 높았고 최고 성공률은 4%에 그쳤습니다. 실험은 Gemma-3 4B와 Qwen-3 1.7B를 사용해 50개의 SWE-bench Verified Mini 문제에서 4개 프레임워크를 비교한 결과입니다.

https://x.com/rohanpaul_ai/status/2008465814471799150

#energyefficiency #codingagents #benchmarking #gemma3 #qwen3

vani (@vaniagrwall)

BuildInPublic 13일차 작업으로 research.site 도메인을 설정하고 'museum of queries'라는 색인 기반 프로젝트를 공개했습니다. 동일한 프롬프트로 여러 공급자(OpenAI, Perplexity AI, GeminiApp, p0)의 출력을 보존·비교해 각 제공자의 추론 과정을 볼 수 있게 한 대조 실험형 데모를 배포했습니다(테스트 링크 포함).

https://x.com/vaniagrwall/status/2005464073711042936

#openai #perplexity #gemini #benchmarking #llm

Ловушка профилирования

Оптимизация и профилирование C++: branchless-код проиграл обычному if-else. Что пошло не так? Разбираемся вместе.

https://habr.com/ru/articles/979778/

#оптимизация #benchmark #benchmarking #benchmarks

Chúng tôi đang phát triển nền tảng agent mở rộng đầu tiên qua terminal, hỗ trợ workflows đa agent, giao diện CLI/TUI, kiểm soát lỗi, và tích hợp các LLM. Cần kỹ sư về workflows, plugin và benchmark. Cung cấp quyền sở hữu, đóng góp cho đội chính.
#OSS #AI #Mastodon #MáyTính #ĐồngLậpTrình #MảngLậpTrình #Benchmarking #OpenSource #LậpTrìnhMở

None

https://www.reddit.com/r/LocalLLaMA/comments/1pkpmpr/oss_terminalfirst_agent_orchestration_platform/

SimpleBench cho GPT‑5.2 và GPT‑5.2 Pro đánh giá thấp hơn GPT‑5. Hiệu năng giảm: các phiên bản mới hơn nhận điểm thấp hơn so với GPT‑5, như báo trên Simple‑Bench leaderboard. Thông tin từ lmcouncil.ai/benchmarks (đánh giá trên Reddit).
#AI #Benchmark #GPT #ĐánhGiá #AIChatbots #ArtificialIntelligence #ĐánhGia #Benchmarking

https://www.reddit.com/r/singularity/comments/1pkp2sw/simplebench_for_gpt_52_and_gpt_52_pro_both_scored/

Here’s some context for #GHULbenchmark:

Most tools show synthetic numbers — #GHUL measures real heat, real load, real sensors. Hardware doesn’t die from FPS; it dies from thermals, VRAM hotspots, and PSUs begging for mercy. 🔥💀

Fun fact from RDNA4 testing: the new “silent” fan feature is a silent killer. VRAM hits 90°C, hotspot follows, fans chill at 46% (BIOS-enforced).
GHUL would cook the card instantly if I hadn’t added emergency shutdowns.

Uploads aren’t required — real nerds test locally first, then send a PR when something explodes or a sensor speaks folklore. 😄

AMD & NVIDIA supported; Intel ARC is next.
Own an ARC card? Congrats, you’re volunteered.

#Linux #Benchmarking #FOSS #GHUL #LinuxGaming #AMD #NVIDIA #ARC

🚀 GHULbenchmark v0.3 is here!

A Linux-native hardware torture & analysis suite — built because nothing out there did what it should.
Same reason Linux exists.
Same reason Git exists.
If the tools suck, we write better ones. GNU-style. 🐐

🔥 #Hellfire Stress Tests (CPU/GPU/RAM/Cooler)
🧠 Sensor autodiscovery (--dump-layout)
💀 GPU Diagnostic Mode
📈 Upload system coming soon — fake scores will die screaming
🐧 #AMD & #NVIDIA supported — Intel #ARC enters the arena next

#GHUL #benchmark
Built FOR Linux.
Built ON Linux.
Built BECAUSE Linux.

👉 https://github.com/g-h-u-l/GHULbenchmark

#Gaming #Linux #Benchmarking #FOSS #OpenSource #AMD #NVIDIA #GHUL #SysAdmin #Manjaro #ArchLinux #GNU #RicingButForScience #NoRGBneeded

Ready for local hardware tests on the rig:
no GUI, no marketing — just #bashing raw data into JSON and scientific results.
Some comments are borderline, but hey — my humor is my trademark.

Best experiences can be expected on Arch-based gaming rigs

Die @Cyberagentur startet HEGEMON, einen europaweit einzigartigen Forschungswettbewerb zur Bewertung und Anpassung von Foundation Models für sicherheitskritische Anwendungen. Vier Teams entwickeln Benchmarks und KI-Modelle für komplexe Aufgaben im Geoinformationswesen.
Mehr dazu: https://t1p.de/7ct97
#Cyberagentur #HEGEMON #KI #FoundationModels #Cybersicherheit #Benchmarking

#Benchmarking

Client Info