Lmst

🚨 KI-Agenten exponentiell besser? METR zeigt steigende "Time Horizons" – aber 50% Erfolg = jeder 2. Versuch scheitert. Log-Skala: stabiler.

👉 Meine Einschätzung: Wirtschaftlich relevant, aber kein Beweis für baldige Agentenübernahmen.

(Picture Credits to METR, via metr org, abgerufen am 22.2.26, "Model Evaluation & Threat Research", Social-Media-Bearbeitung und Screenshot druch: Marlon Niklas Kaulich)

#KI #AIAgents #METR #KünstlicheIntelligenz

https://winbuzzer.com/2026/02/25/anthropic-drops-hard-safety-limit-responsible-scaling-policy-xcxwbn/

Anthropic Drops Hard Safety Limits From its AI Scaling Policy

#AI #Anthropic #ResponsibleScalingPolicy #AISafety #AIRegulation #AISafety #AIModels #AITraining #CatastrophicRisk #METR #TrumpAdministration #Claude

[Opus 4.6, 사람 기준 14.5시간짜리 문제를 푼다는 것의 의미 (METR Time Horizon)

METR 연구기관의 Opus 4.6 모델이 인간 전문가 기준 14.5시간짜리 문제를 50% 확률로 해결할 수 있는 능력을 평가한 연구 결과를 발표했습니다. 이 연구는 AI의 장기적·자율적 작업 수행 능력을 측정하며, AI가 고숙련 지식 노동을 대체할 수 있는 임계점에 도달했음을 시사합니다.

https://news.hada.io/topic?id=26872

#ai #metr #opus46 #automation #timehorizon

Cari #devs,

Uno studio #METR ha scoperto che gli sviluppatori esperti erano convinti che l’#AI li rendesse più rapidi del 20%.

Realtà dei fatti: impiegavano il 19% di tempo in più.

Percezione vs realtà

🔗 https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/?utm_source=perplexity

#llm #claudecode #chatgpt #codex #gemini #agents #agentsai

Künstliche Intelligenz verstärkt Arbeitsbelastung statt sie zu verringern

Entwickler, die KI-Tools wie Cursor Pro mit Claude 3.5/3.7 Sonnet nutzten, benötigten 19 Prozent länger für ihre Aufgaben als ohne KI-Unterstützung.

https://www.all-about-security.de/kuenstliche-intelligenz-verstaerkt-arbeitsbelastung-statt-sie-zu-verringern/

#METR #entwickler #ki #kitools

https://winbuzzer.com/2026/02/06/metr-five-hour-ai-claim-misunderstood-graph-xcxwbn/

METR's Five-Hour AI Claim: Why Everyone Misunderstood the Graph

#AI #METR #Anthropic #Claude #ClaudeOpus45 #AIResearch #AISafety #AIBenchmarks #LLMs #AICoding #AIAgents #AgenticAI #AICoding #AISafety #AIEthics

If AI coding is so good … where are the performance numbers?

https://fed.brid.gy/r/https://pivot-to-ai.com/2026/01/13/if-ai-coding-is-so-good-where-are-the-performance-numbers/

Интересное в графике - не то что 8 часовые задачи (с успешностью 50%) прогнозируются в ~середине этого года, а то, как уныло выглядит график, если переключить на 80% успешность (там нечто вроде 15 минут на начало 2026, а не 4.5 часа как на 50%).

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

#LLM #METR #evals #llm_evals #ai_evals

METR phát hiện Opus 4.5 có 50% xác suất hoàn thành trong 4 giờ 49 phút. Đánh giá kỹ thuật mới về mô hình AI. #Opus45 #METR #AILab #ThờiGianXửLý #ĐánhGiáAI

https://www.reddit.com/r/singularity/comments/1pr39qf/metr_finds_opus_45_has_a_50_time_horizon_of_4/

GPT-5.1-Codex-Max đã thiết lập tiêu chuẩn mới trên METR, khẳng định vị thế hàng đầu trong công nghệ xử lý ngôn ngữ tự nhiên.:::/GPT-5.1-Codex-Max sets new standard on METR, asserting leading position in natural language processing technology. #GPT51CodexMax #METR #TríTuệNhânTạo #ArtificialIntelligence #XửLýNgônNgữTựNhiên #NaturalLanguageProcessing

https://www.reddit.com/r/singularity/comments/1p1k1gr/gpt51codexmax_is_the_new_sota_on_metr/

«La supuesta #revolución de la #productividad no se está reflejando en los números: un riguroso estudio de #METR —que no puede tildarse de tecnófobo— encontró que los #desarrolladores de #software experimentados eran un 20% más lentos al usar herramientas de #IA. El problema radica en la brecha entre capacidad y fiabilidad: los sistemas pueden realizar tareas impresionantes, pero con una inconsistencia que exige una supervisión humana constante, ...»
https://cenital.com/la-burbuja-de-la-inteligencia-artificial/
#LLM #Capitalismo

https://newsletter.getdx.com/p/unpacking-metr-findings-does-ai-slow-developers-down?inbox=true&triedRedirect=true Unpacking #METR ’s findings: Does #AI slow developers down?

People are starting to realize #AI slows you down on projects with a minimal complexity (see the randomized #METR trial and this https://venturebeat.com/ai/stack-overflow-data-reveals-the-hidden-productivity-tax-of-almost-right-ai-code/), so what's the proposed solution? Put a human in the loop, so the poor can fix the mess. I haven't read the paper, but it sounds so stupid! It comes from #Microsoft by the way, so... https://arxiv.org/pdf/2507.22358

Interesting METR experiment: AI tools like Cursor cut raw coding time but ultimately slow devs down due to prompt crafting, reviewing, and tweaking. A solid study - though focused on one tool. Timely reminder: AI isn’t a magic bullet. #METR #AICoding #GenAI #SoftwareDev #Cursor

Cursor makes developers less e...

Very thoughtful analysis by @grimalkina of the experimental design and results from the recent METR study on “the impact of early-2025 AI on experience open-source developer productivity”.

https://www.fightforthehuman.com/are-developers-slowed-down-by-ai-evaluating-an-rct-and-what-it-tells-us-about-developer-productivity/

#metr #cursor

Исследование METR: использование Cursor замедляет опытных разработчиков на 19 %

Считается устоявшейся истиной, что инструменты автодополнения кода и прочая помощь от больших языковых моделей помогают программировать быстрее. Исследование организации METR ставит это фактоид под сомнение и даже демонстрирует обратный эффект. В рамках анализа труда 16 программистов обнаружилось, что ИИ замедляет человека на 19 %. Это противоречит мнению экспертов индустрии машинного обучения, экономистов и самих участников эксперимента. Важно, что проверка шла не на очередных бенчмарках или предложениях решать алгоритмические задачи на скорость, а в обычной работе людей.

https://habr.com/ru/articles/927072/

#METR #Model_Evaluation_Threat_Research #научные_исследования #большие_языковые_модели #БЯМ #Сursor #программирование #GitHub #Git #автодополнение_кода

A #study by #METR found that #experienceddevelopers using #AIcoding tools on mature projects experienced a 19% #decrease in #productivity, contrary to their 20% increase estimate. While the results suggest limitations in AI coding tools, they do not negate their potential benefits in other contexts. https://secondthoughts.ai/p/ai-coding-slowdown?eicker.news #tech #media #news

Some quick notes on Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, a super interesting study on AI tooling’s effect on productivity.

https://vale.rocks/micros/20250711-0800

#AI #LLM #METR

Large language models are improving exponentially
https://spectrum.ieee.org/large-language-model-performance
#ycombinator #2030 #ai_capabilities #exponential_growth #large_language_models #metr #task_completion_time #type_departments

Large Language Models Are Improving Exponentially
https://spectrum.ieee.org/large-language-model-performance
#ycombinator #2030 #ai_capabilities #exponential_growth #large_language_models #metr #task_completion_time #type_departments

#METR

Client Info