#SWEbench

AI Daily Postaidailypost
2025-11-08

Moonshot AI’s Kimi K2 Thinking just hit 71.3% on the SWE‑Bench, outpacing GPT‑5, Claude Sonnet 4.5 and Deepseek‑V3.2. This open‑source milestone shows how far community‑driven models have come in handling HTML, React and real‑world coding tasks. Dive into the details and see why K2 is setting a new bar for AI coding assistants.

🔗 aidailypost.com/news/moonshot-

2025-10-28

MiniMax M2가 보여준 효율성 혁명: Claude의 8% 비용, 2배 빠른 속도

중국 MiniMax가 공개한 M2 모델이 Claude Sonnet 비용의 8%, 2배 빠른 속도로 Claude Opus 4.1을 앞서는 성능을 달성했습니다. 230억 파라미터 중 100억만 활성화하는 효율적 설계와 실전 활용법을 소개합니다.

aisparkup.com/posts/5962

2025-10-16

Claude Haiku 4.5 출시: Sonnet 4 성능을 1/3 가격에

Anthropic의 Claude Haiku 4.5는 5개월 전 최첨단 성능을 1/3 가격에 2배 빠른 속도로 제공하며 AI 활용의 패러다임을 바꾸고 있습니다. 실전 코딩부터 멀티 에이전트 협업까지 새로운 가능성을 확인하세요.

aisparkup.com/posts/5653

2025-10-15

🚀 Kh/ubuntu Haiku 4.5 đe dọa Sonnet 4 trên SWE Bench! Kết quả cho thấyPermissions tự động và hiệu quả. (Ảnh: Kết quả thử nghiệm) #Haiku4.5 #Sonnet4 #SWEBench #AI #Tecnology

reddit.com/r/singularity/comme

2025-09-30

Claude Sonnet 4.5, AI 코딩 모델의 새로운 챔피언

Anthropic의 Claude Sonnet 4.5가 SWE-bench에서 70.6%를 기록하며 GPT-5를 제치고 1위에 올랐다. 30시간 이상 자율 코딩이 가능하며 다양한 산업에서 실질적 성과를 보이고 있다.

aisparkup.com/posts/5213

Jeff Triplettwebology
2025-09-26

If your company is benefiting from Django’s stability and maturity to test or train AI models, consider **funding Django’s development**.

💚 Support Django: djangoproject.com/fundraising/

Dash Removerdashremover
2025-09-22

Every time someone calls developers 'code monkeys' in 2025, a VC whispers 'founder material' and invests in a Slack plugin that reschedules meetings using vibes.

😂💸🧵

Sara Zanzansara
2025-05-23

📢 Don't overlook this in the wave of releases! has a new coding LLM: it's , an open model perfect for on-prem, private and local deployments 🐈

📰 Have a look at the announcement: mistral.ai/news/devstral

SWE Bench results for Devstral
Sara Zanzansara
2025-05-22

🧠 Another flagship model released! just unveiled Claude Opus 4 and Claude Sonnet 4, and they are at the top of the leaderboard for coding 💻

📰 Check out the announcement: anthropic.com/news/claude-4

N-gated Hacker Newsngate
2025-05-22

🎉🥳 OMG, Refact.ai scored a groundbreaking 69.8 on and now it's charging you in coins! 💰🔧 Apparently, solving 349 out of 500 tasks makes it the reigning champion of open-source AI agents. Who knew moving from request limits to coin tossing was the future of tech? 🤪👨‍💻
refact.ai/blog/2025/open-sourc

2025-05-21

#Devstral: New #opensource Model for Coding Agents by #MistralAI & #AllHandsAI 🧠

• 🏆 #Devstral achieves 46.8% on #SWEBench Verified, outperforming previous #opensource models by over 6% points and surpassing #GPT4 mini by 20%

🧵👇#AI #coding

2025-04-15

Как мы собираем SWE-bench на других языках

Современная разработка ПО — это плавильный котел языков: Java, C#, JS/TS, Go, Kotlin… список можно продолжать. Но когда дело доходит до оценки ИИ-агентов, способных помогать в написании и исправлении кода, мы часто упираемся в ограничения. Популярный бенчмарк SWE-bench, например, долгое время поддерживал только Python. Чтобы преодолеть разрыв между реальностью разработки и возможностями оценки ИИ, наша команда в

habr.com/ru/companies/doubleta

#swebench #ии #нейросети #ml #машинное_обучение #искусственный_интеллект #github #open_source

2024-11-14

[Перевод] Сравнение бенчмарков LLM для разработки программного обеспечения

В этой статье мы сравним различные бенчмарки, которые помогают ранжировать крупные языковые модели для задач разработки программного обеспечения.

habr.com/ru/articles/857754/

#LLM #бенчмарки #бенчмаркинг #HumanEval #DevQualityEval #CodeXGLUE #Aider #SWEbench #ClassEval #BigCodeBench

2024-10-29

🚀 #Claude35Sonnet is now rolling out on #GitHubCopilot, bringing advanced coding capabilities directly to #VisualStudioCode and GitHub.com

• 🏆 Performance highlights:
- Highest score among public models on #SWEbench Verified
- 93.7% accuracy on #HumanEval for #Python function writing

• 💻 Key features:
- Production-ready code generation
- Inline debugging assistance
- Automated test suite creation
- Contextual code explanations

• ⚙️ Technical details:
- Runs via #AmazonBedrock
- Cross-region inference for enhanced reliability
- Available to all #GitHub Copilot Chat users and organizations

Source: anthropic.com/news/github-copi

2024-10-22

🚀 #Anthropic announces major updates to their #AI model lineup:

💻 Upgraded #Claude35Sonnet shows significant improvements:
• Achieves 49% on #SWEbench Verified coding benchmark
• Leads in software engineering capabilities
• Maintains same price and speed as predecessor
• Tested by US and UK #AI Safety Institutes

🔄 New #Claude35Haiku introduction:
• Matches #Claude3Opus performance at lower cost
• Scores 40.6% on SWEbench Verified
• Optimized for user-facing products
• Available across multiple cloud platforms

🖱️ Pioneering #ComputerUse beta feature:
• Allows AI to navigate interfaces like humans
• Scores 22% on #OSWorld benchmark
• Currently in experimental phase
• Supported by new safety classifiers

⚡ Enterprise adoption:
#GitLab reports 10% improvement in DevSecOps tasks
#Replit leverages computer use for app evaluation
#Cognition notes enhanced problem-solving capabilities

anthropic.com/news/3-5-models-

marmelabmarmelab
2024-08-06

How do AI software engineering agents work?🤔🤖

Find the answer, along with valuable insights from the creators of SWE-bench & SWE-agent, in this article⬇️

newsletter.pragmaticengineer.c

Great read! 👏 @gergelyorosz, @elin Nilsson

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst