#SWEBench

Sara Zanzansara
2025-05-23

📢 Don't overlook this in the wave of releases! has a new coding LLM: it's , an open model perfect for on-prem, private and local deployments 🐈

📰 Have a look at the announcement: mistral.ai/news/devstral

SWE Bench results for Devstral
Sara Zanzansara
2025-05-22

🧠 Another flagship model released! just unveiled Claude Opus 4 and Claude Sonnet 4, and they are at the top of the leaderboard for coding 💻

📰 Check out the announcement: anthropic.com/news/claude-4

N-gated Hacker Newsngate
2025-05-22

🎉🥳 OMG, Refact.ai scored a groundbreaking 69.8 on and now it's charging you in coins! 💰🔧 Apparently, solving 349 out of 500 tasks makes it the reigning champion of open-source AI agents. Who knew moving from request limits to coin tossing was the future of tech? 🤪👨‍💻
refact.ai/blog/2025/open-sourc

2025-05-21

#Devstral: New #opensource Model for Coding Agents by #MistralAI & #AllHandsAI 🧠

• 🏆 #Devstral achieves 46.8% on #SWEBench Verified, outperforming previous #opensource models by over 6% points and surpassing #GPT4 mini by 20%

🧵👇#AI #coding

2025-04-15

Как мы собираем SWE-bench на других языках

Современная разработка ПО — это плавильный котел языков: Java, C#, JS/TS, Go, Kotlin… список можно продолжать. Но когда дело доходит до оценки ИИ-агентов, способных помогать в написании и исправлении кода, мы часто упираемся в ограничения. Популярный бенчмарк SWE-bench, например, долгое время поддерживал только Python. Чтобы преодолеть разрыв между реальностью разработки и возможностями оценки ИИ, наша команда в

habr.com/ru/companies/doubleta

#swebench #ии #нейросети #ml #машинное_обучение #искусственный_интеллект #github #open_source

2024-11-14

[Перевод] Сравнение бенчмарков LLM для разработки программного обеспечения

В этой статье мы сравним различные бенчмарки, которые помогают ранжировать крупные языковые модели для задач разработки программного обеспечения.

habr.com/ru/articles/857754/

#LLM #бенчмарки #бенчмаркинг #HumanEval #DevQualityEval #CodeXGLUE #Aider #SWEbench #ClassEval #BigCodeBench

2024-10-29

🚀 #Claude35Sonnet is now rolling out on #GitHubCopilot, bringing advanced coding capabilities directly to #VisualStudioCode and GitHub.com

• 🏆 Performance highlights:
- Highest score among public models on #SWEbench Verified
- 93.7% accuracy on #HumanEval for #Python function writing

• 💻 Key features:
- Production-ready code generation
- Inline debugging assistance
- Automated test suite creation
- Contextual code explanations

• ⚙️ Technical details:
- Runs via #AmazonBedrock
- Cross-region inference for enhanced reliability
- Available to all #GitHub Copilot Chat users and organizations

Source: anthropic.com/news/github-copi

2024-10-22

🚀 #Anthropic announces major updates to their #AI model lineup:

💻 Upgraded #Claude35Sonnet shows significant improvements:
• Achieves 49% on #SWEbench Verified coding benchmark
• Leads in software engineering capabilities
• Maintains same price and speed as predecessor
• Tested by US and UK #AI Safety Institutes

🔄 New #Claude35Haiku introduction:
• Matches #Claude3Opus performance at lower cost
• Scores 40.6% on SWEbench Verified
• Optimized for user-facing products
• Available across multiple cloud platforms

🖱️ Pioneering #ComputerUse beta feature:
• Allows AI to navigate interfaces like humans
• Scores 22% on #OSWorld benchmark
• Currently in experimental phase
• Supported by new safety classifiers

⚡ Enterprise adoption:
#GitLab reports 10% improvement in DevSecOps tasks
#Replit leverages computer use for app evaluation
#Cognition notes enhanced problem-solving capabilities

anthropic.com/news/3-5-models-

marmelabmarmelab
2024-08-06

How do AI software engineering agents work?🤔🤖

Find the answer, along with valuable insights from the creators of SWE-bench & SWE-agent, in this article⬇️

newsletter.pragmaticengineer.c

Great read! 👏 @gergelyorosz, @elin Nilsson

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst