Lmst

Recently published a short tutorial on evaluating cultural diplomacy projects. 🎨🌍

Evaluation in cultural diplomacy isn’t about measuring art itself. It’s about making visible the networks, partnerships, and opportunities that cultural diplomacy creates. Using widely available tools like Calc, Excel or Google Sheets can help teams reflect, learn, and stay accountable.

https://my-site-12f6cf.gitlab.io/portfolio/evaluating_cultural_diplomacy_made_easy/

#CulturalDiplomacy #Evaluation #Monitoring #PublicSector #CulturalManagement #DataDriven

Montreal mayor gives herself an 8 out of 10 on her first 100 days in office
Mayor Soraya Martinez Ferrada made 10 key promises to Montrealers she said she’d achieve in her first 100 days in office. Well, today she hit that first 100-day milestone, and how did she do? She told reporters this week, she gives herself an eight out of 10.

#politics #evaluation #Montreal
https://www.cbc.ca/news/canada/montreal/montreal-mayor-100-days-9.7100424?cmp=rss

Tejal Patwardhan (@tejalpatwardhan)

Nature에 새로 게재된 연구로, AI 'wet lab' 평가에 관한 새로운 결과가 발표되었다. 이는 AI 모델의 생물학적, 실험 기반 환경에서의 평가를 다루는 것으로 보이며, 연구팀이 실제 실험 데이터와 AI 분석을 결합한 평가 방법을 제시한 것으로 추정된다.

https://x.com/tejalpatwardhan/status/2024636639126102513

#research #ai #nature #evaluation #wetlab

prinz (@deredleritt3r)

작성자는 ‘Denying the antecedent!’라는 표현으로 시작해 일론 머스크가 벤치마크는 중요하지 않다고 주장한 게시물을 언급한다. 작성자는 벤치마크가 전부가 아니라는 의견에 부분적으로 동의하면서도, 벤치마크를 완전히 대체할 아무것도 없는 상태는 문제라며 벤치마크의 대안 또는 보완 방법이 필요하다고 지적한다.

https://x.com/deredleritt3r/status/2024545823401660765

#benchmarks #ai #evaluation #elonmusk

Chubby (@kimmonismus)

Grok 4.20의 공식 벤치마크 평가 결과를 아직 기다리고 있다는 내용이다. 트윗은 성능 검증을 위한 공식 벤치마크 공개에 대한 기대 또는 촉구를 나타내며, 해당 버전의 객관적 평가를 요구하고 있다.

https://x.com/kimmonismus/status/2023846221128052942

#grok #benchmark #evaluation #llm

The briefing also features perspectives from:

👤 Dr. Anne Reinhardt, Ludwig-Maximilians-Universität München

👤 Prof. Dr. Ute Schmid, Otto-Friedrich-Universität Bamberg / Bamberger Zentrum für Künstliche Intelligenz (BaCAI)

👤 Prof. Dr. Kerstin Denecke, Berner Fachhochschule BFH

📄 𝗥𝗲𝗮𝗱 𝘁𝗵𝗲 𝗳𝘂𝗹𝗹 𝗚𝗲𝗿𝗺𝗮𝗻 𝗯𝗿𝗶𝗲𝗳𝗶𝗻𝗴 (𝗦𝗠𝗖):
https://www.sciencemediacenter.de/angebote/chatbots-fehlerhafte-kommunikation-bei-gesundheitsfragen-26029

🧾 𝗡𝗮𝘁𝘂𝗿𝗲 𝗠𝗲𝗱𝗶𝗰𝗶𝗻𝗲 𝗽𝗮𝗽𝗲𝗿:
https://www.nature.com/articles/s41591-025-04074-y

#NLP #LLMs #HealthAI #HumanAIInteraction #Evaluation #UKPLab

Zu #EU-Projektkoordination – Umfrage zur #Wirkung von Projekten unter #H2020 u. #HEurope https://ec.europa.eu/eusurvey/runner/HorizonSurvey2026 bis 2026-03-09. #Agrarforschung #Forschung #Evaluation

Edit: submission deadlines extended!

Reminder that the deadlines for the IEEE Engineering Reliable Autonomous Systems Conference 2026 in Zagreb, Croatia (May 28-29, just before ICRA in Vienna) are coming up!

March 7: Regular and short papers
March 7: Workshop and tutorial proposals
April 7: Late-breaking reports

Stakeholders across all autonomous system domains and practices are welcome!

https://2026-erasrobotics.org/index.html

#verification #robotics #autonomy #Conference #evaluation #testing #IEEE #cfp #zagreb #specification #autonomoussystems #reliability #eras2026 #reliablesystems

Ivan Fioravanti ᯅ (@ivanfioravanti)

RepoBench는 모델의 코딩 능력 자체를 측정하기보다 대규모 컨텍스트 추론, 지시 준수, 파일 편집 정밀도를 더 반영한다고 지적하며, 최신 모델들이 이전 모델보다 약한 경우가 보인다고 코멘트함. RepoPrompt의 벤치 페이지 링크를 함께 공유함.

https://x.com/ivanfioravanti/status/2023444897806848112

#repoprompt #repobench #benchmark #llm #evaluation

Latent.Space (@latentspacepod)

벤치마크에 대한 코멘트로, 특히 공개된 외부 벤치마크는 유용하지만 유효기간이 있다는 관점입니다. 가장 좋은 벤치마크는 초기 점수가 10~30% 수준으로 시작해 이후 개선의 여지가 남아있어 연구·개선 활동을 촉진하는 유형이라는 주장입니다.

https://x.com/latentspacepod/status/2023306359132061992

#benchmarking #evaluation #ml #aibenchmarks

Chubby (@kimmonismus)

작성자가 DeepSeek v4의 평가 결과가 가짜라는 통보를 받아 해당 게시물을 삭제하고 정정했다는 공지입니다. 잘못된 평가·주장에 대한 정정으로 연구·모델 평가 신뢰성 이슈를 알리는 내용입니다.

https://x.com/kimmonismus/status/2023148930306109486

#deepseek #evaluation #retraction #researchintegrity

Sam Altman (@sama)

몇 년 사이에 초등학교 수준 수학조차 힘들어하던 AI 시스템들이 연구 수준의 수학 문제를 풀 수 있게 되었다는 평가. 작성자는 Jakub의 평가가 현재 가장 중요한 평가라고 동의하며, 대중 반응은 '그렇게 어렵지 않다'는 식일 것이라 예상한다고 밝힘.

https://x.com/sama/status/2022729068949717182

#ai #research #math #evaluation

Jakub Pachocki (@merettm)

"First Proof" 챌린지에 대한 기대를 표명하며, 차세대 AI 모델의 능력을 평가하기 위한 전선(프론티어) 연구의 중요성을 강조. 내부 모델을 제한적 인간 감독 하에 제안된 10문제에 대해 실행해본 결과를 언급함.

https://x.com/merettm/status/2022517085193277874

#airesearch #benchmark #evaluation #challenge

📻 [ #notation et #évaluation des #fonctionnaires ] 🚨 les enregistrements de la séance du 6 février des *Dialogues autour de la fonction publique*, avec Hélène Guillet, Jean-Francois Verdier, Jean Le Bihan & Pierre Karila-Cohen, sont en ligne 👉 https://compter.hypotheses.org/3071

End of the verification era: CBSE implements digital marking for Class 12 in 2026, removing human errors and the need for post-result mark checks. https://english.mathrubhumi.com/news/india/cbse-class-12-digital-evaluation-2026-no-more-mark-verification-b26lbu07?utm_source=dlvr.it&utm_medium=mastodon #CBSE #class12 #boardexam #answersheets #evaluation

Protection Of U.S. Streams Is Insufficient To Safeguard Stream Diversity And Prevent Habitat Impairment
--
https://doi.org/10.1038/s44458-025-00026-2 <-- shared paper
--
#water #surfacewater #USA #US #biodiversity #protection #streams #rivers #regulations #waterresources #watermanagement #waterquality #diversity #habitat #damage #risk #hazard #GlobalBiodiversityFramework #target #CONUS #spatialanalysis #geostatistics #geophysics #biogeographic #impacts #humanimpacts #conservation #watershed #upstream #impairment #evaluation #GAPStatus #catchment #metrics #agriculture #urban #industry #transportation #diversity

Unser Team der Impact Unit berät euch kostenlos zur #Evaluation eurer #Wisskomm! In nur wenigen Klicks vereinbart ihr euren 30-minütigen Video-Call und erhaltet kurzfristig Hilfe und Tipps. Das Angebot findet (fast🐞) jede Woche mittwochs um 10.00 Uhr und 10.30 Uhr statt.

Unsere Kolleg*innen beraten euch von der Erhebungsmethode über die praktische Durchführung bis hin zur Interpretation und zum Reporting eurer Ergebnisse.

Wir freuen uns auf euch! Jetzt Termin buchen:
https://impactunit.de/evaluationsberatung/

Evaluationsberatung Persönliche und kostenlose 30-minütige Videocalls zu allen Fragen rund um die Evaluation von Wissenschaftskommunikation. Immer mittwochs um 10 Uhr und 10.30 Uhr. Jetzt Termin vereinbaren.

🔍 Perspektiven von Nutzer:innen bei der Wirkungsorientierung einbinden – darum geht es in einem neuen Beitrag in meinem Blog.

In dem Beitrag gehe ich darauf ein, dass es zentral ist, auch die Perspektive von Nutzer:innen bei der Wirkungsorientierung und Wirkungsanalyse einzubinden.

Link zum Blog 👉 https://blog.soziale-wirkung.de/2026/02/12/perspektiven-nutzerinnen-wirkungsorientierung-einbinden/

#Wirkung #Wirkungsanalyse #Wirkungsorientierung #SozialeArbeit #Sozialwirtschaft #Wirkmodelle #Monitoring #Evaluation #SocialImpact #data4good

Ai2 (@allen_ai)

LLM은 현실 작업(예: 세금신고)이나 AI 에이전트 계획 등 단계별 지침을 자주 생성하지만, 유창해 보여도 동작하지 않는 단계가 나오고 현재 데이터셋은 다루는 도메인이 제한적이라 개선이 어렵다. How2Everything은 이러한 단계별 지침 문제를 대규모로 평가하고( 및 훈련) 해결하기 위한 평가/데이터셋 솔루션을 제시한다.

https://x.com/allen_ai/status/2021264352175648956

#llm #how2everything #dataset #evaluation

#evaluation

Client Info