#AIEvaluation

FutureOfTestingfutureoftesting
2025-06-24

Scale AI stellt „Scale Evaluation“ vor – eine neue Plattform zur automatisierten Bewertung von KI-Modellen über verschiedene Benchmarks. Ziel: Schwächen erkennen, gezielt verbessern.

🔗 opentools.ai/news/scale-ai-unv

The educator panic over AI is real, and rational.
I've been there myself. The difference is I moved past denial to a more pragmatic question: since AI regulation seems unlikely (with both camps refusing to engage), how do we actually work with these systems?

The "AI will kill critical thinking" crowd has a point, but they're missing context.
Critical reasoning wasn't exactly thriving before AI arrived: just look around. The real question isn't whether AI threatens thinking skills, but whether we can leverage it the same way we leverage other cognitive tools.

We don't hunt our own food or walk everywhere anymore.
We use supermarkets and cars. Most of us Google instead of visiting libraries. Each tool trade-off changed how we think and what skills matter. AI is the next step in this progression, if we're smart about it.

The key is learning to think with AI rather than being replaced by it.
That means understanding both its capabilities and our irreplaceable human advantages.

1/3

#AI #Education #FutureOfEducation #AIinEducation #LLM #ChatGPT #Claude #EdAI #CriticalThinking #CognitiveScience #Metacognition #HigherOrderThinking #Reasoning #Vygotsky #Hutchins #Sweller #LearningScience #EducationalPsychology #SocialLearning #TechforGood #EticalAI #AILiteracy #PromptEngineering #AISkills #DigitalLiteracy #FutureSkills #LRM #AIResearch #AILimitations #SystemsThinking #AIEvaluation #MentalModels #LifelongLearning #AIEthics #HumanCenteredAI #DigitalTransformation #AIRegulation #ResponsibleAI #Philosophy

AI isn't going anywhere. Time to get strategic:
Instead of mourning lost critical thinking skills, let's build on them through cognitive delegation—using AI as a thinking partner, not a replacement.

This isn't some Silicon Valley fantasy:
Three decades of cognitive research already mapped out how this works:

Cognitive Load Theory:
Our brains can only juggle so much at once. Let AI handle the grunt work while you focus on making meaningful connections.

Distributed Cognition:
Naval crews don't navigate with individual genius—they spread thinking across people, instruments, and procedures. AI becomes another crew member in your cognitive system.

Zone of Proximal Development
We learn best with expert guidance bridging what we can't quite do alone. AI can serve as that "more knowledgeable other" (though it's still early days).
The table below shows what this looks like in practice:

2/3

#AI #Education #FutureOfEducation #AIinEducation #LLM #ChatGPT #Claude #EdAI #CriticalThinking #CognitiveScience #Metacognition #HigherOrderThinking #Reasoning #Vygotsky #Hutchins #Sweller #LearningScience #EducationalPsychology #SocialLearning #TechforGood #EticalAI #AILiteracy #PromptEngineering #AISkills #DigitalLiteracy #FutureSkills #LRM #AIResearch #AILimitations #SystemsThinking #AIEvaluation #MentalModels #LifelongLearning #AIEthics #HumanCenteredAI #DigitalTransformation #AIRegulation #ResponsibleAI #Philosophy

Critical reasoning vs Cognitive Delegation

Old School Focus:

Building internal cognitive capabilities and managing cognitive load independently.

Cognitive Delegation Focus:

Orchestrating distributed cognitive systems while maintaining quality control over AI-augmented processes.

We can still go for a jog or go hunt our own deer, but for reaching the stars we, the Apes do what Apes do best: Use tools to build on our cognitive abilities. AI is a tool.

3/3

#AI #Education #FutureOfEducation #AIinEducation #LLM #ChatGPT #Claude #EdAI #CriticalThinking #CognitiveScience #Metacognition #HigherOrderThinking #Reasoning #Vygotsky #Hutchins #Sweller #LearningScience #EducationalPsychology #SocialLearning #TechforGood #EticalAI #AILiteracy #PromptEngineering #AISkills #DigitalLiteracy #FutureSkills #LRM #AIResearch #AILimitations #SystemsThinking #AIEvaluation #MentalModels #LifelongLearning #AIEthics #HumanCenteredAI #DigitalTransformation #AIRegulation #ResponsibleAI #Philosophy

A large table comparing unassisted critical reasoning vs "Cognitive Delegation", leveraging AI for higher order thinking.
IB Teguh TMteguhteja
2025-05-28

Master Python Ragas AI Evaluation! Learn to effectively assess your LLMs and RAG systems for top-tier performance. Full tutorial inside.

teguhteja.id/python-ragas-ai-e

Mr Tech Kingmrtechking
2025-05-08

SWE-Bench, a hot AI coding test, faces a big question: is it being gamed? Models might ace it but flunk real tasks, showing we may be testing test-smarts, not true skill. Time for better AI evaluation.

Rethinking AI Tests: Building Benchmarks That Actually Work.
PPC Landppcland
2025-04-16

ICYMI: Google updates quality rater guidelines with AI content evaluation criteria: Google's latest guidelines provide clearer direction on evaluating AI-generated content and spam tactics. ppc.land/google-updates-qualit

PPC Landppcland
2025-04-15

Google updates quality rater guidelines with AI content evaluation criteria: Google's latest guidelines provide clearer direction on evaluating AI-generated content and spam tactics. ppc.land/google-updates-qualit

SAIL Research NetworkSAILnetwork
2025-04-02

🎉 That’s a wrap! The SAIL Spring School 2025 at Bielefeld University was an inspiring event, bringing together young researchers to explore AI evaluation beyond accuracy & precision.

🍕 A highlight: our poster & pizza session – Congrats to Kathrin Lammers & Thorben Markmann for winning Best Poster Awards! 👏

A big thank you to all speakers, participants & organizers! 🤝 See you at the next SAIL Spring School 2026 in Paderborn! 🚀

2024-10-25

"The #gamma GLM is a relatively assumption-light means of #modeling non-negative data, given gamma's flexibility.
[…]
"Explaining what is used and what is not used, despite merits and demerits […]: Loosely, the larger the internal literature in any field on modelling techniques, the less inclined people in that field seem to be to try something different."

Nick Cox, 2013: stats.stackexchange.com/questi

#normality #normalDistribution #Γ #modelling #dataDev #AIDev #ML #AIEvaluation #logNormal

2024-10-23

@datadon

"The following sections discuss several state-of-the-art interpretable and explainable #ML methods. The selection of works does not comprise an exhaustive survey of the literature. Instead, it is meant to illustrate the commonest properties and inductive biases behind interpretable models and [black-box] explanation methods using concrete instances."
wires.onlinelibrary.wiley.com/ 🧵

#interpretability #explainability #aiethics #compliance #taxonomy #ethicalai #aievaluation #linearRegression

2024-10-23

Model "#interpretability and [black-box] #explainability, although not necessary in many straightforward applications, become instrumental when the problem definition is incomplete and in the presence of additional desiderata, such as trust, causality, or fairness."

wires.onlinelibrary.wiley.com/

#aiethics #compliance #taxonomy #ethicalai #aievaluation

2024-09-10

I'm continuing research in LLM evals for apps and single prompts which IMO is one of the most challenging fields in machine learning right now. Im excited to learn more about Arize Phoenix and their "open-source observability library".

#LLM #MachineLearning #AIEvaluation #Evaluation #AIMetrics #MLOps #AIInsights #ArizeAI #Phoenix #AIEthics #AITransparency #ResponsibleAI

Im linking a very informative video of theirs that got me interested in what they made:
youtube.com/watch?v=9Ay0WcjrdG

2024-09-09

Today Im trying out an OpenSource LLM evaluation framework DeepEval.
The team behind it wrote a wonderful article about evaluations that is very informative, but still easy to understand.

#AI #LLM #MachineLearning #AIEvaluation #NaturalLanguageProcessing #TechInsights

confident-ai.com/blog/llm-eval

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst