Lmst

Slower than expected on moderately complex prompts, but better than expected results.

Khi triển khai ứng dụng LLM, làm sao kiểm tra thay đổi model trước khi release?
Tác giả đang dùng cách thử thủ công (10-20 prompts → deploy → theo dõi → fix lỗi). AWS SageMaker shadow testing quá phức tạp. Câu hỏi cho cộng đồng:
1. Cách kiگاهی tra model mới hiệu quả?
2. Có công cụ replay traffic thực tế?
3. Thủ công đã đủ chưa?
Bạn có giải pháp nào? #LLM #AITesting #MachineLearning #AI #TríTuệNhânTạo #KiểmThửAI #HọcMáy

https://www.reddit.com/r/LocalLLaMA/comments/1qr27hi/how_do_you_test_llm_mod

PsyPost: Researchers are using Dungeons & Dragons to find the breaking points of major AI models. “A new study presented at the NeurIPS 2025 conference suggests that the tabletop game Dungeons & Dragons can serve as a tool for testing the intelligence of artificial intelligence agents. Researchers found that while current models can handle simple questions, they struggle to manage the multiple […]

https://rbfirehose.com/2026/01/26/psypost-researchers-are-using-dungeons-dragons-to-find-the-breaking-points-of-major-ai-models/

[학생당 42센트로 AI 부정행위 잡기: NYU 교수의 AI 구술시험 실험

NYU의 Panos Ipeirotis 교수가 AI를 활용해 구술시험을 진행하는 실험을 수행했습니다. 이 실험은 AI/ML 제품 관리 수업에서 36명의 학생을 대상으로, ElevenLabs 음성 AI를 활용해 저렴하고 효율적인 평가 방식을 모색했습니다. 초기 결과는 AI 평가의 공정성과 효율성을 보여주었지만, AI 에이전트의 인간적 편향성과 학생들의 스트레스 증가 등의 문제점도 드러났습니다.

https://news.hada.io/topic?id=25656

#aiassessment #aitesting #elevenlabs #voiceai #educationtechnology

Mình vừa phát triển một nền tảng QA testing tự động bằng AI 🤖

🎯 Giải quyết công việc QA nhàm chán, lặp đi lặp lại qua 3 bước:
1️⃣ Discovery - lưu ngữ cảnh hình ảnh website
2️⃣ Test Flow Generation - tạo test cases
3️⃣ Test Executions - thực thi kiểm thử

Tất cả đều được AI agent thực hiện tự động hoặc cấu hình thủ công ✨

Hiện đang chạy desktop-based app để đảm bảo bảo mật local browser. Ai quan tâm thử nghiệm thì mình sẵn sàng tặng free credits để đổi lấy feedback nhé! 🚀

#AITesting #Quali

Thử nghiệm AI tạo ảnh sản phẩm chuyên nghiệp - không biến dạng logo, chất liệu hay hình dạng. Mình đã phát triển CL Storyboard với "3D scene ẩn" để giữ độ nhất quán cao, kiểm soát ống kính (14mm-100mm), và tái tạo ánh sáng tự nhiên. Đang tìm người dùng thử nghiệm (có credit tặng kèm). Cần feedback về giao diện và giới hạn nhất quán trong thực tế.
#AITesting #ProductPhotography #EcommerceTools #CLStoryboard #AIẢnhSảnPhẩm #ThửNghiệmAI #CôngCụKinhDoanh

https://www.reddit.com/r/SaaS/comments/1q41

Bài test thực tế so sánh GLM 4.7 và Minimax M2.1 trong việc tạo kiểm thử e2e. Minimax M2.1 vượt trội về tốc độ và độ tin cậy (40 phút vs 70 phút không xong của GLM).

Đáng chú ý, GLM 4.7 dù không hoàn thành nhưng lại phát hiện lỗi thiết kế code, điều mà Minimax bỏ qua. Người dùng ưu tiên Minimax M2.1 cho hiệu suất, nhưng vẫn cân nhắc GLM làm backup cho các vấn đề sâu hơn.

#AI #LLM #GLM47 #MinimaxM21 #AITesting #TechReview #SoSanhAI #KiểmThửAI

https://www.reddit.com/r/LocalLLaMA/comments/1ptq7r

AI testing tools for software testing are gaining momentum as QA teams handle complex systems and faster release cycles.

We recently published a Video exploring the AI testing tools landscape and how teams are using these tools to improve coverage and reduce maintenance. Sharing this here to exchange views on how AI-driven testing is being adopted in real-world QA setups.

https://youtu.be/hheoLq4c7nQ

#AIToolsForSoftwareTesting #SoftwareTesting #AITesting #TestAutomation #QualityEngineering

Sử dụng mô hình **Ollama địa phương** (ví dụ: *llama3.2*) để kiểm tra AI agents thay vì API đám mây. Ưu điểm: tiết kiệm chi phí, bảo mật dữ liệu và hoạt động ngoại tuyến. Cài đặt đơn giản qua **EvalView**: `pip install evalview`, kết nối Ollama với cú pháp YAML để đánh giá phản hồi AI. Dự án mở nguồn tại GitHub. Gợi ý thử nghiệm mô hình Ollama nào khác?

#AItesting #Ollama #AIBots #Llama3 #ĐánhGIáAI #CơChếTesting #PythonTools #AIĐịaPhương #MastodonAI #TechNewsVN

https://www.reddit.com/r/ollam

Momentic raises $15M to revolutionize software testing, preventing 390,000 bugs with AI-powered verification platform. Transforming quality assurance for tech teams worldwide. #AITesting #SoftwareDevelopment

🟦 Set Up Evaluations in Microsoft Copilot Studio
Want reliable Copilot agents? Build test sets, run automated evaluations, and measure pass rates to improve accuracy and relevance 🚀

💡 Define test sets: import, generate, or add cases.
🔍 Pick methods: exact, partial, similarity, or quality.
⚖️ Run evaluations: simulate chats, score responses, set thresholds.

▶︎https://www.hubsite365.com/en-ww/citizen-developer/?id=4138de3e-13c6-f011-bbd3-7ced8d5e09ec&topic=9f678e9a-8cd4-ec11-a7b5-6045bd92fe52&theater=true

Ready to boost agent quality? Watch the guide or DM for a step-by-step walkthrough.
#CopilotStudio #AItesting #PowerPlatform #ConversationalAI

Kiểm tra A/B là chìa khóa trong phát triển ứng dụng AI. Giúp đánh giá độ trễ, hiệu quả chi phí, độ chính xác và trải nghiệm người dùng. #AITesting #ABTesting #AIEngineering #PhátTriểnngDụngAI #KiểmThửA_B #MachineLearning #AITools

https://www.reddit.com/r/SaaS/comments/1or2o0n/why_ab_testing_is_crucial_in_ai_app_development/

The engineering cost of flaky tests is too high. I'm presenting our work in progress on a data-driven solution tomorrow at #SFSCon.

Join me for: "Zap the Flakes! Leveraging AI to Combat Flaky Tests with CANNIER."

I will detail the CANNIER research: using ML to predict and flag flakiness risks in KubeVirt CI.

⏰ Tomorrow, 08/11/2025 @ 10:00 CET
📍 Bolzano, Italy
🔗 Talk Details: https://www.sfscon.it/talks/zap-the-flakes/

#SoftwareQuality #CI #DevOps #AITesting #KubeVirt #WIP

Good speed for short prompts and decent results.

#AITesting

Client Info