Lmst

[Qwen2-72B 중간 레이어 7개 복제로 리더보드 1위, 가중치는 단 하나도 안 건드리고

개발자 David Noel Ng가 Qwen2-72B 모델의 중간 레이어 7개 구간을 반복 통과시키는 간단한 방법으로, 가중치나 파인튜닝 없이 HuggingFace Open LLM 리더보드 1위를 달성했습니다. 이 방법은 특정 중간 레이어(45~51번)를 한 번 더 통과시켜 성능을 향상시켰으며, 6개 주요 벤치마크 중 5개에서 성능이 상승했습니다. 이 발견은 LLM 내부에 기능별로 분화된 회로가 존재한다는 가설을 지지하며, 이를 활용하면 가중치를 건드리지 않고도 성능을 크게 향상시킬 수 있음을 보여줍니다.

https://news.hada.io/topic?id=27406

#llm #qwen2 #neuroanatomy #transformer #modeloptimization

Brie Wensleydale (@SlipperyGem)

Unsloth가 LTX 2.3 GGUF를 'UD(Unsloth Dynamic 2.0)'으로 업데이트했다고 공지했습니다. GGUF의 파일 크기와 속도 이점을 유지하면서 기존 GGUF에서 지적된 'smudgeiness' 즉 품질 저하를 줄였다고 설명하며 Hugging Face 리포지토리 링크를 제공하고 있어 모델 포맷/품질 개선 관련 주목할 만한 업데이트입니다.

https://x.com/SlipperyGem/status/2031387673487937820

#unsloth #gguf #ltx2.3 #huggingface #modeloptimization

Andy Peng (@pymhq)

작성자는 EAGLE 트레이닝 작업을 돌리며 시애틀에서 열린 'Cafe Compute Seattle: Cozy Edition' 밋업(주최: Cerebras, GitHub)에 참석했다고 보고합니다. 현장에서는 모델 최적화(model optimization) 관련 토론이 있었고, 스레드는 계속 업데이트할 계획이라고 밝혔습니다.

https://x.com/pymhq/status/2026551179736666296

#training #modeloptimization #meetup #cerebras

Clément Pillette (@ClementPillette)

kim-dev 72B를 BF16으로 2 GPU 병렬화하는 시도는 다소 무리였고, 대신 AWQ 4-bit 양자화를 시도한다고 보고합니다. MLX 팀(특히 @ivanfioravanti) 덕분에 Mac Studio에서 모델 구동이 훨씬 수월해졌고, Minimax 2.5는 8비트에서 초당 30tps로 잘 동작하고 있다는 실무적 성과를 공유한 트윗입니다.

https://x.com/ClementPillette/status/2024153241387196892

#quantization #awq #llm #modeloptimization #bf16

Tarjei Mandt (@kernelpool)

스파스 어텐션(sparse attention)이 prefill 단계에서 처리 속도를 저하시킨다는 기술적 관찰을 공유하며, 해당 문제는 해결 가능하다는 언급입니다. LLM 추론 파이프라인(특히 prefill)과 어텐션 최적화 관점에서 중요한 성능 이슈와 개선 여지를 제기합니다.

https://x.com/kernelpool/status/2022691285312901537

#sparseattention #prefill #performance #modeloptimization

Python Trending (@pythontrending)

AngelSlim이라는 모델 압축 툴킷이 공개되었습니다. 사용성과 포괄성, 효율성 향상을 목표로 설계된 도구로, 모델 경량화·최적화 워크플로를 지원하는 개발자용 툴킷이라는 점이 강조되어 있습니다.

https://x.com/pythontrending/status/2021903637635530796

#modelcompression #modeloptimization #toolkit #ai

Mojofull (@furoku)

짧은 한 문장으로 'AI 모델의 고속화 경쟁이 시작되었다'고 알리는 트윗입니다. 모델 추론·학습 속도 개선과 최적화 경쟁이 본격화되고 있음을 시사하는 트렌드 알림으로 해석됩니다.

https://x.com/furoku/status/2018864750575378827

#ai #modeloptimization #inference #performance

Đang tìm cách fine-tune mô hình ngôn ngữ nhỏ (quantized) trực tiếp bằng C++ mà không cần chuyển code sang Python? Bạn gặp khó khăn khi codebase hiện tại chỉ hỗ trợ C++. Giải pháp nào hiệu quả?

#C++_Programming #MachineLearning #ModelOptimization #FineTuning
#LậpTrìnhC_ #HọcMáy #TốiƯuMôHình #ĐiềuChỉnhMôHình

https://www.reddit.com/r/LocalLLaMA/comments/1qs9x1h/finetune_model_in_c/

A new compact model, Falcon‑H1R 7B, is shaking up AI benchmarks by matching or beating models up to 7× larger on math and coding tasks—showing small can be seriously powerful.

#AI #LLMs #ModelOptimization
https://kersai.com/ai-breakthroughs-in-2026/

Unsloth AI (@UnslothAI)

GLM-4.7-Flash GGUF 파일이 llama.cpp의 최근 버그 수정 후 출력 품질이 크게 개선되어 GGUF를 재변환 및 업데이트함. 로컬에서 4-bit로 18GB RAM으로 실행 가능. 수정 적용을 위해 업데이트된 GGUFs를 재다운로드하고 @Zai_org가 제시한 추론 파라미터를 사용할 것을 권장.

https://x.com/UnslothAI/status/2013966866646180345

#glm #gguf #llamacpp #huggingface #modeloptimization

🚀 Our latest benchmark shows hyperparameter tuning with Optuna hits 0.9617 validation accuracy in just 64.59 seconds! Using Bayesian optimization and the Tree‑structured Parzen Estimator, we ran 100 trials to squeeze out every percent. Dive into the details of the experiment and see how you can apply these tricks to your own models. #HyperparameterTuning #Optuna #BayesianOptimization #ModelOptimization

🔗 https://aidailypost.com/news/hyperparameter-tuning-reaches-09617-accuracy-6459-seconds

The Lottery Ticket Hypothesis: finding sparse trainable NNs with 90% less params

https://arxiv.org/abs/1803.03635

#HackerNews #LotteryTicketHypothesis #SparseNeuralNetworks #DeepLearning #AIResearch #ModelOptimization

Kết quả mới cho thấy Vulkan có thể nhanh hơn CUDA trong chỉ định model. Ví dụ, Ministral3 14B 2512 Q4 có tốc độ tăng lên 4,4 lần khi xử lý prompt. CUDA vẫn là lựa chọn tốt nhất cho đa số trường hợp. #Vulkan #CUDA #ModelOptimization #TechNews #ThiếtKếModel #BảoMật #LenhLem #HóaCván #SốHúc #LinhTụ #ThépKin #TệpMúzeum #CơSốVănHóa

NONE

https://www.reddit.com/r/LocalLLaMA/comments/1pydegt/benchmarking_local_llms_for_speed_with_cuda_and/

Liệu Kimi K2 Thinking có hoạt động tốt ở mức lượng tử 2.5-3.5 bit/weight không? Được biết model này nguyên bản 4-bit. So sánh với DeepSeek models (8-bit nguyên bản) vẫn hiệu quả ở ~3bpw. Người dùng đã thử Q2_K_XL (3bpw) locally và thấy khá tốt, nhưng chưa thể so sánh với native 4-bit. Thảo luận trên r/LocalLLaMA về hiệu suất quantization. #quantization #AI #machinelearning #KimiK2 #DeepSeek #localAI #modeloptimization #Quantisierung #KünstlicheIntelligenz

https://www.reddit.com/r/LocalLLaMA/com

🚀 GPT OSS 120B chỉ cần 2 expert vẫn như 4 expert nhưng nhanh x2! Người dùng đạt 40 tps với 2 expert. S<body> có lẽ lại khôngopia?
#AI #GPT #MachineLearning #Llama #ModelOptimization #Tech #FastAI #NgônNgh modernai

https://www.reddit.com/r/LocalLLaMA/comments/1o9o5eb/using_only_2_expert_for_gpt_oss_120b/

🚀 Hoạt động hiệu quả hơn cho MoE! Qwen3-Coder được thu gọn 25% (363B) & 50% (246B) dùng FP8 uden mất chính xác. Sử dụng REAP đo lườnglán, không cần gán补丁 cho vLLM. Đọc here: arXiv.org/abs/2510.13999.
#AI #MoE #Qwen3 #NLP #ModelOptimization #HuggingFace

https://www.reddit.com/r/LocalLLaMA/comments/1o98f57/new_from_cerebras_reap_the_experts_why_pruning/

Xem rõ hơn về khác biệt tham số lớn so với quantization trong AI. Ghét-League với Q6/Q8 của cùng model không thấy ủu ợ. Trải nghiệm hạn chế với Q8/F16-32.
#AI #MachineLearning #Quantization #ModelOptimization #TinTếTúc #TươngGiácNghệLearning #TốiHstrateBảnPhân