Lmst

Nvidia stellt Dynamic Memory Sparsification vor. Die Technik reduziert den KV-Cache um Faktor acht, indem unwichtige Token während der Inferenz dynamisch entfernt werden. Laut Paper der University of Edinburgh bleibt die Genauigkeit erhalten, während der Hardwarebedarf für lange Kontexte massiv sinkt. Erste Implementierungen existieren bereits. #Nvidia #DMS #KVCache
https://www.all-ai.de/news/news26/nvidia-speicher-8x

Куда и почему уходят бабки на нейросети

Малоизвестный среди обычных людей факт: у нейросетей нет никаких "разговоров". Ты смотришь в веб-интерфейсе на "диалог" - но это обман, красивый фокус. Каждый раз, когда ты пишешь новое сообщение, все старые сообщения обрабатываются заново. У нейросетей по-настоящему многоразовых задач не существует. Если результат немного поменялся — тебе просто не покажут в веб-интерфейсе изменившиеся сообщения. Иначе пользователь чувствовал бы себя как в дурке, ИИ его бы постоянно как бы газлайтил, изменяя старые ответы без предупреждения. По факту, история переписки в ИИ-чатах фиксирована, тем или иным способом. И стоило бы это вагон. Интересно. Читать далее

https://habr.com/ru/companies/bar/articles/991126/

#LLM #transformer #attention #KVcache #inference #GPU #CUDA #ChatGPT #Claude #токены

𝗭𝗲𝗻 𝗠𝗮𝗴𝗻𝗲𝘁𝘀 (@ZenMagnets)

GLM-4.7-Flash의 큰 KV 캐시 문제(FATASS)에 대한 간단한 우회법 발견을 공유합니다. vllm에서 MLA를 활성화하는 한 줄 수정으로 200k 컨텍스트를 180GB 대신 약 10GB로 맞출 수 있어, 단일 32GB 5090 GPU로 GLM-4.7-Flash-NVFP4 전체 200k 컨텍스트 구동이 가능해졌다고 주장합니다. @Zai_org의 의도 대로 MLA 사용을 권장합니다.

https://x.com/ZenMagnets/status/2013838570059170117

#glm4.7flash #vllm #kvcache #mla #gpu

GLM-4-32B-0414 nổi bật với chỉ **2 đầu KV**, giúp tiết kiệm đáng kể bộ nhớ cache KV nhờ sử dụng GQA. Tiếc rằng GLM-4.7-Flash đã loại bỏ tính năng này, làm giảm hiệu quả tối ưu hóa bộ nhớ. #AI #LLM #GLM #KVCache #GQA #TríTuệNhânTạo #MôHìnhNgônNgữ #AIoptimization

https://www.reddit.com/r/LocalLLaMA/comments/1qiphdr/two_heads_is_all_i_need/

[MI455X에 LPDDR5X 모듈 24개가 박혀있다.

MI455X에 LPDDR5X 모듈 24개가 탑재되어 있으며, KV 캐시 용도로 사용될 것으로 예상됩니다. 최대 대역폭은 1.63TB/s로 추정되며, AMD의 베니스 가속기와 비교되어 언급되었습니다.

https://news.hada.io/topic?id=25990

#lpdrd5x #kvcache #amd #mi455x #memory

NVIDIA’s new Inference Context Memory Storage Platform reshapes AI inference by treating KV cache as a multi-tier memory hierarchy—from HBM to NVMe SSD. This enables longer context windows, persistent reasoning, and scalable multi-agent inference while keeping hot data in GPU memory and offloading cold context to SSD.
https://www.buysellram.com/blog/nvidia-unveils-the-inference-context-memory-storage-platform/
#NVIDIA #Rubin #AI #Inference #LLM #AIInfrastructure #MemoryHierarchy #HBM #NVMe #DPU #BlueField4 #AIHardware #GPU #DRAM #KVCache #DataCenter #tech

NVIDIA’s Inference Context Memory Storage Platform, announced at CES 2026, marks a major shift in how AI inference is architected. Instead of forcing massive KV caches into limited GPU HBM, NVIDIA formalizes a hierarchical memory model that spans GPU HBM, CPU memory, cluster-level shared context, and persistent NVMe SSD storage.

This enables longer-context and multi-agent inference by keeping the most active KV data in HBM while offloading less frequently used context to NVMe—expanding capacity without sacrificing performance. This shift also has implications for AI infrastructure procurement and the secondary GPU/DRAM market, as demand moves toward higher bandwidth memory and context-centric architectures.

https://www.buysellram.com/blog/nvidia-unveils-the-inference-context-memory-storage-platform/

#NVIDIA #Rubin #AI #Inference #LLM #AIInfrastructure #MemoryHierarchy #HBM #NVMe #DPU #BlueField4 #AIHardware #GPU #DRAM #KVCache #LongContextAI #DataCenter #AIStorage #AICompute #AIEcosystem #technology

NVIDIA’s Inference Context Memory Storage Platform, announced at CES 2026, marks a major shift in how AI inference is architected. Instead of forcing massive KV caches into limited GPU HBM, NVIDIA formalizes a hierarchical memory model that spans GPU HBM, CPU memory, cluster-level shared context, and persistent NVMe SSD storage.

This enables longer-context and multi-agent inference by keeping the most active KV data in HBM while offloading less frequently used context to NVMe—expanding capacity without sacrificing performance. This shift also has implications for AI infrastructure procurement and the secondary GPU/DRAM market, as demand moves toward higher bandwidth memory and context-centric architectures.

https://www.buysellram.com/blog/nvidia-unveils-the-inference-context-memory-storage-platform/
#NVIDIA #Rubin #AI #Inference #LLM #AIInfrastructure #MemoryHierarchy #HBM #NVMe #DPU #BlueField4 #AIHardware #GPU #DRAM #KVCache #DataCenter #tech

NVIDIA’s new Inference Context Memory Storage Platform reshapes AI inference by treating KV cache as a multi-tier memory hierarchy—from HBM to NVMe SSD. This enables longer context windows, persistent reasoning, and scalable multi-agent inference while keeping hot data in GPU memory and offloading cold context to SSD.
https://www.buysellram.com/blog/nvidia-unveils-the-inference-context-memory-storage-platform/
#NVIDIA #Rubin #AI #Inference #LLM #AIInfrastructure #MemoryHierarchy #HBM #NVMe #DPU #BlueField4 #AIHardware #GPU #DRAM #KVCache #DataCenter #tech

NVIDIA’s Inference Context Memory Storage Platform, announced at CES 2026, marks a major shift in how AI inference is architected. Instead of forcing massive KV caches into limited GPU HBM, NVIDIA formalizes a hierarchical memory model that spans GPU HBM, CPU memory, cluster-level shared context, and persistent NVMe SSD storage.

This enables longer-context and multi-agent inference by keeping the most active KV data in HBM while offloading less frequently used context to NVMe—expanding capacity without sacrificing performance. This shift also has implications for AI infrastructure procurement and the secondary GPU/DRAM market, as demand moves toward higher bandwidth memory and context-centric architectures.

https://www.buysellram.com/blog/nvidia-unveils-the-inference-context-memory-storage-platform/

#NVIDIA #Rubin #AI #Inference #LLM #AIInfrastructure #MemoryHierarchy #HBM #NVMe #DPU #BlueField4 #AIHardware #GPU #DRAM #KVCache #LongContextAI #DataCenter #AIStorage #AICompute #AIEcosystem #tech

NVIDIA’s new Inference Context Memory Storage Platform reshapes AI inference by treating KV cache as a multi-tier memory hierarchy—from HBM to NVMe SSD. This enables longer context windows, persistent reasoning, and scalable multi-agent inference while keeping hot data in GPU memory and offloading cold context to SSD.
https://www.buysellram.com/blog/nvidia-unveils-the-inference-context-memory-storage-platform/
#NVIDIA #Rubin #AI #Inference #LLM #AIInfrastructure #MemoryHierarchy #HBM #NVMe #DPU #BlueField4 #AIHardware #GPU #DRAM #KVCache #DataCenter #tech

#Nvidia's new #KVcache system is creating significant discussion within the industry, particularly regarding its overlap with #datastorage partners like #NetApp. Analysts are expressing concerns that this development may exacerbate the existing #memoryshortage, potentially resulting in increased prices for #enterpriseIT buyers.

https://www.techtarget.com/searchstorage/news/366637161/Nvidias-new-KV-cache-makes-waves-in-enterprise-storage

😂 Một người dùng thắc mắc: Làm sao quản lý 100+ cuộc trò chuyện ChatGPT? Lưu KV cache (tốn RAM) hay tính toán lại khi tiếp tục (tốn CPU)? Đang tìm giải pháp cân bằng từ các dev tự phát triển chatbot LLM. #MachineLearning #KVcache #ComputationalTradeoff #ChatbotDevelopment #MemoryOptimization #TríTuệNhânTạo #TốiƯuHiệuSuất #TransformerModel #GiaoTiếpAI

https://www.reddit.com/r/LocalLLaMA/comments/1q8eqtc/longterm_kv_cache_storage_or_reruns_for_ongoing/

Đang xây dựng bot LLM ATOM, việc truy xuất bộ nhớ dài hạn làm KV‑cache mất hầu hết, chỉ <5% reuse. Các cách chèn bộ nhớ (thêm tin, chèn vào system prompt, hoãn) đều phá khả năng reuse và gây lỗi. Cần giải pháp mới để giữ prefix reuse cao khi cập nhật bộ nhớ. #LLM #AI #KVCache #Memory #MachineLearning #AIVietnam #CôngNghệ #TríTuệNhânTạo

https://www.reddit.com/r/LocalLLaMA/comments/1q792fk/kv_cache_gets_nuked_by_longterm_memory_retrieval/

Nén cache KV: Giảm kích thước bộ nhớ, tăng tốc xử lý và mở rộng chiều dài bối cảnh. Ví dụ: ARC-Encoder (kyutai) nén 4x, Clara (Apple) 128x, Cartridges (Stanford) 40x bảo toàn dữ liệu. Tuy nhiên, các công nghệ này chưa tích hợp vào framework phổ biến như llama.cpp/vllm. Vì sao? Cần gì để triển khai rộng rãi?

#AI #MachineLearning #LLM #NénDữLiệu #CôngNghệ #KVCache #MởRộngBốiCảnh #HiệuSuất #AISángTạo #TechVietnam

https://www.reddit.com/r/LocalLLaMA/comments/1pqj2so/where_are_cache_compressions/

Do you want to compare the caching performance of your LLM serving stack? We've put together a simple command line tool to do so. Introducing Tensormesh Benchmark.
https://www.tensormesh.ai/blog-posts/tensormesh-benchmark

#llm #ai #kvcache #lmcache #vllm #benchmarking

🎉 Breaking news: scientists invent a way to accelerate AI learning without actually teaching it anything! 🚀 The secret? 🤔 Just enable KV cache and parallel decoding—because who needs training when you can just fast-forward to the finish line? 🏁 Let’s all donate to #arXiv to keep this kind of cutting-edge "innovation" flowing. 💸
https://arxiv.org/abs/2505.22618 #AIlearning #Innovation #KVcache #ParallelDecoding #HackerNews #ngated

#KVCache

Client Info