Lmst

Engineering Supercomputing Platforms for Biomolecular Applications

#CUDA #ROCm #Biology #Biomolecules #MolecularDynamics #HPC #Physics #Package

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

#CUDA #Compilers #Sparse #MatrixMultiplication

Gestern noch verlacht worden für den Vorschlag einer KI für die Menschen. Aber #Merz und die #CDU haben es geschafft, es gibt bald eine souveräne KI für die deutsche Wirtschaft. Super wichtig, um im Rennen für die Zukunft zu bleiben und um nicht abhängig von fremden Firmen oder Nationen zu sein. Gut es ist zusammen mit NVIDIA. #NVIDIA versucht mit #CUDA-X ein AI-Monopol zu errichten. Heise meldet das #Amazon und #Microsoft an Bord sind. Souveräner könnte es kaum sein, heißt ja auch "Sovereign AI" und ist so souverän wie "Open AI" offen ist. Mit dem 10,000 GPUs wird auch ordentlich Strom verbraucht, aber in Deutschland liefert die fossile Brennstoffindustrie ja "Grünen" Strom, der so grün ist wie die Grünen.

Wir könnten auch mit 25 GPUs unsere eigene KI haben. Das Einzige, was schade ist, ist dass Merz (bekannt aus "Sie nannten ihn #FotzenFritz") nicht mit an Bord wäre. Einer muss ja die #Drecksarbeit machen. Richtig?

https://word.undead-network.de/2025/06/19/gestern-noch-verlacht-worden-fuer-den-vorschlag-einer-ki-fuer-die-menschen/
#ai #DigitaleSouveränität #ki

Show HN: I built a tensor library from scratch in C++/CUDA

Link: https://github.com/nirw4nna/dsc
Discussion: https://news.ycombinator.com/item?id=44310678

#cuda

I built a tensor library from scratch in C++/CUDA

https://github.com/nirw4nna/dsc

#HackerNews #tensorLibrary #C++ #CUDA #programming #buildFromScratch #openSource #HackerNews

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

#CUDA #LLM #CodeGeneration #AI

https://hgpu.org/?p=29941

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

#CUDA #LLM #Compilers #AI #PerformancePortability #Package

https://hgpu.org/?p=29940

GPU Acceleration of SQL Analytics on Compressed Data

#CUDA #Databases

https://hgpu.org/?p=29939

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

#CUDA #Physics #Chemistry #MD #ML #Package

https://hgpu.org/?p=29937

Part2: #dailyreport #cuda #nvidia #gentoo #llvm #clang

I learned cmake config files and difference between
Compiler Runtime Library (libgcc and libatomic,
LLVM/Clang: compiler-rt, MSVC:vcruntime.lib) and C
standard library (glibc, musl) and C++ Standard Library
(GCC: libstdc++, LLVM: libc++, MSVC STL) and linker
(GCC:binutils, LLVM:lld) and ABI. Between “toolchain”
and “build pipeline”.

Gentoo STL:
- libc++: sys-devel/gcc
- libstdc++: llvm-runtimes/libcxx

Gentoo libc: sys-libs/glibc and sys-libs/musl

I learned how Nvidia CUDA and CUDNN distribud and what
tools PyTorch have.

Also, I updated my daemon+script to get most heavy
current recent process, which I share at my gentoo
overlay as a package.

Part1: #dailyreport #cuda #nvidia #gentoo #llvm #clang
#programming #gcc #c++ #linux #toolchain #pytorch

I am compiling PyTorch with CUDA and CUDNN. PyTorch is
mainly a Python library with main part of Caffe2 C++
library.

Main dependency of Caffe2 with CUDA support is
NVIDIA "cutlass" library (collection of CUDA C++
template abstractions). This library have "CUDA code"
that may be compiled with nvcc NVIDIA CUDA compiler,
distributed with nvidia-cuda-toolkit, or with LLMV
Clang++ compiler. But llvm support CUDA only up to 12.1
version, but may be used to compile CUDA for sm_52
architecture. Looks like kneeling before NVIDIA. :)

Before installing dev-libs/cutlass you should do:
export CUDAARCHS=75

I sucessfully compiled cutlass, now I am trying to
compile PyTorch CUDA code with Clang++ compiler.

A Cult AI Computer’s Boom and Bust:
I am aware that CUDA isn’t a language. But 🤷‍♂️

📺 https://www.youtube.com/watch?v=sV7C6Ezl35A

#video #yt #youtube #ai #boom #bust #it #ipl #aicomputing #history #aicult #aiboom #cuda #lisp #code

#cuda #gpu #ai

https://youtu.be/K9anz4aB0S0?si=gcrbDaSGF5V6gSq2

Ask HN: How to learn CUDA to professional level | Hacker News

Link

Ask HN: How to learn CUDA to professional level | Hacker News
https://news.ycombinator.com/item?id=35756489

📌 Summary:
本文集結多位程式開發者及CUDA使用者的經驗與建議，探討如何達到專業級的CUDA編程能力。學習CUDA的核心在於理解GPU的平行運算架構與CUDA程式框架，搭配NVIDIA官方CUDA Programming Guide及書籍作為基礎。初學者應具備紮實的C/C++基礎，並從簡單的小型平行程式開始實作，逐步熟悉工具鏈、編譯器與硬體限制。硬體方面，建議使用近幾年內具備較新驅動程式的NVIDIA顯示卡，例如GTX 1080以上型號，以確保CUDA Toolkit的相容性。

實務上，學習過程中不可避免會遇到除錯和性能優化的挑戰，包含記憶體佈局、warp分派、同步與L2快取管理等細節。部分開發者建議先追求正確性，再逐步針對性能進行優化，避免過早優化帶來錯誤。對於應用範疇，CUDA多用於高效能計算、遊戲3D圖形及機器學習AI領域，但若目標是AI模型開發，可能更傾向於使用PyTorch、TensorFlow等高階框架。學習路徑建議包括多看開源專案（如Leela Chess Zero）、利用NVIDIA官方課程、閱讀相關高效能計算書籍，以及參與社羣討論或實務專案。

此外，CUDA程式碼的硬體相容性問題不容忽視，不同世代及型號的GPU在指令集與硬體資源上存在差異，對初學者而言選定目標硬體、配合特定架構進行專案開發會較有效率。高階應用也可輔以如CUTLASS等抽象層工具降低開發門檻。總體而言，精通CUDA需投入大量時間與耐心，建議制訂6至8週的學習計畫並逐步實踐，才能在職場中具備競爭力。

🎯 Key Points:
→ 基礎入門
★ 使用 NVIDIA 官方 CUDA Programming Guide 及書籍學習基礎理論與API
★ 具備 C 或 C++ 程式語言能力，清楚理解並行程式設計概念
★ 實務練習：從簡單平行任務開始（如矩陣乘法），逐步擴大複雜度
★ 搭配合適GPU硬體，建議近代 NVIDIA 顯卡（如 GTX 1080、RTX 20系列以上）及符合驅動版本要求

→ 學習流程與工具
★ 安裝並熟悉 CUDA Toolkit（版本示例12.9.1）、NVidia Nsight與compute-sanitizer等除錯工具
★ 閱讀和分析GitHub上的公開CUDA專案，以實際程式碼理解應用方式
★ 練習利用共享記憶體(shared memory)、warp調度與Tensor Core加速等技術提升效能
★ 利用LLM（大型語言模型）或社羣資源協助程式碼理解與疑難排解

→ 進階挑戰與應用方向
★ 記憶體管理、指令集多樣性、不同GPU架構兼容性為難點
★ CUDA多用於高效運算領域，如遊戲3D圖形與AI訓練，AI開發者多用PyTorch/TensorFlow高階框架
★ 實務建議：先確保功能正確，再進行性能優化，避免記憶體錯誤
★ 鼓勵閱讀高階 HPC（高效能計算）與平行計算相關書籍，如《Programming Massively Parallel Processors》與《Scientific Parallel Computing》
★ 瞭解不同GPU品牌和API的差異，必要時可使用如HIPIFY等工具進行跨平臺移植
★ 建議結合實務專案與學習，逐步建立完整技能樹

🔖 Keywords:
#CUDA #GPU編程 #平行運算 #NVIDIA #高效能計算

Ask HN: How to learn CUDA to professional level

Discussion: https://news.ycombinator.com/item?id=44216123

#cuda

All You Need Is Binary Search! A Practical View on Lightweight Database Indexing on GPUs

#CUDA #Databases #Performance

https://hgpu.org/?p=29922

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

#OpenCL #CUDA #Concurrency #Memory #ModelCheck

https://hgpu.org/?p=29921

🌖 Mojo 中高效矩陣轉置 🔥
➤ 使用 Mojo 實現高效能 GPU 運算
✤ https://veitner.bearblog.dev/highly-efficient-matrix-transpose-in-mojo/
本文逐步展示瞭如何使用 Mojo 語言針對 Hopper 架構實現高效矩陣轉置核心。最佳核心實現了 2775.49 GB/s 的頻寬，達到 84.1056% 的效能。作者將此優化方法與其先前使用純 CUDA 在相同 H100 硬體上達到的 2771.35 GB/s 頻寬進行比較，證明 Mojo 在相同任務上也能達到與 CUDA 相似的效能。文章涵蓋了基本方法、使用 TMA（Tensor Memory Access）以及優化技術，例如 Swizzling 和線程粗化，並提供了詳細的程式碼範例和效能比較。
+ 哇，Mojo 真的很有潛力！能與 CUDA 相提並論，甚至在某些方面超越它，真是令人印象深刻。
+ 這個文章解釋得非常清楚，即使對 Mojo 不熟悉的人也能理解。程式碼範例也很實用，可以直接拿
#GPU 程式設計 #Mojo 語言 #矩陣運算 #CUDA

Bringing GPU-Level Performance to Enterprise Java: A Practical Guide to CUDA Integration

#cuda #gpu #java #performance

https://www.infoq.com/articles/cuda-integration-for-java/

Oh, sweet mercy of progress! 🙄 After countless hours of fiddling with a 'superior' language, #Mojo, our hero achieved a staggering 14% #improvement over #CUDA, which translates to a groundbreaking difference of... wait for it... a couple of GBs! 🚀 Clearly, the future of computing hangs in the balance of this monumental leap. 🤡
https://veitner.bearblog.dev/highly-efficient-matrix-transpose-in-mojo/ #tech #innovation #progress #HackerNews #ngated

#Cuda

Client Info