Vrushank Desai’s Post

🤖

2mo Edited

I spent a couple months at the beginning of this year learning about GPU programming through trying to optimize inference for Cheng Chi’s awesome Diffusion Policy paper. I was able to improve inference time for the convolutional U-Net by ~3.4x over Pytorch eager mode and ~2.65x over Pytorch compile mode! For anyone interested in GPU optimizations for deep learning, I wrote a 9-part blog post that builds up from the physical structure of DRAM/SRAM cells all the way up to integrating custom CUDA kernels in Pytorch: https://lnkd.in/dBMSqh4g I also have a Twitter thread of the most interesting tid-bits here: https://lnkd.in/db_hEqbD This video (requires audio) is unrelated to the Diffusion inference stuff but imo, more amusing… I was able to get my Nvidia RTX 3090 inductor coils to play ‘Twinkle Twinkle Little Star’ using kernels (GPU programs) that modulate power draw at the right frequencies! What’s happening here is each kernel launch triggers a surge of in-rush current in the GPU’s inductor coils. The Lorentz force due to the change in current (proportional to change in current divided by the change in time) causes the coil to move slightly. If we play with the kernel launch frequencies we can vibrate the coils and get noises in the audible range. Unfortunately we can’t make sounds lower than 2000Hz because the ‘change in time’ part of the equation becomes too large, and the resulting vibration is too weak to make audible noise. So we end up with Twinkle Twinkle shifted up many octaves 😀

15 Comments

Jack Lu

Data @ Aleph

2mo

This is true engineering stuff right here. How do you make a VC-backable start up with this knowledge

2 Reactions

Pradeep Kadubandi

Machine Learning Tech Lead. Waymo challenge winner. Worked at Apple, Meta, Microsoft previously.

2mo

This is written in great detail! Kudos!

Raj Shah

Data Scientist at Heinz CMU | CTO, Former AWS Software Engineer

2mo

Holy shit man this is incredible - I thought it’d be a quick read but this is a LOT!

3 Reactions

Inno (Dong) Jia

MSc Student

2mo

That’s amazing!

Ahmad Traboulsi

PhD Candidate | Co-Founder at Munchy Bytes | Senior Software Developer | Course Instructor

2mo

Thanks for sharing

Enes Grahovac

Machine Learning and Software Engineer

2mo

LOL good stuff!

1 Reaction

Mark Moyou, PhD

Sr. Data Scientist @NVIDIA | Host @ AI Portfolio Podcast + Caribbean Tech Pioneers Podcast | Director @Southern Data Science Conference

2mo

Epic

ARUN PRASATH JAYAPRAKASH

MS Information Systems @ CSULB | Machine Learning Engineer/Data scientist | Ex - Shell | Ex - HTC | 7x Google Cloud Certified

2mo

Unbelievable Stuff...🖤

Hemil Ruparel

SDE-1 @ Blox.xyz | C++/Java/Flutter/Python Developer/Contractor/Freelancer

2mo

This is awesome!

1 Reaction

Sasank Chilamkurthy

JOHNAIC | Qure.ai | PyTorch | AI

2mo

Great work

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Luigi Saetta

AI and Application Integration - Generative AI
8mo
Report this post
#GPU #ML Modern Machine Learning requires GPU, if you have big dataset. And modern Machine Learning requires big dataset. One of the most widely used library to preprocess tabular data is Pandas. NVIDIA has made it possible to use Pandas on GPU without changing your code. https://lnkd.in/di5bNRfm

RAPIDS cuDF Accelerates pandas Nearly 150x with Zero Code Changes | NVIDIA Technical Blog

developer.nvidia.com

2 Comments
Like Comment
To view or add a comment, sign in
Fardin Bahreini, Ph.D.

AI & Machine Learning Specialist | Data Scientist | Deep Learning/Computer Vision/NLP Engineer | Python Developer | Project Manager
4mo
Report this post
• Intel showcases PyTorch optimizations on Arc Alchemist GPUs to efficiently run large language models like Llama 2. • The PyTorch Extension allows LLMs to leverage FP16 performance on Arc GPUs, necessitating 14GB VRAM for compatibility with Intel hardware. • Intel highlights the Arc A770 16GB card's capabilities with Llama 2-Chat LLMs using optimized FP16 in PyTorch, suggesting upcoming enhancements for BF16. Source: [https://lnkd.in/ectM9gSa] —————— P.S.: Intel's advancements in optimizing AI model performance through GPU enhancements reflect the industry's wider aim to make AI more accessible and efficient. By boosting processing capabilities, Intel is not just improving computational efficiency but is also facilitating the execution of more complex and demanding AI tasks. #TechNews #Productivity #OpenAI #Courses #Learning #Charts #Insights #Analysis #Data #Coding #Enhancement #Tips #DataSecurity #WebDesign #Development #Computer #Programming #Developer #WebDevelopment #Programmer #Robotics #ComputerScience #ArtificialIntelligence #AI #MachineLearning #DataScience #BigData #NeuralNetworks #SoftwareDevelopment #Automation #DataVisualization #Python #DeepLearning #Analytics #DataAnalysis #MachineLearningAlgorithms #AICommunity #MachineLearningEngineer #MachineLearningTraining #AIInnovation #CyberSecurity #NLP #LLM #TextProcessing #NaturalLanguageProcessing #LanguageModel #GPT #Bard #TextAnalytics #SpeechRecognition #Chatbots #PythonCoding #CodingLife #ProgrammingLife #SoftwareEngineer #Code #DataAnalytics #DataMining #Statistics #DigitalTransformation #DataScientist #ComputationalScience #QuantitativeAnalysis #BusinessGrowth #Entrepreneurship #Startup #Innovation #DigitalMarketing #SocialMedia #Marketing #Branding #Economics #Fintech #Security #CloudComputing #Blockchain #IoT #VR #AR #AIForGood #SustainableTech #EdTech #HealthTech

Intel demonstrates PyTorch AI optimizations for accelerating large language models on its Arc Alchemist GPUs

tomshardware.com
Like Comment
To view or add a comment, sign in
Vijay Thakkar

Architect @ NVIDIA | HPC Garage @ GaTech
7mo Edited
Report this post
🚀 Exciting News from NVIDIA's CUTLASS Team! 🌟 We're thrilled to announce the release of CUTLASS 3.3.0 ! This latest version brings a suite of powerful enhancements and features critical to LLMs (especially for all your weight only quantization needs), further solidifying CUTLASS as a key tool for high-performance GPU-accelerated deep learning. 🔥 What's New in 3.3.0: * Enhanced Mixed-input Hopper GEMMs support for 16-bit x 8-bit input types with optimal performance! * Improved Mixed-input Ampere GEMMs, now supporting canonical layouts and featuring fast numeric conversion and warp level shuffles. * New warp-specialized Hopper GEMMs optimized for lower than 16-Byte aligned input tensors. * Epilogue Visitor Tree support for RELU bitmap tensor store. * Improved support for subbyte elements, including support for vectorized copies and full-fledged CuTe Tensor support. * Clang as a host compiler support. * Expanded CUTLASS Python interface, now supporting void-C kernels and SM80 mixed-input GEMMs. 🎉 We're excited to see how these new features will empower your projects and research! Stay tuned for more updates, and let's continue pushing the boundaries of computational performance together! 💻✨ #CUTLASS #NVIDIA #GPU #MachineLearning #LinearAlgebra #NewRelease Check out CUTLASS 3.3.0 here: https://lnkd.in/gQWYs7fu PS: This post was largely written automatically by Bing chat ;)

GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines

github.com

2 Comments
Like Comment
To view or add a comment, sign in
Timbake Dang

Co-Founder | CEO at dizim
2mo
Report this post
A tutorial for Efficient fine-tuning Llama 3 with PyTorch you won’t wanna miss This tutorial details how to fine-tune the Llama 3 70B model using PyTorch FSDP, Q-Lora, and SDPA, optimized for 4x 24GB GPUs. It includes steps for setting up a development environment, preparing a high-quality dataset, and executing efficient distributed training with Hugging Face's tools. The tutorial focuses on reducing memory requirements through data and model parallelism, leveraging quantization, and low-rank adapters. You will learn how to apply these techniques in practice, adjust configurations, and utilize gradient checkpointing to manage GPU memory effectively, achieving scalable fine-tuning on consumer-sized hardware setups. https://lnkd.in/gBQQ8sWB

Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora

philschmid.de
Like Comment
To view or add a comment, sign in
Ajay Perumbeti

Multi-specialty board certified Physician | Informatics, A.I. & Precision Medicine | Value Based Care & Clinical Outcomes | Healthcare customer development
1mo
Report this post
Cool post to catch up with GPU optimization for local models.

Aniket Mishrikotkar

MLOps @ MathCo | Hacking LLMs
1mo Edited

im re-watching the GPU optimization workshop organised by Chip Huyen and tbh im finding more and more insights each time i watch it as these talks are packed with lots of concepts so here are some of my learnings Mark Saroufim talks about techniques to efficiently run PyTorch models and make them faster 1. GPUs are expensive because of demand and supply tradeoff and they can make training and serving models faster 2. so we don't want to underutilise them 3. need to just call `.cuda()` and no need to write cuda kernels but under the hood pytorch codegenerate these kernels 4. pytorch follows eager execution model which is good for debugging but bad for large ops 5. cuda kernels assign every element to a thread and for simple problems these elements are in contiguous memory for faster performance and there are different threading strategies 6. for more efficiency we can use GPUs memory hierarchy like global memory(VRAM) which is slow whereas shared memory or L1 cache is faster. L2 cache is a bit slower than L1/shared memory. 7. use pytorch profiler for visually seeing the usage of the kernels (mark recommends ncu profiler from nvidia) 8. arithmetic intensity to understand if the problem you have is memory or compute bound and the formula is (number of ops/data movement). if the number is less than 1 then it is memory bound problem(autoregressive decoding in llms) and if not its a compute bound problem 9. fuse more - use `torch.compile()` to fuse kernels and under the hood pytorch generates the triton kernels 10. use tensor cores - just set `torch.set_float32_matmul_precision("high")` 11. reduce overhead - a lot of compute goes into figuring out which kernel to run and so to avoid the overhead in GPUs is to use `torch.compile(model, mode="reduce-overhead")` as cuda kernels are async so we can queue them up (cuda graphs) 12. quantization - this is an exciting space to look out for. use int8 or lower instead of FP16 or BF16 and it also helps in memory bounded workloads 13. use a custom kernel - use `torch.utils.cpp_extension.load_inline()` which will generate right files for us 14. read pmpp book which is like the best resource on this topic 15. join cuda modes discord channel and watch their lectures on youtube ill be sharing my notes on the other talks as well
Like Comment
To view or add a comment, sign in
Eric Vyacheslav

AI/ML Engineer | Ex-Google | MIT Alumni
1mo
Report this post
pandarallel allows you to parallelize your Pandas operations on all available CPUs by adding only one line of code. https://lnkd.in/dZZXsH2u ↓ Are you technical? Check out https://AlphaSignal.ai to get a weekly summary of the top trending models, repos and papers in AI. Read by 180,000+ engineers and researchers.
28 Comments
Like Comment
To view or add a comment, sign in
Prathmesh Jagtap

Data Science practitioner in Python, PowerBI, and Django. ||
1mo
Report this post
🐼 Power up your data processing with pandarallel - a pandas extension library that allows for parallel data processing with Dask. Transform your pandas data frames in parallel without ever leaving the familiar pandas API.
Eric Vyacheslav

AI/ML Engineer | Ex-Google | MIT Alumni
1mo

pandarallel allows you to parallelize your Pandas operations on all available CPUs by adding only one line of code. https://lnkd.in/dZZXsH2u ↓ Are you technical? Check out https://AlphaSignal.ai to get a weekly summary of the top trending models, repos and papers in AI. Read by 180,000+ engineers and researchers.
Like Comment
To view or add a comment, sign in
Mike Worley

Creative Technologist, CIO, CAIO, CTO, COO, FOOMO, Senior Pre & Post Sales Engineer & Architect in XR, CV, AI, ML, OTT, OVP, SaaS, EdTech, Sports, Video, Games, Entertainment, Immersive & Volumetric.
1mo
Report this post
Pandas on parallel CPUs = pandarallel...
Eric Vyacheslav

AI/ML Engineer | Ex-Google | MIT Alumni
1mo

pandarallel allows you to parallelize your Pandas operations on all available CPUs by adding only one line of code. https://lnkd.in/dZZXsH2u ↓ Are you technical? Check out https://AlphaSignal.ai to get a weekly summary of the top trending models, repos and papers in AI. Read by 180,000+ engineers and researchers.
Like Comment
To view or add a comment, sign in
Manoj Kamath, CSM, PhD

PG Artificial Intelligence & Machine Learning Candidate at University of Texas @ Austin Data Scientist & Generative AI Programmer
1w Edited
Report this post
I am looking for someone in my network to help me troubleshoot tensorflow conflicts with tensorboard, keras, protobuff and tensorboard-data-server? I need to resolve this in order to use my NVIDIA T1000 GPU to run my ML code on my Dell Precision 5550. Any help is welcome. Glad to share error messages and the like, if need be. If theres a way to just focus on my code; I do need to run the code locally, any IDE that would take care of this background packages would be really awesome - I already use Google Colaboratory and my code works just fine. #HelpWithPackages #Conflicts #MachineLearning #LocalGPU OS: Windows 10 Pro Error: 1. tb-nightly 2.18.0a20240707 requires tensorboard-data-server<0.8.0,>=0.7.0, but you have tensorboard-data-server 0.6.1 which is incompatible. 2. tensorflow-intel 2.16.2 requires keras>=3.0.0, but you have keras 2.10.0 which is incompatible. 3. tensorflow-intel 2.16.2 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.19.6 which is incompatible. 3. tensorflow-intel 2.16.2 requires tensorboard<2.17,>=2.16, but you have tensorboard 2.10.1 which is incompatible. tf-nightly-intel 2.18.0.dev20240703 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.19.6 which is incompatible. nvidia-smi Output: NVIDIA-SMI 555.85 and nvcc --version Output: Cuda compilation tools, release 12.5, V12.5.82 Build cuda_12.5.r12.5/compiler.34385749_0

4 Comments
Like Comment
To view or add a comment, sign in
Riju Pahwa
6mo
Report this post
Training large language models (LLMs) from scratch or even fine-tuning them requires a tremendous amount of computational resources. Matrix operations, which include matrix multiplication as well as various other computations, form the backbone of deep learning. Given the massive scale of these operations, the speed of computation is crucial. That’s why frameworks like PyTorch, JAX, and TensorFlow leverage the power of GPUs, which are highly efficient at handling parallel computations required for these matrix operations. For example, matrix multiplication in PyTorch with GPU acceleration can be substantially faster compared to using NumPy on a CPU. Check out this simple example that demonstrates how pytorch matrix multiplication is ~20 times faster than its numpy counterpart. https://lnkd.in/ej9CtrSJ #GPU #NVIDIA #tensors #parallel #deeplearning #llm #GPT #ploutosai

ModernCV-Chapter2-ComparingNumpyTorch

ploutos.dev
Like Comment
To view or add a comment, sign in

734 followers

2 Posts

View Profile Follow

Vrushank Desai’s Post

More Relevant Posts

Explore topics