PyTorch’s Post

View organization page for PyTorch, graphic

258,091 followers

Training MoEs at Scale with PyTorch 🔥 In our latest post, we show how we scale to over three thousand GPUs using PyTorch Distributed and MegaBlocks, an efficient open-source MoE implementation in PyTorch. Check it out: https://hubs.ly/Q02DBzL50

Training MoEs at Scale with PyTorch

pytorch.org

1 Comment

To view or add a comment, sign in

More Relevant Posts

Vitaliy Chiley

Research Scientist (Head of LLM Pretraining) at Databricks (MosaicML)
2w
Report this post
If you want to learn more about MoE training with PyTorch, this is a great resource to get started

PyTorch

258,091 followers
2w

Training MoEs at Scale with PyTorch 🔥 In our latest post, we show how we scale to over three thousand GPUs using PyTorch Distributed and MegaBlocks, an efficient open-source MoE implementation in PyTorch. Check it out: https://hubs.ly/Q02DBzL50

Training MoEs at Scale with PyTorch

pytorch.org

1 Comment
Like Comment
To view or add a comment, sign in
Susan E. Kahler, Ph.D.

AI/ML Products and Solutions Marketing Manager
1y
Report this post
Using PyG? Check out this blog on how to optimize PyG performance for both training and inference while using the PyTorch 2.0 flagship torch.compile feature. #oneAPI #pytorch #iamintel

3 Ways to Accelerate PyTorch* Geometric on Intel® CPUs

intel.com

1 Comment
Like Comment
To view or add a comment, sign in
Kanjar De

Applied Machine Learning | Computer Vision | Image and Video Quality Assessment | Researcher | Visionary | Creative Thinker | PhD
1y
Report this post
The power of Neural architecture Search in YOLO

GitHub - Deci-AI/super-gradients: Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.

github.com
Like Comment
To view or add a comment, sign in
Precious Ebite Azun

Medical Imaging & Predictive Analytics
11mo
Report this post
A little book about deep learning. There's a lot to learn about deep learning in this short guide. Topics covered include: - Machine Learning Foundations - Efficient computations with GPUs & TPUs. - Model Architectures & - Model Training And it's free: https://lnkd.in/gZ7epA63
Like Comment
To view or add a comment, sign in
Haseeb R.

Machine Learning Engineer @ Lunit
3mo Edited
Report this post
Hugging Face presents 🤗 quanto, a new and adaptable PyTorch quantization toolkit. It offers a range of distinctive features, including: 🎯 Compatibility with eager mode, making it functional with models that are not traceable. 🍁 Device-agnostic capabilities, enabling usage on CPU, CUDA, and MPS. 🚀 Support for int2, int4, and int8 weights, in addition to int8 weights. 🍄 Support for float8 activations, in addition to int8 activations. ❄️ No need to manually inserts quantization and dequantization stubs. An interesting piece is that quanto does not make a clear distinction between dynamic and static quantization. Models are dynamically quantized first, but their weights can be "frozen" later to static values. Blog: https://lnkd.in/gAnc26tA GitHub: https://lnkd.in/gJmB8PSH #huggingface #deeplearning #pytorch

Quanto: a pytorch quantization toolkit

huggingface.co

1 Comment
Like Comment
To view or add a comment, sign in
Nijesh Kanjinghat

Generative AI | Multi Modal & Large Language Models| MLOps | Responsible AI | Keynote Speaker|Professional Member of Singapore Computer Society
3mo Edited
Report this post
We live in great times where AI innovators like Andrej Karpathy share their knowledge so dedicatedly. He recently released a LLM training library in simple, pure C/CUDA. GPT2 is the chosen LLM to implement this pure C . It compiles instantly and matches the performance of the PyTorch reference implementation. He plans to add the following 1. direct CUDA implementation, which will be significantly faster and probably come close to PyTorch. 2. speed up the CPU version with SIMD instructions, AVX2 on x86 / NEON on ARM (e.g. Apple Silicon). 3. more modern architectures, e.g. Llama2, Gemma, etc. Go through this link and rejoice

GitHub - karpathy/llm.c: LLM training in simple, raw C/CUDA

github.com

2 Comments
Like Comment
To view or add a comment, sign in
Eric Feuilleaubois (Ph.D)

Deep Learning / ADAS / Autonomous Parking chez VALEO // Curator of Deep_In_Depth news feed
1y
Report this post
Optimizing the yolov7 Model using Intel® Extension for PyTorch

Optimizing the yolov7 Model using Intel® Extension for PyTorch

medium.com
Like Comment
To view or add a comment, sign in
SentientMatters

1,878 followers
4mo
Report this post
Experimentation Made Easy: Gemma.cpp for Research Use Cases Gemma.cpp is a lightweight, standalone C++ inference engine for Google’s Gemma models. It provides a minimalist implementation of Gemma 2B and 7B models, focusing on simplicity and directness. The project targets experimentation and research use cases, and is designed to be easily embedded in other projects with minimal dependencies. It also includes a small core implementation for easy modification. The engine uses the Google Highway Library for portable SIMD for CPU inference. It supports both bfloat16 weights for higher fidelity and 8-bit switched floating point weights for faster inference. #Gemma #InferenceEngine #CPlusPlus #MachineLearning #DeepLearning #AI #Research #Experimentation #EmbeddedSystems #SIMD #HighwayLibrary #ModelInference

GitHub - google/gemma.cpp: lightweight, standalone C++ inference engine for Google's Gemma models.

github.com
Like Comment
To view or add a comment, sign in
Ravi Shankar

Driving innovation with AI strategy - ML, GenAI, LLMs, VLMs at Beyond
4mo
Report this post
The blog post discusses optimizations to the PyTorch Inductor C++/OpenMP backend for accelerated CPU inference. Intel implemented a hybrid strategy categorizing operations into Conv/GEMM and non-Conv/GEMM types, leading to significant performance improvements, especially for popular deep learning models. They employed techniques such as weight prepacking, post-operation fusion using the oneDNN library, and explicit vectorization in C++ codegen. The optimizations increased efficiency and reliability, with promising results across TorchBench, Hugging Face, and TIMM benchmark suites. https://lnkd.in/eM_EXQA7

Accelerated CPU Inference with PyTorch Inductor using torch.compile

pytorch.org
Like Comment
To view or add a comment, sign in
Bartosz Rajewski

CEO at vBionic
1y Edited
Report this post
Quick #AI trainings tip. If you happen to have too much data to put into RAM (cards with big ram are $$$) at once (i.e. big res photos), you can try Horovod and split your training into multiple GPUs having RAM somehow combined with 90% efficiency. We missed this feature, we've just found it and testing. I think it's worth sharing. Any idea of simmilar project, but splitting computations over network?

GitHub - horovod/horovod: Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

github.com

4 Comments
Like Comment
To view or add a comment, sign in

258,091 followers

View Profile Follow

PyTorch’s Post

More Relevant Posts

Explore topics