Aleksandr Petiushko’s Post

PhD; Director, Head of ML Research @ Nuro; Professor @ Sofia University

1mo

Hey AV, Self-driving, and ML Infra (:wink:) communities! I'm proud to announce a new Nuro technical publication called "FTL Model Compiler Framework". The previous time we shared our efforts on scaling the ML Training (link: https://lnkd.in/gXBszEME ). This time, let's talk about the inference. Can we integrate multiple ML sub-compilers to provide multi-GPU serving, execution priority control, and even custom GPU kernel injection? The answer is "Yes" with our Faster Than Light (FTL) framework - https://lnkd.in/g_vryPRn ! Thanks to the authors - Ali Boubezari, Muyang Yu, Nick Korovaiko, and Hongze Z.!

To view or add a comment, sign in

More Relevant Posts

Meta for Developers

95,777 followers
6mo
Report this post
PyTorch 2.0 succeeds its predecessor, PyTorch 1.x, with improved performance and efficiency. Here’s a rundown of these new features and capabilities: Backward Compatibility: - Code, model and API compatibility with PyTorch 1.x - Maintained compatibility with libraries and dependencies Fusing Operations: - Fuses multiple operations into a single GPU kernel - Reduces kernel launching overhead Reducing Type Conversion Overhead: - Identifies and minimizes unnecessary type conversion code - Avoids redundancies Reusing Buffers: - Reduced memory allocation - Lower memory consumption during model training Read on to learn about all the improvements PyTorch 2.0 brings and how they work: https://bit.ly/47QsWR2
Like Comment
To view or add a comment, sign in
David J T Poland

Research, Development, and Implementation Specialist | Driving Innovation with AI/ML Solutions | Creating Cutting-Edge Technical solutions
5mo
Report this post
Quantum AI offers a cutting-edge approach to integrating real-world manufacturing challenges with advanced research and solution frameworks. By leveraging Python libraries such as D-Wave's Ocean software stack, I applied Quantum Annealing and various quantum-inspired optimization algorithms. How! By taking hardware sensor data processing with AMD EPYC 7763 processor, AMD's third generation, furnished with 64 cores. High speed data processing performed robust data analysis, supporting a modified XGBoost with Quantum. The outcome higher productivity.
Like Comment
To view or add a comment, sign in
Paul Bridger

Technical Co-founder and Machine Learning Engineering Consultant
1y
Report this post
I wrote an article on PyTorch memory tuning - reducing GPU memory usage during inference and training. Along with some fundamental topics like mixed precision and inference mode, I also dig into some less well known features: activation checkpointing and the use of replacement optimizers from bitsandbytes. https://lnkd.in/dxtHSTJP

PyTorch Memory Tuning

paulbridger.com

3 Comments
Like Comment
To view or add a comment, sign in
OROBIX

7,095 followers
4mo
Report this post
Welcome to ⚡️ Lightning Thunder! A cutting-edge #deeplearning compiler for #PyTorch. Developed by the brilliant minds at Lightning AI in collaboration with NVIDIA, Thunder made its dazzling debut yesterday at #GTC24. Thunder is fast, easy to extend, and easy to inspect. Written entirely in #Python, Thunder runs PyTorch programs leveraging optimized kernels from NVIDIA’s nvFuser, cuDNN, and Apex, as well as extensions written with #OpenAI Triton. Thunder incorporates an automatic differentiation pass that can interoperate with PyTorch’s autograd and produce highly optimized training code. Furthermore, Thunder understands distributed calls and can express elaborate distributed strategies as simple program transformations. Ah, we almost forgot to mention – of course, it's #opensource! 🔥 Hats off to the team for their outstanding contribution! 👏 Kudos to Luca Antiga, Thomas Viehmann, William Falcon, Mike Ruberry and all involved. Another significant stride towards the real-world deployment of #AI! 🚀 👉 https://lnkd.in/dZscUpEZ #LightningThunder #opensource #opensourcecommunity #WeMakeAIhappen #GoodPeopleMakeGoodAI #ArtificialIntelligence
William Falcon

PyTorch Lightning ⚡️creator. CEO Lightning AI. AI researcher.
4mo

Excited to announce our new PyTorch compiler - Thunder! (built in collaboration with NVIDIA 🤯 🤯) Thunder is a source to source compiler for PyTorch. It speeds up PyTorch models. As an example, it speeds up llama 2 7B by 40%. https://lnkd.in/dCF63VaH
Like Comment
To view or add a comment, sign in
RANJIT PANDA

AI/ML Practitioner at A.P. Moller - Maersk
1mo
Report this post
GPU or infra is a major challenge when it comes to fine tune LLM. however Thanks to Pytorch Lightening package which helps in fine tuning llm with minimum resources . here is a snnippet to utilize this awesome package.
Like Comment
To view or add a comment, sign in
Tommy Nguyen

Software Engineer at DogeGraph
2w Edited
Report this post
How did I build DogeGraph from zero to the most valuable publicly available deep-learning framework? By learning from other people mistakes: - Not utilizing GPU enough by having dead memory in GPU - Using asynchronous instead of pure multi-threading approach - Right memory management At Doge, we tackle those problems by: - Global_virtual_address allocation as a separate process - Allocating global_virtual_address, allocation rate = 1 << 30 / machine * s - Customized compiler to take synchronous code -> multi-threading code (spanning computation cluster) - Re-implementing kernel internet protocol (UDP-liked) - Guarantee pointer stability for logit values (float8, float16...), logit-value addr has "sequential" locality. - Splitting global memory_region -> slow_mem_region (cpu + ram), fast mem_region (gpu_only), uniform_bandwidth + high mem usage region (RAMSZ > GPUSZ, flush_on_cap). - Shared pointer (with the cost of RAM memory usage) - Increasing SHARED_PAGE transfer rate by using replicas (only if the requesting data is const_qualified)
Like Comment
To view or add a comment, sign in
Intel Software

59,260 followers
1y
Report this post
Join our webinars to discover the power of PyTorch with Intel hardware optimizations, unlock multi-platform parallelism with SYCL and SYCLomatic, and dive into the latest chapter of Intel Fortran Compiler. Learn about CUDA migration, wildfire prediction with Intel Extension for PyTorch powered by oneAPI, and building a smart queue management system with OpenVINO toolkit in our hands-on workshop. Register now. https://intel.ly/43ObbQG #Developer #oneAPI #OpenVINO #AI #PyTorch #SYCL
Like Comment
To view or add a comment, sign in
Peter Benson

Infosec leader, Responsible AI, Data Protection, Cyber-Psychology amateur, providing thought leadership and business strategy. AI Governance Professional (IAPP), ex CISSP Instructor
4mo
Report this post
This is pretty big (and literally!). Full release of Grok-1, 314B parameter model and source code. Pretty big download, around 300G, recommended you use a GPU cluster, however keen to hear if anyone runs locally with any success :) Code is at https://lnkd.in/gruzTH_i https://x.ai/blog/grok-os

Open Release of Grok-1

x.ai
Like Comment
To view or add a comment, sign in
Chris Tzatzakis (克里斯)

Manager, Connected Plant - Digital Manufacturing | IEEE Member | Innovation Oriented
1y
Report this post
Worth a read The First Physics-inspired Computer Vision Library - PhyCV, open-sourced by Jalali-Lab UCLA. PhyCV has a new class of computer vision algorithms that emulates the propagation of light through a physical medium with natural and engineered diffractive properties followed by coherent detection. https://lnkd.in/dezyU3Rk

GitHub - JalaliLabUCLA/phycv: PhyCV: The First Physics-inspired Computer Vision Library

github.com
Like Comment
To view or add a comment, sign in
Tiago Oliveira

Solutions Architect for Innovation | Master's Student | Distributed Computing, Functional Programming Enthusiast
8mo
Report this post
Interesting to also see implementation for "Vector Similarity Search" I should play with that soon :) #cuda #gpu

Corey Nolet

Principal Engineer | Big-Data Science, ML and Graph Analytics | High-Performance & Distributed Computing
8mo

If you are attending Supercomputing '23, be sure to watch Akira Naruse talk on November 16th about parallel top-k algorithms on the GPU. Akira will be presenting two novel and state-of-the-art k-selection methods in particular- a variation on the popular RadixTopK and a novel GridSelection algorithm . The performance improvements achieved by these new algorithms are mind blowing and the best part is that they are already included in the RAPIDS RAFT library (https://lnkd.in/eHpCzBQC). For more information about Akira's talk, visit https://lnkd.in/eqNsZJd5 If you can't make it to his talk, you should definitely check out the paper! https://lnkd.in/eZtHAQqs

GitHub - rapidsai/raft: RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.

github.com
Like Comment
To view or add a comment, sign in

View Profile Follow

Aleksandr Petiushko’s Post

More from this author

Outcome of Fundamental Research activity in the industry

Explore topics