Hey AV, Self-driving, and ML Infra (:wink:) communities! I'm proud to announce a new Nuro technical publication called "FTL Model Compiler Framework". The previous time we shared our efforts on scaling the ML Training (link: https://lnkd.in/gXBszEME ). This time, let's talk about the inference. Can we integrate multiple ML sub-compilers to provide multi-GPU serving, execution priority control, and even custom GPU kernel injection? The answer is "Yes" with our Faster Than Light (FTL) framework - https://lnkd.in/g_vryPRn ! Thanks to the authors - Ali Boubezari, Muyang Yu, Nick Korovaiko, and Hongze Z.!
Aleksandr Petiushko’s Post
More Relevant Posts
-
PyTorch 2.0 succeeds its predecessor, PyTorch 1.x, with improved performance and efficiency. Here’s a rundown of these new features and capabilities: Backward Compatibility: - Code, model and API compatibility with PyTorch 1.x - Maintained compatibility with libraries and dependencies Fusing Operations: - Fuses multiple operations into a single GPU kernel - Reduces kernel launching overhead Reducing Type Conversion Overhead: - Identifies and minimizes unnecessary type conversion code - Avoids redundancies Reusing Buffers: - Reduced memory allocation - Lower memory consumption during model training Read on to learn about all the improvements PyTorch 2.0 brings and how they work: https://bit.ly/47QsWR2
To view or add a comment, sign in
-
Research, Development, and Implementation Specialist | Driving Innovation with AI/ML Solutions | Creating Cutting-Edge Technical solutions
Quantum AI offers a cutting-edge approach to integrating real-world manufacturing challenges with advanced research and solution frameworks. By leveraging Python libraries such as D-Wave's Ocean software stack, I applied Quantum Annealing and various quantum-inspired optimization algorithms. How! By taking hardware sensor data processing with AMD EPYC 7763 processor, AMD's third generation, furnished with 64 cores. High speed data processing performed robust data analysis, supporting a modified XGBoost with Quantum. The outcome higher productivity.
To view or add a comment, sign in
-
I wrote an article on PyTorch memory tuning - reducing GPU memory usage during inference and training. Along with some fundamental topics like mixed precision and inference mode, I also dig into some less well known features: activation checkpointing and the use of replacement optimizers from bitsandbytes. https://lnkd.in/dxtHSTJP
To view or add a comment, sign in
-
Welcome to ⚡️ Lightning Thunder! A cutting-edge #deeplearning compiler for #PyTorch. Developed by the brilliant minds at Lightning AI in collaboration with NVIDIA, Thunder made its dazzling debut yesterday at #GTC24. Thunder is fast, easy to extend, and easy to inspect. Written entirely in #Python, Thunder runs PyTorch programs leveraging optimized kernels from NVIDIA’s nvFuser, cuDNN, and Apex, as well as extensions written with #OpenAI Triton. Thunder incorporates an automatic differentiation pass that can interoperate with PyTorch’s autograd and produce highly optimized training code. Furthermore, Thunder understands distributed calls and can express elaborate distributed strategies as simple program transformations. Ah, we almost forgot to mention – of course, it's #opensource! 🔥 Hats off to the team for their outstanding contribution! 👏 Kudos to Luca Antiga, Thomas Viehmann, William Falcon, Mike Ruberry and all involved. Another significant stride towards the real-world deployment of #AI! 🚀 👉 https://lnkd.in/dZscUpEZ #LightningThunder #opensource #opensourcecommunity #WeMakeAIhappen #GoodPeopleMakeGoodAI #ArtificialIntelligence
Excited to announce our new PyTorch compiler - Thunder! (built in collaboration with NVIDIA 🤯 🤯) Thunder is a source to source compiler for PyTorch. It speeds up PyTorch models. As an example, it speeds up llama 2 7B by 40%. https://lnkd.in/dCF63VaH
To view or add a comment, sign in
-
GPU or infra is a major challenge when it comes to fine tune LLM. however Thanks to Pytorch Lightening package which helps in fine tuning llm with minimum resources . here is a snnippet to utilize this awesome package.
To view or add a comment, sign in
-
How did I build DogeGraph from zero to the most valuable publicly available deep-learning framework? By learning from other people mistakes: - Not utilizing GPU enough by having dead memory in GPU - Using asynchronous instead of pure multi-threading approach - Right memory management At Doge, we tackle those problems by: - Global_virtual_address allocation as a separate process - Allocating global_virtual_address, allocation rate = 1 << 30 / machine * s - Customized compiler to take synchronous code -> multi-threading code (spanning computation cluster) - Re-implementing kernel internet protocol (UDP-liked) - Guarantee pointer stability for logit values (float8, float16...), logit-value addr has "sequential" locality. - Splitting global memory_region -> slow_mem_region (cpu + ram), fast mem_region (gpu_only), uniform_bandwidth + high mem usage region (RAMSZ > GPUSZ, flush_on_cap). - Shared pointer (with the cost of RAM memory usage) - Increasing SHARED_PAGE transfer rate by using replicas (only if the requesting data is const_qualified)
To view or add a comment, sign in
-
Join our webinars to discover the power of PyTorch with Intel hardware optimizations, unlock multi-platform parallelism with SYCL and SYCLomatic, and dive into the latest chapter of Intel Fortran Compiler. Learn about CUDA migration, wildfire prediction with Intel Extension for PyTorch powered by oneAPI, and building a smart queue management system with OpenVINO toolkit in our hands-on workshop. Register now. https://intel.ly/43ObbQG #Developer #oneAPI #OpenVINO #AI #PyTorch #SYCL
To view or add a comment, sign in
-
Infosec leader, Responsible AI, Data Protection, Cyber-Psychology amateur, providing thought leadership and business strategy. AI Governance Professional (IAPP), ex CISSP Instructor
This is pretty big (and literally!). Full release of Grok-1, 314B parameter model and source code. Pretty big download, around 300G, recommended you use a GPU cluster, however keen to hear if anyone runs locally with any success :) Code is at https://lnkd.in/gruzTH_i https://x.ai/blog/grok-os
Open Release of Grok-1
x.ai
To view or add a comment, sign in
-
Worth a read The First Physics-inspired Computer Vision Library - PhyCV, open-sourced by Jalali-Lab UCLA. PhyCV has a new class of computer vision algorithms that emulates the propagation of light through a physical medium with natural and engineered diffractive properties followed by coherent detection. https://lnkd.in/dezyU3Rk
GitHub - JalaliLabUCLA/phycv: PhyCV: The First Physics-inspired Computer Vision Library
github.com
To view or add a comment, sign in
-
Solutions Architect for Innovation | Master's Student | Distributed Computing, Functional Programming Enthusiast
Interesting to also see implementation for "Vector Similarity Search" I should play with that soon :) #cuda #gpu
Principal Engineer | Big-Data Science, ML and Graph Analytics | High-Performance & Distributed Computing
If you are attending Supercomputing '23, be sure to watch Akira Naruse talk on November 16th about parallel top-k algorithms on the GPU. Akira will be presenting two novel and state-of-the-art k-selection methods in particular- a variation on the popular RadixTopK and a novel GridSelection algorithm . The performance improvements achieved by these new algorithms are mind blowing and the best part is that they are already included in the RAPIDS RAFT library (https://lnkd.in/eHpCzBQC). For more information about Akira's talk, visit https://lnkd.in/eqNsZJd5 If you can't make it to his talk, you should definitely check out the paper! https://lnkd.in/eZtHAQqs
GitHub - rapidsai/raft: RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
github.com
To view or add a comment, sign in