🗓 Happening today at 2PM EST! Learn why vLLM is the leading open-source inference server and how Neural Magic works with enterprises to build and scale vLLM-based model services. https://hubs.li/Q02DVnBd0
Neural Magic’s Post
More Relevant Posts
-
FP8 quantization is now available in vLLM - check it out! Quantized inference is one of the best ways to reduce the costs of LLM deployments.
We’ve recently contributed FP8 support to vLLM in collaboration with Neural Magic -- with this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! A common concern with FP8 is whether users will experience accuracy degradation. To address this, Neural Magic has produced many checkpoints for key models with >99% accuracy preservation across a wide range of benchmarks (https://lnkd.in/gTimN5dZ), including: - Llama3-70b - Mixtral 8x7b - Llama3-8b You can easily try this out on vLLM, and read more about the feature here -- https://lnkd.in/gzKJqerB
To view or add a comment, sign in
-
-
Our bi-weekly vLLM Office Hours continue tomorrow. We are excited to bring Philipp Moritz and Cody Yu from Anyscale for a deep dive into FP8 quantization in vLLM. This is an exciting opportunity to give feedback and get your questions answered. Join us: https://lnkd.in/euF8m73q
To view or add a comment, sign in
-
-
Are you looking to optimize your #LLM inference for more performance and lower costs? Tune in to hear Eldar Kurtić, our Sr. ML Researcher, break down how quantization can optimize LLM inference and reduce memory footprint without compromising model accuracy.
The second episode of the "Efficient Inference through Sparsity and Quantization" podcast series is out now. In the first episode, I talked about how sparsity can enhance the performance and efficiency of machine learning models, leading to significant cost reductions on both CPUs and GPUs. In this newly released episode, we dive deep into quantization techniques. Discover how quantization can further optimize model inference and reduce memory footprint without compromising accuracy. Listen to the second episode here: https://lnkd.in/duK8ijTC
57. Eldar Kurtic - Efficient Inference through sparsity and quantization - Part 2/2
https://spotify.com
To view or add a comment, sign in
-
Neural Magic's CEO, Brian Stevens, recently spent some time with host Heather Haskin from The Catalyst by Softchoice podcast to talk about the intersection of AI, open source, and the future of responsible development. Listen in on "The case for open-source AI" to learn more about the vital role of open-source models and why the democratization of AI is important for the success of today's enterprise. https://lnkd.in/eumGUGBH
The Catalyst by Softchoice
link.chtbl.com
To view or add a comment, sign in
-
Optimizing your AI models with techniques like sparsity and quantization increases production performance while decreasing your total infrastructure spend. Eldar Kurtić, our expert in AI model optimization, shares more details in this podcast. Check it out 👇
I was recently invited to share my insights on "Efficient Inference through Sparsity and Quantization" in a two-part podcast series. In the first episode, we dive into how sparsity can improve the performance and efficiency of machine learning models, reducing deployment costs on both CPUs and GPUs. The next episode, which will focus on quantization, is coming soon. Listen to the first episode here: https://lnkd.in/dnaCzzsm
56. Eldar Kurtic - Efficient Inference through sparsity and quantization - Part 1/2
https://spotify.com
To view or add a comment, sign in
-
Are you using, or considering, vLLM for your LLM inference serving? Join us this Wednesday to ask all your questions and learn more about accelerated #LLM inference with vLLM and Neural Magic. vLLM project maintainer Simon Mo and vLLM project committer Michael Goin will spend one hour with the community answering questions and sharing recent vLLM project updates. Register and ask your questions here: https://lnkd.in/euF8m73q If you can't make it this Wednesday, you can register for the June 20th session.
To view or add a comment, sign in
-
-
We were at Nutanix #NEXTConf in Barcelona last week and it was such an exciting event! From the main stage announcements to the enthusiastic conversations with attendees at the AI Pavilion, congrats to Nutanix for creating a dynamic environment where innovation thrived. Thank you to the Nutanix team, including Gregory Lehrer, Gali Ross-Hasson, Tarkan Maner, Luke Congdon, Wolfgang Huse, and others for partnering with Neural Magic. We appreciate the opportunity to be a part of the Nutanix #AI strategy. 🙏
Neural Magic at Nutanix .NEXT 2024.
To view or add a comment, sign in