Neural Magic

Software Development

Somerville, Massachusetts 15,933 followers

We are on a mission to bring open-source LLMs and vLLM to every enterprise on the planet. The future of AI is open.

See jobs Follow

View all 50 employees

About us

Together with our community, we engineer sparse LLM, CV, and NLP models that are more efficient and performant in production. Why does this matter? Sparse models are more flexible and can achieve unrivaled latency and throughput performance on your private CPU and GPU infrastructure. Check us out on GitHub and join the Neural Magic Slack Community to get started with software-delivered AI.

Website: http://neuralmagic.com/
External link for Neural Magic
Industry: Software Development
Company size: 51-200 employees
Headquarters: Somerville, Massachusetts
Type: Privately Held
Founded: 2018
Specialties: machine learning, deep learning, and artificial intelligence

Locations

Primary

55 Davis Sq

Floor 3

Somerville, Massachusetts 02144, US

Get directions

Employees at Neural Magic

See all employees

Updates

Neural Magic

15,933 followers
6h
Report this post
vLLM office hours continue next week on August 8! We're excited to have Roger Wang from Roblox joining us to discuss multimodal models in vLLM. Don’t miss this opportunity to learn from successful vLLM users, get the latest project updates, ask questions, and share your feedback. Register now: https://lnkd.in/euF8m73q
Like Comment Share
Neural Magic

15,933 followers
2d
Report this post
We further enhanced Meta's Llama 3.1 405B model with complete FP8, quantizing every linear module, unlike the original which skipped 510. This results in 20% less memory (~400GB vs. 500GB), 99.74% accuracy recovery, and no OOM errors. See the fully-quantized model on our Hugging Face Model Hub: https://lnkd.in/eFAr8t53
Mark Kurtz

CTO @ Neural Magic
3d Edited

📢 Full FP8 Llama 3.1 405B Now Available! 📢 Exciting news from Neural Magic! Our research team has successfully compressed the largest model from Meta's exciting Llama 3.1 launch, resulting in a fully quantized FP8 version (no layers skipped!) of the 405B model with ~100% recovery! This new model allows for easy fitting on any 8xH100 or 8xA100 system without worrying about OOM errors commonly seen with the original FP8 and FP16 versions. Additionally, inferences are over 2X faster, utilizing faster memory and computing and removing the need for CPU offloading or distribution across multiple nodes. Explore the Model Links: - FP8 Dynamic Quantization: https://lnkd.in/eWsijBTV - FP8 Static Quantization: https://lnkd.in/eCaGBm39 For further insights, don't miss my previous Llama 3.1 posts: - FP8 8B: https://lnkd.in/eWPXcUVj - FP8 70B: https://lnkd.in/dV3-6pbW - INT8 8B: https://lnkd.in/eqPgi2Bs Stay tuned for updates on our INT4 variations, comprehensive blog writeups detailing best practices and research, and much more!
Like Comment Share
Neural Magic reposted this

Asif Razzaq

AI Research Editor | CEO @ Marktechpost | 1.5 Million Monthly Readers and 47k+ ML Subreddit
2d Edited
Report this post
Neural Magic Releases Fully Quantized FP8 Version of Meta’s Llama 3.1 405B Model: FP8 Dynamic Quantization and FP8 Static Quantization Neural Magic has recently announced a significant breakthrough in AI model compression, introducing a fully quantized FP8 version of Meta’s Llama 3.1 405B model. This achievement marks a milestone in AI, allowing the massive 405 billion parameter model to fit seamlessly on any 8xH100 or 8xA100 system without the common out-of-memory (OOM) errors typically encountered with the original FP8 and FP16 versions. The new model solves memory constraints and enhances inference speeds by over 2X, leveraging faster memory and computing capabilities and eliminating the need for CPU offloading or distribution across multiple nodes. Neural Magic provides two key versions of the model: ✅ Meta-Llama-3.1-405B-Instruct-FP8-dynamic ✅ Meta-Llama-3.1-405B-Instruct-FP8 The fully quantized FP8 version, Meta-Llama-3.1-405B-Instruct-FP8-dynamic, maintains the architecture of Meta-Llama-3.1, designed for an assistant-like chat in multiple languages. However, it is restricted to usage in English and for lawful applications only. Released under version 1.0, this model was developed by Neural Magic and operates under the llama3.1 license..... Read our full take on this super cool model: https://lnkd.in/gfcbsTRv Explore the Model Links: - FP8 Dynamic Quantization: https://lnkd.in/gjhYydTC - FP8 Static Quantization: https://lnkd.in/g3vC7RtY Neural Magic Mark Kurtz Brian Stevens Alex Matveev
1 Comment

Like Comment Share
Neural Magic

15,933 followers
3d
Report this post
📹 We just released our latest vLLM Office Hours recording with model compression expert Eldar Kurtić. In this July 25th session, Eldar dives deep into Model Quantization for Efficient vLLM Inference and introduces the new llm-compressor library. Additionally, we shared updates on the vLLM v0.5.2 and v0.5.3 releases, including model support for Llama 3.1, Mistral NeMo, and Chameleon. Watch the full session recording here: https://lnkd.in/eXBzJNCg P.S. We're excited to invite you to our next vLLM Office Hours on August 8th, featuring Roger Wang from Roblox, who will cover Multimodal Models in vLLM: https://lnkd.in/euF8m73q #vLLM #AI #ModelCompression #LLMs #MachineLearning #DeepLearning #AICommunity

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 5, 2024

https://www.youtube.com/

Like Comment Share
Neural Magic reposted this

Mark Kurtz

CTO @ Neural Magic
1w Edited
Report this post
📢 Llama 3.1 is Here, and We're Actively Compressing Them! 📢 Meta unveiled their latest Llama series, featuring an impressive 405 billion parameter model surpassing OpenAI's GPT4o. This milestone significantly boosts open source and the AI community, although the largest model now requires multiple servers (810 GB!). Model compression is crucial! Our (Neural Magic) Llama 3.1 compression project is underway, aiming for cost-effective and sustainable deployments without compromising accuracy. The FP8 quantized Llama 3.1 8B model has already achieved over 99% recovery, with detailed accuracy metrics and deployment guidelines available. Also, we've introduced FP8 model support for all Llama versions in vLLM for immediate use. Explore the latest models here: - Meta-Llama-3.1-8B-Instruct-FP8: https://lnkd.in/dDAcXAAY - Meta-Llama-3.1-8B-Instruct-FP8-dynamic: https://lnkd.in/djtw4GMr For more insights, visit the vLLM Llama 3.1 Blog: - https://lnkd.in/dnZsvKjy Stay tuned for further updates—I'll be sharing more posts in the days ahead! #LLMs #vLLM #AI #MachineLearning #Quantization #NeuralMagic

Like Comment Share
Neural Magic

15,933 followers
1w
Report this post
Our bi-weekly vLLM office hours continue today. Hear from Eldar Kurtić about model quantization for efficient vLLM inference. In two weeks, learn from Roger Wang about multimodal models in vLLM. Link in comments.
2 Comments

Like Comment Share
Neural Magic reposted this

Mark Kurtz

CTO @ Neural Magic
1w
Report this post
📢 FP8 Quantized Llama 3.1 70B Now Available! 📢 Continuing our (Neural Magic) Llama 3.1 compression project, the first versions of the FP8 quantized 70B model are ready, achieving ~100% recovery. This milestone reduces the model size to 70 GB, enabling deployments on a single H100 or A100 GPU instead of two (resulting in a 50% cost reduction!) for the FP16 version. Additionally, it offers ~2X faster inference on the latest NVIDIA Hardware. Explore the Model Links: - FP8 Dynamic Quantization: https://lnkd.in/eKtbAJBv - FP8 Static Quantization: https://lnkd.in/eQDSj_9i For more insights, check out the vLLM Llama 3.1 Blog: - https://lnkd.in/dnZsvKjy Previous Llama 3.1 8B FP8 Post: - https://lnkd.in/eWPXcUVj Stay tuned for further updates—I'll be sharing more posts in the days ahead! #LLMs #AI #MachineLearning #Quantization #vLLM #OpenSource

1 Comment

Like Comment Share
Neural Magic

15,933 followers
1w
Report this post
Join us for bi-weekly vLLM Office Hours this Thursday, July 25! Eldar Kurtić, a model optimization expert, will show us how to quantize LLMs for fast and efficient inference in #vLLM. We'll have ample time for questions and community discussion. See you there!
Eldar Kurtić

Machine Learning
1w

In this week's installment of 'vLLM Office Hours,' we'll explore the when, why, and how of quantizing LLMs for efficient inference. We'll have plenty of time for questions and open discussion. Register at: https://lnkd.in/dvSmUrVe
Like Comment Share
Neural Magic

15,933 followers
2w
Report this post
It's a big day for us - we've joined the Linux Foundation! We are thrilled to collaborate with the greater ecosystem and accelerate #opensource AI innovation.

LF AI & Data Foundation

3,372 followers
2w

We are thrilled to announce Neural Magic has joined LF AI & Data as a Premier Member! 🚀 Neural Magic is at the forefront of enabling enterprise deployment of leading #opensource large language models (#LLMs) across a broad set of infrastructure, securely, whether that’s in the cloud, a private data center or at the edge. ✅ “Our vision at Neural Magic is The Future of AI is Open, so it would seem only natural to be joining the LF AI & Data as a Premier Member. We look forward to contributing within the LF AI & Data community to help make our vision become a reality.” - Brian Stevens, Neural Magic CEO 🔗 Read the full announcement: https://lnkd.in/e8ASy9VP #linuxfoundation #oss #opensource #lfaidata

Neural Magic Joins LF AI & Data as a Premier Member

https://lfaidata.foundation

1 Comment

Like Comment Share
Neural Magic

15,933 followers
2w
Report this post
EXCITING NEWS: Neural Magic and Anyscale contributed FP8 quantization support to vLLM, making LLM inference even more efficient. FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation. Thank you to NVIDIA AI for validating our benchmarks! 🔍 What is FP8? FP8 is a modern quantization format that balances precision and efficiency with hardware acceleration on newer GPUs. It reduces memory usage significantly, enabling more cost-effective LLM deployments and higher throughput. 📈 Performance gains: FP8 delivers up to 2x Inter Token Latency (ITL) improvement for Llama 3 70B, 1.6x ITL improvement for Mixtral 8x7B, and up to 3x throughput improvement on 2 NVIDIA H100 GPUs. Memory savings allow for larger batch sizes, boosting performance across various models. Our blog contains specific accuracy details. ✅ Model accuracy: We validated the accuracy preservation of FP8 in vLLM through lm-evaluation-harness comparisons on Open LLM Leaderboard v1 tasks. Most models experience over 99% accuracy preservation compared to the unquantized baseline. 🛠️ Get Started: You can now try out FP8 support in vLLM using a quantized FP8 checkpoint. Access Neural Magic's growing list of accuracy-verified quantized FP8 checkpoints of popular LLMs on our Hugging Face Model Hub. Ready to use with vLLM: https://lnkd.in/gTimN5dZ 🗓️ Learn more: See our blog for more detailed FP8 insights and join our bi-weekly vLLM Office Hours to regularly hear from and give feedback to the vLLM committer community. https://lnkd.in/g2suBKvr 🙏 Thank you for reading and please spread the word about FP8 in vLLM by sharing this post.
3 Comments

Like Comment Share

Browse jobs

Funding

Neural Magic 3 total rounds

Last Round

Series A Nov 5, 2021

US$ 30.0M

Investors

New Enterprise Associates + 4 Other investors

See more info on crunchbase

Neural Magic

Software Development

Somerville, Massachusetts 15,933 followers

We are on a mission to bring open-source LLMs and vLLM to every enterprise on the planet. The future of AI is open.

About us

DeepSparse

Deep Learning Software

SparseML

Deep Learning Software

SparseZoo

Deep Learning Software

Locations

Employees at Neural Magic

Dimitri Sirota

BigID - Know Your Data | Control Your Data

Jamie Goldstein

Brian Stevens

CEO at Neural Magic. Ex CTO & VP Google Cloud, CTO & EVP Red Hat.

Gil Beyda

Founder & Managing Partner at Genacast Ventures

Updates

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 5, 2024

https://www.youtube.com/

Join now to see what you are missing

Similar pages

Ultralytics

Deci AI (Acquired by NVIDIA)

Roboflow

Weights & Biases

Run:ai

Cerebras Systems

Hugging Face

OmniML

Layer AI

Nebius AI

Browse jobs

Scientist jobs

Engineer jobs

Analyst jobs

Machine Learning Engineer jobs

Data Scientist jobs

Software Engineer jobs

Developer jobs

Marketing Manager jobs

Associate Product Marketing Manager jobs

Marketing Project Manager jobs

Vice President jobs

Quality Associate jobs

Manager jobs

Component Engineer jobs

Intern jobs

Associate jobs

Python Developer jobs

Microbiologist jobs

Solutions Architect jobs

Operational Specialist jobs

Funding