Neural Magic

Neural Magic

Software Development

Somerville, Massachusetts 15,933 followers

We are on a mission to bring open-source LLMs and vLLM to every enterprise on the planet. The future of AI is open.

About us

Together with our community, we engineer sparse LLM, CV, and NLP models that are more efficient and performant in production. Why does this matter? Sparse models are more flexible and can achieve unrivaled latency and throughput performance on your private CPU and GPU infrastructure. Check us out on GitHub and join the Neural Magic Slack Community to get started with software-delivered AI.

Website
http://neuralmagic.com/
Industry
Software Development
Company size
51-200 employees
Headquarters
Somerville, Massachusetts
Type
Privately Held
Founded
2018
Specialties
machine learning, deep learning, and artificial intelligence

Locations

  • Primary

    55 Davis Sq

    Floor 3

    Somerville, Massachusetts 02144, US

    Get directions

Employees at Neural Magic

Updates

  • View organization page for Neural Magic, graphic

    15,933 followers

    We further enhanced Meta's Llama 3.1 405B model with complete FP8, quantizing every linear module, unlike the original which skipped 510. This results in 20% less memory (~400GB vs. 500GB), 99.74% accuracy recovery, and no OOM errors. See the fully-quantized model on our Hugging Face Model Hub: https://lnkd.in/eFAr8t53

    View profile for Mark Kurtz, graphic

    CTO @ Neural Magic

    📢 Full FP8 Llama 3.1 405B Now Available! 📢 Exciting news from Neural Magic! Our research team has successfully compressed the largest model from Meta's exciting Llama 3.1 launch, resulting in a fully quantized FP8 version (no layers skipped!) of the 405B model with ~100% recovery! This new model allows for easy fitting on any 8xH100 or 8xA100 system without worrying about OOM errors commonly seen with the original FP8 and FP16 versions. Additionally, inferences are over 2X faster, utilizing faster memory and computing and removing the need for CPU offloading or distribution across multiple nodes. Explore the Model Links: - FP8 Dynamic Quantization: https://lnkd.in/eWsijBTV - FP8 Static Quantization: https://lnkd.in/eCaGBm39 For further insights, don't miss my previous Llama 3.1 posts: - FP8 8B: https://lnkd.in/eWPXcUVj - FP8 70B: https://lnkd.in/dV3-6pbW - INT8 8B: https://lnkd.in/eqPgi2Bs Stay tuned for updates on our INT4 variations, comprehensive blog writeups detailing best practices and research, and much more!

    • A generated image representing a quantized model racing and beating the larger unquantized model as two llamas racing
  • Neural Magic reposted this

    View profile for Asif Razzaq, graphic

    AI Research Editor | CEO @ Marktechpost | 1.5 Million Monthly Readers and 47k+ ML Subreddit

    Neural Magic Releases Fully Quantized FP8 Version of Meta’s Llama 3.1 405B Model: FP8 Dynamic Quantization and FP8 Static Quantization Neural Magic has recently announced a significant breakthrough in AI model compression, introducing a fully quantized FP8 version of Meta’s Llama 3.1 405B model. This achievement marks a milestone in AI, allowing the massive 405 billion parameter model to fit seamlessly on any 8xH100 or 8xA100 system without the common out-of-memory (OOM) errors typically encountered with the original FP8 and FP16 versions. The new model solves memory constraints and enhances inference speeds by over 2X, leveraging faster memory and computing capabilities and eliminating the need for CPU offloading or distribution across multiple nodes. Neural Magic provides two key versions of the model: ✅ Meta-Llama-3.1-405B-Instruct-FP8-dynamic ✅ Meta-Llama-3.1-405B-Instruct-FP8 The fully quantized FP8 version, Meta-Llama-3.1-405B-Instruct-FP8-dynamic, maintains the architecture of Meta-Llama-3.1, designed for an assistant-like chat in multiple languages. However, it is restricted to usage in English and for lawful applications only. Released under version 1.0, this model was developed by Neural Magic and operates under the llama3.1 license..... Read our full take on this super cool model: https://lnkd.in/gfcbsTRv Explore the Model Links: - FP8 Dynamic Quantization: https://lnkd.in/gjhYydTC - FP8 Static Quantization: https://lnkd.in/g3vC7RtY Neural Magic Mark Kurtz Brian Stevens Alex Matveev

    • No alternative text description for this image
  • View organization page for Neural Magic, graphic

    15,933 followers

    📹 We just released our latest vLLM Office Hours recording with model compression expert Eldar Kurtić. In this July 25th session, Eldar dives deep into Model Quantization for Efficient vLLM Inference and introduces the new llm-compressor library. Additionally, we shared updates on the vLLM v0.5.2 and v0.5.3 releases, including model support for Llama 3.1, Mistral NeMo, and Chameleon. Watch the full session recording here: https://lnkd.in/eXBzJNCg P.S. We're excited to invite you to our next vLLM Office Hours on August 8th, featuring Roger Wang from Roblox, who will cover Multimodal Models in vLLM: https://lnkd.in/euF8m73q #vLLM #AI #ModelCompression #LLMs #MachineLearning #DeepLearning #AICommunity

  • Neural Magic reposted this

    View profile for Mark Kurtz, graphic

    CTO @ Neural Magic

    📢 Llama 3.1 is Here, and We're Actively Compressing Them! 📢 Meta unveiled their latest Llama series, featuring an impressive 405 billion parameter model surpassing OpenAI's GPT4o. This milestone significantly boosts open source and the AI community, although the largest model now requires multiple servers (810 GB!). Model compression is crucial! Our (Neural Magic) Llama 3.1 compression project is underway, aiming for cost-effective and sustainable deployments without compromising accuracy. The FP8 quantized Llama 3.1 8B model has already achieved over 99% recovery, with detailed accuracy metrics and deployment guidelines available. Also, we've introduced FP8 model support for all Llama versions in vLLM for immediate use. Explore the latest models here: - Meta-Llama-3.1-8B-Instruct-FP8: https://lnkd.in/dDAcXAAY - Meta-Llama-3.1-8B-Instruct-FP8-dynamic: https://lnkd.in/djtw4GMr For more insights, visit the vLLM Llama 3.1 Blog: - https://lnkd.in/dnZsvKjy Stay tuned for further updates—I'll be sharing more posts in the days ahead! #LLMs #vLLM #AI #MachineLearning #Quantization #NeuralMagic

  • Neural Magic reposted this

    View profile for Mark Kurtz, graphic

    CTO @ Neural Magic

    📢 FP8 Quantized Llama 3.1 70B Now Available! 📢 Continuing our (Neural Magic) Llama 3.1 compression project, the first versions of the FP8 quantized 70B model are ready, achieving ~100% recovery. This milestone reduces the model size to 70 GB, enabling deployments on a single H100 or A100 GPU instead of two (resulting in a 50% cost reduction!) for the FP16 version. Additionally, it offers ~2X faster inference on the latest NVIDIA Hardware. Explore the Model Links: - FP8 Dynamic Quantization: https://lnkd.in/eKtbAJBv - FP8 Static Quantization: https://lnkd.in/eQDSj_9i For more insights, check out the vLLM Llama 3.1 Blog: - https://lnkd.in/dnZsvKjy Previous Llama 3.1 8B FP8 Post: - https://lnkd.in/eWPXcUVj Stay tuned for further updates—I'll be sharing more posts in the days ahead! #LLMs #AI #MachineLearning #Quantization #vLLM #OpenSource

  • View organization page for Neural Magic, graphic

    15,933 followers

    Join us for bi-weekly vLLM Office Hours this Thursday, July 25! Eldar Kurtić, a model optimization expert, will show us how to quantize LLMs for fast and efficient inference in #vLLM. We'll have ample time for questions and community discussion. See you there!

    View profile for Eldar Kurtić, graphic

    Machine Learning

    In this week's installment of 'vLLM Office Hours,' we'll explore the when, why, and how of quantizing LLMs for efficient inference. We'll have plenty of time for questions and open discussion. Register at: https://lnkd.in/dvSmUrVe

    • No alternative text description for this image
  • View organization page for Neural Magic, graphic

    15,933 followers

    It's a big day for us - we've joined the Linux Foundation! We are thrilled to collaborate with the greater ecosystem and accelerate #opensource AI innovation.

    View organization page for LF AI & Data Foundation , graphic

    3,372 followers

    We are thrilled to announce Neural Magic has joined LF AI & Data as a Premier Member!  🚀 Neural Magic is at the forefront of enabling enterprise deployment of leading #opensource large language models (#LLMs) across a broad set of infrastructure, securely, whether that’s in the cloud, a private data center or at the edge. ✅ “Our vision at Neural Magic is The Future of AI is Open, so it would seem only natural to be joining the LF AI & Data as a Premier Member. We look forward to contributing within the LF AI & Data community to help make our vision become a reality.” - Brian Stevens, Neural Magic CEO 🔗 Read the full announcement: https://lnkd.in/e8ASy9VP #linuxfoundation #oss #opensource #lfaidata

    Neural Magic Joins LF AI & Data as a Premier Member

    Neural Magic Joins LF AI & Data as a Premier Member

    https://lfaidata.foundation

  • View organization page for Neural Magic, graphic

    15,933 followers

    EXCITING NEWS: Neural Magic and Anyscale contributed FP8 quantization support to vLLM, making LLM inference even more efficient. FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation. Thank you to NVIDIA AI for validating our benchmarks! 🔍 What is FP8? FP8 is a modern quantization format that balances precision and efficiency with hardware acceleration on newer GPUs. It reduces memory usage significantly, enabling more cost-effective LLM deployments and higher throughput. 📈 Performance gains: FP8 delivers up to 2x Inter Token Latency (ITL) improvement for Llama 3 70B, 1.6x ITL improvement for Mixtral 8x7B, and up to 3x throughput improvement on 2 NVIDIA H100 GPUs. Memory savings allow for larger batch sizes, boosting performance across various models. Our blog contains specific accuracy details. ✅ Model accuracy: We validated the accuracy preservation of FP8 in vLLM through lm-evaluation-harness comparisons on Open LLM Leaderboard v1 tasks. Most models experience over 99% accuracy preservation compared to the unquantized baseline. 🛠️ Get Started: You can now try out FP8 support in vLLM using a quantized FP8 checkpoint. Access Neural Magic's growing list of accuracy-verified quantized FP8 checkpoints of popular LLMs on our Hugging Face Model Hub. Ready to use with vLLM: https://lnkd.in/gTimN5dZ 🗓️ Learn more: See our blog for more detailed FP8 insights and join our bi-weekly vLLM Office Hours to regularly hear from and give feedback to the vLLM committer community. https://lnkd.in/g2suBKvr 🙏 Thank you for reading and please spread the word about FP8 in vLLM by sharing this post.

    • Inter-Token Latency (ITL) benchmarks for Llama 3 70B and Mixtral 8x7B on 2xH100. Note that FP8 MoE support currently requires Triton version 2.3.1 or higher.

Similar pages

Browse jobs

Funding