Neural Magic

Neural Magic

Software Development

Somerville, Massachusetts 15,461 followers

High-performance inference serving solutions for you to deploy leading open-source LLMs. #SoftwareDeliveredAI

About us

Together with our community, we engineer sparse LLM, CV, and NLP models that are more efficient and performant in production. Why does this matter? Sparse models are more flexible and can achieve unrivaled latency and throughput performance on your private CPU and GPU infrastructure. Check us out on GitHub and join the Neural Magic Slack Community to get started with software-delivered AI.

Website
http://neuralmagic.com/
Industry
Software Development
Company size
51-200 employees
Headquarters
Somerville, Massachusetts
Type
Privately Held
Founded
2018
Specialties
machine learning, deep learning, and artificial intelligence

Locations

  • Primary

    55 Davis Sq

    Floor 3

    Somerville, Massachusetts 02144, US

    Get directions

Employees at Neural Magic

Updates

  • View organization page for Neural Magic, graphic

    15,461 followers

    Are you looking to optimize your #LLM inference for more performance and lower costs? Tune in to hear Eldar Kurtić, our Sr. ML Researcher, break down how quantization can optimize LLM inference and reduce memory footprint without compromising model accuracy.

    View profile for Eldar Kurtić, graphic

    Machine Learning

    The second episode of the "Efficient Inference through Sparsity and Quantization" podcast series is out now. In the first episode, I talked about how sparsity can enhance the performance and efficiency of machine learning models, leading to significant cost reductions on both CPUs and GPUs. In this newly released episode, we dive deep into quantization techniques. Discover how quantization can further optimize model inference and reduce memory footprint without compromising accuracy. Listen to the second episode here: https://lnkd.in/duK8ijTC

    57. Eldar Kurtic - Efficient Inference through sparsity and quantization - Part 2/2

    57. Eldar Kurtic - Efficient Inference through sparsity and quantization - Part 2/2

    https://spotify.com

  • View organization page for Neural Magic, graphic

    15,461 followers

    The ecosystem of open-source LLMs has exploded over the past year. A new model tops the leaderboard almost every week. Enterprises can now deploy state-of-the-art, open-source LLMs like Llama 3 securely on their infrastructure of choice, fine-tuned with their data for domain-specific use cases, at a significantly lower cost than proprietary APIs. vLLM has emerged as the most popular inference server to deploy open-source LLMs with leading performance, ease of use, broad model support, and heterogeneous hardware backends. Neural Magic is a leading contributor to the vLLM project and offers nm-vllm, an enterprise-ready vLLM distribution. nm-vllm includes stable builds of vLLM with long-term support, tools for optimizing LLMs for inference with techniques like quantization and sparsity, reference architectures for scalable deployments with Kubernetes, integration of telemetry and key monitoring systems, and more. Join us on July 11, 2024, at 2:00 PM EST (11:00 AM PST) to learn why vLLM is the leading open-source inference server and how Neural Magic works with enterprises to build and scale vLLM-based model services.

    This content isn’t available here

    Access this content and more in the LinkedIn app

  • Neural Magic reposted this

    View profile for Mark Kurtz, graphic

    CTO @ Neural Magic

    🚨 New blog posted! We've published a comprehensive blog at Neural Magic on deploying Llama 3 8B with vLLM. The blog showcases an inexpensive, end-to-end open-source solution for large language models (LLMs), enabling cost-effective, high-performance AI solutions. 🔍 Key Takeaways: - Superior Accuracy: Llama 3 8B outperforms larger models for real-world use cases, with an average performance of 28% better than Llama 2 70B. Cost Efficiency: You can achieve significant savings of up to 16X by running the more accurate, smaller models on a single A10 GPU with faster performance than the baseline for larger models of dual A100s. - Seamless Deployment: Integrate Llama 3 8B with vLLM effortlessly for rapid application AI enhancements. To dive in further, the link is in the comments! #LLMs #vLLM #AI #MachineLearning #Innovation #OpenSource

    • Llama 3 8B compared with Llama 2 models across various use case evaluations, including Chat, Code Generation, Summarization, and Retrieval Augmented Generation.

* CodeLlama models were used instead of Llama 2 due to the Llama 2 models' poor baseline performance on code generation tasks.
    • Llama 3 8B compared to Llama 2 70B for deploying customer support use cases at various deployment sizes.
    • Llama 3 8B compared with Llama 2 70B for deploying summarization use cases at various deployment sizes.
  • View organization page for Neural Magic, graphic

    15,461 followers

    Neural Magic's CEO, Brian Stevens, recently spent some time with host Heather Haskin from The Catalyst by Softchoice podcast to talk about the intersection of AI, open source, and the future of responsible development. Listen in on "The case for open-source AI" to learn more about the vital role of open-source models and why the democratization of AI is important for the success of today's enterprise. https://lnkd.in/eumGUGBH

    The Catalyst by Softchoice

    The Catalyst by Softchoice

    link.chtbl.com

  • View organization page for Neural Magic, graphic

    15,461 followers

    Optimizing your AI models with techniques like sparsity and quantization increases production performance while decreasing your total infrastructure spend. Eldar Kurtić, our expert in AI model optimization, shares more details in this podcast. Check it out 👇

    View profile for Eldar Kurtić, graphic

    Machine Learning

    I was recently invited to share my insights on "Efficient Inference through Sparsity and Quantization" in a two-part podcast series. In the first episode, we dive into how sparsity can improve the performance and efficiency of machine learning models, reducing deployment costs on both CPUs and GPUs. The next episode, which will focus on quantization, is coming soon. Listen to the first episode here: https://lnkd.in/dnaCzzsm

    56. Eldar Kurtic - Efficient Inference through sparsity and quantization - Part 1/2

    56. Eldar Kurtic - Efficient Inference through sparsity and quantization - Part 1/2

    https://spotify.com

  • View organization page for Neural Magic, graphic

    15,461 followers

    Are you using, or considering, vLLM for your LLM inference serving? Join us this Wednesday to ask all your questions and learn more about accelerated #LLM inference with vLLM and Neural Magic. vLLM project maintainer Simon Mo and vLLM project committer Michael Goin will spend one hour with the community answering questions and sharing recent vLLM project updates. Register and ask your questions here: https://lnkd.in/euF8m73q If you can't make it this Wednesday, you can register for the June 20th session.

    • No alternative text description for this image
  • View organization page for Neural Magic, graphic

    15,461 followers

    We were at Nutanix #NEXTConf in Barcelona last week and it was such an exciting event! From the main stage announcements to the enthusiastic conversations with attendees at the AI Pavilion, congrats to Nutanix for creating a dynamic environment where innovation thrived. Thank you to the Nutanix team, including Gregory Lehrer, Gali Ross-Hasson, Tarkan Maner, Luke Congdon, Wolfgang Huse, and others for partnering with Neural Magic. We appreciate the opportunity to be a part of the Nutanix #AI strategy. 🙏

  • View organization page for Neural Magic, graphic

    15,461 followers

    Only a few hours after its release, our team optimized Mistral-7B-Instruct-v0.3 for 3x faster deployment with #vLLM. You can now fit the entire model and full 32k context length inside a single A10 GPU. Check it and start deploying efficiently today! #teamwork #opensource

  • View organization page for Neural Magic, graphic

    15,461 followers

    It was awesome meeting the vLLM community and chatting with everyone during our recent office hours with Michael Goin and Simon Mo. Many of the attendees asked that we make this a regular, bi-weekly event. We are excited to deliver! We penciled in two more dates to answer all your vLLM and Neural Magic questions: June 5th and June 20th. Get the details and the Zoom link here: https://lnkd.in/eAw98NHF

    vLLM Office Hours

    vLLM Office Hours

    https://neuralmagic.com

  • Neural Magic reposted this

    View profile for Mark Kurtz, graphic

    CTO @ Neural Magic

    New research posted! At Neural Magic we've collaborated with Cerebras Systems to release the first highly sparse, foundational large language models (LLMs). We removed up to 70% of the connections (nearly 5 billion / 10 gigabytes of the weights!) without affecting the accuracy of popular tasks such as chat, code generation, and summarization. The result is cheaper, faster, and more energy-efficient models, with nearly 9X savings for LLM deployments. This is just the beginning of our push towards establishing efficient LLMs as the default pathway. This will save immense amounts of energy and money as enterprises and the open-source community continue to fine-tune and deploy these models for their revolutionary applications. So, watch for more efficient versions of the latest architectures as we roll out better results over the next few weeks and months. Feel free to ask any questions you have about this latest research, which is linked below. This includes everything from high-level questions about the practicality of this research to deep dives into why/how pruning and distillation work! Paper: https://lnkd.in/entBB_AK Models: https://lnkd.in/eDPAtf8p

    • Llama 2 7B sparsity vs baseline accuracy recovery for a chat fine-tuning task
    • Llama 2 7B sparsity vs baseline accuracy recovery for a code generatino fine-tuning task
    • Llama 2 7B sparsity vs baseline accuracy recovery for an instruction following fine-tuning task
    • Llama 2 7B prefill performance for various sparsity levels across FP32 baseline and INT8 quantized precisions on an 8-Core CPU
    • Llama 2 7B decode performance for various sparsity levels across FP32 baseline and INT8 quantized precisions on an 8-Core CPU
      +1

Similar pages

Browse jobs

Funding