Pulse · vllm-project/vllm · GitHub

July 24, 2024 – July 31, 2024

Overview

146 Active pull requests

132 Active issues

90 Pull requests merged by 46 people

[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user
#6954 merged Aug 1, 2024
PP comm optimization: replace send with partial send + allgather
#6695 merged Aug 1, 2024
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings
#6758 merged Aug 1, 2024
[Bugfix][TPU] Do not use torch.Generator for TPUs
#6981 merged Aug 1, 2024
[Model] Pipeline parallel support for Qwen2
#6924 merged Aug 1, 2024
[Kernel][RFC] Refactor the punica kernel based on Triton
#5036 merged Aug 1, 2024
Revert "[Frontend] Factor out code for running uvicorn"
#7012 merged Jul 31, 2024
[Misc] Add compressed-tensors to optimized quant list
#7006 merged Jul 31, 2024
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4)
#6996 merged Jul 31, 2024
[Kernel] Enable FP8 Cutlass for Ada Lovelace
#6950 merged Jul 31, 2024
[Bugfix] Support cpu offloading with quant_method.process_weights_after_loading
#6960 merged Jul 31, 2024
[MISC] Introduce pipeline parallelism partition strategies
#6920 merged Jul 31, 2024
[Model] use FusedMoE layer in Jamba
#6935 merged Jul 31, 2024
[Bugfix] Fix feature size calculation for LLaVA-NeXT
#6982 merged Jul 31, 2024
[Bugfix] Clean up MiniCPM-V
#6939 merged Jul 31, 2024
Support W4A8 quantization for vllm
#5218 merged Jul 31, 2024
[Bugfix] fix logit processor excceed vocab size issue
#6927 merged Jul 31, 2024
[Bugfix][TPU] Set readonly=True for non-root devices
#6980 merged Jul 31, 2024
[CI/Build] Fix mypy errors
#6968 merged Jul 31, 2024
[Bugfix] Fix broadcasting logic for multi_modal_kwargs
#6836 merged Jul 31, 2024
[mypy] Enable following imports for some directories
#6681 merged Jul 31, 2024
[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding
#6964 merged Jul 31, 2024
[CI] [nightly benchmark] Do not re-download sharegpt dataset if exists
#6706 merged Jul 30, 2024
[Nightly benchmarking suite] Remove pkill python from run benchmark suite
#6965 merged Jul 30, 2024
[Build] Temporarily Disable Kernels and LoRA tests
#6961 merged Jul 30, 2024
[core][misc] improve free_finished_seq_groups
#6865 merged Jul 30, 2024
[Kernel] Remove scaled_fp8_quant kernel padding footgun
#6842 merged Jul 30, 2024
[Bugfix] Fix tensorizer memory profiling bug during testing
#6881 merged Jul 30, 2024
[OpenVINO] Updated OpenVINO requirements and build docs
#6948 merged Jul 30, 2024
[Kernel] Squash a few more warnings
#6914 merged Jul 30, 2024
[BugFix] Fix use of per-request seed with pipeline parallel
#6698 merged Jul 30, 2024
[Doc] Super tiny fix doc typo
#6949 merged Jul 30, 2024
[Bugfix] Fix PaliGemma MMP
#6930 merged Jul 30, 2024
[TPU] Fix greedy decoding
#6933 merged Jul 30, 2024
[Kernel] Tuned int8 kernels for Ada Lovelace
#6848 merged Jul 30, 2024
[Kernel] Fix marlin divide-by-zero warnings
#6904 merged Jul 30, 2024
[ci] GHA workflow to remove ready label upon "/notready" comment
#6921 merged Jul 30, 2024
[Kernel] Remove unused variables in awq/gemm_kernels.cu
#6908 merged Jul 30, 2024
[Frontend] New allowed_token_ids decoding request parameter
#6753 merged Jul 29, 2024
[Bugfix] Allow vllm to still work if triton is not installed.
#6786 merged Jul 29, 2024
[TPU] Add TPU tensor parallelism to async engine
#6891 merged Jul 29, 2024
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel
#6901 merged Jul 29, 2024
[Core] Reduce unnecessary compute when logprobs=None
#6532 merged Jul 29, 2024
[Kernel] Tuned FP8 Kernels for Ada Lovelace
#6677 merged Jul 29, 2024
[Model] Initialize support for InternVL2 series models
#6514 merged Jul 29, 2024
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8
#6871 merged Jul 28, 2024
Add Nemotron to PP_SUPPORTED_MODELS
#6863 merged Jul 27, 2024
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel
#6795 merged Jul 27, 2024
[TPU] Reduce compilation time & Upgrade PyTorch XLA version
#6856 merged Jul 27, 2024
[Docs] Add RunLLM chat widget
#6857 merged Jul 27, 2024
[Model] Initial support for BLIP-2
#5920 merged Jul 27, 2024
[CI/Build][Doc] Update CI and Doc for VLM example changes
#6860 merged Jul 27, 2024
[bugfix] make args.stream work
#6831 merged Jul 27, 2024
[Bugfix] Fix VLM example typo
#6859 merged Jul 27, 2024
[Misc][VLM][Doc] Consolidate offline examples for vision language models
#6858 merged Jul 27, 2024
[Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor
#6802 merged Jul 27, 2024
[Doc] Add missing mock import to docs conf.py
#6834 merged Jul 27, 2024
[Hardware][TPU] Implement tensor parallelism with Ray
#5871 merged Jul 27, 2024
[Model] H2O Danube3-4b
#6451 merged Jul 27, 2024
[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba
#6784 merged Jul 27, 2024
[Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron
#6844 merged Jul 27, 2024
[ROCm] Upgrade PyTorch nightly version
#6845 merged Jul 27, 2024
[Bugfix]: Fix Tensorizer test failures
#6835 merged Jul 27, 2024
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b
#6852 merged Jul 27, 2024
[Frontend] Factor out code for running uvicorn
#6828 merged Jul 27, 2024
[TPU] Support collective communications in XLA devices
#6813 merged Jul 27, 2024
enforce eager mode with bnb quantization temporarily
#6846 merged Jul 27, 2024
Update README.md
#6847 merged Jul 27, 2024
[Doc] Update SkyPilot doc for wrong indents and instructions for update service
#4283 merged Jul 26, 2024
[Doc] Add Nemotron to supported model docs
#6843 merged Jul 26, 2024
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation
#6125 merged Jul 26, 2024
[Misc][TPU] Support TPU in initialize_ray_cluster
#6812 merged Jul 26, 2024
[Build/CI][ROCm] Minor simplification to Dockerfile.rocm
#6811 merged Jul 26, 2024
[Bugfix][Kernel] Promote another index to int64_t
#6838 merged Jul 26, 2024
[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron)
#6611 merged Jul 26, 2024
[doc][debugging] add known issues for hangs
#6816 merged Jul 26, 2024
[Core] Use array to speedup padding
#6779 merged Jul 26, 2024
[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor.
#6770 merged Jul 26, 2024
Fix ReplicatedLinear weight loading
#6793 merged Jul 26, 2024
[ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check
#6810 merged Jul 26, 2024
[ci][distributed] fix flaky tests
#6806 merged Jul 26, 2024
[Core] Fix ray forward_dag error mssg
#6792 merged Jul 25, 2024
[Docs] Publish 5th meetup slides
#6799 merged Jul 25, 2024
[doc][distributed] improve multinode serving doc
#6804 merged Jul 25, 2024
[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors
#6798 merged Jul 25, 2024
[Doc] Add documentations for nightly benchmarks
#6412 merged Jul 25, 2024
[Bugfix] Add synchronize to prevent possible data race
#6788 merged Jul 25, 2024
[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints
#6761 merged Jul 25, 2024
[ Misc ] fp8-marlin channelwise via compressed-tensors
#6524 merged Jul 25, 2024
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V
#6787 merged Jul 25, 2024

56 Pull requests opened by 48 people

[BugFix][Speculative Decoding] Fixes the generation token numbers with sps
#6782 opened Jul 25, 2024
[CI] Reproduce SGLANG benchmark results
#6794 opened Jul 25, 2024
[Core] Get KV from Block, add KV to Block
#6808 opened Jul 26, 2024
Prefetch all
#6817 opened Jul 26, 2024
[Performance] Introducing Prefix-Cached Chunked Prefill with flash-attn backend and 10% throughput gained under prompt <1K
#6819 opened Jul 26, 2024
[CI/Build] upgrade Dockerfile to ubuntu 22.04
#6820 opened Jul 26, 2024
[wip]
#6821 opened Jul 26, 2024
[Model] Teleflm Support
#6822 opened Jul 26, 2024
[Speculative Decoding] EAGLE Implementation with Top-1 proposer
#6830 opened Jul 26, 2024
[CI/Build] bump Dockerfile.neuron image base, use public ECR
#6832 opened Jul 26, 2024
[Core] Pipeline parallel with Ray ADAG
#6837 opened Jul 26, 2024
[ DO NOT MERGE ] grpc openai server prototypes
#6839 opened Jul 26, 2024
[Build] Dockerfile revert to CUDA 12.1
#6840 opened Jul 26, 2024
[Build] Add initial conditional testing spec
#6841 opened Jul 26, 2024
[Bugfix][fast] Fix the get_num_blocks_touched logic
#6849 opened Jul 26, 2024
[Kernel] [Triton] Add Triton implementation of awq_dequantize
#6850 opened Jul 26, 2024
Add required libcuda.so
#6864 opened Jul 27, 2024
[core][scheduler] simplify and improve scheduler
#6867 opened Jul 27, 2024
[Core] generate from input embeds
#6869 opened Jul 27, 2024
[ DO NOT MERGE] pyzmq based openai server prototypes (w/ python pickle)
#6874 opened Jul 28, 2024
Support for guided decoding for offline LLM
#6878 opened Jul 28, 2024
[ Do Not Merge ] pyzmq based openai server prototypes (w/ protobuf)
#6880 opened Jul 29, 2024
merge to main
#6882 opened Jul 29, 2024
[ Frontend ] Multiprocessing for OpenAI Server with `zeromq`
#6883 opened Jul 29, 2024
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking
#6892 opened Jul 29, 2024
Removes duplicate outlines processors
#6900 opened Jul 29, 2024
[wip/spmd] Serialization Optimization
#6903 opened Jul 29, 2024
[Bugfix] Support Rank Stabilized LoRA (RSLoRA)
#6909 opened Jul 29, 2024
[Kernel][Misc] Add meta functions for ops to prevent graph breaks
#6917 opened Jul 29, 2024
[SpecDecode] Support FlashInfer in DraftModelRunner
#6926 opened Jul 30, 2024
[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend
#6931 opened Jul 30, 2024
[Frontend]: Add apply_chat_template method and update generate method in LLM class
#6936 opened Jul 30, 2024
[Model] SiglipVisionModel ported from transformers
#6942 opened Jul 30, 2024
[CI/Build] Update torch to 2.4
#6951 opened Jul 30, 2024
[DO NOT MERGE] Asynchronous Output Processing POC [using asyncio]
#6958 opened Jul 30, 2024
[CI/Build][ROCm] Enabling tensorizer tests for ROCm
#6959 opened Jul 30, 2024
[Kernel] add punica dimensions for granite 20b
#6962 opened Jul 30, 2024
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification
#6963 opened Jul 30, 2024
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace
#6971 opened Jul 31, 2024
llama_index serving integration documentation
#6973 opened Jul 31, 2024
[Models] Support Qwen model with PP
#6974 opened Jul 31, 2024
[WIP] Add Fused MoE W8A8 (Int8) Support
#6978 opened Jul 31, 2024
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm
#6992 opened Jul 31, 2024
[Model] Further cleanup MiniCPM-V
#6995 opened Jul 31, 2024
Update run-amd-test.sh
#6997 opened Jul 31, 2024
[Doc] Proofreading documentation
#6998 opened Jul 31, 2024
[CI/Build] bump minimum cmake version
#6999 opened Jul 31, 2024
[WIP] [core] Multi Step Scheduling
#7000 opened Jul 31, 2024
[CI/Build] Treat warnings as errors in CUDA [DO NOT MERGE]
#7001 opened Jul 31, 2024
[Bugfix] Lower gemma's unloaded_params exception to warning
#7002 opened Jul 31, 2024
Add Load-time W8A16 quantization for TPU Backend
#7005 opened Jul 31, 2024
[Bugfix] Use correct length in beam search scoring
#7007 opened Jul 31, 2024
[Kernel] Fix input for flashinfer prefill wrapper.
#7008 opened Jul 31, 2024
[Kernel] Add Fused Layernorm + Asymmetric int8 Quant
#7010 opened Jul 31, 2024
[Draft][MISC] Use torch.frombuffer(array(list)) in prepare_input
#7014 opened Aug 1, 2024
Add Classifier free guidance
#7016 opened Aug 1, 2024

47 Issues closed by 22 people

[Bug]: Engine crashes when max_tokens undefined
#6707 closed Aug 1, 2024
[Misc]: call for help to fix tensorizer tests
#6809 closed Aug 1, 2024
[Bug]: Command R+ GPTQ bad output on ROCm
#3980 closed Jul 31, 2024
[Bug]: cuda OOM errors persist across requests.
#6907 closed Jul 31, 2024
[Bug]: `RuntimeError: b_q_weight is not on GPU` CPU Offloading
#6952 closed Jul 31, 2024
[Bug]: FP8 Quantization (static and dynamic) incompatible with `--cpu-offload-gb`
#6765 closed Jul 31, 2024
[Usage]: internVL2 推理不支持
#6989 closed Jul 31, 2024
[Doc]: Supported Hardware for Quantization Kernels
#6979 closed Jul 31, 2024
[Usage]: deepseek-v2-lite not supported yet?
#6986 closed Jul 31, 2024
[Bug]: AttributeError: 'MiniCPMVConfig' object has no attribute 'version'
#6814 closed Jul 31, 2024
[Bug]: index out of bound for logits_processors cause vllm.engine.async_llm_engine.AsyncEngineDeadError
#6866 closed Jul 31, 2024
[Bug]: MiniCPM-Llama3-V-2_5 error when tensor_parallel_size>1
#6946 closed Jul 31, 2024
[Misc]: Problem about running with openvino
#6898 closed Jul 30, 2024
[Installation]: Unable to build docker image using Dockerfile.openvino
#6769 closed Jul 30, 2024
[Bug]: Seed issue with Pipeline Parallel
#6449 closed Jul 30, 2024
[Bug]: crash when using response_format of type json_object
#6953 closed Jul 30, 2024
[RFC]: OpenVINO vLLM backend
#5377 closed Jul 30, 2024
[Bug]: Paligemma does not work with tensor parallelism
#6910 closed Jul 30, 2024
[Bug]: MiniCPM-V-2 does not appear to have a file named preprocessor_config.json
#6934 closed Jul 30, 2024
[Bug]: Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
#6897 closed Jul 30, 2024
[Model]: Support for InternVL-Chat-V1-5
#4393 closed Jul 29, 2024
[Model]: Support for InternVL2
#6321 closed Jul 29, 2024
[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error?
#6870 closed Jul 28, 2024
[Bug]: Different quality responses using GPTQ / marlin kernels on A10 vs A100 GPUs
#5793 closed Jul 27, 2024
[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq)
#6258 closed Jul 27, 2024
[Bug]: Concurrent requests are skipped when enable --enable-chunked-prefill
#6726 closed Jul 27, 2024
[New Model]: Blip2 Support required
#4739 closed Jul 27, 2024
[Doc]: Sampling page is no longer showing up
#6853 closed Jul 27, 2024
[Bug][ROCm] The embedding layer does not support long inputs
#6807 closed Jul 27, 2024
[Bug]: Possible data race when running Llama 405b fp8
#6767 closed Jul 27, 2024
[Bug]: BitsandBytes quantization is not working as expected
#5569 closed Jul 27, 2024
[New Model]: Support Nemotron-4-340B
#5722 closed Jul 26, 2024
[Bug]: Pipeline parallelism is very slow when inferencing one request
#6826 closed Jul 26, 2024
[Usage]: How to use vLLM on multi-nodes
#6825 closed Jul 26, 2024
[Bug]: Llama3.1-70B-FP8 Prompt insufficient computing power
#6815 closed Jul 26, 2024
[Feature]: return Usage info for streaming request for each chunk in ChatCompletion
#6540 closed Jul 26, 2024
[FEATURE] Implement Dynamic SplitFuse
#1562 closed Jul 26, 2024
[New Model]: fastspeech2_conformer (just need a new attention mechanism: RelPositionMultiHeadedAttention)
#4736 closed Jul 26, 2024
[Misc]: setting environment variables in multi-node serving
#6803 closed Jul 25, 2024
AWQ + Marlin Error
#3392 closed Jul 25, 2024
[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models.
#4359 closed Jul 25, 2024
[Bug]: The inference speed of vllm running command-r-plus-gptq is very slow
#5076 closed Jul 25, 2024
[Bug]: The FP8 models and FP8 KV-Cache-Scales loaded together failed on the latest 0.5.3
#6738 closed Jul 25, 2024
[Bug]: tensorizer error: name '_write_stream' is not defined
#6791 closed Jul 25, 2024
[Bug]: Reproducing Llama 3.1 distributed inference from the blog
#6775 closed Jul 25, 2024
[Bug]: batch inference not consistent (even temperature=0)
#6735 closed Jul 25, 2024
[Bug]: Broken accuracy on LLaMa 3.1 70B -- worse than even 8B
#6760 closed Jul 25, 2024

85 Issues opened by 75 people

[Bug]: [Bug]: vllm llama3/3.1-8b response is cut
#7015 opened Aug 1, 2024
[Bug]: Distributed inference with Ray: cuda errors if I import torch
#7013 opened Jul 31, 2024
[Bug] [ROCm]: ROCm fails to stop generating tokens on multiple GPTQ models
#7011 opened Jul 31, 2024
[Bug]: AttributeError: 'dict' object has no attribute 'kwargs' in llama_index.llms.ollama integration with Ray
#7009 opened Jul 31, 2024
[Bug]: Mistral Nemo Instruct almost never returns JSON, but model on HF does
#7004 opened Jul 31, 2024
[Bug]: VLLM crashes when prefix caching is enabled
#7003 opened Jul 31, 2024
[RFC]: More rigorous compilation warnings
#6994 opened Jul 31, 2024
[Feature]: Is it possible to control whether to use speculative decoding when making a request?
#6993 opened Jul 31, 2024
[Bug]: TypeError: 'NoneType' object is not callable
#6991 opened Jul 31, 2024
[Usage]: Add support for Python 3.12
#6990 opened Jul 31, 2024
[Bug]: subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=40', '--target=_moe_C', '--target=_C']' returned non-zero exit status 1.
#6988 opened Jul 31, 2024
[Bug]: 导入default_dump_dir报错
#6987 opened Jul 31, 2024
[Bug]: Meet conflicts when using AutoAWQ marlin methods and vLLM
#6985 opened Jul 31, 2024
[Bug]: base_model.model.model.layers.0.mlp.down_proj.lora_magnitude_vector is unsupported LoRA weight
#6983 opened Jul 31, 2024
[Bug]: Wrong image hallucination for InternVL2 model
#6977 opened Jul 31, 2024
[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered
#6976 opened Jul 31, 2024
[Usage]: how to abort request and stop inference?
#6975 opened Jul 31, 2024
[Feature]: Add security scheme to server
#6970 opened Jul 31, 2024
[Bug]: : ERROR 07-31 11:57:33 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
#6969 opened Jul 31, 2024
[Bug]: speculative decoding doesn't work with online mode
#6967 opened Jul 31, 2024
[Bug]: Failed to launch api_server with FP8D quantized gemma-2-27b-it on vllm 0.5.3post1
#6957 opened Jul 30, 2024
[Bug]: ValueError: Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node.
#6956 opened Jul 30, 2024
[Performance]: In tokens processing performance. Bad scaling with tensor-parallel
#6955 opened Jul 30, 2024
[Feature]: Add embeddings api for Llama
#6947 opened Jul 30, 2024
[Performance]: Mode/flag/option to maximize throughput while allowing large latency?
#6945 opened Jul 30, 2024
[Feature]: when will support torch2.4 use it out of box?
#6944 opened Jul 30, 2024
[Performance]: Performance degrades severely with long input
#6943 opened Jul 30, 2024
[Feature]: SiglipVisionModel Support
#6941 opened Jul 30, 2024
[Usage]: Streaming response
#6940 opened Jul 30, 2024
[Usage]: Error with Multi Node llama 405B inference
#6938 opened Jul 30, 2024
[Installation]: my env :cuda version is 12.0，python 3.10, which release should i choose?
#6937 opened Jul 30, 2024
[Feature Request]: Support INT4 for MiniCPM-Llama3-V-2_5
#6932 opened Jul 30, 2024
why our performance so low when compare with sglang(https://github.com/sgl-project/sglang?tab=readme-ov-file#benchmark-and-performance)
#6929 opened Jul 30, 2024
[Feature]: Support rerank models
#6928 opened Jul 30, 2024
[Bug]: Prefix Caching in BlockSpaceManagerV2 Increases Time to First Token(TTFT) and Slows Down System
#6923 opened Jul 30, 2024
[Installation]: What is required for wheels to build?
#6919 opened Jul 29, 2024
[Bug]: error: Segmentation fault(SIGSEGV received at time)
#6918 opened Jul 29, 2024
[Bug]: Unable to build image from `vllm` repo Dockerfile
#6916 opened Jul 29, 2024
[Performance] [Speculative decoding]: Compute prepare inputs of the scoring model on GPU
#6915 opened Jul 29, 2024
[RFC]: Asynchronous Output Processor
#6913 opened Jul 29, 2024
[Feature]: Reduce LoRA latency via speculative decoding
#6912 opened Jul 29, 2024
[Feature]: Combine pipeline parallelism with speculative decoding
#6911 opened Jul 29, 2024
[Feature]: **Feature Request: Gateway for Model to Support Multiple Models Generation in a Given Context**
#6906 opened Jul 29, 2024
[Bug]: JSON-guided generation failing to close text values
#6905 opened Jul 29, 2024
[Bug]: Mixtral 8-way TP with --enable-lora crashes with CUDA illegal memory access error
#6902 opened Jul 29, 2024
[Feature]: Add Sliding Window support to FlashInfer backend?
#6899 opened Jul 29, 2024
[Bug]: When using vllm in a Ray actor, the error "No CUDA GPUs are available" occurs.
#6896 opened Jul 29, 2024
Why is it stuck in 'INFO 07-29 10:04:03 custom_all_reduce_utils.py:202] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json'?
#6895 opened Jul 29, 2024
[Bug]: distributed inference for vl model crashed(so slow that the connection closed)
#6894 opened Jul 29, 2024
[Bug]: Stuck at "generating GPU P2P access cache"
#6893 opened Jul 29, 2024
[Bug]: Vllm api server does not receive supported parameter `truncate_prompt_tokens`
#6890 opened Jul 29, 2024
[Bug]: "apply_gptq_marlin_linear" Error When TP > 1
#6889 opened Jul 29, 2024
[Performance]: tracking ray dag plus spmd performance
#6888 opened Jul 29, 2024
[Bug]: dag teardown error AttributeError: 'Worker' object has no attribute 'core_worker'
#6887 opened Jul 29, 2024
[Bug]: Speculative Decoding + FlashInfer + benchmark_serving.py TransferEncodingError ISSUE
#6885 opened Jul 29, 2024
[Bug]: First input (bf16) and second input (uint8) must have the same dtype!
#6884 opened Jul 29, 2024
[Performance]: use Python array to replace Python list for zero-copy tensor creation
#6879 opened Jul 28, 2024
[Feature]: Support Python 3.12
#6877 opened Jul 28, 2024
[Bug]: vLLM takes forever to load a locally stored 7B model
#6876 opened Jul 28, 2024
[Bug]: Error Running DeepSeek-v2-Lite w/ FP8
#6875 opened Jul 28, 2024
[Usage]: Only one thread is utilising when vllm is use with the llamaindex framework on the cpu.
#6873 opened Jul 28, 2024
[Misc]: Why doesn't a larger block size result in faster performance?
#6868 opened Jul 27, 2024
[Bug]: for mistral model, After the optional system message, conversation roles must alternate user/assistant/user/assistant/
#6862 opened Jul 27, 2024
[Bug]: Can't load BNB model
#6861 opened Jul 27, 2024
[RFC]: Multi-Step Scheduling
#6854 opened Jul 26, 2024
[Bug]: enable_prefix_caching leads to persistent illegal memory access error
#6833 opened Jul 26, 2024
llama 3 8b model with A10 GPU, OOM with VLLM, but holds good on HF transformer pipline
#6829 opened Jul 26, 2024
[Usage]: How do I deploy a model on two GPUs with different memory?
#6824 opened Jul 26, 2024
[Bug]: ERROR 07-26 14:50:35 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 214281 died, exit code: -11
#6823 opened Jul 26, 2024
[Bug]: Got "[WARNING shm_broadcast.py:404] No available block found in 60 second."
#6818 opened Jul 26, 2024
[New Model]: Adding MiniGPT4_video model
#6805 opened Jul 25, 2024
[RFC]: Performance Roadmap
#6801 opened Jul 25, 2024
[Bug]: Discrepancy in vLLM and LoRA Adapter Scores with Different Package Versions
#6800 opened Jul 25, 2024
[RFC]: Isolate OpenAI Server Into Separate Process
#6797 opened Jul 25, 2024
[Bug]: Engine iteration timed out. This should never happen!
#6790 opened Jul 25, 2024
[Usage]: can I use it with classification model (e.g. GemmaForSequenceClassification) ?
#6789 opened Jul 25, 2024
[Feature]: Evaluate multiple ngram speculations in speculative decoding
#6785 opened Jul 25, 2024
[Bug]: SIGSEGV received at time=1721904360 on cpu 140, Fatal Python error: Segmentation fault
#6783 opened Jul 25, 2024
[Performance]: Slow TTFT(?) for Qwen2-72B-GPTQ-Int4 on H100 *2
#6781 opened Jul 25, 2024
[Bug]: N-gram spec_decode in flash_attention bug
#6780 opened Jul 25, 2024
[Feature]: support Mistral-Large-Instruct-2407 function calling
#6778 opened Jul 25, 2024
[Performance]: Medusa SD have poor performance than baseline
#6777 opened Jul 25, 2024
[Bug]: qwen2-72b-instruct model with RuntimeError: CUDA error: an illegal memory access was encountered
#6776 opened Jul 25, 2024
[Bug]: --max-model-len configuration robustness
#6774 opened Jul 25, 2024
[Usage]: Pipeline Parallelism but with quantized model?
#6773 opened Jul 25, 2024

140 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support)
#4942 commented on Jul 31, 2024 • 93 new comments
[Kernel] Add per-tensor and per-token AZP epilogues
#5941 commented on Jul 31, 2024 • 42 new comments
Support Open Models that allow OpenAI API-style tool use & "auto" tool choice
#5649 commented on Aug 1, 2024 • 20 new comments
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend
#6143 commented on Jul 30, 2024 • 17 new comments
[Hardware][Nvidia][Core][Feature] new feature add: vmm(virtual memory manage) kv cache for nvidia gpu
#6102 commented on Jul 29, 2024 • 12 new comments
[Kernel] Add Fused Layernorm + Dynamic-Per-Token Quant Kernels
#6763 commented on Jul 31, 2024 • 8 new comments
[Bugfix] Enable chunked-prefill and prefix cache with flash-attn backend
#6144 commented on Jul 25, 2024 • 4 new comments
[Frontend] Kill the server on engine death
#6594 commented on Jul 26, 2024 • 4 new comments
[Misc] Disambiguate quantized types via a new ScalarType
#6396 commented on Jul 31, 2024 • 4 new comments
[Model] Implement DualChunkAttention for Qwen2 Models
#6139 commented on Jul 29, 2024 • 3 new comments
[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm
#6006 commented on Jul 26, 2024 • 3 new comments
[Core] Support loading GGUF model
#5191 commented on Jul 31, 2024 • 1 new comment
[RFC]: Priority Scheduling
#6077 commented on Aug 1, 2024 • 0 new comments
[Bug]: error: triton_flash_attention.py
#5696 commented on Aug 1, 2024 • 0 new comments
[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs)
#6155 commented on Aug 1, 2024 • 0 new comments
[Feature]: support Qwen2 embedding
#5600 commented on Aug 1, 2024 • 0 new comments
[Feature]: Request for Ascend NPU support
#6368 commented on Aug 1, 2024 • 0 new comments
[Bug]: inter-token latency is lower than TPOT in serving benchmark result
#6531 commented on Aug 1, 2024 • 0 new comments
Support W8A8 inference in vllm
#1508 commented on Jul 30, 2024 • 0 new comments
[WIP] Qwen-style dynamic-NTK ROPE kernel for long sequence support
#1860 commented on Aug 1, 2024 • 0 new comments
[Core] Add retention policy code for processing requests
#4513 commented on Aug 1, 2024 • 0 new comments
[CI/Build] use setuptools-scm to set __version__
#4738 commented on Jul 26, 2024 • 0 new comments
Adding idefics2
#4937 commented on Jul 29, 2024 • 0 new comments
[wip] spmd delta optimization
#6771 commented on Jul 25, 2024 • 0 new comments
[Bug]: Special tokens split when decoding after 0.4.0.post1
#4577 commented on Jul 31, 2024 • 0 new comments
[Model] Meta Llama 3.1 Know Issues & FAQ
#6689 commented on Jul 31, 2024 • 0 new comments
[Bug]: failed when run Qwen2-54B-A14B-GPTQ-Int4(MOE)
#6465 commented on Jul 31, 2024 • 0 new comments
[Bug]: TRACKING ISSUE: `AsyncEngineDeadError`
#5901 commented on Jul 31, 2024 • 0 new comments
[Bug]: NCCL hangs and causes timeout
#5484 commented on Jul 31, 2024 • 0 new comments
[Feature]: vllm-flash-attn cu118 compatibility
#5232 commented on Jul 31, 2024 • 0 new comments
[Feature]: load/unload API to run multiple LLMs in a single GPU instance
#5491 commented on Jul 31, 2024 • 0 new comments
[Bug]: CUDA OOM error when loading another model after exiting the first one.
#6682 commented on Jul 31, 2024 • 0 new comments
Multi-node serving with vLLM - Problems with Ray
#2406 commented on Jul 31, 2024 • 0 new comments
[Bug]: 0.4.2 error on H20
#5001 commented on Jul 31, 2024 • 0 new comments
[Bug]: `flash_attn_cuda.varlen_fwd` may output a bad result when enabling prefix caching
#5678 commented on Jul 31, 2024 • 0 new comments
[Bug]: Illegal memory access
#5687 commented on Jul 31, 2024 • 0 new comments
[Roadmap] vLLM Roadmap Q3 2024
#5805 commented on Jul 31, 2024 • 0 new comments
[Feature]: Allow user defined extra request args to be logged in OpenAI compatible server
#5467 commented on Jul 31, 2024 • 0 new comments
Beam Search Length Normalization Wrong
#2606 commented on Jul 31, 2024 • 0 new comments
[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane
#6556 commented on Jul 31, 2024 • 0 new comments
[Bug]: RuntimeError: GET was unable to find an engine to execute this computation for llava-next model
#6713 commented on Aug 1, 2024 • 0 new comments
[Bugfix]: use PretrainedConfig to communicate config objects with trust remote code
#6751 commented on Jul 25, 2024 • 0 new comments
(Dont Merge) Add rwkv6
#6749 commented on Jul 28, 2024 • 0 new comments
[Model][Jamba] Mamba cache single buffer
#6739 commented on Jul 31, 2024 • 0 new comments
Update logits processor with tensor caching
#6715 commented on Jul 27, 2024 • 0 new comments
[Draft] [Speculative decoding] Use SPMD worker to reduce control plane communication
#6664 commented on Jul 30, 2024 • 0 new comments
[Kernel] Add dynamic asymmetric quantization kernel
#6651 commented on Jul 31, 2024 • 0 new comments
[CI/Build] bump ruff version, fix linting issues
#6546 commented on Jul 31, 2024 • 0 new comments
[Model] Add Support for GPTQ Fused MOE
#6502 commented on Jul 31, 2024 • 0 new comments
[Model] Support Mamba
#6484 commented on Jul 31, 2024 • 0 new comments
[ Kernel ] AWQ Fused MoE
#6422 commented on Jul 30, 2024 • 0 new comments
torch.compile based model optimizer
#6377 commented on Jul 29, 2024 • 0 new comments
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on Jul 31, 2024 • 0 new comments
[Model] Add support for 'gte-Qwen2' embedding models
#6282 commented on Aug 1, 2024 • 0 new comments
[core] Sampling controller interface
#6273 commented on Jul 29, 2024 • 0 new comments
[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface
#6260 commented on Jul 31, 2024 • 0 new comments
[Core] implement disaggregated prefilling via KV cache transfer
#6170 commented on Jul 31, 2024 • 0 new comments
[Kernel] Unify the kernel used in flash attention backend
#6052 commented on Jul 25, 2024 • 0 new comments
Whisper support
#5964 commented on Jul 29, 2024 • 0 new comments
[Frontend] Warn if user `max_model_len` is greater than derived `max_model_len`
#5911 commented on Jul 31, 2024 • 0 new comments
feat: controlling max queue time
#5884 commented on Jul 31, 2024 • 0 new comments
[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2`
#5874 commented on Jul 30, 2024 • 0 new comments
[Model] Initialize deepseek-vl support
#5817 commented on Jul 25, 2024 • 0 new comments
[LoRA] Adds support for bias in LoRA
#5733 commented on Jul 31, 2024 • 0 new comments
[Bugfix] support `tie_word_embeddings` for all models
#5724 commented on Jul 27, 2024 • 0 new comments
[Model] Add support for Qwen2 for embeddings
#5611 commented on Aug 1, 2024 • 0 new comments
[Model] Bert Embedding Model
#5447 commented on Jul 25, 2024 • 0 new comments
[Model] Add GLM-4v support
#5358 commented on Jul 31, 2024 • 0 new comments
Hete spec decode
#5065 commented on Jul 30, 2024 • 0 new comments
[Kernel] Initial commit containing new Triton kernels for multi lora serving.
#5025 commented on Jul 26, 2024 • 0 new comments
[Bug]: AttributeError: '_OpNamespace' '_C' object has no attribute 'rotary_embedding' / gemma-2-9b with vllm=0.5.2
#6478 commented on Jul 25, 2024 • 0 new comments
[Performance]: GPU utilization is low when running large batches on H100
#6560 commented on Jul 26, 2024 • 0 new comments
[RFC]: Enhancing LoRA Management for Production Environments in vLLM
#6275 commented on Jul 26, 2024 • 0 new comments
[Feature]: Pipeline parallelism support for qwen model
#6471 commented on Jul 26, 2024 • 0 new comments
[Installation]: Problem with docker image (ROCm version)
#6512 commented on Jul 26, 2024 • 0 new comments
[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错
#6734 commented on Jul 26, 2024 • 0 new comments
[Bug]: WSL2(also Docker) 1 GPU work but 2 not,(--tensor-parallel-size 2 )
#5161 commented on Jul 26, 2024 • 0 new comments
Attention sliding window
#3385 commented on Jul 26, 2024 • 0 new comments
[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
#5060 commented on Jul 26, 2024 • 0 new comments
[Bug]: vLLM is unable to load Mistral on Inferentia and AWS neuron, likely memory issue.
#6452 commented on Jul 26, 2024 • 0 new comments
[Usage]: Does Prefix Caching currently support offloading to the CPU?
#6676 commented on Jul 26, 2024 • 0 new comments
v0.5.2, v0.5.3, v0.6.0 Release Tracker
#6434 commented on Jul 27, 2024 • 0 new comments
[RFC] Initial Support for Cloud TPUs
#3620 commented on Jul 27, 2024 • 0 new comments
[Bug]: Is vllm support function call mode?
#6631 commented on Jul 27, 2024 • 0 new comments
[Feature]: vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving
#6687 commented on Jul 28, 2024 • 0 new comments
0.4.3 error CUDA error: an illegal memory access was encountered
#5376 commented on Jul 28, 2024 • 0 new comments
[Bug]: Phi-3-mini does not work when using Ray
#6607 commented on Jul 28, 2024 • 0 new comments
[Bug]: No available block found in 60 second in shm
#6614 commented on Jul 29, 2024 • 0 new comments
[Bug]: 8-way tensor parallelism w/ Punica broken on Ubuntu 20.04 (effectively Azure) since v0.5
#6725 commented on Jul 25, 2024 • 0 new comments
[Bug]: call for stack trace for "Watchdog caught collective operation timeout"
#6042 commented on Jul 25, 2024 • 0 new comments
deploying embedding model in same way as LLM
#6498 commented on Jul 25, 2024 • 0 new comments
[Feature]: Support distributing serving with KubeRay's autoscaler
#3522 commented on Jul 25, 2024 • 0 new comments
Support JSON mode.
#2483 commented on Jul 25, 2024 • 0 new comments
Generate nothing from VLLM output
#1185 commented on Jul 25, 2024 • 0 new comments
[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered
#5371 commented on Jul 25, 2024 • 0 new comments
[Bug]: OpenAI server unexpected shutdown
#6629 commented on Jul 25, 2024 • 0 new comments
[Speculative decoding]: `AttributeError: 'NoneType' object has no attribute 'numel'` when exceeding draft context length
#5342 commented on Jul 25, 2024 • 0 new comments
[RFC]: Add control panel support for vLLM
#4873 commented on Jul 25, 2024 • 0 new comments
[Bug]: `pt_main_thread` processes are not killed after main process is killed in MP distributed executor backend
#6766 commented on Jul 25, 2024 • 0 new comments
[Feature]: FlashAttention 3 support
#6348 commented on Jul 26, 2024 • 0 new comments
Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
#2729 commented on Jul 26, 2024 • 0 new comments
[Bug]: python3: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. Aborted (core dumped)
#6723 commented on Jul 26, 2024 • 0 new comments
[Bug]: "Triton Error [CUDA]: device kernel image is invalid" when loading Mixtral-8x7B-Instruct-v0.1 in fused_moe.py
#5713 commented on Jul 26, 2024 • 0 new comments
[Usage]: How to inference a model with medusa speculative sampling.
#6768 commented on Jul 26, 2024 • 0 new comments
[Feature]: chat API assistant prefill
#6772 commented on Jul 26, 2024 • 0 new comments
[Bug]: Qwen-14B-Chat-Int4 with guided_json error
#3778 commented on Jul 29, 2024 • 0 new comments
[Installation]: import llm meet error
#4163 commented on Jul 30, 2024 • 0 new comments
[Usage]: deploy Llama3.1 405B-Instruct-FP8 with H800 * 8 not work
#6750 commented on Jul 30, 2024 • 0 new comments
[Usage]: Cannot load model on 2 4090
#3991 commented on Jul 30, 2024 • 0 new comments
[Bug]: vllm not support fp8 kv cache when use flashinfer
#6537 commented on Jul 30, 2024 • 0 new comments
[Bug]: async llm engine failed unexpectedly (using mixtral-8x7b with tp=4)
#4135 commented on Jul 30, 2024 • 0 new comments
[Bug]: PaliGemma serving
#6644 commented on Jul 30, 2024 • 0 new comments
[Bug]: vLLM server crashes when `echo=True` and `max_tokens=0`
#6521 commented on Jul 30, 2024 • 0 new comments
[Bug]: Internal Server Error when hosting Salesforce/SFR-Embedding-Mistral
#5906 commented on Jul 30, 2024 • 0 new comments
[Feature]: Add OpenAI server `prompt_logprobs` support
#6508 commented on Jul 30, 2024 • 0 new comments
[Feature]: Support for Higher than 64 LoRa Ranks
#3934 commented on Jul 30, 2024 • 0 new comments
[Feature]: Apply chat template through `LLM` class
#6416 commented on Jul 30, 2024 • 0 new comments
[Bug]: vLLM failing on AWS Inferentia (inf2)
#6640 commented on Jul 30, 2024 • 0 new comments
Using the VLLM engine framework for inference, why is the first character generated always a space?
#3683 commented on Jul 30, 2024 • 0 new comments
[RFC]: Multi-modality Support Refactoring
#4194 commented on Jul 30, 2024 • 0 new comments
[Bug]: Cannot find any of ['adapter_name_or_path'] in the model's quantization config
#6727 commented on Jul 30, 2024 • 0 new comments
[Bug]: VLLM 0.5.3.post1 [rank0]: RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
#6732 commented on Jul 30, 2024 • 0 new comments
[Feature]: Return hidden states (in progress?)
#6165 commented on Jul 30, 2024 • 0 new comments
[RFC]: A Graph Optimization System in vLLM using torch.compile
#6378 commented on Jul 29, 2024 • 0 new comments
[Bug]: Load LoRA adaptor for Llama3 seems not working
#6250 commented on Jul 29, 2024 • 0 new comments
lora load failed
#3374 commented on Jul 29, 2024 • 0 new comments
[Bug]: CUDA illegal memory access error when `enable_prefix_caching=True`
#5537 commented on Jul 29, 2024 • 0 new comments
Aborted request without reason
#2484 commented on Jul 29, 2024 • 0 new comments
No executable after building vllm from source with CPU support
#6259 commented on Jul 29, 2024 • 0 new comments
Is there a way to terminate vllm.LLM and release the GPU memory
#1908 commented on Jul 29, 2024 • 0 new comments
[Bug]: llama-3.1-70b model shard_memory objects to clean
#6716 commented on Jul 29, 2024 • 0 new comments
[Installation]: Running ohereForAI/c4ai-command-r-v01 with main pytorch
#6355 commented on Jul 29, 2024 • 0 new comments
[Bug]: topk=1 and temperature=0 cause different output in vllm
#5404 commented on Jul 29, 2024 • 0 new comments
[Bug]: VllmWorkerProcess does not exit correctly when TP > 1
#6219 commented on Jul 29, 2024 • 0 new comments
[Bug]: AssertionError when load miqu70b after full sft
#3813 commented on Jul 29, 2024 • 0 new comments
[Bug]: Gemma2 supports 8192 context with sliding window, but vllm only does 4196 or fails if try 8192
#6220 commented on Jul 29, 2024 • 0 new comments
[Usage]: The 8xH100 device failed to run meta-llama/Meta-Llama-3.1-405B-Instruct-FP8.
#6746 commented on Jul 29, 2024 • 0 new comments
[Feature]: Support Lora Adapter generated from mistral-finetune
#6573 commented on Jul 29, 2024 • 0 new comments
[Bug]: Shape error encountered in speculative decoding when `enable_lora=True`
#4872 commented on Jul 29, 2024 • 0 new comments
[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization
#3567 commented on Jul 29, 2024 • 0 new comments