-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Insights: vllm-project/vllm
Overview
Could not load contribution data
Please try again later
90 Pull requests merged by 46 people
-
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user
#6954 merged
Aug 1, 2024 -
PP comm optimization: replace send with partial send + allgather
#6695 merged
Aug 1, 2024 -
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings
#6758 merged
Aug 1, 2024 -
[Bugfix][TPU] Do not use torch.Generator for TPUs
#6981 merged
Aug 1, 2024 -
[Model] Pipeline parallel support for Qwen2
#6924 merged
Aug 1, 2024 -
[Kernel][RFC] Refactor the punica kernel based on Triton
#5036 merged
Aug 1, 2024 -
Revert "[Frontend] Factor out code for running uvicorn"
#7012 merged
Jul 31, 2024 -
[Misc] Add compressed-tensors to optimized quant list
#7006 merged
Jul 31, 2024 -
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4)
#6996 merged
Jul 31, 2024 -
[Kernel] Enable FP8 Cutlass for Ada Lovelace
#6950 merged
Jul 31, 2024 -
[Bugfix] Support cpu offloading with quant_method.process_weights_after_loading
#6960 merged
Jul 31, 2024 -
[MISC] Introduce pipeline parallelism partition strategies
#6920 merged
Jul 31, 2024 -
[Model] use FusedMoE layer in Jamba
#6935 merged
Jul 31, 2024 -
[Bugfix] Fix feature size calculation for LLaVA-NeXT
#6982 merged
Jul 31, 2024 -
[Bugfix] Clean up MiniCPM-V
#6939 merged
Jul 31, 2024 -
Support W4A8 quantization for vllm
#5218 merged
Jul 31, 2024 -
[Bugfix] fix logit processor excceed vocab size issue
#6927 merged
Jul 31, 2024 -
[Bugfix][TPU] Set readonly=True for non-root devices
#6980 merged
Jul 31, 2024 -
[CI/Build] Fix mypy errors
#6968 merged
Jul 31, 2024 -
[Bugfix] Fix broadcasting logic for
multi_modal_kwargs
#6836 merged
Jul 31, 2024 -
[mypy] Enable following imports for some directories
#6681 merged
Jul 31, 2024 -
[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding
#6964 merged
Jul 31, 2024 -
[CI] [nightly benchmark] Do not re-download sharegpt dataset if exists
#6706 merged
Jul 30, 2024 -
[Nightly benchmarking suite] Remove pkill python from run benchmark suite
#6965 merged
Jul 30, 2024 -
[Build] Temporarily Disable Kernels and LoRA tests
#6961 merged
Jul 30, 2024 -
[core][misc] improve free_finished_seq_groups
#6865 merged
Jul 30, 2024 -
[Kernel] Remove scaled_fp8_quant kernel padding footgun
#6842 merged
Jul 30, 2024 -
[Bugfix] Fix tensorizer memory profiling bug during testing
#6881 merged
Jul 30, 2024 -
[OpenVINO] Updated OpenVINO requirements and build docs
#6948 merged
Jul 30, 2024 -
[Kernel] Squash a few more warnings
#6914 merged
Jul 30, 2024 -
[BugFix] Fix use of per-request seed with pipeline parallel
#6698 merged
Jul 30, 2024 -
[Doc] Super tiny fix doc typo
#6949 merged
Jul 30, 2024 -
[Bugfix] Fix PaliGemma MMP
#6930 merged
Jul 30, 2024 -
[TPU] Fix greedy decoding
#6933 merged
Jul 30, 2024 -
[Kernel] Tuned int8 kernels for Ada Lovelace
#6848 merged
Jul 30, 2024 -
[Kernel] Fix marlin divide-by-zero warnings
#6904 merged
Jul 30, 2024 -
[ci] GHA workflow to remove ready label upon "/notready" comment
#6921 merged
Jul 30, 2024 -
[Kernel] Remove unused variables in awq/gemm_kernels.cu
#6908 merged
Jul 30, 2024 -
[Frontend] New
allowed_token_ids
decoding request parameter#6753 merged
Jul 29, 2024 -
[Bugfix] Allow vllm to still work if triton is not installed.
#6786 merged
Jul 29, 2024 -
[TPU] Add TPU tensor parallelism to async engine
#6891 merged
Jul 29, 2024 -
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel
#6901 merged
Jul 29, 2024 -
[Core] Reduce unnecessary compute when logprobs=None
#6532 merged
Jul 29, 2024 -
[Kernel] Tuned FP8 Kernels for Ada Lovelace
#6677 merged
Jul 29, 2024 -
[Model] Initialize support for InternVL2 series models
#6514 merged
Jul 29, 2024 -
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8
#6871 merged
Jul 28, 2024 -
Add Nemotron to PP_SUPPORTED_MODELS
#6863 merged
Jul 27, 2024 -
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel
#6795 merged
Jul 27, 2024 -
[TPU] Reduce compilation time & Upgrade PyTorch XLA version
#6856 merged
Jul 27, 2024 -
[Docs] Add RunLLM chat widget
#6857 merged
Jul 27, 2024 -
[Model] Initial support for BLIP-2
#5920 merged
Jul 27, 2024 -
[CI/Build][Doc] Update CI and Doc for VLM example changes
#6860 merged
Jul 27, 2024 -
[bugfix] make args.stream work
#6831 merged
Jul 27, 2024 -
[Bugfix] Fix VLM example typo
#6859 merged
Jul 27, 2024 -
[Misc][VLM][Doc] Consolidate offline examples for vision language models
#6858 merged
Jul 27, 2024 -
[Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor
#6802 merged
Jul 27, 2024 -
[Doc] Add missing mock import to docs
conf.py
#6834 merged
Jul 27, 2024 -
[Hardware][TPU] Implement tensor parallelism with Ray
#5871 merged
Jul 27, 2024 -
[Model] H2O Danube3-4b
#6451 merged
Jul 27, 2024 -
[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba
#6784 merged
Jul 27, 2024 -
[Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron
#6844 merged
Jul 27, 2024 -
[ROCm] Upgrade PyTorch nightly version
#6845 merged
Jul 27, 2024 -
[Bugfix]: Fix Tensorizer test failures
#6835 merged
Jul 27, 2024 -
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b
#6852 merged
Jul 27, 2024 -
[Frontend] Factor out code for running uvicorn
#6828 merged
Jul 27, 2024 -
[TPU] Support collective communications in XLA devices
#6813 merged
Jul 27, 2024 -
enforce eager mode with bnb quantization temporarily
#6846 merged
Jul 27, 2024 -
Update README.md
#6847 merged
Jul 27, 2024 -
[Doc] Update SkyPilot doc for wrong indents and instructions for update service
#4283 merged
Jul 26, 2024 -
[Doc] Add Nemotron to supported model docs
#6843 merged
Jul 26, 2024 -
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation
#6125 merged
Jul 26, 2024 -
[Misc][TPU] Support TPU in initialize_ray_cluster
#6812 merged
Jul 26, 2024 -
[Build/CI][ROCm] Minor simplification to Dockerfile.rocm
#6811 merged
Jul 26, 2024 -
[Bugfix][Kernel] Promote another index to int64_t
#6838 merged
Jul 26, 2024 -
[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron)
#6611 merged
Jul 26, 2024 -
[doc][debugging] add known issues for hangs
#6816 merged
Jul 26, 2024 -
[Core] Use array to speedup padding
#6779 merged
Jul 26, 2024 -
[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor.
#6770 merged
Jul 26, 2024 -
Fix ReplicatedLinear weight loading
#6793 merged
Jul 26, 2024 -
[ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check
#6810 merged
Jul 26, 2024 -
[ci][distributed] fix flaky tests
#6806 merged
Jul 26, 2024 -
[Core] Fix ray forward_dag error mssg
#6792 merged
Jul 25, 2024 -
[Docs] Publish 5th meetup slides
#6799 merged
Jul 25, 2024 -
[doc][distributed] improve multinode serving doc
#6804 merged
Jul 25, 2024 -
[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors
#6798 merged
Jul 25, 2024 -
[Doc] Add documentations for nightly benchmarks
#6412 merged
Jul 25, 2024 -
[Bugfix] Add synchronize to prevent possible data race
#6788 merged
Jul 25, 2024 -
[Bugfix] Fix
kv_cache_dtype=fp8
without scales for FP8 checkpoints#6761 merged
Jul 25, 2024 -
[ Misc ]
fp8-marlin
channelwise viacompressed-tensors
#6524 merged
Jul 25, 2024 -
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V
#6787 merged
Jul 25, 2024
56 Pull requests opened by 48 people
-
[BugFix][Speculative Decoding] Fixes the generation token numbers with sps
#6782 opened
Jul 25, 2024 -
[CI] Reproduce SGLANG benchmark results
#6794 opened
Jul 25, 2024 -
[Core] Get KV from Block, add KV to Block
#6808 opened
Jul 26, 2024 -
Prefetch all
#6817 opened
Jul 26, 2024 -
[CI/Build] upgrade Dockerfile to ubuntu 22.04
#6820 opened
Jul 26, 2024 -
[wip]
#6821 opened
Jul 26, 2024 -
[Model] Teleflm Support
#6822 opened
Jul 26, 2024 -
[Speculative Decoding] EAGLE Implementation with Top-1 proposer
#6830 opened
Jul 26, 2024 -
[CI/Build] bump Dockerfile.neuron image base, use public ECR
#6832 opened
Jul 26, 2024 -
[Core] Pipeline parallel with Ray ADAG
#6837 opened
Jul 26, 2024 -
[ DO NOT MERGE ] grpc openai server prototypes
#6839 opened
Jul 26, 2024 -
[Build] Dockerfile revert to CUDA 12.1
#6840 opened
Jul 26, 2024 -
[Build] Add initial conditional testing spec
#6841 opened
Jul 26, 2024 -
[Bugfix][fast] Fix the get_num_blocks_touched logic
#6849 opened
Jul 26, 2024 -
[Kernel] [Triton] Add Triton implementation of awq_dequantize
#6850 opened
Jul 26, 2024 -
Add required libcuda.so
#6864 opened
Jul 27, 2024 -
[core][scheduler] simplify and improve scheduler
#6867 opened
Jul 27, 2024 -
[Core] generate from input embeds
#6869 opened
Jul 27, 2024 -
[ DO NOT MERGE] pyzmq based openai server prototypes (w/ python pickle)
#6874 opened
Jul 28, 2024 -
Support for guided decoding for offline LLM
#6878 opened
Jul 28, 2024 -
[ Do Not Merge ] pyzmq based openai server prototypes (w/ protobuf)
#6880 opened
Jul 29, 2024 -
merge to main
#6882 opened
Jul 29, 2024 -
[ Frontend ] Multiprocessing for OpenAI Server with `zeromq`
#6883 opened
Jul 29, 2024 -
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking
#6892 opened
Jul 29, 2024 -
Removes duplicate outlines processors
#6900 opened
Jul 29, 2024 -
[wip/spmd] Serialization Optimization
#6903 opened
Jul 29, 2024 -
[Bugfix] Support Rank Stabilized LoRA (RSLoRA)
#6909 opened
Jul 29, 2024 -
[Kernel][Misc] Add meta functions for ops to prevent graph breaks
#6917 opened
Jul 29, 2024 -
[SpecDecode] Support FlashInfer in DraftModelRunner
#6926 opened
Jul 30, 2024 -
[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend
#6931 opened
Jul 30, 2024 -
[Frontend]: Add apply_chat_template method and update generate method in LLM class
#6936 opened
Jul 30, 2024 -
[Model] SiglipVisionModel ported from transformers
#6942 opened
Jul 30, 2024 -
[CI/Build] Update torch to 2.4
#6951 opened
Jul 30, 2024 -
[DO NOT MERGE] Asynchronous Output Processing POC [using asyncio]
#6958 opened
Jul 30, 2024 -
[CI/Build][ROCm] Enabling tensorizer tests for ROCm
#6959 opened
Jul 30, 2024 -
[Kernel] add punica dimensions for granite 20b
#6962 opened
Jul 30, 2024 -
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification
#6963 opened
Jul 30, 2024 -
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace
#6971 opened
Jul 31, 2024 -
llama_index serving integration documentation
#6973 opened
Jul 31, 2024 -
[Models] Support Qwen model with PP
#6974 opened
Jul 31, 2024 -
[WIP] Add Fused MoE W8A8 (Int8) Support
#6978 opened
Jul 31, 2024 -
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm
#6992 opened
Jul 31, 2024 -
[Model] Further cleanup MiniCPM-V
#6995 opened
Jul 31, 2024 -
Update run-amd-test.sh
#6997 opened
Jul 31, 2024 -
[Doc] Proofreading documentation
#6998 opened
Jul 31, 2024 -
[CI/Build] bump minimum cmake version
#6999 opened
Jul 31, 2024 -
[WIP] [core] Multi Step Scheduling
#7000 opened
Jul 31, 2024 -
[CI/Build] Treat warnings as errors in CUDA [DO NOT MERGE]
#7001 opened
Jul 31, 2024 -
[Bugfix] Lower gemma's unloaded_params exception to warning
#7002 opened
Jul 31, 2024 -
Add Load-time W8A16 quantization for TPU Backend
#7005 opened
Jul 31, 2024 -
[Bugfix] Use correct length in beam search scoring
#7007 opened
Jul 31, 2024 -
[Kernel] Fix input for flashinfer prefill wrapper.
#7008 opened
Jul 31, 2024 -
[Kernel] Add Fused Layernorm + Asymmetric int8 Quant
#7010 opened
Jul 31, 2024 -
[Draft][MISC] Use torch.frombuffer(array(list)) in prepare_input
#7014 opened
Aug 1, 2024 -
Add Classifier free guidance
#7016 opened
Aug 1, 2024
47 Issues closed by 22 people
-
[Bug]: Engine crashes when max_tokens undefined
#6707 closed
Aug 1, 2024 -
[Misc]: call for help to fix tensorizer tests
#6809 closed
Aug 1, 2024 -
[Bug]: Command R+ GPTQ bad output on ROCm
#3980 closed
Jul 31, 2024 -
[Bug]: cuda OOM errors persist across requests.
#6907 closed
Jul 31, 2024 -
[Bug]: `RuntimeError: b_q_weight is not on GPU` CPU Offloading
#6952 closed
Jul 31, 2024 -
[Bug]: FP8 Quantization (static and dynamic) incompatible with `--cpu-offload-gb`
#6765 closed
Jul 31, 2024 -
[Usage]: internVL2 推理不支持
#6989 closed
Jul 31, 2024 -
[Doc]: Supported Hardware for Quantization Kernels
#6979 closed
Jul 31, 2024 -
[Usage]: deepseek-v2-lite not supported yet?
#6986 closed
Jul 31, 2024 -
[Bug]: AttributeError: 'MiniCPMVConfig' object has no attribute 'version'
#6814 closed
Jul 31, 2024 -
[Bug]: index out of bound for logits_processors cause vllm.engine.async_llm_engine.AsyncEngineDeadError
#6866 closed
Jul 31, 2024 -
[Bug]: MiniCPM-Llama3-V-2_5 error when tensor_parallel_size>1
#6946 closed
Jul 31, 2024 -
[Misc]: Problem about running with openvino
#6898 closed
Jul 30, 2024 -
[Installation]: Unable to build docker image using Dockerfile.openvino
#6769 closed
Jul 30, 2024 -
[Bug]: Seed issue with Pipeline Parallel
#6449 closed
Jul 30, 2024 -
[Bug]: crash when using response_format of type json_object
#6953 closed
Jul 30, 2024 -
[RFC]: OpenVINO vLLM backend
#5377 closed
Jul 30, 2024 -
[Bug]: Paligemma does not work with tensor parallelism
#6910 closed
Jul 30, 2024 -
[Bug]: MiniCPM-V-2 does not appear to have a file named preprocessor_config.json
#6934 closed
Jul 30, 2024 -
[Bug]: Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
#6897 closed
Jul 30, 2024 -
[Model]: Support for InternVL-Chat-V1-5
#4393 closed
Jul 29, 2024 -
[Model]: Support for InternVL2
#6321 closed
Jul 29, 2024 -
[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error?
#6870 closed
Jul 28, 2024 -
[Bug]: Different quality responses using GPTQ / marlin kernels on A10 vs A100 GPUs
#5793 closed
Jul 27, 2024 -
[Bug]: Concurrent requests are skipped when enable --enable-chunked-prefill
#6726 closed
Jul 27, 2024 -
[New Model]: Blip2 Support required
#4739 closed
Jul 27, 2024 -
[Doc]: Sampling page is no longer showing up
#6853 closed
Jul 27, 2024 -
[Bug][ROCm] The embedding layer does not support long inputs
#6807 closed
Jul 27, 2024 -
[Bug]: Possible data race when running Llama 405b fp8
#6767 closed
Jul 27, 2024 -
[Bug]: BitsandBytes quantization is not working as expected
#5569 closed
Jul 27, 2024 -
[New Model]: Support Nemotron-4-340B
#5722 closed
Jul 26, 2024 -
[Bug]: Pipeline parallelism is very slow when inferencing one request
#6826 closed
Jul 26, 2024 -
[Usage]: How to use vLLM on multi-nodes
#6825 closed
Jul 26, 2024 -
[Bug]: Llama3.1-70B-FP8 Prompt insufficient computing power
#6815 closed
Jul 26, 2024 -
[Feature]: return Usage info for streaming request for each chunk in ChatCompletion
#6540 closed
Jul 26, 2024 -
[FEATURE] Implement Dynamic SplitFuse
#1562 closed
Jul 26, 2024 -
[Misc]: setting environment variables in multi-node serving
#6803 closed
Jul 25, 2024 -
AWQ + Marlin Error
#3392 closed
Jul 25, 2024 -
[Bug]: The inference speed of vllm running command-r-plus-gptq is very slow
#5076 closed
Jul 25, 2024 -
[Bug]: The FP8 models and FP8 KV-Cache-Scales loaded together failed on the latest 0.5.3
#6738 closed
Jul 25, 2024 -
[Bug]: tensorizer error: name '_write_stream' is not defined
#6791 closed
Jul 25, 2024 -
[Bug]: Reproducing Llama 3.1 distributed inference from the blog
#6775 closed
Jul 25, 2024 -
[Bug]: batch inference not consistent (even temperature=0)
#6735 closed
Jul 25, 2024 -
[Bug]: Broken accuracy on LLaMa 3.1 70B -- worse than even 8B
#6760 closed
Jul 25, 2024
85 Issues opened by 75 people
-
[Bug]: [Bug]: vllm llama3/3.1-8b response is cut
#7015 opened
Aug 1, 2024 -
[Bug]: Distributed inference with Ray: cuda errors if I import torch
#7013 opened
Jul 31, 2024 -
[Bug] [ROCm]: ROCm fails to stop generating tokens on multiple GPTQ models
#7011 opened
Jul 31, 2024 -
[Bug]: Mistral Nemo Instruct almost never returns JSON, but model on HF does
#7004 opened
Jul 31, 2024 -
[Bug]: VLLM crashes when prefix caching is enabled
#7003 opened
Jul 31, 2024 -
[RFC]: More rigorous compilation warnings
#6994 opened
Jul 31, 2024 -
[Feature]: Is it possible to control whether to use speculative decoding when making a request?
#6993 opened
Jul 31, 2024 -
[Bug]: TypeError: 'NoneType' object is not callable
#6991 opened
Jul 31, 2024 -
[Usage]: Add support for Python 3.12
#6990 opened
Jul 31, 2024 -
[Bug]: 导入default_dump_dir报错
#6987 opened
Jul 31, 2024 -
[Bug]: Meet conflicts when using AutoAWQ marlin methods and vLLM
#6985 opened
Jul 31, 2024 -
[Bug]: base_model.model.model.layers.0.mlp.down_proj.lora_magnitude_vector is unsupported LoRA weight
#6983 opened
Jul 31, 2024 -
[Bug]: Wrong image hallucination for InternVL2 model
#6977 opened
Jul 31, 2024 -
[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered
#6976 opened
Jul 31, 2024 -
[Usage]: how to abort request and stop inference?
#6975 opened
Jul 31, 2024 -
[Feature]: Add security scheme to server
#6970 opened
Jul 31, 2024 -
[Bug]: speculative decoding doesn't work with online mode
#6967 opened
Jul 31, 2024 -
[Bug]: Failed to launch api_server with FP8D quantized gemma-2-27b-it on vllm 0.5.3post1
#6957 opened
Jul 30, 2024 -
[Performance]: In tokens processing performance. Bad scaling with tensor-parallel
#6955 opened
Jul 30, 2024 -
[Feature]: Add embeddings api for Llama
#6947 opened
Jul 30, 2024 -
[Performance]: Mode/flag/option to maximize throughput while allowing large latency?
#6945 opened
Jul 30, 2024 -
[Feature]: when will support torch2.4 use it out of box?
#6944 opened
Jul 30, 2024 -
[Performance]: Performance degrades severely with long input
#6943 opened
Jul 30, 2024 -
[Feature]: SiglipVisionModel Support
#6941 opened
Jul 30, 2024 -
[Usage]: Streaming response
#6940 opened
Jul 30, 2024 -
[Usage]: Error with Multi Node llama 405B inference
#6938 opened
Jul 30, 2024 -
[Installation]: my env :cuda version is 12.0,python 3.10, which release should i choose?
#6937 opened
Jul 30, 2024 -
[Feature Request]: Support INT4 for MiniCPM-Llama3-V-2_5
#6932 opened
Jul 30, 2024 -
[Feature]: Support rerank models
#6928 opened
Jul 30, 2024 -
[Bug]: Prefix Caching in BlockSpaceManagerV2 Increases Time to First Token(TTFT) and Slows Down System
#6923 opened
Jul 30, 2024 -
[Installation]: What is required for wheels to build?
#6919 opened
Jul 29, 2024 -
[Bug]: error: Segmentation fault(SIGSEGV received at time)
#6918 opened
Jul 29, 2024 -
[Bug]: Unable to build image from `vllm` repo Dockerfile
#6916 opened
Jul 29, 2024 -
[Performance] [Speculative decoding]: Compute prepare inputs of the scoring model on GPU
#6915 opened
Jul 29, 2024 -
[RFC]: Asynchronous Output Processor
#6913 opened
Jul 29, 2024 -
[Feature]: Reduce LoRA latency via speculative decoding
#6912 opened
Jul 29, 2024 -
[Feature]: Combine pipeline parallelism with speculative decoding
#6911 opened
Jul 29, 2024 -
[Bug]: JSON-guided generation failing to close text values
#6905 opened
Jul 29, 2024 -
[Bug]: Mixtral 8-way TP with --enable-lora crashes with CUDA illegal memory access error
#6902 opened
Jul 29, 2024 -
[Feature]: Add Sliding Window support to FlashInfer backend?
#6899 opened
Jul 29, 2024 -
[Bug]: When using vllm in a Ray actor, the error "No CUDA GPUs are available" occurs.
#6896 opened
Jul 29, 2024 -
[Bug]: distributed inference for vl model crashed(so slow that the connection closed)
#6894 opened
Jul 29, 2024 -
[Bug]: Stuck at "generating GPU P2P access cache"
#6893 opened
Jul 29, 2024 -
[Bug]: Vllm api server does not receive supported parameter `truncate_prompt_tokens`
#6890 opened
Jul 29, 2024 -
[Bug]: "apply_gptq_marlin_linear" Error When TP > 1
#6889 opened
Jul 29, 2024 -
[Performance]: tracking ray dag plus spmd performance
#6888 opened
Jul 29, 2024 -
[Bug]: dag teardown error AttributeError: 'Worker' object has no attribute 'core_worker'
#6887 opened
Jul 29, 2024 -
[Bug]: Speculative Decoding + FlashInfer + benchmark_serving.py TransferEncodingError ISSUE
#6885 opened
Jul 29, 2024 -
[Bug]: First input (bf16) and second input (uint8) must have the same dtype!
#6884 opened
Jul 29, 2024 -
[Performance]: use Python array to replace Python list for zero-copy tensor creation
#6879 opened
Jul 28, 2024 -
[Feature]: Support Python 3.12
#6877 opened
Jul 28, 2024 -
[Bug]: vLLM takes forever to load a locally stored 7B model
#6876 opened
Jul 28, 2024 -
[Bug]: Error Running DeepSeek-v2-Lite w/ FP8
#6875 opened
Jul 28, 2024 -
[Usage]: Only one thread is utilising when vllm is use with the llamaindex framework on the cpu.
#6873 opened
Jul 28, 2024 -
[Misc]: Why doesn't a larger block size result in faster performance?
#6868 opened
Jul 27, 2024 -
[Bug]: Can't load BNB model
#6861 opened
Jul 27, 2024 -
[RFC]: Multi-Step Scheduling
#6854 opened
Jul 26, 2024 -
[Bug]: enable_prefix_caching leads to persistent illegal memory access error
#6833 opened
Jul 26, 2024 -
llama 3 8b model with A10 GPU, OOM with VLLM, but holds good on HF transformer pipline
#6829 opened
Jul 26, 2024 -
[Usage]: How do I deploy a model on two GPUs with different memory?
#6824 opened
Jul 26, 2024 -
[Bug]: Got "[WARNING shm_broadcast.py:404] No available block found in 60 second."
#6818 opened
Jul 26, 2024 -
[New Model]: Adding MiniGPT4_video model
#6805 opened
Jul 25, 2024 -
[RFC]: Performance Roadmap
#6801 opened
Jul 25, 2024 -
[Bug]: Discrepancy in vLLM and LoRA Adapter Scores with Different Package Versions
#6800 opened
Jul 25, 2024 -
[RFC]: Isolate OpenAI Server Into Separate Process
#6797 opened
Jul 25, 2024 -
[Bug]: Engine iteration timed out. This should never happen!
#6790 opened
Jul 25, 2024 -
[Usage]: can I use it with classification model (e.g. GemmaForSequenceClassification) ?
#6789 opened
Jul 25, 2024 -
[Feature]: Evaluate multiple ngram speculations in speculative decoding
#6785 opened
Jul 25, 2024 -
[Bug]: SIGSEGV received at time=1721904360 on cpu 140, Fatal Python error: Segmentation fault
#6783 opened
Jul 25, 2024 -
[Performance]: Slow TTFT(?) for Qwen2-72B-GPTQ-Int4 on H100 *2
#6781 opened
Jul 25, 2024 -
[Bug]: N-gram spec_decode in flash_attention bug
#6780 opened
Jul 25, 2024 -
[Feature]: support Mistral-Large-Instruct-2407 function calling
#6778 opened
Jul 25, 2024 -
[Performance]: Medusa SD have poor performance than baseline
#6777 opened
Jul 25, 2024 -
[Bug]: qwen2-72b-instruct model with RuntimeError: CUDA error: an illegal memory access was encountered
#6776 opened
Jul 25, 2024 -
[Bug]: --max-model-len configuration robustness
#6774 opened
Jul 25, 2024 -
[Usage]: Pipeline Parallelism but with quantized model?
#6773 opened
Jul 25, 2024
140 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support)
#4942 commented on
Jul 31, 2024 • 93 new comments -
[Kernel] Add per-tensor and per-token AZP epilogues
#5941 commented on
Jul 31, 2024 • 42 new comments -
Support Open Models that allow OpenAI API-style tool use & "auto" tool choice
#5649 commented on
Aug 1, 2024 • 20 new comments -
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend
#6143 commented on
Jul 30, 2024 • 17 new comments -
[Hardware][Nvidia][Core][Feature] new feature add: vmm(virtual memory manage) kv cache for nvidia gpu
#6102 commented on
Jul 29, 2024 • 12 new comments -
[Kernel] Add Fused Layernorm + Dynamic-Per-Token Quant Kernels
#6763 commented on
Jul 31, 2024 • 8 new comments -
[Bugfix] Enable chunked-prefill and prefix cache with flash-attn backend
#6144 commented on
Jul 25, 2024 • 4 new comments -
[Frontend] Kill the server on engine death
#6594 commented on
Jul 26, 2024 • 4 new comments -
[Misc] Disambiguate quantized types via a new ScalarType
#6396 commented on
Jul 31, 2024 • 4 new comments -
[Model] Implement DualChunkAttention for Qwen2 Models
#6139 commented on
Jul 29, 2024 • 3 new comments -
[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm
#6006 commented on
Jul 26, 2024 • 3 new comments -
[Core] Support loading GGUF model
#5191 commented on
Jul 31, 2024 • 1 new comment -
[RFC]: Priority Scheduling
#6077 commented on
Aug 1, 2024 • 0 new comments -
[Bug]: error: triton_flash_attention.py
#5696 commented on
Aug 1, 2024 • 0 new comments -
[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs)
#6155 commented on
Aug 1, 2024 • 0 new comments -
[Feature]: support Qwen2 embedding
#5600 commented on
Aug 1, 2024 • 0 new comments -
[Feature]: Request for Ascend NPU support
#6368 commented on
Aug 1, 2024 • 0 new comments -
[Bug]: inter-token latency is lower than TPOT in serving benchmark result
#6531 commented on
Aug 1, 2024 • 0 new comments -
Support W8A8 inference in vllm
#1508 commented on
Jul 30, 2024 • 0 new comments -
[WIP] Qwen-style dynamic-NTK ROPE kernel for long sequence support
#1860 commented on
Aug 1, 2024 • 0 new comments -
[Core] Add retention policy code for processing requests
#4513 commented on
Aug 1, 2024 • 0 new comments -
[CI/Build] use setuptools-scm to set __version__
#4738 commented on
Jul 26, 2024 • 0 new comments -
Adding idefics2
#4937 commented on
Jul 29, 2024 • 0 new comments -
[wip] spmd delta optimization
#6771 commented on
Jul 25, 2024 • 0 new comments -
[Bug]: Special tokens split when decoding after 0.4.0.post1
#4577 commented on
Jul 31, 2024 • 0 new comments -
[Model] Meta Llama 3.1 Know Issues & FAQ
#6689 commented on
Jul 31, 2024 • 0 new comments -
[Bug]: failed when run Qwen2-54B-A14B-GPTQ-Int4(MOE)
#6465 commented on
Jul 31, 2024 • 0 new comments -
[Bug]: TRACKING ISSUE: `AsyncEngineDeadError`
#5901 commented on
Jul 31, 2024 • 0 new comments -
[Bug]: NCCL hangs and causes timeout
#5484 commented on
Jul 31, 2024 • 0 new comments -
[Feature]: vllm-flash-attn cu118 compatibility
#5232 commented on
Jul 31, 2024 • 0 new comments -
[Feature]: load/unload API to run multiple LLMs in a single GPU instance
#5491 commented on
Jul 31, 2024 • 0 new comments -
[Bug]: CUDA OOM error when loading another model after exiting the first one.
#6682 commented on
Jul 31, 2024 • 0 new comments -
Multi-node serving with vLLM - Problems with Ray
#2406 commented on
Jul 31, 2024 • 0 new comments -
[Bug]: 0.4.2 error on H20
#5001 commented on
Jul 31, 2024 • 0 new comments -
[Bug]: `flash_attn_cuda.varlen_fwd` may output a bad result when enabling prefix caching
#5678 commented on
Jul 31, 2024 • 0 new comments -
[Bug]: Illegal memory access
#5687 commented on
Jul 31, 2024 • 0 new comments -
[Roadmap] vLLM Roadmap Q3 2024
#5805 commented on
Jul 31, 2024 • 0 new comments -
[Feature]: Allow user defined extra request args to be logged in OpenAI compatible server
#5467 commented on
Jul 31, 2024 • 0 new comments -
Beam Search Length Normalization Wrong
#2606 commented on
Jul 31, 2024 • 0 new comments -
[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane
#6556 commented on
Jul 31, 2024 • 0 new comments -
[Bug]: RuntimeError: GET was unable to find an engine to execute this computation for llava-next model
#6713 commented on
Aug 1, 2024 • 0 new comments -
[Bugfix]: use PretrainedConfig to communicate config objects with trust remote code
#6751 commented on
Jul 25, 2024 • 0 new comments -
(Dont Merge) Add rwkv6
#6749 commented on
Jul 28, 2024 • 0 new comments -
[Model][Jamba] Mamba cache single buffer
#6739 commented on
Jul 31, 2024 • 0 new comments -
Update logits processor with tensor caching
#6715 commented on
Jul 27, 2024 • 0 new comments -
[Draft] [Speculative decoding] Use SPMD worker to reduce control plane communication
#6664 commented on
Jul 30, 2024 • 0 new comments -
[Kernel] Add dynamic asymmetric quantization kernel
#6651 commented on
Jul 31, 2024 • 0 new comments -
[CI/Build] bump ruff version, fix linting issues
#6546 commented on
Jul 31, 2024 • 0 new comments -
[Model] Add Support for GPTQ Fused MOE
#6502 commented on
Jul 31, 2024 • 0 new comments -
[Model] Support Mamba
#6484 commented on
Jul 31, 2024 • 0 new comments -
[ Kernel ] AWQ Fused MoE
#6422 commented on
Jul 30, 2024 • 0 new comments -
torch.compile based model optimizer
#6377 commented on
Jul 29, 2024 • 0 new comments -
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on
Jul 31, 2024 • 0 new comments -
[Model] Add support for 'gte-Qwen2' embedding models
#6282 commented on
Aug 1, 2024 • 0 new comments -
[core] Sampling controller interface
#6273 commented on
Jul 29, 2024 • 0 new comments -
[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface
#6260 commented on
Jul 31, 2024 • 0 new comments -
[Core] implement disaggregated prefilling via KV cache transfer
#6170 commented on
Jul 31, 2024 • 0 new comments -
[Kernel] Unify the kernel used in flash attention backend
#6052 commented on
Jul 25, 2024 • 0 new comments -
Whisper support
#5964 commented on
Jul 29, 2024 • 0 new comments -
[Frontend] Warn if user `max_model_len` is greater than derived `max_model_len`
#5911 commented on
Jul 31, 2024 • 0 new comments -
feat: controlling max queue time
#5884 commented on
Jul 31, 2024 • 0 new comments -
[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2`
#5874 commented on
Jul 30, 2024 • 0 new comments -
[Model] Initialize deepseek-vl support
#5817 commented on
Jul 25, 2024 • 0 new comments -
[LoRA] Adds support for bias in LoRA
#5733 commented on
Jul 31, 2024 • 0 new comments -
[Bugfix] support `tie_word_embeddings` for all models
#5724 commented on
Jul 27, 2024 • 0 new comments -
[Model] Add support for Qwen2 for embeddings
#5611 commented on
Aug 1, 2024 • 0 new comments -
[Model] Bert Embedding Model
#5447 commented on
Jul 25, 2024 • 0 new comments -
[Model] Add GLM-4v support
#5358 commented on
Jul 31, 2024 • 0 new comments -
Hete spec decode
#5065 commented on
Jul 30, 2024 • 0 new comments -
[Kernel] Initial commit containing new Triton kernels for multi lora serving.
#5025 commented on
Jul 26, 2024 • 0 new comments -
[Bug]: AttributeError: '_OpNamespace' '_C' object has no attribute 'rotary_embedding' / gemma-2-9b with vllm=0.5.2
#6478 commented on
Jul 25, 2024 • 0 new comments -
[Performance]: GPU utilization is low when running large batches on H100
#6560 commented on
Jul 26, 2024 • 0 new comments -
[RFC]: Enhancing LoRA Management for Production Environments in vLLM
#6275 commented on
Jul 26, 2024 • 0 new comments -
[Feature]: Pipeline parallelism support for qwen model
#6471 commented on
Jul 26, 2024 • 0 new comments -
[Installation]: Problem with docker image (ROCm version)
#6512 commented on
Jul 26, 2024 • 0 new comments -
[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型,刚开始服务正常,但是并发高的时候就报错
#6734 commented on
Jul 26, 2024 • 0 new comments -
[Bug]: WSL2(also Docker) 1 GPU work but 2 not,(--tensor-parallel-size 2 )
#5161 commented on
Jul 26, 2024 • 0 new comments -
Attention sliding window
#3385 commented on
Jul 26, 2024 • 0 new comments -
[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
#5060 commented on
Jul 26, 2024 • 0 new comments -
[Bug]: vLLM is unable to load Mistral on Inferentia and AWS neuron, likely memory issue.
#6452 commented on
Jul 26, 2024 • 0 new comments -
[Usage]: Does Prefix Caching currently support offloading to the CPU?
#6676 commented on
Jul 26, 2024 • 0 new comments -
v0.5.2, v0.5.3, v0.6.0 Release Tracker
#6434 commented on
Jul 27, 2024 • 0 new comments -
[RFC] Initial Support for Cloud TPUs
#3620 commented on
Jul 27, 2024 • 0 new comments -
[Bug]: Is vllm support function call mode?
#6631 commented on
Jul 27, 2024 • 0 new comments -
[Feature]: vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving
#6687 commented on
Jul 28, 2024 • 0 new comments -
0.4.3 error CUDA error: an illegal memory access was encountered
#5376 commented on
Jul 28, 2024 • 0 new comments -
[Bug]: Phi-3-mini does not work when using Ray
#6607 commented on
Jul 28, 2024 • 0 new comments -
[Bug]: No available block found in 60 second in shm
#6614 commented on
Jul 29, 2024 • 0 new comments -
[Bug]: 8-way tensor parallelism w/ Punica broken on Ubuntu 20.04 (effectively Azure) since v0.5
#6725 commented on
Jul 25, 2024 • 0 new comments -
[Bug]: call for stack trace for "Watchdog caught collective operation timeout"
#6042 commented on
Jul 25, 2024 • 0 new comments -
deploying embedding model in same way as LLM
#6498 commented on
Jul 25, 2024 • 0 new comments -
[Feature]: Support distributing serving with KubeRay's autoscaler
#3522 commented on
Jul 25, 2024 • 0 new comments -
Support JSON mode.
#2483 commented on
Jul 25, 2024 • 0 new comments -
Generate nothing from VLLM output
#1185 commented on
Jul 25, 2024 • 0 new comments -
[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered
#5371 commented on
Jul 25, 2024 • 0 new comments -
[Bug]: OpenAI server unexpected shutdown
#6629 commented on
Jul 25, 2024 • 0 new comments -
[Speculative decoding]: `AttributeError: 'NoneType' object has no attribute 'numel'` when exceeding draft context length
#5342 commented on
Jul 25, 2024 • 0 new comments -
[RFC]: Add control panel support for vLLM
#4873 commented on
Jul 25, 2024 • 0 new comments -
[Bug]: `pt_main_thread` processes are not killed after main process is killed in MP distributed executor backend
#6766 commented on
Jul 25, 2024 • 0 new comments -
[Feature]: FlashAttention 3 support
#6348 commented on
Jul 26, 2024 • 0 new comments -
Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
#2729 commented on
Jul 26, 2024 • 0 new comments -
[Bug]: python3: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. Aborted (core dumped)
#6723 commented on
Jul 26, 2024 • 0 new comments -
[Bug]: "Triton Error [CUDA]: device kernel image is invalid" when loading Mixtral-8x7B-Instruct-v0.1 in fused_moe.py
#5713 commented on
Jul 26, 2024 • 0 new comments -
[Usage]: How to inference a model with medusa speculative sampling.
#6768 commented on
Jul 26, 2024 • 0 new comments -
[Feature]: chat API assistant prefill
#6772 commented on
Jul 26, 2024 • 0 new comments -
[Bug]: Qwen-14B-Chat-Int4 with guided_json error
#3778 commented on
Jul 29, 2024 • 0 new comments -
[Installation]: import llm meet error
#4163 commented on
Jul 30, 2024 • 0 new comments -
[Usage]: deploy Llama3.1 405B-Instruct-FP8 with H800 * 8 not work
#6750 commented on
Jul 30, 2024 • 0 new comments -
[Usage]: Cannot load model on 2 4090
#3991 commented on
Jul 30, 2024 • 0 new comments -
[Bug]: vllm not support fp8 kv cache when use flashinfer
#6537 commented on
Jul 30, 2024 • 0 new comments -
[Bug]: async llm engine failed unexpectedly (using mixtral-8x7b with tp=4)
#4135 commented on
Jul 30, 2024 • 0 new comments -
[Bug]: PaliGemma serving
#6644 commented on
Jul 30, 2024 • 0 new comments -
[Bug]: vLLM server crashes when `echo=True` and `max_tokens=0`
#6521 commented on
Jul 30, 2024 • 0 new comments -
[Bug]: Internal Server Error when hosting Salesforce/SFR-Embedding-Mistral
#5906 commented on
Jul 30, 2024 • 0 new comments -
[Feature]: Add OpenAI server `prompt_logprobs` support
#6508 commented on
Jul 30, 2024 • 0 new comments -
[Feature]: Support for Higher than 64 LoRa Ranks
#3934 commented on
Jul 30, 2024 • 0 new comments -
[Feature]: Apply chat template through `LLM` class
#6416 commented on
Jul 30, 2024 • 0 new comments -
[Bug]: vLLM failing on AWS Inferentia (inf2)
#6640 commented on
Jul 30, 2024 • 0 new comments -
Using the VLLM engine framework for inference, why is the first character generated always a space?
#3683 commented on
Jul 30, 2024 • 0 new comments -
[RFC]: Multi-modality Support Refactoring
#4194 commented on
Jul 30, 2024 • 0 new comments -
[Bug]: Cannot find any of ['adapter_name_or_path'] in the model's quantization config
#6727 commented on
Jul 30, 2024 • 0 new comments -
[Bug]: VLLM 0.5.3.post1 [rank0]: RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
#6732 commented on
Jul 30, 2024 • 0 new comments -
[Feature]: Return hidden states (in progress?)
#6165 commented on
Jul 30, 2024 • 0 new comments -
[RFC]: A Graph Optimization System in vLLM using torch.compile
#6378 commented on
Jul 29, 2024 • 0 new comments -
[Bug]: Load LoRA adaptor for Llama3 seems not working
#6250 commented on
Jul 29, 2024 • 0 new comments -
lora load failed
#3374 commented on
Jul 29, 2024 • 0 new comments -
[Bug]: CUDA illegal memory access error when `enable_prefix_caching=True`
#5537 commented on
Jul 29, 2024 • 0 new comments -
Aborted request without reason
#2484 commented on
Jul 29, 2024 • 0 new comments -
No executable after building vllm from source with CPU support
#6259 commented on
Jul 29, 2024 • 0 new comments -
Is there a way to terminate vllm.LLM and release the GPU memory
#1908 commented on
Jul 29, 2024 • 0 new comments -
[Bug]: llama-3.1-70b model shard_memory objects to clean
#6716 commented on
Jul 29, 2024 • 0 new comments -
[Installation]: Running ohereForAI/c4ai-command-r-v01 with main pytorch
#6355 commented on
Jul 29, 2024 • 0 new comments -
[Bug]: topk=1 and temperature=0 cause different output in vllm
#5404 commented on
Jul 29, 2024 • 0 new comments -
[Bug]: VllmWorkerProcess does not exit correctly when TP > 1
#6219 commented on
Jul 29, 2024 • 0 new comments -
[Bug]: AssertionError when load miqu70b after full sft
#3813 commented on
Jul 29, 2024 • 0 new comments -
[Bug]: Gemma2 supports 8192 context with sliding window, but vllm only does 4196 or fails if try 8192
#6220 commented on
Jul 29, 2024 • 0 new comments -
[Usage]: The 8xH100 device failed to run meta-llama/Meta-Llama-3.1-405B-Instruct-FP8.
#6746 commented on
Jul 29, 2024 • 0 new comments -
[Feature]: Support Lora Adapter generated from mistral-finetune
#6573 commented on
Jul 29, 2024 • 0 new comments -
[Bug]: Shape error encountered in speculative decoding when `enable_lora=True`
#4872 commented on
Jul 29, 2024 • 0 new comments -
[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization
#3567 commented on
Jul 29, 2024 • 0 new comments