📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

sora llm llms vllm llm-inference awesome-llm flash-attention flash-attention-2 tensorrt-llm paged-attention deepseek open-sora flash-attention-3

Updated Aug 1, 2024

DefTruth / CUDA-Learn-Notes

Star

🎉CUDA/C++ 笔记 / 大模型手撕CUDA / 技术博客，更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.

cuda cuda-kernels gemm softmax cuda-programming layernorm gemv elementwise rmsnorm flash-attention flash-attention-2 warp-reduce block-reduce

Updated Jul 29, 2024
Cuda

flashinfer-ai / flashinfer

Star

FlashInfer: Kernel Library for LLM Serving

gpu cuda pytorch tvm llm-inference flash-attention large-large-models

Updated Aug 1, 2024
Cuda

CoinCheung / gdGPT

Star

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

nlp bloom pipeline pytorch deepspeed llm full-finetune model-parallization flash-attention llama2 baichuan2-7b chatglm3-6b mixtral-8x7b

Updated Feb 5, 2024
Python

Naman-ntc / FastCode

Star

Utilities for efficient fine-tuning, inference and evaluation of code generation models

transformers efficient inference code-generation finetuning flash-attention

Updated Oct 3, 2023
Python

RulinShao / FastCkpt

Star

Python package for rematerialization-aware gradient checkpointing

gradient-checkpointing flash-attention

Updated Oct 31, 2023
Python

alexzhang13 / flashattention2-custom-mask

Star

Triton implementation of FlashAttention2 that adds Custom Masks.

deep-learning triton attention cuda-kernels attention-mechanism triton-lang flash-attention flash-attention-2

Updated Jul 21, 2024
Python

kklemon / FlashPerceiver

Star

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

nlp deep-learning transformer attention-mechanism perceiver flash-attention

Updated Oct 10, 2023
Python

kyegomez / FlashMHA

Sponsor

Star

An simple pytorch implementation of Flash MultiHead Attention

artificial-intelligence transformer attention artificial-neural-networks attention-mechanisms attentionisallyouneed gpt4 flash-attention

Updated Feb 5, 2024
Jupyter Notebook

MasterSkepticista / gpt2

Star

Training GPT-2 on FineWeb-Edu in JAX/Flax

flax jax gpt2 flash-attention fineweb

Updated Jul 31, 2024
Python

graphcore-research / flash-attention-ipu

Star

Poplar implementation of FlashAttention for IPU

deep-learning transformers pytorch ipu graphcore poplar flash-attention flash-attention-2

Updated Mar 12, 2024
C++

Improve this page

Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash-attention

Here are 14 public repositories matching this topic...

QwenLM / Qwen

ymcui / Chinese-LLaMA-Alpaca-2

InternLM / InternLM

DefTruth / Awesome-LLM-Inference

DefTruth / CUDA-Learn-Notes

flashinfer-ai / flashinfer

CoinCheung / gdGPT

Naman-ntc / FastCode

RulinShao / FastCkpt

alexzhang13 / flashattention2-custom-mask

kklemon / FlashPerceiver

kyegomez / FlashMHA

MasterSkepticista / gpt2

graphcore-research / flash-attention-ipu

Improve this page

Add this topic to your repo