The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
-
Updated
Jul 30, 2024 - Python
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Official release of InternLM2.5 7B base and chat models. 1M context support
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
🎉CUDA/C++ 笔记 / 大模型手撕CUDA / 技术博客,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
FlashInfer: Kernel Library for LLM Serving
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Utilities for efficient fine-tuning, inference and evaluation of code generation models
Python package for rematerialization-aware gradient checkpointing
Triton implementation of FlashAttention2 that adds Custom Masks.
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
An simple pytorch implementation of Flash MultiHead Attention
Training GPT-2 on FineWeb-Edu in JAX/Flax
Poplar implementation of FlashAttention for IPU
Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."