copy and paste this google map to your website or blog!
Press copy button and paste into your blog or website.
(Please switch to 'HTML' mode when posting into your blog. Examples: WordPress Example, Blogger Example)
Million-Tokens Prompt Inference for Long-context LLMs The latency breakdown of a single attention kernel for three patterns and FlashAttention across different context windows in a single A100, including the index time for dynamic sparse approximation and building dynamic sparsity
MInference: Million-Tokens Prompt Inference for LLMs In the case of LLaMA-3-8B and GLM-4-9B-1M, MInference achieves full green performance for context windows up to 1M In comparison, StreamingLLM and InfLLM experience a performance drop to below 20% in the middle segments of prompts even in the 70K context windows
MInference 1. 0: 10x Faster Million Context Inference with a Single GPU MInference implemented corresponding GPU kernels for the three proposed sparse attention patterns: for A-shape pattern and block sparse pattern, the kernel uses blocks of size 64×64 for computation, while for vertical-slash pattern, the kernel uses blocks of size 64×1 for computation
PowerPoint Presentation We presents nmSPARSE, a GPU library of SpMV and SpMM kernels for general N:M sparsity with various sparsity ratios We hope nmSPARSE can benefit efficient sparse model inference and motivate new innovations on N:M sparsity in both machine learning and system communities
SeerAttention TRANSPARENCY. md at main - GitHub When applied to long-context fine-tuning with YaRN, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexityloss, offering a 5 67x speedup over FlashAttention-2 More details can be found in SeerAttention
Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning This work presents nmSPARSE, a GPU library of SpMV and SpMM kernels for sparse DNN inference with general N:M sparsity patterns and various sparsity ratios nmSPARSE ad-dresses the longstanding challenges of irregular computation and scattered memory accesses in sparse matrix multipli-cations by leveraging the intrinsic balance characteristic
GPU MODE Lecture 11: Sparsity – Christian Mills Lecture #11 discusses GPU sparsity, specifically semi-structured and block sparsity techniques, for accelerating neural network inference and training by leveraging optimized kernels and sparse data representations
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference To efficiently and accurately estimate the criticality of KV cache tokens, we propose Quest, an efficient and accurate algorithm that exploits query-aware context sparsity, which approximately selects the most potentially critical KV cache pages for the current query
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs When applied to long-context fine-tuning with YaRN, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 5 67 × speedup over FlashAttention-2
Sparsity — Intel® Neural Compressor documentation We validate the sparsity on typical models across different domains (including CV, NLP, and Recommendation System) The below table shows the sparsity pattern, sparsity ratio, and accuracy of sparse and dense (Reference) model for each model