A Code Walkthrough of vLLM Paged Attention
At the end of March I put together a PPT about the classic Paged Attention algorithm. That reminded me that I had not written a blog post in years, so I turned the slides into an article just to prove that I am still alive.
Paged Attention in vLLM
Before we begin, it is worth clarifying that vLLM has gone through several different implementations of the Paged Attention kernel:
Early vLLM versions:
- Prefilling ->
flash_attn_varlen_funcfrom Flash Attention - Decoding -> vLLM’s own Paged Attention implementation
paged_attention_v1: used for relatively short sequencespaged_attention_v2: used in the cases where you would rather not use v1 :)
The source code looks roughly like this:
# NOTE(woosuk): We use a simple heuristic to decide whether to use
# PagedAttention V1 or V2. If the number of partitions is 1, we use
# V1 to avoid the overhead of reduction. Also, if the number of
# sequences or heads is large, we use V1 since there is enough work
# to parallelize.
# TODO(woosuk): Tune this heuristic.
# For context len > 8192, use V2 kernel to avoid shared memory
# shortage.
use_v1 = (max_seq_len <= 8192 and (max_num_partitions == 1 or num_seqs * num_heads > 512))
In the latest vLLM versions, everything has moved to Flash Attention and is implemented with CUTLASS.
About
This is my tech blog.