Sword Pavilion

A Code Walkthrough of vLLM Paged Attention

At the end of March I put together a PPT about the classic Paged Attention algorithm. That reminded me that I had not written a blog post in years, so I turned the slides into an article just to prove that I am still alive.

Paged Attention in vLLM

Before we begin, it is worth clarifying that vLLM has gone through several different implementations of the Paged Attention kernel:

Early vLLM versions:

Prefilling -> flash_attn_varlen_func from Flash Attention
Decoding -> vLLM’s own Paged Attention implementation
- paged_attention_v1: used for relatively short sequences
- paged_attention_v2: used in the cases where you would rather not use v1 :)

The source code looks roughly like this:

    # NOTE(woosuk): We use a simple heuristic to decide whether to use
    # PagedAttention V1 or V2. If the number of partitions is 1, we use
    # V1 to avoid the overhead of reduction. Also, if the number of
    # sequences or heads is large, we use V1 since there is enough work
    # to parallelize.
    # TODO(woosuk): Tune this heuristic.
    # For context len > 8192, use V2 kernel to avoid shared memory
    # shortage.
    use_v1 = (max_seq_len <= 8192 and (max_num_partitions == 1 or num_seqs * num_heads > 512))

In the latest vLLM versions, everything has moved to Flash Attention and is implemented with CUTLASS.

2025-04-20

About

This is my tech blog.

0001-01-01