Archive Tags About
中文 English
Sword Pavilion
Sword Pavilion
A Code Walkthrough of vLLM Paged Attention

At the end of March I put together a PPT about the classic Paged Attention algorithm. That reminded me that I had not written a blog post in years, so I turned the slides into an article just to prove that I am still alive.

Paged Attention in vLLM

Before we begin, it is worth clarifying that vLLM has gone through several different implementations of the Paged Attention kernel:

Early vLLM versions:

  • Prefilling -> flash_attn_varlen_func from Flash Attention
  • Decoding -> vLLM’s own Paged Attention implementation
    • paged_attention_v1: used for relatively short sequences
    • paged_attention_v2: used in the cases where you would rather not use v1 :)

The source code looks roughly like this:

    # NOTE(woosuk): We use a simple heuristic to decide whether to use
    # PagedAttention V1 or V2. If the number of partitions is 1, we use
    # V1 to avoid the overhead of reduction. Also, if the number of
    # sequences or heads is large, we use V1 since there is enough work
    # to parallelize.
    # TODO(woosuk): Tune this heuristic.
    # For context len > 8192, use V2 kernel to avoid shared memory
    # shortage.
    use_v1 = (max_seq_len <= 8192 and (max_num_partitions == 1 or num_seqs * num_heads > 512))

In the latest vLLM versions, everything has moved to Flash Attention and is implemented with CUTLASS.

2025-04-20    
About

This is my tech blog.

0001-01-01    
Sword Pavilion
Archive Tags About
中文 English
Hugo Theme Diary by Rise
Ported from Makito's Journal.

© Copyright (c) 2015. All rights reserved.
keyboard_arrow_up dark_mode chevron_left
2
-

2
chevron_right
chevron_left
2/2
chevron_right
Hugo Theme Diary by Rise
Ported from Makito's Journal.

© Copyright (c) 2015. All rights reserved.