SGlang-KVCache 生命周期与PD分离传输分析

本文目的是从 Prompt 到绘制到输出全链路数据结构、Tensor的生命周期,讲清楚“用户输入一句 prompt → KVCache 生成 → A 节点 → KVCache 传输 → B 节点恢复 → decode → detokenizer → 输出” ,在这条链路中涉及的核心数据结构、Tensor、转换/恢复点、生命周期与进程边界,尽可能详细地画出来。

目录


0. 总览:进程/线程/设备边界

SGLang server(python -m sglang.launch_server)通常会启动多个子进程/组件:

  • HTTP Server(FastAPI/Uvicorn):接收 /generate 等请求
  • TokenizerManager / TokenizerWorker:将 prompt → token_ids,并管理 disaggregation bootstrap
  • Scheduler 进程:核心调度循环,组 batch、调用 ModelRunner、管理 KV cache / 内存池
  • ModelWorker / TPWorker:执行模型前向(prefill/extend/decode)
  • Detokenizer 进程(或线程):将输出 token_ids → 字符串(支持 streaming)

PD 解耦时:

  • Prefill server A:只做 prefill(prompt 部分前向),产出 KV
  • Decode server B:只做 decode(自回归生成),需要恢复 KV
  • Router:把请求路由到 A/B,并承接客户端的统一 HTTP 接口

设备层面:

  • L1:GPU KV cache pool(MHATokenToKVPool/MLATokenToKVPool/...
  • L2:HostKVCache(pinned host memory)
  • L3:外部存储(Mooncake/HF3FS/File/…)

1. 端到端主链路

下面是主链路 ASCII 总览图:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Client (python/curl/web)
|
| HTTP POST /generate
| JSON: { text, sampling_params, ... }
v
+--------------------------+
| Router Process |
| - policy (random/...) |
| - PD pairing state |
+--------------------------+
| |
| route(prompt) | route(next token loop)
v v
+---------------------+ +---------------------+
| Prefill Server (A) | | Decode Server (B) |
| disagg_mode=prefill | | disagg_mode=decode |
+---------------------+ +---------------------+
| ^
| (1) tokenize + schedule | (4) restore + decode loop
| (2) prefill forward |
| (3) transfer KV + meta ---+
|
v
(optional: A also returns some first tokens depending on implementation)
|
+-------------------------------> Router merges/streams response
|
v
Client receives:
- generated_text / stream chunks

展开如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
                 (A) Prefill server                                    (B) Decode server
+-----------------------------------------+ +-----------------------------------------+
| HTTP Server | | HTTP Server |
| - parse JSON request | | - may receive from router internal path |
+-------------------+---------------------+ +-------------------+---------------------+
| |
v v
+-----------------------+ +-----------------------+
| TokenizerManager | | TokenizerManager |
| - prompt -> token_ids| | - may not tokenize |
| - disagg bootstrap | | - bootstrap client |
+-----------+-----------+ +-----------+-----------+
| |
v v
+-----------------------+ +-----------------------+
| Scheduler Process | | Scheduler Process |
| - ReqQueue | | - DecodeQueue |
| - BatchBuilder | | - PreallocQueue |
| - MemoryPool (L1/L2) | | - MemoryPool (L1/L2) |
| - (Hi)RadixCache | | - (Hi)RadixCache opt |
+-----------+-----------+ +-----------+-----------+
| |
v v
+-----------------------+ +-----------------------+
| ModelRunner/TPWorker | | ModelRunner/TPWorker |
| forward_prefill | | forward_decode |
| writes KV to L1 pool | | reads KV from L1 pool|
+-----------+-----------+ +-----------+-----------+
| |
| KV DATA + META (indices/mapping) |
+--------------(NIXL/Mooncake/...)--------------+
|
v
+------------------+
| Detokenizer |
| token_ids->text |
+------------------+

2. Prefill(A) 内部

2.1 token_id 生成

1
2
3
4
5
6
7
8
9
prompt (str)
|
| tokenizer.encode(...)
v
input_ids: List[int] / torch.Tensor[int32|int64] (CPU)
|
| (optional) add BOS/EOS/system prompt template
v
Request(token_ids, sampling_params, ...)

核心结构

  • Request
    • rid / request_id
    • prompt_text(或不保留)
    • input_ids(CPU tensor / list)
    • sampling_params(temperature/top_p/max_new_tokens/…)
    • stage(PD 时更明显:PREFILL_* / DECODE_*)
    • req_pool_idx(调度时分配)

2.2 PrefixCache/RadixCache

1
2
3
4
5
6
7
8
token_ids (full prompt)
|
| match_prefix(token_ids)
v
MatchResult:
- device_indices (L1 hit: token->KV slot indices)
- host_hit_length (L2 hit)
- last_device_node / last_host_node

如果命中一段前缀:

  • 这段前缀对应的 KV 已存在(L1 或 L2/L3 可回载)
  • prefill 需要计算的 token 数变少(只算未命中部分)
  • Request 会带上:
    • cached_prefix_len
    • cached_device_indices(用于直接复用 KV slots)

2.3 内存池二级映射

1
2
3
4
5
6
7
8
9
10
11
(Request slot) req_pool_idx
|
| ReqToTokenPool.req_to_token[req_pool_idx, pos] = token_slot_idx
v
(token_slot_idx) <---- allocated by TokenToKVPoolAllocator
|
| token_slot_idx indexes into per-layer KV buffers
v
KVCache buffers (L1 GPU):
k_buffer[layer][token_slot_idx, head, dim]
v_buffer[layer][token_slot_idx, head, dim]

更具体一点(以 MHA 为例,简化版):

1
2
3
4
5
6
7
8
9
10
11
12
13
ReqToTokenPool.req_to_token : int32 tensor on GPU (or CPU depending impl)
shape = [max_running_requests, max_context_len]
entry = token_slot_idx

TokenToKVPoolAllocator:
free_slots: list / bitmap / queue
alloc(n_tokens, page_size) -> token_slot_indices (possibly paged)
free(token_slot_indices)

MHATokenToKVPool:
k_buffer: List[Tensor] length = num_layers
each: [kv_pool_size + page_size, n_kv_heads, head_dim] dtype=store_dtype device=cuda
v_buffer: same

配置 --page_size 1 时,分页粒度最细,便于观察 indices 和传输,但 overhead 更大。

2.4 Prefill forward

prefill 时 Scheduler 会构造一个 batch:

  • ForwardBatch
    • input_ids(GPU int32)
    • position_ids(GPU int32)
    • req_pool_indices(GPU int32,告诉 attention 哪些请求槽位参与)
    • seq_lens / qo_indptr / kv_indptr(flashinfer/triton backend 的 metadata)
    • kv_indices(关键:本次新 token 写入 KV pool 的 slot indices)
    • index_slice(关键:req_to_token 的切片范围,本次 extend 的 token span)

prefill 的关键副作用:

  • 对每一层 attention:
    • 计算得到本步 token 的 K,V
    • 写入 k_buffer[layer][kv_indices]v_buffer[layer][kv_indices]

单次 batch 与新 token 写入的核心张量关系:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
new_tokens for request rid:
token positions: [cached_prefix_len ... prompt_len-1]
|
| allocator alloc -> kv_indices (GPU int32)
v
kv_indices: [n_new_tokens] (or paged [n_pages, page_size])
|
| used as index into per-layer buffers
v
for each layer l:
K_l_new: [n_new_tokens, n_kv_heads, head_dim] (compute dtype)
V_l_new: [n_new_tokens, n_kv_heads, head_dim]
|
| scatter/store into pool
v
k_buffer[l][kv_indices] <- K_l_new (store dtype)
v_buffer[l][kv_indices] <- V_l_new

3. A→B 传输

PD 传输要解决的问题是:B 必须在它自己的 KV pool 中“拥有等价的 KV 状态”,才能从 prompt 末尾继续 decode,而不用重新 prefill。

因此传输内容分两大类:

3.1 KV 数据面

两种思路,不同实现可能不同:

  1. 按 token slot 直接传(最直观):把 A 的 k_buffer/layerkv_indices 对应的数据拷到 B 的 pool 对应 slots
  2. 按 page 批量传(更高效):如果 allocator 是 page/Block,连续块更容易跑满带宽

抽象为:

1
2
3
4
KVPayload:
for layer in [0..L-1]:
K: bytes of shape [n_transfer_tokens, n_kv_heads, head_dim] (store dtype)
V: bytes of shape [n_transfer_tokens, n_kv_heads, head_dim] (store dtype)

3.2 KV 控制面

B 必须知道这些 KV 对应“哪个请求、哪个 token range、写到哪里、用什么布局解释”

最少需要:

  • rid / request_id
  • prompt_len(或至少“已存在的上下文长度”)
  • kv_cache_dtype / store_dtype
  • attention_arch(MHA/MLA/NSA…)与其参数(head_dim、kv_heads、layer_num 等)
  • page_size(如果 paged)
  • kv_indices(A 侧 token→slot 映射,或它的等价压缩表示)
  • index_slice / token_range(本次 prefill/extend 的 span)
  • (可选)cached_prefix_len 与 prefix cache 命中信息(若 B 需要知道哪些 token 不需要再传)

大概是:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
KVTransferMeta (example):
{
rid: u64,
model_id: string,
arch: "MHA" | "MLA" | ...,
layer_num: int,
kv_heads: int,
head_dim: int,
v_head_dim: int,
store_dtype: "bf16" | "fp16" | "uint8(fp8)" | ...,
page_size: int,
prompt_len: int,
cached_prefix_len: int,
token_range: [start_pos, end_pos),
kv_indices: int32[...] or pages[...] // A->B mapping handle
}

3.3 NIXL为例,传输的行为

把 PD 传输抽象成提交一组 buffer 和事件同步:

1
2
3
4
5
6
7
8
9
10
11
12
13
A side:
buffers_to_send = [
meta_bytes (CPU or GPU),
K_layer0_chunk (GPU),
V_layer0_chunk (GPU),
...
]
nixl.submit(send, buffers_to_send) -> completion event

B side:
prealloc slots in KV pool
nixl.submit(recv, buffers_to_recv) -> completion event
on complete: "restore mapping" so decode loop can see correct req_to_token / kv_indices

4. Decode(B) 内部

Decode server 的难点是:需要把传入的KVCache变成内部可 decode 状态:Bootstrap/Prealloc → KV 恢复 → Decode loop → Logits → Sample

4.1 Bootstrap

1
2
3
4
prefill server registers to bootstrap service
decode server registers to bootstrap service
bootstrap matches them (room/pair_key)
-> returns transport endpoints/credentials for direct transfer

4.2 PreallocQueue

在收 KV 前先把Slot和结构准备好,decode 一般会在接收前:

  • 分配 req_pool_idx(请求槽位)
  • 分配 token slot indices(按 prompt_len 或 token_range)
  • 准备好 req_to_token 表中的切片空间

这样 KV 一到,就能直接 copy/scatter 到目标位置,这样可以将malloc和RDMA/CUDA注册和数据拷贝overlap。

4.3 KV 恢复

重建 token→slot 映射和重填 KV pool,恢复过程抽象如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
recv(K,V payload + meta)
|
| parse meta -> knows prompt_len, token_range, page_size, ...
|
| allocate target slots in B's KV pool:
| b_kv_indices = allocator.alloc(n_tokens)
|
| receive/copy KV bytes into:
| b_k_buffer[layer][b_kv_indices] <- payload.K[layer]
| b_v_buffer[layer][b_kv_indices] <- payload.V[layer]
|
| rebuild ReqToTokenPool mapping:
| req_to_token[req_pool_idx, token_range] = b_kv_indices
v
Request now becomes "decode-ready"

ASCII 展开:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
B side restore
=============
meta.token_range = [0, prompt_len) # 把整个 prompt 的 KV 都恢复
meta.kv_indices = [a_slot0, a_slot1, ...] # A 的 slot 编号(可能仅用于 debug/校验)
|
| B 不一定复用 A 的 slot 编号;通常重新 alloc 自己的 slots
v
b_kv_indices = allocator_B.alloc(prompt_len)

for each layer l:
k_buffer_B[l][b_kv_indices] <- recv K bytes
v_buffer_B[l][b_kv_indices] <- recv V bytes

ReqToTokenPool_B.req_to_token[req_pool_idx, 0:prompt_len] = b_kv_indices

4.4 Decode loop

进入 decode loop 后,每一步大概是:

1
2
3
4
5
6
7
8
9
(last_token_id, position_id, req_pool_idx)
|
| attention reads KV from pool via req_to_token mapping
v
logits -> sampling -> next_token_id
|
+--> append token_id to request output buffer
|
+--> write new token KV into KV pool (extend)

对应的数据结构:

  • ForwardBatch(decode)
    • input_ids:通常只有 last token(或少数 tokens)
    • req_pool_indices
    • kv_indices:本次新生成 token 的 slots(allocator 新分配)
    • 其余 attention metadata(indptr 等)

重要: decode 的 KV 写入和 prefill 一样,只是每步 token 很少。


5. Detokenizer

Detokenizer 的核心任务是把模型输出 token ids 转为字符串,并处理 streaming(增量输出)。

简化流程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Scheduler / ModelWorker produces:
out_token_ids: List[int] (per request)
+ maybe logprobs, finish_reason, usage
|
| send to detokenizer process via queue/pipe
v
Detokenizer:
tokenizer.decode(incremental_ids, skip_special_tokens=...)
|
v
text delta / full text
|
v
HTTP response:
- non-stream: {"text": "..."}
- stream: chunks (SSE/websocket/HTTP chunked)

A→B replay,detokenizer 侧最关键的观测点是:

  • B 端产生的 token_ids 序列是否与 monolithic 一致(deterministic inference)
  • streaming 时的 “delta” 是否一致(有时会受分块、缓存影响)

6. 生命周期

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
time ───────────────────────────────────────────────────────────────────────────────>

Client JSON req
|
Router receives req (owns request envelope)
|
| (A) prefill stage
v
A: TokenizerManager creates Request(rid, input_ids, sampling_params) [lifetime: until prefill done]
|
A: Scheduler alloc req_pool_idx_A [until request finished or handed off]
|
A: ReqToTokenPool slice reserved [until request finished or cache pinned]
|
A: TokenToKVAllocator alloc kv_indices for prompt tokens [until freed or cached in radix]
|
A: ModelRunner prefill forward writes K/V into KV pool [KV pool global lifetime; slots per req lifetime]
|
A: (optional) RadixCache inserts nodes / inc lock_ref [until eviction/refcount drop]
|
A: Disagg Transfer packages (meta + KV bytes) [ephemeral; per transfer]
|
A: Transfer completes → A may free its per-request slots OR keep cached prefix
|
+-----------------------------------------------------------------------------------+
|
v
B: PreallocQueue reserves req_pool_idx_B + kv slots [until request finished]
B: Receive KV bytes into B's KV pool slots [KV pool global; slots per req]
B: Rebuild ReqToTokenPool mapping for rid [until finished]
B: Decode loop generates tokens; each token alloc new kv slot [until finished/cached]
B: Token ids pushed to detokenizer queue [ephemeral buffer]
B: Detokenizer builds output text [until response complete]
B: When finished: free req_pool_idx_B slots (unless cached) [free list]
  • KV pool 是全局缓冲区(per instance),请求只是占用其中的一些 token slots
  • ReqToTokenPool 是 per instance 的请求索引表,把请求上下文位置映射到 token slots
  • Radix/HiCache 会让某些 token slots 的生命周期超出请求本身(作为缓存被共享/被备份)

7. 回放

站在自定义 KVStore,要实现 A→B 恢复 + replay角度:

7.1 replay 的最小闭包

要在 B 端继续 decode,至少要恢复:

  1. 模型参数一致(同 model_path / same weights version)
  2. tokenizer/模板一致(prompt → token_ids 必须一致;否则 KV 对不上)
  3. 上下文 token_ids 序列(至少长度/position 对齐)
  4. KV 内容(对每层 K/V,覆盖 [0:prompt_len) 的上下文)
  5. token→slot 映射(B 的 req_to_token[req_pool_idx, 0:prompt_len])
  6. 生成状态(sampling_params、random_seed、是否 deterministic)

可以抽象成一个可序列化的 snapshot:

1
2
3
4
5
6
7
8
9
10
ReplaySnapshot {
rid,
model_id,
prompt_token_ids,
prompt_len,
kv_store_ref / kv_bytes,
kv_layout_meta,
req_to_token_mapping (implicit or explicit),
sampling_state (seed, temperature, top_p, ...)
}

7.2 适配边界

适配HiCache 接口和自定义的 KVStore,一般的:

  • 把某段 token_range 的 KV 写入/读出到外部 store
  • 把外部 KV 恢复到某个 server instance 的 KV pool + 映射结构
1
2
3
4
5
store_put(key=(model, layer, page/token_range, ...), value=KV bytes, meta=layout)
store_get(key=..., ...) -> KV bytes + meta

restore_to_kvpool(meta, kv_bytes) -> b_kv_indices + write req_to_token
extract_from_kvpool(req_pool_idx, token_range) -> kv_bytes + meta

8. 可选观测点

8.1 Prefill(A) - before send

  • rid
  • prompt_len, cached_prefix_len
  • req_pool_idx_A
  • index_slice / token_range
  • kv_indices(shape/dtype + 前 16 个 + min/max)
  • KV pool layout meta(layer_num, head_dim, kv_heads, dtype, page_size)
  • payload_total_bytes(按 layer 计算或从 backend 获取)
  • (可选)k_hash/v_hash(小窗口 hash)

8.2 Transfer backend (NIXL) - submit

  • rid
  • num_bufs, total_bytes
  • 每个 buffer 的:
    • device (cuda/cpu)
    • nbytes
    • dtype(如果是 tensor)
  • completion latency(提交→完成)

8.3 Decode(B) - after restore

  • rid
  • req_pool_idx_B
  • prompt_len
  • b_kv_indices(或恢复后写进 req_to_token 的 slice)
  • req_to_token[req_pool_idx_B, 0:prompt_len](desc + head)
  • k_hash/v_hash(与 A 对比)

8.4 Decode loop - per step

  • step i
  • last_token_id
  • next_token_id
  • kv_indices_new(本步新增 token)
  • 结束原因(eos/length/…)

附:全链路 ASCII 图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
(1) Client -> Router
===================
POST /generate
payload = {
"text": prompt:str,
"sampling_params": {...}
}

(2) Router -> Prefill(A)
========================
Router selects (prefill_url, decode_url)
Router forwards request envelope to A (or calls A internal API)

(3) Prefill(A) tokenize + schedule
==================================
prompt:str
-> token_ids: List[int] / Tensor[int32] (CPU)
-> Request(rid, token_ids, sampling_params, stage=PREFILL_*)
-> Scheduler:
req_pool_idx_A = ReqToTokenPool.alloc(1)
(Radix match) -> cached_prefix_len + cached_device_indices
n_new = prompt_len - cached_prefix_len
kv_indices_A = TokenToKVPoolAllocator.alloc(n_new, page_size)
ReqToTokenPool.req_to_token[req_pool_idx_A, cached_prefix_len:prompt_len] = kv_indices_A

(4) Prefill(A) forward_prefill writes KV
========================================
ForwardBatch.prefill:
input_ids : GPU int32 [n_tokens_total or chunk]
position_ids : GPU int32
req_pool_indices: GPU int32 [batch_size]
kv_indices : GPU int32 [n_new_tokens]
index_slice : (req_pool_idx_A, slice(cached_prefix_len, prompt_len))
attn meta : qo_indptr / kv_indptr / ...
ModelRunner.forward_prefill():
for layer l:
K_l_new, V_l_new computed
k_buffer_A[l][kv_indices_A] <- K_l_new
v_buffer_A[l][kv_indices_A] <- V_l_new

(5) Prefill(A) -> Transfer (KV + meta)
======================================
KVTransferMeta:
rid, prompt_len, cached_prefix_len,
layer_num, kv_heads, head_dim, dtype, page_size,
token_range, (maybe) kv_indices_A (for debug)
KVPayload:
for l in layers:
K bytes for token_range
V bytes for token_range

(6) Decode(B) prealloc + receive + restore mapping
==================================================
DecodePreallocQueue:
req_pool_idx_B = ReqToTokenPool.alloc(1)
b_kv_indices = TokenToKVPoolAllocator.alloc(prompt_len)
Receive payload:
for layer l:
k_buffer_B[l][b_kv_indices] <- recv K bytes
v_buffer_B[l][b_kv_indices] <- recv V bytes
Restore:
ReqToTokenPool.req_to_token[req_pool_idx_B, 0:prompt_len] = b_kv_indices
Request.stage = DECODE_READY

(7) Decode loop
===============
for step in [0..max_new_tokens):
ForwardBatch.decode:
input_ids (last token) : GPU int32 [bs]
position_ids : GPU int32
req_pool_indices : GPU int32
kv_indices_new : GPU int32 [bs] (alloc 1 slot / req)
ModelRunner.forward_decode():
reads KV via req_to_token mapping
outputs logits
Sampler():
next_token_id
write KV for new token:
k_buffer_B[l][kv_indices_new] <- K_new
v_buffer_B[l][kv_indices_new] <- V_new
append token_id to output stream

(8) Detokenizer + HTTP response
===============================
token_ids stream -> detokenizer queue
detokenizer.decode(...) -> text delta
router returns:
{"text": "..."} or streaming chunks

SGlang-KVCache 生命周期与PD分离传输分析

https://devillove084.github.io/2026/01/12/SGlang-1/

作者

devillove084

发布于

2026-01-12

更新于

2026-01-12

许可协议

评论

Your browser is out-of-date!

Update your browser to view this website correctly.&npsb;Update my browser now

×