本文目的是从 Prompt 到绘制到输出全链路数据结构、Tensor的生命周期,讲清楚“用户输入一句 prompt → KVCache 生成 → A 节点 → KVCache 传输 → B 节点恢复 → decode → detokenizer → 输出” ,在这条链路中涉及的核心数据结构、Tensor、转换/恢复点、生命周期与进程边界 ,尽可能详细地画出来。
目录
0. 总览:进程/线程/设备边界 SGLang server(python -m sglang.launch_server)通常会启动多个子进程/组件:
HTTP Server (FastAPI/Uvicorn):接收 /generate 等请求
TokenizerManager / TokenizerWorker :将 prompt → token_ids,并管理 disaggregation bootstrap
Scheduler 进程 :核心调度循环,组 batch、调用 ModelRunner、管理 KV cache / 内存池
ModelWorker / TPWorker :执行模型前向(prefill/extend/decode)
Detokenizer 进程 (或线程):将输出 token_ids → 字符串(支持 streaming)
PD 解耦时:
Prefill server A :只做 prefill(prompt 部分前向),产出 KV
Decode server B :只做 decode(自回归生成),需要恢复 KV
Router :把请求路由到 A/B,并承接客户端的统一 HTTP 接口
设备层面:
L1:GPU KV cache pool(MHATokenToKVPool/MLATokenToKVPool/...)
L2:HostKVCache(pinned host memory)
L3:外部存储(Mooncake/HF3FS/File/…)
1. 端到端主链路 下面是主链路 ASCII 总览图:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Client (python/curl/web) | | HTTP POST /generate | JSON: { text, sampling_params, ... } v +--------------------------+ | Router Process || - policy (random/...) | | - PD pairing state | +--------------------------+ | | | route(prompt) | route(next token loop) v v +---------------------+ +---------------------+ | Prefill Server (A) | | Decode Server (B) || disagg_mode=prefill | | disagg_mode=decode | +---------------------+ +---------------------+ | ^ | (1) tokenize + schedule | (4) restore + decode loop | (2) prefill forward | | (3) transfer KV + meta ---+ | v (optional: A also returns some first tokens depending on implementation) | +-------------------------------> Router merges/streams response | v Client receives: - generated_text / stream chunks
展开如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 (A) Prefill server (B) Decode server +-----------------------------------------+ +-----------------------------------------+ | HTTP Server | | HTTP Server | | - parse JSON request | | - may receive from router internal path | +-------------------+---------------------+ +-------------------+---------------------+ | | v v +-----------------------+ +-----------------------+ | TokenizerManager | | TokenizerManager | | - prompt -> token_ids | | - may not tokenize | | - disagg bootstrap | | - bootstrap client | +-----------+-----------+ +-----------+-----------+ | | v v +-----------------------+ +-----------------------+ | Scheduler Process | | Scheduler Process | | - ReqQueue | | - DecodeQueue | | - BatchBuilder | | - PreallocQueue | | - MemoryPool (L1/L2) | | - MemoryPool (L1/L2) | | - (Hi)RadixCache | | - (Hi)RadixCache opt | +-----------+-----------+ +-----------+-----------+ | | v v +-----------------------+ +-----------------------+ | ModelRunner/TPWorker | | ModelRunner/TPWorker | | forward_prefill | | forward_decode | | writes KV to L1 pool | | reads KV from L1 pool | +-----------+-----------+ +-----------+-----------+ | | | KV DATA + META (indices/mapping) | +--------------(NIXL/Mooncake/...)--------------+ | v +------------------+ | Detokenizer | | token_ids->text | +------------------+
2. Prefill(A) 内部 2.1 token_id 生成 1 2 3 4 5 6 7 8 9 prompt ( str) | | tokenizer.encode(...) v input_ids: List[int] / torch.Tensor[int32|int64] (CPU) | | (optional) add BOS/EOS/system prompt template v Request( token_ids, sampling_params, ...)
核心结构 :
Request
rid / request_id
prompt_text(或不保留)
input_ids(CPU tensor / list)
sampling_params(temperature/top_p/max_new_tokens/…)
stage(PD 时更明显:PREFILL_* / DECODE_*)
req_pool_idx(调度时分配)
2.2 PrefixCache/RadixCache 1 2 3 4 5 6 7 8 token_ids (full prompt) | | match_prefix(token_ids) v MatchResult : - device_indices (L1 hit: token->KV slot indices) - host_hit_length (L2 hit) - last_device_node / last_host_node
如果命中一段前缀:
这段前缀对应的 KV 已存在(L1 或 L2/L3 可回载)
prefill 需要计算的 token 数变少(只算未命中部分)
Request 会带上:
cached_prefix_len
cached_device_indices(用于直接复用 KV slots)
2.3 内存池二级映射 1 2 3 4 5 6 7 8 9 10 11 (Request slot) req_pool_idx | | ReqToTokenPool.req_to_token = token_slot_idx v (token_slot_idx) <---- allocated by TokenToKVPoolAllocator | | token_slot_idx indexes into per-layer KV buffers v KVCache buffers (L1 GPU): k_buffer v_buffer
更具体一点(以 MHA 为例,简化版):
1 2 3 4 5 6 7 8 9 10 11 12 13 ReqToTokenPool.req_to_token : int32 tensor on GPU (or CPU depending impl) shape = [max_running_requests, max_context_len] entry = token_slot_idx TokenToKVPoolAllocator: free_slots: list / bitmap / queue alloc(n_tokens, page_size) -> token_slot_indices (possibly paged) free(token_slot_indices) MHATokenToKVPool: k_buffer: List[Tensor] length = num_layers each: [kv_pool_size + page_size, n_kv_heads, head_dim] dtype= store_dtype device= cuda v_buffer: same
配置 --page_size 1 时,分页粒度最细,便于观察 indices 和传输,但 overhead 更大。
2.4 Prefill forward prefill 时 Scheduler 会构造一个 batch:
ForwardBatch
input_ids(GPU int32)
position_ids(GPU int32)
req_pool_indices(GPU int32,告诉 attention 哪些请求槽位参与)
seq_lens / qo_indptr / kv_indptr(flashinfer/triton backend 的 metadata)
kv_indices(关键:本次新 token 写入 KV pool 的 slot indices)
index_slice(关键:req_to_token 的切片范围,本次 extend 的 token span)
prefill 的关键副作用:
对每一层 attention:
计算得到本步 token 的 K,V
写入 k_buffer[layer][kv_indices] 和 v_buffer[layer][kv_indices]
单次 batch 与新 token 写入的核心张量关系:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 new_tokens for request rid: token positions: | | allocator alloc -> kv_indices (GPU int32) v kv_indices: (or paged ) | | used as index into per-layer buffers v for each layer l: K_l_new: (compute dtype) V_l_new: | | scatter/store into pool v k_buffer <- K_l_new (store dtype) v_buffer <- V_l_new
3. A→B 传输 PD 传输要解决的问题是:B 必须在它自己的 KV pool 中“拥有等价的 KV 状态” ,才能从 prompt 末尾继续 decode,而不用重新 prefill。
因此传输内容分两大类:
3.1 KV 数据面 两种思路,不同实现可能不同:
按 token slot 直接传 (最直观):把 A 的 k_buffer/layer 中 kv_indices 对应的数据拷到 B 的 pool 对应 slots
按 page 批量传 (更高效):如果 allocator 是 page/Block,连续块更容易跑满带宽
抽象为:
1 2 3 4 KVPayload : for layer in [0. .L -1 ]: K : bytes of shape [n_transfer_tokens, n_kv_heads, head_dim] (store dtype ) V : bytes of shape [n_transfer_tokens, n_kv_heads, head_dim] (store dtype )
3.2 KV 控制面 B 必须知道这些 KV 对应“哪个请求、哪个 token range、写到哪里、用什么布局解释” 。
最少需要:
rid / request_id
prompt_len(或至少“已存在的上下文长度”)
kv_cache_dtype / store_dtype
attention_arch(MHA/MLA/NSA…)与其参数(head_dim、kv_heads、layer_num 等)
page_size(如果 paged)
kv_indices(A 侧 token→slot 映射,或它的等价压缩表示)
index_slice / token_range(本次 prefill/extend 的 span)
(可选)cached_prefix_len 与 prefix cache 命中信息(若 B 需要知道哪些 token 不需要再传)
大概是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 KVTransferMeta (example): { rid: u64, model_id: string, arch: "MHA" | "MLA" | ..., layer_num: int , kv_heads: int , head_dim: int , v_head_dim: int , store_dtype: "bf16" | "fp16" | "uint8(fp8)" | ..., page_size: int , prompt_len: int , cached_prefix_len: int , token_range: [start_pos, end_pos), kv_indices: int32[...] or pages[...] // A->B mapping handle }
3.3 NIXL为例,传输的行为 把 PD 传输抽象成提交一组 buffer 和事件同步:
1 2 3 4 5 6 7 8 9 10 11 12 13 A side: buffers_to_send = [ meta_bytes (CPU or GPU), K_layer0_chunk (GPU), V_layer0_chunk (GPU), ... ] nixl.submit(send, buffers_to_send) -> completion event B side: prealloc slots in KV pool nixl.submit(recv, buffers_to_recv) -> completion event on complete: "restore mapping" so decode loop can see correct req_to_token / kv_indices
4. Decode(B) 内部 Decode server 的难点是:需要把传入的KVCache变成内部可 decode 状态:Bootstrap/Prealloc → KV 恢复 → Decode loop → Logits → Sample
4.1 Bootstrap 1 2 3 4 prefill server registers to bootstrap service decode server registers to bootstrap service bootstrap matches them (room/pair_key)-> returns transport endpoints/credentials for direct transfer
4.2 PreallocQueue 在收 KV 前先把Slot和结构准备好,decode 一般会在接收前:
分配 req_pool_idx(请求槽位)
分配 token slot indices(按 prompt_len 或 token_range)
准备好 req_to_token 表中的切片空间
这样 KV 一到,就能直接 copy/scatter 到目标位置,这样可以将malloc和RDMA/CUDA注册和数据拷贝overlap。
4.3 KV 恢复 重建 token→slot 映射和重填 KV pool,恢复过程抽象如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 recv( K, V payload + meta) | | parse meta -> knows prompt_len, token_range, page_size, ... | | allocate target slots in B's KV pool: | b_kv_indices = allocator.alloc(n_tokens) | | receive/copy KV bytes into: | b_k_buffer[layer][b_kv_indices] <- payload.K[layer] | b_v_buffer[layer][b_kv_indices] <- payload.V[layer] | | rebuild ReqToTokenPool mapping: | req_to_token[req_pool_idx, token_range] = b_kv_indices v Request now becomes "decode-ready"
ASCII 展开:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 B side restore ============= meta.token_range = [0, prompt_len) # 把整个 prompt 的 KV 都恢复 meta.kv_indices = [a_slot0, a_slot1, ...] # A 的 slot 编号(可能仅用于 debug/校验) | | B 不一定复用 A 的 slot 编号;通常重新 alloc 自己的 slots v b_kv_indices = allocator_B.alloc(prompt_ len)for each layer l: k_buffer_B[l][b_kv_indices] <- recv K bytes v_buffer_B[l][b_kv_indices] <- recv V bytes ReqToTokenPool_B.req_to_token[req_pool_idx, 0:prompt_len] = b_kv_indices
4.4 Decode loop 进入 decode loop 后,每一步大概是:
1 2 3 4 5 6 7 8 9 (last_token_id, position_id, req_pool_idx) | | attention reads KV from pool via req_to_token mapping v logits -> sampling -> next_token_id | +--> append token_id to request output buffer | +--> write new token KV into KV pool (extend)
对应的数据结构:
ForwardBatch(decode)
input_ids:通常只有 last token(或少数 tokens)
req_pool_indices
kv_indices:本次新生成 token 的 slots(allocator 新分配)
其余 attention metadata(indptr 等)
重要: decode 的 KV 写入和 prefill 一样,只是每步 token 很少。
5. Detokenizer Detokenizer 的核心任务是把模型输出 token ids 转为字符串,并处理 streaming(增量输出)。
简化流程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Scheduler / ModelWorker produces: out_token_ids: List[int] (per request) + maybe logprobs, finish_reason, usage | | send to detokenizer process via queue/pipe v Detokenizer: tokenizer.decode(incremental_ids, skip_special_tokens= ...) | v text delta / full text | v HTTP response: - non-stream: {"text" : "..." } - stream: chunks (SSE/websocket/HTTP chunked)
A→B replay,detokenizer 侧最关键的观测点是:
B 端产生的 token_ids 序列是否与 monolithic 一致(deterministic inference)
streaming 时的 “delta” 是否一致(有时会受分块、缓存影响)
6. 生命周期 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 time ───────────────────────────────────────────────────────────────────────────────> Client JSON req | Router receives req (owns request envelope) | | (A) prefill stage v A: TokenizerManager creates Request(rid, input_ids, sampling_params) [lifetime: until prefill done] | A: Scheduler alloc req_pool_idx_A [until request finished or handed off ] | A: ReqToTokenPool slice reserved [until request finished or cache pinned] | A: TokenToKVAllocator alloc kv_indices for prompt tokens [until freed or cached in radix] | A: ModelRunner prefill forward writes K/V into KV pool [KV pool global lifetime; slots per req lifetime] | A: (optional ) RadixCache inserts nodes / inc lock_ref [until eviction/refcount drop] | A: Disagg Transfer packages (meta + KV bytes) [ephemeral; per transfer] | A: Transfer completes → A may free its per-request slots OR keep cached prefix | +-----------------------------------------------------------------------------------+ | v B: PreallocQueue reserves req_pool_idx_B + kv slots [until request finished]B: Receive KV bytes into BB: Rebuild ReqToTokenPool mapping for rid [until finished]B: Decode loop generates tokens; each token alloc new kv slot [until finished/cached]B: Token ids pushed to detokenizer queue [ephemeral buffer]B: Detokenizer builds output text [until response complete]B: When finished: free req_pool_idx_B slots (unless cached) [free list]
KV pool 是全局缓冲区(per instance) ,请求只是占用其中的一些 token slots
ReqToTokenPool 是 per instance 的请求索引表 ,把请求上下文位置映射到 token slots
Radix/HiCache 会让某些 token slots 的生命周期超出请求本身 (作为缓存被共享/被备份)
7. 回放 站在自定义 KVStore,要实现 A→B 恢复 + replay角度:
7.1 replay 的最小闭包 要在 B 端继续 decode,至少要恢复:
模型参数一致 (同 model_path / same weights version)
tokenizer/模板一致 (prompt → token_ids 必须一致;否则 KV 对不上)
上下文 token_ids 序列 (至少长度/position 对齐)
KV 内容 (对每层 K/V,覆盖 [0:prompt_len) 的上下文)
token→slot 映射 (B 的 req_to_token[req_pool_idx, 0:prompt_len])
生成状态 (sampling_params、random_seed、是否 deterministic)
可以抽象成一个可序列化的 snapshot:
1 2 3 4 5 6 7 8 9 10 ReplaySnapshot { rid, model_id, prompt_token_ids, prompt_len, kv_store_ref / kv_bytes, kv_layout_meta, req_to_token_mapping (implicit or explicit), sampling_state (seed, temperature, top_p, ...) }
7.2 适配边界 适配HiCache 接口和自定义的 KVStore,一般的:
把某段 token_range 的 KV 写入/读出到外部 store
把外部 KV 恢复到某个 server instance 的 KV pool + 映射结构
1 2 3 4 5 store_put (key=(model, layer, page/token_range, ...) , value=KV bytes, meta=layout)store_get (key=..., ...) -> KV bytes + metarestore_to_kvpool (meta, kv_bytes) -> b_kv_indices + write req_to_tokenextract_from_kvpool (req_pool_idx, token_range) -> kv_bytes + meta
8. 可选观测点 8.1 Prefill(A) - before send
rid
prompt_len, cached_prefix_len
req_pool_idx_A
index_slice / token_range
kv_indices(shape/dtype + 前 16 个 + min/max)
KV pool layout meta(layer_num, head_dim, kv_heads, dtype, page_size)
payload_total_bytes(按 layer 计算或从 backend 获取)
(可选)k_hash/v_hash(小窗口 hash)
8.2 Transfer backend (NIXL) - submit
rid
num_bufs, total_bytes
每个 buffer 的:
device (cuda/cpu)
nbytes
dtype(如果是 tensor)
completion latency(提交→完成)
8.3 Decode(B) - after restore
rid
req_pool_idx_B
prompt_len
b_kv_indices(或恢复后写进 req_to_token 的 slice)
req_to_token[req_pool_idx_B, 0:prompt_len](desc + head)
k_hash/v_hash(与 A 对比)
8.4 Decode loop - per step
step i
last_token_id
next_token_id
kv_indices_new(本步新增 token)
结束原因(eos/length/…)
附:全链路 ASCII 图 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 (1) Client -> Router =================== POST /generate payload = { "text": prompt:str, "sampling_params": {...} } (2) Router -> Prefill(A) ======================== Router selects (prefill_url, decode_url) Router forwards request envelope to A (or calls A internal API) (3) Prefill(A) tokenize + schedule ================================== prompt:str -> token_ids: List[int] / Tensor[int32] (CPU) -> Request(rid, token_ids, sampling_params, stage=PREFILL_*) -> Scheduler: req_pool_idx_A = ReqToTokenPool.alloc(1) (Radix match) -> cached_prefix_len + cached_device_indices n_new = prompt_len - cached_prefix_len kv_indices_A = TokenToKVPoolAllocator.alloc(n_new, page_size) ReqToTokenPool.req_to_token[req_pool_idx_A, cached_prefix_len:prompt_len] = kv_indices_A (4) Prefill(A) forward_prefill writes KV ======================================== ForwardBatch.prefill: input_ids : GPU int32 [n_tokens_total or chunk] position_ids : GPU int32 req_pool_indices: GPU int32 [batch_size] kv_indices : GPU int32 [n_new_tokens] index_slice : (req_pool_idx_A, slice(cached_prefix_len, prompt_len)) attn meta : qo_indptr / kv_indptr / ... ModelRunner.forward_prefill(): for layer l: K_l_new, V_l_new computed k_buffer_A[l][kv_indices_A] <- K_l_new v_buffer_A[l][kv_indices_A] <- V_l_ new(5) Prefill(A) -> Transfer (KV + meta) ====================================== KVTransferMeta: rid, prompt_len, cached_prefix_len, layer_num, kv_heads, head_dim, dtype, page_size, token_range, (maybe) kv_indices_A (for debug) KVPayload: for l in layers: K bytes for token_range V bytes for token_range (6) Decode(B) prealloc + receive + restore mapping ================================================== DecodePreallocQueue: req_pool_idx_B = ReqToTokenPool.alloc(1) b_kv_indices = TokenToKVPoolAllocator.alloc(prompt_len) Receive payload: for layer l: k_buffer_B[l][b_kv_indices] <- recv K bytes v_buffer_B[l][b_kv_indices] <- recv V bytes Restore: ReqToTokenPool.req_to_token[req_pool_idx_B, 0:prompt_len] = b_kv_indices Request.stage = DECODE_READY (7) Decode loop =============== for step in [0..max_new_tokens): ForwardBatch.decode: input_ids (last token) : GPU int32 [bs] position_ids : GPU int32 req_pool_indices : GPU int32 kv_indices_new : GPU int32 [bs] (alloc 1 slot / req) ModelRunner.forward_decode(): reads KV via req_to_token mapping outputs logits Sampler(): next_token_id write KV for new token: k_buffer_B[l][kv_indices_new] <- K_new v_buffer_B[l][kv_indices_new] <- V_new append token_ id to output stream(8) Detokenizer + HTTP response =============================== token_ids stream -> detokenizer queue detokenizer.decode(...) -> text delta router returns: {"text": "..."} or streaming chunks