Topic: Deepseek

Deploying DeepSeek on 96 H100 GPUs
Deploying DeepSeek on 96 H100 GPUs
source lmsys.org Aug 29, 2025

Article URL: https://lmsys.org/blog/2025-05-05-large-scale-ep/ Comments URL: https://news.ycombinator.com/item?id=45064329 Points: 90 # Comments: 28...

TL;DR
SGLang team successfully replicates DeepSeek's inference system using prefill-decode disaggregation, expert parallelism, and large-scale load balancing, achieving a throughput of 52.3k input tokens per second and 22.3k output tokens per second.

Key Takeaways:
  • PF disaggregation optimizes prefill and decode phases separately, reducing latency and improving efficiency.
  • EP and EPLB achieve a significant speedup of 1.49x (prefill) and 2.54x (decode) by addressing workload imbalances across GPUs.
  • DisposableTensor and expert workload extraction tools enhance memory management and analysis, providing insights for optimization and simulation.
02 Sep
01 Sep
31 Aug
30 Aug
29 Aug
28 Aug
27 Aug