Publications

A collection of my research work.

L4: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling

L4: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling

Yitao Yuan, Chenqi Zhao, Bohan Zhao, Zane Cao, Yongchao He, Wenfei Wu

2025

Paper
A Unified Sparse Attention via Multi-Granularity Compression

A Unified Sparse Attention via Multi-Granularity Compression

Siran Liu, Zane Cao, Yongchao He

2025

Paper
SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

Bohan Zhao, Zane Cao, Yongchao He

2025

Paper

MegatronApp: Efficient and Comprehensive Management on Distributed LLM Training

Bohan Zhao, Guang Yang, Shuo Chen, Ruitao Liu, Tingrui Zhang, Yongchao He, Wei Xu

2025

Paper
SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

Yongchao He, Bohan Zhao, Zheng Cao

2025

Paper
HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding

HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding

Siran Liu, Yang Ye, Qianchao Zhu, Zane Cao, Yongchao He

2025

Paper

RateSheriff: Multipath Flow-aware and Resource Efficient Rate Limiter Placement for Data Center Networks

Songshi Dou, Yongchao He, Sen Liu, Wenfei Wu, Zehua Guo

2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS) 2023

DOIPaper
A Generic Service to Provide In-Network Aggregation for Key-Value Streams

A Generic Service to Provide In-Network Aggregation for Key-Value Streams

Yongchao He, Wenfei Wu, Yanfang Le, Ming Liu, ChonLam Lao

Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 2023

DOIPaper

Consistent and Fine-Grained Rule Update with In-Network Control for Distributed Rate Limiting

Yongchao He, Wenfei Wu

2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS) 2022

DOIPaper

SFP: Service Function Chain Provision on Programmable Switches for Cloud Tenants

Hongyi Huang, Wenfei Wu, Yongchao He, Zehua Guo

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2022

DOIPaper
Scalable On-Switch Rate Limiters for the Cloud

Scalable On-Switch Rate Limiters for the Cloud

Yongchao He, Wenfei Wu, Xuemin Wen, Haifeng Li, Yongqiang Yang

IEEE INFOCOM 2021 - IEEE Conference on Computer Communications 2021

DOIPaper
NFD: Using Behavior Models to Develop Cross-Platform Network Functions

NFD: Using Behavior Models to Develop Cross-Platform Network Functions

Hongyi Huang, Wenfei Wu, Yongchao He, Bangwen Deng, Ying Zhang, Yongqiang Xiong, Guo Chen, Yong Cui, Peng Cheng

IEEE INFOCOM 2021 - IEEE Conference on Computer Communications 2021

DOIPaper

Fully Functional Rate Limiter Design on Programmable Hardware Switches

Yongchao He, Wenfei Wu

Proceedings of the ACM SIGCOMM 2019 Conference Posters and Demos 2019

DOIPaper

SpeedyBox: Low-Latency NFV Service Chains with Cross-NF Runtime Consolidation

Yimin Jiang, Yong Cui, Wenfei Wu, Zhe Xu, Jiahan Gu, K. K. Ramakrishnan, Yongchao He, Xuehai Qian

2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS) 2019

DOIPaper