LLM Inference

An opinionated and incomplete survey of LLM inference and serving runtimes from a systems and infrastructure lens.

  1. LLMs and Transformers Introduction, embeddings, transformers and attention mechanisms
  2. Inference and the KV Cache Inference execution and the KV cache
  3. Sharding a Model Pipeline, tensor, and expert parallelism
  4. Batching, Scheduling, and Paging Continuous batching, Orca, and PagedAttention
  5. I/O-Aware Kernels FlashAttention and FlashInfer
  6. Speculative Decoding Speculative decoding, EAGLE, Medusa Trees, and Multi-Token Prediction
  7. Prefill-Decode Scheduling and Disaggregation Chunk prefill and prefill-decode disaggregation
  8. KV Cache Management and Offload Prefix caching and KV offload
  9. Appendix: Overview of Training Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques
  10. Appendix: GPU Hardware Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy
  11. Appendix: Inference Runtimes LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang