Blog | KVCache.AI

Blog

Explore our latest articles on LLM inference, optimization techniques, and system architecture.

How Much KV Cache Budget Do We Need for LLM Serving?

Jun 26, 2026 KVCache.AI team

Estimate the KV Cache capacity for LLM inference workloads by analyzing hit rate and prefill speedup under different cache budgets, with the help of KV Cache Hit Rate Simulator.

KV Cache LLM Serving Hit Rate Cache Eviction Mooncake

KT-FT v0.6.1: Closing the Loop from MoE Fine-Tuning to Local Serving

May 29, 2026 Ktransformers Team

KT-FT v0.6.1 connects MoE SFT and local SGLang serving into an end-to-end loop; split LoRA serving bridges KT expert and SGLang non-expert adapters for Qwen3.5 MoE.

KTransformers Fine-Tuning LLaMA-Factory LoRA MoE SGLang

OpenClaw + Mooncake: A Stability Upgrade for Real-World Multi-Session Inference

Mar 19, 2026 Mooncake community

By integrating Mooncake into OpenClaw's real inference path, we not only improved fast-path latency, but also sharply reduced TTFT tail latency in multi-session, long-context workloads, turning a system that was usually fast but occasionally slow into one that feels consistently smooth.

Mooncake OpenClaw SGLang TTFT KVCache

Mooncake Joins the PyTorch Ecosystem

Feb 12, 2026 Mooncake community

Mooncake is now part of the PyTorch Ecosystem, complementing PyTorch-native LLM serving with high-performance disaggregated data transfer and storage.

Mooncake PyTorch LLM Serving

KTransformers + LLaMA-Factory + SGLang: Low-Cost Local Fine-Tuning and Inference

Nov 1, 2025 KTransformers Team

A low-cost, low-memory end-to-end fine-tuning and inference workflow for large MoE models with KTransformers, LLaMA-Factory, and SGLang.

KTransformers LLaMA-Factory Fine-Tuning MoE LoRA Heterogeneous Computing SGLang