<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Heterogeneous Computing | KVCache.ai</title>
    <link>https://kvcache.ai/tag/heterogeneous-computing/</link>
      <atom:link href="https://kvcache.ai/tag/heterogeneous-computing/index.xml" rel="self" type="application/rss+xml" />
    <description>Heterogeneous Computing</description>
    <generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 01 Nov 2025 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://kvcache.ai/media/logo.svg</url>
      <title>Heterogeneous Computing</title>
      <link>https://kvcache.ai/tag/heterogeneous-computing/</link>
    </image>
    
    <item>
      <title>KTransformers &#43; LLaMA-Factory &#43; SGLang: Low-Cost Local Fine-Tuning and Inference</title>
      <link>https://kvcache.ai/blog/ktransformers-llamafactory-fine-tuning/</link>
      <pubDate>Sat, 01 Nov 2025 00:00:00 +0000</pubDate>
      <guid>https://kvcache.ai/blog/ktransformers-llamafactory-fine-tuning/</guid>
      <description>&lt;p&gt;On a local workstation, the hard part of large-model experimentation is usually the cost of bringing a large MoE model into the same loop as the user&amp;rsquo;s data and evaluation target. A researcher may want to try a domain dataset, a product prototype, or a benchmark, but the model quickly turns into a GPU-memory problem. This guide presents KTransformers, LLaMA-Factory, and SGLang as a low-cost, low-memory end-to-end path: LoRA fine-tuning stays in a familiar training recipe, KTransformers shifts the memory pressure through GPU+CPU heterogeneous execution, and the adapted model can continue into inference and benchmark testing.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/20251229170124823.png&#34;
alt=&#34;KTransformers, LLaMA-Factory, and SGLang local fine-tuning and inference pipeline&#34;
style=&#34;zoom:50%&#34;/&gt;&lt;/p&gt;
&lt;p&gt;Inside that workflow, LLaMA-Factory sits at the user-facing orchestration layer: it owns dataset preparation, model templates, LoRA configuration, checkpoint output, and the first chat/API validation path. KTransformers plugs in underneath as the LoRA backend engine for Attention and MoE operators, moving memory-heavy expert computation into a GPU+CPU heterogeneous path while preserving the LLaMA-Factory interface. SGLang then takes the trained adapter into the inference side of the same end-to-end flow for batch inference and benchmark traffic.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/20260525034705-ktransformers-ft-01-architecture.png&#34;
alt=&#34;KTransformers and LLaMA-Factory integration architecture&#34;
style=&#34;zoom:45%&#34;/&gt;&lt;/p&gt;
&lt;h2 id=&#34;why-this-integration-matters&#34;&gt;Why This Integration Matters&lt;/h2&gt;
&lt;p&gt;In the same LLaMA-Factory LoRA workflow, the KTransformers backend is the path that can handle ultra-large MoE models on commodity hardware. On DeepSeek-V2-Lite, it improves throughput and lowers GPU memory. On DeepSeek-V3 scale, the default HuggingFace path is not runnable in this 4090-class setting, while KTransformers keeps training feasible through heterogeneous placement.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;LoRA BF16 with NekoQA-10K stylized dialogue&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;HuggingFace backend&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Unsloth backend&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;KTransformers backend&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V2-Lite 14B throughput&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;303.58 token/s&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;455.37 token/s&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;530.38 token/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V2-Lite 14B GPU memory&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;32.12 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;9.64 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;6.08 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V3 671B throughput&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;Too large to run&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;Not supported&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;40.35 token/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V3 671B GPU memory, summed across GPUs&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;theoretical 1400 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;Not supported&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;70 GB measured peak&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The 1400 GB figure is a theoretical FP16 full-parameter resident footprint. The measured KTransformers number comes from placing Attention on GPU and offloading the layered MoE workload.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/20260525034705-ktransformers-ft-02-backend-comparison.png&#34;
alt=&#34;Backend comparison by model scale&#34;
style=&#34;zoom:42%&#34;/&gt;&lt;/p&gt;
&lt;h2 id=&#34;fine-tuning-results&#34;&gt;Fine-Tuning Results&lt;/h2&gt;
&lt;p&gt;We validated the setup on three representative customization tasks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Stylized dialogue, using NekoQA-10K to make a model consistently answer in a recognizable persona.&lt;/li&gt;
&lt;li&gt;Translational-style generation, using an exaggerated Westernized translation tone.&lt;/li&gt;
&lt;li&gt;Medical question answering, using AfriMed-QA short-answer and multiple-choice tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For stylized dialogue, the fine-tuned model follows the target tone and address terms more consistently than the base model.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/20260525034705-ktransformers-ft-03-stylized-dialogue.png&#34;
alt=&#34;Base model and fine-tuned model stylized dialogue comparison&#34;
style=&#34;zoom:45%&#34;/&gt;&lt;/p&gt;
&lt;p&gt;For the translational-style task, both DeepSeek-V2-Lite and DeepSeek-V3 improve clearly after KT-LoRA fine-tuning.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Translational-Style dataset&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;BLEU-1&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;BLEU-2&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;BLEU-3&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;BLEU-4&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;ROUGE-1&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;ROUGE-2&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;ROUGE-L&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;V2-Lite, no LoRA&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;20.66&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;8.33&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;4.54&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;2.89&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;22.71&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;4.52&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;19.19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KT-LoRA fine-tuned V2-Lite&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;35.41&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;22.44&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;15.42&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;11.18&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;42.03&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;18.38&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;33.10&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;V3 base, no LoRA&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;8.49&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;3.34&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;1.62&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;0.96&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;15.91&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;2.55&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;10.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KT-LoRA fine-tuned V3&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;37.02&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;23.70&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;16.21&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;11.49&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;43.43&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;18.96&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;34.54&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For AfriMed-QA, KT-LoRA also improves both short-answer generation and multiple-choice accuracy.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AfriMed-QA short answer&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;BLEU-1&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;BLEU-2&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;BLEU-3&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;BLEU-4&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;ROUGE-1&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;ROUGE-2&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;ROUGE-L&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;V2-Lite, no LoRA&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;13.58&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;11.12&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;9.10&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;7.23&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;22.48&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;7.81&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;11.73&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KT-LoRA fine-tuned V2-Lite&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;35.90&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;27.63&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;22.99&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;19.15&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;35.25&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;17.50&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;28.44&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;V3 base, no LoRA&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;12.75&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;10.27&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;8.05&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;5.99&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;20.33&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;5.65&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;10.11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KT-LoRA fine-tuned V3&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;42.42&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;34.12&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;28.95&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;24.54&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;41.97&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;22.37&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;33.28&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AfriMed-QA multiple choice&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;V2-Lite, no LoRA&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;0.0645&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KT-LoRA fine-tuned V2-Lite&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;0.4812&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;V3 base, no LoRA&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;0.5833&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KT-LoRA fine-tuned V3&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;0.7930&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These are representative small-scale evaluations rather than a complete scaling-law study. The main takeaway is about resource cost: under the same LLaMA-Factory workflow, KTransformers makes LoRA adaptation feasible for MoE models that would otherwise exceed workstation GPU memory.&lt;/p&gt;
&lt;h2 id=&#34;quick-start-may-be-outdated-please-refer-to-the-newest-blog&#34;&gt;Quick Start [May be outdated, please refer to the newest blog]&lt;/h2&gt;
&lt;p&gt;Use sections 1, 2, and 5 if you only need inference. Use sections 1 through 5 if you want the full LoRA fine-tuning and inference loop.&lt;/p&gt;
&lt;h3 id=&#34;1-hardware-requirements&#34;&gt;1. Hardware Requirements&lt;/h3&gt;
&lt;p&gt;Start from the job you want to run:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For inference only, CPU requirements are lighter, but host memory still determines how large a model you can hold.&lt;/li&gt;
&lt;li&gt;For KT LoRA fine-tuning, the CPU must support Intel AMX. Check with &lt;code&gt;lscpu | grep -i amx || true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;GPU memory controls how many GPU experts you can keep resident for speed. KTransformers lets you trade GPU residency for host memory and CPU compute through placement rules.&lt;/li&gt;
&lt;/ul&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;KT inference, rough starting point&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;KT fine-tuning reference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V2-Lite-14B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;3 GB GPU + 15 GB host memory&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;6 GB GPU + 30 GB host memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-30B-A3B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;3 GB GPU + 30 GB host memory&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;5 GB GPU + 60 GB host memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-235B-A22B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;9 GB GPU + 225 GB host memory&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;18 GB GPU + 450 GB host memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V3-671B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;35 GB GPU + 0.65 TB host memory&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;70 GB GPU + 1.3 TB host memory&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;2-environment-and-model-preparation&#34;&gt;2. Environment and Model Preparation&lt;/h3&gt;
&lt;p&gt;Install the three layers used in this workflow: KTransformers for heterogeneous execution, SGLang for serving, and LLaMA-Factory for recipe-style fine-tuning.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# KTransformers inference kernel path.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone https://github.com/kvcache-ai/ktransformers.git
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; ktransformers
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; kt-kernel
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./install.sh
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# SGLang branch used with KTransformers.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone https://github.com/kvcache-ai/sglang.git
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; sglang
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -e &lt;span class=&#34;s2&#34;&gt;&amp;#34;python[all]&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# LLaMA-Factory.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone https://github.com/hiyouga/LLaMA-Factory.git
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; LLaMA-Factory
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -e &lt;span class=&#34;s2&#34;&gt;&amp;#34;.[torch,metrics]&amp;#34;&lt;/span&gt; --no-build-isolation
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# KTransformers fine-tuning dependencies.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Prefer matched wheels when available to avoid local compilation.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Match Python, PyTorch, CUDA, and ABI with your machine.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install ktransformers-0.4.2+cu128torch27fancy-cp311-cp311-linux_x86_64.whl
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install &lt;span class=&#34;nv&#34;&gt;transformers&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;4.56.0
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Use BF16 model weights for KT fine-tuning. DeepSeek-V3-671B is often distributed in FP8 form, so download a BF16 checkpoint directly or convert FP8 weights before training.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -U &lt;span class=&#34;nv&#34;&gt;huggingface_hub&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;0.34.0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;huggingface-cli download --resume-download &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  Qwen/Qwen3-235B-A22B-Instruct-2507 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --local-dir /path/to/Qwen3-235B-A22B-Instruct-2507-BF16
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;3-lora-fine-tuning-with-ktransformers&#34;&gt;3. LoRA Fine-Tuning with KTransformers&lt;/h3&gt;
&lt;p&gt;The training command stays compact. Most experiment changes should live in the LLaMA-Factory YAML.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; LLaMA-Factory
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;USE_KT&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; llamafactory-cli train examples/train_lora/qwen3moe_lora_sft_kt.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The important KT fields are &lt;code&gt;use_kt&lt;/code&gt;, &lt;code&gt;kt_optimize_rule&lt;/code&gt;, &lt;code&gt;cpu_infer&lt;/code&gt;, and &lt;code&gt;chunk_size&lt;/code&gt;. Choose an &lt;code&gt;*-sft-*&lt;/code&gt; optimize rule that matches your model, CPU backend, and GPU count.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-yaml&#34; data-lang=&#34;yaml&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c&#34;&gt;### model&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;model_name_or_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;/path/to/Qwen3-235B-A22B-Instruct-2507-BF16&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;trust_remote_code&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;template&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;qwen3&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;c&#34;&gt;### method&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;stage&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;sft&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;do_train&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;finetuning_type&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;lora&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;lora_rank&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;lora_alpha&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;32&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;lora_dropout&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0.1&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;lora_target&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;all&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;c&#34;&gt;### dataset&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;dataset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;identity, alpaca_en_demo&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;cutoff_len&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2048&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;max_samples&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;100000&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;overwrite_cache&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;preprocessing_num_workers&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;16&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;dataloader_num_workers&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;c&#34;&gt;### output&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;output_dir&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;saves/qwen3moe_lora_sft_kt&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;logging_steps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;save_steps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;500&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;plot_loss&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;overwrite_output_dir&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;save_only_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;false&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;report_to&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;none&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;c&#34;&gt;### train&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;learning_rate&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1.0e-4&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;num_train_epochs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;3.0&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;lr_scheduler_type&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;cosine&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;warmup_ratio&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0.1&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;bf16&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;ddp_timeout&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;180000000&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;c&#34;&gt;### ktransformers&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;use_kt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;kt_optimize_rule&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;examples/kt_optimize_rules/&amp;lt;model&amp;gt;-sft-amx-&amp;lt;gpu-count&amp;gt;.yaml&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;cpu_infer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;64&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;chunk_size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2048&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Training writes LoRA adapter artifacts to &lt;code&gt;output_dir&lt;/code&gt;, usually as safetensors weights plus adapter metadata. That directory is reused by the inference steps below.&lt;/p&gt;
&lt;h3 id=&#34;4-quick-validation-with-llama-factory&#34;&gt;4. Quick Validation with LLaMA-Factory&lt;/h3&gt;
&lt;p&gt;Right after fine-tuning, use LLaMA-Factory for a few interactive checks. This path is meant to confirm that the adapter loads and that the target behavior appears; it is not the fastest serving path.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; LLaMA-Factory
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llamafactory-cli chat examples/inference/qwen3moe_lora_sft_kt.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The inference YAML should point to the base model and adapter directory, set &lt;code&gt;infer_backend: ktransformers&lt;/code&gt;, and keep the KT optimize rule aligned with training.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-yaml&#34; data-lang=&#34;yaml&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nt&#34;&gt;model_name_or_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;/path/to/Qwen3-235B-A22B-Instruct-2507-BF16&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;adapter_name_or_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;saves/qwen3moe_lora_sft_kt&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;template&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;qwen3&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;infer_backend&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;ktransformers&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;trust_remote_code&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;use_kt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;kt_optimize_rule&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;examples/kt_optimize_rules/&amp;lt;model&amp;gt;-infer-amx-&amp;lt;gpu-count&amp;gt;.yaml&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;cpu_infer&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;64&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;chunk_size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2048&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For batch evaluation through the same LLaMA-Factory stack, launch its API server with the same config:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;API_PORT&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;8000&lt;/span&gt; llamafactory-cli api examples/inference/qwen3moe_lora_sft_kt.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;5-faster-serving-and-benchmarking-with-sglang&#34;&gt;5. Faster Serving and Benchmarking with SGLang&lt;/h3&gt;
&lt;p&gt;For benchmark runs or application-facing APIs, use SGLang with KT enabled. The serving path has three steps: convert the LoRA adapter, optionally quantize CPU-side weights, then launch the server with KT and LoRA flags.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; sglang
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python convert_lora.py &amp;lt;YOUR_LORA_ADAPTER_PATH&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; ktransformers/kt-kernel
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python scripts/convert_cpu_weights.py &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --input-path &amp;lt;PATH_TO&amp;gt;/Qwen3-30B-A3B-Instruct-2507 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --input-type bf16 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --output &amp;lt;PATH_TO&amp;gt;/Qwen3-30B-A3B-Instruct-2507-INT8 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --quant-method int8
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python -m sglang.launch_server &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --host 0.0.0.0 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --port &lt;span class=&#34;m&#34;&gt;10103&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --model &amp;lt;PATH_TO&amp;gt;/Qwen3-30B-A3B-Instruct-2507 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --mem-fraction-static 0.7 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --chunked-prefill-size &lt;span class=&#34;m&#34;&gt;2048&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --served-model-name Qwen3-30B-A3B-Instruct-2507 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --tensor-parallel-size &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --kt-method AMXINT8 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --kt-weight-path &amp;lt;PATH_TO&amp;gt;/Qwen3-30B-A3B-Instruct-2507-INT8 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --kt-cpuinfer &lt;span class=&#34;m&#34;&gt;64&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --kt-threadpool-count &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --kt-num-gpu-experts &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --enable-lora &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --lora-paths &lt;span class=&#34;nv&#34;&gt;lora0&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&amp;lt;YOUR_ADAPTER_PATH&amp;gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --max-loras-per-batch &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  --lora-backend triton
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For base-model-only inference, remove the final LoRA-related flags. For Kimi K2, MiniMax M2/M2.1, and other newer model paths, use the corresponding KTransformers V0.5.0 or later instructions when FP8 or INT4 native inference is required.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/20251224165610619.png&#34;
alt=&#34;SGLang server running with KTransformers&#34;
style=&#34;zoom:50%&#34;/&gt;&lt;/p&gt;
&lt;p&gt;Once the SGLang server is running, call it through the OpenAI-compatible API:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;openai&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OpenAI&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;client&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OpenAI&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;base_url&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;http://localhost:10103/v1&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api_key&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;EMPTY&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;client&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;completions&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;create&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Qwen3-30B-A3B-Instruct-2507&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Write quicksort in C++, Python, and Rust.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;max_tokens&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;256&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;resp&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;choices&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;kt-tuning-knobs&#34;&gt;KT Tuning Knobs&lt;/h3&gt;
&lt;p&gt;For fine-tuning, start by changing &lt;code&gt;kt_optimize_rule&lt;/code&gt;. Rule names usually encode the model family, whether the rule is for SFT, the CPU backend such as AMX, and the GPU count. In the LLaMA-Factory YAML, only four KT fields normally need user-side adjustment: &lt;code&gt;use_kt&lt;/code&gt;, &lt;code&gt;kt_optimize_rule&lt;/code&gt;, &lt;code&gt;cpu_infer&lt;/code&gt;, and &lt;code&gt;chunk_size&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For SGLang serving, reduce memory pressure in this order: lower &lt;code&gt;--chunked-prefill-size&lt;/code&gt; for prefill OOM, lower &lt;code&gt;--max-running-requests&lt;/code&gt; for decode OOM, reduce &lt;code&gt;--kt-num-gpu-experts&lt;/code&gt; when GPU-resident experts are too expensive, quantize CPU weights to INT8 when host memory or bandwidth is tight, and then tune &lt;code&gt;--mem-fraction-static&lt;/code&gt; for the target benchmark workload.&lt;/p&gt;
&lt;h2 id=&#34;performance-and-memory&#34;&gt;Performance and Memory&lt;/h2&gt;
&lt;p&gt;For the reported experiments, &lt;code&gt;GAS=16&lt;/code&gt; and &lt;code&gt;qlen=512&lt;/code&gt;, so each optimization step processes 8192 tokens.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Step time&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Tokens per step&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Throughput&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V3 671B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;203 s&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;8192&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;40.35 token/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V2-Lite 14B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;36 s&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;8192&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;&lt;strong&gt;227.6 token/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The measured memory footprint is:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;GPU memory&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Host memory&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V3 671B, 58 MoE layers out of 61&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;about &lt;strong&gt;70 GB&lt;/strong&gt; total GPU memory&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;about &lt;strong&gt;1.2-1.3 TB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V2-Lite 14B, 26 MoE layers out of 27&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;about &lt;strong&gt;5.5 GB&lt;/strong&gt; GPU memory&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;about &lt;strong&gt;150 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;technical-notes&#34;&gt;Technical Notes&lt;/h2&gt;
&lt;p&gt;The following section condenses the original Developer Technical Notes. Blocks marked &lt;strong&gt;Deprecated in V2 Current&lt;/strong&gt; describe earlier implementation details kept only as historical context.&lt;/p&gt;
&lt;h3 id=&#34;attention-with-lora&#34;&gt;Attention with LoRA&lt;/h3&gt;
&lt;p&gt;KTransformers provides operator injection through &lt;code&gt;BaseInjectedModule&lt;/code&gt;, while PEFT provides LoRA layer insertion. For fine-tuning, the integration uses a &lt;code&gt;KTransformersLinearLora&lt;/code&gt; layer that inherits from both the KT linear path and the LoRA layer path.&lt;/p&gt;
&lt;p&gt;This keeps KT&amp;rsquo;s fast &lt;code&gt;prefill_linear&lt;/code&gt; and &lt;code&gt;generate_linear&lt;/code&gt; paths while adding trainable LoRA matrices. During preparation, Q/K/V/O linear layers are replaced so that the Attention block remains optimized but becomes LoRA-trainable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/20260525034705-ktransformers-ft-06-attention-lora.png&#34;
alt=&#34;Attention LoRA replacement in KTransformers&#34;
style=&#34;zoom:45%&#34;/&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/20260525034705-ktransformers-ft-07-linear-lora.png&#34;
alt=&#34;KTransformersLinearLora structure&#34;
style=&#34;zoom:45%&#34;/&gt;&lt;/p&gt;
&lt;h3 id=&#34;moe-as-a-differentiable-backend-operator&#34;&gt;MoE as a Differentiable Backend Operator&lt;/h3&gt;
&lt;p&gt;MoE parameters dominate the model size, but MoE compute is sparse. KTransformers encapsulates expert computation as a differentiable black-box operator: upstream, PyTorch sees a compact autograd node; downstream, pybind11 calls C++ kernels for forward and backward.&lt;/p&gt;
&lt;p&gt;That backend can be selected through config. The evaluated paths include AMX BF16/INT8 and llamafile-style CPU kernels.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/20260525034705-ktransformers-ft-08-moe-autograd.png&#34;
alt=&#34;MoE autograd encapsulation&#34;
style=&#34;zoom:45%&#34;/&gt;&lt;/p&gt;
&lt;h3 id=&#34;moe-backward-cpu-deprecated-in-v2-current&#34;&gt;MoE Backward (CPU) (Deprecated in V2 Current)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Deprecated in V2 Current.&lt;/strong&gt; In the original technical notes, MoE backward frequently needs the transposed weights $W^\top$. To avoid repeated runtime transposes, the earlier implementation precomputed and cached $W^\top$ at load time. This stored transposed-weight copy is deprecated in V2 current and should be read as historical implementation context only.&lt;/p&gt;
&lt;p&gt;The original notes also describe caching necessary intermediate activations, such as expert projections, to reuse in backward and reduce recomputation. Treat this subsection as historical unless it is re-verified against the current V2 implementation.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/20260525034705-ktransformers-ft-09-moe-backward-cache.png&#34;
alt=&#34;MoE backward cache and transposed weights&#34;
style=&#34;zoom:45%&#34;/&gt;&lt;/p&gt;
&lt;h3 id=&#34;multi-gpu-loadingtraining-placement-strategy-instead-of-dataparallel-deprecated-in-v2-current&#34;&gt;Multi-GPU Loading/Training: Placement Strategy Instead of DataParallel (Deprecated in V2 Current)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Deprecated in V2 Current.&lt;/strong&gt; The &lt;code&gt;KTrainer&lt;/code&gt;, explicit placement, and DataParallel-avoidance details in this subsection reflect the original Developer Technical Notes and are not the current V2 behavior. They are preserved only as historical context.&lt;/p&gt;
&lt;p&gt;In the original notes, the multi-GPU strategy was explicit placement plus model parallelism:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Deprecated in V2 Current:&lt;/strong&gt; &lt;code&gt;KTrainer&lt;/code&gt; prevents the entire model from being moved to one GPU.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deprecated in V2 Current:&lt;/strong&gt; Layers are constructed directly on target devices according to the KT config.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deprecated in V2 Current:&lt;/strong&gt; Automatic DataParallel wrappers are disabled when the KT path is active.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deprecated in V2 Current:&lt;/strong&gt; Gradients are reduced where needed, while intermediate activations stay local as much as possible.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Deprecated in V2 Current.&lt;/strong&gt; The original notes describe this as keeping Attention and KV-related work on GPUs while MoE experts are placed on CPU and accelerated there, reducing per-GPU memory pressure without changing the user-facing LLaMA-Factory training flow. This specific placement/trainer description is deprecated in V2 current.&lt;/p&gt;
&lt;h2 id=&#34;limitations&#34;&gt;Limitations&lt;/h2&gt;
&lt;p&gt;The evaluation above is scoped around the low-memory training and inference path. Most measurements use single datasets and relatively small fine-tuning sets, usually no more than 20k examples. They show that LoRA adaptation can run under constrained hardware, but they are not a full study of generalization, scaling laws, multi-seed variance, or multilingual robustness.&lt;/p&gt;
&lt;p&gt;We welcome additional community results, especially when they include the KT config, dataset samples, training/evaluation YAMLs, GPU memory, CPU memory, CPU model, and backend details. These details make performance numbers easier to compare and more useful for other developers.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;KTransformers, LLaMA-Factory, and SGLang turn ultra-large MoE adaptation into a low-cost, low-memory workflow that runs end to end: LLaMA-Factory keeps training recipes familiar, LoRA keeps adaptation lightweight, KTransformers supplies heterogeneous placement and optimized Attention/MoE operators, and SGLang carries the inference path for benchmark or application traffic.&lt;/p&gt;
&lt;p&gt;For smaller MoE models, the same path reduces GPU memory and improves throughput. For 671B-scale MoE models, it gives users a low-memory route where default full-GPU training is out of reach.&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>
