<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://blog.huikang.dev/feed.xml" rel="self" type="application/atom+xml" /><link href="http://blog.huikang.dev/" rel="alternate" type="text/html" /><updated>2026-05-17T07:44:00+00:00</updated><id>http://blog.huikang.dev/feed.xml</id><title type="html">Huikang’s blog</title><subtitle>Writing AGI fanfiction</subtitle><author><name>Tong Hui Kang</name></author><entry><title type="html">All coding models will be interaction models</title><link href="http://blog.huikang.dev/2026/05/15/coding-interaction.html" rel="alternate" type="text/html" title="All coding models will be interaction models" /><published>2026-05-15T00:00:00+00:00</published><updated>2026-05-15T00:00:00+00:00</updated><id>http://blog.huikang.dev/2026/05/15/coding-interaction</id><content type="html" xml:base="http://blog.huikang.dev/2026/05/15/coding-interaction.html"><![CDATA[<p>Thinking Machines released <a href="https://thinkingmachines.ai/blog/interaction-models/">interaction models</a> earlier this week (One year ago, I called them <a href="/2025/05/14/multichannel-prediction.html">multichannel models</a>).</p>

<p>I argue that all frontier coding harnesses will soon be using only interaction models.</p>

<h2 id="what-is-an-interaction-model">What is an interaction model</h2>

<p>Current frontier models (Opus 4.7, GPT-5.5) that we interact with are non-interaction models.
They have one output stream.
They are next-token predictors - they write one token at a time, each token conditioned on everything that came before, and have to finish one thought before starting another.
To do two things, they have to do one and then the other.</p>

<p>An interaction model runs many channels of input and output in parallel, instead of one stream that takes turns.
It can think on one channel while it writes to you on another, while it watches a tool’s output on a third, while it edits a file on a fourth.
The streams are concurrent, not sequential.</p>

<p>On the input side, an interaction model is not waiting for a turn either.
It reads your messages as you type them, terminal output as it streams, file changes as they happen, webhook events as they arrive.
None of these block the others.</p>

<p>The shape comes from how humans work.
A person on a coding task is reading the screen, hearing a coworker, typing, thinking about what to do next, and watching a test suite stream - all at the same time.
An interaction model is the same shape.</p>

<p>Thinking Machines’ <a href="https://thinkingmachines.ai/blog/interaction-models/">post</a> has video demos that make this concrete - <a href="https://www.youtube.com/watch?v=Ys6i_MGnjUA">seamless dialog management</a>, <a href="https://www.youtube.com/watch?v=n2GXGjy41HQ">visual interjection</a>, <a href="https://www.youtube.com/watch?v=2ky5MXBvZP8">simultaneous speech</a>, and <a href="https://www.youtube.com/watch?v=ly3GtaiRFyo">simultaneous tool calls and search</a>.
The demos are voice-and-video rather than coding, but the same multi-stream idea applies.</p>

<h2 id="think-of-how-you-interact-with-ai-coding-tools">Think of how you interact with AI coding tools</h2>

<p>You ask the coding agent to add an “Export to CSV” button to your analytics dashboard.
For each phase of the session, I describe what it is today, and what it should be.</p>

<h4 id="listening"><strong>Listening</strong></h4>

<p><em>What it is.</em>
You ask the model something like “add Export to CSV to the dashboard”.
You can send immediately, or refine your statement in full (which dashboard, what is the time limit<sup id="fnref:banks" role="doc-noteref"><a href="#fn:banks" class="footnote" rel="footnote">1</a></sup>) before sending.
You may be dictating through a speech-to-text tool like Wispr Flow, or macOS dictation.<sup id="fnref:voice" role="doc-noteref"><a href="#fn:voice" class="footnote" rel="footnote">2</a></sup>
Whatever you type, delete, or rephrase before sending is invisible to the model.
The model only sees the final message when you hit send, and starts working from there.</p>

<p><em>What it should be.</em>
You ask the model something like “add Export to CSV to the dashboard”.
When you complete the first phrase, the agent starts to research your code.
As you modify and elaborate more, it steers the research in real time.
By the time you complete your multi-sentence request, the model has already made significant progress in the research.
The agent can already ask useful follow-up questions for your request.</p>

<h4 id="researching"><strong>Researching</strong></h4>

<p><em>What it is.</em>
The agent starts with a general instruction on what to do “add Export to CSV to the dashboard”.
Claude Code could decide to search the codebase in parallel with <a href="https://code.claude.com/docs/en/sub-agents">Explore</a> subagents.
Claude Code writes the instruction up front for the subagent.
One subagent explores the dashboard component.
Another subagent explores the data query.
Another subagent explores the design components.
However, the main agent writes the instruction up front, waits for the subagent to return, and reads only its summary - it does not see what exactly the subagent saw.
Claude Code cannot steer subagents mid-flight.
Information learnt from one subagent cannot influence the research process of another subagent.
This slows the overall research process, and is likely incomplete because subagents do not talk to each other.</p>

<p><em>What it should be.</em>
The agent starts with a general instruction on what to do “add Export to CSV to the dashboard”.
The agent immediately starts multiple channels that investigate the different components.
One channel explores the dashboard component.
You do not waste tokens explaining the situation to the subagent.
Another channel, with the same prefix, explores the data query.
Another channel, with the same prefix, explores the design components.
Information learnt from one channel is immediately shared with another channel.
With each channel informed of how the other channels are doing, the research process is faster and more complete.
For example, if it is discovered that we have similar data export functions for chat history, this information helps to inform the design components to use and the data queries to make.
When the research is done, you also do not waste tokens writing the summary.</p>

<h4 id="aligning"><strong>Aligning</strong></h4>

<p><em>What it is.</em>
There are design decisions involved in a simple button to export a CSV.
Do we give a choice to the user on what to export?
Is exporting instantaneous, or is the user required to check back after an hour or so for their data?
These are questions you need to ask the user.
There is a tradeoff on whether you want to ask the questions early, or whether you want to do your research first before asking the questions.
There is a tradeoff on whether to even ask the question, because Claude Code currently does not work in the background when questions need to be answered.</p>

<p><em>What it should be.</em>
The agent should not need to make these tradeoffs.
The agent could ask questions as early as they can while working on the research in the background.
The agent could retract questions if they have found the answer in the resources (for example, there are data exports that are not instantaneous for data requests of smaller sizes).
All responses to the agent will immediately influence the research.</p>

<h4 id="steering"><strong>Steering</strong></h4>

<p><em>What it is.</em>
The agent is already working - it has drafted the button, wired up the export handler, and is running a first export to see the output.
Halfway through, you notice that the button text is not visible in dark mode.
You type a correction into the chat box.
Your message is queued until the next tool boundary - it feels like the agent is stonewalling you.
You can interrupt the agent to get your queries immediately answered, but this discards the agent’s current progress.</p>

<p><em>What it should be.</em>
You point out that the text is not visible in dark mode.
The model reads your message immediately and acknowledges your comment.
The planning channel updates to include the new constraint.
The implementation channel will pick it up at the next available opportunity.
You do not have to wait for a turn boundary, and you do not feel stonewalled.</p>

<h4 id="approving"><strong>Approving</strong></h4>

<p><em>What it is.</em>
The agent wants to run the export against the production database to validate it on real data, and it needs your approval to do so.
The agent halts and surfaces the approval prompt.
You approve or deny<sup id="fnref:auto-mode" role="doc-noteref"><a href="#fn:auto-mode" class="footnote" rel="footnote">3</a></sup>.
Everything else the agent was doing - drafting the button, type-checking the handler - stops too.
The model is single-stream, so a pending approval blocks all the work.</p>

<p><em>What it should be.</em>
The agent surfaces the approval to run the export against production on a dedicated approval channel.
Only the export channel pauses.
The other channels keep running - the agent continues drafting the button code and refining the export handler while you decide.
If you approve, the paused channel resumes and the export runs.</p>

<h4 id="testing"><strong>Testing</strong></h4>

<p><em>What it is.</em>
Testing follows a linear process.
Your dashboard has hundreds of existing tests that programmatically test each component.
You need to choose a testing setup - do you stop tests on the first failure, or do you continue to run all the tests?
If the agent stops tests on the first failure, the agent will not be aware of the other tests that will fail and the agent will need multiple round trips to fix.
If the agent lets all the tests run, the agent will not be able to fix the first failure as soon as it can.<sup id="fnref:monitor" role="doc-noteref"><a href="#fn:monitor" class="footnote" rel="footnote">4</a></sup></p>

<p><em>What it should be.</em>
The agent starts testing and there is one input channel dedicated to listening for errors.
There is an output channel that surfaces testing issues.
If there is indeed a failure, the channel working on the code will be informed and it will be expected to investigate and fix the failure.
Tests can continue to run so that if there are more test failures, they will be surfaced to the agent.
We get both early fixes to failures, and reduced round trips between fixing and testing.</p>

<h4 id="compacting"><strong>Compacting</strong></h4>

<p><em>What it is.</em>
The inference infrastructure is not wired to generate tokens after a certain context length.
Models are also not trained to generate tokens after a certain context length.
Agents need to compact their context before more tokens can be generated. <sup id="fnref:million-token-context" role="doc-noteref"><a href="#fn:million-token-context" class="footnote" rel="footnote">5</a></sup>
The agent cannot do anything when it is compacting.</p>

<p><em>What it should be.</em>
Compacting should be done in parallel.
As the agent works on the problem, there should be another channel that decides which information is worth storing and which information should be removed from context.
Information removed from context should still be searchable from any channel.</p>

<h4 id="improving"><strong>Improving</strong></h4>

<p><em>What it is.</em>
After you ship your feature, you want to improve your future experience working with the model.
You write and improve skills that help you do your work more efficiently.
For example, when testing the dashboard, the agent should remember to try both light and dark mode and confirm visibility of every text element.
The agent will need to search their history and correctly surface pain points that could have been informed with skills.</p>

<p><em>What it should be.</em>
Reflection happens continuously in a dedicated channel that the agent maintains throughout the session.
When the agent finds out that the button text is not visible in dark mode, the reflection channel should note down the issue in parallel.
When the feature is shipped, the agent will propose to make improvements to AI instructions.</p>

<h2 id="implications">Implications</h2>

<p>If interaction models are coming, here is what I think changes for users, builders, and the model market.</p>

<h4 id="humans-will-prefer-the-better-interface"><strong>Humans will prefer the better interface</strong></h4>

<p>I still prefer Claude Code as my main interface.</p>

<p>For most of the work that I do, it is not possible for me to give perfect instructions in the first turn.
I also operate with imperfect information, and I do not have all the answers.
I need to interact with the agent to understand the problem together.</p>

<p>For me, Claude Code still feels easier to interact with.</p>

<p>I do not really care whether one model is slightly more intelligent than the other.
I care about how easy it is for me to communicate with the agent and get things done.</p>

<p>The companies that ship interaction models first will set the floor for what users expect.
Going back to a single-stream model will feel like going from a chat app to email.</p>

<h4 id="current-interfaces-will-continue-to-be-supported"><strong>Current interfaces will continue to be supported</strong></h4>

<p>Coding agents using interaction models should not require you to turn on your webcam and microphone.</p>

<p>You should still be able to talk to your coding agent with chat, just that it is more responsive and effective.</p>

<p>For users, you should continue to be great at using the current text-interface AI coding tools.</p>

<h4 id="you-will-still-need-to-teach-your-coding-agent"><strong>You will still need to teach your coding agent</strong></h4>

<p>The coding agent does not know about your business.</p>

<p>You will still need to teach your coding agent the environment you are working with.
Even with interaction models, the model still starts every session afresh.</p>

<p>You will still need to manage instructions and resources for the agent to access.
Skills will continue to be written.
Resources will still need to be accessed.</p>

<h4 id="the-model-will-decide-everything"><strong>The model will decide everything</strong></h4>

<p>Currently the harnesses manage plenty of decisions - for example, whether to compact, whether to auto-approve a command, and the context reset after planning mode.</p>

<p>A lot of the harnesses are built with the assumption of a single-stream model - compaction, monitoring, chain-of-thought.
Prompts are written and evaluated.
With this, I think most of the harnesses that we use today will be thrown away.</p>

<p>If I am building yet another coding tool, I will make the harness work only with interaction models.</p>

<h4 id="there-will-be-one-coding-model-size"><strong>There will be one coding model size</strong></h4>

<p>My bet is that the coding model market will collapse to one model size served via API<sup id="fnref:local-models" role="doc-noteref"><a href="#fn:local-models" class="footnote" rel="footnote">6</a></sup>.</p>

<p>Currently Claude Code by default is served with two models - Opus as the main model and Haiku as the explore model.</p>

<p>With interaction models, everything will be one model, which means one model size.</p>

<p>There will be different knobs that the model can decide to turn for itself.<sup id="fnref:knobs" role="doc-noteref"><a href="#fn:knobs" class="footnote" rel="footnote">7</a></sup></p>

<h2 id="closing">Closing</h2>

<p>Coding is the first killer use case for LLMs.</p>

<p>I think coding will also be the first killer use case for interaction models.</p>

<p>All coding models will be interaction models<sup id="fnref:robotics" role="doc-noteref"><a href="#fn:robotics" class="footnote" rel="footnote">8</a></sup>.</p>

<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:banks" role="doc-endnote">
      <p>For some reason all the banks I use make it difficult for me to export all my transaction history at once. <a href="#fnref:banks" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:voice" role="doc-endnote">
      <p>Coding tools should ship with dictation (the human’s voice in) and text-to-speech (the model’s voice out) built in. 
Today they do not.
I had to add a <a href="https://github.com/tonghuikang/claude-code-template/blob/main/.claude/hooks/notify_kokoro.py">TTS hook</a> to my Claude Code template so the agent can speak its notifications out loud.
I do not have voice dictation software. <a href="#fnref:voice" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:auto-mode" role="doc-endnote">
      <p>I am aware that Claude Code has an auto-mode where the agent has a process to automatically decide which commands are safe to run.
However, I think interaction models are useful here, there could be one channel where the model decides whether to approve running the command. <a href="#fnref:auto-mode" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:monitor" role="doc-endnote">
      <p>Claude Code has this <a href="https://code.claude.com/docs/en/tools-reference#monitor-tool">monitor tool</a> where the agent will monitor something in the background. <a href="#fnref:monitor" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:million-token-context" role="doc-endnote">
      <p>There are models with millions of tokens of context.
In my experience with Opus 4.7, I feel that the model simply forgets a lot of things after the 200,000th token.
I would rather the model automatically compact at the 200,000th token. <a href="#fnref:million-token-context" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:local-models" role="doc-endnote">
      <p>If there are models of different sizes being developed, I think they are local models that need to be run on device. <a href="#fnref:local-models" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:knobs" role="doc-endnote">
      <p>We are familiar with the thinking effort as a knob.
I think models should be able to tune their thinking effort by prompting themselves.
There are other knobs that could be turned if you train the model to do so.
Maybe the size of the prefix that you can attend to is tunable.
Maybe the number of experts that you can use is also tunable. <a href="#fnref:knobs" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:robotics" role="doc-endnote">
      <p>I think the first human-level robotics model will also be an interaction model. <a href="#fnref:robotics" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[Thinking Machines released interaction models earlier this week (One year ago, I called them multichannel models).]]></summary></entry><entry><title type="html">Drawing the flash attention animation</title><link href="http://blog.huikang.dev/2026/05/02/flash-attention-animation.html" rel="alternate" type="text/html" title="Drawing the flash attention animation" /><published>2026-05-02T00:00:00+00:00</published><updated>2026-05-02T00:00:00+00:00</updated><id>http://blog.huikang.dev/2026/05/02/flash-attention-animation</id><content type="html" xml:base="http://blog.huikang.dev/2026/05/02/flash-attention-animation.html"><![CDATA[<p>Some time ago, I was studying how Flash Attention works.</p>

<p>The main material available is the <a href="https://github.com/dao-ailab/flash-attention">pyramid visualization</a> from the Flash Attention paper.</p>

<p>I wanted the visualization to be animated. I did not manage to find a good resource out there <sup id="fnref:existing" role="doc-noteref"><a href="#fn:existing" class="footnote" rel="footnote">1</a></sup>. So I made mine.</p>

<p>The <a href="https://tonghuikang.github.io/flash-attention-animation/">animation</a> was made with MacOS Keynote, manually. <sup id="fnref:benchmark" role="doc-noteref"><a href="#fn:benchmark" class="footnote" rel="footnote">2</a></sup>. As of writing, it appears to rank second on <a href="https://www.google.com/search?q=flash+attention+animation">Google search</a>.</p>

<p>I also wrote <a href="https://www.quora.com/How-does-flash-attention-work/answer/Tong-Hui-Kang-1">a Quora answer</a> explaining how Flash Attention works.</p>

<p>I asked in the GPU Mode Discord for opinions on my work.</p>

<p><a href="https://x.com/gaunernst">gau.nernst</a> replied with the following comments, which I greatly appreciate.</p>

<blockquote>
  <p>i wanted to comment that the loop order was reversed. but upon checking, turns out FA1 used this loop ordering, but FA2 reversed it (and I only read the FA2 paper lmao)</p>

  <p>so in FA2, iterating along K/V is the inner loop, iterating along Q/O is the outer loop, which is implemented as 1 threadblock handling 1 Q/O tile</p>

  <p>yea I think FA3 and FA4 also follow the FA2’s general design, but optimized for Hopper and Blackwell respectively</p>
</blockquote>

<p>The explanation may not be complete, the details may not be fully correct, but I still hope this makes it slightly easier for you to understand Flash Attention.</p>

<h3 id="footnotes">Footnotes</h3>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:existing" role="doc-endnote">
      <p>There exists an <a href="https://github.com/Dao-AILab/flash-attention/pull/736">animation</a> made by LuisAVasquez, but I want to stay as close to the source material as possible. <a href="#fnref:existing" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:benchmark" role="doc-endnote">
      <p>I think a good benchmark for AI these days is to reproduce this work. <a href="#fnref:benchmark" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[Some time ago, I was studying how Flash Attention works.]]></summary></entry><entry><title type="html">Winning the Nemotron Progress Prize</title><link href="http://blog.huikang.dev/2026/05/02/nemotron-progress-prize.html" rel="alternate" type="text/html" title="Winning the Nemotron Progress Prize" /><published>2026-05-02T00:00:00+00:00</published><updated>2026-05-02T00:00:00+00:00</updated><id>http://blog.huikang.dev/2026/05/02/nemotron-progress-prize</id><content type="html" xml:base="http://blog.huikang.dev/2026/05/02/nemotron-progress-prize.html"><![CDATA[<p>I won the <a href="www.kaggle.com/competitions/nvidia-nemotron-model-reasoning-challenge/overview/prizes">Open Progress Prize</a> for <a href="https://www.kaggle.com/competitions/nvidia-nemotron-model-reasoning-challenge/">NVIDIA Nemotron Model Reasoning Challenge</a></p>

<p>The writeup with the links is available on <a href="https://www.kaggle.com/competitions/nvidia-nemotron-model-reasoning-challenge/discussion/689915">Kaggle</a>.</p>

<p>This is the first time I won prize money from Kaggle competitions.</p>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[I won the Open Progress Prize for NVIDIA Nemotron Model Reasoning Challenge]]></summary></entry><entry><title type="html">Writing skills for the company</title><link href="http://blog.huikang.dev/2026/02/21/skills.html" rel="alternate" type="text/html" title="Writing skills for the company" /><published>2026-02-21T00:00:00+00:00</published><updated>2026-02-21T00:00:00+00:00</updated><id>http://blog.huikang.dev/2026/02/21/skills</id><content type="html" xml:base="http://blog.huikang.dev/2026/02/21/skills.html"><![CDATA[<p>Agent <a href="https://agentskills.io/home">Skills</a> are instructions that agents can discover and use to do things more accurately and efficiently. The keywords are “accurately” and “efficiently”.</p>

<p>Think of the most capable person that you have ever worked with.</p>

<p>You hire them into a new company.</p>

<p>There are still things this person could have done more “accurately” and “efficiently”.</p>

<p>Think of what they will need to learn between the first day they join the company until the day where they become effective individual contributors. They will need to learn where to look for information. They will need to know who to approach to access privileged information. They will need to learn the processes necessary to ship things. They will need to learn the pitfalls that they should avoid.</p>

<p>Humans have memory and they could remember things.
For AI coding tools however, every time you start a chat, their memory is reset. They know absolutely nothing about your company at the beginning of each session. <sup id="fnref:nothing" role="doc-noteref"><a href="#fn:nothing" class="footnote" rel="footnote">1</a></sup> You need to teach them again.</p>

<p>This teaching process could have been accelerated with skill files.</p>

<p>I want to encourage my colleagues to write and maintain skills to accelerate their work.
Here I describe how I am doing it.</p>

<h2 id="what-exactly-are-skills">What exactly are skills?</h2>

<p>Skills consist of three elements at minimum:</p>
<ul>
  <li>Skill name</li>
  <li>Skill description</li>
  <li>Skill content</li>
</ul>

<p>Let’s use writing a commit message as an example of a skill.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>---
name: write-commit-message
description: Commit message guidelines. Use when writing git commit messages.
---

&lt;skill content&gt;
</code></pre></div></div>

<p>When the AI coding tool starts a session, it will load the skill name and skill description into the model context.</p>

<p>This is what you see in Claude Code when you run <code class="language-plaintext highlighter-rouge">/context</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Skills · /skills

Project
└ write-commit-message: 21 tokens
</code></pre></div></div>

<p>When the agent needs to write a commit message, it should decide to invoke a skill. When the skill is invoked, the agent will read the skill content.
In this case, the skill content contains the information on how to write commit messages.</p>

<h2 id="why-do-we-need-skills">Why do we need skills?</h2>

<p>Skills are needed to instruct the agent on how to do things more accurately and efficiently.</p>

<p>On writing commit messages, the company likely has some standards on how commit messages should be written.</p>

<p>For example, the commit message should contain this information:</p>
<ul>
  <li>How to write the commit title</li>
  <li>What to include in the commit description (context, design decisions, test plan)</li>
  <li>What not to include in the commit description</li>
  <li>Who the reviewers are</li>
  <li>What URLs need to be included (for example Slack, Asana)</li>
</ul>

<p>If agents want to write commit messages accurately without additional instructions, they will need to figure out these requirements by looking at similar commits, or running the unit tests related to the commit message.</p>

<p>However, if you require the agent to learn the commit message pattern on every session by reading many similar commits, this is not efficient.
If the agent skips the learning process, the agent is not being accurate.
This will still be the case even as models get better, because not every company has the same commit message standards.</p>

<p>Writing commit messages could have been done more accurately and efficiently<sup id="fnref:keywords" role="doc-noteref"><a href="#fn:keywords" class="footnote" rel="footnote">2</a></sup>.
Skill files allow this.
When the agent is going to write commit messages, it will “invoke the skill” and read the skill content.</p>

<p>Initially I placed the commit message standards in CLAUDE.md / AGENTS.md.
This was a reasonable place to put the instruction because it is globally relevant.
I have been advocating for CLAUDE.md to only include <a href="https://blog.huikang.dev/2025/05/31/writing-claude-md.html">globally</a> relevant information.
However, this means the commit message instructions are loaded even when the user is not writing a commit message, for example when asking questions about the codebase.
There is room for improvement here, regarding efficiency.</p>

<p>Then I moved the commit message standards to the git commit template.
Instead of placing the full instructions in CLAUDE.md, I added a pointer to the git commit template.
This is what I have been doing before we had skills.
This still follows the progressive disclosure principle, because I load the commit message standards only when I write the commit message.</p>

<p>Even though placing the commit message standards in the git commit template fulfills the principle of progressive disclosure, there is still benefit in making <code class="language-plaintext highlighter-rouge">write-commit-message</code> a skill.
We want to centralize our AI instructions instead of scattering them over the codebase.
When we implement telemetry and feedback loops for skills, <code class="language-plaintext highlighter-rouge">write-commit-message</code> can also benefit if it is a skill.</p>

<h2 id="when-you-should-write-a-skill">When you should write a skill</h2>

<p>If you want a process to be done more accurately and efficiently with AI coding tools, you should write a skill.
These are some examples where you should think about writing a skill.</p>

<p>You have a resource that you want your agent to access.
The resource could be Notion, Slack, Asana, or any internal pages.
Instead of playing telephone between the AI coding tool and the resource, you can write a skill that teaches the agent how to read the resource.
However, this assumes that your AI coding tool has access to the resources, which you will have to set up first.</p>

<p>You execute repetitive processes that you want automated.
For example, every day I am supposed to check the feed statistics for our recommendation system.
This involves looking at dashboards.
If there are significant movements in the metrics, I need to explain it by looking at commit logs.
This should have been a skill.</p>

<p>You want a process to be done more efficiently in the future.
One such process is on-call pages.
You might already be handling on-call pages with AI coding tools that have access to dashboards and error logs.
In the future, you want to handle this more efficiently.
You can write a skill that informs the agent of the resources that it should look at and the dead ends that it should be aware of.</p>

<p>There are cases where you should not write a skill.</p>

<ul>
  <li>Tasks that the agent could already solve accurately and efficiently.
For example, you should not add a skill on how to search the code, because the agent is likely already searching the code in the most efficient manner.<sup id="fnref:search" role="doc-noteref"><a href="#fn:search" class="footnote" rel="footnote">3</a></sup></li>
  <li>Features that the AI coding tool should already be good at.
There should not be a <code class="language-plaintext highlighter-rouge">plan-mode</code> or <code class="language-plaintext highlighter-rouge">clarify-user-questions</code> skill because AI coding tools should already include this in their system prompt.</li>
  <li>Workflows that should have been a deterministic script.
If you are writing a <code class="language-plaintext highlighter-rouge">check-commit-message</code> skill, you should not be asking the agent to run checks that could be unit tests.
The agent should not be an expensive linter.
If there is still value in writing <code class="language-plaintext highlighter-rouge">check-commit-message</code> to check the qualitative aspects of the commit message, the skill should ask the agent to run the relevant unit tests for the deterministic checks.</li>
</ul>

<h2 id="skill-writing-advice">Skill writing advice</h2>

<p>Start by writing the simplest possible skill that is worth using.</p>

<p>You could look at what you did in the past week and think of:</p>

<ul>
  <li>The documents that you need to repeatedly write or review</li>
  <li>Questions that you need to repeatedly answer</li>
  <li>Investigations that you need to repeat</li>
</ul>

<p>Then, think whether any of these processes could be done more accurately and efficiently with AI coding tools.</p>

<p>If so, you have found a good candidate for a skill.</p>

<p>Then write your skill. Start simple, with only a <code class="language-plaintext highlighter-rouge">SKILL.md</code> file.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>---
name: &lt;name&gt;
description: &lt;description&gt;
---

&lt;What exactly the skill does&gt;

Example queries
- &lt;example query&gt;

# Workflow

&lt;step by step process&gt;

Checklist
- [ ] Item 1
- [ ] Item 2

# Pitfalls to avoid

&lt;list them&gt;

</code></pre></div></div>

<p>After writing your skill, you should test it.
When you commit your skill, you should include evidence that it is tested.
Unlike unit tests, testing a skill is not deterministic, but you should still provide evidence of testing.</p>

<p>Here are some good forms of evidence:</p>

<ul>
  <li>If the skill output is a document, the resulting document could be evidence.</li>
  <li>If the skill provides instructions on how to read a resource, you could start a new session to see whether the agent could invoke the skill and read the resource without tripping over issues.</li>
  <li>If the skill helps with an investigation, the investigation thread could be evidence.</li>
</ul>

<p>Your colleagues will review your skill, just as code is reviewed in the codebase.</p>

<h2 id="managing-skills-for-the-company">Managing skills for the company</h2>

<p>As hundreds of colleagues commit skills into the codebase, you will soon have hundreds of skills.</p>

<p>This means that you will have hundreds of skill names, and hundreds of skill descriptions.
If each skill is 50 tokens, this will be 5000 tokens.
Also depending on the quality of your skill descriptions, the agent might invoke skills unnecessarily, or fail to invoke skills when it is needed.</p>

<p>If you look at the skills, there are skills that are company-wide and there are skills that are team-wide.
You should only load company-wide skills into context.</p>

<p>This can be done in Claude Code.
For team-wide skills, add <code class="language-plaintext highlighter-rouge">disable-model-invocation: true</code> to prevent the skill from being loaded in context.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>---
name: investigate-speed-feed
description: 
disable-model-invocation: true
---

&lt;skill content&gt;

</code></pre></div></div>

<p>This will mean that if you go to Claude Code and ask “please investigate feed speed”, the skill will not be invoked.
You need to write <code class="language-plaintext highlighter-rouge">/investigate-speed-feed</code>.
This is fine, because people who need to use the skill should know about the skill.</p>

<p>By separating team-wide skills and company-wide skills, and requiring all team-wide skills to have model invocation disabled, you reduce the risk of the agent not invoking necessary skills or invoking unnecessary skills<sup id="fnref:naming" role="doc-noteref"><a href="#fn:naming" class="footnote" rel="footnote">4</a></sup>.</p>

<p>Organize the team-wide skills and company-wide skills into two folders.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>skills/
     ├── company_wide/
     │   └── write-commit-message/
     │       └── SKILL.md
     └── team_wide/
         └── investigate-speed-feed/
             └── SKILL.md
</code></pre></div></div>

<p>However, the skill standard requires all skills to be at the same level.</p>

<p>Then symlink every skill folder into <code class="language-plaintext highlighter-rouge">skills/all</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>skills/
     ├── all/
     ├── company_wide/
     └── team_wide/
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">.claude/skills</code>, <code class="language-plaintext highlighter-rouge">.cursor/skills</code>, and <code class="language-plaintext highlighter-rouge">.codex/skills</code> are soft symlinks to <code class="language-plaintext highlighter-rouge">skills/all</code>.</p>

<p>To enforce that skills follow the intended format, you should write unit tests to test skills.</p>

<p>For example, you could have unit tests that check that</p>
<ul>
  <li>The skill name is short</li>
  <li>The skill description follows convention <sup id="fnref:description" role="doc-noteref"><a href="#fn:description" class="footnote" rel="footnote">5</a></sup></li>
  <li>Whether the SKILL.md file is under 500 lines</li>
  <li>Required components in the SKILL.md file (I require example queries to appear as its own section within the first 50 lines.)</li>
  <li>Whether the symlinks are added correctly</li>
</ul>

<p>As the skills maintainer, you need to define your responsibilities.
You should not be reviewing every line of every skill.
If there is an issue with a skill, you should direct the complaint to the owner of the skill. You can give advice, but should not be required to.</p>

<p>There are still responsibilities that belong to you as the skills maintainer.
For example, if a skill is not invoked when it is supposed to, or unnecessarily invoked, you need to investigate.
It might be the issue of how the user queries the model, how the skill name and description is written, or it might be the fault of the model or the AI coding tool.
It is still your duty to triage and provide advice here.
Periodically, you also need to review how skills are written and identify anti-patterns and legislate against these anti-patterns.</p>

<p>To help your colleagues write skills, you write a skill that helps them write skills.</p>

<p>Initially I used Anthropic’s <code class="language-plaintext highlighter-rouge">skill-creator</code> <a href="https://github.com/anthropics/skills/tree/main/skills/skill-creator">skill</a> to write skills.
However I found out there are many unnecessary parts.
For example, there is no need for the <code class="language-plaintext highlighter-rouge">init_skill.py</code> steps that create all the resource directories.
Skills should start simple with just a SKILL.md file.</p>

<p>To help your colleagues improve skills, you again write a skill that helps them improve skills.</p>

<p>There will be a <code class="language-plaintext highlighter-rouge">skill-feedback</code> skill where agents can provide feedback on skills.
Feedback should be provided when the agent finds an inaccuracy in the skill file, or pitfalls that the skill has not documented.
The feedback will be stored in some data lake, which should be queried when we iterate on skills.</p>

<p>I hope this helps you write and improve skills for your company, so that the agents can do things more accurately and efficiently.</p>

<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:nothing" role="doc-endnote">
      <p>This sentence was paraphrased from the resource I recommend on how to write <a href="https://www.humanlayer.dev/blog/writing-a-good-claude-md">CLAUDE.md</a>. <a href="#fnref:nothing" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:keywords" role="doc-endnote">
      <p>If you have not noticed by now, the keywords are accurately and efficiently. <a href="#fnref:keywords" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:search" role="doc-endnote">
      <p>That said, it might still be reasonable to include a skill that helps to search code if the agent could not find the code.
 You might have some code that is hard to search, for example you might be searching for a string but the actual string is broken into two pieces with each string defined at different places.
 However, it should not be expected for the agent to trigger this skill for every search, but only when previous search attempts fail.
 Of course, the better way to solve this is to avoid writing code that requires doing this, or improving your codebase instead. <a href="#fnref:search" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:naming" role="doc-endnote">
      <p>The field name is <code class="language-plaintext highlighter-rouge">disable-model-invocation: true</code>, which unfortunately is a negative.
It could have been “invokable skills” or “non-invokable skills”, but it is confusing because users can invoke a skill with the slash command.
The more precise term is “model-invokable skills” and “non-model-invokable skills” but that is too long.
I am glad that I have arrived at the terms “team-wide skills” and “company-wide skills”. <a href="#fnref:naming" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:description" role="doc-endnote">
      <p>Anthropic recommends the <a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices#naming-conventions">gerund form</a>.
 However Anthropic is not <a href="https://github.com/anthropics/skills/tree/main/skills">really</a> following the conventions they recommend.
 Currently I only require skill name to be in the format <code class="language-plaintext highlighter-rouge">{resource / workflow}-{team name}</code>.
 This is a problem I should worry when the monorepo actually has a hundred skills. <a href="#fnref:description" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[Agent Skills are instructions that agents can discover and use to do things more accurately and efficiently. The keywords are “accurately” and “efficiently”.]]></summary></entry><entry><title type="html">Fake tasks for LLMs</title><link href="http://blog.huikang.dev/2026/01/18/fake-tasks-for-LLMs.html" rel="alternate" type="text/html" title="Fake tasks for LLMs" /><published>2026-01-18T00:00:00+00:00</published><updated>2026-01-18T00:00:00+00:00</updated><id>http://blog.huikang.dev/2026/01/18/fake-tasks-for-LLMs</id><content type="html" xml:base="http://blog.huikang.dev/2026/01/18/fake-tasks-for-LLMs.html"><![CDATA[<p>When a frontier LLM fails at an evaluation test case, you should critically evaluate whether your test case is worth passing.</p>

<p>These are some evaluation test cases that I think LLMs should no longer be evaluated on.</p>

<h2 id="tasks-on-recalling-facts-not-critical-to-your-work">Tasks on recalling facts not critical to your work</h2>

<p>There are evals that test whether the LLM remembers certain things (MMLU).</p>

<p>This is the README example in <a href="https://huggingface.co/datasets/cais/mmlu">MMLU dataset</a>.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"question"</span><span class="p">:</span><span class="w"> </span><span class="s2">"What is the embryological origin of the hyoid bone?"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"choices"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="s2">"The first pharyngeal arch"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"The first and second pharyngeal arches"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"The second pharyngeal arch"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"The second and third pharyngeal arches"</span><span class="w">
  </span><span class="p">],</span><span class="w">
  </span><span class="nl">"answer"</span><span class="p">:</span><span class="w"> </span><span class="s2">"D"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>I argue that this is an unreasonable memorization task<sup id="fnref:bits" role="doc-noteref"><a href="#fn:bits" class="footnote" rel="footnote">1</a></sup>. LLMs do not need to memorize this to be helpful. Neither do doctors need to memorize this to execute their work.</p>

<p>Even as we do not expect LLMs to always provide the perfect answer to this, we still expect AI systems to solve this task perfectly. AI systems are more than just models. AI systems involve LLMs having access to tools such as web search.</p>

<h2 id="tasks-on-recalling-from-a-super-long-context">Tasks on recalling from a super-long context</h2>

<p>There are currently evals that test whether a model can retrieve a specific piece of information buried in a very long context. Model providers boast that their models could memorize the context well into the <a href="https://cloud.google.com/blog/products/ai-machine-learning/the-needle-in-the-haystack-test-and-how-gemini-pro-solves-it">millions</a>.</p>

<p>I argue that this is a non-goal. Humans can solve very long problems without the ability to memorize the entire conversation history. A math professor does not need to recall what they did exactly 300 days ago to solve an open math problem that requires a lot of research and trial and error.</p>

<p>The goal here is to solve complex tasks with the minimum compute. I expect the solution to involve an AI system where the LLM manages its own context and queries previous context with tools. The solution does not necessarily require a model that is able to retrieve something within a million tokens.</p>

<p>This means I still expect AI systems to be perfect at super-long context retrieval. LLMs should be able to use tools to query the history and perfectly retrieve any specific piece of information.</p>

<h2 id="tasks-that-require-a-playbook">Tasks that require a playbook</h2>

<p>Some evaluations test whether an LLM can answer domain-specific questions - legal advice, medical diagnoses, tax regulations.</p>

<p>You cannot expect a lawyer to do well at their job without access to the legal resources. Similarly, you cannot expect the LLM to perform as well as the lawyer without access to the same set of legal resources.</p>

<p>LLMs should have baseline domain knowledge, like a human professional would. We should avoid having evals that simply test the LLMs without providing them access to the necessary legal resources.</p>

<h2 id="tasks-that-are-better-served-with-tools">Tasks that are better served with tools</h2>

<p>Some evaluations test raw computation - multiplying large numbers, performing complex arithmetic, counting characters.</p>

<p>We should not be evaluating LLMs on whether they can do 100-digit multiplication without chain of thought.
LLMs should be able to figure out and follow recipes to do 100 by 100 digit multiplication in O(n²) tokens.</p>

<p>LLMs should be evaluated on whether they can discover and execute algorithms, not whether they can compute magically<sup id="fnref:architecture" role="doc-noteref"><a href="#fn:architecture" class="footnote" rel="footnote">2</a></sup>.</p>

<h2 id="tasks-based-on-noisy-data">Tasks based on noisy data</h2>

<p>We should not use the same mindset we used to participate in traditional Kaggle competitions to train LLMs.</p>

<p>In traditional Kaggle competitions, you do whatever it takes to maximize your score on the leaderboard, even if it means training on incorrect labels in the training set. We should not carry this mindset to evaluating LLMs.</p>

<p>I expect model providers to remove inconsistent or ambiguous test cases when evaluating their models internally.</p>

<h2 id="summary">Summary</h2>

<p>I hope we apply higher scrutiny to the test cases we use to evaluate LLMs.</p>

<p>I also hope that you understand the difference in standards that we apply to LLMs and to AI systems. There are tasks we expect AI systems to solve perfectly, but do not expect LLMs to get correct.</p>

<p>So when an LLM fails a benchmark task, ask yourself: is the model wrong, or is the task wrong?</p>

<h2 id="footnotes">Footnotes</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:bits" role="doc-endnote">
      <p>Research shows that LLMs memorize approximately <a href="https://arxiv.org/abs/2505.24832">3.6 bits</a> of information per parameter. Based on this budget, model providers should probably at least have some internal standards on what LLMs are supposed to know and not supposed to know. <a href="#fnref:bits" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:architecture" role="doc-endnote">
      <p>It is still an interesting exercise to design weights to a given neural network architecture to multiply integers without a thinking process. <a href="#fnref:architecture" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[When a frontier LLM fails at an evaluation test case, you should critically evaluate whether your test case is worth passing.]]></summary></entry><entry><title type="html">My comments on the Evals FAQ</title><link href="http://blog.huikang.dev/2026/01/04/comments-on-evals-faq.html" rel="alternate" type="text/html" title="My comments on the Evals FAQ" /><published>2026-01-04T00:00:00+00:00</published><updated>2026-01-04T00:00:00+00:00</updated><id>http://blog.huikang.dev/2026/01/04/comments-on-evals-faq</id><content type="html" xml:base="http://blog.huikang.dev/2026/01/04/comments-on-evals-faq.html"><![CDATA[<p>Hamel’s evals FAQ is a great resource.</p>

<p><a href="https://hamel.dev/blog/posts/evals-faq/">hamel.dev/blog/posts/evals-faq/</a></p>

<p>Many of the ideas I have been promoting<sup id="fnref:prompting" role="doc-noteref"><a href="#fn:prompting" class="footnote" rel="footnote">1</a></sup> align with the Evals FAQ.</p>

<p>The content here does not represent the prevailing opinions or practices of any organization.</p>

<h2 id="sections-that-i-agree-with-and-promote">Sections that I agree with and promote</h2>

<p>These are ideas that I have been promoting.
I should have cited the Evals FAQ when promoting my ideas.</p>

<hr />
<p>On <a href="https://hamel.dev/blog/posts/evals-faq/#q-why-do-you-recommend-binary-passfail-evaluations-instead-of-1-5-ratings-likert-scales">binary</a> (pass/fail) evaluations instead of 1-5 ratings (Likert scales).</p>

<blockquote>
  <p>Having binary options forces people to make a decision rather than hiding uncertainty in middle values. Binary decisions are also faster to make during error analysis - you don’t waste time debating whether something is a 3 or 4.</p>
</blockquote>

<p>For example, you want to classify whether content is adult. You might want to ask for a scale of 1 to 4 instead of a binary classification. On the adult side, you want a label of 1 if the content is <a href="https://en.wikipedia.org/wiki/Child_pornography">CSAM</a> that you have to eliminate from your platform, and a label of 2 if it is merely adult. On the less adult side, you want a label of 3 if it is provocative that you do not want to show to new users, and 4 if the content is perfectly fine. Only users who opted in to adult content will see content with label 2.</p>

<p>I would rather we have three binary classifiers - on whether something is adult (1+2 vs 3+4), on whether something is CSAM (1 vs the rest), and on whether something should be shown to new users (4 vs the rest).</p>

<p>There are some arguments for having one classifier for all 4 labels</p>

<ul>
  <li>You intend to run the LLM for all your content. You think running the LLM only once saves you money, compared to running the LLM three times for three separate classifiers.</li>
  <li>You think that having one system is simpler, as opposed to multiple systems. You think that one system is easier to maintain.</li>
</ul>

<p>I argue that you should think harder about whether having a graduated 1-4 scale is really a better system than having three binary classifiers.</p>

<ul>
  <li>You need to think about what happens when you want to change one of the systems. Let’s say you want to have a different standard on what is allowed to be displayed to new users (label 3 versus label 4). You need to ensure that your single LLM classifier still works for label 2 and label 1. Experimentation is also more complicated now.</li>
  <li>There might be different levels of scrutiny for different classifications - you are okay if label 3 versus label 4 is mixed up, but you do not want to mix up label 1 versus label 2 because accounts posting label 1 content will get banned.</li>
  <li>Aligning a binary classifier is much easier than aligning a graduated classifier. It should be easier and faster to align three binary classifiers independently than to align one 4-class classifier. It is also easier to find a cheaper and equally performant classifier if the task is just binary classification.</li>
</ul>

<p>If your systems are uncoupled, you just need to align the parts that you need to align. Sure, it might cost more to run the LLMs three times, but engineering time usually costs even more money.</p>

<hr />
<p>On <a href="https://hamel.dev/blog/posts/evals-faq/#q-how-should-i-version-and-manage-prompts">versioning</a> and managing prompts</p>

<blockquote>
  <p>There is an unavoidable tension between keeping prompts close to the code vs. an environment that non-technical stakeholders can access.</p>

  <p><strong>My preferred approach is storing prompts in Git.</strong> This treats them as software artifacts that are versioned, reviewed, and deployed atomically with the application code.</p>

  <p>Prompt management tools are inherently limiting because they can’t easily execute your application’s code. Even when they can, there’s often significant indirection involved, making it difficult to test prompts with your system’s capabilities.</p>
</blockquote>

<p>There is indeed a requirement for non-technical stakeholders to read and write prompts. Prompts should already be easily readable if your organization has good AI tooling to point you to the exact prompt.</p>

<p>There is also the case where non-technical stakeholders want to write and test the prompts<sup id="fnref:system" role="doc-noteref"><a href="#fn:system" class="footnote" rel="footnote">2</a></sup> without opening a terminal or Jupyter notebook. It should be possible to keep the prompts in git while they experiment with the different prompts.</p>

<hr />

<p>On how many people should be <a href="https://hamel.dev/blog/posts/evals-faq/#q-how-many-people-should-annotate-my-llm-outputs">annotating</a> LLM outputs</p>

<blockquote>
  <p>A single expert eliminates annotation conflicts and prevents the paralysis that comes from “too many cooks in the kitchen”. The benevolent dictator can incorporate input and feedback from others, but they drive the process.</p>
</blockquote>

<p>I have written exactly on this <a href="https://blog.huikang.dev/2024/12/31/prompting-projects.html">before</a>.</p>

<blockquote>
  <p>I argue that democracy is a horrible way to build a dataset for LLM evaluation. Let’s say you want to build dataset to determine whether an advertisement is low quality. You get your team to label the content and treat the labels as immutable ground truth. But you notice the labels provided by your team often disagrees with each other.</p>

  <p>If you pass this dataset to your colleague, your colleague is simply guessing what people are voting. Your colleague will likely not perform well. This is what you will be tuning your prompt to.</p>
</blockquote>

<p>The FAQ suggests a benevolent dictator to own the prompts, which I agree with.</p>

<h2 id="sections-that-i-would-add-or-edit">Sections that I would add or edit</h2>

<p>Evals should only focus on cases you care about. In other words, you do not need a ground truth label for everything.</p>

<p>For example, in adult content classification, you only want to label what is <a href="https://en.wikipedia.org/wiki/I_know_it_when_I_see_it">obviously</a> adult and what is obviously not adult. If something is neither obviously adult nor obviously not adult, there is no need for a ground truth label. This means, in production, you are okay with the content being classified either way.</p>

<hr />

<p>I think some updates have to be made considering models are much more powerful now.</p>

<ul>
  <li>I do not think we should spend effort ensuring that every step the model takes is correct. For example, I do not think we need to care whether every search term made by the model is ideal. We know that frontier models serve as reliable agents that complete tasks. They know when they are making mistakes. As model users, we should not need to scrutinize every step the model is making; we should be more concerned with the outcome.</li>
  <li>I think models could play a bigger role in prompting. One new use case for LLMs in building the evaluation dataset is to brainstorm the edge cases that we care about. LLMs could even agentically query your data for mistakes the binary classifier is making in production, and put these labels up for human approval.</li>
</ul>

<h2 id="sections-that-i-disagree-with">Sections that I disagree with</h2>

<p>On passing 100% of your evals</p>

<blockquote>
  <p>Be <a href="https://ai-execs.com/2_intro.html#a-case-study-in-misleading-ai-advice">wary of optimizing for high eval pass rates</a>. If you’re passing 100% of your evals, you’re likely not challenging your system enough. A 70% pass rate might indicate a more meaningful evaluation that’s actually stress-testing your application. Focus on evals that help you catch real issues, not ones that make your metrics look good.</p>
</blockquote>

<p>I do not agree that passing 100% of your evals is wrong in itself. Of course, you still need to justify the value of evals that you already score 100% (or any passing rate as well).</p>

<p>When there is a classification failure, either one or multiple parts must be wrong - the classifier (model + prompt), or the label (or the LLM-as-a-judge evaluator). You should find out which one it is. It is possible that the label is wrong<sup id="fnref:borderline" role="doc-noteref"><a href="#fn:borderline" class="footnote" rel="footnote">3</a></sup>. If the prompt is wrong, you can probably tweak the prompt. If the model is wrong, you likely cannot do anything. My point here is, when there is a mistake, something must be wrong.</p>

<p>Mistakes are mistakes. It is possible to fix all these mistakes to get your eval to 100%. Models are better these days, and now you have AI to help you write and improve prompts. It is possible that you are not able to find any more mistakes with reasonable effort.</p>

<p>There could still be value for evals that pass 100%. For example, you have a classifier that classifies whether something is adult. The requirements are quite loose. You only need to classify correctly if something is obviously adult or obviously not adult. There are a lot of borderline cases where you are okay with the system tagging either way. Models today can perform 100% at this task. There is still value in this eval even though it passes 100%.</p>

<p>You can deploy this classifier and monitor mistakes in production. The classifier is likely to make mistakes (something obviously adult being classified as not adult), and you add the mistakes to the evaluation. Then you can tune the classifier to achieve performance on the mistakes and the initial dataset. The eval that passes 100% is still useful.</p>

<p>You can use the same eval when you migrate models. Models get deprecated, or you found a much cheaper model that is equally performant. The eval that scores 100% still serves as a unit test that is only run once every model migration.</p>

<p>What you should not do is try to add borderline examples to the dataset to make the eval perform at 70%. Similar to how “the goal of evaluations isn’t to pat yourself on the back for a perfect score”, it should also be the case that “the goal of evaluations isn’t to pat yourself on the back for an imperfect score”.</p>

<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:prompting" role="doc-endnote">
      <p>I have previously written about prompting projects <a href="https://blog.huikang.dev/2024/12/31/prompting-projects.html">here</a>. <a href="#fnref:prompting" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:system" role="doc-endnote">
      <p>My view of my role as a prompt engineer is not to write the prompt, but to correctly build the system for other people to write the prompts. <a href="#fnref:system" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:borderline" role="doc-endnote">
      <p>Again, the case might be borderline and we should probably allow the system to make either classification and not compute loss. <a href="#fnref:borderline" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[Hamel’s evals FAQ is a great resource.]]></summary></entry><entry><title type="html">Predictions for 2026</title><link href="http://blog.huikang.dev/2026/01/01/predictions-for-2026.html" rel="alternate" type="text/html" title="Predictions for 2026" /><published>2026-01-01T00:00:00+00:00</published><updated>2026-01-01T00:00:00+00:00</updated><id>http://blog.huikang.dev/2026/01/01/predictions-for-2026</id><content type="html" xml:base="http://blog.huikang.dev/2026/01/01/predictions-for-2026.html"><![CDATA[<p>These are my predictions for 2026.<sup id="fnref:2025" role="doc-noteref"><a href="#fn:2025" class="footnote" rel="footnote">1</a></sup></p>

<p>There will continue to be more people working on AI models and products.
There will continue to be more compute.
Access to compute will be better.
You will also have competent AI that helps you improve AI.</p>

<h2 id="ai-will-be-human-level-at-manipulating-browsers">AI will be human-level at manipulating browsers</h2>

<p>The only AI-powered browsing experience that I have tried is the Claude extension on Chrome.</p>

<p>There are a few read-only tasks I want it to do:</p>
<ul>
  <li>Go to the date grid on Google Flights so I can see when the cheapest flights are</li>
  <li>Tabulate my rent options based on listings on Craigslist</li>
  <li>Estimate the value of my (mostly IKEA) furniture from an image of my room</li>
</ul>

<p>The experience is very slow and very inaccurate<sup id="fnref:puppeteer" role="doc-noteref"><a href="#fn:puppeteer" class="footnote" rel="footnote">2</a></sup>.</p>

<p>I see a few problems here:</p>
<ul>
  <li>Every action has to be preceded by a thought process</li>
  <li>Context is filled up quickly and compaction is slow, because the model is trying to keep every screenshot</li>
  <li>The model is bad at managing multiple tabs and parallelizing work</li>
  <li>The model is kind of blind and only looks once<sup id="fnref:blind" role="doc-noteref"><a href="#fn:blind" class="footnote" rel="footnote">3</a></sup></li>
</ul>

<p>I hope to be able to trust AI with all my read-only browser tasks.
After AI earns my trust with their performance on read-only tasks, I can slowly trust AI with tasks that make changes.</p>

<h2 id="context-constraints-will-be-invisible">Context constraints will be invisible</h2>

<p>For the user experience, I think context constraints should already be invisible to users of frontier AI products.
You should not be hit with an instruction on ChatGPT that your conversation is too long.
There are some processes in the background that compact the context.<sup id="fnref:decaching" role="doc-noteref"><a href="#fn:decaching" class="footnote" rel="footnote">4</a></sup></p>

<p>For the developer experience though, you need to be aware of the context limits.
In 2025, the phrase <a href="https://www.quora.com/What-do-you-think-of-context-engineering/answer/Tong-Hui-Kang-1">context engineering</a> was coined.</p>

<p>However, I think developers will no longer need to care about context length.
The model API will be shipped with context management,<sup id="fnref:responses" role="doc-noteref"><a href="#fn:responses" class="footnote" rel="footnote">5</a></sup> similar to how you talk to your friend - you do not need to ask your friend to delete old messages so that you can continue talking to them.</p>

<p>This developer experience should already be achievable with existing models and a suite of scaffolds.
There might also be improvements in the model architecture to help.<sup id="fnref:decaching:1" role="doc-noteref"><a href="#fn:decaching" class="footnote" rel="footnote">4</a></sup></p>

<h2 id="we-will-stop-taking-turns-with-ai">We will stop taking turns with AI</h2>

<p>We are familiar with the chat interface with AI.
You type something, hit enter, and AI replies with something.
You are taking turns with AI.</p>

<p>The experience talking to another human is different.
You take note of their facial expression and body language.
You look up certain information while they are talking.
You write notes to yourself.
There are no explicit turns to take.</p>

<p>There are some efforts to break this turn-based experience with AI:</p>
<ul>
  <li>You can interrupt while AI is replying.</li>
  <li>You may submit messages while AI is replying.
However, how early your replies steer the response differs between products.</li>
</ul>

<p>There are still some bottlenecks:</p>
<ul>
  <li>Ideally the response to my question should be ready by the time I hit send.<sup id="fnref:helpdesk" role="doc-noteref"><a href="#fn:helpdesk" class="footnote" rel="footnote">6</a></sup></li>
  <li>The AI may need to ask follow-up questions to clarify my search request.
But the AI should already start searching in parallel.</li>
</ul>

<p>It is possible to achieve all this with a suite of scaffolds.
There might also be improvements in the model architecture to help.<sup id="fnref:multichannel" role="doc-noteref"><a href="#fn:multichannel" class="footnote" rel="footnote">7</a></sup></p>

<h2 id="ai-will-write-their-own-instructions">AI will write their own instructions</h2>

<p>Currently we expect models, with their harness, to complete any task with the resources available to them.
They are expected to use existing instructions.<sup id="fnref:instructions" role="doc-noteref"><a href="#fn:instructions" class="footnote" rel="footnote">8</a></sup>
Models are already expected to check their own work before reporting success.</p>

<p>Soon we should expect AI to not just follow processes, but also to improve the processes.</p>

<p>Currently, the human needs to take the initiative to prompt the model to fix the instructions.
The instructions may include comments and docstrings in the code that inform future models.
The instructions may be <a href="https://agentskills.io/home">skills files</a> that provide guides on how to execute certain processes.</p>

<p>We will see AI products taking the initiative to suggest changes to the process.
We will see models that tastefully fix outdated and inaccurate comments in the codebase.
We will see skills written, and maintained as they are being used.</p>

<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:2025" role="doc-endnote">
      <p>See my <a href="/2024/12/29/competitive-programming-and-superintelligence.html">predictions</a> <a href="/2024/12/30/prompting-in-2025.html">for</a> <a href="/2025/01/02/mathematical-superintelligence.html">2025</a> for reference on how inaccurate they were. <a href="#fnref:2025" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:puppeteer" role="doc-endnote">
      <p>I think I can get better results with Claude Code on puppeteer MCP with subagents.
Regarding puppeteer MCP - even though <a href="https://github.com/modelcontextprotocol/servers/tree/main/src/puppeteer"><code class="language-plaintext highlighter-rouge">@modelcontextprotocol/server-puppeteer</code></a> is deprecated, I am not recommending <a href="https://github.com/microsoft/playwright-mcp"><code class="language-plaintext highlighter-rouge">@playwright/mcp</code></a> because it takes up more tokens (4k vs 14k) and “No vision models needed, operates purely on structured data” is outdated. <a href="#fnref:puppeteer" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:blind" role="doc-endnote">
      <p>AI is kind of blind - you can see that the bottom set of 30 x 30 matrices are <a href="https://www.quora.com/Is-ARC-wrong-and-flawed-and-not-an-AGI-test-at-all/answer/Tong-Hui-Kang-1">misaligned</a>, which should not pass any design review.
I think o3 <a href="https://openai.com/index/thinking-with-images/">pioneered a method</a> where the model zooms in to a specific section.
I look forward to models drawing lines on an image to check whether they aligned the CSS correctly. <a href="#fnref:blind" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:decaching" role="doc-endnote">
      <p>One way to scaffold this is to have a parallel summarization process that kicks in at 50k tokens to summarize into 10k tokens, and at 80k tokens, replaces the 50k tokens with the 10k token summary so the conversation continues from 40k tokens.
I have an idea where the model discards the KV-cache of parts of the conversation that are no longer relevant, and uses tools to search the conversation history instead. This removes the need to generate 10k tokens every 50k tokens. One prime candidate for decaching is the screenshots when manipulating browsers. <a href="#fnref:decaching" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:decaching:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:responses" role="doc-endnote">
      <p>OpenAI <a href="https://developers.openai.com/blog/responses-api/">released</a> the <a href="https://platform.openai.com/docs/guides/conversation-state">Responses API</a> in 2025 which tracks conversation state server-side.
You do not need to pass in the entire conversation history with each request.
Instead, you pass around an id representing the state of the conversation, and OpenAI keeps it up-to-date for you.
You still need to manually call <code class="language-plaintext highlighter-rouge">/responses/compact</code> to compact the context. <a href="#fnref:responses" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:helpdesk" role="doc-endnote">
      <p>For some human-powered helpdesk chatbots, apparently the human operator can <a href="https://gizmodo.com/be-warned-customer-service-agents-can-see-what-youre-t-1830688119">see what you typed before you send</a>. <a href="#fnref:helpdesk" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:multichannel" role="doc-endnote">
      <p>I wrote about how <a href="/2025/05/14/multichannel-prediction.html">models should be multichannel</a> - humans have multiple input and output channels, and models should be able to converse like humans. <a href="#fnref:multichannel" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:instructions" role="doc-endnote">
      <p>I have written about how <a href="/2025/10/20/delivering-ai-instructions.html">instructions are more than just user and system prompts</a>. <a href="#fnref:instructions" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[These are my predictions for 2026.1 See my predictions for 2025 for reference on how inaccurate they were. &#8617;]]></summary></entry><entry><title type="html">Things that I should do on a plane</title><link href="http://blog.huikang.dev/2025/12/29/things-to-do-on-a-plane.html" rel="alternate" type="text/html" title="Things that I should do on a plane" /><published>2025-12-29T00:00:00+00:00</published><updated>2025-12-29T00:00:00+00:00</updated><id>http://blog.huikang.dev/2025/12/29/things-to-do-on-a-plane</id><content type="html" xml:base="http://blog.huikang.dev/2025/12/29/things-to-do-on-a-plane.html"><![CDATA[<p>There are some things that I wanted to do, but I do not think it is worth doing unless I am stuck on a plane.</p>

<p><strong>Crafting LLM weights to multiply integers without thinking.</strong></p>

<p>It is already expected that frontier models have perfect performance at multiplication without tool use. 
Frontier models should be able multiply step-by-step, only limited by time and space complexity.</p>

<p>However, I am interested in multiplication without a thought process, the model is expected to reply with the product directly.</p>

<p>I want to try crafting weights to perform multiplication perfectly. I will also need to study the multiplication algorithm itself.</p>

<p>There is this related work on <a href="https://arxiv.org/abs/2301.05217">training</a> a model to perform modular addition perfectly.</p>

<p><strong>Understanding <a href="https://cp-algorithms.com/data_structures/segment_tree.html">segment trees</a>.</strong></p>

<p>Segment trees are an important concept if you want to get a higher rating in competitive programming.
I have a <a href="https://github.com/tonghuikang/codecomp/blob/3f93809d7577a544e05eb56754370c9a68f6fc4c/template/template_segment_trees.py">template</a> that I do not know how to use.
I think I also need to understand the algorithm and the intuition behind segment trees.</p>

<p><strong>Understanding <a href="https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm">Aho–Corasick algorithm</a>.</strong></p>

<p>LeetCode contests tested this two times.</p>
<ul>
  <li><a href="https://leetcode.com/problems/construct-string-with-minimum-cost/">Construct String with Minimum Cost</a> in <a href="https://leetcode.com/contest/weekly-contest-405/">Weekly Contest 405</a></li>
  <li><a href="https://leetcode.com/problems/minimum-number-of-valid-strings-to-form-target-ii/">Minimum Number of Valid Strings to Form Target II</a> in <a href="https://leetcode.com/contest/weekly-contest-415/">Weekly Contest 415</a></li>
</ul>

<p>I solved both problems with other methods.</p>

<p><strong>Tree matching problem with wildcards</strong></p>

<p>You have two rooted trees.
Some nodes are wildcards.</p>

<p>The task is to determine whether you can assign each wildcard a subtree so that you can match the two trees.</p>

<p>Please do not ask me where I got this problem.</p>

<p><strong>Memorize the entire architecture of <a href="https://huggingface.co/openai/gpt-oss-120b">gpt-oss</a>.</strong></p>

<p>gpt-oss-120b is the go-to model for <a href="https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-3">AIMO 3</a>. I want to understand the model architecture very well.
I need to be able to derive the parameter count, and also its GPU memory usage.
I will also need to compare and explain the differences between 20b and 120b versions.</p>

<p><strong>Watch some films</strong></p>

<p>These are the films I have yet to watch</p>
<ul>
  <li>Zootopia 2</li>
  <li>Demon Slayer: Kimetsu no Yaiba Infinity Castle</li>
  <li>Chainsaw Man - The Movie: Reze Arc (only watched the first half)</li>
  <li>Ne Zha 2</li>
  <li>KPop Demon Hunters</li>
  <li>Neon Genesis Evangelion</li>
</ul>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[There are some things that I wanted to do, but I do not think it is worth doing unless I am stuck on a plane.]]></summary></entry><entry><title type="html">How to read vLLM logs</title><link href="http://blog.huikang.dev/2025/12/28/vllm-logs.html" rel="alternate" type="text/html" title="How to read vLLM logs" /><published>2025-12-28T00:00:00+00:00</published><updated>2025-12-28T00:00:00+00:00</updated><id>http://blog.huikang.dev/2025/12/28/vllm-logs</id><content type="html" xml:base="http://blog.huikang.dev/2025/12/28/vllm-logs.html"><![CDATA[<p>vLLM logs tell you exactly what’s happening during model loading and inference—memory allocation, attention backends, CUDA graph capture, KV cache sizing. Understanding these logs helps you debug performance issues, optimize configurations, and reason about why your setup behaves the way it does.</p>

<p>This post walks through the startup logs from serving <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS-120B</a> (a 117B parameter MoE model with MXFP4 quantization) on a single GPU via <a href="https://github.com/vllm-project/vllm">vLLM v0.11.2</a>. Each log line is explained with links to source code and documentation. Logs are from <a href="https://www.kaggle.com/code/huikang/streaming-inference?scriptVersionId=282411196">this Kaggle notebook</a>.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">INFO 11-28 11:45:51 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.</code></p>
</blockquote>

<p><code class="language-plaintext highlighter-rouge">max_num_batched_tokens</code> caps tokens per scheduler step. With <a href="https://docs.vllm.ai/en/stable/configuration/optimization/#chunked-prefill">chunked prefill</a> enabled (default in V1), the scheduler prioritizes decode requests and batches pending prefills into remaining token budget. Lower values (e.g., 2048) achieve better ITL (inter-token latency) because there are fewer prefills interrupting decodes. Higher values achieve better TTFT (time-to-first-token) as more prefill tokens are processed per batch. Default is 8192 for online serving, 16384 for offline. See <a href="https://docs.vllm.ai/en/stable/configuration/optimization/#performance-tuning-with-chunked-prefill">vLLM Optimization docs</a>.</p>

<p>This value also affects how much memory is left for KV cache. vLLM <a href="https://github.com/vllm-project/vllm/issues/20256">allocates all remaining GPU memory to KV cache</a> after loading model weights—controlled by <code class="language-plaintext highlighter-rouge">gpu_memory_utilization</code> (default 0.9). Higher <code class="language-plaintext highlighter-rouge">max_num_batched_tokens</code> reserves more activation memory during the profiling step, leaving less for KV cache and reducing <code class="language-plaintext highlighter-rouge">max_concurrency</code>. Decreasing <code class="language-plaintext highlighter-rouge">max_num_batched_tokens</code> or <code class="language-plaintext highlighter-rouge">max_num_seqs</code> <a href="https://docs.vllm.ai/en/stable/configuration/optimization/#preemption">frees KV cache space</a> for more concurrent requests.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:45:51 [api_server.py:1977] vLLM API server version 0.11.2</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:45:51 [utils.py:253] non-default args: {'host': '0.0.0.0', 'model': '/kaggle/input/gpt-oss-120b/transformers/default/1', 'max_model_len': 98304, 'served_model_name': ['vllm-model'], 'gpu_memory_utilization': 0.96, 'max_num_seqs': 6}</code></p>
</blockquote>

<p><code class="language-plaintext highlighter-rouge">max_num_seqs</code> caps concurrent sequences per batch (default 1024 in V1, up from 256 in V0). Higher values allow more concurrent requests but <a href="https://docs.vllm.ai/en/stable/configuration/optimization/#preemption">require more KV cache space</a> at runtime. Lower values <a href="https://docs.vllm.ai/en/stable/configuration/optimization/#preemption">reduce memory pressure</a> and avoid <a href="https://docs.vllm.ai/en/stable/configuration/optimization/#preemption">preemption</a>, where requests are evicted and recomputed when KV cache fills. Here, <code class="language-plaintext highlighter-rouge">max_num_seqs=6</code> is set low for this memory-constrained single-GPU setup.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:46:35 [model.py:631] Resolved architecture: GptOssForCausalLM</code></p>
</blockquote>

<p>vLLM reads <code class="language-plaintext highlighter-rouge">architectures</code> from <code class="language-plaintext highlighter-rouge">config.json</code> and maps it to the corresponding model class in <a href="https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models"><code class="language-plaintext highlighter-rouge">vllm/model_executor/models/</code></a>.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) ERROR 11-28 11:46:35 [config.py:307] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/kaggle/input/gpt-oss-120b/transformers/default/1'. Use repo_type argument if needed., retrying 1 of 2</code>
<code class="language-plaintext highlighter-rouge">(APIServer pid=90) ERROR 11-28 11:46:37 [config.py:305] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/kaggle/input/gpt-oss-120b/transformers/default/1'. Use repo_type argument if needed.</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:46:37 [model.py:1968] Downcasting torch.float32 to torch.bfloat16.</code></p>
</blockquote>

<p>This is about the original checkpoint dtype. Since GPT-OSS uses <code class="language-plaintext highlighter-rouge">quantization=mxfp4</code>, weights end up as 4-bit anyway. bfloat16 is used for activations and KV cache.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:46:37 [model.py:1745] Using max model len 98304</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:46:45 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=8192.</code></p>
</blockquote>

<p>Printed twice: first during APIServer config parsing (default 2048), second when EngineCore initializes with computed value (8192). 8192 is the online serving default.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:46:45 [config.py:272] Overriding max cuda graph capture size to 1024 for performance.</code></p>
</blockquote>

<p><a href="https://docs.vllm.ai/en/stable/design/cuda_graphs/">CUDA graphs</a> pre-record GPU operations to eliminate per-kernel CPU launch overhead. Each captured batch size requires memory to store the graph. Limiting to 1024 max balances memory usage against coverage of typical batch sizes. See <a href="https://docs.vllm.ai/en/stable/design/cuda_graphs/">vLLM CUDA Graphs docs</a>.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:47:23 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/kaggle/input/gpt-oss-120b/transformers/default/1', speculative_config=None, tokenizer='/kaggle/input/gpt-oss-120b/transformers/default/1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=98304, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=vllm-model, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': &lt;CompilationMode.VLLM_COMPILE: 3&gt;, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': &lt;CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)&gt;, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 1024, 'local_cache_dir': None}</code></p>
</blockquote>

<p>Key settings for GPT-OSS:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">quantization=mxfp4</code>: <a href="https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/mxfp4.html">MXFP4</a> (Microscaling FP4) compresses weights to 4-bit with per-group scaling factors</li>
  <li><code class="language-plaintext highlighter-rouge">enable_prefix_caching=True</code>: <a href="https://docs.vllm.ai/en/stable/design/prefix_caching/">Automatic Prefix Caching</a> shares KV cache for requests with common prefixes</li>
  <li><code class="language-plaintext highlighter-rouge">cudagraph_mode=FULL_AND_PIECEWISE</code>: <a href="https://docs.vllm.ai/en/stable/design/cuda_graphs/">FULL_AND_PIECEWISE</a> uses full CUDA graphs for uniform decode batches, piecewise graphs for mixed prefill-decode batches</li>
  <li><code class="language-plaintext highlighter-rouge">reasoning_parser='openai_gptoss'</code>: GPT-OSS specific reasoning output parser</li>
</ul>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:47:31 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.19.2.2:54751 backend=nccl</code></p>
</blockquote>

<p>Single GPU setup. <code class="language-plaintext highlighter-rouge">world_size=1</code> means one process, <code class="language-plaintext highlighter-rouge">backend=nccl</code> uses <a href="https://developer.nvidia.com/nccl">NVIDIA NCCL</a> (NVIDIA Collective Communications Library) for fast GPU-to-GPU communication in multi-GPU setups.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">[W1128 11:47:31.402851082 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3</code>
<code class="language-plaintext highlighter-rouge">[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0</code></p>
</blockquote>

<p>Benign. With one GPU, there are no peers to connect to.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:47:31 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0</code></p>
</blockquote>

<p>Parallelism types (all 0 because single GPU). See <a href="https://jax-ml.github.io/scaling-book/training/">How to Parallelize a Transformer</a> from the JAX Scaling Book:</p>
<ul>
  <li><strong>DP</strong> (<a href="https://jax-ml.github.io/scaling-book/training/#data-parallelism">Data Parallel</a>): activations sharded along batch dimension, parameters replicated on each device, AllReduce gradients during backward pass</li>
  <li><strong>PP</strong> (<a href="https://jax-ml.github.io/scaling-book/training/#pipelining">Pipeline Parallel</a>): layers distributed across devices, activations microbatched through pipeline stages</li>
  <li><strong>TP</strong> (<a href="https://jax-ml.github.io/scaling-book/training/#tensor-parallelism">Tensor Parallel</a>): activations sharded along model dimension, parameters sharded along feed-forward dimension, AllGather/ReduceScatter between blocks</li>
  <li><strong>EP</strong> (<a href="https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/">Expert Parallel</a>): for MoE models, distribute different experts across GPUs, tokens routed to GPUs holding selected experts</li>
</ul>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:47:31 [gpu_model_runner.py:3259] Starting to load model /kaggle/input/gpt-oss-120b/transformers/default/1...</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) WARNING 11-28 11:47:32 [mxfp4.py:196] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.</code></p>
</blockquote>

<p>Some layers lack FP4 kernel implementations and run in bfloat16 instead. See <a href="https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py"><code class="language-plaintext highlighter-rouge">vllm/model_executor/layers/quantization/mxfp4.py</code></a>.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) WARNING 11-28 11:47:32 [mxfp4.py:208] MXFP4 attention layer is not implemented. Skipping quantization for this layer.</code></p>
</blockquote>

<p>Attention layers (Q, K, V projections) run unquantized. See the same <a href="https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py"><code class="language-plaintext highlighter-rouge">mxfp4.py</code></a> source.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:47:32 [cuda.py:377] Using AttentionBackendEnum.TRITON_ATTN backend.</code></p>
</blockquote>

<p>vLLM supports multiple attention backends:</p>
<ul>
  <li><strong>TRITON_ATTN</strong>: Uses <a href="https://github.com/triton-lang/triton">OpenAI’s Triton compiler</a> to generate attention kernels. FlashAttention-style memory efficiency with <a href="https://arxiv.org/abs/2205.14135">O(n) memory instead of O(n²)</a>.</li>
  <li><strong>FLASH_ATTN</strong>: <a href="https://github.com/Dao-AILab/flash-attention">FlashAttention</a> CUDA implementation. <a href="https://github.com/vllm-project/vllm/issues/22610">H100/H800 GPUs default to FlashAttention-3 automatically</a>—no need to set <code class="language-plaintext highlighter-rouge">VLLM_FLASH_ATTN_VERSION=3</code>. FlashAttention-3 (<a href="https://arxiv.org/abs/2407.08608">paper</a>, <a href="https://github.com/vllm-project/vllm/issues/12429">vLLM integration</a>) is optimized for Hopper using WGMMA async Tensor Cores and TMA, achieving <a href="https://tridao.me/blog/2024/flash3/">75-85% of H100 theoretical max FLOPS</a> (740-850 TFLOPs) vs. 35% for FA2. Requires CUDA ≥12.3. Note that <a href="https://github.com/vllm-project/vllm/issues/15344">logs don’t distinguish FA2 vs FA3</a>—the version is determined silently based on GPU architecture.</li>
  <li><strong>FLASHINFER</strong>: <a href="https://github.com/flashinfer-ai/flashinfer">FlashInfer</a> backend with different optimization strategies.</li>
</ul>

<p>Here, <code class="language-plaintext highlighter-rouge">TRITON_ATTN</code> is selected.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:47:32 [layer.py:342] Enabled separate cuda stream for MoE shared_experts</code></p>
</blockquote>

<p>GPT-OSS is a Mixture-of-Experts (MoE) model with <a href="https://arxiv.org/abs/2401.06066">shared experts</a> (always active, capture common knowledge) and routed experts (selected by gating). Running shared experts on a separate CUDA stream allows overlapping their computation with the routed expert dispatch/combine operations. The expected overlap pattern is: shared experts compute in parallel with token dispatch to routed experts, then routed experts compute, then results combine. See <a href="https://github.com/vllm-project/vllm/issues/9203">vLLM RFC #9203</a> and <a href="https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/fused_moe">vLLM MoE implementation</a>.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:47:32 [mxfp4.py:141] Using Marlin backend</code></p>
</blockquote>

<p><a href="https://github.com/IST-DASLab/marlin">Marlin</a> is a weight-only quantization kernel from IST-DASLab. On GPUs without native FP4 tensor cores, Marlin decompresses FP4 weights on-the-fly during computation. See <a href="https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py">vLLM Marlin utils</a>.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) Loading safetensors checkpoint shards: 100% Completed 15/15 [09:05&lt;00:00, 36.34s/it]</code>
<code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:57:30 [default_loader.py:314] Loading weights took 545.24 seconds</code></p>
</blockquote>

<p>9 minutes is slow. Kaggle storage I/O is the bottleneck. Local NVMe + tensor parallelism would be faster.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) WARNING 11-28 11:57:30 [marlin_utils_fp4.py:204] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.</code></p>
</blockquote>

<p>This Kaggle environment’s GPU lacks native FP4 tensor cores. <a href="https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/">Native FP4 requires Blackwell architecture (SM100+)</a>—Hopper (SM90) only supports FP8. The Marlin kernel provides weight-only compression: weights stored in FP4 (memory savings) but computed in higher precision. See <a href="https://github.com/vllm-project/vllm/issues/30135">GitHub issue #30135</a> for more on MXFP4 backend selection.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:57:33 [gpu_model_runner.py:3338] Model loading took 65.9651 GiB memory and 600.408480 seconds</code></p>
</blockquote>

<p>65.97 GiB is expected. Here’s the math based on <a href="https://huggingface.co/openai/gpt-oss-120b/blob/main/config.json">GPT-OSS-120B config.json</a>:</p>

<p><strong>Architecture:</strong> 36 layers, hidden_size=2880, 128 experts per layer, intermediate_size=2880 per expert</p>

<p><strong>MoE weights (~115B params, MXFP4 quantized):</strong></p>
<ul>
  <li>Each expert (SwiGLU): 3 × <code class="language-plaintext highlighter-rouge">hidden_size</code> × <code class="language-plaintext highlighter-rouge">intermediate_size</code> = 3 × 2880 × 2880 = 24.9M params</li>
  <li>Per layer: 128 experts × 24.9M = 3.19B params</li>
  <li>36 layers: 36 × 3.19B = <strong>114.8B params</strong></li>
  <li>MXFP4 (4-bit + scaling): 114.8B × 0.5 bytes × 1.03 ≈ <strong>59.1 GB</strong></li>
</ul>

<p><strong>Non-MoE weights (~2B params, BF16):</strong></p>
<ul>
  <li>Attention (Q/K/V/O), embeddings, layernorms, routers</li>
  <li>~2B × 2 bytes = <strong>~4 GB</strong></li>
</ul>

<p><strong>Total:</strong> 59.1 + 4 + ~2-3 GB (CUDA context, buffers) ≈ <strong>65-66 GB</strong></p>

<p>This matches <a href="https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune">Unsloth’s documentation</a> which recommends “at least 66GB of unified memory” for gpt-oss-120B inference. See also <a href="https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf">OpenAI’s model card</a> for architecture details.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:57:52 [backends.py:631] Using cache directory: /root/.cache/vllm/torch_compile_cache/7fcbe477d2/rank_0_0/backbone for vLLM's torch.compile</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:57:52 [backends.py:647] Dynamo bytecode transform time: 19.07 s</code>
<code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:57:58 [backends.py:251] Cache the graph for dynamic shape for later use</code>
<code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:58:38 [backends.py:282] Compiling a graph for dynamic shape takes 44.90 s</code>
<code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:58:39 [monitor.py:34] torch.compile takes 63.97 s in total</code></p>
</blockquote>

<p><a href="https://docs.pytorch.org/docs/stable/torch.compiler.html"><code class="language-plaintext highlighter-rouge">torch.compile</code></a> is PyTorch’s JIT compiler. <a href="https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html">TorchDynamo</a> (the frontend) traces Python code and captures computation graphs. TorchInductor (the backend) then fuses multiple operations into optimized CUDA/Triton kernels—this reduces memory bandwidth usage since intermediate results stay in registers instead of being written to and read from global memory. See <a href="https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html">PyTorch’s torch.compile tutorial</a> for details on how fusion works. vLLM adds custom fusion passes (RMSNorm+quantization, SiLU+quantization) for additional speedups. Compiled artifacts are cached in <code class="language-plaintext highlighter-rouge">~/.cache/vllm/torch_compile_cache</code> for reuse across runs. See <a href="https://docs.vllm.ai/en/stable/design/torch_compile.html">vLLM torch.compile docs</a>.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:58:41 [gpu_worker.py:359] Available KV cache memory: 8.99 GiB</code>
<code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:58:41 [kv_cache_utils.py:1229] GPU KV cache size: 130,960 tokens</code>
<code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:58:41 [kv_cache_utils.py:1234] Maximum concurrency for 98,304 tokens per request: 2.46x</code></p>
</blockquote>

<p>The 2.46x multiplier is calculated in <a href="https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/kv_cache_utils.py"><code class="language-plaintext highlighter-rouge">vllm/v1/core/kv_cache_utils.py</code></a>. GPT-OSS uses an <a href="https://huggingface.co/openai/gpt-oss-120b/blob/main/config.json">alternating attention pattern</a>: odd layers use sliding window attention (<code class="language-plaintext highlighter-rouge">sliding_window</code>=128), even layers use full attention. This creates 2 KV cache groups.</p>

<p><strong>Derivation</strong> (block_size=16, page_size=32,768 bytes for GQA with 8 KV heads × 64 head_dim × 2 bytes × 2 K+V):</p>

<ol>
  <li><strong>FullAttentionSpec</strong> memory per layer: ⌈98,304 / 16⌉ × 32,768 = 201.3 MB</li>
  <li><strong>SlidingWindowSpec</strong> memory per layer: ⌈(128 − 1 + 8,192) / 16⌉ × 32,768 = 17.0 MB
    <ul>
      <li>Note: includes <code class="language-plaintext highlighter-rouge">max_num_batched_tokens</code> (8,192) for chunked prefill buffer</li>
    </ul>
  </li>
  <li><strong>Sum per layer</strong>: 201.3 + 17.0 = 218.4 MB</li>
  <li><strong>Per request</strong> (18 layers per group): 18 × 218.4 MB = 3.93 GB</li>
  <li><strong>Blocks per request</strong>: ⌈3.93 GB / (32,768 × 18)⌉ = 6,666 blocks</li>
  <li><strong>Total blocks</strong>: 130,960 tokens × 2 groups / 16 block_size = 16,370 blocks</li>
  <li><strong>max_concurrency</strong>: 16,370 / 6,666 = <strong>2.46x</strong></li>
</ol>

<p>When KV cache fills: vLLM <a href="https://docs.vllm.ai/en/stable/configuration/optimization/">preempts</a> lower-priority requests via recomputation (V1 default) or swap to CPU. This causes latency spikes for preempted requests.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) 2025-11-28 11:58:41,951 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...</code>
<code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) 2025-11-28 11:58:41,973 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends</code>
<code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 83/83</code></p>
</blockquote>

<p>83 <a href="https://docs.vllm.ai/en/stable/design/cuda_graphs/">CUDA graphs</a> captured (one per batch size in <code class="language-plaintext highlighter-rouge">cudagraph_capture_sizes</code> up to 1024). <a href="https://docs.vllm.ai/en/stable/design/cuda_graphs/">PIECEWISE mode</a> splits the computation graph at attention operations—attention stays in eager mode while everything else (MLPs, norms) goes into CUDA graphs. This is necessary because attention has variable memory access patterns that are difficult to capture in static graphs.</p>

<p>Why CUDA graphs matter: each kernel launch has ~20μs of <a href="https://developer.nvidia.com/blog/cuda-graphs/">CPU overhead</a> for setup and dispatch. For small batches where GPU compute time is short, this overhead dominates total latency. CUDA graphs pre-record the entire workflow once, then replay it with <a href="https://developer.nvidia.com/blog/constant-time-launch-for-straight-line-cuda-graphs-and-other-performance-enhancements/">near-constant launch time</a> (~2.5μs + ~1ns per node). See <a href="https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cuda-graphs.html">NVIDIA CUDA Programming Guide</a>.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) Capturing CUDA graphs (decode, FULL): 3/3</code>
<code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:58:51 [gpu_model_runner.py:4244] Graph capturing finished in 9 secs, took 0.64 GiB</code></p>
</blockquote>

<p>3 full graphs for decode-only batches (simpler access pattern allows capturing entire forward pass). Uses 0.64 GiB memory.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(EngineCore_DP0 pid=289) INFO 11-28 11:58:51 [core.py:250] init engine (profile, create kv cache, warmup model) took 78.26 seconds</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [api_server.py:1725] Supported tasks: ['generate']</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) WARNING 11-28 11:58:53 [serving_responses.py:175] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [api_server.py:2052] Starting vLLM API server 0 on http://0.0.0.0:8000</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [launcher.py:38] Available routes are:</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [launcher.py:46] Route: /docs, Methods: GET, HEAD</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [launcher.py:46] Route: /tokenize, Methods: POST</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [launcher.py:46] Route: /detokenize, Methods: POST</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [launcher.py:46] Route: /v1/models, Methods: GET</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [launcher.py:46] Route: /v1/chat/completions, Methods: POST</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [launcher.py:46] Route: /v1/completions, Methods: POST</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:58:53 [launcher.py:46] Route: /metrics, Methods: GET</code></p>
</blockquote>

<p>vLLM exposes an <a href="https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html">OpenAI-compatible API</a>:</p>
<ul>
  <li><a href="https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#chat-api"><code class="language-plaintext highlighter-rouge">/v1/chat/completions</code></a> - Chat completions (conversation format with messages array)</li>
  <li><a href="https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#completions-api"><code class="language-plaintext highlighter-rouge">/v1/completions</code></a> - Text completions (raw prompt format)</li>
  <li><code class="language-plaintext highlighter-rouge">/v1/models</code> - List available models and metadata</li>
  <li><a href="https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#tokenizer-api"><code class="language-plaintext highlighter-rouge">/tokenize</code></a>, <a href="https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#tokenizer-api"><code class="language-plaintext highlighter-rouge">/detokenize</code></a> - Encode text to token IDs and decode back</li>
  <li><code class="language-plaintext highlighter-rouge">/metrics</code> - <a href="https://docs.vllm.ai/en/stable/usage/metrics.html">Prometheus metrics</a> for monitoring (<code class="language-plaintext highlighter-rouge">vllm:e2e_request_latency_seconds</code>, <code class="language-plaintext highlighter-rouge">vllm:num_requests_running</code>, <code class="language-plaintext highlighter-rouge">vllm:kv_cache_usage_perc</code>)</li>
  <li><code class="language-plaintext highlighter-rouge">/docs</code> - Swagger/OpenAPI interactive documentation</li>
</ul>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO:     Started server process [90]</code>
<code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO:     Waiting for application startup.</code>
<code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO:     Application startup complete.</code>
<code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO:     127.0.0.1:55440 - "GET /v1/models HTTP/1.1" 200 OK</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO:     127.0.0.1:55440 - "POST /v1/chat/completions HTTP/1.1" 200 OK</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO: 127.0.0.1:55484 - "POST /v1/chat/completions HTTP/1.1" 200 OK</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:59:14 [loggers.py:236] Engine 000: Avg generation throughput: 544.8 tokens/s, GPU KV cache usage: 4.3%, Prefix cache hit rate: 81.6%</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 11:59:44 [loggers.py:236] Engine 000: Avg generation throughput: 514.2 tokens/s, GPU KV cache usage: 10.4%, Prefix cache hit rate: 81.6%</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 12:00:44 [loggers.py:236] Engine 000: Avg generation throughput: 468.0 tokens/s, GPU KV cache usage: 21.5%, Prefix cache hit rate: 81.6%</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 12:01:44 [loggers.py:236] Engine 000: Avg generation throughput: 426.6 tokens/s, GPU KV cache usage: 31.6%, Prefix cache hit rate: 81.6%</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 12:02:44 [loggers.py:236] Engine 000: Avg generation throughput: 396.0 tokens/s, GPU KV cache usage: 41.0%, Prefix cache hit rate: 81.6%</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 12:03:54 [loggers.py:236] Engine 000: Avg generation throughput: 367.2 tokens/s, GPU KV cache usage: 51.1%, Prefix cache hit rate: 81.6%</code></p>
</blockquote>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">(APIServer pid=90) INFO 11-28 12:04:54 [loggers.py:236] Engine 000: Avg generation throughput: 348.6 tokens/s, GPU KV cache usage: 59.3%, Prefix cache hit rate: 81.6%</code></p>
</blockquote>

<p><strong>Throughput drops as KV cache fills:</strong></p>

<ul>
  <li>4.3% cache → 544.8 tok/s</li>
  <li>21.5% cache → 468.0 tok/s</li>
  <li>41.0% cache → 396.0 tok/s</li>
  <li>59.3% cache → 348.6 tok/s</li>
</ul>

<p>This demonstrates that LLM decoding is <a href="https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/">memory-bandwidth bound</a>. During each decode step, the GPU must load the entire KV cache from HBM (high-bandwidth memory) to compute attention over all previous tokens. As sequences grow longer, the amount of data transferred per token increases linearly, but the compute per token stays constant (one output token). According to <a href="https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/">NVIDIA’s LLM optimization guide</a>, “the speed at which the data (weights, keys, values, activations) is transferred to the GPU from memory dominates the latency, not how fast the computation actually happens.”</p>

<p>With 6 concurrent requests generating long outputs, each request’s growing KV cache competes for memory bandwidth. The <a href="https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html">vLLM architecture blog</a> notes that “decode requests are memory-bandwidth-bound since we still need to load all LLM weights (and KV caches) just to compute one token.”</p>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[vLLM logs tell you exactly what’s happening during model loading and inference—memory allocation, attention backends, CUDA graph capture, KV cache sizing. Understanding these logs helps you debug performance issues, optimize configurations, and reason about why your setup behaves the way it does.]]></summary></entry><entry><title type="html">Papers I kept in 2025</title><link href="http://blog.huikang.dev/2025/12/27/papers-I-kept-in-2025.html" rel="alternate" type="text/html" title="Papers I kept in 2025" /><published>2025-12-27T00:00:00+00:00</published><updated>2025-12-27T00:00:00+00:00</updated><id>http://blog.huikang.dev/2025/12/27/papers-I-kept-in-2025</id><content type="html" xml:base="http://blog.huikang.dev/2025/12/27/papers-I-kept-in-2025.html"><![CDATA[<p>I am moving from one place to another. Over my stay in the United States, I have printed plenty of papers, on large language models and recommendation systems. These are the papers I have kept<sup id="fnref:disclaimer" role="doc-noteref"><a href="#fn:disclaimer" class="footnote" rel="footnote">1</a></sup> as I moved.</p>

<h2 id="large-language-model-papers">Large Language Model papers</h2>

<p><strong>Training guides</strong></p>

<ul>
  <li>How to Scale Your Model [<a href="https://jax-ml.github.io/scaling-book/">link</a>] Google DeepMind, 2025 - If you want a good <a href="https://twopug.com/interview-prep-ml-grind/">shot</a> at training LLMs, you will need to do the included homework.</li>
</ul>

<p><strong>GRPO and variants</strong> - I write about my views on <a href="https://blog.huikang.dev/2025/10/28/group-relative-policy-optimization.html">GRPO</a>. It is only recently that I <a href="https://www.interconnects.ai/p/the-dpo-debate">found</a> out about this <a href="https://x.com/tomgoldsteincs/status/1729910334318633116">meme</a> which I agree from the left side.</p>

<ul>
  <li>Group Sequence Policy Optimization [<a href="https://arxiv.org/abs/2507.18071">link</a>] Qwen, 24 Jul 2025</li>
  <li>Understanding R1-Zero-Like Training: A Critical Perspective [<a href="https://arxiv.org/abs/2503.20783">link</a>] Sea AI Lab, 26 Mar 2025 - This introduces “Dr.GRPO”.</li>
  <li>Why RLHF (and Other RL-Like Methods) Don’t Bring True RL to LLMs [<a href="https://www.linkedin.com/pulse/why-rlhf-other-rl-like-methods-dont-bring-true-rl-llmsand-atlas-wang-s1efc/">link</a>] Atlas Wang, 2025</li>
  <li>DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [<a href="https://arxiv.org/abs/2402.03300">link</a>] DeepSeek, 5 Feb 2024 - This is the GRPO paper.</li>
  <li>KTO: Model Alignment as Prospect Theoretic Optimization [<a href="https://arxiv.org/abs/2402.01306">link</a>] Cohere, 2 Feb 2024</li>
  <li>ORPO: Monolithic Preference Optimization without Reference Model [<a href="https://arxiv.org/abs/2403.07691">link</a>] KAIST, 12 Mar 2024</li>
  <li>Direct Preference Optimization: Your Language Model is Secretly a Reward Model [<a href="https://arxiv.org/abs/2305.18290">link</a>] Stanford, 29 May 2023 - This is the DPO paper.</li>
  <li>A General Theoretical Paradigm to Understand Learning from Human Preferences [<a href="https://arxiv.org/abs/2310.12036">link</a>] DeepMind, 18 Oct 2023 - They call their algorithm IPO.</li>
</ul>

<p><strong>Technical reports</strong> - I will read this in the future to look back what models were optimizing for.</p>

<ul>
  <li>Mixtral of Experts [<a href="https://arxiv.org/abs/2401.04088">link</a>] Mistral AI, 8 Jan 2024</li>
  <li>Kimi k1.5: Scaling Reinforcement Learning with LLMs [<a href="https://huggingface.co/papers/2501.12599">link</a>] Moonshot AI, 20 Jan 2025</li>
  <li>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [<a href="https://arxiv.org/abs/2501.12948">link</a>] DeepSeek, 22 Jan 2025</li>
  <li>GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [<a href="https://arxiv.org/abs/2508.06471">link</a>] Zhipu AI, 20 Jun 2025</li>
  <li>The Llama 3 Herd of Models [<a href="https://arxiv.org/abs/2407.21783">link</a>] Meta, 31 Jul 2024</li>
  <li>LLaMA: Open and Efficient Foundation Language Models [<a href="https://arxiv.org/abs/2302.13971">link</a>] Meta, 27 Feb 2023</li>
</ul>

<p><strong>Efficiency efforts</strong></p>

<ul>
  <li>LoRA: Low-Rank Adaptation of Large Language Models [<a href="https://arxiv.org/abs/2106.09685">link</a>] Microsoft, 17 Jun 2021 - LoRA is now regaining popularity and Fireworks and Thinking Machines are supporting fine-tuning with LoRA.</li>
  <li>Hyena Hierarchy: Towards Larger Convolutional Language Models [<a href="https://arxiv.org/abs/2302.10866">link</a>] Stanford, 21 Feb 2023 - Tri Dao contributed to this paper. I kept this because I want to understand how attention can be made theoretically faster with approximations.</li>
</ul>

<p><strong>Prompting</strong> - Now models are trained to effectively run long chains of thoughts without prompting.</p>

<ul>
  <li>Let’s Verify Step by Step [<a href="https://arxiv.org/abs/2305.20050">link</a>] OpenAI, 31 May 2023</li>
  <li>Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [<a href="https://arxiv.org/abs/2201.11903">link</a>] Google, 28 Jan 2022</li>
  <li>In-Context Learning for Extreme Multi-Label Classification [<a href="https://arxiv.org/abs/2401.12178">link</a>] Ghent University, 22 Jan 2024 - I printed this because I used this in my actual work.</li>
</ul>

<p><strong>Early alignment efforts</strong> - I kept this to read in the future to understand what people were thinking.</p>

<ul>
  <li>Training Language Models to Follow Instructions with Human Feedback [<a href="https://arxiv.org/abs/2203.02155">link</a>] OpenAI, 4 Mar 2022</li>
  <li>Reinforced Self-Training (ReST) for Language Modeling [<a href="https://arxiv.org/abs/2308.08998">link</a>] DeepMind, 17 Aug 2023</li>
  <li>RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback [<a href="https://arxiv.org/abs/2309.00267">link</a>] Google, 1 Sep 2023</li>
  <li>Constitutional AI: Harmlessness from AI Feedback [<a href="https://arxiv.org/abs/2212.08073">link</a>] Anthropic, 15 Dec 2022</li>
</ul>

<h2 id="recommendation-systems-papers">Recommendation Systems papers</h2>

<p>For an introduction to recommendation systems, I recommend</p>
<ul>
  <li>This Chinese language <a href="https://www.youtube.com/playlist?list=PLvOO0btloRntAi-VnV06M1Bu0X1xljUUP">playlist</a> by Shusen Wang.</li>
  <li>Recommendation systems viewed in <a href="https://medium.com/nvidia-merlin/recommender-systems-not-just-recommender-models-485c161c755e">four stages</a>, with online and offline processes.</li>
</ul>

<p><strong>Value modeling</strong> - Recommendation systems calculate P(action) for multiple actions, for each candidate. The candidates are ranked based on a utility function. The utility function takes the action probability as arguments. You need to design a good utility function for your recommendation system. The design also involves deciding how important is each action.</p>

<ul>
  <li>What We Know About Using Non-Engagement Signals in Content Ranking [<a href="https://arxiv.org/abs/2402.06831">link</a>] Integrity Institute, 9 Feb 2024 - This puts down in writing that engaging content is usually negatively correlated with “quality”.</li>
  <li>Multi-Objective Recommendation via Multivariate Policy Learning [<a href="https://arxiv.org/abs/2405.02141">link</a>] Spotify, 3 May 2024</li>
  <li>Feedback Shaping: A Modeling Approach to Nurture Content Creation [<a href="https://arxiv.org/abs/2106.11312">link</a>] LinkedIn, 21 Jun 2021</li>
</ul>

<p><strong>Training multi-task models</strong> - We usually use one neural network model to predict multiple action probabilities. The alternative is to use a separate model to predict each action probability. However,  sometimes individual models are better at predicting action probabilities than the combined model, even controlling for total parameter count. Hence there is this line of research to bridge the performance gap.</p>

<ul>
  <li>Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts [<a href="https://dl.acm.org/doi/10.1145/3219819.3220007">link</a>] Google, 13 Jun 2018</li>
  <li>Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations [<a href="https://dl.acm.org/doi/10.1145/3383313.3412236">link</a>] Tencent, 22 Sep 2020</li>
  <li>Recommending What Video to Watch Next: A Multitask Ranking System [<a href="https://dl.acm.org/doi/10.1145/3298689.3346997">link</a>] Google, 10 Sep 2019</li>
</ul>

<p><strong>Calibration</strong> - When you ship a ranking model (the model that predicts P(action) for multiple actions), you also ship how miscalibrated it is. I think calibration is a very easily misunderstood topic. The concept of calibration should have been taught and tested in schools.</p>

<ul>
  <li>On Calibration of Modern Neural Networks [<a href="https://arxiv.org/abs/1706.04599">link</a>] Cornell University, 14 Jun 2017</li>
  <li>Why Model Calibration Matters and How to Achieve It [<a href="https://www.unofficialgoogledatascience.com/2021/04/why-model-calibration-matters-and-how.html">link</a>] Google, Apr 2021</li>
  <li>Multi-task Learning and Calibration for Utility-based Home Feed Ranking [<a href="https://medium.com/pinterest-engineering/multi-task-learning-and-calibration-for-utility-based-home-feed-ranking-64087a7bcbad">link</a>] Pinterest, 14 Sep 2020</li>
  <li>The Foundations of Cost-Sensitive Learning [<a href="https://dl.acm.org/doi/10.5555/1642194.1642224">link</a>] UCSD, 4 Aug 2001</li>
  <li>Predicting Good Probabilities with Supervised Learning [<a href="https://dl.acm.org/doi/10.1145/1102351.1102430">link</a>] Cornell, 7 Aug 2005</li>
</ul>

<p><strong>Feature engineering</strong> - Manually engineering features does not scale well. Whenever you add a new feature you will need to implement all the feature crosses. It would be great if this process is learnt by the model instead.</p>

<ul>
  <li>DCN V2: Improved Deep &amp; Cross Network and Practical Lessons for Web-scale Learning to Rank Systems [<a href="https://arxiv.org/abs/2008.13535">link</a>] Google, 31 Aug 2020 - I wrote about this <a href="https://recsys.quora.com/Deep-and-Cross-Network">here</a>.</li>
</ul>

<p><strong>Sequence feature modeling</strong> - Your sparse features could be a sequence of IDs. You might believe that you can make better predictions on action probabilities by learning from this sequence.</p>

<ul>
  <li>TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest [<a href="https://arxiv.org/abs/2306.00248">link</a>] Pinterest, 1 Jun 2023</li>
  <li>Behavior Sequence Transformer for E-commerce Recommendation in Alibaba [<a href="https://arxiv.org/abs/1905.06874">link</a>] Alibaba, 16 May 2019</li>
  <li>Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction [<a href="https://arxiv.org/abs/2006.05639">link</a>] Alibaba, 10 Jun 2020 -  Shusen Wang covered this <a href="https://www.youtube.com/watch?v=_4J9aF5KR84">here</a>.</li>
  <li>Deep Interest Network for Click-Through Rate Prediction [<a href="https://arxiv.org/abs/1706.06978">link</a>] Alibaba, 21 Jun 2017 - This is known as DIN. Shusen Wang covered this <a href="https://www.youtube.com/watch?v=_4J9aF5KR84">here</a>.</li>
</ul>

<p><strong>Trainable embeddings</strong> - If your model uses sparse features (item IDs, action type is an example of a sparse feature, float values like age is an example of a dense feature), you will need to map each ID to an embedding and train the embeddings. The problem happens when you have too many IDs to train on. It is not a good idea to just use a larger GPU.</p>

<ul>
  <li>Monolith: Real Time Recommendation System With Collisionless Embedding Table [<a href="https://arxiv.org/abs/2209.07663">link</a>] ByteDance, 16 Sep 2022</li>
  <li>Efficient Data Representation Learning in Google-scale Systems [<a href="https://dl.acm.org/doi/10.1145/3604915.3608882">link</a>] Google, 14 Sep 2023</li>
</ul>

<p><strong>Pretrained embeddings</strong> - In your neural network model, you can also use embeddings that you do not intend to train. One such embedding is content embeddings, and you can introduce content embeddings to the neural network model, thinking that the model can better predict the action probabilities by knowing more about the content. You still need to prove that these embeddings are useful, and even if you fail to do so, you should be prepared to learn something.</p>

<ul>
  <li>Cross-lingual Language Model Pretraining [<a href="https://arxiv.org/abs/1901.07291">link</a>] Facebook AI, 22 Jan 2019 - This introduces XLM embeddings.</li>
  <li>Text Embeddings by Weakly-Supervised Contrastive Pre-training [<a href="https://arxiv.org/abs/2212.03533">link</a>] Microsoft, 7 Dec 2022 - This introduces E5 embeddings.</li>
</ul>

<p><strong>Two tower model</strong> - Recommendation systems involve first retrieving thousands of candidates from millions of indexed content. Currently, you index with the item embedding, you retrieve with the user embedding, for items with the largest dot product. The two tower model produces the item and user embedding. You need to train the model.</p>

<ul>
  <li>Self-supervised Learning for Large-scale Item Recommendations [<a href="https://arxiv.org/abs/2007.12865">link</a>] Google, 25 Jul 2020</li>
  <li>Cross-Batch Negative Sampling for Training Two-Tower Recommenders [<a href="https://arxiv.org/abs/2110.15154">link</a>] Huawei, 28 Oct 2021</li>
  <li>Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations [<a href="https://dl.acm.org/doi/10.1145/3366424.3386195">link</a>] Google, 20 Apr 2020</li>
  <li>Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations [<a href="https://dl.acm.org/doi/abs/10.1145/3298689.3346996">link</a>] Google, 10 Sep 2019</li>
  <li>Deep Neural Networks for YouTube Recommendations [<a href="https://dl.acm.org/doi/10.1145/2959100.2959190">link</a>] Google, 15 Sep 2016</li>
</ul>

<p><strong>User interest exploration</strong> - A good recommendation system does not just recommend content similar to ones that you have liked. The recommendation system should also appropriately explore what other types of content that you might like.</p>

<ul>
  <li>Values of User Exploration in Recommender Systems [<a href="https://dl.acm.org/doi/10.1145/3460231.3474236">link</a>] Google, 13 Sep 2021</li>
</ul>

<p><strong>Item exploration</strong> - New content is essential to any recommendation system. To help fresh content succeed, you implement methods to surface it more effectively. However, you still need to prove these methods actually work. This is where a challenge arises: traditional user-split A/B testing tends to show lower engagement for variants that prioritize new content—simply because new content hasn’t yet accumulated the signals that make established content perform well. Hence there is this line of research on how do you both deliver new content effectively and demonstrate that your approach is beneficial.</p>

<ul>
  <li>Nonlinear Bandits Exploration for Recommendations [<a href="https://dl.acm.org/doi/10.1145/3604915.3610245">link</a>] Google, 14 Sep 2023</li>
  <li>Online Matching: A Real-time Bandit System for Large-scale Recommendations [<a href="https://arxiv.org/abs/2307.15893">link</a>] Google, 29 Jul 2023</li>
  <li>Long-Term Value of Exploration: Measurements, Findings and Algorithms [<a href="https://arxiv.org/abs/2305.07764">link</a>] Google, 12 May 2023</li>
  <li>Fresh Content Needs More Attention: Multi-funnel Fresh Content Recommendation [<a href="https://arxiv.org/abs/2306.01720">link</a>] Google, 2 Jun 2023</li>
</ul>

<p><strong>Recommendation as sequence prediction</strong> - Instead of predicting the P(action), there is a line of research where you predict the item directly, similar to how you predict words in a sentence. I think this line of work only starts contributing value when you have systems that are bilingual in semantic IDs and English. Eugene Yan has an open source <a href="https://eugeneyan.com/writing/semantic-ids/">implementation</a>.</p>

<ul>
  <li>Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations [<a href="https://arxiv.org/abs/2402.17152">link</a>] Meta, 26 Feb 2024 - I still do not understand this paper.</li>
  <li>Effective and Efficient Training for Sequential Recommendation using Recency Sampling [<a href="https://arxiv.org/abs/2207.02643">link</a>] University of Glasgow, 6 Jul 2022</li>
  <li>BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer [<a href="https://arxiv.org/abs/1904.06690">link</a>] Alibaba, 15 Apr 2019</li>
  <li>Learning from Negative User Feedback and Measuring Responsiveness for Sequential Recommenders [<a href="https://arxiv.org/abs/2308.12256">link</a>] Google, 23 Aug 2023</li>
</ul>

<p><strong>Miscellaneous</strong> - These are some papers that I do not manage to classify.</p>

<ul>
  <li>Improving Training Stability for Multitask Ranking Models in Recommender Systems [<a href="https://arxiv.org/abs/2302.09178">link</a>] Google, 18 Feb 2023</li>
  <li>Trustworthy Online Marketplace Experimentation with Budget-split Design [<a href="https://arxiv.org/abs/2012.08724">link</a>] LinkedIn, 16 Dec 2020 - You cannot run a traditional A/B testing process for ads ranking because users in the variant can cannibalize the budget of the users in control. Even though the A/B test analysis reports that users in variant contributed more revenue to the users in control, the truth might be the opposite direction. Therefore you need an experiment design where the budget allocation is split.</li>
  <li>Why do tree-based models still outperform deep learning on tabular data? [<a href="https://arxiv.org/abs/2207.08815">link</a>] 18 Jul 2022 - You cannot just migrate to a neural network from tree-based models and expect an improvement in metrics.</li>
  <li>Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Prediction Models [<a href="https://arxiv.org/abs/2209.06053">link</a>] Alibaba, 13 Sep 2022 - It seems that in recommendation systems if you train on more than one epoch you overfit.</li>
  <li>Fairness in Recommendation Ranking through Pairwise Comparisons [<a href="https://arxiv.org/abs/1903.00780">link</a>] Google, 2 Mar 2019</li>
  <li>Practical Lessons from Predicting Clicks on Ads at Facebook [<a href="https://dl.acm.org/doi/10.1145/2648584.2648589">link</a>] Facebook, 24 Aug 2014</li>
  <li>Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs [<a href="https://arxiv.org/abs/1603.09320">link</a>] Russian Academy of Sciences, 30 Mar 2016 - This is the HNSW paper. I think it is a good idea to have some intuition on how approximate retrieval works so that you have some idea of what it can and cannot do.</li>
  <li>Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations [<a href="https://arxiv.org/abs/2007.07203">link</a>] ByteDance, 14 Jul 2020 - I still do not really understand this. I recommend watching <a href="https://www.youtube.com/watch?v=BYtzZ48hRFM">this</a>.</li>
  <li>Full Index Deep Retrieval: End-to-End User and Item Structures for Cold-start and Long-tail Item Recommendation [<a href="https://dl.acm.org/doi/10.1145/3604915.3608773">link</a>] ByteDance/SJTU, 14 Sep 2023 - This is a follow-up to the Deep Retrieval paper.</li>
</ul>

<h2 id="other-ml-resources">Other ML resources</h2>

<p><strong>Reinforcement learning</strong> - I wrote about reinforcement learning <a href="https://blog.huikang.dev/2025/08/23/reinforcement-learning-life-lessons.html">here</a>.</p>

<ul>
  <li>Reinforcement Learning: An Introduction [<a href="http://incompleteideas.net/book/the-book-2nd.html">link</a>] Sutton &amp; Barto, 2018 - Papers involving reinforcement learning assumes that you have read this book because they do not fully explain the symbols and terminologies they use.</li>
</ul>

<p><strong>Image models</strong></p>

<ul>
  <li>High-Resolution Image Synthesis with Latent Diffusion Models [<a href="https://arxiv.org/abs/2112.10752">link</a>] LMU Munich, 20 Dec 2021 - This is the stable diffusion paper.</li>
  <li>An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [<a href="https://arxiv.org/abs/2010.11929">link</a>] Google, 22 Oct 2020</li>
</ul>

<p><strong>Attention mechanism</strong> - I drew the attention mechanism <a href="https://www.quora.com/profile/Tong-Hui-Kang-1/I-drew-an-image-of-the-attention-mechanism-of-of-Qwen2-5-0-5B-Instruct-Some-things-I-learnt-while-drawing-What-is-th">here</a>.</p>

<ul>
  <li>Attention Is All You Need [<a href="https://arxiv.org/abs/1706.03762">link</a>] Google, 12 Jun 2017</li>
  <li>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [<a href="https://arxiv.org/abs/1810.04805">link</a>] Google, 11 Oct 2018</li>
  <li>Long Short-Term Memory-Networks for Machine Reading [<a href="https://arxiv.org/abs/1601.06733">link</a>] University of Edinburgh, 25 Jan 2016</li>
  <li>Neural Machine Translation by Jointly Learning to Align and Translate [<a href="https://arxiv.org/abs/1409.0473">link</a>] Mila, 1 Sep 2014</li>
</ul>

<h2 id="footnotes">Footnotes</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:disclaimer" role="doc-endnote">
      <p>This just means that I previously printed the papers, and I did not discard the papers as I moved my residence within the Bay Area. There are very impactful papers in the field that I did not print. There are also papers in the list which I had not really read. <a href="#fnref:disclaimer" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Tong Hui Kang</name></author><summary type="html"><![CDATA[I am moving from one place to another. Over my stay in the United States, I have printed plenty of papers, on large language models and recommendation systems. These are the papers I have kept1 as I moved. This just means that I previously printed the papers, and I did not discard the papers as I moved my residence within the Bay Area. There are very impactful papers in the field that I did not print. There are also papers in the list which I had not really read. &#8617;]]></summary></entry></feed>