Papers I kept in 2025

I am moving from one place to another. Over my stay in the United States, I have printed plenty of papers, on large language models and recommendation systems. These are the papers I have kept¹ as I moved.

Large Language Model papers

Training guides

How to Scale Your Model [link] Google DeepMind, 2025 - If you want a good shot at training LLMs, you will need to do the included homework.

GRPO and variants - I write about my views on GRPO. It is only recently that I found out about this meme which I agree from the left side.

Group Sequence Policy Optimization [link] Qwen, 24 Jul 2025
Understanding R1-Zero-Like Training: A Critical Perspective [link] Sea AI Lab, 26 Mar 2025 - This introduces “Dr.GRPO”.
Why RLHF (and Other RL-Like Methods) Don’t Bring True RL to LLMs [link] Atlas Wang, 2025
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [link] DeepSeek, 5 Feb 2024 - This is the GRPO paper.
KTO: Model Alignment as Prospect Theoretic Optimization [link] Cohere, 2 Feb 2024
ORPO: Monolithic Preference Optimization without Reference Model [link] KAIST, 12 Mar 2024
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [link] Stanford, 29 May 2023 - This is the DPO paper.
A General Theoretical Paradigm to Understand Learning from Human Preferences [link] DeepMind, 18 Oct 2023 - They call their algorithm IPO.

Technical reports - I will read this in the future to look back what models were optimizing for.

Mixtral of Experts [link] Mistral AI, 8 Jan 2024
Kimi k1.5: Scaling Reinforcement Learning with LLMs [link] Moonshot AI, 20 Jan 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [link] DeepSeek, 22 Jan 2025
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [link] Zhipu AI, 20 Jun 2025
The Llama 3 Herd of Models [link] Meta, 31 Jul 2024
LLaMA: Open and Efficient Foundation Language Models [link] Meta, 27 Feb 2023

Efficiency efforts

LoRA: Low-Rank Adaptation of Large Language Models [link] Microsoft, 17 Jun 2021 - LoRA is now regaining popularity and Fireworks and Thinking Machines are supporting fine-tuning with LoRA.
Hyena Hierarchy: Towards Larger Convolutional Language Models [link] Stanford, 21 Feb 2023 - Tri Dao contributed to this paper. I kept this because I want to understand how attention can be made theoretically faster with approximations.

Prompting - Now models are trained to effectively run long chains of thoughts without prompting.

Let’s Verify Step by Step [link] OpenAI, 31 May 2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [link] Google, 28 Jan 2022
In-Context Learning for Extreme Multi-Label Classification [link] Ghent University, 22 Jan 2024 - I printed this because I used this in my actual work.

Early alignment efforts - I kept this to read in the future to understand what people were thinking.

Training Language Models to Follow Instructions with Human Feedback [link] OpenAI, 4 Mar 2022
Reinforced Self-Training (ReST) for Language Modeling [link] DeepMind, 17 Aug 2023
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback [link] Google, 1 Sep 2023
Constitutional AI: Harmlessness from AI Feedback [link] Anthropic, 15 Dec 2022

Recommendation Systems papers

For an introduction to recommendation systems, I recommend

This Chinese language playlist by Shusen Wang.
Recommendation systems viewed in four stages, with online and offline processes.

Value modeling - Recommendation systems calculate P(action) for multiple actions, for each candidate. The candidates are ranked based on a utility function. The utility function takes the action probability as arguments. You need to design a good utility function for your recommendation system. The design also involves deciding how important is each action.

What We Know About Using Non-Engagement Signals in Content Ranking [link] Integrity Institute, 9 Feb 2024 - This puts down in writing that engaging content is usually negatively correlated with “quality”.
Multi-Objective Recommendation via Multivariate Policy Learning [link] Spotify, 3 May 2024
Feedback Shaping: A Modeling Approach to Nurture Content Creation [link] LinkedIn, 21 Jun 2021

Training multi-task models - We usually use one neural network model to predict multiple action probabilities. The alternative is to use a separate model to predict each action probability. However, sometimes individual models are better at predicting action probabilities than the combined model, even controlling for total parameter count. Hence there is this line of research to bridge the performance gap.

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts [link] Google, 13 Jun 2018
Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations [link] Tencent, 22 Sep 2020
Recommending What Video to Watch Next: A Multitask Ranking System [link] Google, 10 Sep 2019

Calibration - When you ship a ranking model (the model that predicts P(action) for multiple actions), you also ship how miscalibrated it is. I think calibration is a very easily misunderstood topic. The concept of calibration should have been taught and tested in schools.

On Calibration of Modern Neural Networks [link] Cornell University, 14 Jun 2017
Why Model Calibration Matters and How to Achieve It [link] Google, Apr 2021
Multi-task Learning and Calibration for Utility-based Home Feed Ranking [link] Pinterest, 14 Sep 2020
The Foundations of Cost-Sensitive Learning [link] UCSD, 4 Aug 2001
Predicting Good Probabilities with Supervised Learning [link] Cornell, 7 Aug 2005

Feature engineering - Manually engineering features does not scale well. Whenever you add a new feature you will need to implement all the feature crosses. It would be great if this process is learnt by the model instead.

DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems [link] Google, 31 Aug 2020 - I wrote about this here.

Sequence feature modeling - Your sparse features could be a sequence of IDs. You might believe that you can make better predictions on action probabilities by learning from this sequence.

TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest [link] Pinterest, 1 Jun 2023
Behavior Sequence Transformer for E-commerce Recommendation in Alibaba [link] Alibaba, 16 May 2019
Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction [link] Alibaba, 10 Jun 2020 - Shusen Wang covered this here.
Deep Interest Network for Click-Through Rate Prediction [link] Alibaba, 21 Jun 2017 - This is known as DIN. Shusen Wang covered this here.

Trainable embeddings - If your model uses sparse features (item IDs, action type is an example of a sparse feature, float values like age is an example of a dense feature), you will need to map each ID to an embedding and train the embeddings. The problem happens when you have too many IDs to train on. It is not a good idea to just use a larger GPU.

Monolith: Real Time Recommendation System With Collisionless Embedding Table [link] ByteDance, 16 Sep 2022
Efficient Data Representation Learning in Google-scale Systems [link] Google, 14 Sep 2023

Pretrained embeddings - In your neural network model, you can also use embeddings that you do not intend to train. One such embedding is content embeddings, and you can introduce content embeddings to the neural network model, thinking that the model can better predict the action probabilities by knowing more about the content. You still need to prove that these embeddings are useful, and even if you fail to do so, you should be prepared to learn something.

Cross-lingual Language Model Pretraining [link] Facebook AI, 22 Jan 2019 - This introduces XLM embeddings.
Text Embeddings by Weakly-Supervised Contrastive Pre-training [link] Microsoft, 7 Dec 2022 - This introduces E5 embeddings.

Two tower model - Recommendation systems involve first retrieving thousands of candidates from millions of indexed content. Currently, you index with the item embedding, you retrieve with the user embedding, for items with the largest dot product. The two tower model produces the item and user embedding. You need to train the model.

Self-supervised Learning for Large-scale Item Recommendations [link] Google, 25 Jul 2020
Cross-Batch Negative Sampling for Training Two-Tower Recommenders [link] Huawei, 28 Oct 2021
Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations [link] Google, 20 Apr 2020
Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations [link] Google, 10 Sep 2019
Deep Neural Networks for YouTube Recommendations [link] Google, 15 Sep 2016

User interest exploration - A good recommendation system does not just recommend content similar to ones that you have liked. The recommendation system should also appropriately explore what other types of content that you might like.

Values of User Exploration in Recommender Systems [link] Google, 13 Sep 2021

Item exploration - New content is essential to any recommendation system. To help fresh content succeed, you implement methods to surface it more effectively. However, you still need to prove these methods actually work. This is where a challenge arises: traditional user-split A/B testing tends to show lower engagement for variants that prioritize new content—simply because new content hasn’t yet accumulated the signals that make established content perform well. Hence there is this line of research on how do you both deliver new content effectively and demonstrate that your approach is beneficial.

Nonlinear Bandits Exploration for Recommendations [link] Google, 14 Sep 2023
Online Matching: A Real-time Bandit System for Large-scale Recommendations [link] Google, 29 Jul 2023
Long-Term Value of Exploration: Measurements, Findings and Algorithms [link] Google, 12 May 2023
Fresh Content Needs More Attention: Multi-funnel Fresh Content Recommendation [link] Google, 2 Jun 2023

Recommendation as sequence prediction - Instead of predicting the P(action), there is a line of research where you predict the item directly, similar to how you predict words in a sentence. I think this line of work only starts contributing value when you have systems that are bilingual in semantic IDs and English. Eugene Yan has an open source implementation.

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations [link] Meta, 26 Feb 2024 - I still do not understand this paper.
Effective and Efficient Training for Sequential Recommendation using Recency Sampling [link] University of Glasgow, 6 Jul 2022
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer [link] Alibaba, 15 Apr 2019
Learning from Negative User Feedback and Measuring Responsiveness for Sequential Recommenders [link] Google, 23 Aug 2023

Miscellaneous - These are some papers that I do not manage to classify.

Improving Training Stability for Multitask Ranking Models in Recommender Systems [link] Google, 18 Feb 2023
Trustworthy Online Marketplace Experimentation with Budget-split Design [link] LinkedIn, 16 Dec 2020 - You cannot run a traditional A/B testing process for ads ranking because users in the variant can cannibalize the budget of the users in control. Even though the A/B test analysis reports that users in variant contributed more revenue to the users in control, the truth might be the opposite direction. Therefore you need an experiment design where the budget allocation is split.
Why do tree-based models still outperform deep learning on tabular data? [link] 18 Jul 2022 - You cannot just migrate to a neural network from tree-based models and expect an improvement in metrics.
Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Prediction Models [link] Alibaba, 13 Sep 2022 - It seems that in recommendation systems if you train on more than one epoch you overfit.
Fairness in Recommendation Ranking through Pairwise Comparisons [link] Google, 2 Mar 2019
Practical Lessons from Predicting Clicks on Ads at Facebook [link] Facebook, 24 Aug 2014
Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs [link] Russian Academy of Sciences, 30 Mar 2016 - This is the HNSW paper. I think it is a good idea to have some intuition on how approximate retrieval works so that you have some idea of what it can and cannot do.
Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations [link] ByteDance, 14 Jul 2020 - I still do not really understand this. I recommend watching this.
Full Index Deep Retrieval: End-to-End User and Item Structures for Cold-start and Long-tail Item Recommendation [link] ByteDance/SJTU, 14 Sep 2023 - This is a follow-up to the Deep Retrieval paper.

Other ML resources

Reinforcement learning - I wrote about reinforcement learning here.

Reinforcement Learning: An Introduction [link] Sutton & Barto, 2018 - Papers involving reinforcement learning assumes that you have read this book because they do not fully explain the symbols and terminologies they use.

Image models

High-Resolution Image Synthesis with Latent Diffusion Models [link] LMU Munich, 20 Dec 2021 - This is the stable diffusion paper.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [link] Google, 22 Oct 2020

Attention mechanism - I drew the attention mechanism here.

Attention Is All You Need [link] Google, 12 Jun 2017
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [link] Google, 11 Oct 2018
Long Short-Term Memory-Networks for Machine Reading [link] University of Edinburgh, 25 Jan 2016
Neural Machine Translation by Jointly Learning to Align and Translate [link] Mila, 1 Sep 2014

Footnotes

This just means that I previously printed the papers, and I did not discard the papers as I moved my residence within the Bay Area. There are very impactful papers in the field that I did not print. There are also papers in the list which I had not really read. ↩