I am moving from one place to another. Over my stay in the United States, I have printed plenty of papers, on large language models and recommendation systems. These are the papers I have kept1 as I moved.

Large Language Model papers

Training guides

  • How to Scale Your Model [link] Google DeepMind, 2025 - If you want a good shot at training LLMs, you will need to do the included homework.

GRPO and variants - I write about my views on GRPO. It is only recently that I found out about this meme which I agree from the left side.

  • Group Sequence Policy Optimization [link] Qwen, 24 Jul 2025
  • Understanding R1-Zero-Like Training: A Critical Perspective [link] Sea AI Lab, 26 Mar 2025 - This introduces “Dr.GRPO”.
  • Why RLHF (and Other RL-Like Methods) Don’t Bring True RL to LLMs [link] Atlas Wang, 2025
  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [link] DeepSeek, 5 Feb 2024 - This is the GRPO paper.
  • KTO: Model Alignment as Prospect Theoretic Optimization [link] Cohere, 2 Feb 2024
  • ORPO: Monolithic Preference Optimization without Reference Model [link] KAIST, 12 Mar 2024
  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model [link] Stanford, 29 May 2023 - This is the DPO paper.
  • A General Theoretical Paradigm to Understand Learning from Human Preferences [link] DeepMind, 18 Oct 2023 - They call their algorithm IPO.

Technical reports - I will read this in the future to look back what models were optimizing for.

  • Mixtral of Experts [link] Mistral AI, 8 Jan 2024
  • Kimi k1.5: Scaling Reinforcement Learning with LLMs [link] Moonshot AI, 20 Jan 2025
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [link] DeepSeek, 22 Jan 2025
  • GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [link] Zhipu AI, 20 Jun 2025
  • The Llama 3 Herd of Models [link] Meta, 31 Jul 2024
  • LLaMA: Open and Efficient Foundation Language Models [link] Meta, 27 Feb 2023

Efficiency efforts

  • LoRA: Low-Rank Adaptation of Large Language Models [link] Microsoft, 17 Jun 2021 - LoRA is now regaining popularity and Fireworks and Thinking Machines are supporting fine-tuning with LoRA.
  • Hyena Hierarchy: Towards Larger Convolutional Language Models [link] Stanford, 21 Feb 2023 - Tri Dao contributed to this paper. I kept this because I want to understand how attention can be made theoretically faster with approximations.

Prompting - Now models are trained to effectively run long chains of thoughts without prompting.

  • Let’s Verify Step by Step [link] OpenAI, 31 May 2023
  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [link] Google, 28 Jan 2022
  • In-Context Learning for Extreme Multi-Label Classification [link] Ghent University, 22 Jan 2024 - I printed this because I used this in my actual work.

Early alignment efforts - I kept this to read in the future to understand what people were thinking.

  • Training Language Models to Follow Instructions with Human Feedback [link] OpenAI, 4 Mar 2022
  • Reinforced Self-Training (ReST) for Language Modeling [link] DeepMind, 17 Aug 2023
  • RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback [link] Google, 1 Sep 2023
  • Constitutional AI: Harmlessness from AI Feedback [link] Anthropic, 15 Dec 2022

Recommendation Systems papers

For an introduction to recommendation systems, I recommend

  • This Chinese language playlist by Shusen Wang.
  • Recommendation systems viewed in four stages, with online and offline processes.

Value modeling - Recommendation systems calculate P(action) for multiple actions, for each candidate. The candidates are ranked based on a utility function. The utility function takes the action probability as arguments. You need to design a good utility function for your recommendation system. The design also involves deciding how important is each action.

  • What We Know About Using Non-Engagement Signals in Content Ranking [link] Integrity Institute, 9 Feb 2024 - This puts down in writing that engaging content is usually negatively correlated with “quality”.
  • Multi-Objective Recommendation via Multivariate Policy Learning [link] Spotify, 3 May 2024
  • Feedback Shaping: A Modeling Approach to Nurture Content Creation [link] LinkedIn, 21 Jun 2021

Training multi-task models - We usually use one neural network model to predict multiple action probabilities. The alternative is to use a separate model to predict each action probability. However, sometimes individual models are better at predicting action probabilities than the combined model, even controlling for total parameter count. Hence there is this line of research to bridge the performance gap.

  • Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts [link] Google, 13 Jun 2018
  • Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations [link] Tencent, 22 Sep 2020
  • Recommending What Video to Watch Next: A Multitask Ranking System [link] Google, 10 Sep 2019

Calibration - When you ship a ranking model (the model that predicts P(action) for multiple actions), you also ship how miscalibrated it is. I think calibration is a very easily misunderstood topic. The concept of calibration should have been taught and tested in schools.

  • On Calibration of Modern Neural Networks [link] Cornell University, 14 Jun 2017
  • Why Model Calibration Matters and How to Achieve It [link] Google, Apr 2021
  • Multi-task Learning and Calibration for Utility-based Home Feed Ranking [link] Pinterest, 14 Sep 2020
  • The Foundations of Cost-Sensitive Learning [link] UCSD, 4 Aug 2001
  • Predicting Good Probabilities with Supervised Learning [link] Cornell, 7 Aug 2005

Feature engineering - Manually engineering features does not scale well. Whenever you add a new feature you will need to implement all the feature crosses. It would be great if this process is learnt by the model instead.

  • DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems [link] Google, 31 Aug 2020 - I wrote about this here.

Sequence feature modeling - Your sparse features could be a sequence of IDs. You might believe that you can make better predictions on action probabilities by learning from this sequence.

  • TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest [link] Pinterest, 1 Jun 2023
  • Behavior Sequence Transformer for E-commerce Recommendation in Alibaba [link] Alibaba, 16 May 2019
  • Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction [link] Alibaba, 10 Jun 2020 - Shusen Wang covered this here.
  • Deep Interest Network for Click-Through Rate Prediction [link] Alibaba, 21 Jun 2017 - This is known as DIN. Shusen Wang covered this here.

Trainable embeddings - If your model uses sparse features (item IDs, action type is an example of a sparse feature, float values like age is an example of a dense feature), you will need to map each ID to an embedding and train the embeddings. The problem happens when you have too many IDs to train on. It is not a good idea to just use a larger GPU.

  • Monolith: Real Time Recommendation System With Collisionless Embedding Table [link] ByteDance, 16 Sep 2022
  • Efficient Data Representation Learning in Google-scale Systems [link] Google, 14 Sep 2023

Pretrained embeddings - In your neural network model, you can also use embeddings that you do not intend to train. One such embedding is content embeddings, and you can introduce content embeddings to the neural network model, thinking that the model can better predict the action probabilities by knowing more about the content. You still need to prove that these embeddings are useful, and even if you fail to do so, you should be prepared to learn something.

  • Cross-lingual Language Model Pretraining [link] Facebook AI, 22 Jan 2019 - This introduces XLM embeddings.
  • Text Embeddings by Weakly-Supervised Contrastive Pre-training [link] Microsoft, 7 Dec 2022 - This introduces E5 embeddings.

Two tower model - Recommendation systems involve first retrieving thousands of candidates from millions of indexed content. Currently, you index with the item embedding, you retrieve with the user embedding, for items with the largest dot product. The two tower model produces the item and user embedding. You need to train the model.

  • Self-supervised Learning for Large-scale Item Recommendations [link] Google, 25 Jul 2020
  • Cross-Batch Negative Sampling for Training Two-Tower Recommenders [link] Huawei, 28 Oct 2021
  • Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations [link] Google, 20 Apr 2020
  • Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations [link] Google, 10 Sep 2019
  • Deep Neural Networks for YouTube Recommendations [link] Google, 15 Sep 2016

User interest exploration - A good recommendation system does not just recommend content similar to ones that you have liked. The recommendation system should also appropriately explore what other types of content that you might like.

  • Values of User Exploration in Recommender Systems [link] Google, 13 Sep 2021

Item exploration - New content is essential to any recommendation system. To help fresh content succeed, you implement methods to surface it more effectively. However, you still need to prove these methods actually work. This is where a challenge arises: traditional user-split A/B testing tends to show lower engagement for variants that prioritize new content—simply because new content hasn’t yet accumulated the signals that make established content perform well. Hence there is this line of research on how do you both deliver new content effectively and demonstrate that your approach is beneficial.

  • Nonlinear Bandits Exploration for Recommendations [link] Google, 14 Sep 2023
  • Online Matching: A Real-time Bandit System for Large-scale Recommendations [link] Google, 29 Jul 2023
  • Long-Term Value of Exploration: Measurements, Findings and Algorithms [link] Google, 12 May 2023
  • Fresh Content Needs More Attention: Multi-funnel Fresh Content Recommendation [link] Google, 2 Jun 2023

Recommendation as sequence prediction - Instead of predicting the P(action), there is a line of research where you predict the item directly, similar to how you predict words in a sentence. I think this line of work only starts contributing value when you have systems that are bilingual in semantic IDs and English. Eugene Yan has an open source implementation.

  • Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations [link] Meta, 26 Feb 2024 - I still do not understand this paper.
  • Effective and Efficient Training for Sequential Recommendation using Recency Sampling [link] University of Glasgow, 6 Jul 2022
  • BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer [link] Alibaba, 15 Apr 2019
  • Learning from Negative User Feedback and Measuring Responsiveness for Sequential Recommenders [link] Google, 23 Aug 2023

Miscellaneous - These are some papers that I do not manage to classify.

  • Improving Training Stability for Multitask Ranking Models in Recommender Systems [link] Google, 18 Feb 2023
  • Trustworthy Online Marketplace Experimentation with Budget-split Design [link] LinkedIn, 16 Dec 2020 - You cannot run a traditional A/B testing process for ads ranking because users in the variant can cannibalize the budget of the users in control. Even though the A/B test analysis reports that users in variant contributed more revenue to the users in control, the truth might be the opposite direction. Therefore you need an experiment design where the budget allocation is split.
  • Why do tree-based models still outperform deep learning on tabular data? [link] 18 Jul 2022 - You cannot just migrate to a neural network from tree-based models and expect an improvement in metrics.
  • Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Prediction Models [link] Alibaba, 13 Sep 2022 - It seems that in recommendation systems if you train on more than one epoch you overfit.
  • Fairness in Recommendation Ranking through Pairwise Comparisons [link] Google, 2 Mar 2019
  • Practical Lessons from Predicting Clicks on Ads at Facebook [link] Facebook, 24 Aug 2014
  • Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs [link] Russian Academy of Sciences, 30 Mar 2016 - This is the HNSW paper. I think it is a good idea to have some intuition on how approximate retrieval works so that you have some idea of what it can and cannot do.
  • Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations [link] ByteDance, 14 Jul 2020 - I still do not really understand this. I recommend watching this.
  • Full Index Deep Retrieval: End-to-End User and Item Structures for Cold-start and Long-tail Item Recommendation [link] ByteDance/SJTU, 14 Sep 2023 - This is a follow-up to the Deep Retrieval paper.

Other ML resources

Reinforcement learning - I wrote about reinforcement learning here.

  • Reinforcement Learning: An Introduction [link] Sutton & Barto, 2018 - Papers involving reinforcement learning assumes that you have read this book because they do not fully explain the symbols and terminologies they use.

Image models

  • High-Resolution Image Synthesis with Latent Diffusion Models [link] LMU Munich, 20 Dec 2021 - This is the stable diffusion paper.
  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [link] Google, 22 Oct 2020

Attention mechanism - I drew the attention mechanism here.

  • Attention Is All You Need [link] Google, 12 Jun 2017
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [link] Google, 11 Oct 2018
  • Long Short-Term Memory-Networks for Machine Reading [link] University of Edinburgh, 25 Jan 2016
  • Neural Machine Translation by Jointly Learning to Align and Translate [link] Mila, 1 Sep 2014

Footnotes

  1. This just means that I previously printed the papers, and I did not discard the papers as I moved my residence within the Bay Area. There are very impactful papers in the field that I did not print. There are also papers in the list which I had not really read.