[1] Understanding Reference Policies in Direct Preference Optimization:https://arxiv.org/pdf/2407.13709
[2] Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study:https://arxiv.org/pdf/2404.10719v2
[3] Prefix-Tuning:https://arxiv.org/pdf/2101.00190
[4] P-Tuning:https://arxiv.org/pdf/2103.10385
[5] Prompt Tuning:https://arxiv.org/pdf/2104.08691
[6] P-Tuning v2:https://arxiv.org/pdf/2110.07602
[7] LoRA:https://arxiv.org/pdf/2106.09685
[8] AdaLoRA:https://arxiv.org/pdf/2303.10512
[9] PiSSA:https://arxiv.org/abs/2404.02948
[10] OLoRA: https://arxiv.org/pdf/2406.01775
[11] LoHa:https://arxiv.org/pdf/2108.06098
[12] LoKr:https://arxiv.org/pdf/2309.14859
[13] QLoRA:https://arxiv.org/abs/2305.14314
[14] LoftQ:https://arxiv.org/pdf/2310.08659
[15] DoRA:https://arxiv.org/pdf/2402.09353
[16] Adapter tuning:https://arxiv.org/pdf/1902.00751
[17] LLM finetuning:https://arxiv.org/pdf/2402.17193
[18] DPO SFT:https://arxiv.org/pdf/2406.04879
[19] DEEPSEEK DPO:https://arxiv.org/pdf/2401.02954
[20] LLaMA Factory:https://github.com/hiyouga/LLaMA-Factory
[21] Qwen:https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
[22] LIMA:https://arxiv.org/pdf/2305.11206
[23] InsTag:https://arxiv.org/pdf/2308.07074
[24] IFD:https://arxiv.org/pdf/2308.12032v5
[25] WizardLM: Empowering Large Language Models to Follow Complex Instructions:https://arxiv.org/pdf/2304.12244
[26] LESS: Selecting Influential Data for Targeted Instruction Tuning:https://arxiv.org/pdf/2402.04333
[27] DEITA:https://arxiv.org/pdf/2312.15685
[28] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model:https://arxiv.org/pdf/2405.04434
[29] Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?:https://arxiv.org/pdf/2405.05904
[30] Knowledge Verification to Nip Hallucination in the Bud:https://arxiv.org/pdf/2401.10768
[31] OpenAI, Parameter Space Noise for Exploration:https://arxiv.org/pdf/1706.01905
[32] Reinforcement Learning: An Introduction, 2nd Edition, Richard S. Sutton:http://incompleteideas.net/book/the-book-2nd.html
[33] MuZero:https://arxiv.org/pdf/1911.08265
[34] DPG:https://proceedings.mlr.press/v32/silver14.pdf
[35] DDPG:https://arxiv.org/pdf/1509.02971
[36] TD3:https://arxiv.org/pdf/1802.09477
[37] Dec-POMDP:https://arxiv.org/pdf/1301.3836
[38] MAPPO:https://arxiv.org/pdf/2103.01955
[39] QMIX:https://arxiv.org/pdf/1803.11485
[40] COMA:https://arxiv.org/pdf/1705.08926
[41] MADDPG:https://arxiv.org/pdf/1706.02275
[42] MAXQ:https://www.jair.org/index.php/jair/article/view/10266/24463
[43] Feudal Reinforcement Learning:https://www.cs.toronto.edu/~fritz/absps/dh93.pdf
[44] Dyna-Q:http://www.incompleteideas.net/papers/sutton-90.pdf
[45] POMDP:Optimal control of Markov processes with incomplete state information, https://core.ac.uk/download/pdf/82498456.pdf
[46] DPO:https://arxiv.org/pdf/2305.18290
[47] AC架构:http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf
[48] SAC:https://arxiv.org/pdf/1801.01290
[49] A3C:https://arxiv.org/pdf/1602.01783
[50] GAE:https://arxiv.org/pdf/1506.02438
[51] TRPO:https://arxiv.org/pdf/1502.05477
[52] PPO:https://arxiv.org/abs/1707.06347
[53] John Schulman:https://www.technologyreview.com/2023/03/03/1069311/inside-story-oral-history-how-chatgpt-built-openai/
[54] Deep Reinforcement Learning from Human Preferences:https://arxiv.org/pdf/1706.03741
[55] Fine-Tuning Language Models from Human Preferences:https://arxiv.org/pdf/1909.08593
[56] InstructGPT: Training language models to follow instructions with human feedback:https://arxiv.org/pdf/2203.02155
[57] LlaMA2:https://arxiv.org/pdf/2307.09288
[58] LlaMA3:https://arxiv.org/abs/2407.21783
[59] Scaling Laws for Reward Model Overoptimization:https://arxiv.org/pdf/2210.10760
[60] TRL:https://github.com/huggingface/trl
[61] 吴恩达IRL,Algorithms for Inverse Reinforcement Learning:https://ai.stanford.edu/~ang/papers/icml00-irl.pdf
[62] CAI:https://arxiv.org/pdf/2212.08073
[63] RLAIF-V:https://arxiv.org/pdf/2405.17220
[64] Claude’s Constitution:https://www.anthropic.com/news/claudes-constitution
[65] RLlib:https://docs.ray.io/en/latest/rllib/index.html
[66] Stable Baselines3(SB3):https://stable-baselines3.readthedocs.io/en/master/
[67] OpenRLHF:https://openrlhf.readthedocs.io/en/latest/
[68] lilianweng:https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
[69] SEAL:https://arxiv.org/abs/2408.10270
[70] Reward hacking:https://arxiv.org/abs/2201.03544
[71] Anthropic, Rejection Sampling:https://arxiv.org/pdf/2204.05862
[72] GRPO:https://arxiv.org/pdf/2402.03300
[73] OpenAI RBR:https://openai.com/index/improving-model-safety-behavior-with-rule-based-rewards/
[74] Contrastive Search:https://arxiv.org/pdf/2202.06417
[75] Lookahead Decoding:https://arxiv.org/pdf/2402.02057
[76] Phi-4:https://arxiv.org/pdf/2412.08905
[77] DoLa:https://arxiv.org/abs/2309.03883
[78] Transformers:https://huggingface.co/docs/transformers/index
[79] Prompt Engineering Guide:https://www.promptingguide.ai/
[80] OpenAI Prompt:https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api
[81] OpenAI Prompt:https://platform.openai.com/docs/guides/prompt-engineering#six-strategies-for-getting-better-results
[82] CoT:https://arxiv.org/pdf/2201.11903
[83] ToT:https://arxiv.org/pdf/2305.10601
[84] Auto-CoT:https://arxiv.org/pdf/2210.03493
[85] Self-Consistency with CoT:https://arxiv.org/pdf/2203.11171
[86] XoT:https://arxiv.org/pdf/2311.04254
[87] GoT:https://arxiv.org/pdf/2308.09687
[88] MoT:https://arxiv.org/pdf/2305.05181
[89] Multimodal CoT:https://arxiv.org/pdf/2302.00923
[90] VLM CoT:https://arxiv.org/pdf/2410.16198
[91] Zero-shot-CoT:https://arxiv.org/pdf/2205.11916
[92] langchain:https://python.langchain.com/docs/
[93] Contextual RAG:https://www.anthropic.com/news/contextual-retrieval
[94] RAGFlow:https://ragflow.io/
[95] TD:http://incompleteideas.net/papers/sutton-88-with-erratum.pdf
[96] Hugging Face PEFT:https://huggingface.co/docs/peft/index
[97] Byte Latent Transformer:https://arxiv.org/pdf/2412.09871
[98] OpenAI Scaling Law:https://arxiv.org/pdf/2001.08361
[99] DeepMind Chinchilla Scaling Law:https://arxiv.org/pdf/2203.15556
[100] OpenAI o1 Scaling Law:https://openai.com/index/learning-to-reason-with-llms/
[101] VLMEvalKit:https://github.com/open-compass/VLMEvalKit
[102] opencompass:https://github.com/open-compass/opencompass
[103] ollama:https://github.com/ollama/ollama
[104] mlc-llm:https://github.com/mlc-ai/mlc-llm
[105] llama.cpp:https://github.com/ggerganov/llama.cpp
[106] text-generation-inference:https://github.com/huggingface/text-generation-inference
[107] langgraph:https://github.com/langchain-ai/langgraph
[108] Qwen2.5:https://arxiv.org/pdf/2412.15115
[109] DQN:https://arxiv.org/pdf/1312.5602
[110] Q-learning:https://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf
[111] Policy Gradient&REINFORCE:https://people.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf
[112] Policy Gradient Theorem:https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf
[113] Hugging Face:https://huggingface.co/
[114] Diverse Beam Search:https://arxiv.org/pdf/1610.02424
[115] Constrained Beam Search:https://arxiv.org/pdf/1612.00576
[116] Top-P:https://arxiv.org/pdf/1904.09751
[117] Top-K:https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[118] Speculative Sampling:https://arxiv.org/pdf/2302.01318
[119] UCB:https://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
[120] Rainbow:https://arxiv.org/pdf/1710.02298
[121] Prioritized Experience Replay:https://arxiv.org/pdf/1511.05952
[122] Dueling DQN:https://arxiv.org/pdf/1511.06581
[123] Double DQN:https://arxiv.org/pdf/1509.06461
[124] DQN + Target Network:https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
[125] POSG:https://www.khoury.northeastern.edu/home/camato/publications/aaai-SS-04.pdf
[126] IQL:https://web.media.mit.edu/~cynthiab/Readings/tan-MAS-reinfLearn.pdf
[127] BC:https://proceedings.neurips.cc/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf
[128] GAIL:https://arxiv.org/pdf/1606.03476
[129] MCTS:https://www.davidsilver.uk/wp-content/uploads/2020/03/pomcp.pdf
[131] Distributional RL:https://arxiv.org/pdf/1707.06887
[132] Chatbot Arena:https://www.lmarena.ai/
[133] Teacher Forcing:https://gwern.net/doc/ai/nn/rnn/1989-williams-2.pdf
[134] GPT-1:https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[135] Attention Is All You Need:https://arxiv.org/pdf/1706.03762
[136] Ilya Sutskever, seq2seq, LSTM:https://arxiv.org/pdf/1409.3215
[137] MoE, Switch Transformers:https://arxiv.org/pdf/2101.03961
[138] RoPE,苏剑林:https://arxiv.org/pdf/2104.09864
[139] ResNet,何恺明:https://arxiv.org/pdf/1512.03385
[140] DriveVLM,清华,理想汽车:https://arxiv.org/pdf/2402.12289
[141] ELMo:https://arxiv.org/pdf/1802.05365
[142] Generative Verifiers:https://arxiv.org/pdf/2408.15240
[143] rStar:https://arxiv.org/pdf/2408.06195
[144] Scaling LLM Test-Time Compute:https://arxiv.org/pdf/2408.03314v1
[145] LoRA:https://github.com/microsoft/LoRA
[146] GPT-3:https://arxiv.org/pdf/2005.14165
[147] RAG:https://arxiv.org/pdf/2005.11401
[148] Richard S. Sutton:http://incompleteideas.net/
[149] The Bitter Lesson:http://incompleteideas.net/IncIdeas/BitterLesson.html
[150] Yarn:https://arxiv.org/pdf/2309.00071
[151] Qwen2.5:https://arxiv.org/pdf/2412.15115
[152] DeepSeek-V3:https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
[153] Speculative Decoding:https://arxiv.org/pdf/2211.17192
[154] MCTS:https://inria.hal.science/inria-00116992/document
[155] UCT:http://ggp.stanford.edu/readings/uct.pdf
[156] AlphaGo:https://www.davidsilver.uk/wp-content/uploads/2020/03/unformatted_final_mastering_go.pdf
[157] About BoN:https://arxiv.org/pdf/2009.01325
[158] BOND:https://arxiv.org/pdf/2407.14622
[159] DVTS:https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
[160] A Survey on KD of LLM:https://arxiv.org/pdf/2402.13116
[161] Investigating Mysteries of CoT-Augmented Distillation:https://arxiv.org/pdf/2406.14511
[162] TTT:https://ekinakyurek.github.io/papers/ttt.pdf
[163] RESET:https://arxiv.org/pdf/2409.14586
[164] OpenAI,Let’s Verify Step by Step:https://arxiv.org/pdf/2305.20050
[165] DeepMind,Google,OmegaPRM:https://arxiv.org/pdf/2406.06592
[166] DeepMind,PRM:https://arxiv.org/pdf/2211.14275
[167] Epoch AI, Will we run out of data:https://arxiv.org/pdf/2211.04325v2
[168] Epoch AI:https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
[169] A Survey on Data Synthesis and Augmentation:https://arxiv.org/pdf/2410.12896
[170] OLMo 2:https://arxiv.org/pdf/2501.00656
[171] TÜLU 3:https://arxiv.org/pdf/2411.15124
[172] ReFT:https://arxiv.org/pdf/2401.08967
[173] Reinforcement Fine-Tuning:https://openai.com/12-days/
[174] rStar-Math:https://arxiv.org/pdf/2501.04519
[175] A* search:https://ai.stanford.edu/~nilsson/OnlinePubs-Nils/PublishedPapers/astar.pdf
[176] Meta-CoT:https://arxiv.org/pdf/2501.04682
[177] Best-first:https://arxiv.org/pdf/2407.01476
[178] AlphaZero:https://arxiv.org/pdf/1712.01815
[179] AlphaGo Zero:https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf
[180] Self-Rewarding:https://arxiv.org/pdf/2401.10020
[181] Meta,Meta-Rewarding:https://arxiv.org/pdf/2407.19594
[182] DeepMind,SCoRe:https://arxiv.org/pdf/2409.12917
[183] OpenAI ,Deliberative Alignment: https://arxiv.org/pdf/2412.16339
[184] Distillation:https://arxiv.org/pdf/1503.02531
[185] DeepSeek-R1:https://arxiv.org/pdf/2501.12948
[186] Approximating KL Divergence:http://joschu.net/blog/kl-approx.html
[187] Li Fei-Fei, s1:https://arxiv.org/pdf/2501.19393
[188] BitNet b1.58:https://arxiv.org/pdf/2402.17764
[189] Unsloth:https://unsloth.ai/
[190] GRPO Trainer:https://huggingface.co/docs/trl/main/en/grpo_trainer