385 lines (193 loc) · 12.4 KB

参考文献（References）

[1] Understanding Reference Policies in Direct Preference Optimization：https://arxiv.org/pdf/2407.13709

[2] Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study：https://arxiv.org/pdf/2404.10719v2

[3] Prefix-Tuning：https://arxiv.org/pdf/2101.00190

[4] P-Tuning：https://arxiv.org/pdf/2103.10385

[5] Prompt Tuning：https://arxiv.org/pdf/2104.08691

[6] P-Tuning v2：https://arxiv.org/pdf/2110.07602

[7] LoRA：https://arxiv.org/pdf/2106.09685

[8] AdaLoRA：https://arxiv.org/pdf/2303.10512

[9] PiSSA：https://arxiv.org/abs/2404.02948

[10] OLoRA： https://arxiv.org/pdf/2406.01775

[11] LoHa：https://arxiv.org/pdf/2108.06098

[12] LoKr：https://arxiv.org/pdf/2309.14859

[13] QLoRA：https://arxiv.org/abs/2305.14314

[14] LoftQ：https://arxiv.org/pdf/2310.08659

[15] DoRA：https://arxiv.org/pdf/2402.09353

[16] Adapter tuning：https://arxiv.org/pdf/1902.00751

[17] LLM finetuning：https://arxiv.org/pdf/2402.17193

[18] DPO SFT：https://arxiv.org/pdf/2406.04879

[19] DEEPSEEK DPO：https://arxiv.org/pdf/2401.02954

[20] LLaMA Factory：https://github.com/hiyouga/LLaMA-Factory

[21] Qwen：https://huggingface.co/Qwen/Qwen2-0.5B-Instruct

[22] LIMA：https://arxiv.org/pdf/2305.11206

[23] InsTag：https://arxiv.org/pdf/2308.07074

[24] IFD：https://arxiv.org/pdf/2308.12032v5

[25] WizardLM: Empowering Large Language Models to Follow Complex Instructions：https://arxiv.org/pdf/2304.12244

[26] LESS: Selecting Influential Data for Targeted Instruction Tuning：https://arxiv.org/pdf/2402.04333

[27] DEITA：https://arxiv.org/pdf/2312.15685

[28] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model：https://arxiv.org/pdf/2405.04434

[29] Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?：https://arxiv.org/pdf/2405.05904

[30] Knowledge Verification to Nip Hallucination in the Bud：https://arxiv.org/pdf/2401.10768

[31] OpenAI, Parameter Space Noise for Exploration：https://arxiv.org/pdf/1706.01905

[32] Reinforcement Learning: An Introduction, 2nd Edition, Richard S. Sutton：http://incompleteideas.net/book/the-book-2nd.html

[33] MuZero：https://arxiv.org/pdf/1911.08265

[34] DPG：https://proceedings.mlr.press/v32/silver14.pdf

[35] DDPG：https://arxiv.org/pdf/1509.02971

[36] TD3：https://arxiv.org/pdf/1802.09477

[37] Dec-POMDP：https://arxiv.org/pdf/1301.3836

[38] MAPPO：https://arxiv.org/pdf/2103.01955

[39] QMIX：https://arxiv.org/pdf/1803.11485

[40] COMA：https://arxiv.org/pdf/1705.08926

[41] MADDPG：https://arxiv.org/pdf/1706.02275

[42] MAXQ：https://www.jair.org/index.php/jair/article/view/10266/24463

[43] Feudal Reinforcement Learning：https://www.cs.toronto.edu/~fritz/absps/dh93.pdf

[44] Dyna-Q：http://www.incompleteideas.net/papers/sutton-90.pdf

[45] POMDP：Optimal control of Markov processes with incomplete state information, https://core.ac.uk/download/pdf/82498456.pdf

[46] DPO：https://arxiv.org/pdf/2305.18290

[47] AC架构：http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf

[48] SAC：https://arxiv.org/pdf/1801.01290

[49] A3C：https://arxiv.org/pdf/1602.01783

[50] GAE：https://arxiv.org/pdf/1506.02438

[51] TRPO：https://arxiv.org/pdf/1502.05477

[52] PPO：https://arxiv.org/abs/1707.06347

[53] John Schulman：https://www.technologyreview.com/2023/03/03/1069311/inside-story-oral-history-how-chatgpt-built-openai/

[54] Deep Reinforcement Learning from Human Preferences：https://arxiv.org/pdf/1706.03741

[55] Fine-Tuning Language Models from Human Preferences：https://arxiv.org/pdf/1909.08593

[56] InstructGPT: Training language models to follow instructions with human feedback：https://arxiv.org/pdf/2203.02155

[57] LlaMA2：https://arxiv.org/pdf/2307.09288

[58] LlaMA3：https://arxiv.org/abs/2407.21783

[59] Scaling Laws for Reward Model Overoptimization：https://arxiv.org/pdf/2210.10760

[60] TRL：https://github.com/huggingface/trl

[61] 吴恩达IRL，Algorithms for Inverse Reinforcement Learning：https://ai.stanford.edu/~ang/papers/icml00-irl.pdf

[62] CAI：https://arxiv.org/pdf/2212.08073

[63] RLAIF-V：https://arxiv.org/pdf/2405.17220

[64] Claude’s Constitution：https://www.anthropic.com/news/claudes-constitution

[65] RLlib：https://docs.ray.io/en/latest/rllib/index.html

[66] Stable Baselines3（SB3）：https://stable-baselines3.readthedocs.io/en/master/

[67] OpenRLHF：https://openrlhf.readthedocs.io/en/latest/

[68] lilianweng：https://lilianweng.github.io/posts/2024-11-28-reward-hacking/

[69] SEAL：https://arxiv.org/abs/2408.10270

[70] Reward hacking：https://arxiv.org/abs/2201.03544

[71] Anthropic, Rejection Sampling：https://arxiv.org/pdf/2204.05862

[72] GRPO：https://arxiv.org/pdf/2402.03300

[73] OpenAI RBR：https://openai.com/index/improving-model-safety-behavior-with-rule-based-rewards/

[74] Contrastive Search：https://arxiv.org/pdf/2202.06417

[75] Lookahead Decoding：https://arxiv.org/pdf/2402.02057

[76] Phi-4：https://arxiv.org/pdf/2412.08905

[77] DoLa：https://arxiv.org/abs/2309.03883

[78] Transformers：https://huggingface.co/docs/transformers/index

[79] Prompt Engineering Guide：https://www.promptingguide.ai/

[80] OpenAI Prompt：https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api

[81] OpenAI Prompt：https://platform.openai.com/docs/guides/prompt-engineering#six-strategies-for-getting-better-results

[82] CoT：https://arxiv.org/pdf/2201.11903

[83] ToT：https://arxiv.org/pdf/2305.10601

[84] Auto-CoT：https://arxiv.org/pdf/2210.03493

[85] Self-Consistency with CoT：https://arxiv.org/pdf/2203.11171

[86] XoT：https://arxiv.org/pdf/2311.04254

[87] GoT：https://arxiv.org/pdf/2308.09687

[88] MoT：https://arxiv.org/pdf/2305.05181

[89] Multimodal CoT：https://arxiv.org/pdf/2302.00923

[90] VLM CoT：https://arxiv.org/pdf/2410.16198

[91] Zero-shot-CoT：https://arxiv.org/pdf/2205.11916

[92] langchain：https://python.langchain.com/docs/

[93] Contextual RAG：https://www.anthropic.com/news/contextual-retrieval

[94] RAGFlow：https://ragflow.io/

[95] TD：http://incompleteideas.net/papers/sutton-88-with-erratum.pdf

[96] Hugging Face PEFT：https://huggingface.co/docs/peft/index

[97] Byte Latent Transformer：https://arxiv.org/pdf/2412.09871

[98] OpenAI Scaling Law：https://arxiv.org/pdf/2001.08361

[99] DeepMind Chinchilla Scaling Law：https://arxiv.org/pdf/2203.15556

[100] OpenAI o1 Scaling Law：https://openai.com/index/learning-to-reason-with-llms/

[101] VLMEvalKit：https://github.com/open-compass/VLMEvalKit

[102] opencompass：https://github.com/open-compass/opencompass

[103] ollama：https://github.com/ollama/ollama

[104] mlc-llm：https://github.com/mlc-ai/mlc-llm

[105] llama.cpp：https://github.com/ggerganov/llama.cpp

[106] text-generation-inference：https://github.com/huggingface/text-generation-inference

[107] langgraph：https://github.com/langchain-ai/langgraph

[108] Qwen2.5：https://arxiv.org/pdf/2412.15115

[109] DQN：https://arxiv.org/pdf/1312.5602

[110] Q-learning：https://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf

[111] Policy Gradient&REINFORCE：https://people.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf

[112] Policy Gradient Theorem：https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

[113] Hugging Face：https://huggingface.co/

[114] Diverse Beam Search：https://arxiv.org/pdf/1610.02424

[115] Constrained Beam Search：https://arxiv.org/pdf/1612.00576

[116] Top-P：https://arxiv.org/pdf/1904.09751

[117] Top-K：https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[118] Speculative Sampling：https://arxiv.org/pdf/2302.01318

[119] UCB：https://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf

[120] Rainbow：https://arxiv.org/pdf/1710.02298

[121] Prioritized Experience Replay：https://arxiv.org/pdf/1511.05952

[122] Dueling DQN：https://arxiv.org/pdf/1511.06581

[123] Double DQN：https://arxiv.org/pdf/1509.06461

[124] DQN + Target Network：https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf

[125] POSG：https://www.khoury.northeastern.edu/home/camato/publications/aaai-SS-04.pdf

[126] IQL：https://web.media.mit.edu/~cynthiab/Readings/tan-MAS-reinfLearn.pdf

[127] BC：https://proceedings.neurips.cc/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf

[128] GAIL：https://arxiv.org/pdf/1606.03476

[129] MCTS：https://www.davidsilver.uk/wp-content/uploads/2020/03/pomcp.pdf

[130] HRL：https://proceedings.neurips.cc/paper_files/paper/1997/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

[131] Distributional RL：https://arxiv.org/pdf/1707.06887

[132] Chatbot Arena：https://www.lmarena.ai/

[133] Teacher Forcing：https://gwern.net/doc/ai/nn/rnn/1989-williams-2.pdf

[134] GPT-1：https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

[135] Attention Is All You Need：https://arxiv.org/pdf/1706.03762

[136] Ilya Sutskever, seq2seq, LSTM：https://arxiv.org/pdf/1409.3215

[137] MoE, Switch Transformers：https://arxiv.org/pdf/2101.03961

[138] RoPE，苏剑林：https://arxiv.org/pdf/2104.09864

[139] ResNet，何恺明：https://arxiv.org/pdf/1512.03385

[140] DriveVLM，清华，理想汽车：https://arxiv.org/pdf/2402.12289

[141] ELMo：https://arxiv.org/pdf/1802.05365

[142] Generative Verifiers：https://arxiv.org/pdf/2408.15240

[143] rStar：https://arxiv.org/pdf/2408.06195

[144] Scaling LLM Test-Time Compute：https://arxiv.org/pdf/2408.03314v1

[145] LoRA：https://github.com/microsoft/LoRA

[146] GPT-3：https://arxiv.org/pdf/2005.14165

[147] RAG：https://arxiv.org/pdf/2005.11401

[148] Richard S. Sutton：http://incompleteideas.net/

[149] The Bitter Lesson：http://incompleteideas.net/IncIdeas/BitterLesson.html

[150] Yarn：https://arxiv.org/pdf/2309.00071

[151] Qwen2.5：https://arxiv.org/pdf/2412.15115

[152] DeepSeek-V3：https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

[153] Speculative Decoding：https://arxiv.org/pdf/2211.17192

[154] MCTS：https://inria.hal.science/inria-00116992/document

[155] UCT：http://ggp.stanford.edu/readings/uct.pdf

[156] AlphaGo：https://www.davidsilver.uk/wp-content/uploads/2020/03/unformatted_final_mastering_go.pdf

[157] About BoN：https://arxiv.org/pdf/2009.01325

[158] BOND：https://arxiv.org/pdf/2407.14622

[159] DVTS：https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute

[160] A Survey on KD of LLM：https://arxiv.org/pdf/2402.13116

[161] Investigating Mysteries of CoT-Augmented Distillation：https://arxiv.org/pdf/2406.14511

[162] TTT：https://ekinakyurek.github.io/papers/ttt.pdf

[163] RESET：https://arxiv.org/pdf/2409.14586

[164] OpenAI，Let’s Verify Step by Step：https://arxiv.org/pdf/2305.20050

[165] DeepMind，Google，OmegaPRM：https://arxiv.org/pdf/2406.06592

[166] DeepMind，PRM：https://arxiv.org/pdf/2211.14275

[167] Epoch AI, Will we run out of data：https://arxiv.org/pdf/2211.04325v2

[168] Epoch AI：https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data

[169] A Survey on Data Synthesis and Augmentation：https://arxiv.org/pdf/2410.12896

[170] OLMo 2：https://arxiv.org/pdf/2501.00656

[171] TÜLU 3：https://arxiv.org/pdf/2411.15124

[172] ReFT：https://arxiv.org/pdf/2401.08967

[173] Reinforcement Fine-Tuning：https://openai.com/12-days/

[174] rStar-Math：https://arxiv.org/pdf/2501.04519

[175] A* search：https://ai.stanford.edu/~nilsson/OnlinePubs-Nils/PublishedPapers/astar.pdf

[176] Meta-CoT：https://arxiv.org/pdf/2501.04682

[177] Best-first：https://arxiv.org/pdf/2407.01476

[178] AlphaZero：https://arxiv.org/pdf/1712.01815

[179] AlphaGo Zero：https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf

[180] Self-Rewarding：https://arxiv.org/pdf/2401.10020

[181] Meta，Meta-Rewarding：https://arxiv.org/pdf/2407.19594

[182] DeepMind，SCoRe：https://arxiv.org/pdf/2409.12917

[183] OpenAI ，Deliberative Alignment: https://arxiv.org/pdf/2412.16339

[184] Distillation：https://arxiv.org/pdf/1503.02531

[185] DeepSeek-R1：https://arxiv.org/pdf/2501.12948

[186] Approximating KL Divergence：http://joschu.net/blog/kl-approx.html

[187] Li Fei-Fei, s1：https://arxiv.org/pdf/2501.19393

[188] BitNet b1.58：https://arxiv.org/pdf/2402.17764

[189] Unsloth：https://unsloth.ai/

[190] GRPO Trainer：https://huggingface.co/docs/trl/main/en/grpo_trainer