Skip to content

Latest commit

 

History

History
385 lines (193 loc) · 12.4 KB

File metadata and controls

385 lines (193 loc) · 12.4 KB

参考文献(References)

[1] Understanding Reference Policies in Direct Preference Optimization:https://arxiv.org/pdf/2407.13709

[2] Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study:https://arxiv.org/pdf/2404.10719v2

[3] Prefix-Tuning:https://arxiv.org/pdf/2101.00190

[4] P-Tuning:https://arxiv.org/pdf/2103.10385

[5] Prompt Tuning:https://arxiv.org/pdf/2104.08691

[6] P-Tuning v2:https://arxiv.org/pdf/2110.07602

[7] LoRA:https://arxiv.org/pdf/2106.09685

[8] AdaLoRA:https://arxiv.org/pdf/2303.10512

[9] PiSSA:https://arxiv.org/abs/2404.02948

[10] OLoRA: https://arxiv.org/pdf/2406.01775

[11] LoHa:https://arxiv.org/pdf/2108.06098

[12] LoKr:https://arxiv.org/pdf/2309.14859

[13] QLoRA:https://arxiv.org/abs/2305.14314

[14] LoftQ:https://arxiv.org/pdf/2310.08659

[15] DoRA:https://arxiv.org/pdf/2402.09353

[16] Adapter tuning:https://arxiv.org/pdf/1902.00751

[17] LLM finetuning:https://arxiv.org/pdf/2402.17193

[18] DPO SFT:https://arxiv.org/pdf/2406.04879

[19] DEEPSEEK DPO:https://arxiv.org/pdf/2401.02954

[20] LLaMA Factory:https://github.com/hiyouga/LLaMA-Factory

[21] Qwen:https://huggingface.co/Qwen/Qwen2-0.5B-Instruct

[22] LIMA:https://arxiv.org/pdf/2305.11206

[23] InsTag:https://arxiv.org/pdf/2308.07074

[24] IFD:https://arxiv.org/pdf/2308.12032v5

[25] WizardLM: Empowering Large Language Models to Follow Complex Instructions:https://arxiv.org/pdf/2304.12244

[26] LESS: Selecting Influential Data for Targeted Instruction Tuning:https://arxiv.org/pdf/2402.04333

[27] DEITA:https://arxiv.org/pdf/2312.15685

[28] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model:https://arxiv.org/pdf/2405.04434

[29] Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?:https://arxiv.org/pdf/2405.05904

[30] Knowledge Verification to Nip Hallucination in the Bud:https://arxiv.org/pdf/2401.10768

[31] OpenAI, Parameter Space Noise for Exploration:https://arxiv.org/pdf/1706.01905

[32] Reinforcement Learning: An Introduction, 2nd Edition, Richard S. Sutton:http://incompleteideas.net/book/the-book-2nd.html

[33] MuZero:https://arxiv.org/pdf/1911.08265

[34] DPG:https://proceedings.mlr.press/v32/silver14.pdf

[35] DDPG:https://arxiv.org/pdf/1509.02971

[36] TD3:https://arxiv.org/pdf/1802.09477

[37] Dec-POMDP:https://arxiv.org/pdf/1301.3836

[38] MAPPO:https://arxiv.org/pdf/2103.01955

[39] QMIX:https://arxiv.org/pdf/1803.11485

[40] COMA:https://arxiv.org/pdf/1705.08926

[41] MADDPG:https://arxiv.org/pdf/1706.02275

[42] MAXQ:https://www.jair.org/index.php/jair/article/view/10266/24463

[43] Feudal Reinforcement Learning:https://www.cs.toronto.edu/~fritz/absps/dh93.pdf

[44] Dyna-Q:http://www.incompleteideas.net/papers/sutton-90.pdf

[45] POMDP:Optimal control of Markov processes with incomplete state information, https://core.ac.uk/download/pdf/82498456.pdf

[46] DPO:https://arxiv.org/pdf/2305.18290

[47] AC架构:http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf

[48] SAC:https://arxiv.org/pdf/1801.01290

[49] A3C:https://arxiv.org/pdf/1602.01783

[50] GAE:https://arxiv.org/pdf/1506.02438

[51] TRPO:https://arxiv.org/pdf/1502.05477

[52] PPO:https://arxiv.org/abs/1707.06347

[53] John Schulman:https://www.technologyreview.com/2023/03/03/1069311/inside-story-oral-history-how-chatgpt-built-openai/

[54] Deep Reinforcement Learning from Human Preferences:https://arxiv.org/pdf/1706.03741

[55] Fine-Tuning Language Models from Human Preferences:https://arxiv.org/pdf/1909.08593

[56] InstructGPT: Training language models to follow instructions with human feedback:https://arxiv.org/pdf/2203.02155

[57] LlaMA2:https://arxiv.org/pdf/2307.09288

[58] LlaMA3:https://arxiv.org/abs/2407.21783

[59] Scaling Laws for Reward Model Overoptimization:https://arxiv.org/pdf/2210.10760

[60] TRL:https://github.com/huggingface/trl

[61] 吴恩达IRL,Algorithms for Inverse Reinforcement Learning:https://ai.stanford.edu/~ang/papers/icml00-irl.pdf

[62] CAI:https://arxiv.org/pdf/2212.08073

[63] RLAIF-V:https://arxiv.org/pdf/2405.17220

[64] Claude’s Constitution:https://www.anthropic.com/news/claudes-constitution

[65] RLlib:https://docs.ray.io/en/latest/rllib/index.html

[66] Stable Baselines3(SB3):https://stable-baselines3.readthedocs.io/en/master/

[67] OpenRLHF:https://openrlhf.readthedocs.io/en/latest/

[68] lilianweng:https://lilianweng.github.io/posts/2024-11-28-reward-hacking/

[69] SEAL:https://arxiv.org/abs/2408.10270

[70] Reward hacking:https://arxiv.org/abs/2201.03544

[71] Anthropic, Rejection Sampling:https://arxiv.org/pdf/2204.05862

[72] GRPO:https://arxiv.org/pdf/2402.03300

[73] OpenAI RBR:https://openai.com/index/improving-model-safety-behavior-with-rule-based-rewards/

[74] Contrastive Search:https://arxiv.org/pdf/2202.06417

[75] Lookahead Decoding:https://arxiv.org/pdf/2402.02057

[76] Phi-4:https://arxiv.org/pdf/2412.08905

[77] DoLa:https://arxiv.org/abs/2309.03883

[78] Transformers:https://huggingface.co/docs/transformers/index

[79] Prompt Engineering Guide:https://www.promptingguide.ai/

[80] OpenAI Prompt:https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api

[81] OpenAI Prompt:https://platform.openai.com/docs/guides/prompt-engineering#six-strategies-for-getting-better-results

[82] CoT:https://arxiv.org/pdf/2201.11903

[83] ToT:https://arxiv.org/pdf/2305.10601

[84] Auto-CoT:https://arxiv.org/pdf/2210.03493

[85] Self-Consistency with CoT:https://arxiv.org/pdf/2203.11171

[86] XoT:https://arxiv.org/pdf/2311.04254

[87] GoT:https://arxiv.org/pdf/2308.09687

[88] MoT:https://arxiv.org/pdf/2305.05181

[89] Multimodal CoT:https://arxiv.org/pdf/2302.00923

[90] VLM CoT:https://arxiv.org/pdf/2410.16198

[91] Zero-shot-CoT:https://arxiv.org/pdf/2205.11916

[92] langchain:https://python.langchain.com/docs/

[93] Contextual RAG:https://www.anthropic.com/news/contextual-retrieval

[94] RAGFlow:https://ragflow.io/

[95] TD:http://incompleteideas.net/papers/sutton-88-with-erratum.pdf

[96] Hugging Face PEFT:https://huggingface.co/docs/peft/index

[97] Byte Latent Transformer:https://arxiv.org/pdf/2412.09871

[98] OpenAI Scaling Law:https://arxiv.org/pdf/2001.08361

[99] DeepMind Chinchilla Scaling Law:https://arxiv.org/pdf/2203.15556

[100] OpenAI o1 Scaling Law:https://openai.com/index/learning-to-reason-with-llms/

[101] VLMEvalKit:https://github.com/open-compass/VLMEvalKit

[102] opencompass:https://github.com/open-compass/opencompass

[103] ollama:https://github.com/ollama/ollama

[104] mlc-llm:https://github.com/mlc-ai/mlc-llm

[105] llama.cpp:https://github.com/ggerganov/llama.cpp

[106] text-generation-inference:https://github.com/huggingface/text-generation-inference

[107] langgraph:https://github.com/langchain-ai/langgraph

[108] Qwen2.5:https://arxiv.org/pdf/2412.15115

[109] DQN:https://arxiv.org/pdf/1312.5602

[110] Q-learning:https://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf

[111] Policy Gradient&REINFORCE:https://people.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf

[112] Policy Gradient Theorem:https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

[113] Hugging Face:https://huggingface.co/

[114] Diverse Beam Search:https://arxiv.org/pdf/1610.02424

[115] Constrained Beam Search:https://arxiv.org/pdf/1612.00576

[116] Top-P:https://arxiv.org/pdf/1904.09751

[117] Top-K:https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[118] Speculative Sampling:https://arxiv.org/pdf/2302.01318

[119] UCB:https://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf

[120] Rainbow:https://arxiv.org/pdf/1710.02298

[121] Prioritized Experience Replay:https://arxiv.org/pdf/1511.05952

[122] Dueling DQN:https://arxiv.org/pdf/1511.06581

[123] Double DQN:https://arxiv.org/pdf/1509.06461

[124] DQN + Target Network:https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf

[125] POSG:https://www.khoury.northeastern.edu/home/camato/publications/aaai-SS-04.pdf

[126] IQL:https://web.media.mit.edu/~cynthiab/Readings/tan-MAS-reinfLearn.pdf

[127] BC:https://proceedings.neurips.cc/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf

[128] GAIL:https://arxiv.org/pdf/1606.03476

[129] MCTS:https://www.davidsilver.uk/wp-content/uploads/2020/03/pomcp.pdf

[130] HRL:https://proceedings.neurips.cc/paper_files/paper/1997/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

[131] Distributional RL:https://arxiv.org/pdf/1707.06887

[132] Chatbot Arena:https://www.lmarena.ai/

[133] Teacher Forcing:https://gwern.net/doc/ai/nn/rnn/1989-williams-2.pdf

[134] GPT-1:https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

[135] Attention Is All You Need:https://arxiv.org/pdf/1706.03762

[136] Ilya Sutskever, seq2seq, LSTM:https://arxiv.org/pdf/1409.3215

[137] MoE, Switch Transformers:https://arxiv.org/pdf/2101.03961

[138] RoPE,苏剑林:https://arxiv.org/pdf/2104.09864

[139] ResNet,何恺明:https://arxiv.org/pdf/1512.03385

[140] DriveVLM,清华,理想汽车:https://arxiv.org/pdf/2402.12289

[141] ELMo:https://arxiv.org/pdf/1802.05365

[142] Generative Verifiers:https://arxiv.org/pdf/2408.15240

[143] rStar:https://arxiv.org/pdf/2408.06195

[144] Scaling LLM Test-Time Compute:https://arxiv.org/pdf/2408.03314v1

[145] LoRA:https://github.com/microsoft/LoRA

[146] GPT-3:https://arxiv.org/pdf/2005.14165

[147] RAG:https://arxiv.org/pdf/2005.11401

[148] Richard S. Sutton:http://incompleteideas.net/

[149] The Bitter Lesson:http://incompleteideas.net/IncIdeas/BitterLesson.html

[150] Yarn:https://arxiv.org/pdf/2309.00071

[151] Qwen2.5:https://arxiv.org/pdf/2412.15115

[152] DeepSeek-V3:https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

[153] Speculative Decoding:https://arxiv.org/pdf/2211.17192

[154] MCTS:https://inria.hal.science/inria-00116992/document

[155] UCT:http://ggp.stanford.edu/readings/uct.pdf

[156] AlphaGo:https://www.davidsilver.uk/wp-content/uploads/2020/03/unformatted_final_mastering_go.pdf

[157] About BoN:https://arxiv.org/pdf/2009.01325

[158] BOND:https://arxiv.org/pdf/2407.14622

[159] DVTS:https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute

[160] A Survey on KD of LLM:https://arxiv.org/pdf/2402.13116

[161] Investigating Mysteries of CoT-Augmented Distillation:https://arxiv.org/pdf/2406.14511

[162] TTT:https://ekinakyurek.github.io/papers/ttt.pdf

[163] RESET:https://arxiv.org/pdf/2409.14586

[164] OpenAI,Let’s Verify Step by Step:https://arxiv.org/pdf/2305.20050

[165] DeepMind,Google,OmegaPRM:https://arxiv.org/pdf/2406.06592

[166] DeepMind,PRM:https://arxiv.org/pdf/2211.14275

[167] Epoch AI, Will we run out of data:https://arxiv.org/pdf/2211.04325v2

[168] Epoch AI:https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data

[169] A Survey on Data Synthesis and Augmentation:https://arxiv.org/pdf/2410.12896

[170] OLMo 2:https://arxiv.org/pdf/2501.00656

[171] TÜLU 3:https://arxiv.org/pdf/2411.15124

[172] ReFT:https://arxiv.org/pdf/2401.08967

[173] Reinforcement Fine-Tuning:https://openai.com/12-days/

[174] rStar-Math:https://arxiv.org/pdf/2501.04519

[175] A* search:https://ai.stanford.edu/~nilsson/OnlinePubs-Nils/PublishedPapers/astar.pdf

[176] Meta-CoT:https://arxiv.org/pdf/2501.04682

[177] Best-first:https://arxiv.org/pdf/2407.01476

[178] AlphaZero:https://arxiv.org/pdf/1712.01815

[179] AlphaGo Zero:https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf

[180] Self-Rewarding:https://arxiv.org/pdf/2401.10020

[181] Meta,Meta-Rewarding:https://arxiv.org/pdf/2407.19594

[182] DeepMind,SCoRe:https://arxiv.org/pdf/2409.12917

[183] OpenAI ,Deliberative Alignment: https://arxiv.org/pdf/2412.16339

[184] Distillation:https://arxiv.org/pdf/1503.02531

[185] DeepSeek-R1:https://arxiv.org/pdf/2501.12948

[186] Approximating KL Divergence:http://joschu.net/blog/kl-approx.html

[187] Li Fei-Fei, s1:https://arxiv.org/pdf/2501.19393

[188] BitNet b1.58:https://arxiv.org/pdf/2402.17764

[189] Unsloth:https://unsloth.ai/

[190] GRPO Trainer:https://huggingface.co/docs/trl/main/en/grpo_trainer