- SGD [Book]
 - Momentum [Book]
 - RMSProp [Book]
 - AdaGrad [Link]
 - ADAM [Link]
 - AdaBound [Link] [Github]
 - ADAMAX [Link]
 - NADAM [Link]
 - ADAMW [Link]
 - AdaLOMO Link
 - All optimizers list Awesome-Optimizer
 
- BatchNorm [Link]
 - Weight Norm [Link]
 - Spectral Norm [Link]
 - Cosine Normalization [Link]
 - L2 Regularization versus Batch and Weight Normalization Link
 - WHY GRADIENT CLIPPING ACCELERATES TRAINING: A THEORETICAL JUSTIFICATION FOR ADAPTIVITY Link
 
- Convex Neural Networks [Link]
 - Breaking the Curse of Dimensionality with Convex Neural Networks [Link]
 - UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION [Link]
 - Optimal Control Via Neural Networks: A Convex Approach. [Link]
 - Input Convex Neural Networks [Link]
 - A New Concept of Convex based Multiple Neural Networks Structure. [Link
 - SGD Converges to Global Minimum in Deep Learning via Star-convex Path [Link]
 - A Convergence Theory for Deep Learning via Over-Parameterization Link
 
- Curriculum Learning [Link]
 - SOLVING RUBIK’S CUBE WITH A ROBOT HAND Link
 - Noisy Activation Function [Link]
 - Mollifying Networks [Link]
 - Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks Link Talk
 - Automated Curriculum Learning for Neural Networks Link
 - On The Power of Curriculum Learning in Training Deep Networks Link
 - On-line Adaptative Curriculum Learning for GANs Link
 - Parameter Continuation with Secant Approximation for Deep Neural Networks and Step-up GAN Link
 - HashNet: Deep Learning to Hash by Continuation. [Link]
 - Learning Combinations of Activation Functions. [Link]
 - Learning and development in neural networks: The importance of starting small (1993) Link
 - Flexible shaping: How learning in small steps helps Link
 - Curriculum Labeling: Self-paced Pseudo-Labeling for Semi-Supervised Learning Link
 - RETHINKING CURRICULUM LEARNING WITH INCREMENTAL LABELS AND ADAPTIVE COMPENSATION Link
 - Parameter Continuation Methods for the Optimization of Deep Neural Networks Link
 - Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection [Link (https://www.aclweb.org/anthology/W18-6314.pdf)
 - Reinforcement Learning based Curriculum Optimization for Neural Machine Translation Link
 - EVOLUTIONARY POPULATION CURRICULUM FOR SCALING MULTI-AGENT REINFORCEMENT LEARNING Link
 - ENTROPY-SGD: BIASING GRADIENT DESCENT INTO WIDE VALLEYS Link
 - NEIGHBOURHOOD DISTILLATION: ON THE BENEFITS OF NON END-TO-END DISTILLATION Link
 - LEARNING TO EXECUTE Link
 - Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing Link
 - Data Parameters: A New Family of Parameters for Learning a Differentiable Curriculum Link
 - Breaking the Curse of Space Explosion: Towards Effcient NAS with Curriculum Search Link
 - Continuation Methods and Curriculum Learning for Learning to Rank Link
 
- Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape Link
 - Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape Link
 - Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Link
 - QUALITATIVELY CHARACTERIZING NEURAL NETWORK OPTIMIZATION PROBLEMS[Link]
 - The Loss Surfaces of Multilayer Networks [Link]
 - Visualizing the Loss Landscape of Neural Nets [Link]
 - The Loss Surface Of Deep Linear Networks Viewed Through The Algebraic Geometry Lens [Link]
 - How regularization affects the critical points in linear networks.[Link]
 - Local minima in training of neural networks [Link]
 - Necessary and Sufficient Geometries for Gradient Methods Link
 - Fine-grained Optimization of Deep Neural Networks Link
 - SCORE-BASED GENERATIVE MODELING THROUGH STOCHASTIC DIFFERENTIAL EQUATIONS Link
 
- Deep Equilibrium Models Link
 - Bifurcations of Recurrent Neural Networks in Gradient Descent Learning [Link]
 - On the difficulty of training recurrent neural networks [Link]
 - Understanding and Controlling Memory in Recurrent Neural Networks [Link]
 - Dynamics and Bifurcation of Neural Networks [Link]
 - Context Aware Machine Learning [Link]
 - The trade-off between long-term memory and smoothness for recurrent networks [Link]
 - Dynamical complexity and computation in recurrent neural networks beyond their fxed point [Link]
 - Bifurcations in discrete-time neural networks : controlling complex network behaviour with inputs [Links]
 - Interpreting Recurrent Neural Networks Behaviour via Excitable Network Attractors [Link]
 - Bifurcation analysis of a neural network model Link
 - A Differentiable Physics Engine for Deep Learning in Robotics Link
 - Deep learning for universal linear embeddings of nonlinear dynamics Link
 - Deep Hidden Physics Models: Deep Learning of Nonlinear Partial Differential Equations Link
 - Analysis of gradient descent learning algorithms for multilayer feedforward neural networks Link
 - A dynamical model for the analysis and acceleration of learning in feedforward networks Link
 - A bio-inspired bistable recurrent cell allows for long-lasting memory Link
 - Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation [Link (https://www.frontiersin.org/articles/10.3389/fncom.2017.00024/full)
 
- Adding One Neuron Can Eliminate All Bad Local Minima Link
 - Deep Learning without Poor Local Minima Link
 - Elimination of All Bad Local Minima in Deep Learning Link
 - How to escape saddle points efficiently. Link
 - Depth with Nonlinearity Creates No Bad Local Minima in ResNets Link
 - Sharp Minima Can Generalize For Deep Nets Link
 - Asymmetric Valleys: Beyond Sharp and Flat Local Minima Link
 - A Reparameterization-Invariant Flatness Measure for Deep Neural Networks Link
 - A Simple Weight Decay Can Improve Generalization Link
 - Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions Link
 - The Loss Surface Of Deep Linear Networks Viewed Through The Algebraic Geometry Lens Link
 - Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Link
 - Flatness is a False Friend Link
 - Are_Saddles_Good_Enough_for_Deep_Learning Link
 
- Deep learning course notes Link
 - On the importance of initialization and momentum in deep learning Link
 - The Break-Even Point on Optimization Trajectories of Deep Neural Networks Link
 - THE EARLY PHASE OF NEURAL NETWORK TRAINING Link
 - One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers Link
 - PCA-Initialized Deep Neural Networks Applied To Document Image Analysis Link
 - Understanding the difficulty of training deep feedforward neural networks Link
 - Unitary Evolution of RNNs Link
 
- RETHINKING THE HYPERPARAMETERS FOR FINE-TUNING Link
 - Momentum Residual Neural Networks Link
 - Smooth momentum: improving lipschitzness in gradient descent Link
 - Momentum-based Weight Interpolation of Strong Zero-Shot Models for Continual Learning link
 
- ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMALink
 - Revisiting Small Batch Training for Deep Neural Networks Link
 - LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS Link
 - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes Link
 - DON’T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE Link
 
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Link
 - Avoiding pathologies in very deep networks Link
 - Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice Link
 - SKIP CONNECTIONS ELIMINATE SINGULARITIES Link
 - How degenerate is the parametrization of neural networks with the ReLU activation function? Link
 - Theory of Deep Learning III: explaining the non-overfitting puzzle Link
 - Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks Link
 - Understanding Deep Learning: Expected Spanning Dimension and Controlling the Flexibility of Neural Networks Link
 - The Loss Surface Of Deep Linear Networks Viewed Through The Algebraic Geometry Lens Link
 - PYHESSIAN: Neural Networks Through the Lens of the Hessian Link
 
- A CONVERGENCE ANALYSIS OF GRADIENT DESCENT FOR DEEP LINEAR NEURAL NETWORKS Link
 - A Convergence Theory for Deep Learning via Over-Parameterization Link
 - Convergence Analysis of Homotopy-SGD for Non-Convex Optimization Link
 
- Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning. Link
 - Learning a Multitask Curriculum for Neural Machine Translation. Link
 - Self-paced Curriculum Learning. Link
 - Curriculum Learning of Multiple Tasks. Link
 
- A Primal-Dual Formulation for Deep Learning with Constraints Link
 
- Object-Oriented Curriculum Generation for Reinforcement Learning Link
 - Teacher-Student Curriculum Learning Link
 
- Curriculum Learning: A Survey Link
 - A Comprehensive Survey on Curriculum Learning Link
 - https://www.offconvex.org/
 - An overview of gradient descent optimization algorithms [Link]
 - Review of second-order optimization techniques in artificial neural networks backpropagation Link
 - Linear Algebra and data Link
 - Why Momentum really works?[Blog]
 - Optimization [Book]
 - Optimization for deep learning: theory and algorithms Link
 - Generalization Error in Deep Learning Link
 - Automatic Differentiation in Machine Learning: a Survey Link
 - Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey Link
 - Automatic Curriculum Learning For Deep RL: A Short Survey Link
 - The Generalization Mystery: Sharp vs Flat Minima Link
 
If you've found any informative resources that you think belong here, be sure to submit a pull request or create an issue!
- Or send me 2-4 dollars on my venmo account @HARSHNILESH-PATHAK