Flat minima and generalization in deep learning: a case study in low rank matrix recovery
Halıcıoğlu Data Science Institute Building, Room 123 3234 Matthews Lane, La Jolla, CA, United StatesAbstract: Recent advances in machine learning and artificial intelligence have relied on fitting highly overparameterized models, notably deep neural networks, to observed data. In such settings, the number of parameters of the model is much greater than the number of data samples, thereby resulting in a continuum of models with near-zero training error. Understanding which of these models generalize well and which do not is the central open question in deep learning. Recent empirical evidence suggests one mechanism for generalization: the shape of the training loss around a local minimizer seems to strongly impact the model’s performance. In particular, flat minima -- those around which the loss grows slowly -- appear to generalize well. Clarifying this phenomenon can shed new light on generalization in deep learning, which still largely remains a mystery.
I will describe our recent work that takes a step towards this goal by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. These results suggest (i) a theoretical basis for favoring methods that bias iterates towards flat solutions and (ii) use of Hessian trace as a good regularizer for some learning tasks. We end by discussing the impact of depth on the generalization properties of flat solutions, which surprisingly is not always beneficial.