Give us a call or drop by anytime, we endeavor to answer all inquiries within 24 hours.
PO Box 16122 Collins Street West Victoria, Australia
email@example.com / firstname.lastname@example.org
Phone: + (066) 0760 0260 / + (057) 0760 0560
We are delighted that Misha Belkin will give this week’s One World Mathematics of Information, Data, and Signals (1W-MINDS) Seminar on Thursday, February 18, at 2:30 pm EDT (11:30 am Pacific time). As usual, attendees can participate in the seminar using the following zoom link at that time:
https://msu.zoom.us/j/96421373881 (the passcode is the first prime number > 100).
Prof. Belkin’s title and abstract are below, and can also be found on the seminar website along with information about other upcoming talks, videos of past talks, and more at https://sites.google.com/view/minds-seminar/home
A Theory of Optimization and Transition to Linearity in Deep Learning
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. In this talk I will discuss some general mathematical principles allowing for efficient optimization in over-parameterized non-linear systems, a setting that includes deep neural networks. Remarkably, it seems that optimization of such systems is “easy”. In particular, optimization problems corresponding to these systems are not convex, even locally, but instead satisfy locally the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD. We connect the PL condition of these systems to the condition number associated to the tangent kernel and develop a non-linear theory parallel to classical analyses of over-parameterized linear equations.
In a related but conceptually separate development, I will discuss a new perspective on the remarkable recently discovered phenomenon of transition to linearity (constancy of NTK) in certain classes of large neural networks. I will show how this transition to linearity results from the scaling of the Hessian with the size of the network.
Joint work with Chaoyue Liu and Libin Zhu