The use of optimization as a framework for formulating machine learning problems has become much more widespread in recent years. In some cases, the demands of the machine learning problems go beyond the scope of traditional optimization paradigms. While existing optimization formulations and algorithms serve as an good starting point for the solution strategies, important work must be carried out at the interface of optimization and machine learning to devise strategies that exploit the special features of the application and that perform well on very large data sets. This talk reviews recent developments from an optimization perspective, focusing on activity during the past three years, and looking in particular at problems where the machine learning application has motivated novel algorithms or analysis in the optimization domain. We also discuss some current challenges, highlighting several recent developments in optimization that may be useful in machine learning applications.
We propose a unifying framework for solution of convex programs by polyhedral
approximation. It includes classical methods, such as cutting plane,
Dantzig-Wolfe decomposition, bundle, and simplicial de- composition, but also
includes refinements of these methods, as well as new methods that are
well-suited for important large-scale types of problems, arising for example
in network optimization.
(more, PDF)
The presentation stresses important differences between machine learning and conventional optimisation approaches and proposes some solutions. The first part discusses the the interaction of two kind of asympotic properties: those of the statistics and those of optimization algorithm. Unlikely optimization algorithm such as stochastic gradient show amazing performance for large-scale machine learning problems. The second part shows how the deeper causes of this performance suggests the theoretical possibility learn large-scale problems with a single pass over the data. Practical algorithms will be discussed: various second order stochastic gradients, averaging methods, dual methods with data reprocessing...