18th INFORMS Computing Society (ICS) Conference

Toronto, Canada, 14 — 16 mars 2025

18th INFORMS Computing Society (ICS) Conference

Toronto, Canada, 14 — 16 mars 2025

Horaire Auteurs Mon horaire

Optimization for Machine Learning

16 mars 2025 16h45 – 18h15

Salle: East Common 

Présidée par Calvin Tsay

3 présentations

  • 16h45 - 17h07

    Fast Methods for Large-scale Kernel Ridge Regression

    • Pratik Rathore, prés., Stanford University
    • Zachary Frangella, Stanford University
    • Jiaming Yang, University of Michigan
    • Michał Dereziński, University of Michigan
    • Madeleine Udell, Stanford University

    Kernel ridge regression (KRR) is a fundamental computational tool, appearing in problems ranging from computational chemistry to health analytics, with a particular interest due to its starring role in Gaussian process regression. However, it is challenging to scale KRR solvers to large datasets: the memory and per-iteration computational complexities for direct methods (i.e., Cholesky decomposition) and indirect methods (i.e., PCG), scale quadratically and/or cubically with the number of training samples.

    To alleviate these scalability issues, we propose an approximate sketch-and-project method, ASkotch, to solve full KRR, and a variance-reduced stochastic gradient method, Mimosa, to solve inducing points KRR. Our algorithms (i) have lower memory and per-iteration computational complexities than existing KRR solvers, (ii) are easy to tune, (iii) obtain linear convergence to the optimum of the KRR training problem, and (iv) outperform state-of-the-art KRR solvers on a variety of large-scale regression and classification tasks.

  • 17h07 - 17h29

    Preconditioned Variance-Reduced Stochastic Gradient Algorithms for Faster Empirical Risk Minimization.

    • Zachary Frangella, prés., Stanford University
    • Pratik Rathore, Stanford University
    • Shipu Zhao, Cornell University
    • Madeleine Udell, Stanford University

    Ill-conditioned problems are ubiquitous in large-scale machine learning: as a data set grows to include more and more features correlated with the labels, the condition number increases. Yet traditional stochastic gradient methods converge slowly on these ill-conditioned problems, even with careful hyperparameter tuning.

    In this talk, we introduce PROMISE, a suite of sketching-based preconditioned stochastic gradient algorithms that deliver fast convergence on ill-conditioned large-scale convex optimization problems arising in machine learning. PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha; each algorithm comes with a strong theoretical analysis and effective default hyperparameter values. Empirically, we verify the superiority of the proposed algorithms by showing that, using default hyperparameter values, they outperform or match popular tuned stochastic gradient optimizers.

    Overall, PROMISE methods offer a compelling alternative to popular stochastic first-order optimization algorithms for large-scale machine learning.

  • 17h29 - 17h51

    An MIQP Formulation for Gaussian Processes Using a Piecewise-Linear Kernel Approximation

    • Yilin Xie, Imperial College London
    • Shiqiang Zhang, Imperial College London
    • Joel Paulson, The Ohio State University
    • Calvin Tsay, prés., Imperial College London

    Optimization over a Gaussian process model is a key step in many Bayesian optimization (BO) algorithms. However, this step turns out to be a challenging, nonlinear and non-convex optimization problem itself. Despite the relative importance of this step, most BO algorithms employ sampling- or gradient-based methods, which do not provably converge to global optima. This presentation investigates mixed-integer programming (MIP) as a paradigm for global optimization over Gaussian processes. Specifically, we present PK-MIQP (Piecewise-linear Kernel Mixed Integer Quadratic Programming), a formulation based on a piecewise-linear approximation for Gaussian process kernels and admits a corresponding MIQP representation. The proposed method effectively pre-processes and approximates non-linear kernel functions towards tractable global optimization. The final MIQP model is optionally warm-started by solving a simpler sub-model. We analyze the theoretical regret bounds of the proposed approximation, and empirically demonstrate the framework in a full BO procedure on synthetic functions, constrained benchmarks, and hyperparameter tuning.

Retour