All-at-once optimization for kernel machines with canonical polyadic decompositions: Enabling large scale learning for kernel machines

van Mourik, Ewoud

All-at-once optimization for kernel machines with canonical polyadic decompositions

Title

All-at-once optimization for kernel machines with canonical polyadic decompositions: Enabling large scale learning for kernel machines

Author

van Mourik, Ewoud (TU Delft Mechanical, Maritime and Materials Engineering; TU Delft Delft Center for Systems and Control)

Contributor

Batselier, K. (mentor)
Wesel, F. (mentor)
Wahls, S. (graduation committee)
Dauwels, J.H.G. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Mechanical Engineering | Systems and Control

Date

2022-11-15

Abstract

This thesis studies the Canonical Polyadic Decomposition (CPD) constrained kernel machine for large scale learning, i.e. learning with a large number of samples. The kernel machine optimization problem is solved in the primal space, such that the complexity of the problem scales linearly in the number of samples as opposed to scaling cubically in the dual space. Product feature maps are applied to transform the input data. The weights are constrained to be a CPD, so the number of weights scales linearly in the number of features. The CPD introduces a nonlinearity, so nonlinear optimization must be applied.
It is studied in which situation it is more advantageous to apply iterative all-at-once opti- mization compared to Alternating Least Squares (ALS) to solve the CPD constrained kernel machine problem. Specifically, all-at-once gradient descent methods are studied. An efficient analytical algorithm for the all-at-once gradient is derived. Furthermore, it is shown that automatic differentiation (AD) can also be applied, but it is found to be slower than the analytical method.
The selection of a step size is found to be challenging. It is shown that the magnitude of the gradient of the mean squared error (MSE) term decreases for an increasing number of features. As a result, selecting the step size becomes more difficult for more features. To overcome this, the Line search and the Adam method are studied. A general expression for the exact line search solution is derived. It can be applied to compute the optimal step size for any step direction and any number of features. However, the Adam method performs better in terms of loss after training, convergence and the training run time. The mini-batch Adam method is used to evaluate the performance of all-at-once optimization for large scale learning.
It is found that the Adam method no longer performs well for data sets with around 16 features or more, likely due to the decrease in the magnitude of the gradient of the MSE term. On large scale data sets with fewer features, the Adam method outperforms ALS in terms of run time until convergence while achieving similar training and validation losses. The Adam method reached convergence on a data set with 11 million samples within ten minutes. Furthermore, it is shown that the scaling of the run time of the Adam method in terms of the feature map order and the CP-rank is more than an order lower than the scaling of ALS when the methods are run on a GPU. This makes to Adam method more suitable for more complex models.

Subject

tensor decomposition
Canonical Polyadic Decomposition
Optimization
feature mapping
all-at-once optimization
Machine Learning

To reference this document use:

http://resolver.tudelft.nl/uuid:b96ce7d0-99f0-4bf4-b501-903bba3f180e

Bibliographical note

https://github.com/ewoudvm/all-at-once-cpd-learning GitHub repository of thesis

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

MSc_Thesis_Ewoud_van_Mour ... itions.pdf

6.99 MB

Close viewer