Print Email Facebook Twitter All-at-once optimization for kernel machines with canonical polyadic decompositions Title All-at-once optimization for kernel machines with canonical polyadic decompositions: Enabling large scale learning for kernel machines Author van Mourik, Ewoud (TU Delft Mechanical, Maritime and Materials Engineering; TU Delft Delft Center for Systems and Control) Contributor Batselier, K. (mentor) Wesel, F. (mentor) Wahls, S. (graduation committee) Dauwels, J.H.G. (graduation committee) Degree granting institution Delft University of Technology Programme Mechanical Engineering | Systems and Control Date 2022-11-15 Abstract This thesis studies the Canonical Polyadic Decomposition (CPD) constrained kernel machine for large scale learning, i.e. learning with a large number of samples. The kernel machine optimization problem is solved in the primal space, such that the complexity of the problem scales linearly in the number of samples as opposed to scaling cubically in the dual space. Product feature maps are applied to transform the input data. The weights are constrained to be a CPD, so the number of weights scales linearly in the number of features. The CPD introduces a nonlinearity, so nonlinear optimization must be applied. It is studied in which situation it is more advantageous to apply iterative all-at-once opti- mization compared to Alternating Least Squares (ALS) to solve the CPD constrained kernel machine problem. Specifically, all-at-once gradient descent methods are studied. An efficient analytical algorithm for the all-at-once gradient is derived. Furthermore, it is shown that automatic differentiation (AD) can also be applied, but it is found to be slower than the analytical method. The selection of a step size is found to be challenging. It is shown that the magnitude of the gradient of the mean squared error (MSE) term decreases for an increasing number of features. As a result, selecting the step size becomes more difficult for more features. To overcome this, the Line search and the Adam method are studied. A general expression for the exact line search solution is derived. It can be applied to compute the optimal step size for any step direction and any number of features. However, the Adam method performs better in terms of loss after training, convergence and the training run time. The mini-batch Adam method is used to evaluate the performance of all-at-once optimization for large scale learning. It is found that the Adam method no longer performs well for data sets with around 16 features or more, likely due to the decrease in the magnitude of the gradient of the MSE term. On large scale data sets with fewer features, the Adam method outperforms ALS in terms of run time until convergence while achieving similar training and validation losses. The Adam method reached convergence on a data set with 11 million samples within ten minutes. Furthermore, it is shown that the scaling of the run time of the Adam method in terms of the feature map order and the CP-rank is more than an order lower than the scaling of ALS when the methods are run on a GPU. This makes to Adam method more suitable for more complex models. Subject tensor decompositionCanonical Polyadic DecompositionOptimizationfeature mappingall-at-once optimizationMachine Learning To reference this document use: http://resolver.tudelft.nl/uuid:b96ce7d0-99f0-4bf4-b501-903bba3f180e Bibliographical note https://github.com/ewoudvm/all-at-once-cpd-learning GitHub repository of thesis Part of collection Student theses Document type master thesis Rights © 2022 Ewoud van Mourik Files PDF MSc_Thesis_Ewoud_van_Mour ... itions.pdf 6.99 MB Close viewer /islandora/object/uuid:b96ce7d0-99f0-4bf4-b501-903bba3f180e/datastream/OBJ/view