Prototype Selection for Classification in Standard and Generalized Dissimilarity Spaces

Plasencia Calaña, Y.

Prototype Selection for Classification in Standard and Generalized Dissimilarity Spaces

Title

Prototype Selection for Classification in Standard and Generalized Dissimilarity Spaces

Author

Plasencia Calaña, Y.

Contributor

Reinders, M.J.T. (promotor)

Faculty

Electrical Engineering, Mathematics and Computer Science

Department

Intelligent Systems

Date

2015-09-24

Abstract

Automatic pattern classification for a given problem domain aims at assigning a class or category membership to a new unseen object from the same domain. This is performed in three main stages: data preprocessing, representation and classification. The data preprocessing highly depends on the data type (e.g. images, signals) which makes its study highly domain dependent. The representation and classification stages are more general and the same type of representations or classifier can be studied for different problems. This thesis focuses on the representation stage as better representation will result in better classification performances. Traditionally, pattern recognition made use of vector space representation and structural representations. Drawbacks, like the possible unavailability of distinguishing features and lack of learning tools for the structural representations, have led to alternatives such as the Dissimilarity Representation (DR), which is a relational representation where the objects are represented by the (potentially non-Euclidean and non-metric) dissimilarities to a set of prototypes. One of the possibilities when considering DRs is the Dissimilarity Space (DS) approach. It was postulated as a Euclidean space where an object is represented by its dissimilarities to a set of prototypes. The DS is attractive since it gives a good trade off between accuracy and computational cost of the representation especially when the prototypes are carefully selected. In this thesis we study how to define and select the prototypes for creating good representations in the sense of compromise between good classification accuracy and low computational cost. Our main research question is: Can we create better prototypes and/or selection procedures if we take the nature and characteristics of the dissimilarity data into account in the process? This thesis presents new prototype selection methods based on Genetic Algorithms (GAs). Ideally, prototypes should be chosen such that they optimize some suitable criterion which expresses the representativeness of the set of prototypes for a given problem. As similar objects have a similar representational power, randomized methods such as GAs are a powerful approach for selecting. This property was further exploited by proposing two new scalable methods based on GAs. One uses as a criterion to maximize the weight of the minimum spanning tree of the set of prototypes, the other one maximizes the match of the labels of objects and their assigned prototypes after a nearest prototype clustering is performed. We found that for multiclass problems our proposed criteria based on maximizing diversity of the prototypes was crucial to select good prototypes. The second part of the thesis studies the creation and selection of models as prototypes for classification in generalized dissimilarity spaces. A new method, based on the nearest feature line classifier, is proposed to select feature lines as prototypes. Feature lines are suitable for data under representational limitations. We also studied the creation and selection of clusters as prototypes. We considered different ways to measure distances of objects with clusters: the minimum, maximum, and average statistics. A new method based on the $Nystr\ddot{o}m$ formula was proposed to measure a subspace distance of objects with the positive part of the subspace created by the objects inside the cluster considering only the information contained in the given dissimilarities. The results of the study showed that cluster-based prototypes where always better than object-based prototypes when comparing DSs of the same dimension. The last part of the thesis studies the creation and selection of extended prototypes. First, the extension is achieved by considering directed asymmetric dissimilarities, where we obtain two dissimilarity values computed from the objects to the prototypes and viceversa. Prototype selection in extended asymmetric dissimilarity spaces is studied as an alternative to symmetrization by averaging, minimum and maximum as well as the two individual directed DS. Supervised procedures are studied to perform the selection since they are able to select the prototypes with their best associated direction for the computation of dissimilarities. It was concluded from this study that there is useful information in asymmetry and the dissimilarity space with prototype selection is a means to use this information. In addition, we studied another alternative to use extended prototypes for multiscale dissimilarity data. In this case, the prototypes are selected in an extended multiscale dissimilarity space (EMDS). A GA optimizing a classification criterion was proposed due to its ability to cope with large candidate sets of prototypes. It finds the best prototypes with their best related scales in order to take advantage of multiscale data provided in the form of dissimilarities. We found that our proposal of using a reduced EMDS by prototype selection was useful for problems where the scales perform significantly different. This thesis has contributed to gain insights on the topic of prototype selection for classification in the DS. We found that diversity plays a key role for the selection of prototypes, and this explains why a set of random prototypes is already a good starting point and why GAs are powerful methods to optimize this set further. In addition, we discovered that by creating and selecting more complicated models (depending on the data characteristics), such as clusters or feature lines as prototypes, the classification accuracies are increased. This points to promising research directions such as automatically learning the best prototype models. Finally, we showed that it is beneficial for classification to include extra knowledge, such as asymmetry in data or multiscale dissimilarities, in an extended dissimilarity representation.

Subject

prototype selection

To reference this document use:

https://doi.org/10.4233/uuid:4a8f0412-fc16-4dc7-8f42-cb223c64de1b

ISBN

978-94-6295-348-2

Part of collection

Institutional Repository

Document type

doctoral thesis

Rights

Files

PDF

thesis_final.pdf

2.44 MB

Close viewer