Evaluating the Performance of the Model Selection with Average ECE and Naive Calibration in Out-of-Domain Generalization Problems for Binary Classifiers

Liu, Anxian

Evaluating the Performance of the Model Selection with Average ECE and Naive Calibration in Out-of-Domain Generalization Problems for Binary Classifiers

Title

Evaluating the Performance of the Model Selection with Average ECE and Naive Calibration in Out-of-Domain Generalization Problems for Binary Classifiers

Author

Liu, Anxian (TU Delft Electrical Engineering, Mathematics and Computer Science; TU Delft Pattern Recognition and Bioinformatics)

Contributor

Krijthe, J.H. (mentor)
Karlsson, R.K.A. (mentor)
Bongers, S.R. (mentor)
Höllt, T. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Computer Science and Engineering

Project

CSE3000 Research Project

Date

2022-06-24

Abstract

Out-of-domain (OOD) generalization refers to learning a model from one or more different but related domain(s) that can be used in an unknown test domain. It is challenging for existing machine learning models. Several methods have been proposed to solve this problem, and multi-domain calibration is one of these methods. Model selection with the average expected calibration error (ECE) across training domains and naive calibration are two approaches to implementing multi-domain calibration. However, it might happen that neither approach can learn a genuinely well-calibrated model in the multi-domain setting. Hence, this paper intends to evaluate how naive calibration and model selection with average ECE perform in the OOD generalization problem for binary classifiers. We generated many synthetic datasets and set up three experiments to answer this question. Finally, the conclusions based on empirical results are obtained: 1) Although naive calibration can improve the average accuracy across unseen domains (OOD accuracy) and the average area under the ROC Curve across unseen domains (OOD AUROC) for some binary classifiers, it does not work for all binary classifiers. However, at least it does not make the model worse for OOD generalization. 2) On the synthetic datasets we generated, if the number of training domains increases, most binary classifiers' OOD accuracy will also increase. 3) Average ECE is a reasonable metric for selecting a model in the OOD generalization problem and is better than validation accuracy. This is because a strong linear relationship exists between OOD accuracy and the average ECE across the training domains. This linear relationship is stronger than the linear relationship between OOD accuracy and validation accuracy.

Subject

Out-of-domain (OOD) generalization
Calibration
Naive calibration
Expected calibration error
Multi-domain calibration

To reference this document use:

http://resolver.tudelft.nl/uuid:00365c48-9e5e-47db-9fe8-962c34011ef6

Part of collection

Student theses

Document type

bachelor thesis

Rights

Files

PDF

anxian_liu_final_paper_fi ... ersion.pdf

1.86 MB

Close viewer