Print Email Facebook Twitter PCADA: Partial Correlation Aware Data Augmentation for random forest classifier Title PCADA: Partial Correlation Aware Data Augmentation for random forest classifier Author Lorek, Oskar (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Ionescu, A. (mentor) Hai, R. (mentor) Epema, D.H.J. (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science and Engineering Project CSE3000 Research Project Date 2022-06-22 Abstract Machine learning models require rich, quality data sets to achieve high accuracy. With current exponential growth of data being generated it is becoming increasingly hard to prepare high-quality tables within reasonable time frame. To combat this issue automated data augmentation methods has emerged in recent years. However, existing solution do not focus on specific ML algorithm used for training the data.In this paper we propose data augmentation framework designed specifically for the random forest classifier. The algorithm uses sample joins to estimate partial correlation between features in the neighbouring tables and the target column, while controlling for all other features.Moreover, we show that partial correlation is the most optimal characteristic for determining features’ importance for random forest classifier. Apart from it, we demonstrate hat PCADA can improve accuracy and run-time in comparison with other baseline data augmentation approaches.Finally, we show that the framework can also be used for other decision trees classifiers (CART, XGBoost) and linear classifier (Support Vector Machine). To reference this document use: http://resolver.tudelft.nl/uuid:dc80eb98-a1da-49d6-bb8e-325e62870dd0 Part of collection Student theses Document type bachelor thesis Rights © 2022 Oskar Lorek Files PDF oskar_lorek_research_project.pdf 883.87 KB Close viewer /islandora/object/uuid:dc80eb98-a1da-49d6-bb8e-325e62870dd0/datastream/OBJ/view