Print Email Facebook Twitter Machine Learning for Software Refactoring: a Large-Scale Empirical Study Title Machine Learning for Software Refactoring: a Large-Scale Empirical Study Author Gerling, Jan (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Finavaro Aniche, M. (mentor) van Deursen, A. (graduation committee) Erkin, Z. (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science Date 2020-11-19 Abstract Refactorings tackle the challenge of architectural degradation of object-oriented software projects by improving its internal structure without changing the behavior. Refactorings improve software quality and maintainability if applied correctly. However, identifying refactoring opportunities is a challenging problem for developers and researchers alike. In a recent work, machine learning algorithms have shown great potential to solve this problem. This thesis used RefactoringMiner to detect refactorings in open-source Java projects and computed code metrics by static analysis. We defined the refactoring opportunity detection problem as a binary classification problem and deployed machine learning algorithms to solve it. The models classify between a specific refactoring type and a stable class using the metrics as features. Multiple machine learning experiments were designed based on the results of an empirical study of the refactorings. For this work, we created the largest data set of refactorings in Java source code to date, including 92800 open-source projects from GitHub with a total of 33.67 million refactoring samples. The data analysis revealed that Class- and Package-Level refactorings occur most frequently in early development stages of a class, Method- and Variable-Level refactorings are applied uniformly during the development of a class. The machine learning models achieve high performance ranging from 80\% to 89\% total average accuracy for different configurations of the refactoring opportunity prediction problem on unseen projects. Selecting a high Stable Commit Threshold (K) improves the recall of the models significantly, but also strongly reduces the generalizability of the models. The Random Forest (RF) classifier shows great potential for the refactoring opportunity detection, it can adapt to various configurations of the problem, identifies a large variety of relevant metrics in the data and is able to distinguish different refactoring types. This work shows that for solving the refactoring opportunity detection problem a large variety of metrics is required, as a small set of metrics cannot represent the complexity of the problem. Subject Refactoringsoftware engineeringmachine learningdata setopen sourceJava To reference this document use: http://resolver.tudelft.nl/uuid:bf649e9c-9d53-4e8c-a91b-f0a6b6aab733 Bibliographical note http://doi.org/10.5281/zenodo.4267824 Appendix: Data Analysis and Machine Learning Experiments ShowEdit http://doi.org/10.5281/zenodo.4267711 Appendix: Refactoring Data Set ShowEdit https://github.com/refactoring-ai/Data-Collection Repository link Refactoring Mining Tool ShowEdit https://github.com/refactoring-ai/Machine-Learning Repository link Machine Learning Pipeline Part of collection Student theses Document type master thesis Rights © 2020 Jan Gerling Files PDF Master_Thesis_Jan.pdf 4.4 MB Close viewer /islandora/object/uuid:bf649e9c-9d53-4e8c-a91b-f0a6b6aab733/datastream/OBJ/view