Print Email Facebook Twitter Detecting PII in Git commits Title Detecting PII in Git commits Author van der Plas, Niek (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Cruz, Luis (mentor) Oliveira, Luiz (graduation committee) van Deursen, A. (mentor) Degree granting institution Delft University of Technology Programme Computer Science | Software Technology Date 2022-07-04 Abstract With the advancement of technology, organizations are experiencing more trouble with keeping their data private with it often leaked to the public via their code-repositories or databases. There are methods to counter the leakage of data while pushing code to a repository however, these are heavily reliant on regular expressions. Personal names, locations and other Personally Identifiable Information (PII) do not follow a reoccurring pattern and can thus only be prevented by manual code reviews, which are also prone to errors. A tool to detect these PII should be designed as an initial measure to counteract the leakage. In this paper, we propose a heavily modifiable tool in which we combine the strength of regular expressions with a state-of-the-art machine learning model to detect a variety of important PII within the code changes of Python software projects. We use CodeBERT, a RoBERTa-like Transformer model, as our PII recognizer. This recognizer is fine-tuned using the Scikit-learn library of which we injected the git commits with fake sensitive data. To test and improve the quality of the model and the entire tool, we design an experimental methodology to find the optimal value for the hyper parameters of the model, compare it against another Transformer model and run the fine-tuned model against several other code-bases with different programming languages. The outcome of these experiments benefit the quality of the model in a positive way and allows us to design a robust tool with a well-performing machine learning model to detect a variety of entities. This tool can be personalized to any business and mitigate a significant part of the potential data leaks. To reference this document use: http://resolver.tudelft.nl/uuid:fe195c17-ecf5-4811-a987-89f238a6802f Part of collection Student theses Document type master thesis Rights © 2022 Niek van der Plas Files PDF Master_Thesis_Niek.pdf 1.98 MB Close viewer /islandora/object/uuid:fe195c17-ecf5-4811-a987-89f238a6802f/datastream/OBJ/view