Print Email Facebook Twitter e-Discovery: Discovering fraud related e-mails using Bayesian statistical techniques Title e-Discovery: Discovering fraud related e-mails using Bayesian statistical techniques Author Kaak, Davey (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Bierkens, Joris (mentor) Dashorst, Ian (graduation committee) Degree granting institution Delft University of Technology Date 2019-06-06 Abstract During a digital fraud investigation the search for relevant information in mailboxes of custodians is like finding a needle in a haystack. This time consuming task can, on various levels, be improved and made more efficient. Technology Assisted Review (TAR) is already one of the available machine learning algorithms that helps speeding up the process of finding relevant information. In Technology Assisted Review a model is trained based on the classification of e-mails by expert review. During the review process TAR continuously gives back the (potentially) most relevant e-mails that still need to be given a classification. The downside of this algorithm is that a manual expert review is still needed before TAR can give recommendations. This thesis will focus on introductory research on models that give an initial sorting before the expert review is done. The hypothesis that will be used is that this sorting (or classification) can be done in a similar manner as spam e-mails are removed to the junk folder in a mailbox. Three different features have been used (word frequencies, word occurrences and length of an e-mail) on four different models for each feature (A generative and discriminative model, each with maximum likelihood estimation or Bayesian estimation). Each of these 12 different implementations have been tested on three different datasets (TREC, ENRON and a confidential dataset). Based on 5-fold cross validation the Bayesian generative model based on word frequencies has been shown to perform best on the confidential dataset. This model shows that a classification at the start of a digital fraud investigation can be helpful. Combining different models, and finding the best parameters for practical usage of the model is left for further research. Subject classificationfraudgenerative modeldiscriminative modelNaive Bayeslogistic regressionTARe-DiscoveryBayesian statistics To reference this document use: http://resolver.tudelft.nl/uuid:52a65ac9-afb3-4b2a-84e2-3b4efe7185a5 Part of collection Student theses Document type master thesis Rights © 2019 Davey Kaak Files PDF Final_report_Master_Thesi ... Kaak_2.pdf 5.32 MB Close viewer /islandora/object/uuid:52a65ac9-afb3-4b2a-84e2-3b4efe7185a5/datastream/OBJ/view