Print Email Facebook Twitter Who said that? Comparing performance of TF-IDF and fastText to identify authorship of short sentences Title Who said that? Comparing performance of TF-IDF and fastText to identify authorship of short sentences Author van Tussenbroek, Thomas (TU Delft Electrical Engineering, Mathematics and Computer Science; TU Delft Pattern Recognition and Bioinformatics) Contributor Viering, T.J. (graduation committee) Makrodimitris, S. (graduation committee) Naseri Jahfari, A. (graduation committee) Tax, D.M.J. (mentor) Loog, M. (mentor) Degree granting institution Delft University of Technology Programme Computer Science and Engineering Project CSE3000 Research Project Date 2020-06-22 Abstract Authorship identification is often applied to large documents, but less so to short, everyday sentences. The ability of identifying who said a short line could provide help to chatbots or personal assistants. This research compares performance of TF-IDF and fastText when identifying authorship of short sentences, by applying these feature extraction techniques to the television series Friends' transcripts. TF-IDF outperforms fastText in every measurement, but its performance is only marginally better than randomly guessing the original character, reaching an accuracy of 28 percent when making a distinction between 6 characters. Accuracy increases linearly at the same rate for both techniques as the minimum word count per sentence set on the test data increases. TF-IDF's confidence remains constant as this limit is set on either the test or training data, whereas fastText's confidence decreases and increases, respectively. Cross-entropy loss, however, remains constant for fastText and decreases for TF-IDF as the minimum word count set on the test data increases. Subject Authorship IdentificationfastTextTF-IDFshort sentencesNatural Language Processing To reference this document use: http://resolver.tudelft.nl/uuid:93873bbf-2886-4023-b696-e11be2b99024 Part of collection Student theses Document type bachelor thesis Rights © 2020 Thomas van Tussenbroek Files PDF Research_Paper_Thomas_van ... 534794.pdf 906.45 KB Close viewer /islandora/object/uuid:93873bbf-2886-4023-b696-e11be2b99024/datastream/OBJ/view