Plagiarism detection by similarity join

Schellenberger, R.

Plagiarism detection by similarity join

Title

Plagiarism detection by similarity join

Author

Schellenberger, R.

Contributor

De Vries, A.P. (mentor)
De Bruijn, S.T.J. (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science

Department

Media and Knowledge Engineering

Date

2009-08-18

Abstract

Since the internet is so big and most of its content is public, it is very hard to find out where the information came from originally. There are many websites that publish news articles, so people and organizations can easily lose track of where their articles are reused with or without their permission. This paper presents a plagiarism detection algorithm that allows us to quickly compare online news articles with a collection of personal news articles and detect plagiarized passages with the same quality as a human. The algorithm uses a basic shingle index and a Signature Tree as a more advanced pre-filtering step to narrow down the viable documents to a query. The algorithm achieves a score of 0.96 precision and 0.94 recall but is too resource intensive to be considered scalable. When only the pre-filtering step is used, it achieves 0.85 precision and recall creating a speedup of nearly one order of magnitude.

Subject

plagarism detection
similarity join
shingles
signature tree
news articles

To reference this document use:

http://resolver.tudelft.nl/uuid:a18b62c5-e73e-44fc-9336-83a78275f266

Embargo date

2011-08-18

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Plagiarism_detection_by_s ... y_join.pdf

2.96 MB

Close viewer