Print Email Facebook Twitter Performance of near-duplicate detection algorithms for Crawljax Title Performance of near-duplicate detection algorithms for Crawljax Author Van Eyk, E.D.C. Van Leeuwen, W.J. Contributor Van Deursen, A. (mentor) Faculty Delft University of Technology Department Electrical Engineering, Mathematics and Computer Science Programme Computer Science Date 2014-07-10 Abstract Crawljax is a crawler, which not only finds states via regular links, but also states that are hidden by JavaScript actions. However, this leads to a gigantic number of states with many duplicates. A near-duplicate detection algorithm can be a solution to limit the number of states found by Crawljax, while crawling the most essential, unique states. Through a literature survey it became apparent that Simhash and Broder are two state-of-the-art near-duplicate detection algorithms that are suitable for Crawljax. In this project, both algorithms are implemented into Crawljax. These algorithms have been tested extensively to determine the performance of the new duplicate detection algorithms on Crawljax in comparison with the current version of duplicate detection. The testing has been done using a separate calibration tool, which can distribute tasks over different machines to lower the amount of time needed for the tests. This calibration tool will return the number of mistakes of every near-duplicate detection algorithms for many different parameter values. This make it possible to compare the performance of the near-duplicate detection algorithms. The results of the calibration tool showed us that Crawlhash was faster, but Broder was slightly better. Additionally the so-called threshold-slider has been designed to simulate what would have happened with the state-flow-graph of a crawl if a higher threshold was used. This makes it possible to find a nice threshold for one specific web application. Subject Crawljaxfingerprintingnear-duplicate detectiontesting To reference this document use: http://resolver.tudelft.nl/uuid:66711106-520b-4b27-ae79-256f2eb7250c Part of collection Student theses Document type bachelor thesis Rights (c) 2014 Van Eyk, E.D.C.Van Leeuwen, W.J. Files PDF BEP_fingerprinting_Crawljax.pdf 1.93 MB Close viewer /islandora/object/uuid:66711106-520b-4b27-ae79-256f2eb7250c/datastream/OBJ/view