DarkDiff: Explainable web page similarity of TOR onion sites

08/23/2023
by   Pieter Hartel, et al.
0

In large-scale data analysis, near-duplicates are often a problem. For example, with two near-duplicate phishing emails, a difference in the salutation (Mr versus Ms) is not essential, but whether it is bank A or B is important. The state-of-the-art in near-duplicate detection is a black box approach (MinHash), so one only knows that emails are near-duplicates, but not why. We present DarkDiff, which can efficiently detect near-duplicates while providing the reason why there is a near-duplicate. We have developed DarkDiff to detect near-duplicates of homepages on the Darkweb. DarkDiff works well on those pages because they resemble the clear web of the past.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset