Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution

Authors: Rebekah Overdorf (Drexel University), Rachel Greenstadt (Drexel University)

Volume: 2016
Issue: 3
Pages: 155–171

Download PDF

Abstract: Stylometry is a form of authorship attribution that relies on the linguistic information to attribute documents of unknown authorship based on the writing styles of a suspect set of authors. This paper focuses on the cross-domain subproblem where the known and suspect documents differ in the setting in which they were created. Three distinct domains, Twitter feeds, blog entries, and Reddit comments, are explored in this work. We determine that state-of-the-art methods in stylometry do not perform as well in cross-domain situations (34.3% accuracy) as they do in in-domain situations (83.5% accuracy) and propose methods that improve performance in the cross-domain setting with both feature and classification level techniques which can increase accuracy to up to 70%. In addition to testing these approaches on a large real world dataset, we also examine real world adversarial cases where an author is actively attempting to hide their identity. Being able to identify authors across domains facilitates linking identities across the Internet making this a key security and privacy concern; users can take other measures to ensure their anonymity, but due to their unique writing style, they may not be as anonymous as they believe.

Keywords: Stylometry, Machine Learning, Domain Adaptation, Privacy

Copyright in PoPETs articles are held by their authors. This article is published under a Creative Commons Attribution-NonCommercial-NoDerivs license.