Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments

Authors: Edwin Dauber (Drexel University), Aylin Caliskan (George Washington University), Richard Harang (Sophos Data Science Team), Gregory Shearer (ICF International), Michael Weisman (United States Army Research Laboratory), Frederica Nelson (United States Army Research Laboratory), Rachel Greenstadt (New York University)

Volume: 2019
Issue: 3
Pages: 389–408
DOI: https://doi.org/10.2478/popets-2019-0053

Download PDF

Abstract: Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that individually authored complete files can be attributed, these efforts have focused on such ideal data sets as contest submissions and student assignments. We explore the problem of authorship attribution β€œin the wild,” examining source code obtained from open-source version control systems, and investigate how contributions can be attributed to their authors, either on an individual or a per-account basis. In this work, we present a study of attribution of code collected from collaborative environments and identify factors which make attribution of code fragments more or less successful. For individual contributions, we show that previous methods (adapted to be applied to short code fragments) yield an accuracy of approximately 50% or 60%, depending on whether we average by sample or by author, at identifying the correct author out of a set of 104 programmers. By ensembling the classification probabilities of a sufficiently large set of samples belonging to the same author we achieve much higher accuracy for assigning the set of samples to the correct author from a known suspect set. Additionally, we propose the use of calibration curves to identify which samples are by unknown and previously unencountered authors.

Keywords: Stylometry, code authorship attribution

Copyright in PoPETs articles are held by their authors. This article is published under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license.