Overview of the PAN@FIRE 2020 Task on the Authorship Identification of SOurce COde

Ali Fadel, Husam Musleh, Ibraheem Tuffaha, Mahmoud Al-Ayyoub, Yaser Jararweh, Elhadj Benkhelifa, Paolo Rosso

December 2020 nlp

Abstract

Authorship identification is essential to the detection of undesirable deception of others’ content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languages, it is rarely considered for programming languages. Accordingly, a PAN@FIRE task, named Authorship Identification of SOurce COde (AI-SOCO), is proposed with the focus on the identification of source code authors. The dataset consists of crawled source codes submitted by the top 1,000 human users with 100 correct C++ submissions or more from the CodeForces online judge platform. The participating systems are asked to predict the author of a given source code from the predefined list of code authors. In total, 60 teams registered on the task’s CodaLab page. Out of them, 14 teams submitted 94 runs. The results are surprisingly high with many teams and baselines breaking the 90% accuracy barrier. These systems used a wide range of models and techniques from pretrained word embeddings (especially, those that are tweaked to handle source code) to stylometric features.

Type

Conference paper

Publication

Forum for Information Retrieval Evaluation

nlp Authorship Identification Source Code deep learning Neural Networks

Overview of the PAN@FIRE 2020 Task on the Authorship Identification of SOurce COde

Abstract

Ali Fadel

Machine Learning Engineer II

Related