Evidence-based Token Abstraction for Software Plagiarism Detection

Aus SDQ-Institutsseminar
Vortragende(r) Hannes Greule
Vortragstyp Bachelorarbeit
Betreuer(in) Timur Sağlam
Termin Fr 28. April 2023
Vortragssprache
Vortragsmodus in Präsenz
Kurzfassung Programming assignments for students are target of plagiarism. Especially for graded assignments, instructors want to detect plagiarism among the students. For larger courses, however, manual inspection of all submissions is a resourceful task. For this purpose, there are numerous tools that can help detect plagiarism in submissions. Many well-known plagiarism detection tools are token-based detectors. In an abstraction step, they map source code to a list of tokens, and such lists are then compared with each other. While there is much research in the area of comparison algorithms, the mapping is often only considered superficially. In this work, we conduct two experiments that address the issue of token abstraction. For that, we design different token abstractions and explain their differences. We then evaluate these abstractions using multiple datasets. We show that different abstractions have pros and cons, and that a higher abstraction level does not necessarily perform better. These findings are useful when adding support for new programming languages and for improving existing plagiarism detection tools. Furthermore, the results can be helpful to choose abstractions tailored to specific requirements.