Semantic Tokenization in Source Code Plagiarism Detection

Aus SDQ-Institutsseminar
Vortragende(r) Simon Wessel
Vortragstyp Bachelorarbeit
Betreuer(in) Robin Maisch
Termin Fr 24. April 2026, 14:00 (Raum 010 (Gebäude 50.34))
Vortragssprache Deutsch
Vortragsmodus in Präsenz
Kurzfassung This thesis presents an approach to mitigate this limitation by enriching tokens with additional structural and semantic context. The proposed method extends the token representation by incorporating vector-based features derived from the abstract syntax tree and variable dependencies. Furthermore, a filtration–verification process is introduced within the Greedy String Tiling algorithm to enable similarity computation based on these enriched representations.

The approach is implemented in the state-of-the-art plagiarism detection system JPlag and evaluated on real-world submissions from introductory programming courses. The results demonstrate that the proposed method effectively reduces similarity scores of independent developed submission pairs, thereby improving the separation between plagiarized and independent submission pairs, while incurring only minor runtime overhead.