Lesegruppe/2019-05-08

Datum	2019/05/08 11:15:00 – 2019/05/08 12:15:00
Ort	Gebäude 50.34, Raum 348
Vortragende(r)	Jan Keim
Forschungsgruppe	ARE
Titel	Neural Code Comprehension: A Learnable Representation of Code Semantics
Autoren	Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoeﬂer
PDF	http://papers.nips.cc/paper/7617-neural-code-comprehension-a-learnable-representation-of-code-semantics.pdf
URL	http://papers.nips.cc/paper/7617-neural-code-comprehension-a-learnable-representation-of-code-semantics
BibTeX	http://papers.nips.cc/paper/7617-neural-code-comprehension-a-learnable-representation-of-code-semantics/bibtex
Abstract	With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufﬁcient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we deﬁne an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel deﬁnition of contextual ﬂow for this IR, leveraging both the underlying data- and control-ﬂow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that even without ﬁne-tuning, a single RNN architecture and ﬁxed inst2vec embeddings outperform specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classiﬁcation from raw code (104 classes), where we set a new state-of-the-art.