Flow2Vec: Value-Flow-Based Precise Code Embedding (SPLASH 2020 - OOPSLA)

Sun 15 - Sat 21 November 2020 Online Conference

Who

Yulei Sui, Xiao Cheng, Guanqin Zhang, Haoyu Wang

Track

SPLASH 2020 OOPSLA

Time Zone

The program is currently displayed in (GMT-06:00) Central Time (US & Canada).

Use conference time zone: (GMT-06:00) Central Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 17 Nov 2020 17:00 - 17:20 at SPLASH-I - T-6A Chair(s): Zhefeng Wu, Filip Niksic
Wed 18 Nov 2020 05:00 - 05:20 at SPLASH-I - T-6A Chair(s): Michael Pradel, Konstantinos Kallas

Abstract

Code embedding, as an emerging paradigm for source code analysis, has attracted much attention over the past few years. It aims to represent code semantics through distributed vector representations, which can be used to support a variety of program analysis tasks (e.g., code summarization and semantic labeling). However, existing code embedding approaches are intraprocedural, alias-unaware and ignoring the asymmetric transitivity of directed graphs abstracted from source code, thus they are still ineffective in preserving the structural information of code.

This paper presents Flow2Vec, a new code embedding approach that precisely preserves interprocedural program dependence (a.k.a value-flows). By approximating the high-order proximity, i.e., the asymmetric transitivity of value-flows, Flow2Vec embeds control-flows and alias-aware data-flows of a program in a low-dimensional vector space. Our value-flow embedding is formulated as matrix multiplication to preserve context-sensitive transitivity through CFL reachability by filtering out infeasible value-flow paths. We have evaluated Flow2Vec using 32 popular open-source projects. Results from our experiments show that Flow2Vec successfully boosts the performance of two recent code embedding approaches codevec and codeseq for two client applications, i.e., code classification and code summarization. For code classification, Flow2Vec improves codevec with an average increase of 21.2%, 20.1% and 20.7% in precision, recall and F1, respectively. For code summarization, Flow2Vec outperforms codeseq by an average of 13.2%, 18.8% and 16.0% in precision, recall and F1, respectively.

Link to Publication

https://dl.acm.org/doi/pdf/10.1145/3428301

DOI

https://doi.org/10.1145/3428301

Yulei Sui

University of Technology Sydney

Xiao Cheng

Beijing University of Posts and Telecommunications

Guanqin Zhang

University of Technology Sydney

Haoyu Wang

Beijing University of Posts and Telecommunications

Media