On the impact of tokenizer and parameters on N-gram based Code Analysis

Jimenez, Matthieu
Cordy, Maxime
Le Traon, Yves
Papadakis, Mike

Publication date

September 2018

Abstract

Recent research shows that language models, such as n-gram models, are useful at a wide variety of software engineering tasks, e.g., code completion, bug identification, code summarisation, etc. However, such models require the appropriate set of numerous parameters. Moreover, the different ways one can read code essentially yield different models (based on the different sequences of tokens). In this paper, we focus on n- gram models and evaluate how the use of tokenizers, smoothing, unknown threshold and n values impact the predicting ability of these models. Thus, we compare the use of multiple tokenizers and sets of different parameters (smoothing, unknown threshold and n values) with the aim of identifying the most appropriate combinati...

Extracted data

We use cookies to provide a better user experience.

Data Protection

On the impact of tokenizer and parameters on N-gram based Code Analysis

Abstract

Extracted data

On the impact of tokenizer and parameters on N-gram based Code Analysis

Abstract

Extracted data

Related items

Related items