Empirical Software Linguistics: An Investigation of Code Reviews, Recommendations and Faults

Communication is fundamental to human nature and underlies many of its successes as a species. In recent decades, the adoption of increasingly abstract software languages has supported many advances in computer science and software engineering. Although in many regards distinct from natural language, software language has proven surprisingly similar to it as well and has been studied successfully using natural language models. Recent studies have investigated this "naturalness" property of software in relation to a variety of applications including code completion, fault detection, and language migration. In this thesis, based on three research papers, we investigate three main aspects of software naturalness. Firstly, we investigate the relation between perceived (un)naturalness of source code (according to the statistical model) and the reaction to such code by software developers. In open-source projects, we find that those contributions which contain code that (statistically speaking) fits in less well are also subject to more scrutiny from reviewers and are rejected more often. Secondly, we investigate an application of highly predictable code: code completion. Previous work had evaluated the performance of language models in this application in isolation; we compare the language model approach to a commonly used code completion engine. We find that it compares favorably, achieving substantially higher accuracy scores. In particular, a combination of the two approaches yielded the best results. Finally, we investigate instances of highly unpredictable code in order to automatically detect faults. We find that buggy lines of code are substantially less predictable, becoming more predictable after a bug is fixed. Our bug detection approach yields performance comparable to popular static bug finders, such as FindBugs and PMD. Our results further confirm that statistical (ir)regularity of source code from a natural language perspectives reflects real-world phenomena.

Subject

software linguistics
software engineering
fault detection
code completionc
code review

To reference this document use:

http://resolver.tudelft.nl/uuid:ff7acb60-a3e9-4f72-9c8d-bc65398d8d6a

Embargo date

2015-08-18

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

thesis.pdf

1.02 MB

Close viewer