Print Email Facebook Twitter Empirical Software Linguistics: An Investigation of Code Reviews, Recommendations and Faults Title Empirical Software Linguistics: An Investigation of Code Reviews, Recommendations and Faults Author Hellendoorn, V.J. Contributor Bacchelli, A. (mentor) Faculty Electrical Engineering, Mathematics and Computer Science Department Computer Science Programme Software Engineering Date 2015-08-25 Abstract Communication is fundamental to human nature and underlies many of its successes as a species. In recent decades, the adoption of increasingly abstract software languages has supported many advances in computer science and software engineering. Although in many regards distinct from natural language, software language has proven surprisingly similar to it as well and has been studied successfully using natural language models. Recent studies have investigated this "naturalness" property of software in relation to a variety of applications including code completion, fault detection, and language migration. In this thesis, based on three research papers, we investigate three main aspects of software naturalness. Firstly, we investigate the relation between perceived (un)naturalness of source code (according to the statistical model) and the reaction to such code by software developers. In open-source projects, we find that those contributions which contain code that (statistically speaking) fits in less well are also subject to more scrutiny from reviewers and are rejected more often. Secondly, we investigate an application of highly predictable code: code completion. Previous work had evaluated the performance of language models in this application in isolation; we compare the language model approach to a commonly used code completion engine. We find that it compares favorably, achieving substantially higher accuracy scores. In particular, a combination of the two approaches yielded the best results. Finally, we investigate instances of highly unpredictable code in order to automatically detect faults. We find that buggy lines of code are substantially less predictable, becoming more predictable after a bug is fixed. Our bug detection approach yields performance comparable to popular static bug finders, such as FindBugs and PMD. Our results further confirm that statistical (ir)regularity of source code from a natural language perspectives reflects real-world phenomena. Subject software linguisticssoftware engineeringfault detectioncode completionccode review To reference this document use: http://resolver.tudelft.nl/uuid:ff7acb60-a3e9-4f72-9c8d-bc65398d8d6a Embargo date 2015-08-18 Part of collection Student theses Document type master thesis Rights (c) 2015 Hellendoorn, V.J. Files PDF thesis.pdf 1.02 MB Close viewer /islandora/object/uuid:ff7acb60-a3e9-4f72-9c8d-bc65398d8d6a/datastream/OBJ/view