Date(s) - 03/09/2018
8:30 am - 5:30 pm
The topic of the CHOOSE forum 2018 is Software Engineering and Machine Learning.
To seed discussions, we sampled the space with six high-profile talks from both academia and industry. Confirmed speakers for this year are: Earl T. Barr, Georgios Gousios, Marc Brockschmidt, Veselin Raychev, Maxim Podkolzine, and Prem Devanbu.
|08:30 – 08:45||Registration|
|08:45 – 09:00||Welcome and introduction|
|09:00 – 09:45||“Why is Software Natural?”
Prof. Dr. Prem Devanbu, University of California, Davis, USA
|09:45 – 10:00||Short Break|
|10:00 – 10:45||“Big Data in Software Engineering”
Dr. Georgios Gousios, Delft University of Technology, The Netherlands
|10:45 – 11:15||Coffee Break|
|11:15 – 12:00||“More Productive Software Engineers Through Deep Learning”
Dr. Marc Brockschmidt, Microsoft Research, Cambridge, UK
|12:00 – 12:20||CHOOSE General Assembly|
|12:20 – 13:30||Lunch|
|13:30 – 14:15||“Machine Learning at JetBrains”
Maxim Podkolzine, JetBrains, Munich, Germany
|14:15 – 14:30||Short Break|
|14:30 – 15:15||“Finding Program Vulnerabilities by Learning from Code Changes”
Dr. Veselin Raychev, DeepCode.ai, Switzerland
|15:15 – 15:45||Coffee Break|
|15:45 – 16:30||“Bimodal Software Engineering”
Dr. Earl T. Barr, University College London, UK
|16:30 – 17:15||Panel|
|17:15 – 17:30||Closing|
Earl T. Barr
Bimodal Software Engineering
Source code is bimodal: it combines a formal algorithmic channel and a natural language channel of identifiers and comments. To date, most work has focused exclusively on a single channel. This is a missed opportunity because the two channels interact: the natural language often explains or summarizes the algorithmic channel, so information in one channel can be used to improve analyses of the other channel. A canonical bimodal fact is identifier named “secret” (NL channel) printed to the console (AL channel). To exploit such bimodal facts, one must overcome two challenges: find cross-channel synchronisation points and handle noise in the form of ambiguity in the NL channel and imprecision in the AL channel. Thus, bimodality is a natural fit for machine learning. I will present RefiNym, a bimodal analysis that models code with *name-flows*, a dataflow graph augmented to track identifier names. Conceptual types are logically different types that do not always coincide with program types. Passwords and URLs are example conceptual types that can share the program type String. RefiNym is an unsupervised method that mines a lattice of conceptual types from name-flows and reifies those conceptual types into distinct nominal types. For the String type, we show that RefiNym minimises co-occurrence of disparate conceptual types in the same scope by 42%, thereby making it harder for a developer to inadvertently introduce an unintended flow.
Earl T. Barr is a senior lecturer (associate professor) at the University College London. He received his Ph.D. at UC Davis in 2009. Earl’s research interests include bimodal software engineering, testing and analysis, and computer security. His recent work focuses on automated software transplantation, applying game theory to software process, and using machine learning to solve programming problems. Earl dodges vans and taxis on his bike commute in London.
Big Data in Software Engineering
The (not so) secret sauce that led to the explosive developments in Machine Learning we are currently witnessing, is the availability of vast, organized datasets. In Software Engineering, perhaps the most well-known one is GHTorrent. GHTorrent monitors the GitHub event timeline, collects all data offered through its API and offers them in various formats to researchers. Hundrends of researchers are using it as both an index of all things happenning on GitHub and as a source of data to apply Machine/Deep Learning on.
In our talk, we present the workings and evolution of GHTorrent along with selected examples of how GHTorrent was used as a data source for ML research.
Georgios Gousios is an assistant professor at the Software Engineering Group, Delft University of Technology, where he leads the group’s Software Analytics lab. He works in the fields of sofware ecosystems, software testing, distributed software development processes and research infrastructures. His research has been published in top venues (ICSE, FSE, TSE), where he has received 4 distinguished paper awards. He is the main author of the GHTorrent data collection and curation framework (for which he received a foundational contribution award), the Alitheia Core repository mining platform and various widely used tools and datasets. Currently, he is leading the Unified Call Graph project with analyzes package repositories at a fine level of detail, and the CodeFeedr project which delivers real time software analytics. Dr. Gousios holds a PhD with distinction in Software Engineering (mining software repositories) from the Athens University of Economics and Business (AUEB) and an MSc with distinction (software engineering) from the University of Manchester.
More Productive Software Engineers Through Deep Learning
Deep Learning has been the crucial step forward in perceptual tasks such as the understanding of images, speech and natural language. So far, it has had less impact on the practice of Software Engineering, where it is competing with a wide variety of mature, existing methods based on logic and deduction. I will discuss how these two worlds can be combined, as well as present first results that can show the way towards practical tools. Finally, I will report on user experience problems arising from the use of Deep Learning in Software Engineering tools.
Marc Brockschmidt is a Senior Researcher in the Machine Intelligence group at Microsoft Research Cambridge (UK). He obtained his PhD studying formal methods that can automatically prove termination of Java programs. Surprisingly, that did work substantially less well than manually proving termination of Java programs. He thus moved on to study how computers can learn the skills that make humans better at programming than machines.
Finding Program Vulnerabilities by Learning from Code Changes
I will present a new machine learning approach for finding code vulnerabilities. The key idea is to learn program analysis rules from a large datasets of program changes (e.g., as found on GitHub), technically achieved via a careful balance of the right semantic program abstractions with clustering based methods (e.g., hierarchical clustering). I will show several analysis rules our learning system discovered automatically, some of which were missed by existing state of the art existing checkers. I will also briefly discuss a more elaborate version of this approach as developed by DeepCode (deepcode.ai), a startup which built the first AI-based code review system whose product is also available online.
Veselin Raychev Veselin Raychev is CTO of DeepCode AG, an ETH-spinoff that builds an AI-code review system. He obtained his PhD from ETH Zurich in 2016 on the topic of “Learning from Large Codebases”. Veselin’s recent work focuses on doing precise program analysis that scales to huge codebases and on using such analysis to learn how open source projects and fix defects and improve their code.
Machine Learning at JetBrains
JetBrains is one of the world’s leading software tools companies, traditionally relying on strict computer science algorithms to analyze code. In this talk, I’ll share how we started to apply Machine Learning in various contexts and what challenges we faced with that.
Maxim Podkolzine is a senior software engineer at JetBrains, building “intelligent” tools for developers for 10 years, machine learning enthusiast.
Why is Software Natural?
Sometime during the Summer of 2011, several of us at UC Davis were quite puzzled and shocked to discover that software is “natural”, viz., repetitive and predictable, as are natural language corpora; In fact much, much more so! By now, this early experiment has been replicated many times, in many ways, and various applications of naturalness have been developed. But why is this? Is it just because of programming language syntax? or is it due to something else, like conscious programmer choice? How can we study this question? Are there are other “natural” corpora (other than software) that are similar to software? While these questions are purely scientific, investigating them can and will lead to actionable insights.
Prem Devanbu received his B.Tech from the Indian Institute of Technology in Chennai, India, before you were born, and his PhD from Rutgers University in 1994. After spending nearly two decades at Bell Labs and its various offshoots, he escaped New Jersey traffic to join the CS faculty at UC Davis in late 1997. For about 15 years now, he has been working on ways to exploit the copious amounts of available open-source project data to bring more joy, meaning, and fulfillment to the lives of programmers.
The registration to the CHOOSE Forum 2018 can be completed here.