Twitter icon.
Random notes on software, programming and languages.
By Adrian Kuhn

Archive for the ‘Research’ Category

Sneak Peak: Unit Test Dependencies


Monday, September 7th, 2009

Here is a preview of a recently submitted paper on unit test dependencies. Besides the main contribution, a case-study on Lea’s automatic API migration and improved defect localization with JExample, a survey of over 2,500 open source projects is presented. We used the database of the Sourcerer code search engine (by Sushil Bajracharya and Joel Ossher) to analyse how unit testing frameworks are used and extended. As an appetizer: every second project has no test suite, every fourth test suite uses mock objects, and every tenth test suite uses third-party extensions.

jexample-sneak-peak

The preview was created with Wordle, an idea that I owe to Tom Zimmermann.

On why Software follows Power Law


Wednesday, May 20th, 2009

There is an increasing body of research that shows that software (vocabulary) follows power law [Baxter 06, Concas 07, Louridas 08, Zhang 08, Veldhuizen 05, Linstead 09, Pierret 09, etc]. But little focus has been put on how come that we find a power law distribution of word frequencies in software.

For those unfamiliar with the notion of power law: if we say that software vocabulary follows power law, we mean that terms are not equally distributed over program code. Some terms are very frequent, whereas most terms are rare. This is basically the same as the 20-80 rule of the Pareto principle. We find the same kind of distribution almost everywhere in nature and social interaction. For example, take the size of towns in a country, or the distribution of wealth, or also the number of friends on Twitter.

How can we explain power law?

Luckily, there is a very simple model that explains the occurrence of both power law and normal distributions. Imagine a huge set of nodes. Now let’s start connecting the nodes with each other…

  • If we do so by randomly picking two nodes and connecting them, we end up with a network where the number of connections per node follows normal distribution.
  • If we do so by preferring nodes with many connection over nodes with few connection, we end up with a network where the number of connections per node follows power law distribution.

That is, the function to pick nodes is no langer random but a function of the already established connections. This also known as “the rich get richer” principle.

In the process of writing software, it thus seems, developers are more likely to use components (choose names, call methods, etc…) that other developers preferred before. In some cases this is easy to explain. For example, it is obvious that general-purpose classes such as String are more likely to be used than highly specialized classes as for example Westin­Bayshore­Hotel­Lobby or Package­Explorer­View. In other cases, the reason might be less obvious. As usual, answering one question raises many more questions.

If you like to learn more about power law distributions in networks, I highly recommend to read “Linked: The New Science of Networks” by Barabási. It’s a fun and inspiring read (at least the first third, I never finished the book). Thanks go to Jacek Jonczy of the RUN group, who recommended me that book in the first place about a year ago.

Who should fix that Bug?


Tuesday, April 14th, 2009

Dominique Matter’s paper got accepted at MSR 09, congratulations!

For his Master’s thesis, Dominique developed an approach to automatically find developers who have the appropriate expertise for solving a bug. The novel contribution of his work is that no prior bug reports are required to train the recommendation system. His approach models the expertise of developers using their source code vocabulary. He collects the word frequencies found in SVN or CVS diffs to model the expertise of a contributor.

Weighted list (aka tag cloud) created with Wordle:

msr09-matter-kuhn-nierstrasz-assigning-bug-reports-using-a-vocabulary-based-expertise-model-of-developers

Title: Assigning Bug Reports using a Vocabulary-Based Expertise Model of Developers

Abstract: For popular software systems, the number of daily submitted bug reports is high. Triaging these incoming reports is a time consuming task. Part of the bug triage is the assignment of a report to a developer with the appropriate expertise. In this paper, we present an approach to automatically suggest developers who have the appropriate expertise for handling a bug report. We model developer expertise using the vocabulary found in their source code contributions and compare this vocabulary to the vocabulary of bug reports. We evaluate our approach by comparing the suggested experts to the persons who eventually worked on the bug. Using eight years of Eclipse development as a case study, we achieve 33.6% top-1 precision and 71.0% top-10 recall.

Download the full paper (PDF).

What is Context Anyway?


Wednesday, March 11th, 2009

I just had a heated debate with Toon Verwaest about applications of context aware programming languages. Given his background with AmbientTalk his views are rather different than mine. To him, context includes availability and proximity of mobile devices, whereas I am more focused on programming-language specific context. In this post, I’d like to order the views and thoughts that we had been discussing. In particular I’ll try to identify different kinds of context with regard to message dispatch. 

  • The first distinction is between context of the sender and context of the receiver. In the first case, the behavior of an object depends on our view on it. In the second case, the behavior depends on the object itself. A special case of receiver (or is it sender?) context is multiple dispatch, that is context depending on the arguments of the sent message. 
  • The next distinction is between programming-specific context and external context. External context depends neither on the sender nor on any other construct of the programming language. Examples of external context are the weather, the current room temperature, or the availability of external resources such as the wireless telephone network or a the printer next doors. External context often is either categorical or numerical, that is, it is either modeled by an enumeration of states or by some numerical value.
  • I am unsure where spatial context deserves a distinction of its own. For example, for robots on a football field proximity to the other players and the goals is context as well (taken from Collaborative Confusion, via HOP). Clearly, external context is a generalization of spatial context but there may be good reasons to treat spatiality as a context of its own. Geometry and metric distance set spatial context apart from other external contexts. Actually, positions in any metric space can be used to define spatial proximity.
  • Programming-specific context is obviously further subdivided into lexical and dynamic context. Lexical context is typically given by the location in the source code. More general, it is given by location within an program’s structure. Dynamic context on the other hand is given by the flow of execution. Technically, dynamic context often boils down to the context given by the call stack.
  • However, lexical and dynamic context are not the only programming-specific context. Considering Adrian Lienhard’s theory of aliases we can dispatch messages based on the object flow. For example, the reference returned by a collection’s #get method could understand the context-aware method #next which yields the next object within the aforementioned collection. This approach may take us beyond a mere sender-receiver distinction to behavior that depends on the origin of a reference.
  • Other programming-specific contexts are composition and state. For example, messages sent to a collection could be dispatched to different methods depending on the type of the collection’s elements, which is what we do in Swarm Behavior. In the same way, we can imagine behavior that depends on the state of the sender. Given the popularity of both the State and the Strategy pattern, it seems useful to bring this kind of context closer to the language.

Enought brain dump, I am now off to read the Context-oriented Programming paper by Robert Hirschfeld, Pascal Costanza, and Oscar Nierstrasz. And as you can imagine, discussion with Toon is already continuing by chat…

TermMap of OOPSLA


Tuesday, October 21st, 2008

While browsing the proceedings of this year’s OOPSLA, I thought, hey let’s create a themescape of the proceedings. So I fired up SoftwareCartographer and created a “code map” of all PDFs found in the CD. Normally I use SoftwareCartographer to analyse the vocabulary of software systems, but since it operates on vocabulary only, it can be applied on normal text files aswell.

But before we dive into software cartography, the word cloud of all proceeding documents:

Obviously, there are many Java programmers fighting with their type systems at OOPSLA. In the cloud above, the terms are weighted by number of occurrences in the proceeding documents. I guess on a cloud weighted by fun, Smalltalk Superpowers and Animal Verbing would show up the largest.

On the picture above, we see the “CodeMap” of OOPSLA together with the word cloud of selected papers (click for larger version). CodeMap is a visualization to show source code files (here PDF files) and how similar they are in terms of vocabulary [WCRE 2008]. Each file is rendered as a hill, file size is used as the hill’s height. Location of the files reflects topical similarity. Files that use the same vocabulary are close to each other, files that use different vocabulary are far apart of each other.

SoftwareCartographer is written in Smalltalk, if you have VW installed you can download the WCRE demo and apply it your own software systems or conference papers. Software-Cartographer uses Hapax and Pimon, but not Moose.

For.example is Digg proof thanks to caching by WP Super Cache!