Twitter icon.
Random notes on software, programming and languages.
By Adrian Kuhn

Exploring the Layout of Software Maps

In this post I will cover my current work on Software Cartography. If you are unfamiliar with spatial-representation of software please refer to “A Software cartographer’s Vision” previously featured on this blog.

Codemap is available for download as Eclipse plug-in.

A major issue of software cartography is the base layout of software maps. A good base layout is both stable over time and conveys a meaningful grouping of software artifacts into islands (ie clusters). My initial attempt was to based the layout on vocabulary found in the source code, that is identifier names and comments. Vocabulary has the nice property that it is more stable over time than structure, and that it naturally conveys a meaningful clustering of latent topics.

However a first user study has shown that a vocabulary-based layout does not meet the developers intuition. Even though developers had been well aware that the layout was based on vocabulary, they took conclusions that assumed a structure-based layout. We learnt this from the think-aloud protocol used in the user study.

The user study included six developers that each had 1.5 hours time to explore an unknown software system. They were given six tasks of increasing complexity—cumulating in fixing an actual bug.

I will cover the user study in more details in a future blog post. In this post I shall discuss some proposals for alternative layouts…

* * *

Package-based layout — is typically the first layout that comes to people’s mind when they hear about software cartography. In fact there is a rich set of previous work that uses packag-based layout. For example, Codecity uses a treemap layout to visualize software systems based in their package structure.

A treemap layout makes good use of screen space, and using packages is likely to convey a meaningful clustering (ignoring for the moment that Java’s package nesting is a mere naming convention and bears no language semantics what so ever.)

However, packaging layouts are not stable in the face of change. They are gap-less and thus major parts of the map have to be moved aside in order to make space for new elements (and vice-versa for disappearing elements). Codecity works around this limitation by offering an after-the-fact analysis mode where the layout of past snapshots anticipates the latest state of the system. However, this is only applicable to post-mortem analysis but not to tools that are embedded in a development environment with live changes.

Callgraph-based layout — is the other common layout of software visualization. Static analysis is used to find the call-relations between software artifacts, and then standard force-based graph drawing is applied visualize to the system. Force-based graph drawing adapt well to change, however this is canceled by call-relations having a very high change frequency.

Call-relations are well understood by developers and possibly quite close to developers’s intuitive understanding of distance in software systems. Personally however, I would prefer if the map were based on a more abstract distance measurement.

For example, It would be desirable if call-relations that are displayed on the map had a meaningful interpretation. A long-distance-call should have a diagnostic interpretation. Given a layout based on call-graphs however, a long-distance-call would just indicate failure of the force-based layout to find a solution where all calls have the same length.

Law-of-demeter layout — the “Law of Demeter” is a guideline in software design. It states that each method should only talk to its friends, which are defined as its class’s fields, its local variables and its method arguments. Based on this we can defined an idealized call-based distance between software artifacts.

Given a LOD-based layout, software artifacts are close to one another if they are supposed to call one another and far apart if they better oughta not call one another. Thus we get the desired property that visualizing call-graphs conveys meaningful arrow distances. And also, compared to a raw call-graph, a LOD-based graph is less connected and thus better suited for graph drawing.

Best fo all: on a LOD-based map, any long-distance-call has a diagnostic interpretation that helps developers to take actions: long calls possibly violates the “Law of Demeter”!

History-driven layout — Claus Lewerentz and Frank Steinbrückner developed a layout for software cities that is based on historical data. They start out with a central plaza and then each new packages branches off as a new street, and each new class is a building along these streets. This generates visually awesome and intuitively stable software maps.

Their work has been (first?) presented at the recent MSA 2010 workshop and will be published soon. I will cover their work in more detail when their publication appears. For the moment please refer to this slide deck of Frank Steinbrückner’s MSA presentation

Test-dependency-based layout — despite Agile claims, unit tests do depend on one another. We can record this dependencies using dynamic instrumentation and profiling, and establish a partial-order of unit tests. Based on that partial-order we can do a radial tree layout of tests and place each software artifact closest to their corresponding tests (kernel method).

Test-dependency-based layout is stable, since changes to test code are typically less frequent than changes to the code under tests. In particular, changes to the dependencies between tests.

Also, a test-dependency-based layout has a clear diagnostic interpretation. Along any radial axis, software artifact on the inside provide services to their clients on the outside. Thus the map’s layout is like a cell, with API interfaces on the outside and a basic kernel at the core.

* * *

An important criterion, that has not yet been discussed, is “ease of retrieval” (there might be proper term in data mining for that). What I mean is availability and accessibility of required data. For example, vocabulary data is always available and easy to access (you don’t even need to parse the source code) while historical data and dynamic instrumentation are often either not available or not easy accessible. So far, I’ve thus only implement vocabulary-based layout (featured in the current download of Codemap) and a prototype of law-of-demeter-based layout.

In this post I’ve only scratched the surface of possible map layouts and their design space. For example, Niko Schwarz took inspiration from Richard Dawkins and proposed code hospitality as a foundation for software maps. Code hospitality is a measure for how likely snippets of one class are to run when copied into another class.

If you’ve got your own crazy ideas, tweet or blog about it!

I’ll collect all pingbacks in a future blog post.

2 Responses to “Exploring the Layout of Software Maps”

  1. Niko Schwarz Says:

    I think the history-driven layout is pretty good. And c’mon, for plenty of projects the version history is right there in a cleanable repository. But their city-construction metaphor begs the question: what do you do on large merges? Those become more and more common. I have a hunch that they’re playing with data where the changes come in keystroke by keystroke, but that doesn’t need to be the way change comes to you.

  2. akuhn Says:

    @niko they showed a movie with like 20 snapshots from the entire version history of a system at #msa2010. As far I understood though, their visualization tool is embedded in a nightly build system. So no live stream of events as with Codemap or Codecanvas.

For.example is Digg proof thanks to caching by WP Super Cache!