Wednesday, May 20th, 2009
There is an increasing body of research that shows that software (vocabulary) follows power law [Baxter 06, Concas 07, Louridas 08, Zhang 08, Veldhuizen 05, Linstead 09, Pierret 09, etc]. But little focus has been put on how come that we find a power law distribution of word frequencies in software.
For those unfamiliar with the notion of power law: if we say that software vocabulary follows power law, we mean that terms are not equally distributed over program code. Some terms are very frequent, whereas most terms are rare. This is basically the same as the 20-80 rule of the Pareto principle. We find the same kind of distribution almost everywhere in nature and social interaction. For example, take the size of towns in a country, or the distribution of wealth, or also the number of friends on Twitter.
How can we explain power law?
Luckily, there is a very simple model that explains the occurrence of both power law and normal distributions. Imagine a huge set of nodes. Now let’s start connecting the nodes with each other…
- If we do so by randomly picking two nodes and connecting them, we end up with a network where the number of connections per node follows normal distribution.
- If we do so by preferring nodes with many connection over nodes with few connection, we end up with a network where the number of connections per node follows power law distribution.
That is, the function to pick nodes is no langer random but a function of the already established connections. This also known as “the rich get richer” principle.
In the process of writing software, it thus seems, developers are more likely to use components (choose names, call methods, etc…) that other developers preferred before. In some cases this is easy to explain. For example, it is obvious that general-purpose classes such as String are more likely to be used than highly specialized classes as for example WestinBayshoreHotelLobby or PackageExplorerView. In other cases, the reason might be less obvious. As usual, answering one question raises many more questions.
If you like to learn more about power law distributions in networks, I highly recommend to read “Linked: The New Science of Networks” by Barabási. It’s a fun and inspiring read (at least the first third, I never finished the book). Thanks go to Jacek Jonczy of the RUN group, who recommended me that book in the first place about a year ago.