Community-cation: A Look Back at Week 43, 2008

The whitepaper "Estimating the Total Development Cost of a Linux Distribution" published this week was a really fun bit of collaborative effort. It basically boiled down to Ron Hale-Evans analyzing the Fedora code, me scoping out the kernel code, and Amanda McPherson pulling it all together into a coherent document.

There were a couple of good things that came out of this project for me, besides the result of the paper itself. The first was a very intensive crash course on command line syntax and scripting. After a number of years working with GUI tools, getting into some serious CLI action was a welcome shift.

The second, and more important to the general public, was the accumulation of a ton of data. SLOCCount pushes out a lot of results, and with a little scripting, you can easily pull the data together into nice reports. Discovering all of this information was really looking into the history of Linux, so that made my assignment even more enjoyable. It even got to the point where Amanda, my boss, had to gently but firmly tell me that the paper was done, and it didn't need any more information.

But, she added, that did not preclude me from reporting on it in this forum. Over the next few weeks, I'll be publishing some more interesting facts from the research done on the Estimating Kernel Costs whitepaper. For today, I wanted to share the data from SLOCCount that deals with the languages used in Linux development.

When SLOCCount generates a report, it analyzes the language of the lines of source code it's counting. This is a pretty straightforward report, but we agreed that including it in the whitepaper (along with the top 10 packages data in the Appendix) might be overkill. The table below shows the number of lines by language found in the complete Fedora 9 package set.

Language Lines of Code Percentage
ansic 105146849 52.86%
cpp 50850770 25.56%
java 10769952 5.41%
perl 5142060 2.59%
lisp 4644173 2.33%
sh 3887886 1.95%
cs 3295714 1.66%
fortran 2353071 1.18%
pascal 2280308 1.15%
php 2250944 1.13%
asm 1849326 0.93%
ada 1475653 0.74%
tcl 1268139 0.64%
python 666467 0.34%
ml 597753 0.30%
ruby 568669 0.29%
yacc 508982 0.26%
haskell 405920 0.20%
exp 321405 0.16%
objc 196274 0.10%
lex 190273 0.10%
f90 86924 0.04%
jsp 56403 0.03%
awk 47775 0.02%
csh 33996 0.02%
sed 20341 0.01%
modula3 33 0.00%
cobol 27 0.00%

As you can see, a very large percentage of the code is written in either C (52.86%) or C++ (25.56%). All of the rest of the code falls into single-digit percentages, with Java, Perl, and Lisp rounding out the rest of the top 5 languages. This is even more clearly illustrated in the graph shown in the next figure.

Compare this to the kernel itself, which has an even more dominant C presence:

Language Lines of Code Percentage
ansic 6466319 95.47%
asm 236826 3.50%
python 45928 0.68%
sh 7511 0.11%
perl 6577 0.10%
cpp 3962 0.06%
yacc 2901 0.04%
lex 1824 0.03%
objc 613 0.01%
lisp 218 0.00%
pascal 116 0.00%
awk 96 0.00%
sed 11 0.00%

Obviously, C and C++ are still the dominant part of the Linux development landscape, both within the kernel and without.

What surprised me was that--at the distribution level--Java was the number three follow-up for code language, and not something like Perl or Python. It seems indicative of how much progress Java has made at the application level on Linux. But other interpretations are certainly welcome!

0
Copyright © 2008 Linux Foundation. All rights reserved.
LSB is a trademark of the Linux Foundation. Linux is a registered trademark of Linus Torvalds