Community-cation: A Look Back at Week 43, 2008
The whitepaper "Estimating the Total Development Cost of a Linux Distribution" published this week was a really fun bit of collaborative effort. It basically boiled down to Ron Hale-Evans analyzing the Fedora code, me scoping out the kernel code, and Amanda McPherson pulling it all together into a coherent document.
There were a couple of good things that came out of this project for me, besides the result of the paper itself. The first was a very intensive crash course on command line syntax and scripting. After a number of years working with GUI tools, getting into some serious CLI action was a welcome shift.
The second, and more important to the general public, was the accumulation of a ton of data. SLOCCount pushes out a lot of results, and with a little scripting, you can easily pull the data together into nice reports. Discovering all of this information was really looking into the history of Linux, so that made my assignment even more enjoyable. It even got to the point where Amanda, my boss, had to gently but firmly tell me that the paper was done, and it didn't need any more information.
But, she added, that did not preclude me from reporting on it in this forum. Over the next few weeks, I'll be publishing some more interesting facts from the research done on the Estimating Kernel Costs whitepaper. For today, I wanted to share the data from SLOCCount that deals with the languages used in Linux development.
When SLOCCount generates a report, it analyzes the language of the lines of source code it's counting. This is a pretty straightforward report, but we agreed that including it in the whitepaper (along with the top 10 packages data in the Appendix) might be overkill. The table below shows the number of lines by language found in the complete Fedora 9 package set.
| Language | Lines of Code | Percentage |
| ansic | 105146849 | 52.86% |
| cpp | 50850770 | 25.56% |
| java | 10769952 | 5.41% |
| perl | 5142060 | 2.59% |
| lisp | 4644173 | 2.33% |
| sh | 3887886 | 1.95% |
| cs | 3295714 | 1.66% |
| fortran | 2353071 | 1.18% |
| pascal | 2280308 | 1.15% |
| php | 2250944 | 1.13% |
| asm | 1849326 | 0.93% |
| ada | 1475653 | 0.74% |
| tcl | 1268139 | 0.64% |
| python | 666467 | 0.34% |
| ml | 597753 | 0.30% |
| ruby | 568669 | 0.29% |
| yacc | 508982 | 0.26% |
| haskell | 405920 | 0.20% |
| exp | 321405 | 0.16% |
| objc | 196274 | 0.10% |
| lex | 190273 | 0.10% |
| f90 | 86924 | 0.04% |
| jsp | 56403 | 0.03% |
| awk | 47775 | 0.02% |
| csh | 33996 | 0.02% |
| sed | 20341 | 0.01% |
| modula3 | 33 | 0.00% |
| cobol | 27 | 0.00% |
As you can see, a very large percentage of the code is written in either C (52.86%) or C++ (25.56%). All of the rest of the code falls into single-digit percentages, with Java, Perl, and Lisp rounding out the rest of the top 5 languages. This is even more clearly illustrated in the graph shown in the next figure.
Compare this to the kernel itself, which has an even more dominant C presence:
| Language | Lines of Code | Percentage |
| ansic | 6466319 | 95.47% |
| asm | 236826 | 3.50% |
| python | 45928 | 0.68% |
| sh | 7511 | 0.11% |
| perl | 6577 | 0.10% |
| cpp | 3962 | 0.06% |
| yacc | 2901 | 0.04% |
| lex | 1824 | 0.03% |
| objc | 613 | 0.01% |
| lisp | 218 | 0.00% |
| pascal | 116 | 0.00% |
| awk | 96 | 0.00% |
| sed | 11 | 0.00% |
Obviously, C and C++ are still the dominant part of the Linux development landscape, both within the kernel and without.
What surprised me was that--at the distribution level--Java was the number three follow-up for code language, and not something like Perl or Python. It seems indicative of how much progress Java has made at the application level on Linux. But other interpretations are certainly welcome!


