So I and Kevin had a workshop paper accepted today. It’s a summary of our experiences in spidering and parsing SourceForge for our study of free and open source software development team social structure. The best part is that the Mining Software Repositories workshop is co-located with the International Conference on Software Engineering (ICSE 2004) in Edinburgh May 25, which means I get a trip out to visit me Granny ;)
Here’s the abstract and the full PDF is available for The Pitfalls and Perils of Mining SourceForge. Happily it is already written, just a few small changes suggested by the reviewers to make.
SourceForge provides abundant accessible data from Open Source Software development projects, making it an attractive data source for software engineering research. However it is not without theoretical peril and practical pitfalls.In this paper, we outline practical lessons gained from our spidering, parsing and analysis of SourceForge data. SourceForge can be practically difficult: projects are defunct, data from earlier systems has been dumped in and crucial data is hosted outside SourceForge, dirtying the retrieved data. These practical issues play directly into analysis: decisions made in screening projects can reduce the range of variables, skewing data and biasing correlations.
SourceForge is theoretically perilous: because it provides easily accessible data items for each project, tempting researchers to fit their theories to these limited data. Worse, few are plausible dependent variables. Studies are thus likely to test the same hypotheses even if they start from different theoretical bases. To avoid these problems, analyses of SourceForge projects should go beyond project level variables and carefully consider which variables are used for screening projects and which for testing hypotheses.
So it’s far from earth-shattering stuff but still good news ;)
Posted by james at March 30, 2004 07:44 PM | TrackBackso help me god if you come back without a few gpg sigs…
Hmmm… If you have to be in Scotland on May 25, why not come a week and a half early, and meet me in Paris? Just a thought.