Tuesday, September 10, 2013

Becoming a better scientist

I'm really interested in the work the Force11 group and many others are doing to apply the techniques of software engineering and digital preservation to making science more reproducible and re-usable. Unfortunately, work for our recent Mellon Foundation grant and for the TRAC audit of the CLOCKSS Archive has meant I've been too busy to contribute or even pay much attention. But a discussion that broke out on the Force11 mailing list sparked by Paul Groth pointing to a post on his blog called Becoming a better scientist (reproducibility edition) really grabbed my attention. Follow me below the fold for the details.


Paul said:
The major issue I have is that the methods are probably not as reproducible or transparent as they should be – essentially it’s a bit messy for other people to figure out exactly what I was up to when doing something new. It’s not in one place nor is it clearly documented. It also hurts my process in that a lot of the mucking about I do gets lost or it takes time to find. I see this is as a particular problem as I do more web science research where the gathering cleaning and reanalyzing data is a critical part of the endeavor.

With that in mind, I’ve decided to get my act together and follow in the footsteps of the likes of Titus Brown and Carl Boettiger and do more of my science in a reproducible and open fashion.

To do this, I’ve decided to adopt IPython Notebooks as my new note taking environment. This solves the problem of allowing me to try different things out and keep track of all the parts of a project together. Additionally, it lets me “narrate my work” – that is mix commentary with my code, which is pretty cool.

My notebook is on github and also contains information about how my system is setup including versions of libraries I’m relying on.
Carole Goble responded with two good pointers:
  • To her keynote on reproducibility at ISMB/ECCB entitled results may vary which, as usual with her talks, is full of interesting references and pithy remarks. Paul Groth commented on the talk:
    One of the great things about your keynote, was that it made the case that we weren't all horrible people for not doing perfect science but that we should try to be better and there are ways to do it.
  • To a blog post by Mike Jackson about their workshop What makes good code good at INTECOL13, one of the premier conferences for ecologists. About the workshop, Carole commented:
    about 120 people showed up for a lunchtime workshop that meant missing lunch with about 8 competing workshops in parallel. about 100 outed themselves as coders (mainly R scripts) - about 80% not only hadn't heard of github but hadn't even thought of version control and most hadn't thought of publishing their R scripts despite continually conflating their models and statistics (key to the papers) with the R code itself.
I hadn't heard about iPython Notebooks; they seem to be a really interesting way of combining documents and code. I installed them easily via Ubuntu's apt_get mechanism (if you install the optional stuff, you will get a whole lot of packages), and I'm planning to use them as we analyze data from LOCKSS networks to see if the notebooks live up to the hype.

Carole also pointed to Dexy, which seems to be a similar idea for combining code and documents, but agnostic about the language it works with. That seems like an advantage. However, iPython Notebooks are internally represented as JSON, which gives a certain level of preservability; I'm not clear how Dexy would provide this.

No comments: