Post follows workshop delivered on Data Preparation for Digital Humanities Research.
Slides & Companion OpenRefine Guide linked below.
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003). Data preparation is not just a first step, but must be repeated many times over the course of analysis as new problems come to light or new data is collected.
Hadley Wickham, Tiny Data
Wickham captures well what most of us learn pretty quickly when we embark on a first Digital Humanities research project. Data cleaning and preparation are challenging and often take up the lion’s share of effort. The fun stuff – representation via things like different flavors of visualization are the icing on a fairly complex cake. If we were to gauge work put into most projects an 80/20 rule might be best for portraying distribution of effort, where 80% of effort is spent on iterative data preparation, cleaning, and analysis with perhaps 20% effort spent on documentation and representation.
If you’ve put together a workshop, you know how much work it is. You have to get the dataset, clean the dataset, put it somewhere people can find it, troubleshoot the software, plan out the steps, etcetera, etcetera. You have to do this if you want to get through all the material in an hour. But guess what? THAT PREP WORK IS WHAT DOING DH IS. All that garbage prep work is what we spend most of our time doing. This seamless processing of data is a fantasy world!
Miriam Posner, Here and There: Creating DH Community
There goes Miriam speaking truth. Again.
We might hesitate to frame Digital Humanities along the lines of an 80/20 rule because it appears to foreground uninteresting work (data cleaning and preparation). Its tempting to foreground DH work with reference to end products and the tools that enable generation of quick results. While this latter framing is attractive because it aligns DH with the type of work manifest in the high level representations of data that likely first charmed us (e.g. maps, network visualization, etc.), it shifts attention away from the skills and dispositions that are needed to do Digital Humanities independent of a carefully crafted, fit for purpose dataset provided to illustrate the functions of a tool during a workshop or institute.
In my neck of the woods Ive made first modest efforts to begin imparting skills and dispositions needed for tackling preparation and cleaning of tabular data – a small start! This approach is informed by some research Ive embarked on with colleagues which in turn was informed by the Data Information Literacy project. This is the beginning of what I hope to be a series of graduated sessions that build on each other in complexity. While Im not totally sure of a route to success in this area, I do think its a step in the right direction.