'Digital Research: preserving your research data', History libraries & research open day, Senate House Library, University of London, 18 March 2014
Digital Research: preserving your research data
Notes from a talk I gave at the ‘History libraries & research open day’ at Senate House Library, University of London, 18 March 2014 organised by the Committee of London Research Libraries in History
The following text represents my notes rather than precisely what was said on the day and should be taken in that spirit.
S Background of team, more than resource discovery, S deluge of data et al, libraries increasingly full of data as much as books S category of research we support, S new contexts for scholarship in HSS.
In the context of this is a much more mundane challenge that offers a nice case study of the sort of thing the Digital Research team at the BL is interested in: that being the challenge of managing all your digital research stuff, all your data.
S A decade ago Roy Rosenzweig sought to alert historians to ‘the fragility of evidence in the digital era’. Whilst Rosenzweig’s concerns were focused on sources available on the open web, they can easily be extended to the born-digital materials – or data – historians create during their research.
S But why should you care? (and why do I care!)
Well, ordinary working historians are moving toward using computers as the default means of storing all of their stuff. Your manuscripts have been digital objects for some time (a dog doesn’t eat our student’s essays, their computers die) and your research is moving in the same direction: in the form of typed notes, photographs of archives, and tabulated data.
Putting research data into digital form, backing it up, does not guarantee that data will survive. And here I neither mean survive in a literal sense nor in a survival as readable by the next version of Word sense, but rather in a usable by people sense. The research data you generate is at risk of loss if you are not able to generate and preserve it in a form your future self can understand and find meaningful, let alone someone else wading through the idiosyncrasies of your research process, years or decades after the fact. In short, there is a risk of loss by data being detached from the context of its creation, from the tacit knowledge that made it useful at the time of preparing talk X or manuscript Y. As William Stafford Noble puts it S:
The core guiding principle is simple: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why […]Most commonly, however, that “someone” is you. A few months from now, you may not remember what you were up to when you created a particular set of files, or you may not remember what conclusions you drew. You will either have to then spend time reconstructing your previous experiments or lose whatever insights you gained from those experiments.
William Stafford Noble (2009) A Quick Guide to Organizing Computational Biology Projects.PLoSComputBiol 5(7): e1000424. doi:10.1371/journal.pcbi.1000424
There are two simple things you can do that make all this trouble go away. You can document your data. And you can think about how you structure your data.
Birkwood, Katie (girlinthe). “Victory is mine: while ago I worked out some Clever Stuff (tm) in Excel. And I MADE NOTES ON IT. And those notes ENABLED ME TO DO IT AGAIN.” 7 October 2013, 3:46 a.m.. Tweet. https://twitter.com/Girlinthe/status/387166944094199809
The purpose of documentation is to capture the process of data creation, changes made to data, and tacit knowledge associated with data. Project management methodologies place great emphasis on precise, structured, and verbose documentation. Whilst there are benefits to this approach, especially for large, complex, multi-partner projects, the average working historian is more likely to benefit from a flexible, bespoke approach to documentation that draws on, but is not yoked to, project management principles.
In the case of historical research, useful documentations might include: S
- details describing notes taken whilst examining a document in an archive, from the archival reference for the original document (which I’m sure you all do already!), to how representative the notes are (for example, full transcriptions, partial transcriptions, or just summaries), how much of the document was examined, and decisions taken to exclude sections of the document from the research process.
- details describing what tabulated data is, how it was generated (for example, by hand or in an automated manner), what attributes of the original sources were retained (and why).
- details describing a directory of digital images, such as how each image was created or those images were downloaded from, and links to research notes that refer to them.
As the last example suggests, good documentation should describe the meaningful connections that exist between research data, links that may not remain obvious over time.
Time also plays in the file formats you use. Ideally research data and documentation should be saved in platform agnostic formats such as .txt or .md for notes and .csv (comma-separated values) or .tsv (tab-separated values) for tabulated data.
These plain formats are preferable to the propriety formats used as defaults by Microsoft Office or iWork because they can be opened by many software packages and have a strong chance of remaining viewable and editable in the future. But more importantly, .txt, .md .csv and .tsv formats have contain only machine-readable elements. Whilst it is common to use bold, italics, and colouring to signify headings or to make a visual connection between bits of research, these display-orientated annotations are not machine-readable, can’t be queried and searched, and hence are not appropriate for large quantities of information. How you denote emphasis in a machine readable way is largely up to you, though there are existing schema such as Markup available to build on and adapt for your own needs. The rule of thumb is, if you couldn’t Google it your notation is no us to you.
Documenting your research is made easier by structuring your research data in a consistent and predictable manner.
Well, every time we use a library or archive catalogue, we rely upon structured information to help us navigate data (both physical and digital) the library or archive contains.
Examining URLs are a good way of thinking about why structuring research data in a consistent and predictable manner might be useful in your research. Good URLs represent with clarity the content of the page they identify, either by containing semantic elements or by using a single data element found across a set or majority of pages.
Semantic URLs are common to news websites or blogging services S and URLs structured by a single data element are common to archival catalogues. For example, The British Cartoon Archive structures its web archive using the format:
- website name/record/reference number
Consistent and predictable data structures are readable both by humans and machines. Transferring this logic to digital data accumulated during the course of historical research makes research data easier to browse, to search and to query using the standard tools provided by our operating systems. In practice, the structure of your research data archive might look something like this: S
A root directory…
…with a series of sub-directories.
\root\events\ \research\ \teaching\ \writing\
A series of directories for each event, research project, module, or piece of writing, including a date element.
Finally, further sub-directories can be used to separate out data as the project grows.
\root\research\2014_Journal_Articles\analysis \data \notes
Obviously not all information will fit neatly within any given structure and as new projects arise taxonomies will need to be revisited. Idiosyncrasy is fine so long as the overall directory structure is consistent and predictable.
S How you name files also helps ensure the directory structure is logical and readable. A file describing the scope of a folder is better called something like ‘2014-01-03_writing_readme.txt’ (or ‘2014-01-03_Writing_readme.md’) than ‘Notes about this folder’ is as it replicates the title of the directory and included some date information to anchor it around other files (North American audiences should note that I’ve chosen the structure year_month_date).
Applying such naming conventions to all research data in a consistent and predictable manner further aids the readability and comprehension of the data structure. So… S
…for a project on journal articles we might choose the directory…
…where the year-month elements captures when the project started.
Within this directory we include a \data\ directory where the original tabulated data used in the project is kept.
Alongside this data is documentation that describes 2014-01-31_Journal_Articles.tsv.
In the \analysis\ subdirectory we place notes from the data analysis.
Note the different month and date attributes here, reflecting the dates on which this analysis took place, a convention described briefly in 2014-02-02_Journal_Articles_analysis_readme.txt.
Finally, a directory within \data\ called \derived_data\ contains data derived from the original 2014-01-31_Journal_Articles.tsv. In this case, the derived data contains only lines of the data including the keywords, ‘africa’, ‘america’, ‘art’, ‘britin’, and are named accordingly.
2014-01-31_Journal_Articles_KW_africa.tsv 2014-01-31_Journal_Articles_KW_america.tsv 2014-02-01_Journal_Articles_KW_art .tsv 2014-02-02_Journal_Articles_KW_britain.tsv
And so on as the project progresses.
S So what have I just done? I’ve suggested ways you might document and structure your research data, the purpose of which is to ensure that data is preserved by capturing tacit knowledge gained during the research process and thus making the information easy to use in the future. I have recommended the use of platform agnostic and machine-readable formats for documentation and research data. And I have suggested that URLs offer a practical example of both good and bad data structures that can be replicated for the purposes of a historian’s research data.
These suggestions are intended merely as guides that historians will adapt them to suit their purposes. For time spent documenting and structuring research should not be a burden, rather their purpose is to make more not less efficient research that generates data - which, as I’ve suggested, is most research undertaken by the ordinary working historian these days. S This is part and parcel of the supporting role the DR team at the BL aims to play - helping scholars make the best of the digital in the course of their research. And this extends way beyond preserving your research data for the future, so do come and have a chat with me during the Fair if you think we can help.