by Sam Hames
As part of a regular feature in our quarterly newsletter, we asked LDaCA’s Research Analytics lead, Sam Hames, for a tip to pass on to readers. Read on to find out his tip about purpose and data reuse.
Language data and materials don’t just fall from the sky fully formed – collections are made by people! And those people usually have a particular purpose or set of purposes when creating a collection. Their purposes shape how collections are assembled and created. Elements of a collection that are central to their purpose will generally receive more time and attention, while everything else is a bonus. This means that when re-using data, you need to carefully examine the materials you’re working with to evaluate whether that data is fit for your purposes.
In practice this means being prepared to take a deep dive to look at the data in question, one record or item at a time. This lets you critically analyse how the materials have been put together – and evaluate whether that’s appropriate for your needs. A collection that looks promising may turn out to have subtle (or not so subtle!) problems, and it’s better to find that out sooner rather than later.
We can see an example of this with item 1-009 from the Corpus of Oz Early English (COOEE) collection. The metadata only lists a single entry per text for gender, and subsequently records the author of the text as male. However, the recorded testimony is clearly from a male seaman and a female convict. If gender is central to your research, this collection may require further annotation of the materials to be fit-for-purpose.

Image Source: Marc Grimwade, courtesy of ARDC
Sam Hames
Sam Hames is a research fellow in computational humanities at UQ and analytics lead for LDaCA. His PhD was on machine learning for analysing medical images. His primary area of interest is in how we can use computational approaches to support qualitative and interpretive inquiry in the humanities.