FAIR and CARE
Data is becoming increasingly important in today’s world, so corpus linguists might feel that the rest of the world is finally catching up. But the rest of the world are bringing with them new approaches to how data is handled. This means that fields such as corpus linguistics may need to reassess their practices. Such reassessment includes addressing concerns about how data is stored and who can access it (data stewardship) – concerns that are a part of the Open Science movement, ultimately grounded on principles of equity and accountability.
The most influential approach to data stewardship today is the FAIR principles. According to these principles, data should be:
Metadata and data should be easy to find for both humans and computers.
Once the user finds the required data, she/he/they need to know how can they be accessed, possibly including authentication and authorisation.
The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
In general, corpus linguists do well on the interoperability criterion. Corpus data is usually stored in non-proprietary formats; even when some structure is imposed on the data, this is almost always in a form which is saved as a simple text file (e.g. CSV files or XML annotations). Data stored in such formats is easy to move between applications. But what about the other three criteria?
Some corpus data is easy to discover; it is findable. For example, CLARIN, the portal to the European Union language resource infrastructure, provides access to many large data collections, as does the Linguistic Data Consortium in the USA. However, some data is never made part of a large collection and often remains under the control of individual researchers or research teams. Such data may be almost impossible to find. Even if we can find such data, it is unlikely to be accompanied by good descriptions of the data and metadata, making reusability problematic. Of course, big corpora such as the British National Corpus will be both findable and accompanied by comprehensive corpus manuals. However, it is worth considering how to make other corpora more findable, including the provision of corpus manuals or corpus descriptions. Corpus resource databases such as CoRD do aim to work towards this principle.
Accessibility may also be an issue for some data. Copyright law may allow use of material for individual research but prohibit any further distribution of the material. The FAIR approach to such cases is that metadata should be available so that interested parties can know that a data holding exists (F), and the metadata will include information about the conditions under which the data may or may not be shared or reused (A and R).
For linguists, there is another very important set of principles concerning data, the CARE principles developed by the Global Indigenous Data Alliance:
Data ecosystems shall be designed and function in ways that enable Indigenous Peoples to derive benefit from the data.
Authority to control
Indigenous Peoples’ rights and interests in Indigenous data must be recognised and their authority to control such data be empowered.
Those working with Indigenous data have a responsibility to share how those data are used to support Indigenous Peoples’ self-determination and collective benefit.
Indigenous Peoples’ rights and wellbeing should be the primary concern at all stages of the data life cycle and across the data ecosystem.
These principles are presented as applying particularly to Indigenous data, but we believe that researchers should adopt this approach in all cases where the people who participate in our research can be seen to have some moral rights in the information they have contributed. Respecting those moral rights should be demonstrated by recognising the participants’ authority to control how data is used, by seeking to ensure that participants derive benefit from use of the data, and by acting ethically and transparently in our relations with the participants. Deborah Cameron and her colleagues (Cameron et al 1993) raised similar issues almost 20 years ago, arguing that the imbalance of power in the relation between researchers and participants needed to be reduced. The CARE principles continue along this path, but go even further in explicitly returning power to the sources of information.
Corpus data is often written language. We have already mentioned that copyright law is relevant to some such material, and that body of law protects at least some rights for the creators of the material. But corpus linguists also work with other kinds of data such as spoken language (spontaneous or produced as a response to some prompt) or written material produced by research participants according to some protocol. In such cases, ethical research practice should include addressing the issues raised by the CARE principles. Some aspects of this practice will fall under institutional ethics requirements (for example, thinking carefully about what permissions we request on consent forms), but other questions must be part of the relationship between the researcher and the research participants. Corpus linguists working with spoken, computer-mediated, or otherwise particularly sensitive data have been aware of at least some of these issues, but the CARE principles offer an opportunity to go further.
Acquiring data for linguistic research takes effort and often that means money. It is therefore a good use of resources if any data we collect can be used by others. The FAIR principles provide a framework to make sharing and reusing data easier, and applying the CARE principles where relevant helps to ensure that our research has a sound ethical basis.
Note: This post is based on the presentation ‘Advance Australia FAIR’, given by Simon Musgrave and Michael Haugh to the 4th Forum on Englishes in Australia (LaTrobe University, August 27, 2021).
Thanks to Leah Gustafson and Monika Bednarek for helpful comments on drafts.