by Harriet Sheppard
My fieldwork experience, in many ways, conforms to stereotypes of language fieldwork – a community outsider working with a smaller language community located far from an urban centre. For my postgraduate research, I documented and described grammatical topics of Sudest (known as Vanga Vanatɨna by its speakers), an Oceanic language spoken in Papua New Guinea (PNG ). The language is spoken on the islands of Vanatinai (also known as Sudest and Tagula) and Yeina in the Louisiade Archipelago, approximately 350 kilometres southeast of the PNG mainland. In the following, I discuss decisions I made and processes I followed related to the collection of language data and archiving the resulting corpus of texts, including ethical considerations, ownership of the data, data formats, metadata, and data security and reuse. This is by no means meant as a guide (of best practice) for projects where language data are being collected with speakers of un(der) described languages. Rather, it is an account of some of the types of considerations and procedures that need to be considered when collecting language data and how I handled them in this instance.
Before recording with a speaker for the first time, I would explain the project to the speaker and their rights as contributors to the project. I would explain how I would use the language recordings, how they would be stored and accessed and how speakers of the language or other researchers might access and use the recordings in the future. This included discussing how they, the person(s) making the recording, would retain ownership over the recording (i.e., intellectual and cultural property rights). They would also be the ones to choose access rights over the text and have the option to withdraw any or all recordings they made from the corpus at any point (more on these below). After these discussions, if the speaker was still interested in the project, I would then record an oral consent form with them. The consent form was in English, the language of wider communication in the province. In cases where the speaker didn’t speak or spoke limited English, we would have a translator present for this process. I chose to use an oral consent form as it is accessible to all participants no matter their literacy levels. It also has the advantage of being a more durable format compared with a paper consent form, particularly when working in a hot and humid climate and it can also be easily backed up.
Each time I made a recording with a speaker or speakers, I played the recording back to them to check that they were happy with the recording and so that they could decide on what sort of access conditions they wanted the archived recording to have. If the speaker was not satisfied with the recording, sometimes they would choose to make another recording and delete the original or simply delete the recording and maybe try again later. If they were happy with the recording, they would decide on the access conditions they wanted for the archived recording. The possibilities ranged from the most restricted, being that only I, as the original researcher, could access the recordings and use them for my research, to the opposite end of the access continuum where anyone interested could download and listen to the recordings, read any transcriptions, and potentially use them for educational or research purposes. In between these two ends of the continuum, they could opt for other access conditions such as only allowing access for registered users of the archive, or only for researchers and speakers, or just for speakers. They could also choose to have a text embargoed for a certain period of time.
When I was preparing my ethics application for the university, I applied and received ethics approval to work with minors. I didn’t plan to collect data from children (although I did want the option of recording older teens if any showed an interest in recording), but included minors in my ethics application because of the potential for children to be inadvertently filmed, for example if they walk into frame or come up to a parent while the parent is being recorded. The ethics approval meant that if such instances arose during a recording, the recording could still be kept and archived ( assuming approval from the speaker making the recording). While such situations didn’t end up arising during my own data collection, this would not have been an ideal fix for this issue. More widely, as a discipline, we don’t have good model(s) regarding how we deal with such scenarios in regards to access and consent. One suggestion I heard was to only make such recordings openly available after obtaining informed consent from any children involved after they turned 18 but, for most circumstances, such an approach is unlikely to be realistic.
As the researcher who collected the data and deposited it in the archive, I am the person the archive contacts when someone puts in a request to access restricted data. But granting any access to data, whether restricted or not, is based on the wishes of the owner(s) of a specific text, that is, the speaker(s) who recorded it. As mentioned, the speaker(s) retains ownership of the text and intellectual and cultural property rights over the recording. They also retain the right to change access restrictions on a recording or remove it completely from the archived corpus at any point and request that it not be used in any future research from that point forwards (given it would be difficult or impossible to remove excerpts from recordings in previously submitted or published work on the language, this is not something that is generally done). Away from the field, this can get a bit complicated due to limited telephone and internet coverage in the islands although it is increasingly more feasible as more and more people have access to mobile telephones and apps like WhatsApp and Facebook. Any future users of the corpus who download data from the archive must agree to only use the data for educational or research purposes and not for (direct) financial gain. Having said that, it needs to be acknowledged that researchers including myself can and do benefit indirectly from such data collections in the form of scholarships, grants, qualifications, and other potential employment opportunities related to our work with the language. Although I, like many ( hopefully most) fellow linguists, subscribe to the idea that the source of the data, that is the speaker(s), should have control over their own recordings, the current systems we have set up make that hard or impossible to implement. This is a problem that is increasingly being acknowledged and we as a community of researchers don’t currently have answers for it. Each speaker community is different and so one-size-fits-all models for communities are unlikely to work.
When collecting spoken or signed language data, the best practice is to make audio and video recordings in formats that are ‘lossless’ rather than ‘lossy ’. Lossless compression allows for perfect reconstruction of the original data . Lossy formats reduce file size but also discard some information that cannot be reconstructed which is obviously not an ideal outcome for precious language data! For audio recordings, I use a recorder that records Waveform Audio File Format or WAV files (.wav) and for video recordings I record Advanced Video Coding High Definition or AVCHD files (.mts) which are both lossless formats and accepted by many archives. WAV files are also the audio format needed for many linguistics software programs. AVCHD files are, however , not a file type accepted by the archive where the Sudest corpus is housed which only accepts MPEG video files (.mpg) which are lossy. This situation is likely due to the fact that the archive was set up some two decades ago when storage capacity was a big issue and the archive has not yet caught up in this regard. This is not an ideal situation and something that should be considered if you are able to choose the archive where your data will be deposited. For derived data including text-audio aligned transcriptions and translations as well as any other annotations, I use plain text (.txt) and ELAN Annotation Format files (.eaf). The plain text files are the file type created when working in the Field Linguist’s Toolbox software program and can be opened, read, and edited in most text editors. It is also a good format for storing information if you want it to persist. The EAF files are more specialist although they are a transferable format. They are only created and used with ELAN which is an annotation tool for audio and video widely used by linguists. Both text and EAF files are some of the standard file formats for transcriptions and annotations that are generally accepted by language archives today.
When collecting metadata, I adopted the Component MetaData Infrastructure ( CMDI) schema used by the archive where I was planning to deposit my data. The logistics of fieldwork can be overwhelming at times and following the archive’ s metadata schema meant I would minimally have the metadata required to deposit the corpus. The schema was also practical in that there were some fields that were obligatory (e.g., text title, creation data, country of recording) but many fields are optional (e.g., speaker date of birth/age, education level, address where recording took place). Having optional fields that could be filled in or not worked well for me because collecting more detailed information, particularly biographical information about speaker(s), wasn’t always possible or appropriate. Not all speakers, particularly older speakers, necessarily knew their date or year of birth and they may only know their approximate age. Some speakers I only met when they came to make a recording and getting the detailed biographic information one might like for metadata records is not necessarily culturally appropriate, particularly given the power imbalance between myself, an educated, white researcher from Australia, and speakers I don’t have an established relationship with. Aside from asking about where the speaker grew up, currently lived, and their approximate age, if I didn’t already have an established relationship with the individual, I tended to let information emerge incidentally as volunteered rather than questioning them in depth. One topic I did discuss with speakers regarding metadata was the question of authorship and attribution of the recording and whether they wanted their name to be listed or would like to be given a pseudonym. With small communities, especially when you are recording audio and video of the speaker, complete anonymisation is not really possible if the speaker also wants the recording to be accessible to other speakers and this could be a potential issue. However, in my experiences, (all) speakers chose to have their names listed as the author(s) of the text without anonymisation and were generally excited or proud to be making a record of their language and knowledge.
While the amount of personal metadata collected for individual speakers varied, there was also situational data that I wish I had collected and didn’t. I collected information including the date and location of the recording (e.g., the village and house where the recording took place), who was present at the recording, the genre of recording (e.g., a historical narrative, conversation , instructions on how to complete a specific task, etc.). One piece of information that I didn’t include, for example, is the exact location of the speaker and their orientation in space. Were they facing towards the north? North-west? Since my original fieldwork, I have begun to research the use of co-speech gestures in the language, that is, gestures that occur at the same time as someone is speaking. Frequently, speakers point to real world locations while they are talking but this can be hard to identify if you don’ t know exactly how the speaker was positioned in relation to where they are gesturing. Luckily, by watching the videos I can still identify the direction a speaker was facing and add this to the recording’s metadata but this would be a very difficult if not impossible information for a future user of the data to ascertain if they wanted to study gesture.
Data security is something I had to consider both in the field and at home. In order to keep data secure on computers and storage devices, I made sure that they were all password protected. In the field, I also had to consider specific factors in the environment that could endanger the data - the major issue is the humidity. The average humidity level on the island hovers around 80 percent year-round and I had paper notebooks, a computer, hard drives and SD cards that I need to keep in working order in a house that is relatively open to the outside environment with no glass windows and no electricity (I do remember a field manual suggesting a fridge makes a good home for equipment in humid environments if that is an option). To protect equipment, I stored everything in waterproof pouches with silica sachets. I would distribute backup USBs and SD cards across different pouches so if one failed , I would hopefully have backups. I also tried to double bag equipment and notes when travelling, particularly by boat. Boat transport in the islands is either by traditional wooden sailing outrigger canoes or fibreglass dinghy ‘ banana’ boat, both of which are open to the elements.
To secure data, ideally, I would also make backup copies of all new recordings and transcription files each day in the field. However, during my first fieldtrip, I didn’t have access to a power source and therefore couldn’t charge my laptop meaning I couldn’t backup recordings. As this also meant I was transcribing all texts using pencil and paper, I did photograph all new transcriptions each night to preserve a backup copy. Towards the end of my trip, I became quite worried about having no backups for recordings, particularly when thinking about the boat trip back to the main island which can take up to eight hours (or more in bad conditions). To avoid making this trip with only the original copies, I tagged along on a walk to the local primary school two and a half hours away because it was rumoured that one of the teachers had a solar panel connected to a car battery and I would be able to charge my laptop and therefore backup the recordings. All the community’s primary school students make this walk twice a week, living at school during the week. At one point in the journey, you have to cross a river in a small outrigger canoe. It was quite alarming when the teenager paddling me across nonchalantly pointed out a nearby crocodile while we were mid-river but at least I managed to back up all the files! For subsequent fieldtrips, I invested in a battery and solar panel for charging equipment and backing up data although I still had to cross rivers with crocodiles from time to time.
In order to secure the corpus of texts into the future and make sure that it doesn’t get lost on an old hard drive, I have uploaded it to a language archive. The archive is a digital repository tasked with safeguarding language corpora of languages for which there is limited documented data. As well as keeping copies of the uploaded data, the archive will also ensure that the data continues to be accessible, for example, by converting data files to newer formats if the format a file was uploaded in becomes obsolete. Having the corpus saved in an online archive also means that it is discoverable for other researchers, speakers, and other interested parties. It can be found through the search portal of the archive by searching by language name(s), the contributor’s name, or by browsing by country. The corpus can also be found by searching through the Open Language Archive Community (OLAC) portal which is an online virtual library of information and links to language resources available in a specific language (e.g., grammars, dictionaries, archived texts). The information on OLAC is automatically collected or ‘harvested’ daily from participating archives meaning that when you upload a file to an archive, it will automatically be listed in OLAC the next time it is updated. Because there is limited information online about Sudest, keyword searches on most online search engines such as Google also displays links to the corpus quite high up in the results.
It’s often suggested that the researcher should make a will and assign a literary executor for the data they collect. This raises the question about who to nominate? As a graduate student, I nominated one of my supervisors as the person to have direct access and editing powers over the archived corpus. This makes sense in that, after myself, my supervisor is one of the people that knows the most about the project and data collection. Supervisors are also often the chief investigator for ethics approval for graduate data collection as well. But supervisors tend to be older than graduate students and don’t necessarily have any personal connection to the speaker community. Ideally, any succession plan would bring control of the data back to the community but what would that mean in practice? There may be issues both political and practical if just one individual from the community becomes the executor. In some cases, there may be an obvious option if the community has a language and culture centre or registered Indigenous corporation. In such cases, access and stewardship plans may have been discussed and incorporated into the project since the beginning. For many communities, especially in the Pacific, this is not the case and therefore there is no clear governing group which might take such a task on.
In future research projects I might work in, I can build on my past experiences and hopefully improve on them. This would likely involve expanding the list of situational metadata I automatically collect for each recording as well as considering the file types accepted by an archive when deciding where to deposit any recordings. Although I’ve only touched briefly on issues relating to speaker-community control and intellectual and cultural property rights, I would also aim to build in more funding and space into the timeline of a project for more equitable community consultation and collaboration. Such discussion would then likely have a flow-on effect for how to secure the data into the future. Ethical and practical questions regarding data collection, access, and stewardship are complex. It is impossible to take all contingencies into account. Best practice and standards change with much of the change led by Indigenous and First Nations researchers. Best practices from a decade ago are not those of today and this is a good thing!
Back to Language Data