The Australian Text Analytics Platform (ATAP) and the Language Data Commons of Australia (LDaCA) are collaborative projects led by the University of Queensland and supported by the Australian Research Data Commons to develop infrastructure for researchers who work with language data. In this blog post series, we feature interviews with the Chief Investigators of the two projects. In each post, we present their answers to three questions:
- What is your role in these projects? (What do you/your team do as part of your participation?)
- What excites you most about the projects? (What motivates you to participate?)
- What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection?
This blog post features Catherine Travis (CT), Nicholas Evans (NE), and Monika Bednarek (MB). Nick was a Chief Investigator in one of the first ARDC-funded projects for LDaCA under the Australian Data Partnerships program. Although he is not a CI on our current project, Nick remains a friend and supporter of LDaCA.
The interview was undertaken via email, and we are grateful to Kelvin Lee from the Sydney Corpus Lab for his assistance in undertaking the interviews and creating these blog posts.
What is your role in these projects?
CT: Our main goal at the Australian National University is to grow the set of collections that will be incorporated into LDaCA with a particular focus on Australian English and migrant languages. This involves identifying relevant collections, which is a bit of a ‘language dig’, as these are diverse, dispersed, and often quite hidden away. Many have restrictions around data sharing, so we work closely with data stewards to set up the appropriate access and licensing conditions. As well as this, we are developing tools that can enhance the usability of both legacy corpora and newer collections, including tools for aligning transcripts that exist in different formats with their corresponding audio, standardising orthography, streamlining the anonymisation process, and so on. Right now, most of what we know about Australian English comes from middle-class Australians of Anglo-Celtic background living in major urban centres. A better representation of Australian society in our language collections will broaden the scope of the research that can be done, allow new questions to be asked, and provide a better picture of Australia overall.
NE: My role is to keep our foot on the gas for the huge job of getting a continued flow of new corpus data, as well as digitised legacy data, mostly for the languages of New Guinea and the Pacific.
MB: I’m the Academic Lead for these projects at the University of Sydney. In this role, I manage the projects overall and work closely with staff from the Sydney Corpus Lab and the Sydney Informatics Hub. We develop text analytics tools and training for analysis of text/language, including but not limited to tools for linguistic and discourse analysis. A team at PARADISEC is also involved and contributes to work packages around language collections for Indigenous languages of Australia and the Pacific.
My role as director of the Sydney Corpus Lab is crucial for these projects. The lab’s mission is to build research capacity in corpus linguistics at the University of Sydney, to connect Australian corpus linguists, and to promote the method in Australia, both in linguistics and in other disciplines. We organise relevant events on corpus linguistics and text analytics, including guest lectures and workshops, and create resources such as corpora, blog posts, video playlists, and curated introductions in different languages.
What excites you most about the projects?
CT: There is a wide range of collections out there that represent an absolute treasure trove of knowledge not only for language in Australia, but also for Australian society, culture, and history. For example, oral histories or ethnographic interviews can provide invaluable linguistic data, and likewise, sociolinguistic interviews may contain invaluable information for historians, sociologists, or anthropologists. But because there is no way to know what collections exist, what is in them, and what is accessible, this treasure trove is underutilised. With many decades of data collection behind us and with recent technological developments, now is the perfect time to open this up to create a language data commons. It is exciting to think of the opportunities that this presents for us to cross disciplinary boundaries, and expand our understanding of Australia’s social, cultural, and linguistic history.
NE: We all know that most of the world’s languages are under-resourced, lacking even a grammar or dictionary. A dream I have – which this project will move us very slightly toward – is that every one of the world’s 7,000 languages might one day have a vast library of corpus data, comparable to the 60 million words we have for Classical Greek, for example. Each language and culture deserves its own vast library wing. To give an idea of the scale of the challenge, so far in our part of the world there are only three languages (Bislama, Gurindji Kriol, and Ku Waru) where we have even one million words.
MB: To work across disciplines, to collaborate with data scientists, and to learn many new things myself! And to discover the value that we bring to such collaborations as domain experts in the Humanities and Social Sciences.
What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection?
CT: Given the number of existing language collections, one piece of advice I would give to someone who wants to get started with data collection would be to learn what is already out there prior to beginning new data collection, and to think about how any new data you collect may contribute to existing collections or research projects, or alternatively, how existing collections might shape the kind of new data you might collect or the kind of research you might do. It is standard practice to contextualise the research that we do within the field; we are at a point where we should be doing the same with data collection. It is through our cumulative efforts, taking advantage of the work that has preceded us and building on that, that we will most advance in our scientific endeavours.
NE: Find good ways of getting sensitive and accurate commentaries on the meaning of materials. We need the equivalent of Biblical or Talmudic commentaries for the digital age – metatexts that comment on what other texts mean.
MB: Just give it a go and don’t be daunted! Go to workshops or summer schools, and avail yourself of other free or low-cost opportunities. Be (and remain) critical in your use of tools, and remember the value of your own knowledge and expertise. Discover the joy of mastering a new tool/technique or of knowing enough to find out that it is not for you or for your research project. If you find it useful, practice and also keep good notes so that you can return to the tool later and still know what to do. Always be transparent in how you use the tool/technique and make sure you know why you are using it.
Acknowledgments
The Australian Text Analytics Platform program (https://doi.org/10.47486/PL074) and the HASS Research Data Commons and Indigenous Research Capability Program (https://doi.org/10.47486/HIR001) received investment from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).
