The Australian Text Analytics Platform (ATAP) and the Language Data Commons of Australia (LDaCA) are collaborative projects led by the University of Queensland and supported by the Australian Research Data Commons to develop infrastructure for researchers who work with language data. In this blog post series, we feature interviews with the Chief Investigators of the two projects. In each post, we present their answers to three questions:
- What is your role in these projects? (What do you/your team do as part of your participation?)
- What excites you most about the projects? (What motivates you to participate?)
- What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection?
This blog post features Louisa Willoughby (LW), Martin Schweinberger (MS) and Nick Thieberger (NT). The interview was undertaken via email, and we are grateful to Kelvin Lee from the Sydney Corpus Lab for his assistance in undertaking the interviews and creating these blog posts.
1. What is your role in these projects?
LW: I’m the academic lead on the multimodal corpus. We’re building infrastructure for people to create video dictionaries that link to corpus examples (and vice versa), using Signbank and the Auslan corpus as our test case.
MS: I am a Chief Investigator and steering committee member of the Australian Text Analytics (ATAP) Project and I am also a CI of the Language Data Commons of Australia (LDaCA). I am particularly focusing on the Language Technology and Data Analysis Laboratory (LADAL) which is part of ATAP and represents an infrastructure for computational text analytics in the humanities. LADAL has an outreach component as it organizes webinars and workshops and it provides computational resources in the form of self-paced online tutorials as well as interactive notebooks that allow researchers to try out methods and apply them to their own data. I see my task as organizing and managing activities around LADAL by creating and optimizing resources as well as getting involved in outreach via events and building (inter-)national partnerships and collaborations with other computational humanities infrastructures.
NT: I am in charge of work packages 1.1 and 1.2, Indigenous languages of Australia and the Pacific. I worked at AIATSIS in the past and set up an Aboriginal language centre in Port Hedland (Wangka Maya). I have worked in Vanuatu, and did my PhD research on Nafsan, a language from Efate, sparking an interest in languages of the Pacific. The work of recording speakers of Nafsan also resulted in a corpus of time-aligned text and media, as well as historical manuscripts that I have been finding in archives around the world. This led me to be concerned that existing records of these languages be made accessible to current speakers and so I helped establish the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) which has many connections to cultural agencies in the Pacific. I currently lead a project (Nyingarn) to convert manuscripts in Australian Indigenous languages to text.
2. What excites you most about the projects?
LW: That it is both building an exciting tool for researchers to better understand language variation AND making a resource that is useful for Auslan students and members of the community.
MS: One of the most fulfilling aspects of my work is the opportunity to collaborate with researchers from diverse communities and fields or to see how resources I created help people from very different backgrounds in their work. I do take pride in inspiring and enabling researchers to explore the potential of computational methods, whether they are working on projects related to the language sciences, health sciences, social sciences, arts, or beyond. As a quantitative linguist who came to computation as a computerphobe philosopher, I really enjoy creating visualizations that communicate data and results and that provide an intuitive understanding of complex data sets. I also take pleasure in showing others how to perform statistical analyses that help them make informed decisions based on their findings. By optimizing workflows and automating procedures, I help researchers to work more efficiently and effectively, enabling them to achieve their goals more quickly and easily. Through my work, I aim to create an environment where researchers from diverse backgrounds and specializations can benefit from the potential of computational tools, regardless of their field of study or level of experience. I value the opportunity to work collaboratively with researchers to find solutions to complex problems, and to share my expertise with others to promote innovation and growth within the field. Overall, I believe that the application of computational methods is an exciting area of research that has the potential to transform the way we approach scientific inquiry in the humanities and social sciences. By helping researchers to explore the possibilities of these tools, I hope to contribute to a more vibrant and innovative research community.
NT: Making textual material in these languages available so that speakers can find records in their own languages. Often there is very little available information for these languages, so every record becomes all the more important, especially in the context of colonial dispossession where the languages are no longer spoken everyday and there are efforts to relearn the language. It is exciting that this work can become part of a national commons, and be supported into the future, so that more and more material can be included. The simple task of locating a language record, digitising it, and making the files available with appropriate licences means that it can be used by speakers, and by researchers for various new purposes.
3. What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection?
LW: Jump in and start playing! There are lots of different tools and corpora online and I find it easiest to learn how to use them by just having a go and seeing what you can get out. It can be fun to look for little bits of variation or to see how common a certain word or phrase is, and doing something small will give you a sense of the kind of data you get out of the tools and how you might then incorporate them into a wider project.
MS: If you’re just starting out with a new project, it’s essential to take it one step at a time. Start with a simple visualization or basic text processing task, such as concordancing, and build on it gradually. Don’t be intimidated by others who may have more experience or skills than you do. Instead, take the opportunity to learn from them and seek their guidance when necessary. It’s important to keep in mind that progress takes time, and comparing yourself to others can be discouraging. Instead, focus on comparing your current skills and accomplishments to your former self. Take pride in what you’ve created and the progress you’ve made. Celebrate your small wins and use them as motivation to keep moving forward. Another key to improving your skills is to seek feedback from others. Share your work with peers, mentors, or instructors who can provide constructive criticism and suggestions for improvement. Don’t be afraid to ask questions or seek clarification when you’re unsure of something. Remember that learning is a continuous process, and even the most experienced professionals have room for growth and improvement. Embrace the journey, enjoy the learning process, and stay motivated by setting achievable goals for yourself. By doing so, you’ll be well on your way to developing your skills and achieving your goals.
NT: The first steps in language data collection involve recording speakers, with all appropriate permissions in place, and with a consent form signed by them. Making those recordings as well as possible, following recommendations, doing training and so on, and managing the resulting files so they are not lost. Learn the basics of text querying, searching, and regular expressions. Be aware of the ethical considerations in using material from someone else’s language. But just start now!
The Australian Text Analytics Platform program (https://doi.org/10.47486/PL074) and the HASS Research Data Commons and Indigenous Research Capability Program (https://doi.org/10.47486/HIR001) received investment from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).