What happened to the Australian National Corpus (AusNC)?

Overview of the Australian National Corpus (AusNC)

The Australian National Corpus (AusNC) project was an important precursor to the Language Data Commons of Australia (LDaCA) project. Important aims and motivations of the LDaCA project are to secure language data in Australia and to make it easily available to researchers (and other interested parties). But language researchers have been aware of these issues for some time and the AusNC was an earlier attempt to address them.

The AusNC was discussed in various places before the work of assembling data began. A meeting of interested people was held at the conferences of the Australian Linguistic Society and the Applied Linguistics Association of Australia in 2008, and a workshop was held in 2008 (proceedings published in Haugh et al. 2009), supported by an ARC-funded network for research into speech and communication (HSC-Net). Following these meetings, funding support from the Australian National Data Service, and subsequent funding and technical support from Griffith University enabled us to start the AusNC project, which was led by Michael Haugh, then a staff member at Griffith. Several data sets, which already existed and could easily be accessed, were included in this initial stage of the project:

AustLit (selected samples of out-of-copyright poetry, fiction and criticism ranging from 1795 to the 1930s)
The Australian Corpus of English (ACE)
The Australian component of the International Corpus of English (ICE-AUS)
Australian Radio Talkback (ART)
The Corpus of Oz Early English (COOEE)
The Monash Corpus of English (MCE)
The Griffith Corpus of Spoken Australian English (GCSAusE)
Braided Channels (oral histories of women from the Channel Country)
Mitchell and Delbridge (a sample from data collected around 1960 documenting the speech of Australian adolescents)

The team behind the AusNC always hoped that additional data would be added to the corpus, although funding constraints meant that only one further collection, the La Trobe Corpus of Spoken English, was added to the corpus shortly after the initial launch of the AusNC web interface.

The Griffith University eResearch Services built the web interface and hosted the data for the AusNC, which was launched in 2012. From 2014, the Alveo virtual laboratory also held a copy of the data and provided alternative ways of interacting with it. However, after the launch of the Alveo version, the Griffith version of the corpus changed in various ways, and the two versions remained out of sync for a time, before both eventually became inaccessible when changes in the underlying technologies meant that maintaining those digital infrastructures was no longer cost effective.

The Alveo virtual laboratory still exists, in principle, but has not been regularly maintained since 2019, and its services are rarely operational. With no ongoing funding and no staff member affiliated with the AusNC project, Griffith University indicated that it could not maintain the AusNC data sets and interface, and closed those facilities when the data was transferred to LDaCA in 2023.¹ All the original data sets are now available (or are in the process of being made available) through the LDaCA data portal, and in one case, there is a substantial increase in the amount of data available (see discussion of Mitchell and Delbridge below).

Screenshot of the Australian National Corpus interface showing results of a frequency search.

Figure 1: Results of a frequency search in the Australian National Corpus interface.
Image Source: LDaCA

Corpus design

There are three general types of corpora for language data. In the first type, the data was collected for specific research purposes by individuals. In the second type, the data was collected following a plan. In the third type, the data was collected opportunistically.

Data collected for specific research purposes by individuals

Likely the most common type of corpus is a collection created by an individual researcher or research group to try to answer a specific question or set of questions.

An example of such a corpus in the AusNC is the Mitchell and Delbridge dataset. Mitchell and Delbridge wanted to know what Australian speech sounded like, how it differed from other varieties of English (often Received Pronunciation), and what sort of variation occurred within the Australian community. To answer these questions, Mitchell and Delbridge recruited school principals across the country to organise recordings of adolescents reading word lists and other prepared material, with a small, less structured component.

As mentioned above, the AusNC made only a small part of this dataset available. LDaCA is delighted that, with the assistance of the University of Sydney, we are now able to provide access to data from all of the 7,736 speakers from 330 schools who were recorded.

Another example of such a corpus in the AusNC is the Braided Channels collection, which contains 70 hours of oral history interviews with women from Queensland’s Channel Country. This is an example of a collection that was created to answer historical and cultural questions, but has been repurposed to enable researchers to address linguistic questions as well.

Data collected following a plan

In the second type of corpus, data is collected according to a defined plan. The kinds of material and the relative amounts of different kinds of material are decided in advance, and then data is collected accordingly. We can distinguish two sub-types within this category: the plan already exists or the plan needs to be developed independently.

Existing plan

In some cases, data is collected according to the plan of another corpus which already exists, so that meaningful comparisons can be made.

An example of this approach in the AusNC is the Australian Corpus of English (ACE), which was designed to be comparable with two previous corpora: the Brown Corpus of American English and the Lancaster-Oslo-Bergen Corpus of UK English (which was itself designed to be comparable with the Brown Corpus). The Australian component of the International Corpus of English (ICE-AUS) is another example of this type of corpus in the AusNC.

Independent plan

Alternatively, a corpus can be designed independently, but with the intention that it should be representative of some realm of language use, showing different genres and registers, and possibly also sampling some aspects of demographic variation.

A classic example of such a corpus is the British National Corpus (BNC: 1994 version, 2014 version), which includes samples of both written and spoken language, with different kinds of written text in specified proportions, and which samples speaker variation across dimensions such as geography, gender and age. The BNC is sometimes referred to as a ‘reference corpus’; the idea is that it provides a baseline against which phenomena from other datasets can be compared.

COOEE was constructed on similar, if less ambitious, lines. The corpus is divided into four time periods, with an approximately equal amount of material in each. Texts are assigned to one of four different types, and the proportion of each of those types is approximately equal in each time period. Such a design makes it easy to examine chronological change or variation between registers.

Data collected opportunistically

The third type of corpus is one that is almost without a plan; the collection of data is opportunistic. Again, we can distinguish some sub-types here.

Restricted amount of data

Sometimes, opportunistic data collection is the only viable strategy. An example of this is the approach often taken by linguists documenting languages whose future is uncertain. In that situation, carefully planning what kinds of data to collect and in what proportions would be ideal, but is typically a luxury which cannot be afforded. Collecting anything which is easily available is at least a good starting point — if one becomes aware that there are large gaps in the data collected, one can try to address the problem in the future, but that may not ever be possible. This type of corpus is the result of making the best one can of limited opportunities, which restrict the amount of data which is available. Collections of spoken interaction are often assembled in this way. The Griffith Corpus of Spoken Australian English (GCSAusE), which includes recordings made and transcribed by students, is an example of this type of corpus in the AusNC.

Extensive data

Opportunistic data collection is also relevant to big data: many researchers work with data from the web nowadays, and this almost inevitably involves opportunistic procedures. However, in this case, the large amount of data which is accessible can justify a claim that a significant part of the variation occurring in language use (at least for a certain type of use) will be represented in the data. On this basis, many linguists are happy to use a corpus of web data, at least for exploratory research.

Figure 2: The Australian National Corpus logo.
Image Source: LDaCA

AusNC as an opportunistic collection

We have set out three broad strategies above for constructing a corpus. The AusNC calls itself a corpus, so can we assign it to one of these types? We hope that the answer to this question is clear: the AusNC was an opportunistic collection.

When the possibility of a corpus was being discussed, there were proposals for it to be carefully planned. For example, in a paper published following one of the workshops leading up to the AusNC, Pho (2009) argued that an Australian corpus should mirror the design of the BNC. However, even putting together a collection of written material on the scale of the BNC is a huge task, never mind the spoken language component; the BNC itself was produced by a large team working over a number of years. The AusNC project never enjoyed a level of funding which would have made such a plan possible (indeed, the amount of funding we received while very welcome was modest). In these circumstances, we had no alternative but to create the AusNC from existing data sources which custodians were prepared to share. We did hope that this model would make it easy to expand the corpus, but this did not occur, as we were not able to develop the infrastructure necessary for ongoing expansion of the corpus with the limited amount of funding we had.

In 2012, it seemed obvious that the appropriate term for our project was corpus. We were putting together a collection of language data and it came to exist as an entity in its own right. What was included might be the result of rather random factors, but there was a coherence in that all the data related to language use, in fact, use of the English language in Australia. No one was talking about ‘data commons’² when the AusNC was created (or we were not listening). If the concept had been actively discussed, perhaps we would have described what we were doing as a data commons.

The inception of LDaCA coincided, more or less, with the two factors mentioned above which led to the AusNC becoming non-viable: the understandable withdrawal of support from Griffith University and the instability of the Alveo infrastructure due to funding constraints. It was therefore inevitable that transferring the AusNC collection to the new project would be an early priority.

Approaching this task, we faced the problem of whether it still made sense to refer to the collection of datasets as the AusNC. After discussion within our project, and with the custodians of the AusNC datasets, we decided that retaining the name would suggest a coherence which did not exist now that the AusNC was part of LDaCA. The AusNC is now part of a larger assemblage of collections of data, and we felt that any internal coherence which the AusNC might have did not differentiate it, or the datasets it was composed of, from that broader landscape.

Therefore, we made the decision to include the component datasets of the AusNC as separate collections in LDaCA data portal. The top-level description of each of the collections contains the following sentence to acknowledge the history of the data: “This collection was previously accessible online via the Australian National Corpus (AusNC), an initiative managed by Griffith University between 2012 and 2023.” Additionally, this blog post is available to those who are interested in the history of the AusNC.³

The corpus is dead, long live the data commons!

References

Grossman, Robert L., Allison Heath, Mark Murphy, Maria Patterson, and Walt Wells. 2016. “A Case for Data Commons: Toward Data Science as a Service.” Computing in Science & Engineering 18 (5): 10–20. DOI: 10.1109/MCSE.2016.92

Haugh, Michael, Kate Burridge, Jean Mulder and Pam Peters (eds.). 2009. Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages. Sommerville, MA: Cascadilla Proceedings Project. http://www.lingref.com/cpp/ausnc/2008/index.html.

Musgrave, Simon & Michael Haugh. 2020. The Australian National Corpus (and beyond). In Louisa Willoughby & Howard Manns (eds.), Australian English Reimagined. Abingdon: Routledge.

Pho, Phuong Dzung. 2009. Towards the Design of the Australian National Corpus. In Michael Haugh, Kate Burridge, Jean Mulder & Pam Peters (eds.), Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages, 25–29. Cascadilla Proceedings Project. http://www.lingref.com/cpp/ausnc/2008/index.html.

Footnotes

Thanks to Teresa Chan who provided valuable feedback on a draft.

1 We think it is important, however, to acknowledge and recognise the longstanding commitment by Griffith University to maintaining the AusNC over a period of more than ten years. ↩

2 “A global trusted system of systems that provides frictionless access to high quality interoperable resources, services and artefacts for research….. Data commons collocate data, storage, and computing infrastructure with core services and commonly used tools and applications for managing, analyzing, and sharing data to create an interoperable resource for the research community.” (Grossman et al. 2016) ↩

3 More details of the history of the AusNC can be found in Musgrave & Haugh (2020). ↩