LDaCA has begun adding data sets, including:

  • Corpus of Oz Early English CoOEE: a collection of texts written in Australia between 1788 and 1900. The corpus is divided into four time periods (1788–1825, 1826–1850, 1851–1875 and 1876–1900) each holding about 500,000 words. Four registers were defined for CoOEE: the Speech-based Register (SB), the Private Written Register (PrW), the Public Written Register (PcW) and the register of Government English (GE). In every Period 1-4 there is a similar number of words in the different registers.
  • Sydney Speaks: This project seeks to document and explore Australian English, as spoken in Australia’s largest and most ethnically and linguistically diverse city – Sydney.
  • From Farms to Freeways: This research project sought to analyse the experiences of women who had lived in the Blacktown and Penrith areas since the early 1950s, including their responses to social changes brought about by rapid suburbanisation in the Western Sydney region in the post-war period. Two-hour taped discussions were held with 34 women, aged sixty and over, who were in their early twenties during the Western Sydney region’s population growth.

We have also ingested a collection of government documents in various languages. This a very small dataset assembled to check that our technology can handle different languages and different scripts; more information about this work is available in this presentation.

Work is under way to make the data from other earlier projects accessible through LDaCA:

  • The Australian National Corpus
  • AusTalk

Back to Background