LDaCA has begun adding data sets, including:
- most of the datasets which were part of the Australian National Corpus collection, now available at our data portal (or click the Data Portal button top right)
- Corpus of Oz Early English (COOEE): a collection of texts written in Australia between 1788 and 1900. The corpus is divided into four time periods (1788–1825, 1826–1850, 1851–1875 and 1876–1900) each holding about 500,000 words. Four registers were defined for CoOEE: the Speech-based Register (SB), the Private Written Register (PrW), the Public Written Register (PcW) and the register of Government English (GE). For each time period, there is a similar number of words in the different registers.
- The Australian Corpus of English: The Australian Corpus of English (ACE) was compiled to match Australian data from 1986 with the American (Brown) and British (LOB) corpora of written English from the 1960s. It includes 500 samples of published texts taken from 15 different categories of nonfiction and fiction, including newspapers, reportage, editorials, reviews; magazines and journals: popular, academic; government and corporate documents; fiction monographs and short stories (both popular and literary).
- The International Corpus of English (Australian component): The Australian component of the International Corpus of English (ICE-AUS) is an approximately one million word corpus of transcribed spoken and written Australian English from 1992-1995. It consists of 500 samples of Australian English (60% speech, 40% writing) that matches the structure of other corpora associated with the International Corpus of English.
- Australian Radio Talkback: Australian Radio Talkback (ART) is a set of transcribed recordings of samples of national, regional and commercial Australian talkback radio from 2004 to 2006. It includes 27 audio recordings and transcripts of talkback from ABC National Radio, ABC Radio broadcasts to eastern Australia, ABC Radio broadcasts to southern and western Australia, as well as commercial stations broadcasting to eastern Australia and southern and western Australia.
- AustLit: The contribution from AustLit provides full-text access to select samples of out of copyright poetry, fiction and criticism ranging from 1795 to the 1930s. The collection includes literature intended for popular audiences as well as literature intended for audiences concerned with literary quality or the establishment of a national canon.
- Braided Channels: The Braided Channels research collection is constructed from some 70 hours of oral history interviews with women from Australia’s Channel Country, together with archival film, transcripts, photos and music. It includes both audiovisual recordings and transcripts of interviews.
- The La Trobe Corpus of Spoken Australian English: The La Trobe Corpus of Spoken Australian English comprises a collection of six recordings and transcriptions of spoken interaction amongst Australian speakers of English (some in conversation with native French speakers speaking English) made in Melbourne from 2001 to 2002.
- The speech of Australian adolescents: research data and recordings collected by A.G. Mitchell and Arthur Delbridge in 1959 and 1960: This dataset comprises 22,187 recordings of Australian English as spoken by 7,736 students at 330 schools across Australia and specific information about the speakers. The recordings were made on reel-to-reel tapes and were used to create the 1965 monograph The speech of Australian adolescents: a survey and the revised 1965 publication The pronunciation of English in Australia (originally published in 1946). The Australian National Corpus provided access to a sample of this material; the full dataset is now available.
- other datasets not yet available in the portal:
- Sydney Speaks: This project seeks to document and explore Australian English, as spoken in Australia’s largest and most ethnically and linguistically diverse city – Sydney.
- From Farms to Freeways: This research project sought to analyse the experiences of women who had lived in the Blacktown and Penrith areas since the early 1950s, including their responses to social changes brought about by rapid suburbanisation in the Western Sydney region in the post-war period. Two-hour taped discussions were held with 34 women, aged 60 and over, who were in their early twenties during the Western Sydney region’s population growth.
- A collection of government documents in various languages. This is a very small dataset assembled to check that our technology can handle different languages and different scripts; more information about this work is available in this presentation.
Work is under way to make data from other earlier projects accessible through LDaCA:
- Datasets from The Australian National Corpus not listed above (Monash Corpus of English, Griffith Corpus of Spoken Australian English)
- Corpus of Australian English as a Second Language (AusESL)