LDaCA Software Tools


A variety of LDaCA open-source tools are available at our GitHub organisation. Highlights include:


Metadata Editor

Crate-O: A tool that allows you to create and update Research Object Crates (RO-Crates) using a web interface, and with metadata spreadsheets. It provides researchers with a relatively simple way to describe their data using the best practices in formal metadata description.

For more information about Crate-O and how to create RO-Crates with the interface, see the Crate-O User Guide.


Data Storage

OCFL-js, a library that implements the Oxford Common File Layout (OCFL): A specification for laying out digital collections on file or object storage. It is designed with long-term preservation principles in mind and does not rely on specialised software. Amongst the benefits of using OCFL with RO-Crate objects are:

  • completeness: a repository can be re-indexed from the files it stores
  • versioning: repositories can make changes to objects and still allow their history to persist

Data Portal and Access API

Oni: A web application that provides indexing, searching and access to secure data repositories following the Arkisto model. This is used to build the LDaCA Portal: The online interface of the Language Data Commons of Australia where users can discover and access language collections.


Tools developed in the project

Sydney Informatics Hub / Sydney Corpus Lab

  • Document Similarity: A tool that identifies identical and similar text in a corpus.
  • Quotation Tool: A tool to extract quotes and other useful information from a newspaper article/corpus.
  • Semantic Tagger: A tool used to tag a text/corpus so you can extract token level semantic tags from the tagged texts.
  • Keywords Analysis: A tool to analyse words in a collection of corpora and identify whether certain words are over or under-represented in a particular corpus.
  • Discursis: A conversational analysis and visualisation tool.
  • Concordancer: A concordancing tool that demos how to analyse turn-taking pairs in a concordancer.

University of Queensland

  • Image Dataset Explorer: A tool that embeds images from a zip file using off the shelf image embeddings, then creates a static HTML visualisation for browsing/exploring clusters and relations.

Language Technology & Data Analysis Laboratory (LADAL) Tools

  • Shiny Tools: Fully graphical, point-and-click applications built with R Shiny.
    • FileRenamer: Batch rename plain-text files
    • TextCleaner: Remove and replace text elements with regex
    • POSTagger: Part-of-speech tagging & dependency parsing in 65+ languages
    • WordFinder: Keyword-in-context concordancing
    • KeywordExtractor: Keyness analysis vs. a reference corpus
    • WordWebber: Word co-occurrence network visualisation
    • SentimentExplorer: NRC word-emotion sentiment analysis
    • CollocationCalculator: Collocation association measures
    • TopicDetector: Unsupervised & seeded LDA topic modelling
  • Jupyter Notebook Tools: Interactive R notebooks that run in a JupyterLab environment in a browser.
    • Concordance Explorer: KWIC concordances — find any word or phrase in context
    • Text Cleaner: Remove or replace words, tags, URLs and patterns
    • Part-of-Speech Tagger: POS tagging and dependency parsing in 65+ languages
    • Collocation Analyser: Association measures showing which words attract each other
    • Keyword Finder: Over- and under-represented words vs. a reference corpus
    • Network Visualiser: Network graphs from structured edge-list data
    • Topic Explorer: LDA topic discovery across text collections
    • Sentiment Explorer: Polarity scoring and eight basic emotion categories (NRC lexicon)

Australian National University

  • ELAN Replacer: A tool that enables context dependent search and replace functionality across a folder of ELAN files.
  • ELAN Annotation Splitter: A tool that can split ELAN annotations.
  • ELAN Inventory: A web application that summarises ELAN files and compiles configuration files to create an ANNIS corpus.
  • ELAN Commander: A web application that finds unwanted characters in ELAN annotations.
  • Anonymising ELAN Files: A tool that anonymises the content of ELAN files.
  • ELAN Audio Segmentation: A tool which takes an audio or video file as input and creates a ELAN file with empty annotations wherever a voice is heard in the audio.

University of Melbourne

  • Lameta: Added an RO-Crate output option for this metadata tool.