A variety of LDaCA open-source tools are available at our GitHub organisation. Highlights include:
Metadata Editor
Crate-O: A tool that allows you to create and update Research Object Crates (RO-Crates) using a web interface, and with metadata spreadsheets. It provides researchers with a relatively simple way to describe their data using the best practices in formal metadata description.
For more information about Crate-O and how to create RO-Crates with the interface, see the Crate-O User Guide.
Data Storage
OCFL-js, a library that implements the Oxford Common File Layout (OCFL): A specification for laying out digital collections on file or object storage. It is designed with long-term preservation principles in mind and does not rely on specialised software. Amongst the benefits of using OCFL with RO-Crate objects are:
- completeness: a repository can be re-indexed from the files it stores
- versioning: repositories can make changes to objects and still allow their history to persist
Data Portal and Access API
Oni: A web application that provides indexing, searching and access to secure data repositories following the Arkisto model. This is used to build the LDaCA Portal: The online interface of the Language Data Commons of Australia where users can discover and access language collections.
Tools developed in the project
Sydney Informatics Hub / Sydney Corpus Lab
- Document Similarity: A tool that identifies identical and similar text in a corpus.
- Quotation Tool: A tool to extract quotes and other useful information from a newspaper article/corpus.
- Semantic Tagger: A tool used to tag a text/corpus so you can extract token level semantic tags from the tagged texts.
- Keywords Analysis: A tool to analyse words in a collection of corpora and identify whether certain words are over or under-represented in a particular corpus.
- Discursis: A conversational analysis and visualisation tool.
- Concordancer: A concordancing tool that demos how to analyse turn-taking pairs in a concordancer.
University of Queensland
- Image Dataset Explorer: A tool that embeds images from a zip file using off the shelf image embeddings, then creates a static HTML visualisation for browsing/exploring clusters and relations.
Language Technology & Data Analysis Laboratory (LADAL) Tools
- Shiny Tools: Fully graphical, point-and-click applications built with R Shiny.
- FileRenamer: Batch rename plain-text files
- TextCleaner: Remove and replace text elements with regex
- POSTagger: Part-of-speech tagging & dependency parsing in 65+ languages
- WordFinder: Keyword-in-context concordancing
- KeywordExtractor: Keyness analysis vs. a reference corpus
- WordWebber: Word co-occurrence network visualisation
- SentimentExplorer: NRC word-emotion sentiment analysis
- CollocationCalculator: Collocation association measures
- TopicDetector: Unsupervised & seeded LDA topic modelling
- Jupyter Notebook Tools: Interactive R notebooks that run in a JupyterLab environment in a browser.
- Concordance Explorer: KWIC concordances — find any word or phrase in context
- Text Cleaner: Remove or replace words, tags, URLs and patterns
- Part-of-Speech Tagger: POS tagging and dependency parsing in 65+ languages
- Collocation Analyser: Association measures showing which words attract each other
- Keyword Finder: Over- and under-represented words vs. a reference corpus
- Network Visualiser: Network graphs from structured edge-list data
- Topic Explorer: LDA topic discovery across text collections
- Sentiment Explorer: Polarity scoring and eight basic emotion categories (NRC lexicon)
Australian National University
- ELAN Replacer: A tool that enables context dependent search and replace functionality across a folder of ELAN files.
- ELAN Annotation Splitter: A tool that can split ELAN annotations.
- ELAN Inventory: A web application that summarises ELAN files and compiles configuration files to create an ANNIS corpus.
- ELAN Commander: A web application that finds unwanted characters in ELAN annotations.
- Anonymising ELAN Files: A tool that anonymises the content of ELAN files.
- ELAN Audio Segmentation: A tool which takes an audio or video file as input and creates a ELAN file with empty annotations wherever a voice is heard in the audio.
University of Melbourne
- Lameta: Added an RO-Crate output option for this metadata tool.
