Implementing PILARS

2025-11-23 1149 words 6 minutes

Preserving digital language and cultural collections :: By adopting open standards and clear governance :: Sustainable stewardship protects past investments in research and infrastructure :: Addressing this problem isn’t just about technology :: Ensuring Digital Language and Cultural-Heritage Materials Remain Accessible, Usable, and Sustainably Managed Over Time :: :: :: Implementing PILARS ::

An adaptation of a presentation delivered at the 2025 Annual Symposium of the HASS and Indigenous Research Data Commons.

Preserving digital language and cultural-heritage materials isn’t just a technical exercise—it’s about safeguarding knowledge, identity, and history for future generations. As collections grow and as data becomes increasingly fragmented across institutions, the challenge is no longer simply storing information. It’s ensuring that community-owned knowledge remains accessible, usable, and sustainably managed over time.

At the Language Data Commons of Australia (LDaCA), we’ve been working toward this goal by adopting open standards, building clear governance mechanisms, and designing infrastructure that communities can trust and control. The result of this work is PILARS: the Protocols for Implementing Long-Term Archival Repository Services.

How are we implementing our work.

We designed the Protocols for Implementing Long-Term archival Repository Services These are: Designed to work in low-resource environments,

To allow communities to have agency and control over their materials.

And prioritise sustainability, simplicity, standardisation, linked-data description and clear licensing over user interface features

A framework of protocols to design sustainable archival systems. :: :: Supports FAIR (Findable, Accessible, Interoperable, Reusable) and CARE (Collective Benefit, Authority to Control, Responsibility, Ethics) principles. :: Data Portability :: Commodity Storage :: Storage Objects :: Store documentation within storage root :: 2. Metadata & Annotation :: Each object has descriptive metadata (usage rights, provenance) :: Use Linked Data, Represent high level structures :: 3. Governance :: PILARS :: :: PILARS Goals :: Autonomy :: Sustainability :: Value ::

Why PILARS?

PILARS is our framework for designing sustainable archival systems—particularly in low-resource environments where communities need agency, autonomy, and long-term reliability.

The protocols are guided by both the FAIR (Findable, Accessible, Interoperable, Reusable) and CARE (Collective Benefit, Authority to Control, Responsibility, Ethics) principles. Together, these ensure that while data remains discoverable and reusable, the rights and authority of communities are respected and embedded in the system itself.

Our goals are simple:

Autonomy: Reduce reliance on closed, proprietary, or opaque storage systems.

Sustainability: Ensure that data remains intact and accessible decades from now.

Value: Maximise the return on investment in digital collections and research infrastructure.

The Oxford Common File Layout :: 1 - Data is Portable :: ::

1. Data Portability: The Foundation

PILARS insists on storing data in a way that is portable, stable, and independent of any particular platform.

We do this by using:

OCFL (Oxford Common File Layout)

A community standard that ensures digital objects are stored in a transparent, predictable, and platform-independent structure. LDaCA extends OCFL with a storage-layout specification that maps identifiers to directory structures in both filesystems and object storage.

RO-Crate

Every storage object we deposit is an RO-Crate—a research object composed of:

The data files themselves

A JSON-LD metadata file (ro-crate-metadata.json) that describes the content, its provenance, and licensing

An RO-Crate can represent a collection, an interview, a series, or any structured set of materials. Each file is described, linked, and licensed in machine-readable form so that tools, portals, and search engines can reconstruct meaning without custom logic.

OCFL is laid out as URI IDs and mapped to directory hierarchies. :: Persistant IDs :: ::

Persistent Identifiers

While institutions may not always have PID systems in place, LDaCA supports temporary identifiers like ARCP until more formal identifiers (such as DOIs) are assigned. This means repositories can begin storing and organising data immediately, without waiting for institution-wide decisions.

2. Metadata & Annotation: Making Data Understandable

Metadata is where collections become meaningful. For LDaCA, metadata runs across several layers:

RO-Crate Metadata Schema, built on Schema.org

LDaCA Metadata Schema, an extension for language-specific concepts

LDaCA RO-Crate Profile, a document that explains how schemas are applied in practice, including both human-readable guidance and machine-readable constraints

Validation rules and Crate-O Mode Files generated from the profile to ensure every dataset meets requirements

These are our schemas available to the public.

http://w3id.org/ldac/profile

http://w3id.org/ldac/terms

Tools for creating metadata

To support researchers and communities, we created Crate-O, a Vue.js-based tool that: Provides guided metadata creation Integrates with services like ROR for organisation lookup Accepts spreadsheets and converts them into RO-Crates Allows batch upload of metadata Can run locally, in portals, or as GitHub Pages

Most of the real work of metadata creation still happens in spreadsheets—but Crate-O helps turn that into structured, validated metadata.

Portals can be then indexed from the storage to make them findable :: Index :: ::

Findability: From Storage to Search

Once data is stored and described, it must be discoverable. LDaCA uses:

PILARS-compliant storage

An API layer that exposes data consistently

Indexing pipelines to make data searchable across distributed services

Tools for building portals on demand—automated via Terraform—that communities can manage themselves

Our main portal aggregates language datasets curated by LDaCA, while community instances provide tailored environments for managing and exploring their own collections.

A distributed access control system that leverages federated authenication (AAF) independently of authorization services. :: Key features: :: License-based access control :: Enforcement points :: Interoperable protocols :: :: Motivation :: FAIR data principles require not just openness but controlled access in many contexts. :: :: Traditional centralized access control solutions struggle with scalability, sustainability, cross-institutional trust, privacy, and fine-grained permissions. :: Architecture & Workflow :: User requests access :: Enforcement point at repository :: Repository polls authorization server if necessary :: Decision point at authorization server :: Audit & logging :: Access Control :: ::

Access Control: Distributed, License-Based, and Interoperable

LDaCA has implemented a distributed access-control system that separates: Authentication (who you are) Authorization (what you’re allowed to access)

We use federated identity systems like CILogon and eduGAIN, and partner with platforms like CADRE and REMS to manage entitlement workflows.

Instead of role-based permissions locked inside a single system, we use license-based access control. Your entitlements—granted through a governance process—travel with you, allowing enforcement points at each data repository to make decisions consistently and automatically.

This approach: Scales across institutions Supports sensitive or community-restricted data Ensures transparent auditing and revocation Respects the governance requirements of language communities

The above diagram represents our Authorisation and Authentication Infrastructure.

With CILogon - an Identity and access management platform enables researchers to use their existing credentials

Supported by AAF we are using EduGAIN - The eduGAIN interfederation service connects identity federations around the world

CADRE - for authorization – CADRE Coordinated Access for Data, Researchers and Environments is a shared platform for safely handling requests to access sensitive data, addressing governance, creation, management and sharing of data for research. We have a service agreement with CADRE to provide access controls. CADRE uses REMS at the backend for resource management – Resource Entitlement Management System is a tool for managing access rights to resources, such as research datasets.

These means – with this licensed based authorization mechanisms - You are licensed to access sensitive materials

Photo of LDaCA workshop in Darwin at Charles Darwin University campus

The focus is on delivery :: :: Decisions are made for speed and appearance, :: Code, data, and dependencies often become conflated . :: When the developer moves on, knowledge and maintenance capacity disappear. :: What began as a useful tool can become a fragile, unmaintained system :: Beyond project websites; sustainable dashboards :: :: The focus shifts from quick delivery to long-term value and maintainability. :: :: Systems are built with open standards, :: Data and code are portable and separate :: Maintenance is part of the design :: The result is a system that endures beyond individual projects and people ::

Building Sustainable Systems, Not Just Dashboards

One of our biggest learnings is this: Dashboards and portals are easy to build, but hard to maintain.

Too often:

Design choices prioritise speed over long-term care
Knowledge is tied to individual developers
Data, code, and dependencies get tightly coupled
Tools become fragile and unmaintained once the project ends

We want to change that pattern.

By building with open standards, separating data from code, and treating maintenance as an expected part of system design, we ensure tools outlive projects—and people.

Fix bugs maintain our tools UX improvements :: :: Design and implement complete Workflow for Interactive Deposits :: :: Add more language data collections :: :: Add more analytical notebooks and tools :: https://ocfl.io/1.1.0/spec/ :: TODO :: ::

What’s Next

There is still work ahead:

Fixing bugs and improving the user experience
Completing workflows for interactive deposits
Expanding language data collections
Adding analytical notebooks and tools
Sharing the LDaCA approach across disciplines
Strengthening governance frameworks

But the foundation—the PILARS protocols—gives us a sustainable, community-centered way forward.

Implementing PILARS :: Moises Sacal Bonequi ::

Acknowledging the iconography within the Language Data Commons of Australia (LDaCA) logo, designed by Dylan Sarra. The design draws inspiration from the Burnett River Petroglyphs, and we recognise the Gureng Gureng communal knowledges that inhere within these symbols.

Much like the Indigenous language and cultural data dispersed across many institutions and archives which LDaCA engages with regularly, the Burnett River Petroglyphs themselves were jackhammered and scattered across Queensland in 1972 — this story lives on today through people… And it is ‘people’ who are central to the data which LDaCA intersects with.

Created with https://github.com/ptsefton/pptx_to_md